File size: 3,201 Bytes
81f10bd
ae56c94
 
 
 
 
 
81f10bd
 
 
40de846
ae56c94
 
 
 
 
 
 
 
 
81f10bd
 
ae56c94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
title: ColiFormer - E. coli Codon Optimization
emoji: 🧬
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.28.1
app_file: app.py
pinned: false
license: mit
short_description: E. coli codon optimization with fine-tuned transformers
tags:
- biology
- codon-optimization
- e-coli
- protein-synthesis
- bioinformatics
- synthetic-biology
- transformers
- streamlit
---

# 🧬 ColiFormer - E. coli Codon Optimization

**ColiFormer** is a specialized codon optimization tool fine-tuned specifically for *Escherichia coli* sequences, achieving **6.2% better CAI scores** compared to the base CodonTransformer model.

## πŸš€ Features

- **🎯 E. coli Specialized**: Fine-tuned on 4,300 high-CAI E. coli sequences
- **πŸ“Š Advanced Metrics**: CAI, tAI, GC content, and codon frequency analysis
- **πŸ€– Auto-Loading**: Automatically downloads model and reference data from Hugging Face
- **⚑ Real-time**: Interactive sequence optimization with live metrics
- **πŸ”¬ Research-Grade**: Based on BigBird Transformer architecture
- **πŸ“ˆ Performance**: Significant improvement over base models for E. coli

## πŸ“Š Model Performance

| Metric | Base Model | ColiFormer | Improvement |
|--------|------------|------------|-------------|
| CAI Score | 0.742 | 0.788 | **+6.2%** |
| tAI Score | 0.451 | 0.478 | **+6.0%** |
| GC Content | 52.1% | 51.8% | Optimized |

## πŸ”— Related Resources

- **Model**: [saketh11/ColiFormer](https://huggingface.co/saketh11/ColiFormer)
- **Dataset**: [saketh11/ColiFormer-Data](https://huggingface.co/datasets/saketh11/ColiFormer-Data)
- **Base Model**: [adibvafa/CodonTransformer](https://huggingface.co/adibvafa/CodonTransformer)
- **Paper**: [CodonTransformer: The Global Translation of Genetic Code by Transformer](https://www.biorxiv.org/content/10.1101/2023.09.09.556981v1)

## πŸ’‘ How to Use

1. **Enter your protein sequence** in single-letter amino acid format
2. **Select optimization parameters** (temperature, max length, etc.)
3. **Click "Optimize Sequence"** to generate the optimized DNA sequence
4. **View comprehensive metrics** including CAI, tAI, GC content, and codon usage
5. **Download results** as FASTA or Excel files

## πŸ§ͺ Example

**Input Protein**: `MKRISTTITTTITITTGNGAG`

**Optimized DNA**: `ATGAAACGTATTAGT...` (optimized for E. coli expression)

**Metrics**:
- CAI: 0.85 (High)
- tAI: 0.52 (Good)
- GC Content: 51.2% (Optimal)

## πŸ”¬ Technical Details

- **Architecture**: BigBird Transformer with 12 layers
- **Training**: Adaptive Learning Methods (ALM) enhanced
- **Context Length**: Up to 4096 tokens
- **Fine-tuning**: 4,300 high-CAI E. coli sequences
- **Reference Data**: 50,000+ E. coli gene sequences for metrics

## πŸ“œ Citation

If you use ColiFormer in your research, please cite:

```bibtex
@article{codon_transformer_2023,
  title={CodonTransformer: The Global Translation of Genetic Code by Transformer},
  author={Adibvafa Fallahpour and Bartosz Grzybowski and Bogdan Gliwa and Bartosz Michalak},
  journal={bioRxiv},
  year={2023},
  doi={10.1101/2023.09.09.556981}
}
```

## πŸ“„ License

This project is licensed under the MIT License.

---

**Built with ❀️ for the synthetic biology community**