Spaces:
Running
Running
title: ColiFormer - E. coli Codon Optimization | |
emoji: 𧬠| |
colorFrom: blue | |
colorTo: green | |
sdk: streamlit | |
sdk_version: 1.28.1 | |
app_file: app.py | |
pinned: false | |
license: mit | |
short_description: E. coli codon optimization with fine-tuned transformers | |
tags: | |
- biology | |
- codon-optimization | |
- e-coli | |
- protein-synthesis | |
- bioinformatics | |
- synthetic-biology | |
- transformers | |
- streamlit | |
# 𧬠ColiFormer - E. coli Codon Optimization | |
**ColiFormer** is a specialized codon optimization tool fine-tuned specifically for *Escherichia coli* sequences, achieving **6.2% better CAI scores** compared to the base CodonTransformer model. | |
## π Features | |
- **π― E. coli Specialized**: Fine-tuned on 4,300 high-CAI E. coli sequences | |
- **π Advanced Metrics**: CAI, tAI, GC content, and codon frequency analysis | |
- **π€ Auto-Loading**: Automatically downloads model and reference data from Hugging Face | |
- **β‘ Real-time**: Interactive sequence optimization with live metrics | |
- **π¬ Research-Grade**: Based on BigBird Transformer architecture | |
- **π Performance**: Significant improvement over base models for E. coli | |
## π Model Performance | |
| Metric | Base Model | ColiFormer | Improvement | | |
|--------|------------|------------|-------------| | |
| CAI Score | 0.742 | 0.788 | **+6.2%** | | |
| tAI Score | 0.451 | 0.478 | **+6.0%** | | |
| GC Content | 52.1% | 51.8% | Optimized | | |
## π Related Resources | |
- **Model**: [saketh11/ColiFormer](https://huggingface.co/saketh11/ColiFormer) | |
- **Dataset**: [saketh11/ColiFormer-Data](https://huggingface.co/datasets/saketh11/ColiFormer-Data) | |
- **Base Model**: [adibvafa/CodonTransformer](https://huggingface.co/adibvafa/CodonTransformer) | |
- **Paper**: [CodonTransformer: The Global Translation of Genetic Code by Transformer](https://www.biorxiv.org/content/10.1101/2023.09.09.556981v1) | |
## π‘ How to Use | |
1. **Enter your protein sequence** in single-letter amino acid format | |
2. **Select optimization parameters** (temperature, max length, etc.) | |
3. **Click "Optimize Sequence"** to generate the optimized DNA sequence | |
4. **View comprehensive metrics** including CAI, tAI, GC content, and codon usage | |
5. **Download results** as FASTA or Excel files | |
## π§ͺ Example | |
**Input Protein**: `MKRISTTITTTITITTGNGAG` | |
**Optimized DNA**: `ATGAAACGTATTAGT...` (optimized for E. coli expression) | |
**Metrics**: | |
- CAI: 0.85 (High) | |
- tAI: 0.52 (Good) | |
- GC Content: 51.2% (Optimal) | |
## π¬ Technical Details | |
- **Architecture**: BigBird Transformer with 12 layers | |
- **Training**: Adaptive Learning Methods (ALM) enhanced | |
- **Context Length**: Up to 4096 tokens | |
- **Fine-tuning**: 4,300 high-CAI E. coli sequences | |
- **Reference Data**: 50,000+ E. coli gene sequences for metrics | |
## π Citation | |
If you use ColiFormer in your research, please cite: | |
```bibtex | |
@article{codon_transformer_2023, | |
title={CodonTransformer: The Global Translation of Genetic Code by Transformer}, | |
author={Adibvafa Fallahpour and Bartosz Grzybowski and Bogdan Gliwa and Bartosz Michalak}, | |
journal={bioRxiv}, | |
year={2023}, | |
doi={10.1101/2023.09.09.556981} | |
} | |
``` | |
## π License | |
This project is licensed under the MIT License. | |
--- | |
**Built with β€οΈ for the synthetic biology community** | |