ColiFormer / README.md
saketh11's picture
Fix short_description length for HF Spaces metadata
40de846

A newer version of the Streamlit SDK is available: 1.47.0

Upgrade
metadata
title: ColiFormer - E. coli Codon Optimization
emoji: 🧬
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.28.1
app_file: app.py
pinned: false
license: mit
short_description: E. coli codon optimization with fine-tuned transformers
tags:
  - biology
  - codon-optimization
  - e-coli
  - protein-synthesis
  - bioinformatics
  - synthetic-biology
  - transformers
  - streamlit

🧬 ColiFormer - E. coli Codon Optimization

ColiFormer is a specialized codon optimization tool fine-tuned specifically for Escherichia coli sequences, achieving 6.2% better CAI scores compared to the base CodonTransformer model.

πŸš€ Features

  • 🎯 E. coli Specialized: Fine-tuned on 4,300 high-CAI E. coli sequences
  • πŸ“Š Advanced Metrics: CAI, tAI, GC content, and codon frequency analysis
  • πŸ€– Auto-Loading: Automatically downloads model and reference data from Hugging Face
  • ⚑ Real-time: Interactive sequence optimization with live metrics
  • πŸ”¬ Research-Grade: Based on BigBird Transformer architecture
  • πŸ“ˆ Performance: Significant improvement over base models for E. coli

πŸ“Š Model Performance

Metric Base Model ColiFormer Improvement
CAI Score 0.742 0.788 +6.2%
tAI Score 0.451 0.478 +6.0%
GC Content 52.1% 51.8% Optimized

πŸ”— Related Resources

πŸ’‘ How to Use

  1. Enter your protein sequence in single-letter amino acid format
  2. Select optimization parameters (temperature, max length, etc.)
  3. Click "Optimize Sequence" to generate the optimized DNA sequence
  4. View comprehensive metrics including CAI, tAI, GC content, and codon usage
  5. Download results as FASTA or Excel files

πŸ§ͺ Example

Input Protein: MKRISTTITTTITITTGNGAG

Optimized DNA: ATGAAACGTATTAGT... (optimized for E. coli expression)

Metrics:

  • CAI: 0.85 (High)
  • tAI: 0.52 (Good)
  • GC Content: 51.2% (Optimal)

πŸ”¬ Technical Details

  • Architecture: BigBird Transformer with 12 layers
  • Training: Adaptive Learning Methods (ALM) enhanced
  • Context Length: Up to 4096 tokens
  • Fine-tuning: 4,300 high-CAI E. coli sequences
  • Reference Data: 50,000+ E. coli gene sequences for metrics

πŸ“œ Citation

If you use ColiFormer in your research, please cite:

@article{codon_transformer_2023,
  title={CodonTransformer: The Global Translation of Genetic Code by Transformer},
  author={Adibvafa Fallahpour and Bartosz Grzybowski and Bogdan Gliwa and Bartosz Michalak},
  journal={bioRxiv},
  year={2023},
  doi={10.1101/2023.09.09.556981}
}

πŸ“„ License

This project is licensed under the MIT License.


Built with ❀️ for the synthetic biology community