--- title: Language Detection App emoji: ๐ŸŒ colorFrom: indigo colorTo: blue sdk: gradio python_version: 3.9 app_file: app.py license: mit --- # ๐ŸŒ Language Detection App A powerful and elegant language detection application built with Gradio frontend and a modular backend featuring multiple state-of-the-art ML models organized by architecture and training dataset. ## โœจ Features - **Clean Gradio Interface**: Simple, intuitive web interface for language detection - **Multiple Model Architectures**: Choose between XLM-RoBERTa (Model A) and BERT (Model B) architectures - **Multiple Training Datasets**: Models trained on standard (Dataset A) and enhanced (Dataset B) datasets - **Centralized Configuration**: All model configurations and settings in one place - **Modular Backend**: Easy-to-extend architecture for plugging in your own ML models - **Real-time Detection**: Instant language detection with confidence scores - **Multiple Predictions**: Shows top 5 language predictions with confidence levels - **100+ Languages**: Support for major world languages (varies by model) - **Example Texts**: Pre-loaded examples in various languages for testing - **Model Switching**: Seamlessly switch between different models - **Extensible**: Abstract base class for implementing custom models ## ๐Ÿš€ Quick Start ### 1. Setup Environment ```bash # Create virtual environment python -m venv venv # Activate environment # On macOS/Linux: source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt ``` ### 2. Test the Backend ```bash # Run tests to verify everything works python test_app.py # Test specific model combinations python test_model_a_dataset_a.py python test_model_b_dataset_b.py ``` ### 3. Launch the App ```bash # Start the Gradio app python app.py ``` The app will be available at `http://localhost:7860` ## ๐Ÿงฉ Model Architecture The system is organized around two dimensions: ### ๐Ÿ—๏ธ Model Architectures - **Model A**: XLM-RoBERTa based architectures - Excellent cross-lingual transfer capabilities - **Model B**: BERT based architectures - Efficient and fast processing ### ๐Ÿ“Š Training Datasets - **Dataset A**: Standard multilingual language detection dataset - Broad language coverage - **Dataset B**: Enhanced/specialized language detection dataset - Ultra-high accuracy focus ### ๐Ÿค– Available Model Combinations 1. **Model A Dataset A** - XLM-RoBERTa + Standard Dataset โœ… - **Architecture**: XLM-RoBERTa (Model A) - **Training**: Dataset A (standard multilingual) - **Accuracy**: 97.9% - **Size**: 278M parameters - **Languages**: 100+ languages - **Strengths**: Balanced performance, robust cross-lingual capabilities, comprehensive language coverage - **Use Cases**: General-purpose language detection, multilingual content processing 2. **Model B Dataset A** - BERT + Standard Dataset โœ… - **Architecture**: BERT (Model B) - **Training**: Dataset A (standard multilingual) - **Accuracy**: 96.17% - **Size**: 178M parameters - **Languages**: 100+ languages - **Strengths**: Fast inference, broad language support, efficient processing - **Use Cases**: High-throughput detection, real-time applications, resource-constrained environments 3. **Model A Dataset B** - XLM-RoBERTa + Enhanced Dataset โœ… - **Architecture**: XLM-RoBERTa (Model A) - **Training**: Dataset B (enhanced/specialized) - **Accuracy**: 99.72% - **Size**: 278M parameters - **Training Loss**: 0.0176 - **Languages**: 20 carefully selected languages - **Strengths**: Exceptional accuracy, focused language support, state-of-the-art results - **Use Cases**: Research applications, high-precision detection, critical accuracy requirements 4. **Model B Dataset B** - BERT + Enhanced Dataset โœ… - **Architecture**: BERT (Model B) - **Training**: Dataset B (enhanced/specialized) - **Accuracy**: 99.85% - **Size**: 178M parameters - **Training Loss**: 0.0125 - **Languages**: 20 carefully selected languages - **Strengths**: Highest accuracy, ultra-low training loss, precision-optimized - **Use Cases**: Maximum precision applications, research requiring highest accuracy ### ๐Ÿ—๏ธ Core Components - **`BaseLanguageModel`**: Abstract interface that all models must implement - **`ModelRegistry`**: Manages model registration and creation with centralized configuration - **`LanguageDetector`**: Main orchestrator for language detection - **`model_config.py`**: Centralized configuration for all models and language mappings ### ๐Ÿ”ง Adding New Models To add a new model combination, simply: 1. Create a new file in `backend/models/` (e.g., `model_c_dataset_a.py`) 2. Inherit from `BaseLanguageModel` 3. Implement the required methods 4. Add configuration to `model_config.py` 5. Register it in `ModelRegistry` Example: ```python # backend/models/model_c_dataset_a.py from .base_model import BaseLanguageModel from .model_config import get_model_config class ModelCDatasetA(BaseLanguageModel): def __init__(self): self.model_key = "model-c-dataset-a" self.config = get_model_config(self.model_key) # Initialize your model def predict(self, text: str) -> Dict[str, Any]: # Implement prediction logic pass def get_supported_languages(self) -> List[str]: # Return supported language codes pass def get_model_info(self) -> Dict[str, Any]: # Return model metadata from config pass ``` Then add configuration in `model_config.py` and register in `language_detector.py`. ## ๐Ÿงช Testing The project includes comprehensive test suites: - **`test_app.py`**: General app functionality tests - **`test_model_a_dataset_a.py`**: Tests for XLM-RoBERTa + standard dataset - **`test_model_b_dataset_b.py`**: Tests for BERT + enhanced dataset (highest accuracy) - **Model comparison tests**: Automated testing across all model combinations - **Model switching tests**: Verify seamless model switching ## ๐ŸŒ Supported Languages The models support different language sets based on their training: - **Model A/B + Dataset A**: 100+ languages including major European, Asian, African, and other world languages based on the CC-100 dataset - **Model A/B + Dataset B**: 20 carefully selected high-performance languages (Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Italian, Japanese, Dutch, Polish, Portuguese, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese) ## ๐Ÿ“Š Model Comparison | Feature | Model A Dataset A | Model B Dataset A | Model A Dataset B | Model B Dataset B | |---------|-------------------|-------------------|-------------------|-------------------| | **Architecture** | XLM-RoBERTa | BERT | XLM-RoBERTa | BERT | | **Dataset** | Standard | Standard | Enhanced | Enhanced | | **Accuracy** | 97.9% | 96.17% | 99.72% | **99.85%** ๐Ÿ† | | **Model Size** | 278M | 178M | 278M | 178M | | **Languages** | 100+ | 100+ | 20 (curated) | 20 (curated) | | **Training Loss** | N/A | N/A | 0.0176 | **0.0125** | | **Speed** | Moderate | **Fast** | Moderate | **Fast** | | **Memory Usage** | Higher | **Lower** | Higher | **Lower** | | **Best For** | Balanced performance | Speed & broad coverage | Ultra-high accuracy | **Maximum precision** | ### ๐ŸŽฏ Model Selection Guide - **๐Ÿ† Model B Dataset B**: Choose for maximum accuracy on 20 core languages (99.85%) - **๐Ÿ”ฌ Model A Dataset B**: Choose for ultra-high accuracy on 20 core languages (99.72%) - **โš–๏ธ Model A Dataset A**: Choose for balanced performance and comprehensive language coverage (97.9%) - **โšก Model B Dataset A**: Choose for fast inference and broad language coverage (96.17%) ## ๐Ÿ”ง Configuration You can configure models using the centralized configuration system: ```python # Default model selection detector = LanguageDetector(model_key="model-a-dataset-a") # Balanced XLM-RoBERTa detector = LanguageDetector(model_key="model-b-dataset-a") # Fast BERT detector = LanguageDetector(model_key="model-a-dataset-b") # Ultra-high accuracy XLM-RoBERTa detector = LanguageDetector(model_key="model-b-dataset-b") # Maximum precision BERT # All configurations are centralized in backend/models/model_config.py ``` ## ๐Ÿ“ Project Structure ``` language-detection/ โ”œโ”€โ”€ backend/ โ”‚ โ”œโ”€โ”€ models/ โ”‚ โ”‚ โ”œโ”€โ”€ model_config.py # Centralized configuration โ”‚ โ”‚ โ”œโ”€โ”€ base_model.py # Abstract base class โ”‚ โ”‚ โ”œโ”€โ”€ model_a_dataset_a.py # XLM-RoBERTa + Standard โ”‚ โ”‚ โ”œโ”€โ”€ model_b_dataset_a.py # BERT + Standard โ”‚ โ”‚ โ”œโ”€โ”€ model_a_dataset_b.py # XLM-RoBERTa + Enhanced โ”‚ โ”‚ โ”œโ”€โ”€ model_b_dataset_b.py # BERT + Enhanced โ”‚ โ”‚ โ””โ”€โ”€ __init__.py โ”‚ โ””โ”€โ”€ language_detector.py # Main orchestrator โ”œโ”€โ”€ tests/ โ”œโ”€โ”€ app.py # Gradio interface โ””โ”€โ”€ README.md ``` ## ๐Ÿค Contributing 1. Fork the repository 2. Create your feature branch (`git checkout -b feature/new-model-combination`) 3. Implement your model following the `BaseLanguageModel` interface 4. Add configuration to `model_config.py` 5. Add tests for your implementation 6. Commit your changes (`git commit -m 'Add new model combination'`) 7. Push to the branch (`git push origin feature/new-model-combination`) 8. Open a Pull Request ## ๐Ÿ“ License This project is open source and available under the MIT License. ## ๐Ÿ™ Acknowledgments - **Hugging Face** for the transformers library and model hosting platform - **Model providers** for the fine-tuned language detection models used in this project - **Gradio** for the excellent web interface framework - **Open source community** for the foundational technologies that make this project possible