--- title: AI Language Monitor emoji: 🌍 colorFrom: purple colorTo: pink sdk: docker app_port: 8000 license: cc-by-sa-4.0 short_description: Evaluating LLM performance across all human languages. datasets: - openlanguagedata/flores_plus - google/fleurs - mozilla-foundation/common_voice_1_0 - CohereForAI/Global-MMLU models: - meta-llama/Llama-3.3-70B-Instruct - mistralai/Mistral-Small-24B-Instruct-2501 - deepseek-ai/DeepSeek-V3 - microsoft/phi-4 - openai/whisper-large-v3 - google/gemma-3-27b-it tags: - leaderboard - submission:manual - test:public - judge:auto - modality:text - modality:artefacts - eval:generation - language:English - language:German --- [![Hugging Face](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-purple)](https://huggingface.co/spaces/datenlabor-bmz/ai-language-monitor) # AI Language Monitor 🌍 _Tracking language proficiency of AI models for every language_ ## System Architecture The AI Language Monitor evaluates language models across 100+ languages using a comprehensive pipeline that combines model discovery, automated evaluation, and real-time visualization. ```mermaid flowchart TD %% Model Sources A1["important_models
Static Curated List"] --> D[load_models] A2["get_historical_popular_models
Web Scraping - Top 20"] --> D A3["get_current_popular_models
Web Scraping - Top 10"] --> D A4["blocklist
Exclusions"] --> D %% Model Processing D --> |"Combine & Dedupe"| E["Dynamic Model List
~40-50 models"] E --> |get_or_metadata| F["OpenRouter API
Model Metadata"] F --> |get_hf_metadata| G["HuggingFace API
Model Details"] G --> H["Enriched Model DataFrame"] H --> |Save| I[models.json] %% Model Validation & Cost Filtering H --> |"Validate Models
Check API Availability"| H1["Valid Models Only
Cost ≤ $20/1M tokens"] H1 --> |"Timeout Protection
120s for Large Models"| H2["Robust Model List"] %% Language Data J["languages.py
BCP-47 + Population"] --> K["Top 100 Languages"] %% Task Registry with Unified Prompting L["tasks.py
7 Evaluation Tasks"] --> M["Task Functions
Unified English Zero-Shot"] M --> M1["translation_from/to
BLEU + ChrF"] M --> M2["classification
Accuracy"] M --> M3["mmlu
Accuracy"] M --> M4["arc
Accuracy"] M --> M5["truthfulqa
Accuracy"] M --> M6["mgsm
Accuracy"] %% On-the-fly Translation with Origin Tagging subgraph OTF [On-the-fly Dataset Translation] direction LR DS_raw["Raw English Dataset
(e.g., MMLU)"] --> Google_Translate["Google Translate API"] Google_Translate --> DS_translated["Translated Dataset
(e.g., German MMLU)
Origin: 'machine'"] DS_native["Native Dataset
(e.g., German MMLU)
Origin: 'human'"] end %% Evaluation Pipeline H2 --> |"models ID"| N["main.py / main_gcs.py
evaluate"] K --> |"languages bcp_47"| N L --> |"tasks.items"| N N --> |"Filter by model.tasks"| O["Valid Combinations
Model × Language × Task"] O --> |"10 samples each"| P["Evaluation Execution
Batch Processing"] %% Task Execution with Origin Tracking P --> Q1[translate_and_evaluate
Origin: 'human'] P --> Q2[classify_and_evaluate
Origin: 'human'] P --> Q3[mmlu_and_evaluate
Origin: 'human'/'machine'] P --> Q4[arc_and_evaluate
Origin: 'human'/'machine'] P --> Q5[truthfulqa_and_evaluate
Origin: 'human'/'machine'] P --> Q6[mgsm_and_evaluate
Origin: 'human'/'machine'] %% API Calls with Error Handling Q1 --> |"complete() API
Rate Limiting"| R["OpenRouter
Model Inference"] Q2 --> |"complete() API
Rate Limiting"| R Q3 --> |"complete() API
Rate Limiting"| R Q4 --> |"complete() API
Rate Limiting"| R Q5 --> |"complete() API
Rate Limiting"| R Q6 --> |"complete() API
Rate Limiting"| R %% Results Processing with Origin Aggregation R --> |Scores| S["Result Aggregation
Mean by model+lang+task+origin"] S --> |Save| T[results.json] %% Backend & Frontend with Origin-Specific Metrics T --> |Read| U[backend.py] I --> |Read| U U --> |make_model_table| V["Model Rankings
Origin-Specific Metrics"] U --> |make_country_table| W["Country Aggregation"] U --> |"API Endpoint"| X["FastAPI /api/data
arc_accuracy_human
arc_accuracy_machine"] X --> |"JSON Response"| Y["Frontend React App"] %% UI Components Y --> Z1["WorldMap.js
Country Visualization"] Y --> Z2["ModelTable.js
Model Rankings"] Y --> Z3["LanguageTable.js
Language Coverage"] Y --> Z4["DatasetTable.js
Task Performance"] %% Data Sources with Origin Information subgraph DS ["Data Sources"] DS1["Flores-200
Translation Sentences
Origin: 'human'"] DS2["MMLU/AfriMMLU
Knowledge QA
Origin: 'human'"] DS3["ARC
Science Reasoning
Origin: 'human'"] DS4["TruthfulQA
Truthfulness
Origin: 'human'"] DS5["MGSM
Math Problems
Origin: 'human'"] end DS1 --> Q1 DS2 --> Q3 DS3 --> Q4 DS4 --> Q5 DS5 --> Q6 DS_translated --> Q3 DS_translated --> Q4 DS_translated --> Q5 DS_native --> Q3 DS_native --> Q4 DS_native --> Q5 %% Styling - Neutral colors that work in both dark and light modes classDef modelSource fill:#f8f9fa,stroke:#6c757d,color:#212529 classDef evaluation fill:#e9ecef,stroke:#495057,color:#212529 classDef api fill:#dee2e6,stroke:#6c757d,color:#212529 classDef storage fill:#d1ecf1,stroke:#0c5460,color:#0c5460 classDef frontend fill:#f8d7da,stroke:#721c24,color:#721c24 classDef translation fill:#d4edda,stroke:#155724,color:#155724 class A1,A2,A3,A4 modelSource class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation class R,F,G,X api class T,I storage class Y,Z1,Z2,Z3,Z4 frontend class Google_Translate,DS_translated,DS_native translation ``` **Key Features:** - **Model Discovery**: Combines curated models with real-time trending models via web scraping - **Multi-Task Evaluation**: 7 tasks across 100+ languages with origin tracking (human vs machine-translated) - **Scalable Architecture**: Dual deployment (local/GitHub vs Google Cloud) - **Real-time Visualization**: Interactive web interface with country-level insights ## Evaluate ### Local Development ```bash uv run --extra dev evals/main.py ``` ### Google Cloud Deployment ```bash uv run --extra dev evals/main_gcs.py ``` ## Explore ```bash uv run evals/backend.py cd frontend && npm i && npm start ```