---
title: AI Language Monitor
emoji: 🌍
colorFrom: purple
colorTo: pink
sdk: docker
app_port: 8000
license: cc-by-sa-4.0
short_description: Evaluating LLM performance across all human languages.
datasets:
- openlanguagedata/flores_plus
- google/fleurs
- mozilla-foundation/common_voice_1_0
- CohereForAI/Global-MMLU
models:
- meta-llama/Llama-3.3-70B-Instruct
- mistralai/Mistral-Small-24B-Instruct-2501
- deepseek-ai/DeepSeek-V3
- microsoft/phi-4
- openai/whisper-large-v3
- google/gemma-3-27b-it
tags:
- leaderboard
- submission:manual
- test:public
- judge:auto
- modality:text
- modality:artefacts
- eval:generation
- language:English
- language:German
---
[](https://huggingface.co/spaces/datenlabor-bmz/ai-language-monitor)
# AI Language Monitor 🌍
_Tracking language proficiency of AI models for every language_
## System Architecture
The AI Language Monitor evaluates language models across 100+ languages using a comprehensive pipeline that combines model discovery, automated evaluation, and real-time visualization.
```mermaid
flowchart TD
%% Model Sources
A1["important_models
Static Curated List"] --> D[load_models]
A2["get_historical_popular_models
Web Scraping - Top 20"] --> D
A3["get_current_popular_models
Web Scraping - Top 10"] --> D
A4["blocklist
Exclusions"] --> D
%% Model Processing
D --> |"Combine & Dedupe"| E["Dynamic Model List
~40-50 models"]
E --> |get_or_metadata| F["OpenRouter API
Model Metadata"]
F --> |get_hf_metadata| G["HuggingFace API
Model Details"]
G --> H["Enriched Model DataFrame"]
H --> |Save| I[models.json]
%% Model Validation & Cost Filtering
H --> |"Validate Models
Check API Availability"| H1["Valid Models Only
Cost ≤ $20/1M tokens"]
H1 --> |"Timeout Protection
120s for Large Models"| H2["Robust Model List"]
%% Language Data
J["languages.py
BCP-47 + Population"] --> K["Top 100 Languages"]
%% Task Registry with Unified Prompting
L["tasks.py
7 Evaluation Tasks"] --> M["Task Functions
Unified English Zero-Shot"]
M --> M1["translation_from/to
BLEU + ChrF"]
M --> M2["classification
Accuracy"]
M --> M3["mmlu
Accuracy"]
M --> M4["arc
Accuracy"]
M --> M5["truthfulqa
Accuracy"]
M --> M6["mgsm
Accuracy"]
%% On-the-fly Translation with Origin Tagging
subgraph OTF [On-the-fly Dataset Translation]
direction LR
DS_raw["Raw English Dataset
(e.g., MMLU)"] --> Google_Translate["Google Translate API"]
Google_Translate --> DS_translated["Translated Dataset
(e.g., German MMLU)
Origin: 'machine'"]
DS_native["Native Dataset
(e.g., German MMLU)
Origin: 'human'"]
end
%% Evaluation Pipeline
H2 --> |"models ID"| N["main.py / main_gcs.py
evaluate"]
K --> |"languages bcp_47"| N
L --> |"tasks.items"| N
N --> |"Filter by model.tasks"| O["Valid Combinations
Model × Language × Task"]
O --> |"10 samples each"| P["Evaluation Execution
Batch Processing"]
%% Task Execution with Origin Tracking
P --> Q1[translate_and_evaluate
Origin: 'human']
P --> Q2[classify_and_evaluate
Origin: 'human']
P --> Q3[mmlu_and_evaluate
Origin: 'human'/'machine']
P --> Q4[arc_and_evaluate
Origin: 'human'/'machine']
P --> Q5[truthfulqa_and_evaluate
Origin: 'human'/'machine']
P --> Q6[mgsm_and_evaluate
Origin: 'human'/'machine']
%% API Calls with Error Handling
Q1 --> |"complete() API
Rate Limiting"| R["OpenRouter
Model Inference"]
Q2 --> |"complete() API
Rate Limiting"| R
Q3 --> |"complete() API
Rate Limiting"| R
Q4 --> |"complete() API
Rate Limiting"| R
Q5 --> |"complete() API
Rate Limiting"| R
Q6 --> |"complete() API
Rate Limiting"| R
%% Results Processing with Origin Aggregation
R --> |Scores| S["Result Aggregation
Mean by model+lang+task+origin"]
S --> |Save| T[results.json]
%% Backend & Frontend with Origin-Specific Metrics
T --> |Read| U[backend.py]
I --> |Read| U
U --> |make_model_table| V["Model Rankings
Origin-Specific Metrics"]
U --> |make_country_table| W["Country Aggregation"]
U --> |"API Endpoint"| X["FastAPI /api/data
arc_accuracy_human
arc_accuracy_machine"]
X --> |"JSON Response"| Y["Frontend React App"]
%% UI Components
Y --> Z1["WorldMap.js
Country Visualization"]
Y --> Z2["ModelTable.js
Model Rankings"]
Y --> Z3["LanguageTable.js
Language Coverage"]
Y --> Z4["DatasetTable.js
Task Performance"]
%% Data Sources with Origin Information
subgraph DS ["Data Sources"]
DS1["Flores-200
Translation Sentences
Origin: 'human'"]
DS2["MMLU/AfriMMLU
Knowledge QA
Origin: 'human'"]
DS3["ARC
Science Reasoning
Origin: 'human'"]
DS4["TruthfulQA
Truthfulness
Origin: 'human'"]
DS5["MGSM
Math Problems
Origin: 'human'"]
end
DS1 --> Q1
DS2 --> Q3
DS3 --> Q4
DS4 --> Q5
DS5 --> Q6
DS_translated --> Q3
DS_translated --> Q4
DS_translated --> Q5
DS_native --> Q3
DS_native --> Q4
DS_native --> Q5
%% Styling - Neutral colors that work in both dark and light modes
classDef modelSource fill:#f8f9fa,stroke:#6c757d,color:#212529
classDef evaluation fill:#e9ecef,stroke:#495057,color:#212529
classDef api fill:#dee2e6,stroke:#6c757d,color:#212529
classDef storage fill:#d1ecf1,stroke:#0c5460,color:#0c5460
classDef frontend fill:#f8d7da,stroke:#721c24,color:#721c24
classDef translation fill:#d4edda,stroke:#155724,color:#155724
class A1,A2,A3,A4 modelSource
class Q1,Q2,Q3,Q4,Q5,Q6,P evaluation
class R,F,G,X api
class T,I storage
class Y,Z1,Z2,Z3,Z4 frontend
class Google_Translate,DS_translated,DS_native translation
```
**Key Features:**
- **Model Discovery**: Combines curated models with real-time trending models via web scraping
- **Multi-Task Evaluation**: 7 tasks across 100+ languages with origin tracking (human vs machine-translated)
- **Scalable Architecture**: Dual deployment (local/GitHub vs Google Cloud)
- **Real-time Visualization**: Interactive web interface with country-level insights
## Evaluate
### Local Development
```bash
uv run --extra dev evals/main.py
```
### Google Cloud Deployment
```bash
uv run --extra dev evals/main_gcs.py
```
## Explore
```bash
uv run evals/backend.py
cd frontend && npm i && npm start
```