|
|
|
# π Code Explanation: Text Language Detector |
|
|
|
This document explains the **Text Language Detector** app, detailing each part of the provided code and intended use cases. |
|
|
|
--- |
|
|
|
## π Overview |
|
|
|
**Purpose** |
|
Detect the language of any input text and return the full language name, ISO code, and confidence score. |
|
|
|
**Tech Stack** |
|
- **Model**: `papluca/xlm-roberta-base-language-detection` (Hugging Face Transformers) |
|
- **Model Precision**: `torch_dtype=torch.bfloat16` for reduced memory usage |
|
- **Language Mapping**: `pycountry` to convert ISO codes to full language names |
|
- **Interface**: Gradio Blocks + Buttons |
|
|
|
--- |
|
|
|
## βοΈ Setup & Dependencies |
|
|
|
Install required libraries: |
|
|
|
```bash |
|
pip install transformers gradio torch pycountry |
|
``` |
|
|
|
--- |
|
|
|
## π Detailed Block-by-Block Code Explanation |
|
|
|
```python |
|
import torch |
|
import gradio as gr |
|
from transformers import pipeline |
|
import pycountry |
|
|
|
# Load the language-detection pipeline with bfloat16 precision |
|
language_detector = pipeline( |
|
"text-classification", |
|
model="papluca/xlm-roberta-base-language-detection", |
|
torch_dtype=torch.bfloat16 |
|
) |
|
|
|
def detect_language(text: str) -> str: |
|
result = language_detector(text)[0] |
|
code = result["label"] # e.g. "en", "ta", "fr" |
|
score = result["score"] |
|
|
|
# Map ISO code to full language name using pycountry |
|
try: |
|
lang = pycountry.languages.get(alpha_2=code).name |
|
except: |
|
lang = code.upper() |
|
|
|
return f"{lang} ({code}) β {score:.2f}" |
|
|
|
# Build the Gradio interface |
|
with gr.Blocks(theme=gr.themes.Default()) as demo: |
|
gr.Markdown("## π Text Language Detector") |
|
gr.Markdown("Type or paste text below to detect its language (name + code + confidence).") |
|
|
|
with gr.Row(): |
|
text_input = gr.Textbox(label="π Input Text", placeholder="Type or paste text here...", lines=4) |
|
lang_output = gr.Textbox(label="β
Detected Language", placeholder="Language & confidence", lines=1, interactive=False) |
|
|
|
detect_btn = gr.Button("π Detect Language") |
|
detect_btn.click(fn=detect_language, inputs=text_input, outputs=lang_output) |
|
|
|
gr.Markdown("---") |
|
gr.Markdown("Built with π€ Transformers (`papluca/xlm-roberta-base-language-detection`), `pycountry`, and π Gradio") |
|
|
|
demo.launch() |
|
``` |
|
|
|
--- |
|
|
|
## π Core Concepts |
|
|
|
| Concept | Why It Matters | |
|
|---------------------------|-------------------------------------------------------| |
|
| Hugging Face Pipeline | One-line model loading & inference | |
|
| bfloat16 Precision | Lower memory usage, faster inference on supported HW | |
|
| pycountry Mapping | Converts ISO codes to human-readable language names | |
|
| Gradio Blocks | Builds interactive web apps with pure Python | |
|
|
|
--- |
|
|
|
## π Intended Uses & Limitations |
|
|
|
You can directly use this model as a language detector for sequence classification tasks. Currently, it supports the following 20 languages: |
|
|
|
- Arabic (ar) |
|
- Bulgarian (bg) |
|
- German (de) |
|
- Modern Greek (el) |
|
- English (en) |
|
- Spanish (es) |
|
- French (fr) |
|
- Hindi (hi) |
|
- Italian (it) |
|
- Japanese (ja) |
|
- Dutch (nl) |
|
- Polish (pl) |
|
- Portuguese (pt) |
|
- Russian (ru) |
|
- Swahili (sw) |
|
- Thai (th) |
|
- Turkish (tr) |
|
- Urdu (ur) |
|
- Vietnamese (vi) |
|
- Chinese (zh) |
|
|
|
--- |
|
|
|
|