YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
DistilBERT Multilingual Model for Language Identification
This repository contains a fine-tuned version of the distilbert-base-multilingual-cased
transformer model for language identification using the papluca/language-identification
dataset from Hugging Face.
Model Details
- Model Architecture: DistilBERT (Base, Multilingual, Cased)
- Task: Language Identification
- Dataset: papluca/language-identification
- Quantization: Float16
- Fine-tuning Framework: Hugging Face Transformers
Installation
pip install transformers datasets scikit-learn torch
Loading the Model
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load tokenizer and model
model_path = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
# Define test sentences
sample_texts = [
"This is an English sentence.",
"C'est une phrase en français.",
"यह एक हिंदी वाक्य है।"
]
# Tokenize and predict
def predict_language(texts, model, tokenizer, label_encoder):
if isinstance(texts, str):
texts = [texts]
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
model.eval()
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
preds = torch.argmax(outputs.logits, dim=1).cpu().numpy() # move to CPU first
predicted_languages = label_encoder.inverse_transform(preds)
return predicted_languages
Performance Metrics
- Accuracy: 0.993300
- Precision: 0.993337
- Recall: 0.993648
- F1 Score: 0.993300
Fine-Tuning Details
Dataset
The model is trained on the papluca/language-identification dataset from Hugging Face, which contains text samples labeled with ISO 639-1 language codes.It includes 20 languages with balanced class distribution for training, validation, and testing.
Training
- Epochs: 2
- Batch size: 32
- Learning rate: 2e-5
- Evaluation strategy:
epoch
Quantization
Post-training quantization was applied using PyTorch’s half()
precision (FP16) to reduce model size and inference time.
Repository Structure
.
├── quantized-model/ # Contains the quantized model files
│ ├── config.json
│ ├── model.safetensors
│ ├── tokenizer_config.json
│ ├── vocab.txt
│ └── special_tokens_map.json
├── README.md # Model documentation
Limitations
- The model might not generalize well on code-mixed or low-resource languages.
- FP16 quantization may result in slight numerical instability in edge cases.
- Smaller language subsets may be underrepresented.
Contributing
Feel free to open issues or submit pull requests to improve the model or documentation.
- Downloads last month
- 4
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support