DistilBERT Multilingual Model for Language Identification

This repository contains a fine-tuned version of the distilbert-base-multilingual-cased transformer model for language identification using the papluca/language-identification dataset from Hugging Face.

Model Details

Model Architecture: DistilBERT (Base, Multilingual, Cased)
Task: Language Identification
Dataset: papluca/language-identification
Quantization: Float16
Fine-tuning Framework: Hugging Face Transformers

Installation

pip install transformers datasets scikit-learn torch

Loading the Model

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load tokenizer and model
model_path = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
# Define test sentences
sample_texts = [
    "This is an English sentence.",
    "C'est une phrase en français.",
    "यह एक हिंदी वाक्य है।"
]


# Tokenize and predict
def predict_language(texts, model, tokenizer, label_encoder):
    if isinstance(texts, str):
        texts = [texts]

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.eval()

    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()  # move to CPU first

    predicted_languages = label_encoder.inverse_transform(preds)
    return predicted_languages

Performance Metrics

Accuracy: 0.993300
Precision: 0.993337
Recall: 0.993648
F1 Score: 0.993300

Fine-Tuning Details

Dataset

The model is trained on the papluca/language-identification dataset from Hugging Face, which contains text samples labeled with ISO 639-1 language codes.It includes 20 languages with balanced class distribution for training, validation, and testing.

Training

Epochs: 2
Batch size: 32
Learning rate: 2e-5
Evaluation strategy: epoch

Quantization

Post-training quantization was applied using PyTorch’s half() precision (FP16) to reduce model size and inference time.

Repository Structure

.
├── quantized-model/               # Contains the quantized model files
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer_config.json
│   ├── vocab.txt
│   └── special_tokens_map.json
├── README.md                      # Model documentation

Limitations

The model might not generalize well on code-mixed or low-resource languages.
FP16 quantization may result in slight numerical instability in edge cases.
Smaller language subsets may be underrepresented.

Contributing

Feel free to open issues or submit pull requests to improve the model or documentation.