YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

DistilBERT Multilingual Model for Language Identification

This repository contains a fine-tuned version of the distilbert-base-multilingual-cased transformer model for language identification using the papluca/language-identification dataset from Hugging Face.

Model Details

  • Model Architecture: DistilBERT (Base, Multilingual, Cased)
  • Task: Language Identification
  • Dataset: papluca/language-identification
  • Quantization: Float16
  • Fine-tuning Framework: Hugging Face Transformers

Installation

pip install transformers datasets scikit-learn torch

Loading the Model

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load tokenizer and model
model_path = "distilbert-base-multilingual-cased"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
# Define test sentences
sample_texts = [
    "This is an English sentence.",
    "C'est une phrase en français.",
    "यह एक हिंदी वाक्य है।"
]


# Tokenize and predict
def predict_language(texts, model, tokenizer, label_encoder):
    if isinstance(texts, str):
        texts = [texts]

    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model.to(device)
    model.eval()

    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()  # move to CPU first

    predicted_languages = label_encoder.inverse_transform(preds)
    return predicted_languages

Performance Metrics

  • Accuracy: 0.993300
  • Precision: 0.993337
  • Recall: 0.993648
  • F1 Score: 0.993300

Fine-Tuning Details

Dataset

The model is trained on the papluca/language-identification dataset from Hugging Face, which contains text samples labeled with ISO 639-1 language codes.It includes 20 languages with balanced class distribution for training, validation, and testing.

Training

  • Epochs: 2
  • Batch size: 32
  • Learning rate: 2e-5
  • Evaluation strategy: epoch

Quantization

Post-training quantization was applied using PyTorch’s half() precision (FP16) to reduce model size and inference time.


Repository Structure

.
├── quantized-model/               # Contains the quantized model files
│   ├── config.json
│   ├── model.safetensors
│   ├── tokenizer_config.json
│   ├── vocab.txt
│   └── special_tokens_map.json
├── README.md                      # Model documentation

Limitations

  • The model might not generalize well on code-mixed or low-resource languages.
  • FP16 quantization may result in slight numerical instability in edge cases.
  • Smaller language subsets may be underrepresented.

Contributing

Feel free to open issues or submit pull requests to improve the model or documentation.

Downloads last month
4
Safetensors
Model size
135M params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support