Language-Classification / README_language_classification.md
gautamnancy's picture
Upload 7 files
adaaf78 verified

BERT-Based Language Classification Model

This repository contains a fine-tuned BERT-based model for classifying text into different languages. The model is designed to identify the language of a given sentence and has been trained using the Hugging Face Transformers library. It supports post-training dynamic quantization for optimized performance in deployment environments.


Model Details

  • Model Name: BERT Base for Language Classification
  • Model Architecture: BERT Base
  • Task: Language Identification
  • Dataset: Custom Dataset with multilingual text samples
  • Quantization: Dynamic Quantization (INT8)
  • Fine-tuning Framework: Hugging Face Transformers

Usage

Installation

pip install transformers torch

Loading the Fine-tuned Model

from transformers import pipeline

# Load the model and tokenizer from saved directory
classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")

# Example input
text = "Bonjour, comment allez-vous?"

# Get prediction
prediction = classifier(text)
print(f"Prediction: {prediction}")

Saving and Testing the Model

Saving

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_checkpoint = "bert-base-uncased"  # or your fine-tuned model path
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

# Save model and tokenizer
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")

Testing

from transformers import pipeline

classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")
text = "Ceci est un exemple de texte."
print(classifier(text))

Quantization

Apply Dynamic Quantization

import torch
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("./saved_model")

# Apply dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Save quantized model
quantized_model.save_pretrained("./quantized_model")

Load and Test Quantized Model

from transformers import AutoTokenizer, pipeline
from transformers import AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("./saved_model")
quantized_model = AutoModelForSequenceClassification.from_pretrained("./quantized_model")

classifier = pipeline("text-classification", model=quantized_model, tokenizer=tokenizer)
text = "Hola, ¿cómo estás?"
print(classifier(text))

Repository Structure

.
├── saved_model/            # Fine-tuned Model
├── quantized_model/        # Quantized Model
├── language-clasification.ipynb  
├── README.md               # Documentation

Limitations

  • The model performance may vary for low-resource or underrepresented languages in the training dataset.
  • Quantization may slightly reduce accuracy, but improves inference efficiency.

Contributing

Feel free to submit issues or pull requests to enhance performance, accuracy, or add new language support.