File size: 3,302 Bytes

adaaf78

# BERT-Based Language Classification Model

This repository contains a fine-tuned BERT-based model for classifying text into different languages. The model is designed to identify the language of a given sentence and has been trained using the Hugging Face Transformers library. It supports post-training dynamic quantization for optimized performance in deployment environments.

---

## Model Details

- **Model Name:** BERT Base for Language Classification  
- **Model Architecture:** BERT Base  
- **Task:** Language Identification  
- **Dataset:** Custom Dataset with multilingual text samples  
- **Quantization:** Dynamic Quantization (INT8)  
- **Fine-tuning Framework:** Hugging Face Transformers  

---

## Usage

### Installation

```bash
pip install transformers torch
```

### Loading the Fine-tuned Model

```python
from transformers import pipeline

# Load the model and tokenizer from saved directory
classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")

# Example input
text = "Bonjour, comment allez-vous?"

# Get prediction
prediction = classifier(text)
print(f"Prediction: {prediction}")
```

---

## Saving and Testing the Model

### Saving

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_checkpoint = "bert-base-uncased"  # or your fine-tuned model path
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

# Save model and tokenizer
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")
```

### Testing

```python
from transformers import pipeline

classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")
text = "Ceci est un exemple de texte."
print(classifier(text))
```

---

## Quantization

### Apply Dynamic Quantization

```python
import torch
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("./saved_model")

# Apply dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Save quantized model
quantized_model.save_pretrained("./quantized_model")
```

### Load and Test Quantized Model

```python
from transformers import AutoTokenizer, pipeline
from transformers import AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("./saved_model")
quantized_model = AutoModelForSequenceClassification.from_pretrained("./quantized_model")

classifier = pipeline("text-classification", model=quantized_model, tokenizer=tokenizer)
text = "Hola, ¿cómo estás?"
print(classifier(text))
```

---

## Repository Structure

```
.
├── saved_model/            # Fine-tuned Model
├── quantized_model/        # Quantized Model
├── language-clasification.ipynb  
├── README.md               # Documentation
```

---

## Limitations

- The model performance may vary for low-resource or underrepresented languages in the training dataset.  
- Quantization may slightly reduce accuracy, but improves inference efficiency.  

---

## Contributing

Feel free to submit issues or pull requests to enhance performance, accuracy, or add new language support.

---