|
# BERT-Based Language Classification Model |
|
|
|
This repository contains a fine-tuned BERT-based model for classifying text into different languages. The model is designed to identify the language of a given sentence and has been trained using the Hugging Face Transformers library. It supports post-training dynamic quantization for optimized performance in deployment environments. |
|
|
|
--- |
|
|
|
## Model Details |
|
|
|
- **Model Name:** BERT Base for Language Classification |
|
- **Model Architecture:** BERT Base |
|
- **Task:** Language Identification |
|
- **Dataset:** Custom Dataset with multilingual text samples |
|
- **Quantization:** Dynamic Quantization (INT8) |
|
- **Fine-tuning Framework:** Hugging Face Transformers |
|
|
|
--- |
|
|
|
## Usage |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install transformers torch |
|
``` |
|
|
|
### Loading the Fine-tuned Model |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
# Load the model and tokenizer from saved directory |
|
classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model") |
|
|
|
# Example input |
|
text = "Bonjour, comment allez-vous?" |
|
|
|
# Get prediction |
|
prediction = classifier(text) |
|
print(f"Prediction: {prediction}") |
|
``` |
|
|
|
--- |
|
|
|
## Saving and Testing the Model |
|
|
|
### Saving |
|
|
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
model_checkpoint = "bert-base-uncased" # or your fine-tuned model path |
|
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) |
|
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint) |
|
|
|
# Save model and tokenizer |
|
model.save_pretrained("./saved_model") |
|
tokenizer.save_pretrained("./saved_model") |
|
``` |
|
|
|
### Testing |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model") |
|
text = "Ceci est un exemple de texte." |
|
print(classifier(text)) |
|
``` |
|
|
|
--- |
|
|
|
## Quantization |
|
|
|
### Apply Dynamic Quantization |
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForSequenceClassification |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("./saved_model") |
|
|
|
# Apply dynamic quantization |
|
quantized_model = torch.quantization.quantize_dynamic( |
|
model, {torch.nn.Linear}, dtype=torch.qint8 |
|
) |
|
|
|
# Save quantized model |
|
quantized_model.save_pretrained("./quantized_model") |
|
``` |
|
|
|
### Load and Test Quantized Model |
|
|
|
```python |
|
from transformers import AutoTokenizer, pipeline |
|
from transformers import AutoModelForSequenceClassification |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("./saved_model") |
|
quantized_model = AutoModelForSequenceClassification.from_pretrained("./quantized_model") |
|
|
|
classifier = pipeline("text-classification", model=quantized_model, tokenizer=tokenizer) |
|
text = "Hola, ¿cómo estás?" |
|
print(classifier(text)) |
|
``` |
|
|
|
--- |
|
|
|
## Repository Structure |
|
|
|
``` |
|
. |
|
├── saved_model/ # Fine-tuned Model |
|
├── quantized_model/ # Quantized Model |
|
├── language-clasification.ipynb |
|
├── README.md # Documentation |
|
``` |
|
|
|
--- |
|
|
|
## Limitations |
|
|
|
- The model performance may vary for low-resource or underrepresented languages in the training dataset. |
|
- Quantization may slightly reduce accuracy, but improves inference efficiency. |
|
|
|
--- |
|
|
|
## Contributing |
|
|
|
Feel free to submit issues or pull requests to enhance performance, accuracy, or add new language support. |
|
|
|
--- |