Language-Classification / README_language_classification.md
gautamnancy's picture
Upload 7 files
adaaf78 verified
# BERT-Based Language Classification Model
This repository contains a fine-tuned BERT-based model for classifying text into different languages. The model is designed to identify the language of a given sentence and has been trained using the Hugging Face Transformers library. It supports post-training dynamic quantization for optimized performance in deployment environments.
---
## Model Details
- **Model Name:** BERT Base for Language Classification
- **Model Architecture:** BERT Base
- **Task:** Language Identification
- **Dataset:** Custom Dataset with multilingual text samples
- **Quantization:** Dynamic Quantization (INT8)
- **Fine-tuning Framework:** Hugging Face Transformers
---
## Usage
### Installation
```bash
pip install transformers torch
```
### Loading the Fine-tuned Model
```python
from transformers import pipeline
# Load the model and tokenizer from saved directory
classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")
# Example input
text = "Bonjour, comment allez-vous?"
# Get prediction
prediction = classifier(text)
print(f"Prediction: {prediction}")
```
---
## Saving and Testing the Model
### Saving
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_checkpoint = "bert-base-uncased" # or your fine-tuned model path
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
# Save model and tokenizer
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")
```
### Testing
```python
from transformers import pipeline
classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")
text = "Ceci est un exemple de texte."
print(classifier(text))
```
---
## Quantization
### Apply Dynamic Quantization
```python
import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("./saved_model")
# Apply dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
# Save quantized model
quantized_model.save_pretrained("./quantized_model")
```
### Load and Test Quantized Model
```python
from transformers import AutoTokenizer, pipeline
from transformers import AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("./saved_model")
quantized_model = AutoModelForSequenceClassification.from_pretrained("./quantized_model")
classifier = pipeline("text-classification", model=quantized_model, tokenizer=tokenizer)
text = "Hola, ¿cómo estás?"
print(classifier(text))
```
---
## Repository Structure
```
.
├── saved_model/ # Fine-tuned Model
├── quantized_model/ # Quantized Model
├── language-clasification.ipynb
├── README.md # Documentation
```
---
## Limitations
- The model performance may vary for low-resource or underrepresented languages in the training dataset.
- Quantization may slightly reduce accuracy, but improves inference efficiency.
---
## Contributing
Feel free to submit issues or pull requests to enhance performance, accuracy, or add new language support.
---