File size: 3,302 Bytes
adaaf78 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
# BERT-Based Language Classification Model
This repository contains a fine-tuned BERT-based model for classifying text into different languages. The model is designed to identify the language of a given sentence and has been trained using the Hugging Face Transformers library. It supports post-training dynamic quantization for optimized performance in deployment environments.
---
## Model Details
- **Model Name:** BERT Base for Language Classification
- **Model Architecture:** BERT Base
- **Task:** Language Identification
- **Dataset:** Custom Dataset with multilingual text samples
- **Quantization:** Dynamic Quantization (INT8)
- **Fine-tuning Framework:** Hugging Face Transformers
---
## Usage
### Installation
```bash
pip install transformers torch
```
### Loading the Fine-tuned Model
```python
from transformers import pipeline
# Load the model and tokenizer from saved directory
classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")
# Example input
text = "Bonjour, comment allez-vous?"
# Get prediction
prediction = classifier(text)
print(f"Prediction: {prediction}")
```
---
## Saving and Testing the Model
### Saving
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_checkpoint = "bert-base-uncased" # or your fine-tuned model path
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
# Save model and tokenizer
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")
```
### Testing
```python
from transformers import pipeline
classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")
text = "Ceci est un exemple de texte."
print(classifier(text))
```
---
## Quantization
### Apply Dynamic Quantization
```python
import torch
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("./saved_model")
# Apply dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
# Save quantized model
quantized_model.save_pretrained("./quantized_model")
```
### Load and Test Quantized Model
```python
from transformers import AutoTokenizer, pipeline
from transformers import AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("./saved_model")
quantized_model = AutoModelForSequenceClassification.from_pretrained("./quantized_model")
classifier = pipeline("text-classification", model=quantized_model, tokenizer=tokenizer)
text = "Hola, ¿cómo estás?"
print(classifier(text))
```
---
## Repository Structure
```
.
├── saved_model/ # Fine-tuned Model
├── quantized_model/ # Quantized Model
├── language-clasification.ipynb
├── README.md # Documentation
```
---
## Limitations
- The model performance may vary for low-resource or underrepresented languages in the training dataset.
- Quantization may slightly reduce accuracy, but improves inference efficiency.
---
## Contributing
Feel free to submit issues or pull requests to enhance performance, accuracy, or add new language support.
--- |