File size: 3,302 Bytes
adaaf78
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
# BERT-Based Language Classification Model

This repository contains a fine-tuned BERT-based model for classifying text into different languages. The model is designed to identify the language of a given sentence and has been trained using the Hugging Face Transformers library. It supports post-training dynamic quantization for optimized performance in deployment environments.

---

## Model Details

- **Model Name:** BERT Base for Language Classification  
- **Model Architecture:** BERT Base  
- **Task:** Language Identification  
- **Dataset:** Custom Dataset with multilingual text samples  
- **Quantization:** Dynamic Quantization (INT8)  
- **Fine-tuning Framework:** Hugging Face Transformers  

---

## Usage

### Installation

```bash
pip install transformers torch
```

### Loading the Fine-tuned Model

```python
from transformers import pipeline

# Load the model and tokenizer from saved directory
classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")

# Example input
text = "Bonjour, comment allez-vous?"

# Get prediction
prediction = classifier(text)
print(f"Prediction: {prediction}")
```

---

## Saving and Testing the Model

### Saving

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_checkpoint = "bert-base-uncased"  # or your fine-tuned model path
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

# Save model and tokenizer
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")
```

### Testing

```python
from transformers import pipeline

classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")
text = "Ceci est un exemple de texte."
print(classifier(text))
```

---

## Quantization

### Apply Dynamic Quantization

```python
import torch
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("./saved_model")

# Apply dynamic quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Save quantized model
quantized_model.save_pretrained("./quantized_model")
```

### Load and Test Quantized Model

```python
from transformers import AutoTokenizer, pipeline
from transformers import AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("./saved_model")
quantized_model = AutoModelForSequenceClassification.from_pretrained("./quantized_model")

classifier = pipeline("text-classification", model=quantized_model, tokenizer=tokenizer)
text = "Hola, ¿cómo estás?"
print(classifier(text))
```

---

## Repository Structure

```
.
├── saved_model/            # Fine-tuned Model
├── quantized_model/        # Quantized Model
├── language-clasification.ipynb  
├── README.md               # Documentation
```

---

## Limitations

- The model performance may vary for low-resource or underrepresented languages in the training dataset.  
- Quantization may slightly reduce accuracy, but improves inference efficiency.  

---

## Contributing

Feel free to submit issues or pull requests to enhance performance, accuracy, or add new language support.

---