Update README.md
Browse files
README.md
CHANGED
@@ -12,30 +12,102 @@ pipeline_tag: text-classification
|
|
12 |
|
13 |
# Language Detection Model
|
14 |
|
15 |
-
|
16 |
-
|
17 |
-
##
|
18 |
-
|
19 |
-
- **
|
20 |
-
- **
|
21 |
-
- **
|
22 |
-
- **
|
23 |
-
- **
|
24 |
-
- **
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
- **
|
30 |
-
-
|
31 |
-
-
|
32 |
-
|
33 |
-
|
34 |
-
-
|
35 |
-
-
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
|
40 |
-
|
41 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
# Language Detection Model
|
14 |
|
15 |
+
A **BERT-based** language detection model trained on [hac541309/open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset), which includes **121 million sentences across 200 languages**. This model is optimized for **fast and accurate** language identification in text classification tasks.
|
16 |
+
|
17 |
+
## Model Details
|
18 |
+
|
19 |
+
- **Architecture**: [BertForSequenceClassification](https://huggingface.co/transformers/model_doc/bert.html)
|
20 |
+
- **Hidden Size**: 384
|
21 |
+
- **Number of Layers**: 4
|
22 |
+
- **Attention Heads**: 6
|
23 |
+
- **Max Sequence Length**: 512
|
24 |
+
- **Dropout**: 0.1
|
25 |
+
- **Vocabulary Size**: 50,257
|
26 |
+
|
27 |
+
## Training Process
|
28 |
+
|
29 |
+
- **Dataset**:
|
30 |
+
- Used the [open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset)
|
31 |
+
- Split into train (90%) and test (10%)
|
32 |
+
- **Tokenizer**: A custom `BertTokenizerFast` with special tokens for `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`
|
33 |
+
- **Hyperparameters**:
|
34 |
+
- Learning Rate: 2e-5
|
35 |
+
- Batch Size: 256 (training) / 512 (testing)
|
36 |
+
- Epochs: 1
|
37 |
+
- Scheduler: Cosine
|
38 |
+
- **Trainer**: Leveraged the Hugging Face [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer) with Weights & Biases for logging
|
39 |
+
|
40 |
+
## Evaluation
|
41 |
+
|
42 |
+
The model was evaluated on the test split. Below are the overall metrics:
|
43 |
+
|
44 |
+
- **Accuracy**: 0.969466
|
45 |
+
- **Precision**: 0.969586
|
46 |
+
- **Recall**: 0.969466
|
47 |
+
- **F1 Score**: 0.969417
|
48 |
+
|
49 |
+
Detailled evaluation (Size is the number of languages supported)
|
50 |
+
|
51 |
+
| Script | Support | Precision | Recall | F1 Score | Size |
|
52 |
+
|--------|---------|-----------|--------|----------|------|
|
53 |
+
| Arab | 819219 | 0.9038 | 0.9014 | 0.9023 | 21 |
|
54 |
+
| Latn | 7924704 | 0.9678 | 0.9663 | 0.9670 | 125 |
|
55 |
+
| Ethi | 144403 | 0.9967 | 0.9964 | 0.9966 | 2 |
|
56 |
+
| Beng | 163983 | 0.9949 | 0.9935 | 0.9942 | 3 |
|
57 |
+
| Deva | 423895 | 0.9495 | 0.9326 | 0.9405 | 10 |
|
58 |
+
| Cyrl | 831949 | 0.9899 | 0.9883 | 0.9891 | 12 |
|
59 |
+
| Tibt | 35683 | 0.9925 | 0.9930 | 0.9927 | 2 |
|
60 |
+
| Grek | 131155 | 0.9984 | 0.9990 | 0.9987 | 1 |
|
61 |
+
| Gujr | 86912 | 0.99999 | 0.9999 | 0.99995 | 1 |
|
62 |
+
| Hebr | 100530 | 0.9966 | 0.9995 | 0.9981 | 2 |
|
63 |
+
| Armn | 67203 | 0.9999 | 0.9998 | 0.9998 | 1 |
|
64 |
+
| Jpan | 88004 | 0.9983 | 0.9987 | 0.9985 | 1 |
|
65 |
+
| Knda | 67170 | 0.9999 | 0.9998 | 0.9999 | 1 |
|
66 |
+
| Geor | 70769 | 0.99997 | 0.9998 | 0.9999 | 1 |
|
67 |
+
| Khmr | 39708 | 1.0000 | 0.9997 | 0.9999 | 1 |
|
68 |
+
| Hang | 108509 | 0.9997 | 0.9999 | 0.9998 | 1 |
|
69 |
+
| Laoo | 29389 | 0.9999 | 0.9999 | 0.9999 | 1 |
|
70 |
+
| Mlym | 68418 | 0.99996 | 0.9999 | 0.9999 | 1 |
|
71 |
+
| Mymr | 100857 | 0.9999 | 0.9992 | 0.9995 | 2 |
|
72 |
+
| Orya | 44976 | 0.9995 | 0.9998 | 0.9996 | 1 |
|
73 |
+
| Guru | 67106 | 0.99999 | 0.9999 | 0.9999 | 1 |
|
74 |
+
| Olck | 22279 | 1.0000 | 0.9991 | 0.9995 | 1 |
|
75 |
+
| Sinh | 67492 | 1.0000 | 0.9998 | 0.9999 | 1 |
|
76 |
+
| Taml | 76373 | 0.99997 | 0.9999 | 0.9999 | 1 |
|
77 |
+
| Tfng | 41325 | 0.8512 | 0.8246 | 0.8247 | 2 |
|
78 |
+
| Telu | 62387 | 0.99997 | 0.9999 | 0.9999 | 1 |
|
79 |
+
| Thai | 83820 | 0.99995 | 0.9998 | 0.9999 | 1 |
|
80 |
+
| Hant | 152723 | 0.9945 | 0.9954 | 0.9949 | 2 |
|
81 |
+
| Hans | 92689 | 0.9893 | 0.9870 | 0.9882 | 1 |
|
82 |
+
|
83 |
+
|
84 |
+
A detailed per-script classification report is also provided in the repository for further analysis.
|
85 |
+
|
86 |
+
---
|
87 |
+
|
88 |
+
### How to Use
|
89 |
+
|
90 |
+
You can quickly load and run inference with this model using the [Transformers pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines):
|
91 |
+
|
92 |
+
```python
|
93 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
|
94 |
+
|
95 |
+
tokenizer = AutoTokenizer.from_pretrained("alexneakameni/language_detection")
|
96 |
+
model = AutoModelForSequenceClassification.from_pretrained("alexneakameni/language_detection")
|
97 |
+
|
98 |
+
language_detection = pipeline("text-classification", model=model, tokenizer=tokenizer)
|
99 |
+
|
100 |
+
text = "Hello world!"
|
101 |
+
predictions = language_detection(text)
|
102 |
+
print(predictions)
|
103 |
+
```
|
104 |
+
|
105 |
+
This will output the predicted language code or label with the corresponding confidence score.
|
106 |
+
|
107 |
+
---
|
108 |
+
|
109 |
+
**Note**: The model’s performance may vary depending on text length, language variety, and domain-specific vocabulary. Always validate results against your own datasets for critical applications.
|
110 |
+
|
111 |
+
For more information, see the [repository documentation](https://github.com/KameniAlexNea/learning_language).
|
112 |
+
|
113 |
+
Thank you for using this model—feedback and contributions are welcome!
|