alexneakameni commited on
Commit
61b1af3
·
verified ·
1 Parent(s): 93dd06c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +99 -27
README.md CHANGED
@@ -12,30 +12,102 @@ pipeline_tag: text-classification
12
 
13
  # Language Detection Model
14
 
15
- This project trains a **BERT-based language detection model** on the **Hugging Face `hac541309/open-lid-dataset`**, which contains **121 million sentences across 200 languages**. The trained model is designed for **fast and accurate language identification** in text classification tasks.
16
-
17
- ## 📌 Model Details
18
- - **Architecture**: `BertForSequenceClassification`
19
- - **Hidden Size**: `384`
20
- - **Layers**: `4`
21
- - **Attention Heads**: `6`
22
- - **Max Sequence Length**: `512`
23
- - **Dropout**: `0.1`
24
- - **Vocabulary Size**: `50,257`
25
-
26
- ## 🚀 Training Process
27
- - **Dataset**: Preprocessed and split into **train (90%)** and **test (10%)** sets.
28
- - **Tokenizer**: Custom `PreTrainedTokenizerFast` for text tokenization.
29
- - **Evaluation Metrics**: Tracked using `compute_metrics` function.
30
- - **Hyperparameters**:
31
- - Learning Rate: `2e-5`
32
- - Batch Size: `256` (train) / `512` (test)
33
- - Epochs: `1`
34
- - Scheduler: `cosine`
35
- - **Trainer**: Utilizes `Hugging Face Trainer` API with `wandb` logging.
36
-
37
- ## 📊 Evaluation Results
38
- The model was evaluated on a **separate test set**, and the results are shared in this repository.
39
-
40
-
41
- https://wandb.ai/eak/lang_detection/reports/Language-detection--VmlldzoxMTMzNjc2NQ
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
 
13
  # Language Detection Model
14
 
15
+ A **BERT-based** language detection model trained on [hac541309/open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset), which includes **121 million sentences across 200 languages**. This model is optimized for **fast and accurate** language identification in text classification tasks.
16
+
17
+ ## Model Details
18
+
19
+ - **Architecture**: [BertForSequenceClassification](https://huggingface.co/transformers/model_doc/bert.html)
20
+ - **Hidden Size**: 384
21
+ - **Number of Layers**: 4
22
+ - **Attention Heads**: 6
23
+ - **Max Sequence Length**: 512
24
+ - **Dropout**: 0.1
25
+ - **Vocabulary Size**: 50,257
26
+
27
+ ## Training Process
28
+
29
+ - **Dataset**:
30
+ - Used the [open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset)
31
+ - Split into train (90%) and test (10%)
32
+ - **Tokenizer**: A custom `BertTokenizerFast` with special tokens for `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]`
33
+ - **Hyperparameters**:
34
+ - Learning Rate: 2e-5
35
+ - Batch Size: 256 (training) / 512 (testing)
36
+ - Epochs: 1
37
+ - Scheduler: Cosine
38
+ - **Trainer**: Leveraged the Hugging Face [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer) with Weights & Biases for logging
39
+
40
+ ## Evaluation
41
+
42
+ The model was evaluated on the test split. Below are the overall metrics:
43
+
44
+ - **Accuracy**: 0.969466
45
+ - **Precision**: 0.969586
46
+ - **Recall**: 0.969466
47
+ - **F1 Score**: 0.969417
48
+
49
+ Detailled evaluation (Size is the number of languages supported)
50
+
51
+ | Script | Support | Precision | Recall | F1 Score | Size |
52
+ |--------|---------|-----------|--------|----------|------|
53
+ | Arab | 819219 | 0.9038 | 0.9014 | 0.9023 | 21 |
54
+ | Latn | 7924704 | 0.9678 | 0.9663 | 0.9670 | 125 |
55
+ | Ethi | 144403 | 0.9967 | 0.9964 | 0.9966 | 2 |
56
+ | Beng | 163983 | 0.9949 | 0.9935 | 0.9942 | 3 |
57
+ | Deva | 423895 | 0.9495 | 0.9326 | 0.9405 | 10 |
58
+ | Cyrl | 831949 | 0.9899 | 0.9883 | 0.9891 | 12 |
59
+ | Tibt | 35683 | 0.9925 | 0.9930 | 0.9927 | 2 |
60
+ | Grek | 131155 | 0.9984 | 0.9990 | 0.9987 | 1 |
61
+ | Gujr | 86912 | 0.99999 | 0.9999 | 0.99995 | 1 |
62
+ | Hebr | 100530 | 0.9966 | 0.9995 | 0.9981 | 2 |
63
+ | Armn | 67203 | 0.9999 | 0.9998 | 0.9998 | 1 |
64
+ | Jpan | 88004 | 0.9983 | 0.9987 | 0.9985 | 1 |
65
+ | Knda | 67170 | 0.9999 | 0.9998 | 0.9999 | 1 |
66
+ | Geor | 70769 | 0.99997 | 0.9998 | 0.9999 | 1 |
67
+ | Khmr | 39708 | 1.0000 | 0.9997 | 0.9999 | 1 |
68
+ | Hang | 108509 | 0.9997 | 0.9999 | 0.9998 | 1 |
69
+ | Laoo | 29389 | 0.9999 | 0.9999 | 0.9999 | 1 |
70
+ | Mlym | 68418 | 0.99996 | 0.9999 | 0.9999 | 1 |
71
+ | Mymr | 100857 | 0.9999 | 0.9992 | 0.9995 | 2 |
72
+ | Orya | 44976 | 0.9995 | 0.9998 | 0.9996 | 1 |
73
+ | Guru | 67106 | 0.99999 | 0.9999 | 0.9999 | 1 |
74
+ | Olck | 22279 | 1.0000 | 0.9991 | 0.9995 | 1 |
75
+ | Sinh | 67492 | 1.0000 | 0.9998 | 0.9999 | 1 |
76
+ | Taml | 76373 | 0.99997 | 0.9999 | 0.9999 | 1 |
77
+ | Tfng | 41325 | 0.8512 | 0.8246 | 0.8247 | 2 |
78
+ | Telu | 62387 | 0.99997 | 0.9999 | 0.9999 | 1 |
79
+ | Thai | 83820 | 0.99995 | 0.9998 | 0.9999 | 1 |
80
+ | Hant | 152723 | 0.9945 | 0.9954 | 0.9949 | 2 |
81
+ | Hans | 92689 | 0.9893 | 0.9870 | 0.9882 | 1 |
82
+
83
+
84
+ A detailed per-script classification report is also provided in the repository for further analysis.
85
+
86
+ ---
87
+
88
+ ### How to Use
89
+
90
+ You can quickly load and run inference with this model using the [Transformers pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines):
91
+
92
+ ```python
93
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
94
+
95
+ tokenizer = AutoTokenizer.from_pretrained("alexneakameni/language_detection")
96
+ model = AutoModelForSequenceClassification.from_pretrained("alexneakameni/language_detection")
97
+
98
+ language_detection = pipeline("text-classification", model=model, tokenizer=tokenizer)
99
+
100
+ text = "Hello world!"
101
+ predictions = language_detection(text)
102
+ print(predictions)
103
+ ```
104
+
105
+ This will output the predicted language code or label with the corresponding confidence score.
106
+
107
+ ---
108
+
109
+ **Note**: The model’s performance may vary depending on text length, language variety, and domain-specific vocabulary. Always validate results against your own datasets for critical applications.
110
+
111
+ For more information, see the [repository documentation](https://github.com/KameniAlexNea/learning_language).
112
+
113
+ Thank you for using this model—feedback and contributions are welcome!