|  | --- | 
					
						
						|  | library_name: transformers | 
					
						
						|  | tags: | 
					
						
						|  | - language | 
					
						
						|  | - detection | 
					
						
						|  | - classification | 
					
						
						|  | license: mit | 
					
						
						|  | datasets: | 
					
						
						|  | - hac541309/open-lid-dataset | 
					
						
						|  | pipeline_tag: text-classification | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | This is a clone of https://huggingface.co/alexneakameni/language_detection with onnx format | 
					
						
						|  |  | 
					
						
						|  | # Language Detection Model | 
					
						
						|  |  | 
					
						
						|  | A **BERT-based** language detection model trained on [hac541309/open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset), which includes **121 million sentences across 200 languages**. This model is optimized for **fast and accurate** language identification in text classification tasks. | 
					
						
						|  |  | 
					
						
						|  | ## Model Details | 
					
						
						|  |  | 
					
						
						|  | - **Architecture**: [BertForSequenceClassification](https://huggingface.co/transformers/model_doc/bert.html) | 
					
						
						|  | - **Hidden Size**: 384 | 
					
						
						|  | - **Number of Layers**: 4 | 
					
						
						|  | - **Attention Heads**: 6 | 
					
						
						|  | - **Max Sequence Length**: 512 | 
					
						
						|  | - **Dropout**: 0.1 | 
					
						
						|  | - **Vocabulary Size**: 50,257 | 
					
						
						|  |  | 
					
						
						|  | ## Training Process | 
					
						
						|  |  | 
					
						
						|  | - **Dataset**: | 
					
						
						|  | - Used the [open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset) | 
					
						
						|  | - Split into train (90%) and test (10%) | 
					
						
						|  | - **Tokenizer**: A custom `BertTokenizerFast` with special tokens for `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]` | 
					
						
						|  | - **Hyperparameters**: | 
					
						
						|  | - Learning Rate: 2e-5 | 
					
						
						|  | - Batch Size: 256 (training) / 512 (testing) | 
					
						
						|  | - Epochs: 1 | 
					
						
						|  | - Scheduler: Cosine | 
					
						
						|  | - **Trainer**: Leveraged the Hugging Face [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer) with Weights & Biases for logging | 
					
						
						|  |  | 
					
						
						|  | ## Evaluation | 
					
						
						|  |  | 
					
						
						|  | The model was evaluated on the test split. Below are the overall metrics: | 
					
						
						|  |  | 
					
						
						|  | - **Accuracy**: 0.969466 | 
					
						
						|  | - **Precision**: 0.969586 | 
					
						
						|  | - **Recall**: 0.969466 | 
					
						
						|  | - **F1 Score**: 0.969417 | 
					
						
						|  |  | 
					
						
						|  | Detailled evaluation (Size is the number of languages supported) | 
					
						
						|  |  | 
					
						
						|  | | Script | Support | Precision | Recall | F1 Score | Size | | 
					
						
						|  | |--------|---------|-----------|--------|----------|------| | 
					
						
						|  | | Arab   | 819219  | 0.9038    | 0.9014 | 0.9023   | 21   | | 
					
						
						|  | | Latn   | 7924704 | 0.9678    | 0.9663 | 0.9670   | 125  | | 
					
						
						|  | | Ethi   | 144403  | 0.9967    | 0.9964 | 0.9966   | 2    | | 
					
						
						|  | | Beng   | 163983  | 0.9949    | 0.9935 | 0.9942   | 3    | | 
					
						
						|  | | Deva   | 423895  | 0.9495    | 0.9326 | 0.9405   | 10   | | 
					
						
						|  | | Cyrl   | 831949  | 0.9899    | 0.9883 | 0.9891   | 12   | | 
					
						
						|  | | Tibt   | 35683   | 0.9925    | 0.9930 | 0.9927   | 2    | | 
					
						
						|  | | Grek   | 131155  | 0.9984    | 0.9990 | 0.9987   | 1    | | 
					
						
						|  | | Gujr   | 86912   | 0.99999   | 0.9999 | 0.99995  | 1    | | 
					
						
						|  | | Hebr   | 100530  | 0.9966    | 0.9995 | 0.9981   | 2    | | 
					
						
						|  | | Armn   | 67203   | 0.9999    | 0.9998 | 0.9998   | 1    | | 
					
						
						|  | | Jpan   | 88004   | 0.9983    | 0.9987 | 0.9985   | 1    | | 
					
						
						|  | | Knda   | 67170   | 0.9999    | 0.9998 | 0.9999   | 1    | | 
					
						
						|  | | Geor   | 70769   | 0.99997   | 0.9998 | 0.9999   | 1    | | 
					
						
						|  | | Khmr   | 39708   | 1.0000    | 0.9997 | 0.9999   | 1    | | 
					
						
						|  | | Hang   | 108509  | 0.9997    | 0.9999 | 0.9998   | 1    | | 
					
						
						|  | | Laoo   | 29389   | 0.9999    | 0.9999 | 0.9999   | 1    | | 
					
						
						|  | | Mlym   | 68418   | 0.99996   | 0.9999 | 0.9999   | 1    | | 
					
						
						|  | | Mymr   | 100857  | 0.9999    | 0.9992 | 0.9995   | 2    | | 
					
						
						|  | | Orya   | 44976   | 0.9995    | 0.9998 | 0.9996   | 1    | | 
					
						
						|  | | Guru   | 67106   | 0.99999   | 0.9999 | 0.9999   | 1    | | 
					
						
						|  | | Olck   | 22279   | 1.0000    | 0.9991 | 0.9995   | 1    | | 
					
						
						|  | | Sinh   | 67492   | 1.0000    | 0.9998 | 0.9999   | 1    | | 
					
						
						|  | | Taml   | 76373   | 0.99997   | 0.9999 | 0.9999   | 1    | | 
					
						
						|  | | Tfng   | 41325   | 0.8512    | 0.8246 | 0.8247   | 2    | | 
					
						
						|  | | Telu   | 62387   | 0.99997   | 0.9999 | 0.9999   | 1    | | 
					
						
						|  | | Thai   | 83820   | 0.99995   | 0.9998 | 0.9999   | 1    | | 
					
						
						|  | | Hant   | 152723  | 0.9945    | 0.9954 | 0.9949   | 2    | | 
					
						
						|  | | Hans   | 92689   | 0.9893    | 0.9870 | 0.9882   | 1    | | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | A detailed per-script classification report is also provided in the repository for further analysis. | 
					
						
						|  |  | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | ### How to Use | 
					
						
						|  |  | 
					
						
						|  | You can quickly load and run inference with this model using the [Transformers pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines): | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline | 
					
						
						|  |  | 
					
						
						|  | tokenizer = AutoTokenizer.from_pretrained("alexneakameni/language_detection") | 
					
						
						|  | model = AutoModelForSequenceClassification.from_pretrained("alexneakameni/language_detection") | 
					
						
						|  |  | 
					
						
						|  | language_detection = pipeline("text-classification", model=model, tokenizer=tokenizer) | 
					
						
						|  |  | 
					
						
						|  | text = "Hello world!" | 
					
						
						|  | predictions = language_detection(text) | 
					
						
						|  | print(predictions) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | This will output the predicted language code or label with the corresponding confidence score. | 
					
						
						|  |  | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | **Note**: The model’s performance may vary depending on text length, language variety, and domain-specific vocabulary. Always validate results against your own datasets for critical applications. | 
					
						
						|  |  | 
					
						
						|  | For more information, see the [repository documentation](https://github.com/KameniAlexNea/learning_language). | 
					
						
						|  |  | 
					
						
						|  | Thank you for using this model—feedback and contributions are welcome! | 
					
						
						|  |  |