Text Classification
Transformers
LiteRT
ONNX
Safetensors
bert
language
detection
classification
text-embeddings-inference
Instructions to use dewdev/language_detection with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dewdev/language_detection with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="dewdev/language_detection")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("dewdev/language_detection") model = AutoModelForSequenceClassification.from_pretrained("dewdev/language_detection") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| tags: | |
| - language | |
| - detection | |
| - classification | |
| license: mit | |
| datasets: | |
| - hac541309/open-lid-dataset | |
| pipeline_tag: text-classification | |
| This is a clone of https://huggingface.co/alexneakameni/language_detection with onnx format | |
| # Language Detection Model | |
| A **BERT-based** language detection model trained on [hac541309/open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset), which includes **121 million sentences across 200 languages**. This model is optimized for **fast and accurate** language identification in text classification tasks. | |
| ## Model Details | |
| - **Architecture**: [BertForSequenceClassification](https://huggingface.co/transformers/model_doc/bert.html) | |
| - **Hidden Size**: 384 | |
| - **Number of Layers**: 4 | |
| - **Attention Heads**: 6 | |
| - **Max Sequence Length**: 512 | |
| - **Dropout**: 0.1 | |
| - **Vocabulary Size**: 50,257 | |
| ## Training Process | |
| - **Dataset**: | |
| - Used the [open-lid-dataset](https://huggingface.co/datasets/hac541309/open-lid-dataset) | |
| - Split into train (90%) and test (10%) | |
| - **Tokenizer**: A custom `BertTokenizerFast` with special tokens for `[UNK]`, `[CLS]`, `[SEP]`, `[PAD]`, `[MASK]` | |
| - **Hyperparameters**: | |
| - Learning Rate: 2e-5 | |
| - Batch Size: 256 (training) / 512 (testing) | |
| - Epochs: 1 | |
| - Scheduler: Cosine | |
| - **Trainer**: Leveraged the Hugging Face [Trainer API](https://huggingface.co/docs/transformers/main_classes/trainer) with Weights & Biases for logging | |
| ## Evaluation | |
| The model was evaluated on the test split. Below are the overall metrics: | |
| - **Accuracy**: 0.969466 | |
| - **Precision**: 0.969586 | |
| - **Recall**: 0.969466 | |
| - **F1 Score**: 0.969417 | |
| Detailled evaluation (Size is the number of languages supported) | |
| | Script | Support | Precision | Recall | F1 Score | Size | | |
| |--------|---------|-----------|--------|----------|------| | |
| | Arab | 819219 | 0.9038 | 0.9014 | 0.9023 | 21 | | |
| | Latn | 7924704 | 0.9678 | 0.9663 | 0.9670 | 125 | | |
| | Ethi | 144403 | 0.9967 | 0.9964 | 0.9966 | 2 | | |
| | Beng | 163983 | 0.9949 | 0.9935 | 0.9942 | 3 | | |
| | Deva | 423895 | 0.9495 | 0.9326 | 0.9405 | 10 | | |
| | Cyrl | 831949 | 0.9899 | 0.9883 | 0.9891 | 12 | | |
| | Tibt | 35683 | 0.9925 | 0.9930 | 0.9927 | 2 | | |
| | Grek | 131155 | 0.9984 | 0.9990 | 0.9987 | 1 | | |
| | Gujr | 86912 | 0.99999 | 0.9999 | 0.99995 | 1 | | |
| | Hebr | 100530 | 0.9966 | 0.9995 | 0.9981 | 2 | | |
| | Armn | 67203 | 0.9999 | 0.9998 | 0.9998 | 1 | | |
| | Jpan | 88004 | 0.9983 | 0.9987 | 0.9985 | 1 | | |
| | Knda | 67170 | 0.9999 | 0.9998 | 0.9999 | 1 | | |
| | Geor | 70769 | 0.99997 | 0.9998 | 0.9999 | 1 | | |
| | Khmr | 39708 | 1.0000 | 0.9997 | 0.9999 | 1 | | |
| | Hang | 108509 | 0.9997 | 0.9999 | 0.9998 | 1 | | |
| | Laoo | 29389 | 0.9999 | 0.9999 | 0.9999 | 1 | | |
| | Mlym | 68418 | 0.99996 | 0.9999 | 0.9999 | 1 | | |
| | Mymr | 100857 | 0.9999 | 0.9992 | 0.9995 | 2 | | |
| | Orya | 44976 | 0.9995 | 0.9998 | 0.9996 | 1 | | |
| | Guru | 67106 | 0.99999 | 0.9999 | 0.9999 | 1 | | |
| | Olck | 22279 | 1.0000 | 0.9991 | 0.9995 | 1 | | |
| | Sinh | 67492 | 1.0000 | 0.9998 | 0.9999 | 1 | | |
| | Taml | 76373 | 0.99997 | 0.9999 | 0.9999 | 1 | | |
| | Tfng | 41325 | 0.8512 | 0.8246 | 0.8247 | 2 | | |
| | Telu | 62387 | 0.99997 | 0.9999 | 0.9999 | 1 | | |
| | Thai | 83820 | 0.99995 | 0.9998 | 0.9999 | 1 | | |
| | Hant | 152723 | 0.9945 | 0.9954 | 0.9949 | 2 | | |
| | Hans | 92689 | 0.9893 | 0.9870 | 0.9882 | 1 | | |
| A detailed per-script classification report is also provided in the repository for further analysis. | |
| --- | |
| ### How to Use | |
| You can quickly load and run inference with this model using the [Transformers pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines): | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline | |
| tokenizer = AutoTokenizer.from_pretrained("alexneakameni/language_detection") | |
| model = AutoModelForSequenceClassification.from_pretrained("alexneakameni/language_detection") | |
| language_detection = pipeline("text-classification", model=model, tokenizer=tokenizer) | |
| text = "Hello world!" | |
| predictions = language_detection(text) | |
| print(predictions) | |
| ``` | |
| This will output the predicted language code or label with the corresponding confidence score. | |
| --- | |
| **Note**: The model’s performance may vary depending on text length, language variety, and domain-specific vocabulary. Always validate results against your own datasets for critical applications. | |
| For more information, see the [repository documentation](https://github.com/KameniAlexNea/learning_language). | |
| Thank you for using this model—feedback and contributions are welcome! | |