Upload 7 files

Browse files

Files changed (7) hide show

README_language_classification.md +130 -0
config.json +25 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +56 -0
vocab.txt +0 -0

README_language_classification.md ADDED Viewed

	@@ -0,0 +1,130 @@

+# BERT-Based Language Classification Model
+This repository contains a fine-tuned BERT-based model for classifying text into different languages. The model is designed to identify the language of a given sentence and has been trained using the Hugging Face Transformers library. It supports post-training dynamic quantization for optimized performance in deployment environments.
+---
+## Model Details
+- **Model Name:** BERT Base for Language Classification
+- **Model Architecture:** BERT Base
+- **Task:** Language Identification
+- **Dataset:** Custom Dataset with multilingual text samples
+- **Quantization:** Dynamic Quantization (INT8)
+- **Fine-tuning Framework:** Hugging Face Transformers
+---
+## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Loading the Fine-tuned Model
+```python
+from transformers import pipeline
+# Load the model and tokenizer from saved directory
+classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")
+# Example input
+text = "Bonjour, comment allez-vous?"
+# Get prediction
+prediction = classifier(text)
+print(f"Prediction: {prediction}")
+```
+---
+## Saving and Testing the Model
+### Saving
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+model_checkpoint = "bert-base-uncased"  # or your fine-tuned model path
+tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
+model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)
+# Save model and tokenizer
+model.save_pretrained("./saved_model")
+tokenizer.save_pretrained("./saved_model")
+```
+### Testing
+```python
+from transformers import pipeline
+classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")
+text = "Ceci est un exemple de texte."
+print(classifier(text))
+```
+---
+## Quantization
+### Apply Dynamic Quantization
+```python
+import torch
+from transformers import AutoModelForSequenceClassification
+model = AutoModelForSequenceClassification.from_pretrained("./saved_model")
+# Apply dynamic quantization
+quantized_model = torch.quantization.quantize_dynamic(
+    model, {torch.nn.Linear}, dtype=torch.qint8
+)
+# Save quantized model
+quantized_model.save_pretrained("./quantized_model")
+```
+### Load and Test Quantized Model
+```python
+from transformers import AutoTokenizer, pipeline
+from transformers import AutoModelForSequenceClassification
+tokenizer = AutoTokenizer.from_pretrained("./saved_model")
+quantized_model = AutoModelForSequenceClassification.from_pretrained("./quantized_model")
+classifier = pipeline("text-classification", model=quantized_model, tokenizer=tokenizer)
+text = "Hola, ¿cómo estás?"
+print(classifier(text))
+```
+---
+## Repository Structure
+```
+.
+├── saved_model/            # Fine-tuned Model
+├── quantized_model/        # Quantized Model
+├── language-clasification.ipynb
+├── README.md               # Documentation
+```
+---
+## Limitations
+- The model performance may vary for low-resource or underrepresented languages in the training dataset.
+- Quantization may slightly reduce accuracy, but improves inference efficiency.
+---
+## Contributing
+Feel free to submit issues or pull requests to enhance performance, accuracy, or add new language support.
+---

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float16",
+  "transformers_version": "4.51.3",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2482cf07a839e4f091ada914e740ad10849999df90e38f157b86d5d0c8a710da
+size 218990972

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff