Language-Classification / README_language_classification.md

Upload 7 files

adaaf78 verified about 2 months ago

3.3 kB

	# BERT-Based Language Classification Model

	This repository contains a fine-tuned BERT-based model for classifying text into different languages. The model is designed to identify the language of a given sentence and has been trained using the Hugging Face Transformers library. It supports post-training dynamic quantization for optimized performance in deployment environments.

	---

	## Model Details

	- Model Name: BERT Base for Language Classification
	- Model Architecture: BERT Base
	- Task: Language Identification
	- Dataset: Custom Dataset with multilingual text samples
	- Quantization: Dynamic Quantization (INT8)
	- Fine-tuning Framework: Hugging Face Transformers

	---

	## Usage

	### Installation

	```bash
	pip install transformers torch
	```

	### Loading the Fine-tuned Model

	```python
	from transformers import pipeline

	# Load the model and tokenizer from saved directory
	classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")

	# Example input
	text = "Bonjour, comment allez-vous?"

	# Get prediction
	prediction = classifier(text)
	print(f"Prediction: {prediction}")
	```

	---

	## Saving and Testing the Model

	### Saving

	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	model_checkpoint = "bert-base-uncased" # or your fine-tuned model path
	tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
	model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

	# Save model and tokenizer
	model.save_pretrained("./saved_model")
	tokenizer.save_pretrained("./saved_model")
	```

	### Testing

	```python
	from transformers import pipeline

	classifier = pipeline("text-classification", model="./saved_model", tokenizer="./saved_model")
	text = "Ceci est un exemple de texte."
	print(classifier(text))
	```

	---

	## Quantization

	### Apply Dynamic Quantization

	```python
	import torch
	from transformers import AutoModelForSequenceClassification

	model = AutoModelForSequenceClassification.from_pretrained("./saved_model")

	# Apply dynamic quantization
	quantized_model = torch.quantization.quantize_dynamic(
	model, {torch.nn.Linear}, dtype=torch.qint8
	)

	# Save quantized model
	quantized_model.save_pretrained("./quantized_model")
	```

	### Load and Test Quantized Model

	```python
	from transformers import AutoTokenizer, pipeline
	from transformers import AutoModelForSequenceClassification

	tokenizer = AutoTokenizer.from_pretrained("./saved_model")
	quantized_model = AutoModelForSequenceClassification.from_pretrained("./quantized_model")

	classifier = pipeline("text-classification", model=quantized_model, tokenizer=tokenizer)
	text = "Hola, ¿cómo estás?"
	print(classifier(text))
	```

	---

	## Repository Structure

	```
	.
	├── saved_model/ # Fine-tuned Model
	├── quantized_model/ # Quantized Model
	├── language-clasification.ipynb
	├── README.md # Documentation
	```

	---

	## Limitations

	- The model performance may vary for low-resource or underrepresented languages in the training dataset.
	- Quantization may slightly reduce accuracy, but improves inference efficiency.

	---

	## Contributing

	Feel free to submit issues or pull requests to enhance performance, accuracy, or add new language support.

	---