BERT-Base Quantized Model for Relation Extraction
This repository hosts a quantized version of the BERT-Base Chinese model, fine-tuned for relation extraction tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.
Model Details
- Model Name: BERT-Base Chinese
- Model Architecture: BERT Base
- Task: Relation Extraction/Classification
- Dataset: Chinese Entity-Relation Dataset
- Quantization: Float16
- Fine-tuning Framework: Hugging Face Transformers
Usage
Installation
pip install transformers torch evaluate
Loading the Quantized Model
from transformers import BertTokenizerFast, BertForSequenceClassification
import torch
# Load the fine-tuned model and tokenizer
model_path = "final_relation_extraction_model"
tokenizer = BertTokenizerFast.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)
model.eval()
# Example input with entity markers
text = "η¬εοΌ[SUBJ] ζ¨ζ§ [/SUBJ] εΊηε°οΌ[OBJ] ζι½ [/OBJ]"
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
# Inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=1).item()
# Map prediction to relation label
label_mapping = {0: "εΊηε°", 1: "εΊηζ₯ζ", 2: "ζ°ζ", 3: "θδΈ"} # Customize based on your labels
predicted_relation = label_mapping[predicted_class]
print(f"Predicted Relation: {predicted_relation}")
Performance Metrics
- Accuracy: 0.970222
- F1 Score: 0.964973
- Training Loss: 0.130104
- Validation Loss: 0.066986
Fine-Tuning Details
Dataset
The model was fine-tuned on a Chinese entity-relation dataset with:
- Entity pairs marked with special tokens
[SUBJ]
and[OBJ]
- Text preprocessed to include entity boundaries
- Multiple relation types including biographical information
Training Configuration
- Epochs: 3
- Batch Size: 16
- Learning Rate: 2e-5
- Max Length: 128 tokens
- Evaluation Strategy: epoch
- Weight Decay: 0.01
- Optimizer: AdamW
Data Processing
The original SPO (Subject-Predicate-Object) format was converted to relation classification:
- Each SPO triple becomes a separate training example
- Entities are marked with special tokens in the text
- Relations are encoded as numerical labels for classification
Quantization
Post-training quantization was applied using PyTorch's Float16 precision to reduce the model size and improve inference efficiency.
Repository Structure
.
βββ final_relation_extraction_model/
β βββ config.json
β βββ pytorch_model.bin # Fine-tuned Model
β βββ tokenizer_config.json
β βββ special_tokens_map.json
β βββ tokenizer.json
β βββ vocab.txt
β βββ added_tokens.json
βββ relationship-extraction.ipynb # Training notebook
βββ README.md # Model documentation
Entity Marking Format
The model expects input text with entities marked using special tokens:
- Subject entities:
[SUBJ] entity_name [/SUBJ]
- Object entities:
[OBJ] entity_name [/OBJ]
Example:
Input: "η¬εοΌ[SUBJ] ζ¨ζ§ [/SUBJ]εεοΌζ¨θζΎζ°ζοΌ [OBJ] εζ [/OBJ]"
Output: "ζ°ζ" (ethnicity relation)
Supported Relations
The model can classify various biographical and factual relations in Chinese text, including:
- εΊηε° (Birthplace)
- εΊηζ₯ζ (Birth Date)
- ζ°ζ (Ethnicity)
- θδΈ (Occupation)
- And many more based on the training dataset
Limitations
- The model is specifically trained for Chinese text and may not work well with other languages
- Performance depends on proper entity marking in the input text
- The model may not generalize well to domains outside the fine-tuning dataset
- Quantization may result in minor accuracy degradation compared to full-precision models
Training Environment
- Platform: Kaggle Notebooks with GPU acceleration
- GPU: NVIDIA Tesla T4
- Training Time: Approximately 1 hour 5 minutes
- Framework: Hugging Face Transformers with PyTorch backend
Contributing
Contributions are welcome! Feel free to open an issue or PR for improvements, fixes, or feature extensions.