gautamnancy's picture
Upload README.md
9511daa verified
# BERT-Base Quantized Model for Relation Extraction
This repository hosts a quantized version of the BERT-Base Chinese model, fine-tuned for relation extraction tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.
---
## Model Details
- **Model Name:** BERT-Base Chinese
- **Model Architecture:** BERT Base
- **Task:** Relation Extraction/Classification
- **Dataset:** Chinese Entity-Relation Dataset
- **Quantization:** Float16
- **Fine-tuning Framework:** Hugging Face Transformers
---
## Usage
### Installation
```bash
pip install transformers torch evaluate
```
### Loading the Quantized Model
```python
from transformers import BertTokenizerFast, BertForSequenceClassification
import torch
# Load the fine-tuned model and tokenizer
model_path = "final_relation_extraction_model"
tokenizer = BertTokenizerFast.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)
model.eval()
# Example input with entity markers
text = "η¬”εοΌš[SUBJ] ζœ¨ζ–§ [/SUBJ] ε‡Ίη”Ÿεœ°οΌš[OBJ] ζˆιƒ½ [/OBJ]"
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
# Inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=1).item()
# Map prediction to relation label
label_mapping = {0: "ε‡Ίη”Ÿεœ°", 1: "ε‡Ίη”Ÿζ—₯期", 2: "民族", 3: "职业"} # Customize based on your labels
predicted_relation = label_mapping[predicted_class]
print(f"Predicted Relation: {predicted_relation}")
```
---
## Performance Metrics
- **Accuracy:** 0.970222
- **F1 Score:** 0.964973
- **Training Loss:** 0.130104
- **Validation Loss:** 0.066986
---
## Fine-Tuning Details
### Dataset
The model was fine-tuned on a Chinese entity-relation dataset with:
- Entity pairs marked with special tokens `[SUBJ]` and `[OBJ]`
- Text preprocessed to include entity boundaries
- Multiple relation types including biographical information
### Training Configuration
- **Epochs:** 3
- **Batch Size:** 16
- **Learning Rate:** 2e-5
- **Max Length:** 128 tokens
- **Evaluation Strategy:** epoch
- **Weight Decay:** 0.01
- **Optimizer:** AdamW
### Data Processing
The original SPO (Subject-Predicate-Object) format was converted to relation classification:
- Each SPO triple becomes a separate training example
- Entities are marked with special tokens in the text
- Relations are encoded as numerical labels for classification
### Quantization
Post-training quantization was applied using PyTorch's Float16 precision to reduce the model size and improve inference efficiency.
---
## Repository Structure
```
.
β”œβ”€β”€ final_relation_extraction_model/
β”‚ β”œβ”€β”€ config.json
β”‚ β”œβ”€β”€ pytorch_model.bin # Fine-tuned Model
β”‚ β”œβ”€β”€ tokenizer_config.json
β”‚ β”œβ”€β”€ special_tokens_map.json
β”‚ β”œβ”€β”€ tokenizer.json
β”‚ β”œβ”€β”€ vocab.txt
β”‚ └── added_tokens.json
β”œβ”€β”€ relationship-extraction.ipynb # Training notebook
└── README.md # Model documentation
```
---
## Entity Marking Format
The model expects input text with entities marked using special tokens:
- Subject entities: `[SUBJ] entity_name [/SUBJ]`
- Object entities: `[OBJ] entity_name [/OBJ]`
Example:
```
Input: "η¬”εοΌš[SUBJ] ζœ¨ζ–§ [/SUBJ]εŽŸεοΌšζ¨θŽ†ζ›Ύζ°‘ζ—οΌš [OBJ] ε›žζ— [/OBJ]"
Output: "民族" (ethnicity relation)
```
---
## Supported Relations
The model can classify various biographical and factual relations in Chinese text, including:
- ε‡Ίη”Ÿεœ° (Birthplace)
- ε‡Ίη”Ÿζ—₯期 (Birth Date)
- 民族 (Ethnicity)
- 职业 (Occupation)
- And many more based on the training dataset
---
## Limitations
- The model is specifically trained for Chinese text and may not work well with other languages
- Performance depends on proper entity marking in the input text
- The model may not generalize well to domains outside the fine-tuning dataset
- Quantization may result in minor accuracy degradation compared to full-precision models
---
## Training Environment
- **Platform:** Kaggle Notebooks with GPU acceleration
- **GPU:** NVIDIA Tesla T4
- **Training Time:** Approximately 1 hour 5 minutes
- **Framework:** Hugging Face Transformers with PyTorch backend
---
## Contributing
Contributions are welcome! Feel free to open an issue or PR for improvements, fixes, or feature extensions.
---