|
# BERT-Base Quantized Model for Relation Extraction
|
|
|
|
This repository hosts a quantized version of the BERT-Base Chinese model, fine-tuned for relation extraction tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.
|
|
|
|
---
|
|
|
|
## Model Details
|
|
|
|
- **Model Name:** BERT-Base Chinese
|
|
- **Model Architecture:** BERT Base
|
|
- **Task:** Relation Extraction/Classification
|
|
- **Dataset:** Chinese Entity-Relation Dataset
|
|
- **Quantization:** Float16
|
|
- **Fine-tuning Framework:** Hugging Face Transformers
|
|
|
|
---
|
|
|
|
## Usage
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
pip install transformers torch evaluate
|
|
```
|
|
|
|
### Loading the Quantized Model
|
|
|
|
```python
|
|
from transformers import BertTokenizerFast, BertForSequenceClassification
|
|
import torch
|
|
|
|
# Load the fine-tuned model and tokenizer
|
|
model_path = "final_relation_extraction_model"
|
|
tokenizer = BertTokenizerFast.from_pretrained(model_path)
|
|
model = BertForSequenceClassification.from_pretrained(model_path)
|
|
model.eval()
|
|
|
|
# Example input with entity markers
|
|
text = "η¬εοΌ[SUBJ] ζ¨ζ§ [/SUBJ] εΊηε°οΌ[OBJ] ζι½ [/OBJ]"
|
|
|
|
# Tokenize input
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
|
|
|
|
# Inference
|
|
with torch.no_grad():
|
|
outputs = model(**inputs)
|
|
logits = outputs.logits
|
|
predicted_class = torch.argmax(logits, dim=1).item()
|
|
|
|
# Map prediction to relation label
|
|
label_mapping = {0: "εΊηε°", 1: "εΊηζ₯ζ", 2: "ζ°ζ", 3: "θδΈ"} # Customize based on your labels
|
|
predicted_relation = label_mapping[predicted_class]
|
|
print(f"Predicted Relation: {predicted_relation}")
|
|
```
|
|
|
|
---
|
|
|
|
## Performance Metrics
|
|
|
|
- **Accuracy:** 0.970222
|
|
- **F1 Score:** 0.964973
|
|
- **Training Loss:** 0.130104
|
|
- **Validation Loss:** 0.066986
|
|
|
|
---
|
|
|
|
## Fine-Tuning Details
|
|
|
|
### Dataset
|
|
|
|
The model was fine-tuned on a Chinese entity-relation dataset with:
|
|
- Entity pairs marked with special tokens `[SUBJ]` and `[OBJ]`
|
|
- Text preprocessed to include entity boundaries
|
|
- Multiple relation types including biographical information
|
|
|
|
### Training Configuration
|
|
|
|
- **Epochs:** 3
|
|
- **Batch Size:** 16
|
|
- **Learning Rate:** 2e-5
|
|
- **Max Length:** 128 tokens
|
|
- **Evaluation Strategy:** epoch
|
|
- **Weight Decay:** 0.01
|
|
- **Optimizer:** AdamW
|
|
|
|
### Data Processing
|
|
|
|
The original SPO (Subject-Predicate-Object) format was converted to relation classification:
|
|
- Each SPO triple becomes a separate training example
|
|
- Entities are marked with special tokens in the text
|
|
- Relations are encoded as numerical labels for classification
|
|
|
|
### Quantization
|
|
|
|
Post-training quantization was applied using PyTorch's Float16 precision to reduce the model size and improve inference efficiency.
|
|
|
|
---
|
|
|
|
## Repository Structure
|
|
|
|
```
|
|
.
|
|
βββ final_relation_extraction_model/
|
|
β βββ config.json
|
|
β βββ pytorch_model.bin # Fine-tuned Model
|
|
β βββ tokenizer_config.json
|
|
β βββ special_tokens_map.json
|
|
β βββ tokenizer.json
|
|
β βββ vocab.txt
|
|
β βββ added_tokens.json
|
|
βββ relationship-extraction.ipynb # Training notebook
|
|
βββ README.md # Model documentation
|
|
```
|
|
|
|
---
|
|
|
|
## Entity Marking Format
|
|
|
|
The model expects input text with entities marked using special tokens:
|
|
- Subject entities: `[SUBJ] entity_name [/SUBJ]`
|
|
- Object entities: `[OBJ] entity_name [/OBJ]`
|
|
|
|
Example:
|
|
```
|
|
Input: "η¬εοΌ[SUBJ] ζ¨ζ§ [/SUBJ]εεοΌζ¨θζΎζ°ζοΌ [OBJ] εζ [/OBJ]"
|
|
Output: "ζ°ζ" (ethnicity relation)
|
|
```
|
|
|
|
---
|
|
|
|
## Supported Relations
|
|
|
|
The model can classify various biographical and factual relations in Chinese text, including:
|
|
- εΊηε° (Birthplace)
|
|
- εΊηζ₯ζ (Birth Date)
|
|
- ζ°ζ (Ethnicity)
|
|
- θδΈ (Occupation)
|
|
- And many more based on the training dataset
|
|
|
|
---
|
|
|
|
## Limitations
|
|
|
|
- The model is specifically trained for Chinese text and may not work well with other languages
|
|
- Performance depends on proper entity marking in the input text
|
|
- The model may not generalize well to domains outside the fine-tuning dataset
|
|
- Quantization may result in minor accuracy degradation compared to full-precision models
|
|
|
|
---
|
|
|
|
## Training Environment
|
|
|
|
- **Platform:** Kaggle Notebooks with GPU acceleration
|
|
- **GPU:** NVIDIA Tesla T4
|
|
- **Training Time:** Approximately 1 hour 5 minutes
|
|
- **Framework:** Hugging Face Transformers with PyTorch backend
|
|
|
|
---
|
|
|
|
## Contributing
|
|
|
|
Contributions are welcome! Feel free to open an issue or PR for improvements, fixes, or feature extensions.
|
|
|
|
--- |