File size: 4,712 Bytes
9511daa |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
# BERT-Base Quantized Model for Relation Extraction
This repository hosts a quantized version of the BERT-Base Chinese model, fine-tuned for relation extraction tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.
---
## Model Details
- **Model Name:** BERT-Base Chinese
- **Model Architecture:** BERT Base
- **Task:** Relation Extraction/Classification
- **Dataset:** Chinese Entity-Relation Dataset
- **Quantization:** Float16
- **Fine-tuning Framework:** Hugging Face Transformers
---
## Usage
### Installation
```bash
pip install transformers torch evaluate
```
### Loading the Quantized Model
```python
from transformers import BertTokenizerFast, BertForSequenceClassification
import torch
# Load the fine-tuned model and tokenizer
model_path = "final_relation_extraction_model"
tokenizer = BertTokenizerFast.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)
model.eval()
# Example input with entity markers
text = "η¬εοΌ[SUBJ] ζ¨ζ§ [/SUBJ] εΊηε°οΌ[OBJ] ζι½ [/OBJ]"
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
# Inference
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
predicted_class = torch.argmax(logits, dim=1).item()
# Map prediction to relation label
label_mapping = {0: "εΊηε°", 1: "εΊηζ₯ζ", 2: "ζ°ζ", 3: "θδΈ"} # Customize based on your labels
predicted_relation = label_mapping[predicted_class]
print(f"Predicted Relation: {predicted_relation}")
```
---
## Performance Metrics
- **Accuracy:** 0.970222
- **F1 Score:** 0.964973
- **Training Loss:** 0.130104
- **Validation Loss:** 0.066986
---
## Fine-Tuning Details
### Dataset
The model was fine-tuned on a Chinese entity-relation dataset with:
- Entity pairs marked with special tokens `[SUBJ]` and `[OBJ]`
- Text preprocessed to include entity boundaries
- Multiple relation types including biographical information
### Training Configuration
- **Epochs:** 3
- **Batch Size:** 16
- **Learning Rate:** 2e-5
- **Max Length:** 128 tokens
- **Evaluation Strategy:** epoch
- **Weight Decay:** 0.01
- **Optimizer:** AdamW
### Data Processing
The original SPO (Subject-Predicate-Object) format was converted to relation classification:
- Each SPO triple becomes a separate training example
- Entities are marked with special tokens in the text
- Relations are encoded as numerical labels for classification
### Quantization
Post-training quantization was applied using PyTorch's Float16 precision to reduce the model size and improve inference efficiency.
---
## Repository Structure
```
.
βββ final_relation_extraction_model/
β βββ config.json
β βββ pytorch_model.bin # Fine-tuned Model
β βββ tokenizer_config.json
β βββ special_tokens_map.json
β βββ tokenizer.json
β βββ vocab.txt
β βββ added_tokens.json
βββ relationship-extraction.ipynb # Training notebook
βββ README.md # Model documentation
```
---
## Entity Marking Format
The model expects input text with entities marked using special tokens:
- Subject entities: `[SUBJ] entity_name [/SUBJ]`
- Object entities: `[OBJ] entity_name [/OBJ]`
Example:
```
Input: "η¬εοΌ[SUBJ] ζ¨ζ§ [/SUBJ]εεοΌζ¨θζΎζ°ζοΌ [OBJ] εζ [/OBJ]"
Output: "ζ°ζ" (ethnicity relation)
```
---
## Supported Relations
The model can classify various biographical and factual relations in Chinese text, including:
- εΊηε° (Birthplace)
- εΊηζ₯ζ (Birth Date)
- ζ°ζ (Ethnicity)
- θδΈ (Occupation)
- And many more based on the training dataset
---
## Limitations
- The model is specifically trained for Chinese text and may not work well with other languages
- Performance depends on proper entity marking in the input text
- The model may not generalize well to domains outside the fine-tuning dataset
- Quantization may result in minor accuracy degradation compared to full-precision models
---
## Training Environment
- **Platform:** Kaggle Notebooks with GPU acceleration
- **GPU:** NVIDIA Tesla T4
- **Training Time:** Approximately 1 hour 5 minutes
- **Framework:** Hugging Face Transformers with PyTorch backend
---
## Contributing
Contributions are welcome! Feel free to open an issue or PR for improvements, fixes, or feature extensions.
--- |