File size: 4,712 Bytes
9511daa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
# BERT-Base Quantized Model for Relation Extraction

This repository hosts a quantized version of the BERT-Base Chinese model, fine-tuned for relation extraction tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.

---

## Model Details

- **Model Name:** BERT-Base Chinese   
- **Model Architecture:** BERT Base  
- **Task:** Relation Extraction/Classification  
- **Dataset:** Chinese Entity-Relation Dataset  
- **Quantization:** Float16  
- **Fine-tuning Framework:** Hugging Face Transformers

---

## Usage

### Installation

```bash

pip install transformers torch evaluate

```

### Loading the Quantized Model

```python

from transformers import BertTokenizerFast, BertForSequenceClassification

import torch



# Load the fine-tuned model and tokenizer

model_path = "final_relation_extraction_model"

tokenizer = BertTokenizerFast.from_pretrained(model_path)

model = BertForSequenceClassification.from_pretrained(model_path)

model.eval()



# Example input with entity markers

text = "η¬”εοΌš[SUBJ] ζœ¨ζ–§ [/SUBJ] ε‡Ίη”Ÿεœ°οΌš[OBJ] ζˆιƒ½ [/OBJ]"



# Tokenize input

inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)



# Inference

with torch.no_grad():

    outputs = model(**inputs)

    logits = outputs.logits

    predicted_class = torch.argmax(logits, dim=1).item()



# Map prediction to relation label

label_mapping = {0: "ε‡Ίη”Ÿεœ°", 1: "ε‡Ίη”Ÿζ—₯期", 2: "民族", 3: "职业"}  # Customize based on your labels

predicted_relation = label_mapping[predicted_class]

print(f"Predicted Relation: {predicted_relation}")

```

---

## Performance Metrics

- **Accuracy:** 0.970222
- **F1 Score:** 0.964973  
- **Training Loss:** 0.130104  
- **Validation Loss:** 0.066986

---

## Fine-Tuning Details

### Dataset

The model was fine-tuned on a Chinese entity-relation dataset with:
- Entity pairs marked with special tokens `[SUBJ]` and `[OBJ]`
- Text preprocessed to include entity boundaries
- Multiple relation types including biographical information

### Training Configuration

- **Epochs:** 3  
- **Batch Size:** 16  
- **Learning Rate:** 2e-5  
- **Max Length:** 128 tokens   
- **Evaluation Strategy:** epoch  
- **Weight Decay:** 0.01
- **Optimizer:** AdamW

### Data Processing

The original SPO (Subject-Predicate-Object) format was converted to relation classification:
- Each SPO triple becomes a separate training example
- Entities are marked with special tokens in the text
- Relations are encoded as numerical labels for classification

### Quantization

Post-training quantization was applied using PyTorch's Float16 precision to reduce the model size and improve inference efficiency.

---

## Repository Structure

```

.

β”œβ”€β”€ final_relation_extraction_model/

β”‚   β”œβ”€β”€ config.json           

β”‚   β”œβ”€β”€ pytorch_model.bin     # Fine-tuned Model

β”‚   β”œβ”€β”€ tokenizer_config.json    

β”‚   β”œβ”€β”€ special_tokens_map.json

β”‚   β”œβ”€β”€ tokenizer.json

β”‚   β”œβ”€β”€ vocab.txt

β”‚   └── added_tokens.json

β”œβ”€β”€ relationship-extraction.ipynb  # Training notebook

└── README.md                     # Model documentation

```

---

## Entity Marking Format

The model expects input text with entities marked using special tokens:
- Subject entities: `[SUBJ] entity_name [/SUBJ]`
- Object entities: `[OBJ] entity_name [/OBJ]`

Example:
```

Input: "η¬”εοΌš[SUBJ] ζœ¨ζ–§ [/SUBJ]εŽŸεοΌšζ¨θŽ†ζ›Ύζ°‘ζ—οΌš [OBJ] ε›žζ— [/OBJ]"

Output: "民族" (ethnicity relation)

```

---

## Supported Relations

The model can classify various biographical and factual relations in Chinese text, including:
- ε‡Ίη”Ÿεœ° (Birthplace)
- ε‡Ίη”Ÿζ—₯期 (Birth Date)  
- 民族 (Ethnicity)
- 职业 (Occupation)
- And many more based on the training dataset

---

## Limitations

- The model is specifically trained for Chinese text and may not work well with other languages
- Performance depends on proper entity marking in the input text
- The model may not generalize well to domains outside the fine-tuning dataset
- Quantization may result in minor accuracy degradation compared to full-precision models

---

## Training Environment

- **Platform:** Kaggle Notebooks with GPU acceleration
- **GPU:** NVIDIA Tesla T4
- **Training Time:** Approximately 1 hour 5 minutes
- **Framework:** Hugging Face Transformers with PyTorch backend

---

## Contributing

Contributions are welcome! Feel free to open an issue or PR for improvements, fixes, or feature extensions.

---