File size: 4,712 Bytes

9511daa

# BERT-Base Quantized Model for Relation Extraction

This repository hosts a quantized version of the BERT-Base Chinese model, fine-tuned for relation extraction tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.

---

## Model Details

- **Model Name:** BERT-Base Chinese   
- **Model Architecture:** BERT Base  
- **Task:** Relation Extraction/Classification  
- **Dataset:** Chinese Entity-Relation Dataset  
- **Quantization:** Float16  
- **Fine-tuning Framework:** Hugging Face Transformers

---

## Usage

### Installation

```bash

pip install transformers torch evaluate

```

### Loading the Quantized Model

```python

from transformers import BertTokenizerFast, BertForSequenceClassification

import torch



# Load the fine-tuned model and tokenizer

model_path = "final_relation_extraction_model"

tokenizer = BertTokenizerFast.from_pretrained(model_path)

model = BertForSequenceClassification.from_pretrained(model_path)

model.eval()



# Example input with entity markers

text = "笔名：[SUBJ] 木斧 [/SUBJ] 出生地：[OBJ] 成都 [/OBJ]"



# Tokenize input

inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)



# Inference

with torch.no_grad():

    outputs = model(**inputs)

    logits = outputs.logits

    predicted_class = torch.argmax(logits, dim=1).item()



# Map prediction to relation label

label_mapping = {0: "出生地", 1: "出生日期", 2: "民族", 3: "职业"}  # Customize based on your labels

predicted_relation = label_mapping[predicted_class]

print(f"Predicted Relation: {predicted_relation}")

```

---

## Performance Metrics

- **Accuracy:** 0.970222
- **F1 Score:** 0.964973  
- **Training Loss:** 0.130104  
- **Validation Loss:** 0.066986

---

## Fine-Tuning Details

### Dataset

The model was fine-tuned on a Chinese entity-relation dataset with:
- Entity pairs marked with special tokens `[SUBJ]` and `[OBJ]`
- Text preprocessed to include entity boundaries
- Multiple relation types including biographical information

### Training Configuration

- **Epochs:** 3  
- **Batch Size:** 16  
- **Learning Rate:** 2e-5  
- **Max Length:** 128 tokens   
- **Evaluation Strategy:** epoch  
- **Weight Decay:** 0.01
- **Optimizer:** AdamW

### Data Processing

The original SPO (Subject-Predicate-Object) format was converted to relation classification:
- Each SPO triple becomes a separate training example
- Entities are marked with special tokens in the text
- Relations are encoded as numerical labels for classification

### Quantization

Post-training quantization was applied using PyTorch's Float16 precision to reduce the model size and improve inference efficiency.

---

## Repository Structure

```

.

├── final_relation_extraction_model/

│   ├── config.json           

│   ├── pytorch_model.bin     # Fine-tuned Model

│   ├── tokenizer_config.json    

│   ├── special_tokens_map.json

│   ├── tokenizer.json

│   ├── vocab.txt

│   └── added_tokens.json

├── relationship-extraction.ipynb  # Training notebook

└── README.md                     # Model documentation

```

---

## Entity Marking Format

The model expects input text with entities marked using special tokens:
- Subject entities: `[SUBJ] entity_name [/SUBJ]`
- Object entities: `[OBJ] entity_name [/OBJ]`

Example:
```

Input: "笔名：[SUBJ] 木斧 [/SUBJ]原名：杨莆曾民族： [OBJ] 回族 [/OBJ]"

Output: "民族" (ethnicity relation)

```

---

## Supported Relations

The model can classify various biographical and factual relations in Chinese text, including:
- 出生地 (Birthplace)
- 出生日期 (Birth Date)  
- 民族 (Ethnicity)
- 职业 (Occupation)
- And many more based on the training dataset

---

## Limitations

- The model is specifically trained for Chinese text and may not work well with other languages
- Performance depends on proper entity marking in the input text
- The model may not generalize well to domains outside the fine-tuning dataset
- Quantization may result in minor accuracy degradation compared to full-precision models

---

## Training Environment

- **Platform:** Kaggle Notebooks with GPU acceleration
- **GPU:** NVIDIA Tesla T4
- **Training Time:** Approximately 1 hour 5 minutes
- **Framework:** Hugging Face Transformers with PyTorch backend

---

## Contributing

Contributions are welcome! Feel free to open an issue or PR for improvements, fixes, or feature extensions.

---