BERT-Base Quantized Model for Relation Extraction

This repository hosts a quantized version of the BERT-Base Chinese model, fine-tuned for relation extraction tasks. The model has been optimized for efficient deployment while maintaining high accuracy, making it suitable for resource-constrained environments.

Model Details

Model Name: BERT-Base Chinese
Model Architecture: BERT Base
Task: Relation Extraction/Classification
Dataset: Chinese Entity-Relation Dataset
Quantization: Float16
Fine-tuning Framework: Hugging Face Transformers

Usage

Installation

pip install transformers torch evaluate

Loading the Quantized Model

from transformers import BertTokenizerFast, BertForSequenceClassification
import torch

# Load the fine-tuned model and tokenizer
model_path = "final_relation_extraction_model"
tokenizer = BertTokenizerFast.from_pretrained(model_path)
model = BertForSequenceClassification.from_pretrained(model_path)
model.eval()

# Example input with entity markers
text = "笔名：[SUBJ] 木斧 [/SUBJ] 出生地：[OBJ] 成都 [/OBJ]"

# Tokenize input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = torch.argmax(logits, dim=1).item()

# Map prediction to relation label
label_mapping = {0: "出生地", 1: "出生日期", 2: "民族", 3: "职业"}  # Customize based on your labels
predicted_relation = label_mapping[predicted_class]
print(f"Predicted Relation: {predicted_relation}")

Performance Metrics

Accuracy: 0.970222
F1 Score: 0.964973
Training Loss: 0.130104
Validation Loss: 0.066986

Fine-Tuning Details

Dataset

The model was fine-tuned on a Chinese entity-relation dataset with:

Entity pairs marked with special tokens [SUBJ] and [OBJ]
Text preprocessed to include entity boundaries
Multiple relation types including biographical information

Training Configuration

Epochs: 3
Batch Size: 16
Learning Rate: 2e-5
Max Length: 128 tokens
Evaluation Strategy: epoch
Weight Decay: 0.01
Optimizer: AdamW

Data Processing

The original SPO (Subject-Predicate-Object) format was converted to relation classification:

Each SPO triple becomes a separate training example
Entities are marked with special tokens in the text
Relations are encoded as numerical labels for classification

Quantization

Post-training quantization was applied using PyTorch's Float16 precision to reduce the model size and improve inference efficiency.

Repository Structure

.
├── final_relation_extraction_model/
│   ├── config.json           
│   ├── pytorch_model.bin     # Fine-tuned Model
│   ├── tokenizer_config.json    
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   ├── vocab.txt
│   └── added_tokens.json
├── relationship-extraction.ipynb  # Training notebook
└── README.md                     # Model documentation

Entity Marking Format

The model expects input text with entities marked using special tokens:

Subject entities: [SUBJ] entity_name [/SUBJ]
Object entities: [OBJ] entity_name [/OBJ]

Example:

Input: "笔名：[SUBJ] 木斧 [/SUBJ]原名：杨莆曾民族： [OBJ] 回族 [/OBJ]"
Output: "民族" (ethnicity relation)

Supported Relations

The model can classify various biographical and factual relations in Chinese text, including:

出生地 (Birthplace)
出生日期 (Birth Date)
民族 (Ethnicity)
职业 (Occupation)
And many more based on the training dataset

Limitations

The model is specifically trained for Chinese text and may not work well with other languages
Performance depends on proper entity marking in the input text
The model may not generalize well to domains outside the fine-tuning dataset
Quantization may result in minor accuracy degradation compared to full-precision models

Training Environment

Platform: Kaggle Notebooks with GPU acceleration
GPU: NVIDIA Tesla T4
Training Time: Approximately 1 hour 5 minutes
Framework: Hugging Face Transformers with PyTorch backend

Contributing

Contributions are welcome! Feel free to open an issue or PR for improvements, fixes, or feature extensions.