|
--- |
|
title: Vietnamese Legal Doc Retrieval |
|
emoji: π |
|
colorFrom: indigo |
|
colorTo: pink |
|
sdk: docker |
|
pinned: false |
|
short_description: Fine-tuned Retrieval System for Vietnamese Legal Documents |
|
models: |
|
- YuITC/bert-base-multilingual-cased-finetuned-VNLegalDocs |
|
datasets: |
|
- YuITC/Vietnamese-Legal-Doc-Retrieval-Data |
|
--- |
|
|
|
# Vietnamese Legal Document Retrieval System |
|
|
|
[](https://huggingface.co/spaces/YuITC/Vietnamese-Legal-Doc-Retrieval) |
|
[](https://huggingface.co/YuITC/bert-base-multilingual-cased-finetuned-VNLegalDocs) |
|
[](https://huggingface.co/datasets/YuITC/Vietnamese-Legal-Doc-Retrieval-Data) |
|
|
|
A retrieval system specifically designed for Vietnamese legal documents using fine-tuned SBERT (Sentence-BERT) technology. |
|
|
|
|
|
## π Overview |
|
This project implements a retrieval system for retrieving relevant Vietnamese legal documents based on user queries. The system uses a fine-tuned multilingual BERT model to encode legal queries and documents into a semantic vector space, allowing for retrieval based on meaning rather than just keyword matching. |
|
|
|
 |
|
|
|
|
|
## π Key features |
|
- Step-by-step notebook for understanding. |
|
- Fine-tuned SBERT model specialized for Vietnamese legal document retrieval. |
|
- FAISS indexing for efficient vector search. |
|
- Evaluation based on MTEB. |
|
- Interactive web interface for quick legal document search. |
|
- High-performance retrieval of relevant legal passages. |
|
|
|
|
|
## π οΈ Installation & Usage |
|
```bash |
|
# Install dependencies |
|
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia |
|
conda install faiss-gpu=1.9.0 -c pytorch -c nvidia |
|
pip install -r requirements.txt |
|
|
|
# Running the Application |
|
python main.py |
|
``` |
|
|
|
The application will start a local web server with the Gradio interface, allowing you to enter legal queries and retrieve relevant documents. |
|
|
|
|
|
## π Project Structure |
|
|
|
``` |
|
Vietnamese-Legal-Doc-Retrieval/ |
|
βββ assets/ # Visual assets for documentation |
|
β βββ gradio_demo.png # Screenshot of the Gradio demo interface |
|
βββ cache/ # Cached model files |
|
β βββ VN-legalDocs-SBERT/ # Cached BERT model files |
|
βββ data/ # Dataset files |
|
β βββ original/ # Original downloaded dataset |
|
β β βββ corpus.csv # Raw corpus documents |
|
β β βββ train_split.csv # Training data |
|
β β βββ val_split.csv # Validation data |
|
β β βββ ... |
|
β βββ processed/ # Processed dataset files |
|
β β βββ corpus_data.parquet # Processed corpus for embedding |
|
β β βββ train_data.parquet # Processed training data |
|
β β βββ test_data.parquet # Processed test data |
|
β βββ retrieval/ # Files for retrieval system |
|
β βββ legal_faiss.index # FAISS index for fast vector search |
|
βββ models/ # Trained model files |
|
β βββ VN-legalDocs-SBERT/ # Fine-tuned BERT model for legal documents |
|
β βββ model.safetensors # Model weights |
|
β βββ config.json # Model configuration |
|
β βββ checkpoint-*/ # Training checkpoints |
|
βββ results/ # Evaluation results |
|
βββ Dockerfile # Docker configuration for deployment |
|
βββ main.py # Main application entry point |
|
βββ requirements.txt # Python dependencies |
|
βββ settings.py # Configuration settings |
|
βββ step_*_*.ipynb # Jupyter notebooks for each step of the process |
|
``` |
|
## πΎ Dataset |
|
The system is trained on a Vietnamese legal document corpus containing: |
|
- Legal texts from various domains |
|
- Query-document pairs for training and evaluation |
|
- Processed and structured for semantic search training |
|
|
|
The dataset is available on [Hugging Face](https://huggingface.co/datasets/YuITC/Vietnamese-Legal-Doc-Retrieval-Data) (modified by me, the base dataset is cited below). |
|
|
|
|
|
## π Model Training Process |
|
The project follows a systematic approach to build the retrieval system: |
|
|
|
1. **Data Preparation** (`step_01_Prepare_Data.ipynb`): |
|
- Processes raw legal documents |
|
- Creates query-document pairs for training |
|
- Formats data for the embedding model |
|
|
|
2. **SBERT Fine-tuning** (`step_02_Finetune_SBERT.ipynb`): |
|
- Fine-tunes a multilingual BERT model with legal document pairs |
|
- Uses `CachedMultipleNegativesRankingLoss` for training |
|
- Optimizes for semantic similarity in legal context |
|
|
|
3. **Evaluation** (`step_03_Eval_with_MTEB.ipynb`): |
|
- Evaluates model performance using retrieval metrics |
|
- Compares with baseline models |
|
|
|
4. **Retrieval System Setup** (`step_04_Retrieval.ipynb`): |
|
- Creates FAISS index from document embeddings |
|
- Implements efficient search functionality |
|
- Prepares for deployment |
|
|
|
|
|
## π Usage Examples |
|
|
|
The system accepts natural language queries in Vietnamese related to legal topics. Example queries: |
|
|
|
- "Tα»i xΓΊc phαΊ‘m danh dα»±?" (Crimes against honor?) |
|
- "Quyα»n lợi cα»§a ngΖ°α»i lao Δα»ng?" (Rights of workers?) |
|
- "Thα»§ tα»₯c ΔΔng kΓ½ kαΊΏt hΓ΄n?" (Marriage registration procedures?) |
|
|
|
|
|
## π§ͺ Performance |
|
|
|
The fine-tuned model was evaluated using the [MTEB benchmark](https://github.com/embeddings-benchmark/mteb) on the BKAILegalDocRetrieval dataset. Key results: |
|
|
|
| Metric | @k | Pre-trained model score (%) | Fine-tuned model score (%) | |
|
|--------------|-----|-----------------------------|-----------------------------| |
|
| **NDCG** | 1 | 0.007 | 42.425 | |
|
| | 5 | 0.011 | 57.387 | |
|
| | 10 | 0.023 | 60.389 | |
|
| | 20 | 0.049 | 62.160 | |
|
| | 100 | 0.147 | 63.894 | |
|
| **MAP** | 1 | 0.007 | 40.328 | |
|
| | 5 | 0.009 | 52.297 | |
|
| | 10 | 0.014 | 53.608 | |
|
| | 20 | 0.021 | 54.136 | |
|
| | 100 | 0.033 | 54.418 | |
|
| **Recall** | 1 | 0.007 | 40.328 | |
|
| | 5 | 0.017 | 70.466 | |
|
| | 10 | 0.054 | 79.407 | |
|
| | 20 | 0.157 | 86.112 | |
|
| | 100 | 0.713 | 94.805 | |
|
| **Precision**| 1 | 0.007 | 42.425 | |
|
| | 5 | 0.003 | 15.119 | |
|
| | 10 | 0.005 | 8.587 | |
|
| | 20 | 0.008 | 4.687 | |
|
| | 100 | 0.007 | 1.045 | |
|
| **MRR** | 1 | 0.007 | 42.418 | |
|
| | 5 | 0.010 | 54.337 | |
|
| | 10 | 0.014 | 55.510 | |
|
| | 20 | 0.021 | 55.956 | |
|
| | 100 | 0.033 | 56.172 | |
|
|
|
- **NDCG@k (Normalized Discounted Cumulative Gain)** |
|
Measures ranking quality by evaluating the relevance of results with logarithmic position-based discounting. |
|
- **MAP@k (Mean Average Precision)** |
|
Computes the average precision for each query up to rank kβprecision at each relevant retrieved documentβthen averages across all queries. |
|
- **Recall@k** |
|
The proportion of all relevant documents that are retrieved in the top k results. |
|
- **Precision@k** |
|
The proportion of the top k retrieved documents that are relevant. |
|
- **MRR@k (Mean Reciprocal Rank)** |
|
The average of the reciprocal of the rank position of the first relevant document across all queries. |
|
|
|
The model significantly outperforms baseline retrieval methods, with the main evaluation score (NDCG@10) reaching 60.4%, demonstrating strong performance on Vietnamese legal document retrieval tasks. |
|
|
|
## π³ Docker Deployment |
|
|
|
The project includes a Docker configuration for easy deployment. The Docker image is built on `continuumio/miniconda3` and includes GPU support via PyTorch CUDA and FAISS-GPU. |
|
|
|
```bash |
|
# Build the Docker image |
|
docker build -t vietnamese-legal-retrieval . |
|
|
|
# Run the container |
|
docker run -p 7860:7860 vietnamese-legal-retrieval |
|
``` |
|
|
|
The container: |
|
- Uses Python 3.10 with CUDA 12.1 support |
|
- Installs required dependencies from requirements.txt |
|
- Exposes port 7860 for the Gradio web interface |
|
- Sets proper environment variables for security and performance |
|
- Runs as a non-root user for enhanced security |
|
|
|
You can access the web interface by navigating to `http://localhost:7860` after starting the container. |
|
|
|
|
|
## π License |
|
This project is licensed under the MIT License β feel free to modify and distribute it as needed. |
|
|
|
|
|
## π€ Acknowledgments |
|
Thanks for: |
|
- [BKAI Legal Retrieval Dataset](https://huggingface.co/datasets/tmnam20/BKAI-Legal-Retrieval) for the original data |
|
- [Sentence Transformers](https://www.sbert.net/) library for the embedding model architecture |
|
- [Hugging Face](https://huggingface.co/) for hosting the model and dataset |
|
|
|
If you find this project useful, consider βοΈ starring the repository or contributing to further improvements! |
|
|
|
|
|
## π¬ Contact |
|
For any questions or collaboration opportunities, feel free to reach out: |
|
|
|
π§ Email: [email protected] |