Spaces:
Running
Running
title: RAG Benchmark Leaderboard | |
emoji: π | |
colorFrom: gray | |
colorTo: purple | |
sdk: gradio | |
sdk_version: 5.4.0 | |
app_file: app.py | |
pinned: false | |
# RAG Benchmark Leaderboard | |
An interactive leaderboard for comparing and visualizing the performance of RAG (Retrieval-Augmented Generation) systems. | |
## Features | |
- **Version Comparison**: Compare model performances across different versions of the benchmark dataset | |
- **Interactive Radar Charts**: Visualize generative and retrieval metrics | |
- **Customizable Views**: Filter and sort models based on different criteria | |
- **Easy Submission**: Simple API for submitting your model results | |
## Installation | |
```bash | |
pip install -r requirements.txt | |
``` | |
## Running the Leaderboard | |
```bash | |
cd leaderboard | |
python app.py | |
``` | |
This will start a Gradio server, and you can access the leaderboard in your browser at http://localhost:7860. | |
## Submitting Results | |
To submit your results to the leaderboard, use the provided API: | |
```python | |
from rag_benchmark import RAGBenchmark | |
# Initialize the benchmark | |
benchmark = RAGBenchmark(version="2.0") # Use the latest version | |
# Run evaluation | |
results = benchmark.evaluate( | |
model_name="Your Model Name", | |
embedding_model="your-embedding-model", | |
retriever_type="dense", # Options: dense, sparse, hybrid | |
retrieval_config={"top_k": 3} | |
) | |
# Submit results | |
benchmark.submit_results(results) | |
``` | |
## Data Format | |
The results.json file has the following structure: | |
```json | |
{ | |
"items": { | |
"1.0": { // Dataset version | |
"model1": { // Submission ID | |
"model_name": "Model Name", | |
"timestamp": "2025-03-20T12:00:00", | |
"config": { | |
"embedding_model": "embedding-model-name", | |
"retriever_type": "dense", | |
"retrieval_config": { | |
"top_k": 3 | |
} | |
}, | |
"metrics": { | |
"retrieval": { | |
"hit_rate": 0.82, | |
"mrr": 0.65, | |
"precision": 0.78 | |
}, | |
"generation": { | |
"rouge1": 0.72, | |
"rouge2": 0.55, | |
"rougeL": 0.68 | |
} | |
} | |
} | |
} | |
}, | |
"last_version": "2.0", | |
"n_questions": "1000" | |
} | |
``` | |
## License | |
MIT | |
# RAG Evaluation Leaderboard | |
This leaderboard tracks different RAG (Retrieval-Augmented Generation) implementations and their performance metrics. | |
## Metrics Tracked | |
### Retrieval Metrics | |
- Hit Rate: Proportion of relevant documents retrieved | |
- MRR (Mean Reciprocal Rank): Position of first relevant document | |
### Generation Metrics | |
- ROUGE-1: Unigram overlap | |
- ROUGE-2: Bigram overlap | |
- ROUGE-L: Longest common subsequence | |