Spaces:
Running
Running
Initialize README
Browse files
README.md
CHANGED
@@ -1,12 +1,116 @@
|
|
1 |
-
---
|
2 |
-
title:
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
-
sdk: gradio
|
7 |
-
sdk_version: 5.
|
8 |
-
app_file: app.py
|
9 |
-
pinned: false
|
10 |
-
---
|
11 |
-
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
title: RAG Benchmark Leaderboard
|
3 |
+
emoji: π
|
4 |
+
colorFrom: gray
|
5 |
+
colorTo: purple
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 5.4.0
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
---
|
11 |
+
|
12 |
+
# RAG Benchmark Leaderboard
|
13 |
+
|
14 |
+
An interactive leaderboard for comparing and visualizing the performance of RAG (Retrieval-Augmented Generation) systems.
|
15 |
+
|
16 |
+
## Features
|
17 |
+
|
18 |
+
- **Version Comparison**: Compare model performances across different versions of the benchmark dataset
|
19 |
+
- **Interactive Radar Charts**: Visualize generative and retrieval metrics
|
20 |
+
- **Customizable Views**: Filter and sort models based on different criteria
|
21 |
+
- **Easy Submission**: Simple API for submitting your model results
|
22 |
+
|
23 |
+
## Installation
|
24 |
+
|
25 |
+
```bash
|
26 |
+
pip install -r requirements.txt
|
27 |
+
```
|
28 |
+
|
29 |
+
## Running the Leaderboard
|
30 |
+
|
31 |
+
```bash
|
32 |
+
cd leaderboard
|
33 |
+
python app.py
|
34 |
+
```
|
35 |
+
|
36 |
+
This will start a Gradio server, and you can access the leaderboard in your browser at http://localhost:7860.
|
37 |
+
|
38 |
+
## Submitting Results
|
39 |
+
|
40 |
+
To submit your results to the leaderboard, use the provided API:
|
41 |
+
|
42 |
+
```python
|
43 |
+
from rag_benchmark import RAGBenchmark
|
44 |
+
|
45 |
+
# Initialize the benchmark
|
46 |
+
benchmark = RAGBenchmark(version="2.0") # Use the latest version
|
47 |
+
|
48 |
+
# Run evaluation
|
49 |
+
results = benchmark.evaluate(
|
50 |
+
model_name="Your Model Name",
|
51 |
+
embedding_model="your-embedding-model",
|
52 |
+
retriever_type="dense", # Options: dense, sparse, hybrid
|
53 |
+
retrieval_config={"top_k": 3}
|
54 |
+
)
|
55 |
+
|
56 |
+
# Submit results
|
57 |
+
benchmark.submit_results(results)
|
58 |
+
```
|
59 |
+
|
60 |
+
## Data Format
|
61 |
+
|
62 |
+
The results.json file has the following structure:
|
63 |
+
|
64 |
+
```json
|
65 |
+
{
|
66 |
+
"items": {
|
67 |
+
"1.0": { // Dataset version
|
68 |
+
"model1": { // Submission ID
|
69 |
+
"model_name": "Model Name",
|
70 |
+
"timestamp": "2024-03-20T12:00:00",
|
71 |
+
"config": {
|
72 |
+
"embedding_model": "embedding-model-name",
|
73 |
+
"retriever_type": "dense",
|
74 |
+
"retrieval_config": {
|
75 |
+
"top_k": 3
|
76 |
+
}
|
77 |
+
},
|
78 |
+
"metrics": {
|
79 |
+
"retrieval": {
|
80 |
+
"hit_rate": 0.82,
|
81 |
+
"mrr": 0.65,
|
82 |
+
"precision": 0.78
|
83 |
+
},
|
84 |
+
"generation": {
|
85 |
+
"rouge1": 0.72,
|
86 |
+
"rouge2": 0.55,
|
87 |
+
"rougeL": 0.68
|
88 |
+
}
|
89 |
+
}
|
90 |
+
}
|
91 |
+
}
|
92 |
+
},
|
93 |
+
"last_version": "2.0",
|
94 |
+
"n_questions": "1000"
|
95 |
+
}
|
96 |
+
```
|
97 |
+
|
98 |
+
## License
|
99 |
+
|
100 |
+
MIT
|
101 |
+
|
102 |
+
# RAG Evaluation Leaderboard
|
103 |
+
|
104 |
+
This leaderboard tracks different RAG (Retrieval-Augmented Generation) implementations and their performance metrics.
|
105 |
+
|
106 |
+
## Metrics Tracked
|
107 |
+
|
108 |
+
### Retrieval Metrics
|
109 |
+
- Hit Rate: Proportion of relevant documents retrieved
|
110 |
+
- MRR (Mean Reciprocal Rank): Position of first relevant document
|
111 |
+
|
112 |
+
### Generation Metrics
|
113 |
+
- ROUGE-1: Unigram overlap
|
114 |
+
- ROUGE-2: Bigram overlap
|
115 |
+
- ROUGE-L: Longest common subsequence
|
116 |
+
|