ai-forever commited on
Commit
aff180f
Β·
verified Β·
1 Parent(s): 7a8aa93

Initialize README

Browse files
Files changed (1) hide show
  1. README.md +116 -12
README.md CHANGED
@@ -1,12 +1,116 @@
1
- ---
2
- title: Rag Leaderboard
3
- emoji: πŸ‘
4
- colorFrom: blue
5
- colorTo: pink
6
- sdk: gradio
7
- sdk_version: 5.23.1
8
- app_file: app.py
9
- pinned: false
10
- ---
11
-
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: RAG Benchmark Leaderboard
3
+ emoji: πŸ“š
4
+ colorFrom: gray
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 5.4.0
8
+ app_file: app.py
9
+ pinned: false
10
+ ---
11
+
12
+ # RAG Benchmark Leaderboard
13
+
14
+ An interactive leaderboard for comparing and visualizing the performance of RAG (Retrieval-Augmented Generation) systems.
15
+
16
+ ## Features
17
+
18
+ - **Version Comparison**: Compare model performances across different versions of the benchmark dataset
19
+ - **Interactive Radar Charts**: Visualize generative and retrieval metrics
20
+ - **Customizable Views**: Filter and sort models based on different criteria
21
+ - **Easy Submission**: Simple API for submitting your model results
22
+
23
+ ## Installation
24
+
25
+ ```bash
26
+ pip install -r requirements.txt
27
+ ```
28
+
29
+ ## Running the Leaderboard
30
+
31
+ ```bash
32
+ cd leaderboard
33
+ python app.py
34
+ ```
35
+
36
+ This will start a Gradio server, and you can access the leaderboard in your browser at http://localhost:7860.
37
+
38
+ ## Submitting Results
39
+
40
+ To submit your results to the leaderboard, use the provided API:
41
+
42
+ ```python
43
+ from rag_benchmark import RAGBenchmark
44
+
45
+ # Initialize the benchmark
46
+ benchmark = RAGBenchmark(version="2.0") # Use the latest version
47
+
48
+ # Run evaluation
49
+ results = benchmark.evaluate(
50
+ model_name="Your Model Name",
51
+ embedding_model="your-embedding-model",
52
+ retriever_type="dense", # Options: dense, sparse, hybrid
53
+ retrieval_config={"top_k": 3}
54
+ )
55
+
56
+ # Submit results
57
+ benchmark.submit_results(results)
58
+ ```
59
+
60
+ ## Data Format
61
+
62
+ The results.json file has the following structure:
63
+
64
+ ```json
65
+ {
66
+ "items": {
67
+ "1.0": { // Dataset version
68
+ "model1": { // Submission ID
69
+ "model_name": "Model Name",
70
+ "timestamp": "2024-03-20T12:00:00",
71
+ "config": {
72
+ "embedding_model": "embedding-model-name",
73
+ "retriever_type": "dense",
74
+ "retrieval_config": {
75
+ "top_k": 3
76
+ }
77
+ },
78
+ "metrics": {
79
+ "retrieval": {
80
+ "hit_rate": 0.82,
81
+ "mrr": 0.65,
82
+ "precision": 0.78
83
+ },
84
+ "generation": {
85
+ "rouge1": 0.72,
86
+ "rouge2": 0.55,
87
+ "rougeL": 0.68
88
+ }
89
+ }
90
+ }
91
+ }
92
+ },
93
+ "last_version": "2.0",
94
+ "n_questions": "1000"
95
+ }
96
+ ```
97
+
98
+ ## License
99
+
100
+ MIT
101
+
102
+ # RAG Evaluation Leaderboard
103
+
104
+ This leaderboard tracks different RAG (Retrieval-Augmented Generation) implementations and their performance metrics.
105
+
106
+ ## Metrics Tracked
107
+
108
+ ### Retrieval Metrics
109
+ - Hit Rate: Proportion of relevant documents retrieved
110
+ - MRR (Mean Reciprocal Rank): Position of first relevant document
111
+
112
+ ### Generation Metrics
113
+ - ROUGE-1: Unigram overlap
114
+ - ROUGE-2: Bigram overlap
115
+ - ROUGE-L: Longest common subsequence
116
+