metadata

title: RAG Benchmark Leaderboard
emoji: 📚
colorFrom: gray
colorTo: purple
sdk: gradio
sdk_version: 5.4.0
app_file: app.py
pinned: false

RAG Benchmark Leaderboard

An interactive leaderboard for comparing and visualizing the performance of RAG (Retrieval-Augmented Generation) systems.

Features

Version Comparison: Compare model performances across different versions of the benchmark dataset
Interactive Radar Charts: Visualize generative and retrieval metrics
Customizable Views: Filter and sort models based on different criteria
Easy Submission: Simple API for submitting your model results

Installation

pip install -r requirements.txt

Running the Leaderboard

cd leaderboard
python app.py

This will start a Gradio server, and you can access the leaderboard in your browser at http://localhost:7860.

Submitting Results

To submit your results to the leaderboard, use the provided API:

from rag_benchmark import RAGBenchmark

# Initialize the benchmark
benchmark = RAGBenchmark(version="2.0")  # Use the latest version

# Run evaluation
results = benchmark.evaluate(
    model_name="Your Model Name",
    embedding_model="your-embedding-model",
    retriever_type="dense",  # Options: dense, sparse, hybrid
    retrieval_config={"top_k": 3}
)

# Submit results
benchmark.submit_results(results)

Data Format

The results.json file has the following structure:

{
  "items": {
    "1.0": {  // Dataset version
      "model1": {  // Submission ID
        "model_name": "Model Name",
        "timestamp": "2025-03-20T12:00:00",
        "config": {
          "embedding_model": "embedding-model-name",
          "retriever_type": "dense",
          "retrieval_config": {
            "top_k": 3
          }
        },
        "metrics": {
          "retrieval": {
            "hit_rate": 0.82,
            "mrr": 0.65,
            "precision": 0.78
          },
          "generation": {
            "rouge1": 0.72,
            "rouge2": 0.55,
            "rougeL": 0.68
          }
        }
      }
    }
  },
  "last_version": "2.0",
  "n_questions": "1000"
}

License

MIT

RAG Evaluation Leaderboard

This leaderboard tracks different RAG (Retrieval-Augmented Generation) implementations and their performance metrics.

## Metrics Tracked

### Retrieval Metrics
- Hit Rate: Proportion of relevant documents retrieved
- MRR (Mean Reciprocal Rank): Position of first relevant document

### Generation Metrics
- ROUGE-1: Unigram overlap
- ROUGE-2: Bigram overlap
- ROUGE-L: Longest common subsequence