Spaces:
Running
Running
metadata
title: AutoBench
emoji: 🐠
colorFrom: red
colorTo: yellow
sdk: streamlit
sdk_version: 1.42.2
app_file: app.py
pinned: false
license: mit
short_description: LLM Many-Model-As-Judge Benchmark
AutoBench
This Space runs a benchmark to compare different language models using Hugging Face's Inference API.
Features
- Benchmark multiple models side by side (models evaluate models)
- Test models across various topics and difficulty levels
- Evaluate question quality and answer quality
- Generate detailed performance reports
How to Use
- Enter your Hugging Face API token (needed to access models)
- Select the models you want to benchmark
- Choose topics and number of iterations
- Click "Start Benchmark"
- View and download results when complete
Models
The benchmark supports any model available through Hugging Face's Inference API, including:
- Meta Llama models
- Google Gemma models
- Mistral models
- And many more!
Note
Running a full benchmark might take some time depending on the number of models and iterations.