Spaces:
Sleeping
Sleeping
title: AutoBench | |
emoji: ๐ | |
colorFrom: red | |
colorTo: yellow | |
sdk: streamlit | |
sdk_version: 1.42.2 | |
app_file: app.py | |
pinned: false | |
license: mit | |
short_description: LLM Many-Model-As-Judge Benchmark | |
# AutoBench | |
This Space runs a benchmark to compare different language models using Hugging Face's Inference API. | |
## Features | |
- Benchmark multiple models side by side (models evaluate models) | |
- Test models across various topics and difficulty levels | |
- Evaluate question quality and answer quality | |
- Generate detailed performance reports | |
## How to Use | |
1. Enter your Hugging Face API token (needed to access models) | |
2. Select the models you want to benchmark | |
3. Choose topics and number of iterations | |
4. Click "Start Benchmark" | |
5. View and download results when complete | |
## Models | |
The benchmark supports any model available through Hugging Face's Inference API, including: | |
- Meta Llama models | |
- Google Gemma models | |
- Mistral models | |
- And many more! | |
## Note | |
Running a full benchmark might take some time depending on the number of models and iterations. | |