Spaces:
Sleeping
Sleeping
| title: AutoBench | |
| emoji: ๐ | |
| colorFrom: red | |
| colorTo: yellow | |
| sdk: streamlit | |
| sdk_version: 1.42.2 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: LLM Many-Model-As-Judge Benchmark | |
| # AutoBench | |
| This Space runs a benchmark to compare different language models using Hugging Face's Inference API. | |
| ## Features | |
| - Benchmark multiple models side by side (models evaluate models) | |
| - Test models across various topics and difficulty levels | |
| - Evaluate question quality and answer quality | |
| - Generate detailed performance reports | |
| ## How to Use | |
| 1. Enter your Hugging Face API token (needed to access models) | |
| 2. Select the models you want to benchmark | |
| 3. Choose topics and number of iterations | |
| 4. Click "Start Benchmark" | |
| 5. View and download results when complete | |
| ## Models | |
| The benchmark supports any model available through Hugging Face's Inference API, including: | |
| - Meta Llama models | |
| - Google Gemma models | |
| - Mistral models | |
| - And many more! | |
| ## Note | |
| Running a full benchmark might take some time depending on the number of models and iterations. | |