AutoBench_1.0_Demo / README.md
PeterKruger's picture
Update README.md
9d9a69a verified
|
raw
history blame
1.06 kB
metadata
title: AutoBench
emoji: 🐠
colorFrom: red
colorTo: yellow
sdk: streamlit
sdk_version: 1.42.2
app_file: app.py
pinned: false
license: mit
short_description: LLM Many-Model-As-Judge Benchmark

AutoBench

This Space runs a benchmark to compare different language models using Hugging Face's Inference API.

Features

  • Benchmark multiple models side by side (models evaluate models)
  • Test models across various topics and difficulty levels
  • Evaluate question quality and answer quality
  • Generate detailed performance reports

How to Use

  1. Enter your Hugging Face API token (needed to access models)
  2. Select the models you want to benchmark
  3. Choose topics and number of iterations
  4. Click "Start Benchmark"
  5. View and download results when complete

Models

The benchmark supports any model available through Hugging Face's Inference API, including:

  • Meta Llama models
  • Google Gemma models
  • Mistral models
  • And many more!

Note

Running a full benchmark might take some time depending on the number of models and iterations.