Spaces:

AutoBench
/

AutoBench_1.0_Demo

Running

App Files Files Community

PeterKruger commited on Feb 28

Commit

0f654e5

verified ·

1 Parent(s): f0f4459

Update README.md

Browse files

Files changed (1) hide show

README.md +15 -5

README.md CHANGED Viewed

@@ -1,5 +1,5 @@
 ---
-title: AutoBench
 emoji: 🐠
 colorFrom: red
 colorTo: yellow
@@ -8,13 +8,13 @@ sdk_version: 1.42.2
 app_file: app.py
 pinned: false
 license: mit
-short_description: LLM Many-Model-As-Judge Benchmark
 ---
-# AutoBench
-This Space runs a benchmark to compare different language models using Hugging Face's Inference API.
 ## Features
@@ -31,6 +31,16 @@ This Space runs a benchmark to compare different language models using Hugging F
 4. Click "Start Benchmark"
 5. View and download results when complete
 ## Models
 The benchmark supports any model available through Hugging Face's Inference API, including:
@@ -41,4 +51,4 @@ The benchmark supports any model available through Hugging Face's Inference API,
 ## Note
-Running a full benchmark might take some time depending on the number of models and iterations.

 ---
+title: AutoBench 1.0 Demo
 emoji: 🐠
 colorFrom: red
 colorTo: yellow
 app_file: app.py
 pinned: false
 license: mit
+short_description: Many-Model-As-Judge LLM Benchmark
 ---
+# AutoBench 1.0 Demo
+This Space runs a Many-Model-As-Judge LLM benchmark to compare different language models using Hugging Face's Inference API. This is a simplified version of Autobench 1.0 which relies on multiple inference providers to manage request load and a wider range of models (Anthropic, Grok, Nebius, OpenAI, Together AI, Vertex AI). For more advanced use, please use refer to the AutoBench 1.0 repository.
 ## Features
 4. Click "Start Benchmark"
 5. View and download results when complete
+## How it works
+On each iteration, the system:
+1. generates a question prompt based on a random topic and difficulty level
+2. randomly selects a model to generate the question
+3. asks all models to rank the question (The question is accepted if it ranks above a threshold (3.5) and all ranks are above a set value (2) - alternatively step 3 is repeated)
+4. asks all models to generate an answer
+5. per each answer, asks all models to rank the answer (from 1 to 5) and an average rank is computed based on weights that are proportional each models' rank
+6. computes a cumulative average rank per each model over all iterations
 ## Models
 The benchmark supports any model available through Hugging Face's Inference API, including:
 ## Note
+Running a full benchmark might take some time depending on the number of models and iterations. Make sure you have sufficient Hugging Face credits to run the benchmark, especially when employing numerous models for long iteration duration.