PeterKruger commited on
Commit
0f654e5
·
verified ·
1 Parent(s): f0f4459

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -5
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- title: AutoBench
3
  emoji: 🐠
4
  colorFrom: red
5
  colorTo: yellow
@@ -8,13 +8,13 @@ sdk_version: 1.42.2
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
- short_description: LLM Many-Model-As-Judge Benchmark
12
  ---
13
 
14
 
15
- # AutoBench
16
 
17
- This Space runs a benchmark to compare different language models using Hugging Face's Inference API.
18
 
19
  ## Features
20
 
@@ -31,6 +31,16 @@ This Space runs a benchmark to compare different language models using Hugging F
31
  4. Click "Start Benchmark"
32
  5. View and download results when complete
33
 
 
 
 
 
 
 
 
 
 
 
34
  ## Models
35
 
36
  The benchmark supports any model available through Hugging Face's Inference API, including:
@@ -41,4 +51,4 @@ The benchmark supports any model available through Hugging Face's Inference API,
41
 
42
  ## Note
43
 
44
- Running a full benchmark might take some time depending on the number of models and iterations.
 
1
  ---
2
+ title: AutoBench 1.0 Demo
3
  emoji: 🐠
4
  colorFrom: red
5
  colorTo: yellow
 
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
+ short_description: Many-Model-As-Judge LLM Benchmark
12
  ---
13
 
14
 
15
+ # AutoBench 1.0 Demo
16
 
17
+ This Space runs a Many-Model-As-Judge LLM benchmark to compare different language models using Hugging Face's Inference API. This is a simplified version of Autobench 1.0 which relies on multiple inference providers to manage request load and a wider range of models (Anthropic, Grok, Nebius, OpenAI, Together AI, Vertex AI). For more advanced use, please use refer to the AutoBench 1.0 repository.
18
 
19
  ## Features
20
 
 
31
  4. Click "Start Benchmark"
32
  5. View and download results when complete
33
 
34
+ ## How it works
35
+
36
+ On each iteration, the system:
37
+ 1. generates a question prompt based on a random topic and difficulty level
38
+ 2. randomly selects a model to generate the question
39
+ 3. asks all models to rank the question (The question is accepted if it ranks above a threshold (3.5) and all ranks are above a set value (2) - alternatively step 3 is repeated)
40
+ 4. asks all models to generate an answer
41
+ 5. per each answer, asks all models to rank the answer (from 1 to 5) and an average rank is computed based on weights that are proportional each models' rank
42
+ 6. computes a cumulative average rank per each model over all iterations
43
+
44
  ## Models
45
 
46
  The benchmark supports any model available through Hugging Face's Inference API, including:
 
51
 
52
  ## Note
53
 
54
+ Running a full benchmark might take some time depending on the number of models and iterations. Make sure you have sufficient Hugging Face credits to run the benchmark, especially when employing numerous models for long iteration duration.