Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
---
|
2 |
-
title: AutoBench
|
3 |
emoji: 🐠
|
4 |
colorFrom: red
|
5 |
colorTo: yellow
|
@@ -8,13 +8,13 @@ sdk_version: 1.42.2
|
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
license: mit
|
11 |
-
short_description:
|
12 |
---
|
13 |
|
14 |
|
15 |
-
# AutoBench
|
16 |
|
17 |
-
This Space runs a benchmark to compare different language models using Hugging Face's Inference API.
|
18 |
|
19 |
## Features
|
20 |
|
@@ -31,6 +31,16 @@ This Space runs a benchmark to compare different language models using Hugging F
|
|
31 |
4. Click "Start Benchmark"
|
32 |
5. View and download results when complete
|
33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
## Models
|
35 |
|
36 |
The benchmark supports any model available through Hugging Face's Inference API, including:
|
@@ -41,4 +51,4 @@ The benchmark supports any model available through Hugging Face's Inference API,
|
|
41 |
|
42 |
## Note
|
43 |
|
44 |
-
Running a full benchmark might take some time depending on the number of models and iterations.
|
|
|
1 |
---
|
2 |
+
title: AutoBench 1.0 Demo
|
3 |
emoji: 🐠
|
4 |
colorFrom: red
|
5 |
colorTo: yellow
|
|
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
license: mit
|
11 |
+
short_description: Many-Model-As-Judge LLM Benchmark
|
12 |
---
|
13 |
|
14 |
|
15 |
+
# AutoBench 1.0 Demo
|
16 |
|
17 |
+
This Space runs a Many-Model-As-Judge LLM benchmark to compare different language models using Hugging Face's Inference API. This is a simplified version of Autobench 1.0 which relies on multiple inference providers to manage request load and a wider range of models (Anthropic, Grok, Nebius, OpenAI, Together AI, Vertex AI). For more advanced use, please use refer to the AutoBench 1.0 repository.
|
18 |
|
19 |
## Features
|
20 |
|
|
|
31 |
4. Click "Start Benchmark"
|
32 |
5. View and download results when complete
|
33 |
|
34 |
+
## How it works
|
35 |
+
|
36 |
+
On each iteration, the system:
|
37 |
+
1. generates a question prompt based on a random topic and difficulty level
|
38 |
+
2. randomly selects a model to generate the question
|
39 |
+
3. asks all models to rank the question (The question is accepted if it ranks above a threshold (3.5) and all ranks are above a set value (2) - alternatively step 3 is repeated)
|
40 |
+
4. asks all models to generate an answer
|
41 |
+
5. per each answer, asks all models to rank the answer (from 1 to 5) and an average rank is computed based on weights that are proportional each models' rank
|
42 |
+
6. computes a cumulative average rank per each model over all iterations
|
43 |
+
|
44 |
## Models
|
45 |
|
46 |
The benchmark supports any model available through Hugging Face's Inference API, including:
|
|
|
51 |
|
52 |
## Note
|
53 |
|
54 |
+
Running a full benchmark might take some time depending on the number of models and iterations. Make sure you have sufficient Hugging Face credits to run the benchmark, especially when employing numerous models for long iteration duration.
|