Commit
Β·
2990bc2
1
Parent(s):
70cc330
Update README.md with project structure and development instructions; fix error percentage calculation in user_eval.py
Browse files- README.md +33 -1
- src/user_eval.py +1 -1
README.md
CHANGED
@@ -11,4 +11,36 @@ pinned: true
|
|
11 |
license: apache-2.0
|
12 |
---
|
13 |
|
14 |
-
# CP
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
license: apache-2.0
|
12 |
---
|
13 |
|
14 |
+
# π [CP-Bench Leaderboard] β Hugging Face Leaderboard Space
|
15 |
+
|
16 |
+
This repository contains the logic and configuration for the **[CP-Bench Leaderboard]**.
|
17 |
+
|
18 |
+
## π Structure
|
19 |
+
|
20 |
+
- `app.py` β Launches the Gradio interface.
|
21 |
+
- `src/` β Contains the main logic for fetching and displaying leaderboard data.'
|
22 |
+
- `config.py` β Configuration for the leaderboard.
|
23 |
+
- `eval.py` β Evaluation logic for model submissions.
|
24 |
+
- `hf_utils.py` β Utilities file.
|
25 |
+
- `ui.py` β UI components for displaying the leaderboard.
|
26 |
+
- `user_eval.py` β The logic for the evaluation of submitted models, it can also be used to evaluate models locally.
|
27 |
+
- `README.md` β (you are here)
|
28 |
+
|
29 |
+
## π§ How It Works
|
30 |
+
|
31 |
+
1. Users submit a .jsonl file with their generated models
|
32 |
+
2. The submission is uploaded to a storage repository (Hugging Face Hub).
|
33 |
+
3. An evaluation script is triggered, which:
|
34 |
+
- Loads the submission.
|
35 |
+
- Evaluates the models against the benchmark dataset.
|
36 |
+
- Computes metrics.
|
37 |
+
4. The results are stored and displayed on the leaderboard.
|
38 |
+
|
39 |
+
## π οΈ Development
|
40 |
+
|
41 |
+
To run locally:
|
42 |
+
|
43 |
+
```bash
|
44 |
+
pip install -r requirements.txt
|
45 |
+
python app.py
|
46 |
+
```
|
src/user_eval.py
CHANGED
@@ -328,7 +328,7 @@ def evaluate_submission(submitted_models, summary_file_path, modelling_framw, to
|
|
328 |
summary_f.write(f" Total Submitted Models Parsed: {total_submitted_models}\n")
|
329 |
summary_f.write(f" Models That Ran Successfully (out of the submitted models): {models_ran_successfully}/{total_submitted_models}\n")
|
330 |
summary_f.write(f" Submission coverage perc: {float(total_submitted_models) / len(ground_truth_models) * 100:.2f}%\n")
|
331 |
-
summary_f.write(f" Error perc: {(total_submitted_models - models_ran_successfully) / len(
|
332 |
summary_f.write(f" Consistency perc: {consistency_checks_passed / len(ground_truth_models) * 100:.2f}%\n")
|
333 |
summary_f.write(f" Final Solution Accuracy perc: {all_checks_passed / len(ground_truth_models) * 100:.2f}%\n")
|
334 |
summary_f.write("-" * 30 + "\n")
|
|
|
328 |
summary_f.write(f" Total Submitted Models Parsed: {total_submitted_models}\n")
|
329 |
summary_f.write(f" Models That Ran Successfully (out of the submitted models): {models_ran_successfully}/{total_submitted_models}\n")
|
330 |
summary_f.write(f" Submission coverage perc: {float(total_submitted_models) / len(ground_truth_models) * 100:.2f}%\n")
|
331 |
+
summary_f.write(f" Error perc: {(total_submitted_models - models_ran_successfully) / len(total_submitted_models) * 100:.2f}%\n")
|
332 |
summary_f.write(f" Consistency perc: {consistency_checks_passed / len(ground_truth_models) * 100:.2f}%\n")
|
333 |
summary_f.write(f" Final Solution Accuracy perc: {all_checks_passed / len(ground_truth_models) * 100:.2f}%\n")
|
334 |
summary_f.write("-" * 30 + "\n")
|