Spaces:
Running
on
CPU Upgrade
performance-improvement
Changes to pyproject.toml:
- corrected
ruffsettings to work with VSCode
Changes to src/envs.py:
- Removed formatted string literal in the print statement, replacing
f-stringwith a plain string for constant messages.
Changes to src/leaderboard/read_evals.py:
- Class
EvalResult:- Changed some instance variables to optionally include types or use defaults.
- Replaced
tagsdefault fromNoneto an empty list usingfield(default_factory=list). - Refactored
init_from_json_filemethod to handle the new config structure and useclsinstead ofself. - Extracted result processing into a new method
extract_results. - Implemented structured error handling and refined the update methods.
- Functionality:
- Redefined how request files are selected and validated using
Pathliband added more structured checks. - Enhanced error handling across methods with specific exceptions and logging for errors.
- Redefined how request files are selected and validated using
❗This is a first commit, I'm going to improve existing functionality in the next commits
Do we need to see the list of flagged models from src/leaderboard/filter_models.py line 144? More no than yes, so I commented it out
Key changes for src/leaderboard/read_evals.py:
- Replaced the method for sorting JSON files based on datetime embedded in their filenames. The new method uses a list of expected datetime formats to parse these strings, and logs an error if none of the formats match, defaulting to a Unix start time for legacy files with incorrect time formats.
- Introduced error handling during the construction of evaluation results dictionary to log missing keys specifically, improving debugging capabilities.
- Wrapped the iteration over model files with
tqdmfor a visual progress indicator during execution. - Added handling within the logging scope for
tqdmto ensure progress output and log messages don't conflict, improving the clarity of console output during execution.
Notes
My concern is how will tqdm behave in an ephemeral space? I need to check
Is this one reviewable? :)
General comments
- nice system with the exponential backoff
- cool work on the type hinting
- careful, you removed some docstrings
Specific comments
src/leaderboard/filter_models
Feel free to remove the "flagged models" log
src/leaderboard/read_evals
- please revert the change for
result_keyas the new system with the join is considerably less clear to read/edit if needed - truthfulqa and NaNs > could be interesting to set any NaN value to 0, no matter the eval, it will also make the code more readable (but add in comment that it's mostly for truthfulqa)
- l.79: add a comment to explain the system
- add the comments back in
extract_results - nice exception management in
update_with_request_file parse_datetimecould go inutils
Following new commits that happened in this PR, the ephemeral Space HuggingFaceH4/open_llm_leaderboard-ci-pr-705 has been updated.
(This is an automated message.)
Following new commits that happened in this PR, the ephemeral Space HuggingFaceH4/open_llm_leaderboard-ci-pr-705 has been updated.
(This is an automated message.)
Don't mind the above commit, it's a WIP one
Finished with the changes, I'm ready to merge!
LGTM!