docs/evaluate_local_commands.md · leroidubuffet/HF_Agents_Final_Project at 13efa1c0956d7b7cebda7f310240712f3669d81f

Run the Evaluation Script: Open your terminal, navigate to the utilities directory, and run the script:

Evaluate all levels:

cd /Users/yagoairm2/Desktop/agents/final\ projectHF_Agents_Final_Project/utilities
python evaluate_local.py --answers_file .agent_answers.json

Evaluate only Level 1:

pythonevaluate_local.py --answers_file ../gent_answers.json --level 1

Evaluate Level 1 and show incorrect answers:

python evaluate_local.py --answers_file ..agent_answers.json --level 1 --verbose

This script will calculate and print the accuracy based on the exact match criterion used by GAIA, without submitting anything to the official leaderboard.