Yago Bolivar
feat: add initial phase 1 test script and update project overview with HF Space context
f8d444a
A newer version of the Gradio SDK is available:
5.43.1
Project: GAIA Benchmark Agent Development
This project will run on a HF Space.
Contrasubject
The project involves the design and implementation of an advanced AI agent that can efficiently tackle a variety of real-world tasks defined by the GAIA benchmark. This benchmark evaluates AI systems across three complexity levels, focusing on core competencies like reasoning, multimodal understanding, web browsing, and proficient use of tools. The agent must demonstrate capabilities in structured problem-solving, multimodal reasoning, multi-hop fact retrieval, and coherent task sequencing.
Requirements
General Requirements
- Design and implement an AI agent capable of addressing tasks from the GAIA benchmark.
- Agent must achieve a minimum performance benchmark of 30% accuracy on GAIA's level 1 questions.
- Maintain publicly accessible code and documentation on Hugging Face to allow verification and reproducibility.
Technical Requirements
- Use the provided API endpoints for interacting with the GAIA evaluation:
GET /questions
: Retrieve evaluation questions.GET /random-question
: Obtain individual random questions.GET /files/{task_id}
: Access files associated with specific task IDs.POST /submit
: Submit answers for evaluation and leaderboard updates.
Submission Requirements
- Username: Provide your Hugging Face username for identification purposes.
- Code Link (agent_code): Include a URL pointing directly to your Hugging Face Space code repository.
- Answers: Submit a structured response (
{"task_id": ..., "submitted_answer": ...}
) generated by your agent.
Evaluation Criteria
- Answers evaluated through exact match comparison to the provided ground truth.
- Results and rankings displayed on a student leaderboard, accessible publicly.
Advanced Capabilities
- Ensure your agent can efficiently:
- Engage in multi-step reasoning and complex coordination of multiple tools.
- Conduct multimodal analyses involving text and image-based data.
- Execute web browsing and structured retrieval of information.
- Manage advanced planning and tool integration (particularly relevant for Level 2 and Level 3 GAIA tasks).
Additional Notes
- GAIA's benchmark questions range from simple (Level 1, fewer than 5 steps) to complex tasks requiring extensive coordination and planning (Levels 2 and 3).
- Agent development should prioritize the following principles:
- Real-world applicability of solutions.
- Clarity and interpretability for human evaluators.
- Robustness against brute-force methods (non-gameability).
- Concise and unambiguous answers.
Documentation and Compliance
- Ensure proper attribution and compliance with GAIA data usage policies.
- Keep your codebase transparent and publicly accessible for verification purposes.
- Clearly document your agent's design, architecture, and reasoning processes to facilitate comprehension and evaluation.