docs/project_overview.md · leroidubuffet/HF_Agents_Final

Project: GAIA Benchmark Agent Development

This project will run on a HF Space.

Contrasubject

The project involves the design and implementation of an advanced AI agent that can efficiently tackle a variety of real-world tasks defined by the GAIA benchmark. This benchmark evaluates AI systems across three complexity levels, focusing on core competencies like reasoning, multimodal understanding, web browsing, and proficient use of tools. The agent must demonstrate capabilities in structured problem-solving, multimodal reasoning, multi-hop fact retrieval, and coherent task sequencing.

Requirements

General Requirements

Design and implement an AI agent capable of addressing tasks from the GAIA benchmark.
Agent must achieve a minimum performance benchmark of 30% accuracy on GAIA's level 1 questions.
Maintain publicly accessible code and documentation on Hugging Face to allow verification and reproducibility.

Technical Requirements

Use the provided API endpoints for interacting with the GAIA evaluation:
- GET /questions: Retrieve evaluation questions.
- GET /random-question: Obtain individual random questions.
- GET /files/{task_id}: Access files associated with specific task IDs.
- POST /submit: Submit answers for evaluation and leaderboard updates.

Submission Requirements

Username: Provide your Hugging Face username for identification purposes.
Code Link (agent_code): Include a URL pointing directly to your Hugging Face Space code repository.
Answers: Submit a structured response ({"task_id": ..., "submitted_answer": ...}) generated by your agent.

Evaluation Criteria

Answers evaluated through exact match comparison to the provided ground truth.
Results and rankings displayed on a student leaderboard, accessible publicly.

Advanced Capabilities

Ensure your agent can efficiently:
- Engage in multi-step reasoning and complex coordination of multiple tools.
- Conduct multimodal analyses involving text and image-based data.
- Execute web browsing and structured retrieval of information.
- Manage advanced planning and tool integration (particularly relevant for Level 2 and Level 3 GAIA tasks).

Additional Notes

GAIA's benchmark questions range from simple (Level 1, fewer than 5 steps) to complex tasks requiring extensive coordination and planning (Levels 2 and 3).
Agent development should prioritize the following principles:
- Real-world applicability of solutions.
- Clarity and interpretability for human evaluators.
- Robustness against brute-force methods (non-gameability).
- Concise and unambiguous answers.

Documentation and Compliance

Ensure proper attribution and compliance with GAIA data usage policies.
Keep your codebase transparent and publicly accessible for verification purposes.
Clearly document your agent's design, architecture, and reasoning processes to facilitate comprehension and evaluation.