A newer version of the Gradio SDK is available:
5.43.1
GAIA Agent Development Plan
This document outlines a structured approach to developing an agent that can successfully solve a subset of the GAIA benchmark, focusing on understanding the task, designing the agent architecture, and planning the development process.
I. Understanding the Task & Data:
Analyze common_questions.json:
- Structure: Each entry has
task_id
,Question
,Level
,Final answer
, and sometimesfile_name
. - Question Types: Identify patterns:
- Direct information retrieval (e.g., "How many studio albums...").
- Web search required (e.g., "On June 6, 2023, an article...").
- File-based questions (audio, images, code - indicated by
file_name
). - Logic/reasoning puzzles (e.g., the table-based commutativity question, reversed sentence).
- Multistep questions.
- Answer Format: Observe the format of
Final answer
for each type. Note the guidance indocs/submission_instructions.md
regarding formatting (numbers, few words, comma-separated lists). - File Dependencies: List all unique
file_name
extensions to understand what file processing capabilities are needed (e.g.,.mp3
,.png
,.py
,.xlsx
).
- Structure: Each entry has
Review Project Context:
- Agent Interface: The agent will need to fit into the
BasicAgent
structure inapp.py
(i.e., an__init__
and a__call__(self, question: str) -> str
method). - Evaluation: Keep
docs/testing_recipe.md
and thenormalize
function in mind for how answers will be compared. - Model: The agent will use an LLM (like the Llama 3 model mentioned in
docs/log.md
).
- Agent Interface: The agent will need to fit into the
II. Agent Architecture Design (Conceptual):
Core Agent Loop (
MyAgent.answer
orMyAgent.__call__
):- Input: Question string (and
task_id
/file_name
if passed separately or parsed from a richer input object). - Step 1: Question Analysis & Planning:
- Use the LLM to understand the question.
- Determine the type of question (web search, file processing, direct knowledge, etc.).
- Identify if any tools are needed.
- Formulate a high-level plan (e.g., "Search web for X, then extract Y from the page").
- Step 2: Tool Selection & Execution (if needed):
- Based on the plan, select and invoke appropriate tools.
- Pass necessary parameters to tools (e.g., search query, file path).
- Collect tool outputs.
- Step 3: Information Synthesis & Answer Generation:
- Use the LLM to process tool outputs and any retrieved information.
- Generate the final answer string.
- Step 4: Answer Formatting:
- Ensure the answer conforms to the expected format (using guidance from common_questions.json examples and
docs/submission_instructions.md
). This might involve specific post-processing rules or prompting the LLM for a specific format.
- Ensure the answer conforms to the expected format (using guidance from common_questions.json examples and
- Output: Return the formatted answer string.
- Input: Question string (and
Key Modules/Components:
- LLM Interaction Module:
- Handles communication with the chosen LLM (e.g., GPT4All Llama 3).
- Manages prompt construction (system prompts, user prompts, few-shot examples if useful).
- Parses LLM responses.
- Tool Library: A collection of functions/classes that the agent can call.
WebSearchTool
:- Input: Search query.
- Action: Uses a search engine API (or simulates browsing if necessary, though direct API is better).
- Output: List of search results (titles, snippets, URLs) or page content.
FileReaderTool
:- Input: File path (derived from
file_name
andtask_id
to locate/fetch the file). - Action: Reads content based on file type.
- Text files (
.py
): Read as string. - Spreadsheets (
.xlsx
): Parse relevant data (requires a library likepandas
oropenpyxl
). - Audio files (
.mp3
): Transcribe to text (requires a speech-to-text model/API). - Image files (
.png
): Describe image content or extract text (requires a vision model/API or OCR).
- Text files (
- Output: Processed content (text, structured data).
- Input: File path (derived from
CodeInterpreterTool
(for.py
files like in taskf918266a-b3e0-4914-865d-4faa564f1aef
):- Input: Python code string.
- Action: Executes the code in a sandboxed environment.
- Output: Captured stdout/stderr or final expression value.
- (Potentially)
KnowledgeBaseTool
: If there's a way to pre-process or index relevant documents/FAQs for faster lookups (though most GAIA questions imply dynamic information retrieval).
- File Management/Access:
- Mechanism to locate/download files associated with
task_id
andfile_name
. The API endpointGET /files/{task_id}
fromdocs/API.md
is relevant here. For local testing with common_questions.json, ensure these files are available locally.
- Mechanism to locate/download files associated with
- Prompt Engineering Strategy:
- Develop a set of system prompts to guide the agent's behavior (e.g., "You are a helpful AI assistant designed to answer questions from the GAIA benchmark...").
- Develop task-specific prompts or prompt templates for different question types or tool usage.
- Incorporate answer formatting instructions into prompts.
- LLM Interaction Module:
III. Development & Testing Strategy:
- Environment Setup:
- Install necessary Python libraries for LLM interaction, web requests, file processing (e.g.,
requests
,beautifulsoup4
(for web scraping if needed),pandas
,Pillow
(for images), speech recognition libraries, etc.).
- Install necessary Python libraries for LLM interaction, web requests, file processing (e.g.,
- Iterative Implementation:
- Phase 1: Basic LLM Agent: Start with an agent that only uses the LLM for direct-answer questions (no tools).
- Phase 2: Web Search Integration: Implement the
WebSearchTool
and integrate it for questions requiring web lookups. - Phase 3: File Handling:
- Implement
FileReaderTool
for one file type at a time (e.g., start with.txt
or.py
, then.mp3
,.png
,.xlsx
). - Implement
CodeInterpreterTool
.
- Implement
- Phase 4: Complex Reasoning & Multi-step: Refine the planning and synthesis capabilities of the LLM to handle more complex, multi-step questions that might involve multiple tool uses.
- Testing:
- Use
common_questions.json
as the primary test set. - Adapt the script from
docs/testing_recipe.md
to run your agent against these questions and compare outputs. - Focus on one question type or
task_id
at a time for debugging. - Log agent's internal "thoughts" (plan, tool calls, tool outputs) for easier debugging.
- Use
IV. Pre-computation/Pre-analysis (before coding):
- Map Question Types to Tools: For each question in common_questions.json, manually note which tool(s) would ideally be used. This helps prioritize tool development.
- Example:
8e867cd7-cff9-4e6c-867a-ff5ddc2550be
(Mercedes Sosa albums): WebSearchToolcca530fc-4052-43b2-b130-b30968d8aa44
(Chess): FileReaderTool (image) + Vision/Chess Engine Tool (or very advanced LLM vision)99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3
(Pie ingredients): FileReaderTool (audio) + SpeechToTextf918266a-b3e0-4914-865d-4faa564f1aef
(Python output): FileReaderTool (code) + CodeInterpreterTool
- Example:
- Define Tool Interfaces: Specify the exact input/output signature for each planned tool.