Yago Bolivar
reorder
073b7fb

A newer version of the Gradio SDK is available: 5.43.1

Upgrade

GAIA Agent Development Plan

This document outlines a structured approach to developing an agent that can successfully solve a subset of the GAIA benchmark, focusing on understanding the task, designing the agent architecture, and planning the development process.

I. Understanding the Task & Data:

  1. Analyze common_questions.json:

    • Structure: Each entry has task_id, Question, Level, Final answer, and sometimes file_name.
    • Question Types: Identify patterns:
      • Direct information retrieval (e.g., "How many studio albums...").
      • Web search required (e.g., "On June 6, 2023, an article...").
      • File-based questions (audio, images, code - indicated by file_name).
      • Logic/reasoning puzzles (e.g., the table-based commutativity question, reversed sentence).
      • Multistep questions.
    • Answer Format: Observe the format of Final answer for each type. Note the guidance in docs/submission_instructions.md regarding formatting (numbers, few words, comma-separated lists).
    • File Dependencies: List all unique file_name extensions to understand what file processing capabilities are needed (e.g., .mp3, .png, .py, .xlsx).
  2. Review Project Context:

    • Agent Interface: The agent will need to fit into the BasicAgent structure in app.py (i.e., an __init__ and a __call__(self, question: str) -> str method).
    • Evaluation: Keep docs/testing_recipe.md and the normalize function in mind for how answers will be compared.
    • Model: The agent will use an LLM (like the Llama 3 model mentioned in docs/log.md).

II. Agent Architecture Design (Conceptual):

  1. Core Agent Loop (MyAgent.answer or MyAgent.__call__):

    • Input: Question string (and task_id/file_name if passed separately or parsed from a richer input object).
    • Step 1: Question Analysis & Planning:
      • Use the LLM to understand the question.
      • Determine the type of question (web search, file processing, direct knowledge, etc.).
      • Identify if any tools are needed.
      • Formulate a high-level plan (e.g., "Search web for X, then extract Y from the page").
    • Step 2: Tool Selection & Execution (if needed):
      • Based on the plan, select and invoke appropriate tools.
      • Pass necessary parameters to tools (e.g., search query, file path).
      • Collect tool outputs.
    • Step 3: Information Synthesis & Answer Generation:
      • Use the LLM to process tool outputs and any retrieved information.
      • Generate the final answer string.
    • Step 4: Answer Formatting:
      • Ensure the answer conforms to the expected format (using guidance from common_questions.json examples and docs/submission_instructions.md). This might involve specific post-processing rules or prompting the LLM for a specific format.
    • Output: Return the formatted answer string.
  2. Key Modules/Components:

    • LLM Interaction Module:
      • Handles communication with the chosen LLM (e.g., GPT4All Llama 3).
      • Manages prompt construction (system prompts, user prompts, few-shot examples if useful).
      • Parses LLM responses.
    • Tool Library: A collection of functions/classes that the agent can call.
      • WebSearchTool:
        • Input: Search query.
        • Action: Uses a search engine API (or simulates browsing if necessary, though direct API is better).
        • Output: List of search results (titles, snippets, URLs) or page content.
      • FileReaderTool:
        • Input: File path (derived from file_name and task_id to locate/fetch the file).
        • Action: Reads content based on file type.
          • Text files (.py): Read as string.
          • Spreadsheets (.xlsx): Parse relevant data (requires a library like pandas or openpyxl).
          • Audio files (.mp3): Transcribe to text (requires a speech-to-text model/API).
          • Image files (.png): Describe image content or extract text (requires a vision model/API or OCR).
        • Output: Processed content (text, structured data).
      • CodeInterpreterTool (for .py files like in task f918266a-b3e0-4914-865d-4faa564f1aef):
        • Input: Python code string.
        • Action: Executes the code in a sandboxed environment.
        • Output: Captured stdout/stderr or final expression value.
      • (Potentially) KnowledgeBaseTool: If there's a way to pre-process or index relevant documents/FAQs for faster lookups (though most GAIA questions imply dynamic information retrieval).
    • File Management/Access:
      • Mechanism to locate/download files associated with task_id and file_name. The API endpoint GET /files/{task_id} from docs/API.md is relevant here. For local testing with common_questions.json, ensure these files are available locally.
    • Prompt Engineering Strategy:
      • Develop a set of system prompts to guide the agent's behavior (e.g., "You are a helpful AI assistant designed to answer questions from the GAIA benchmark...").
      • Develop task-specific prompts or prompt templates for different question types or tool usage.
      • Incorporate answer formatting instructions into prompts.

III. Development & Testing Strategy:

  1. Environment Setup:
    • Install necessary Python libraries for LLM interaction, web requests, file processing (e.g., requests, beautifulsoup4 (for web scraping if needed), pandas, Pillow (for images), speech recognition libraries, etc.).
  2. Iterative Implementation:
    • Phase 1: Basic LLM Agent: Start with an agent that only uses the LLM for direct-answer questions (no tools).
    • Phase 2: Web Search Integration: Implement the WebSearchTool and integrate it for questions requiring web lookups.
    • Phase 3: File Handling:
      • Implement FileReaderTool for one file type at a time (e.g., start with .txt or .py, then .mp3, .png, .xlsx).
      • Implement CodeInterpreterTool.
    • Phase 4: Complex Reasoning & Multi-step: Refine the planning and synthesis capabilities of the LLM to handle more complex, multi-step questions that might involve multiple tool uses.
  3. Testing:
    • Use common_questions.json as the primary test set.
    • Adapt the script from docs/testing_recipe.md to run your agent against these questions and compare outputs.
    • Focus on one question type or task_id at a time for debugging.
    • Log agent's internal "thoughts" (plan, tool calls, tool outputs) for easier debugging.

IV. Pre-computation/Pre-analysis (before coding):

  1. Map Question Types to Tools: For each question in common_questions.json, manually note which tool(s) would ideally be used. This helps prioritize tool development.
    • Example:
      • 8e867cd7-cff9-4e6c-867a-ff5ddc2550be (Mercedes Sosa albums): WebSearchTool
      • cca530fc-4052-43b2-b130-b30968d8aa44 (Chess): FileReaderTool (image) + Vision/Chess Engine Tool (or very advanced LLM vision)
      • 99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3 (Pie ingredients): FileReaderTool (audio) + SpeechToText
      • f918266a-b3e0-4914-865d-4faa564f1aef (Python output): FileReaderTool (code) + CodeInterpreterTool
  2. Define Tool Interfaces: Specify the exact input/output signature for each planned tool.