HF_Agents_Final_Project

Sleeping

App Files Files Community

Yago Bolivar commited on May 9

Commit

736bdeb

1 Parent(s): a3c3cd5

fix: clarify GAIA agent development plan and remove unnecessary lines from testing recipe

Browse files

Files changed (3) hide show

docs/devplan.md +5 -6
docs/testing_recipe.md +1 -6
notes.md +3 -1

docs/devplan.md CHANGED Viewed

@@ -1,5 +1,6 @@
 # GAIA Agent Development Plan
-# This document outlines a structured approach to developing an agent for the GAIA benchmark, focusing on understanding the task, designing the agent architecture, and planning the development process.
 **I. Understanding the Task & Data:**
 1.  **Analyze common_questions.json:**
@@ -16,7 +17,7 @@
 2.  **Review Project Context:**
     *   **Agent Interface:** The agent will need to fit into the `BasicAgent` structure in `app.py` (i.e., an `__init__` and a `__call__(self, question: str) -> str` method).
     *   **Evaluation:** Keep `docs/testing_recipe.md` and the `normalize` function in mind for how answers will be compared.
-    *   **Model:** The agent will likely use an LLM (like the Llama 3 model mentioned in `docs/log.md`).
 **II. Agent Architecture Design (Conceptual):**
@@ -82,7 +83,7 @@
     *   **Phase 4: Complex Reasoning & Multi-step:** Refine the planning and synthesis capabilities of the LLM to handle more complex, multi-step questions that might involve multiple tool uses.
 3.  **Testing:**
     *   Use `common_questions.json` as the primary test set.
-    *   Adapt the script from `docs/testing_recipe.md` (or use `utilities/evaluate_local.py` if suitable) to run your agent against these questions and compare outputs.
     *   Focus on one question type or `task_id` at a time for debugging.
     *   Log agent's internal "thoughts" (plan, tool calls, tool outputs) for easier debugging.
@@ -94,6 +95,4 @@
         *   `cca530fc-4052-43b2-b130-b30968d8aa44` (Chess): FileReaderTool (image) + Vision/Chess Engine Tool (or very advanced LLM vision)
         *   `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` (Pie ingredients): FileReaderTool (audio) + SpeechToText
         *   `f918266a-b3e0-4914-865d-4faa564f1aef` (Python output): FileReaderTool (code) + CodeInterpreterTool
-2.  **Define Tool Interfaces:** Specify the exact input/output signature for each planned tool.
-This structured approach should provide a solid foundation for developing the agent. The key will be modularity, robust tool implementation, and effective prompt engineering to guide the LLM.

 # GAIA Agent Development Plan
+This document outlines a structured approach to developing an agent that can successfully solve a subset of the GAIA benchmark, focusing on understanding the task, designing the agent architecture, and planning the development process.
 **I. Understanding the Task & Data:**
 1.  **Analyze common_questions.json:**
 2.  **Review Project Context:**
     *   **Agent Interface:** The agent will need to fit into the `BasicAgent` structure in `app.py` (i.e., an `__init__` and a `__call__(self, question: str) -> str` method).
     *   **Evaluation:** Keep `docs/testing_recipe.md` and the `normalize` function in mind for how answers will be compared.
+    *   **Model:** The agent will use an LLM (like the Llama 3 model mentioned in `docs/log.md`).
 **II. Agent Architecture Design (Conceptual):**
     *   **Phase 4: Complex Reasoning & Multi-step:** Refine the planning and synthesis capabilities of the LLM to handle more complex, multi-step questions that might involve multiple tool uses.
 3.  **Testing:**
     *   Use `common_questions.json` as the primary test set.
+    *   Adapt the script from `docs/testing_recipe.md` to run your agent against these questions and compare outputs.
     *   Focus on one question type or `task_id` at a time for debugging.
     *   Log agent's internal "thoughts" (plan, tool calls, tool outputs) for easier debugging.
         *   `cca530fc-4052-43b2-b130-b30968d8aa44` (Chess): FileReaderTool (image) + Vision/Chess Engine Tool (or very advanced LLM vision)
         *   `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` (Pie ingredients): FileReaderTool (audio) + SpeechToText
         *   `f918266a-b3e0-4914-865d-4faa564f1aef` (Python output): FileReaderTool (code) + CodeInterpreterTool
+2.  **Define Tool Interfaces:** Specify the exact input/output signature for each planned tool.

docs/testing_recipe.md CHANGED Viewed

@@ -1,6 +1,3 @@
-Pensó durante 4 segundos
 Below is a practical, lightweight recipe you can adapt to measure **exact-match accuracy** (the metric GAIA uses) on your new evaluation file.
 ---
@@ -113,6 +110,4 @@ python3 evaluate_agent.py question_set/common_questions.json
 ### 5  Interpreting results
 * **Exact-match accuracy** (>= 100 % means your agent reproduced all answers).
-* **Latency** helps you spot outliers in run time (e.g. long tool chains).
-That’s all you need to benchmark quickly. Happy testing!

 Below is a practical, lightweight recipe you can adapt to measure **exact-match accuracy** (the metric GAIA uses) on your new evaluation file.
 ---
 ### 5  Interpreting results
 * **Exact-match accuracy** (>= 100 % means your agent reproduced all answers).
+* **Latency** helps you spot outliers in run time (e.g. long tool chains).

notes.md CHANGED Viewed

@@ -1,4 +1,6 @@
-# README
 ## utilities/ Python scripts:
 - random_questions.py: fetches random questions from the GAIA API

+# NOTES
+## general notes
+- There are 5 questions that require the interpretation of a file
 ## utilities/ Python scripts:
 - random_questions.py: fetches random questions from the GAIA API