Yago Bolivar commited on
Commit
736bdeb
·
1 Parent(s): a3c3cd5

fix: clarify GAIA agent development plan and remove unnecessary lines from testing recipe

Browse files
Files changed (3) hide show
  1. docs/devplan.md +5 -6
  2. docs/testing_recipe.md +1 -6
  3. notes.md +3 -1
docs/devplan.md CHANGED
@@ -1,5 +1,6 @@
1
  # GAIA Agent Development Plan
2
- # This document outlines a structured approach to developing an agent for the GAIA benchmark, focusing on understanding the task, designing the agent architecture, and planning the development process.
 
3
  **I. Understanding the Task & Data:**
4
 
5
  1. **Analyze common_questions.json:**
@@ -16,7 +17,7 @@
16
  2. **Review Project Context:**
17
  * **Agent Interface:** The agent will need to fit into the `BasicAgent` structure in `app.py` (i.e., an `__init__` and a `__call__(self, question: str) -> str` method).
18
  * **Evaluation:** Keep `docs/testing_recipe.md` and the `normalize` function in mind for how answers will be compared.
19
- * **Model:** The agent will likely use an LLM (like the Llama 3 model mentioned in `docs/log.md`).
20
 
21
  **II. Agent Architecture Design (Conceptual):**
22
 
@@ -82,7 +83,7 @@
82
  * **Phase 4: Complex Reasoning & Multi-step:** Refine the planning and synthesis capabilities of the LLM to handle more complex, multi-step questions that might involve multiple tool uses.
83
  3. **Testing:**
84
  * Use `common_questions.json` as the primary test set.
85
- * Adapt the script from `docs/testing_recipe.md` (or use `utilities/evaluate_local.py` if suitable) to run your agent against these questions and compare outputs.
86
  * Focus on one question type or `task_id` at a time for debugging.
87
  * Log agent's internal "thoughts" (plan, tool calls, tool outputs) for easier debugging.
88
 
@@ -94,6 +95,4 @@
94
  * `cca530fc-4052-43b2-b130-b30968d8aa44` (Chess): FileReaderTool (image) + Vision/Chess Engine Tool (or very advanced LLM vision)
95
  * `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` (Pie ingredients): FileReaderTool (audio) + SpeechToText
96
  * `f918266a-b3e0-4914-865d-4faa564f1aef` (Python output): FileReaderTool (code) + CodeInterpreterTool
97
- 2. **Define Tool Interfaces:** Specify the exact input/output signature for each planned tool.
98
-
99
- This structured approach should provide a solid foundation for developing the agent. The key will be modularity, robust tool implementation, and effective prompt engineering to guide the LLM.
 
1
  # GAIA Agent Development Plan
2
+ This document outlines a structured approach to developing an agent that can successfully solve a subset of the GAIA benchmark, focusing on understanding the task, designing the agent architecture, and planning the development process.
3
+
4
  **I. Understanding the Task & Data:**
5
 
6
  1. **Analyze common_questions.json:**
 
17
  2. **Review Project Context:**
18
  * **Agent Interface:** The agent will need to fit into the `BasicAgent` structure in `app.py` (i.e., an `__init__` and a `__call__(self, question: str) -> str` method).
19
  * **Evaluation:** Keep `docs/testing_recipe.md` and the `normalize` function in mind for how answers will be compared.
20
+ * **Model:** The agent will use an LLM (like the Llama 3 model mentioned in `docs/log.md`).
21
 
22
  **II. Agent Architecture Design (Conceptual):**
23
 
 
83
  * **Phase 4: Complex Reasoning & Multi-step:** Refine the planning and synthesis capabilities of the LLM to handle more complex, multi-step questions that might involve multiple tool uses.
84
  3. **Testing:**
85
  * Use `common_questions.json` as the primary test set.
86
+ * Adapt the script from `docs/testing_recipe.md` to run your agent against these questions and compare outputs.
87
  * Focus on one question type or `task_id` at a time for debugging.
88
  * Log agent's internal "thoughts" (plan, tool calls, tool outputs) for easier debugging.
89
 
 
95
  * `cca530fc-4052-43b2-b130-b30968d8aa44` (Chess): FileReaderTool (image) + Vision/Chess Engine Tool (or very advanced LLM vision)
96
  * `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` (Pie ingredients): FileReaderTool (audio) + SpeechToText
97
  * `f918266a-b3e0-4914-865d-4faa564f1aef` (Python output): FileReaderTool (code) + CodeInterpreterTool
98
+ 2. **Define Tool Interfaces:** Specify the exact input/output signature for each planned tool.
 
 
docs/testing_recipe.md CHANGED
@@ -1,6 +1,3 @@
1
- Pensó durante 4 segundos
2
-
3
-
4
  Below is a practical, lightweight recipe you can adapt to measure **exact-match accuracy** (the metric GAIA uses) on your new evaluation file.
5
 
6
  ---
@@ -113,6 +110,4 @@ python3 evaluate_agent.py question_set/common_questions.json
113
  ### 5 Interpreting results
114
 
115
  * **Exact-match accuracy** (>= 100 % means your agent reproduced all answers).
116
- * **Latency** helps you spot outliers in run time (e.g. long tool chains).
117
-
118
- That’s all you need to benchmark quickly. Happy testing!
 
 
 
 
1
  Below is a practical, lightweight recipe you can adapt to measure **exact-match accuracy** (the metric GAIA uses) on your new evaluation file.
2
 
3
  ---
 
110
  ### 5 Interpreting results
111
 
112
  * **Exact-match accuracy** (>= 100 % means your agent reproduced all answers).
113
+ * **Latency** helps you spot outliers in run time (e.g. long tool chains).
 
 
notes.md CHANGED
@@ -1,4 +1,6 @@
1
- # README
 
 
2
 
3
  ## utilities/ Python scripts:
4
  - random_questions.py: fetches random questions from the GAIA API
 
1
+ # NOTES
2
+ ## general notes
3
+ - There are 5 questions that require the interpretation of a file
4
 
5
  ## utilities/ Python scripts:
6
  - random_questions.py: fetches random questions from the GAIA API