Yago Bolivar
commited on
Commit
·
736bdeb
1
Parent(s):
a3c3cd5
fix: clarify GAIA agent development plan and remove unnecessary lines from testing recipe
Browse files- docs/devplan.md +5 -6
- docs/testing_recipe.md +1 -6
- notes.md +3 -1
docs/devplan.md
CHANGED
@@ -1,5 +1,6 @@
|
|
1 |
# GAIA Agent Development Plan
|
2 |
-
|
|
|
3 |
**I. Understanding the Task & Data:**
|
4 |
|
5 |
1. **Analyze common_questions.json:**
|
@@ -16,7 +17,7 @@
|
|
16 |
2. **Review Project Context:**
|
17 |
* **Agent Interface:** The agent will need to fit into the `BasicAgent` structure in `app.py` (i.e., an `__init__` and a `__call__(self, question: str) -> str` method).
|
18 |
* **Evaluation:** Keep `docs/testing_recipe.md` and the `normalize` function in mind for how answers will be compared.
|
19 |
-
* **Model:** The agent will
|
20 |
|
21 |
**II. Agent Architecture Design (Conceptual):**
|
22 |
|
@@ -82,7 +83,7 @@
|
|
82 |
* **Phase 4: Complex Reasoning & Multi-step:** Refine the planning and synthesis capabilities of the LLM to handle more complex, multi-step questions that might involve multiple tool uses.
|
83 |
3. **Testing:**
|
84 |
* Use `common_questions.json` as the primary test set.
|
85 |
-
* Adapt the script from `docs/testing_recipe.md`
|
86 |
* Focus on one question type or `task_id` at a time for debugging.
|
87 |
* Log agent's internal "thoughts" (plan, tool calls, tool outputs) for easier debugging.
|
88 |
|
@@ -94,6 +95,4 @@
|
|
94 |
* `cca530fc-4052-43b2-b130-b30968d8aa44` (Chess): FileReaderTool (image) + Vision/Chess Engine Tool (or very advanced LLM vision)
|
95 |
* `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` (Pie ingredients): FileReaderTool (audio) + SpeechToText
|
96 |
* `f918266a-b3e0-4914-865d-4faa564f1aef` (Python output): FileReaderTool (code) + CodeInterpreterTool
|
97 |
-
2. **Define Tool Interfaces:** Specify the exact input/output signature for each planned tool.
|
98 |
-
|
99 |
-
This structured approach should provide a solid foundation for developing the agent. The key will be modularity, robust tool implementation, and effective prompt engineering to guide the LLM.
|
|
|
1 |
# GAIA Agent Development Plan
|
2 |
+
This document outlines a structured approach to developing an agent that can successfully solve a subset of the GAIA benchmark, focusing on understanding the task, designing the agent architecture, and planning the development process.
|
3 |
+
|
4 |
**I. Understanding the Task & Data:**
|
5 |
|
6 |
1. **Analyze common_questions.json:**
|
|
|
17 |
2. **Review Project Context:**
|
18 |
* **Agent Interface:** The agent will need to fit into the `BasicAgent` structure in `app.py` (i.e., an `__init__` and a `__call__(self, question: str) -> str` method).
|
19 |
* **Evaluation:** Keep `docs/testing_recipe.md` and the `normalize` function in mind for how answers will be compared.
|
20 |
+
* **Model:** The agent will use an LLM (like the Llama 3 model mentioned in `docs/log.md`).
|
21 |
|
22 |
**II. Agent Architecture Design (Conceptual):**
|
23 |
|
|
|
83 |
* **Phase 4: Complex Reasoning & Multi-step:** Refine the planning and synthesis capabilities of the LLM to handle more complex, multi-step questions that might involve multiple tool uses.
|
84 |
3. **Testing:**
|
85 |
* Use `common_questions.json` as the primary test set.
|
86 |
+
* Adapt the script from `docs/testing_recipe.md` to run your agent against these questions and compare outputs.
|
87 |
* Focus on one question type or `task_id` at a time for debugging.
|
88 |
* Log agent's internal "thoughts" (plan, tool calls, tool outputs) for easier debugging.
|
89 |
|
|
|
95 |
* `cca530fc-4052-43b2-b130-b30968d8aa44` (Chess): FileReaderTool (image) + Vision/Chess Engine Tool (or very advanced LLM vision)
|
96 |
* `99c9cc74-fdc8-46c6-8f8d-3ce2d3bfeea3` (Pie ingredients): FileReaderTool (audio) + SpeechToText
|
97 |
* `f918266a-b3e0-4914-865d-4faa564f1aef` (Python output): FileReaderTool (code) + CodeInterpreterTool
|
98 |
+
2. **Define Tool Interfaces:** Specify the exact input/output signature for each planned tool.
|
|
|
|
docs/testing_recipe.md
CHANGED
@@ -1,6 +1,3 @@
|
|
1 |
-
Pensó durante 4 segundos
|
2 |
-
|
3 |
-
|
4 |
Below is a practical, lightweight recipe you can adapt to measure **exact-match accuracy** (the metric GAIA uses) on your new evaluation file.
|
5 |
|
6 |
---
|
@@ -113,6 +110,4 @@ python3 evaluate_agent.py question_set/common_questions.json
|
|
113 |
### 5 Interpreting results
|
114 |
|
115 |
* **Exact-match accuracy** (>= 100 % means your agent reproduced all answers).
|
116 |
-
* **Latency** helps you spot outliers in run time (e.g. long tool chains).
|
117 |
-
|
118 |
-
That’s all you need to benchmark quickly. Happy testing!
|
|
|
|
|
|
|
|
|
1 |
Below is a practical, lightweight recipe you can adapt to measure **exact-match accuracy** (the metric GAIA uses) on your new evaluation file.
|
2 |
|
3 |
---
|
|
|
110 |
### 5 Interpreting results
|
111 |
|
112 |
* **Exact-match accuracy** (>= 100 % means your agent reproduced all answers).
|
113 |
+
* **Latency** helps you spot outliers in run time (e.g. long tool chains).
|
|
|
|
notes.md
CHANGED
@@ -1,4 +1,6 @@
|
|
1 |
-
#
|
|
|
|
|
2 |
|
3 |
## utilities/ Python scripts:
|
4 |
- random_questions.py: fetches random questions from the GAIA API
|
|
|
1 |
+
# NOTES
|
2 |
+
## general notes
|
3 |
+
- There are 5 questions that require the interpretation of a file
|
4 |
|
5 |
## utilities/ Python scripts:
|
6 |
- random_questions.py: fetches random questions from the GAIA API
|