Yago Bolivar
commited on
Commit
·
aa49c02
1
Parent(s):
d1780a5
chore: update documentation and add evaluation scripts for GAIA project
Browse files- README.md +3 -1
- TODO.md +6 -1
- docs/evaluate_local_commands.md +17 -0
- docs/log.md +11 -0
- docs/submission_instructions.md +14 -0
- docs/testing_recipe.md +118 -0
- notes.md +9 -0
README.md
CHANGED
@@ -12,4 +12,6 @@ hf_oauth_expiration_minutes: 480
|
|
12 |
short_description: 'Design and implementation of an advanced AI agent '
|
13 |
---
|
14 |
|
15 |
-
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
12 |
short_description: 'Design and implementation of an advanced AI agent '
|
13 |
---
|
14 |
|
15 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
16 |
+
|
17 |
+
Notes at: [notes.md](notes.md)
|
TODO.md
CHANGED
@@ -1,2 +1,7 @@
|
|
1 |
- get GAIA answers specifitacion
|
2 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
1 |
- get GAIA answers specifitacion
|
2 |
+
- develop agent
|
3 |
+
- hwo to set up my app.py for tests
|
4 |
+
- how will I test the agent. Checkout [testing_recipe.md](docs/testing_recipe.md)
|
5 |
+
- check gpt4 all as test model
|
6 |
+
x find a way to evaluate the performance of the agent -> now we have a dataset with answers
|
7 |
+
x what model will it be tested with? -> DEFAULT_API_URL in app.py
|
docs/evaluate_local_commands.md
ADDED
@@ -0,0 +1,17 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
**Run the Evaluation Script:** Open your terminal, navigate to the `utilities` directory, and run the script:
|
2 |
+
|
3 |
+
* **Evaluate all levels:**
|
4 |
+
```bash
|
5 |
+
cd /Users/yagoairm2/Desktop/agents/final\ projectHF_Agents_Final_Project/utilities
|
6 |
+
python evaluate_local.py --answers_file .agent_answers.json
|
7 |
+
```
|
8 |
+
* **Evaluate only Level 1:**
|
9 |
+
```bash
|
10 |
+
pythonevaluate_local.py --answers_file ../gent_answers.json --level 1
|
11 |
+
```
|
12 |
+
* **Evaluate Level 1 and show incorrect answers:**
|
13 |
+
```bash
|
14 |
+
python evaluate_local.py --answers_file ..agent_answers.json --level 1 --verbose
|
15 |
+
```
|
16 |
+
|
17 |
+
This script will calculate and print the accuracy based on the exact match criterion used by GAIA, without submitting anything to the official leaderboard.
|
docs/log.md
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Log
|
2 |
+
|
3 |
+
- Checked the API documentation and endpoints
|
4 |
+
- Downloaded questions
|
5 |
+
- Downloaded validation question set
|
6 |
+
- Extracted answers from validation and put them in common_questions.json
|
7 |
+
- Identified a default api url sourcing the model that will drive the agent
|
8 |
+
- Created a script to test if gpt4all is working
|
9 |
+
- Found Meta-Llama-3-8B-Instruct.Q4_0.gguf in /yagoairm2/Library/Application Support/nomic.ai/GPT4All/Meta-Llama-3-8B-Instruct.Q4_0.gguf and succesfully loaded it
|
10 |
+
|
11 |
+
|
docs/submission_instructions.md
ADDED
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Submissions
|
2 |
+
Results can be submitted for both validation and test. Scores are expressed as the percentage of correct answers for a given split.
|
3 |
+
|
4 |
+
Each question calls for an answer that is either a string (one or a few words), a number, or a comma separated list of strings or floats, unless specified otherwise. There is only one correct answer. Hence, evaluation is done via quasi exact match between a model’s answer and the ground truth (up to some normalization that is tied to the “type” of the ground truth).
|
5 |
+
|
6 |
+
In our evaluation, we use a system prompt to instruct the model about the required format:
|
7 |
+
|
8 |
+
You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER]. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise. If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise. If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
|
9 |
+
We advise you to use the system prompt provided in the paper to ensure your agents answer using the correct and expected format. In practice, GPT4 level models easily follow it.
|
10 |
+
|
11 |
+
We expect submissions to be json-line files with the following format. The first two fields are mandatory, reasoning_trace is optional:
|
12 |
+
|
13 |
+
{"task_id": "task_id_1", "model_answer": "Answer 1 from your model", "reasoning_trace": "The different steps by which your model reached answer 1"}
|
14 |
+
{"task_id": "task_id_2", "model_answer": "Answer 2 from your model", "reasoning_trace": "The different steps by which your model reached answer 2"}
|
docs/testing_recipe.md
ADDED
@@ -0,0 +1,118 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Pensó durante 4 segundos
|
2 |
+
|
3 |
+
|
4 |
+
Below is a practical, lightweight recipe you can adapt to measure **exact-match accuracy** (the metric GAIA uses) on your new evaluation file.
|
5 |
+
|
6 |
+
---
|
7 |
+
|
8 |
+
### 1 Define a thin wrapper around your agent
|
9 |
+
|
10 |
+
```python
|
11 |
+
# agent_wrapper.py
|
12 |
+
from typing import Dict
|
13 |
+
|
14 |
+
class MyAgent:
|
15 |
+
"""
|
16 |
+
Replace the `answer` method with however you call your own agent
|
17 |
+
(API call, local model .predict(), etc.).
|
18 |
+
"""
|
19 |
+
def answer(self, record: Dict) -> str:
|
20 |
+
prompt = record["question"]
|
21 |
+
# ► ► your code here ◄ ◄
|
22 |
+
response = ... # the raw answer string
|
23 |
+
return response.strip()
|
24 |
+
```
|
25 |
+
|
26 |
+
---
|
27 |
+
|
28 |
+
### 2 Normalization helpers (GAIA style)
|
29 |
+
|
30 |
+
```python
|
31 |
+
# normalize.py
|
32 |
+
import re
|
33 |
+
|
34 |
+
def normalize(ans: str) -> str:
|
35 |
+
"""
|
36 |
+
GAIA scoring ≈ quasi-exact match after:
|
37 |
+
• trim / collapse whitespace
|
38 |
+
• lowercase (safe for numbers, too)
|
39 |
+
Extend if you need custom rules (e.g. strip trailing $ or %).
|
40 |
+
"""
|
41 |
+
ans = ans.strip().lower()
|
42 |
+
ans = re.sub(r"\\s+", " ", ans) # collapse inner spaces
|
43 |
+
return ans
|
44 |
+
```
|
45 |
+
|
46 |
+
---
|
47 |
+
|
48 |
+
### 3 Evaluation script
|
49 |
+
|
50 |
+
```python
|
51 |
+
# evaluate_agent.py
|
52 |
+
import json, argparse, pathlib, time
|
53 |
+
from typing import Dict, List
|
54 |
+
|
55 |
+
from agent_wrapper import MyAgent
|
56 |
+
from normalize import normalize
|
57 |
+
|
58 |
+
def load_records(path: pathlib.Path) -> List[Dict]:
|
59 |
+
with path.open("r", encoding="utf-8") as f:
|
60 |
+
return json.load(f) # your new file is a JSON array
|
61 |
+
|
62 |
+
def main(path_eval: str, limit: int | None = None):
|
63 |
+
eval_path = pathlib.Path(path_eval)
|
64 |
+
records = load_records(eval_path)
|
65 |
+
if limit:
|
66 |
+
records = records[:limit]
|
67 |
+
|
68 |
+
agent = MyAgent()
|
69 |
+
n_total = len(records)
|
70 |
+
n_correct = 0
|
71 |
+
latencies = []
|
72 |
+
|
73 |
+
for rec in records:
|
74 |
+
t0 = time.perf_counter()
|
75 |
+
pred = agent.answer(rec)
|
76 |
+
latencies.append(time.perf_counter() - t0)
|
77 |
+
|
78 |
+
gold = rec.get("Final answer") or rec.get("Final answer.".lower()) \
|
79 |
+
or rec.get("Final answer".lower()) or rec.get("Final answer", "")
|
80 |
+
if normalize(pred) == normalize(gold):
|
81 |
+
n_correct += 1
|
82 |
+
|
83 |
+
acc = n_correct / n_total * 100
|
84 |
+
print(f"Accuracy: {n_correct}/{n_total} ({acc:.2f}%)")
|
85 |
+
print(f"Median latency: {sorted(latencies)[len(latencies)//2]:.2f}s")
|
86 |
+
|
87 |
+
if __name__ == "__main__":
|
88 |
+
parser = argparse.ArgumentParser()
|
89 |
+
parser.add_argument("eval_json", help="common_questions.json (or other)")
|
90 |
+
parser.add_argument("--limit", type=int, help="debug with first N records")
|
91 |
+
args = parser.parse_args()
|
92 |
+
main(args.eval_json, args.limit)
|
93 |
+
```
|
94 |
+
|
95 |
+
*Run*:
|
96 |
+
|
97 |
+
```bash
|
98 |
+
python3 evaluate_agent.py question_set/common_questions.json
|
99 |
+
```
|
100 |
+
|
101 |
+
---
|
102 |
+
|
103 |
+
### 4 Customizing
|
104 |
+
|
105 |
+
| Need | Where to tweak |
|
106 |
+
| ----------------------------------------------------------------------- | ----------------------------------------- |
|
107 |
+
| **Agent call** (local model vs. API with keys, tool-use, etc.) | `MyAgent.answer()` |
|
108 |
+
| **More elaborate normalization** (e.g. strip `$` or `%`, round numbers) | `normalize()` |
|
109 |
+
| **Partial credit / numeric tolerance** | Replace the `==` line with your own logic |
|
110 |
+
|
111 |
+
---
|
112 |
+
|
113 |
+
### 5 Interpreting results
|
114 |
+
|
115 |
+
* **Exact-match accuracy** (>= 100 % means your agent reproduced all answers).
|
116 |
+
* **Latency** helps you spot outliers in run time (e.g. long tool chains).
|
117 |
+
|
118 |
+
That’s all you need to benchmark quickly. Happy testing!
|
notes.md
CHANGED
@@ -6,15 +6,24 @@
|
|
6 |
- fetch_all_questions.py: fetches all questions from the GAIA API
|
7 |
- random_question_submit.py: fetches a random question and submits the answer to the GAIA API
|
8 |
- evaluate_local.py: evaluates questions locally
|
|
|
|
|
9 |
|
10 |
## docs/ Project documentation:
|
11 |
- project_overview.md: overview of the project
|
12 |
- API.md: API documentation
|
|
|
|
|
13 |
- pdf/: PDF files for the project
|
|
|
|
|
|
|
14 |
|
15 |
## question_set: GAIS question set
|
16 |
- gaia_questions.json: JSON file with the GAIA question set
|
17 |
- new_gaia_questions.json: JSON file with the new GAIA question set
|
|
|
|
|
18 |
|
19 |
## answers/ agent's answers
|
20 |
- agent_answers.json: JSON file with the agent's answers
|
|
|
6 |
- fetch_all_questions.py: fetches all questions from the GAIA API
|
7 |
- random_question_submit.py: fetches a random question and submits the answer to the GAIA API
|
8 |
- evaluate_local.py: evaluates questions locally
|
9 |
+
- common_questions.py: finds common questions between validation.json and gaia_questions.json, and formats them in json.
|
10 |
+
- check_gpt4all.py: checks if the gpt4all model is working
|
11 |
|
12 |
## docs/ Project documentation:
|
13 |
- project_overview.md: overview of the project
|
14 |
- API.md: API documentation
|
15 |
+
- scorer.py: GAIA scoring function
|
16 |
+
- submission_instructions.py: GAIA submission instructions
|
17 |
- pdf/: PDF files for the project
|
18 |
+
- testing_recipe.md: testing recipe for the project (not used yet)
|
19 |
+
- evaluate_local_commands.md: commands to evaluate the agent locally
|
20 |
+
- Log.md: log of the project
|
21 |
|
22 |
## question_set: GAIS question set
|
23 |
- gaia_questions.json: JSON file with the GAIA question set
|
24 |
- new_gaia_questions.json: JSON file with the new GAIA question set
|
25 |
+
- validation.json: JSON file with the validation set from GAIA
|
26 |
+
- common_questions.json: JSON file with the common questions between validation.json and gaia_questions.json, including the answers
|
27 |
|
28 |
## answers/ agent's answers
|
29 |
- agent_answers.json: JSON file with the agent's answers
|