Yago Bolivar commited on
Commit
aa49c02
·
1 Parent(s): d1780a5

chore: update documentation and add evaluation scripts for GAIA project

Browse files
README.md CHANGED
@@ -12,4 +12,6 @@ hf_oauth_expiration_minutes: 480
12
  short_description: 'Design and implementation of an advanced AI agent '
13
  ---
14
 
15
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
12
  short_description: 'Design and implementation of an advanced AI agent '
13
  ---
14
 
15
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
16
+
17
+ Notes at: [notes.md](notes.md)
TODO.md CHANGED
@@ -1,2 +1,7 @@
1
  - get GAIA answers specifitacion
2
- - design agent
 
 
 
 
 
 
1
  - get GAIA answers specifitacion
2
+ - develop agent
3
+ - hwo to set up my app.py for tests
4
+ - how will I test the agent. Checkout [testing_recipe.md](docs/testing_recipe.md)
5
+ - check gpt4 all as test model
6
+ x find a way to evaluate the performance of the agent -> now we have a dataset with answers
7
+ x what model will it be tested with? -> DEFAULT_API_URL in app.py
docs/evaluate_local_commands.md ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ **Run the Evaluation Script:** Open your terminal, navigate to the `utilities` directory, and run the script:
2
+
3
+ * **Evaluate all levels:**
4
+ ```bash
5
+ cd /Users/yagoairm2/Desktop/agents/final\ projectHF_Agents_Final_Project/utilities
6
+ python evaluate_local.py --answers_file .agent_answers.json
7
+ ```
8
+ * **Evaluate only Level 1:**
9
+ ```bash
10
+ pythonevaluate_local.py --answers_file ../gent_answers.json --level 1
11
+ ```
12
+ * **Evaluate Level 1 and show incorrect answers:**
13
+ ```bash
14
+ python evaluate_local.py --answers_file ..agent_answers.json --level 1 --verbose
15
+ ```
16
+
17
+ This script will calculate and print the accuracy based on the exact match criterion used by GAIA, without submitting anything to the official leaderboard.
docs/log.md ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Log
2
+
3
+ - Checked the API documentation and endpoints
4
+ - Downloaded questions
5
+ - Downloaded validation question set
6
+ - Extracted answers from validation and put them in common_questions.json
7
+ - Identified a default api url sourcing the model that will drive the agent
8
+ - Created a script to test if gpt4all is working
9
+ - Found Meta-Llama-3-8B-Instruct.Q4_0.gguf in /yagoairm2/Library/Application Support/nomic.ai/GPT4All/Meta-Llama-3-8B-Instruct.Q4_0.gguf and succesfully loaded it
10
+
11
+
docs/submission_instructions.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Submissions
2
+ Results can be submitted for both validation and test. Scores are expressed as the percentage of correct answers for a given split.
3
+
4
+ Each question calls for an answer that is either a string (one or a few words), a number, or a comma separated list of strings or floats, unless specified otherwise. There is only one correct answer. Hence, evaluation is done via quasi exact match between a model’s answer and the ground truth (up to some normalization that is tied to the “type” of the ground truth).
5
+
6
+ In our evaluation, we use a system prompt to instruct the model about the required format:
7
+
8
+ You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER]. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise. If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise. If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
9
+ We advise you to use the system prompt provided in the paper to ensure your agents answer using the correct and expected format. In practice, GPT4 level models easily follow it.
10
+
11
+ We expect submissions to be json-line files with the following format. The first two fields are mandatory, reasoning_trace is optional:
12
+
13
+ {"task_id": "task_id_1", "model_answer": "Answer 1 from your model", "reasoning_trace": "The different steps by which your model reached answer 1"}
14
+ {"task_id": "task_id_2", "model_answer": "Answer 2 from your model", "reasoning_trace": "The different steps by which your model reached answer 2"}
docs/testing_recipe.md ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Pensó durante 4 segundos
2
+
3
+
4
+ Below is a practical, lightweight recipe you can adapt to measure **exact-match accuracy** (the metric GAIA uses) on your new evaluation file.
5
+
6
+ ---
7
+
8
+ ### 1 Define a thin wrapper around your agent
9
+
10
+ ```python
11
+ # agent_wrapper.py
12
+ from typing import Dict
13
+
14
+ class MyAgent:
15
+ """
16
+ Replace the `answer` method with however you call your own agent
17
+ (API call, local model .predict(), etc.).
18
+ """
19
+ def answer(self, record: Dict) -> str:
20
+ prompt = record["question"]
21
+ # ► ► your code here ◄ ◄
22
+ response = ... # the raw answer string
23
+ return response.strip()
24
+ ```
25
+
26
+ ---
27
+
28
+ ### 2 Normalization helpers (GAIA style)
29
+
30
+ ```python
31
+ # normalize.py
32
+ import re
33
+
34
+ def normalize(ans: str) -> str:
35
+ """
36
+ GAIA scoring ≈ quasi-exact match after:
37
+ • trim / collapse whitespace
38
+ • lowercase (safe for numbers, too)
39
+ Extend if you need custom rules (e.g. strip trailing $ or %).
40
+ """
41
+ ans = ans.strip().lower()
42
+ ans = re.sub(r"\\s+", " ", ans) # collapse inner spaces
43
+ return ans
44
+ ```
45
+
46
+ ---
47
+
48
+ ### 3 Evaluation script
49
+
50
+ ```python
51
+ # evaluate_agent.py
52
+ import json, argparse, pathlib, time
53
+ from typing import Dict, List
54
+
55
+ from agent_wrapper import MyAgent
56
+ from normalize import normalize
57
+
58
+ def load_records(path: pathlib.Path) -> List[Dict]:
59
+ with path.open("r", encoding="utf-8") as f:
60
+ return json.load(f) # your new file is a JSON array
61
+
62
+ def main(path_eval: str, limit: int | None = None):
63
+ eval_path = pathlib.Path(path_eval)
64
+ records = load_records(eval_path)
65
+ if limit:
66
+ records = records[:limit]
67
+
68
+ agent = MyAgent()
69
+ n_total = len(records)
70
+ n_correct = 0
71
+ latencies = []
72
+
73
+ for rec in records:
74
+ t0 = time.perf_counter()
75
+ pred = agent.answer(rec)
76
+ latencies.append(time.perf_counter() - t0)
77
+
78
+ gold = rec.get("Final answer") or rec.get("Final answer.".lower()) \
79
+ or rec.get("Final answer".lower()) or rec.get("Final answer", "")
80
+ if normalize(pred) == normalize(gold):
81
+ n_correct += 1
82
+
83
+ acc = n_correct / n_total * 100
84
+ print(f"Accuracy: {n_correct}/{n_total} ({acc:.2f}%)")
85
+ print(f"Median latency: {sorted(latencies)[len(latencies)//2]:.2f}s")
86
+
87
+ if __name__ == "__main__":
88
+ parser = argparse.ArgumentParser()
89
+ parser.add_argument("eval_json", help="common_questions.json (or other)")
90
+ parser.add_argument("--limit", type=int, help="debug with first N records")
91
+ args = parser.parse_args()
92
+ main(args.eval_json, args.limit)
93
+ ```
94
+
95
+ *Run*:
96
+
97
+ ```bash
98
+ python3 evaluate_agent.py question_set/common_questions.json
99
+ ```
100
+
101
+ ---
102
+
103
+ ### 4 Customizing
104
+
105
+ | Need | Where to tweak |
106
+ | ----------------------------------------------------------------------- | ----------------------------------------- |
107
+ | **Agent call** (local model vs. API with keys, tool-use, etc.) | `MyAgent.answer()` |
108
+ | **More elaborate normalization** (e.g. strip `$` or `%`, round numbers) | `normalize()` |
109
+ | **Partial credit / numeric tolerance** | Replace the `==` line with your own logic |
110
+
111
+ ---
112
+
113
+ ### 5 Interpreting results
114
+
115
+ * **Exact-match accuracy** (>= 100 % means your agent reproduced all answers).
116
+ * **Latency** helps you spot outliers in run time (e.g. long tool chains).
117
+
118
+ That’s all you need to benchmark quickly. Happy testing!
notes.md CHANGED
@@ -6,15 +6,24 @@
6
  - fetch_all_questions.py: fetches all questions from the GAIA API
7
  - random_question_submit.py: fetches a random question and submits the answer to the GAIA API
8
  - evaluate_local.py: evaluates questions locally
 
 
9
 
10
  ## docs/ Project documentation:
11
  - project_overview.md: overview of the project
12
  - API.md: API documentation
 
 
13
  - pdf/: PDF files for the project
 
 
 
14
 
15
  ## question_set: GAIS question set
16
  - gaia_questions.json: JSON file with the GAIA question set
17
  - new_gaia_questions.json: JSON file with the new GAIA question set
 
 
18
 
19
  ## answers/ agent's answers
20
  - agent_answers.json: JSON file with the agent's answers
 
6
  - fetch_all_questions.py: fetches all questions from the GAIA API
7
  - random_question_submit.py: fetches a random question and submits the answer to the GAIA API
8
  - evaluate_local.py: evaluates questions locally
9
+ - common_questions.py: finds common questions between validation.json and gaia_questions.json, and formats them in json.
10
+ - check_gpt4all.py: checks if the gpt4all model is working
11
 
12
  ## docs/ Project documentation:
13
  - project_overview.md: overview of the project
14
  - API.md: API documentation
15
+ - scorer.py: GAIA scoring function
16
+ - submission_instructions.py: GAIA submission instructions
17
  - pdf/: PDF files for the project
18
+ - testing_recipe.md: testing recipe for the project (not used yet)
19
+ - evaluate_local_commands.md: commands to evaluate the agent locally
20
+ - Log.md: log of the project
21
 
22
  ## question_set: GAIS question set
23
  - gaia_questions.json: JSON file with the GAIA question set
24
  - new_gaia_questions.json: JSON file with the new GAIA question set
25
+ - validation.json: JSON file with the validation set from GAIA
26
+ - common_questions.json: JSON file with the common questions between validation.json and gaia_questions.json, including the answers
27
 
28
  ## answers/ agent's answers
29
  - agent_answers.json: JSON file with the agent's answers