HF_Agents_Final_Project

Sleeping

App Files Files Community

Yago Bolivar commited on May 7

Commit

aa49c02

1 Parent(s): d1780a5

chore: update documentation and add evaluation scripts for GAIA project

Browse files

Files changed (7) hide show

README.md +3 -1
TODO.md +6 -1
docs/evaluate_local_commands.md +17 -0
docs/log.md +11 -0
docs/submission_instructions.md +14 -0
docs/testing_recipe.md +118 -0
notes.md +9 -0

README.md CHANGED Viewed

@@ -12,4 +12,6 @@ hf_oauth_expiration_minutes: 480
 short_description: 'Design and implementation of an advanced AI agent '
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 short_description: 'Design and implementation of an advanced AI agent '
 ---
+Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+Notes at: [notes.md](notes.md)

TODO.md CHANGED Viewed

@@ -1,2 +1,7 @@
 - get GAIA answers specifitacion
-- design agent

 - get GAIA answers specifitacion
+- develop agent
+- hwo to set up my app.py for tests
+- how will I test the agent. Checkout [testing_recipe.md](docs/testing_recipe.md)
+- check gpt4 all as test model
+x find a way to evaluate the performance of the agent -> now we have a dataset with answers
+x what model will it be tested with? -> DEFAULT_API_URL in app.py

docs/evaluate_local_commands.md ADDED Viewed

	@@ -0,0 +1,17 @@

+**Run the Evaluation Script:** Open your terminal, navigate to the `utilities` directory, and run the script:
+*   **Evaluate all levels:**
+    ```bash
+    cd /Users/yagoairm2/Desktop/agents/final\ projectHF_Agents_Final_Project/utilities
+    python evaluate_local.py --answers_file .agent_answers.json
+    ```
+*   **Evaluate only Level 1:**
+    ```bash
+    pythonevaluate_local.py --answers_file ../gent_answers.json --level 1
+     ```
+*   **Evaluate Level 1 and show incorrect answers:**
+    ```bash
+    python evaluate_local.py --answers_file ..agent_answers.json --level 1 --verbose
+    ```
+This script will calculate and print the accuracy based on the exact match criterion used by GAIA, without submitting anything to the official leaderboard.

docs/log.md ADDED Viewed

	@@ -0,0 +1,11 @@

+# Log
+- Checked the API documentation and endpoints
+- Downloaded questions
+- Downloaded validation question set
+- Extracted answers from validation and put them in common_questions.json
+- Identified a default api url sourcing the model that will drive the agent
+- Created a script to test if gpt4all is working
+- Found Meta-Llama-3-8B-Instruct.Q4_0.gguf in /yagoairm2/Library/Application Support/nomic.ai/GPT4All/Meta-Llama-3-8B-Instruct.Q4_0.gguf and succesfully loaded it

docs/submission_instructions.md ADDED Viewed

	@@ -0,0 +1,14 @@

+Submissions
+Results can be submitted for both validation and test. Scores are expressed as the percentage of correct answers for a given split.
+Each question calls for an answer that is either a string (one or a few words), a number, or a comma separated list of strings or floats, unless specified otherwise. There is only one correct answer. Hence, evaluation is done via quasi exact match between a model’s answer and the ground truth (up to some normalization that is tied to the “type” of the ground truth).
+In our evaluation, we use a system prompt to instruct the model about the required format:
+    You are a general AI assistant. I will ask you a question. Report your thoughts, and finish your answer with the following template: FINAL ANSWER: [YOUR FINAL ANSWER]. YOUR FINAL ANSWER should be a number OR as few words as possible OR a comma separated list of numbers and/or strings. If you are asked for a number, don't use comma to write your number neither use units such as $ or percent sign unless specified otherwise. If you are asked for a string, don't use articles, neither abbreviations (e.g. for cities), and write the digits in plain text unless specified otherwise. If you are asked for a comma separated list, apply the above rules depending of whether the element to be put in the list is a number or a string.
+    We advise you to use the system prompt provided in the paper to ensure your agents answer using the correct and expected format. In practice, GPT4 level models easily follow it.
+We expect submissions to be json-line files with the following format. The first two fields are mandatory, reasoning_trace is optional:
+{"task_id": "task_id_1", "model_answer": "Answer 1 from your model", "reasoning_trace": "The different steps by which your model reached answer 1"}
+{"task_id": "task_id_2", "model_answer": "Answer 2 from your model", "reasoning_trace": "The different steps by which your model reached answer 2"}

docs/testing_recipe.md ADDED Viewed

	@@ -0,0 +1,118 @@

+Pensó durante 4 segundos
+Below is a practical, lightweight recipe you can adapt to measure **exact-match accuracy** (the metric GAIA uses) on your new evaluation file.
+---
+### 1  Define a thin wrapper around your agent
+```python
+# agent_wrapper.py
+from typing import Dict
+class MyAgent:
+    """
+    Replace the `answer` method with however you call your own agent
+    (API call, local model .predict(), etc.).
+    """
+    def answer(self, record: Dict) -> str:
+        prompt = record["question"]
+        # ► ► your code here ◄ ◄
+        response = ...                 # the raw answer string
+        return response.strip()
+```
+---
+### 2  Normalization helpers (GAIA style)
+```python
+# normalize.py
+import re
+def normalize(ans: str) -> str:
+    """
+    GAIA scoring ≈ quasi-exact match after:
+      • trim / collapse whitespace
+      • lowercase (safe for numbers, too)
+    Extend if you need custom rules (e.g. strip trailing $ or %).
+    """
+    ans = ans.strip().lower()
+    ans = re.sub(r"\\s+", " ", ans)        # collapse inner spaces
+    return ans
+```
+---
+### 3  Evaluation script
+```python
+# evaluate_agent.py
+import json, argparse, pathlib, time
+from typing import Dict, List
+from agent_wrapper import MyAgent
+from normalize import normalize
+def load_records(path: pathlib.Path) -> List[Dict]:
+    with path.open("r", encoding="utf-8") as f:
+        return json.load(f)               # your new file is a JSON array
+def main(path_eval: str, limit: int | None = None):
+    eval_path = pathlib.Path(path_eval)
+    records = load_records(eval_path)
+    if limit:
+        records = records[:limit]
+    agent = MyAgent()
+    n_total = len(records)
+    n_correct = 0
+    latencies = []
+    for rec in records:
+        t0 = time.perf_counter()
+        pred = agent.answer(rec)
+        latencies.append(time.perf_counter() - t0)
+        gold = rec.get("Final answer") or rec.get("Final answer.".lower()) \
+               or rec.get("Final answer".lower()) or rec.get("Final answer", "")
+        if normalize(pred) == normalize(gold):
+            n_correct += 1
+    acc = n_correct / n_total * 100
+    print(f"Accuracy: {n_correct}/{n_total}  ({acc:.2f}%)")
+    print(f"Median latency: {sorted(latencies)[len(latencies)//2]:.2f}s")
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("eval_json", help="common_questions.json (or other)")
+    parser.add_argument("--limit", type=int, help="debug with first N records")
+    args = parser.parse_args()
+    main(args.eval_json, args.limit)
+```
+*Run*:
+```bash
+python3 evaluate_agent.py question_set/common_questions.json
+```
+---
+### 4  Customizing
+| Need                                                                    | Where to tweak                            |
+| ----------------------------------------------------------------------- | ----------------------------------------- |
+| **Agent call** (local model vs. API with keys, tool-use, etc.)          | `MyAgent.answer()`                        |
+| **More elaborate normalization** (e.g. strip `$` or `%`, round numbers) | `normalize()`                             |
+| **Partial credit / numeric tolerance**                                  | Replace the `==` line with your own logic |
+---
+### 5  Interpreting results
+* **Exact-match accuracy** (>= 100 % means your agent reproduced all answers).
+* **Latency** helps you spot outliers in run time (e.g. long tool chains).
+That’s all you need to benchmark quickly. Happy testing!

notes.md CHANGED Viewed

@@ -6,15 +6,24 @@
 - fetch_all_questions.py: fetches all questions from the GAIA API
 - random_question_submit.py: fetches a random question and submits the answer to the GAIA API
 - evaluate_local.py: evaluates questions locally
 ## docs/ Project documentation:
 - project_overview.md: overview of the project
 - API.md: API documentation
 - pdf/: PDF files for the project
 ## question_set: GAIS question set
 - gaia_questions.json: JSON file with the GAIA question set
 - new_gaia_questions.json: JSON file with the new GAIA question set
 ## answers/ agent's answers
 - agent_answers.json: JSON file with the agent's answers

 - fetch_all_questions.py: fetches all questions from the GAIA API
 - random_question_submit.py: fetches a random question and submits the answer to the GAIA API
 - evaluate_local.py: evaluates questions locally
+- common_questions.py: finds common questions between validation.json and gaia_questions.json, and formats them in json.
+- check_gpt4all.py: checks if the gpt4all model is working
 ## docs/ Project documentation:
 - project_overview.md: overview of the project
 - API.md: API documentation
+- scorer.py: GAIA scoring function
+- submission_instructions.py: GAIA submission instructions
 - pdf/: PDF files for the project
+- testing_recipe.md: testing recipe for the project (not used yet)
+- evaluate_local_commands.md: commands to evaluate the agent locally
+- Log.md: log of the project
 ## question_set: GAIS question set
 - gaia_questions.json: JSON file with the GAIA question set
 - new_gaia_questions.json: JSON file with the new GAIA question set
+- validation.json: JSON file with the validation set from GAIA
+- common_questions.json: JSON file with the common questions between validation.json and gaia_questions.json, including the answers
 ## answers/ agent's answers
 - agent_answers.json: JSON file with the agent's answers