docs/testing_recipe.md · leroidubuffet/HF_Agents_Final

Below is a practical, lightweight recipe you can adapt to measure exact-match accuracy (the metric GAIA uses) on your new evaluation file.

1 Define a thin wrapper around your agent

# agent_wrapper.py
from typing import Dict

class MyAgent:
    """
    Replace the `answer` method with however you call your own agent
    (API call, local model .predict(), etc.). 
    """
    def answer(self, record: Dict) -> str:
        prompt = record["question"]
        # ► ► your code here ◄ ◄
        response = ...                 # the raw answer string
        return response.strip()

2 Normalization helpers (GAIA style)

# normalize.py
import re

def normalize(ans: str) -> str:
    """
    GAIA scoring ≈ quasi-exact match after:
      • trim / collapse whitespace
      • lowercase (safe for numbers, too)
    Extend if you need custom rules (e.g. strip trailing $ or %).
    """
    ans = ans.strip().lower()
    ans = re.sub(r"\\s+", " ", ans)        # collapse inner spaces
    return ans

3 Evaluation script

# evaluate_agent.py
import json, argparse, pathlib, time
from typing import Dict, List

from agent_wrapper import MyAgent
from normalize import normalize

def load_records(path: pathlib.Path) -> List[Dict]:
    with path.open("r", encoding="utf-8") as f:
        return json.load(f)               # your new file is a JSON array

def main(path_eval: str, limit: int | None = None):
    eval_path = pathlib.Path(path_eval)
    records = load_records(eval_path)
    if limit:
        records = records[:limit]

    agent = MyAgent()
    n_total = len(records)
    n_correct = 0
    latencies = []

    for rec in records:
        t0 = time.perf_counter()
        pred = agent.answer(rec)
        latencies.append(time.perf_counter() - t0)

        gold = rec.get("Final answer") or rec.get("Final answer.".lower()) \
               or rec.get("Final answer".lower()) or rec.get("Final answer", "")
        if normalize(pred) == normalize(gold):
            n_correct += 1

    acc = n_correct / n_total * 100
    print(f"Accuracy: {n_correct}/{n_total}  ({acc:.2f}%)")
    print(f"Median latency: {sorted(latencies)[len(latencies)//2]:.2f}s")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("eval_json", help="common_questions.json (or other)")
    parser.add_argument("--limit", type=int, help="debug with first N records")
    args = parser.parse_args()
    main(args.eval_json, args.limit)

Run:

python3 evaluate_agent.py question_set/common_questions.json

4 Customizing

Need	Where to tweak
Agent call (local model vs. API with keys, tool-use, etc.)	`MyAgent.answer()`
More elaborate normalization (e.g. strip `$` or `%`, round numbers)	`normalize()`
Partial credit / numeric tolerance	Replace the `==` line with your own logic

5 Interpreting results

Exact-match accuracy (>= 100 % means your agent reproduced all answers).
Latency helps you spot outliers in run time (e.g. long tool chains).