File size: 3,435 Bytes
aa49c02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
736bdeb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
Below is a practical, lightweight recipe you can adapt to measure **exact-match accuracy** (the metric GAIA uses) on your new evaluation file.

---

### 1  Define a thin wrapper around your agent

```python
# agent_wrapper.py
from typing import Dict

class MyAgent:
    """
    Replace the `answer` method with however you call your own agent
    (API call, local model .predict(), etc.). 
    """
    def answer(self, record: Dict) -> str:
        prompt = record["question"]
        # ► ► your code here ◄ ◄
        response = ...                 # the raw answer string
        return response.strip()
```

---

### 2  Normalization helpers (GAIA style)

```python
# normalize.py
import re

def normalize(ans: str) -> str:
    """
    GAIA scoring ≈ quasi-exact match after:
      • trim / collapse whitespace
      • lowercase (safe for numbers, too)
    Extend if you need custom rules (e.g. strip trailing $ or %).
    """
    ans = ans.strip().lower()
    ans = re.sub(r"\\s+", " ", ans)        # collapse inner spaces
    return ans
```

---

### 3  Evaluation script

```python
# evaluate_agent.py
import json, argparse, pathlib, time
from typing import Dict, List

from agent_wrapper import MyAgent
from normalize import normalize

def load_records(path: pathlib.Path) -> List[Dict]:
    with path.open("r", encoding="utf-8") as f:
        return json.load(f)               # your new file is a JSON array

def main(path_eval: str, limit: int | None = None):
    eval_path = pathlib.Path(path_eval)
    records = load_records(eval_path)
    if limit:
        records = records[:limit]

    agent = MyAgent()
    n_total = len(records)
    n_correct = 0
    latencies = []

    for rec in records:
        t0 = time.perf_counter()
        pred = agent.answer(rec)
        latencies.append(time.perf_counter() - t0)

        gold = rec.get("Final answer") or rec.get("Final answer.".lower()) \
               or rec.get("Final answer".lower()) or rec.get("Final answer", "")
        if normalize(pred) == normalize(gold):
            n_correct += 1

    acc = n_correct / n_total * 100
    print(f"Accuracy: {n_correct}/{n_total}  ({acc:.2f}%)")
    print(f"Median latency: {sorted(latencies)[len(latencies)//2]:.2f}s")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("eval_json", help="common_questions.json (or other)")
    parser.add_argument("--limit", type=int, help="debug with first N records")
    args = parser.parse_args()
    main(args.eval_json, args.limit)
```

*Run*:

```bash
python3 evaluate_agent.py question_set/common_questions.json
```

---

### 4  Customizing

| Need                                                                    | Where to tweak                            |
| ----------------------------------------------------------------------- | ----------------------------------------- |
| **Agent call** (local model vs. API with keys, tool-use, etc.)          | `MyAgent.answer()`                        |
| **More elaborate normalization** (e.g. strip `$` or `%`, round numbers) | `normalize()`                             |
| **Partial credit / numeric tolerance**                                  | Replace the `==` line with your own logic |

---

### 5  Interpreting results

* **Exact-match accuracy** (>= 100 % means your agent reproduced all answers).
* **Latency** helps you spot outliers in run time (e.g. long tool chains).