Duibonduil commited on
Commit
68e0793
·
verified ·
1 Parent(s): 987d77e

Upload 7 files

Browse files
examples/open_deep_research/README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Open Deep Research
2
+
3
+ Welcome to this open replication of [OpenAI's Deep Research](https://openai.com/index/introducing-deep-research/)! This agent attempts to replicate OpenAI's model and achieve similar performance on research tasks.
4
+
5
+ Read more about this implementation's goal and methods in our [blog post](https://huggingface.co/blog/open-deep-research).
6
+
7
+
8
+ This agent achieves **55% pass@1** on the GAIA validation set, compared to **67%** for the original Deep Research.
9
+
10
+ ## Setup
11
+
12
+ To get started, follow the steps below:
13
+
14
+ ### Clone the repository
15
+
16
+ ```bash
17
+ git clone https://github.com/huggingface/smolagents.git
18
+ cd smolagents/examples/open_deep_research
19
+ ```
20
+
21
+ ### Install dependencies
22
+
23
+ Run the following command to install the required dependencies from the `requirements.txt` file:
24
+
25
+ ```bash
26
+ pip install -r requirements.txt
27
+ ```
28
+
29
+ ### Install the development version of `smolagents`
30
+
31
+ ```bash
32
+ pip install -e ../../.[dev]
33
+ ```
34
+
35
+ ### Set up environment variables
36
+
37
+ The agent uses the `GoogleSearchTool` for web search, which requires an environment variable with the corresponding API key, based on the selected provider:
38
+ - `SERPAPI_API_KEY` for SerpApi: [Sign up here to get a key](https://serpapi.com/users/sign_up)
39
+ - `SERPER_API_KEY` for Serper: [Sign up here to get a key](https://serper.dev/signup)
40
+
41
+ Depending on the model you want to use, you may need to set environment variables.
42
+ For example, to use the default `o1` model, you need to set the `OPENAI_API_KEY` environment variable.
43
+ [Sign up here to get a key](https://platform.openai.com/signup).
44
+
45
+ > [!WARNING]
46
+ > The use of the default `o1` model is restricted to tier-3 access: https://help.openai.com/en/articles/10362446-api-access-to-o1-and-o3-mini
47
+
48
+
49
+ ## Usage
50
+
51
+ Then you're good to go! Run the run.py script, as in:
52
+ ```bash
53
+ python run.py --model-id "o1" "Your question here!"
54
+ ```
55
+
56
+ ## Full reproducibility of results
57
+
58
+ The data used in our submissions to GAIA was augmented in this way:
59
+ - For each single-page .pdf or .xls file, it was opened in a file reader (MacOS Sonoma Numbers or Preview), and a ".png" screenshot was taken and added to the folder.
60
+ - Then for any file used in a question, the file loading system checks if there is a ".png" extension version of the file, and loads it instead of the original if it exists.
61
+
62
+ This process was done manually but could be automatized.
63
+
64
+ After processing, the annotated was uploaded to a [new dataset](https://huggingface.co/datasets/smolagents/GAIA-annotated). You need to request access (granted instantly).
examples/open_deep_research/analysis.ipynb ADDED
@@ -0,0 +1,457 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": null,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "!pip install plotly kaleido datasets nbformat -U -q"
10
+ ]
11
+ },
12
+ {
13
+ "cell_type": "code",
14
+ "execution_count": null,
15
+ "metadata": {},
16
+ "outputs": [],
17
+ "source": [
18
+ "import os\n",
19
+ "\n",
20
+ "import datasets\n",
21
+ "import pandas as pd\n",
22
+ "from dotenv import load_dotenv\n",
23
+ "from huggingface_hub import login\n",
24
+ "\n",
25
+ "\n",
26
+ "load_dotenv(override=True)\n",
27
+ "login(os.getenv(\"HF_TOKEN\"))\n",
28
+ "\n",
29
+ "pd.set_option(\"max_colwidth\", None)\n",
30
+ "\n",
31
+ "OUTPUT_DIR = \"output\""
32
+ ]
33
+ },
34
+ {
35
+ "cell_type": "code",
36
+ "execution_count": null,
37
+ "metadata": {},
38
+ "outputs": [],
39
+ "source": [
40
+ "eval_ds = datasets.load_dataset(\"gaia-benchmark/GAIA\", \"2023_all\")[\"validation\"]\n",
41
+ "eval_ds = eval_ds.rename_columns({\"Question\": \"question\", \"Final answer\": \"true_answer\", \"Level\": \"task\"})\n",
42
+ "eval_df = pd.DataFrame(eval_ds)"
43
+ ]
44
+ },
45
+ {
46
+ "cell_type": "markdown",
47
+ "metadata": {},
48
+ "source": [
49
+ "# 1. Load all results"
50
+ ]
51
+ },
52
+ {
53
+ "cell_type": "code",
54
+ "execution_count": 88,
55
+ "metadata": {},
56
+ "outputs": [],
57
+ "source": [
58
+ "import glob\n",
59
+ "\n",
60
+ "\n",
61
+ "results = []\n",
62
+ "for f in glob.glob(f\"{OUTPUT_DIR}/validation/*.jsonl\"):\n",
63
+ " df = pd.read_json(f, lines=True)\n",
64
+ " df[\"agent_name\"] = f.split(\"/\")[-1].split(\".\")[0]\n",
65
+ " results.append(df)\n",
66
+ "\n",
67
+ "result_df = pd.concat(results)\n",
68
+ "result_df[\"prediction\"] = result_df[\"prediction\"].fillna(\"No prediction\")"
69
+ ]
70
+ },
71
+ {
72
+ "cell_type": "code",
73
+ "execution_count": null,
74
+ "metadata": {},
75
+ "outputs": [],
76
+ "source": [
77
+ "import re\n",
78
+ "from collections import Counter\n",
79
+ "\n",
80
+ "from scripts.gaia_scorer import check_close_call, question_scorer\n",
81
+ "\n",
82
+ "\n",
83
+ "result_df[\"is_correct\"] = result_df.apply(lambda x: question_scorer(x[\"prediction\"], x[\"true_answer\"]), axis=1)\n",
84
+ "result_df[\"is_near_correct\"] = result_df.apply(\n",
85
+ " lambda x: check_close_call(x[\"prediction\"], x[\"true_answer\"], x[\"is_correct\"]),\n",
86
+ " axis=1,\n",
87
+ ")\n",
88
+ "\n",
89
+ "result_df[\"count_steps\"] = result_df[\"intermediate_steps\"].apply(len)\n",
90
+ "\n",
91
+ "\n",
92
+ "def find_attachment(question):\n",
93
+ " matches = eval_df.loc[eval_df[\"question\"].apply(lambda x: x in question), \"file_name\"]\n",
94
+ "\n",
95
+ " if len(matches) == 0:\n",
96
+ " return \"Not found\"\n",
97
+ " file_path = matches.values[0]\n",
98
+ "\n",
99
+ " if isinstance(file_path, str) and len(file_path) > 0:\n",
100
+ " return file_path.split(\".\")[-1]\n",
101
+ " else:\n",
102
+ " return \"None\"\n",
103
+ "\n",
104
+ "\n",
105
+ "result_df[\"attachment_type\"] = result_df[\"question\"].apply(find_attachment)\n",
106
+ "\n",
107
+ "\n",
108
+ "def extract_tool_calls(code):\n",
109
+ " regex = r\"\\b(\\w+)\\(\"\n",
110
+ " function_calls = [el for el in re.findall(regex, code) if el.islower()]\n",
111
+ "\n",
112
+ " function_call_counter = Counter(function_calls)\n",
113
+ " return function_call_counter\n",
114
+ "\n",
115
+ "\n",
116
+ "def sum_tool_calls(steps):\n",
117
+ " total_count = Counter()\n",
118
+ " for step in steps:\n",
119
+ " if \"llm_output\" in step:\n",
120
+ " total_count += extract_tool_calls(step[\"llm_output\"])\n",
121
+ "\n",
122
+ " return total_count\n",
123
+ "\n",
124
+ "\n",
125
+ "def get_durations(row):\n",
126
+ " # start_datetime = datetime.strptime(row['start_time'], \"%Y-%m-%d %H:%M:%S\")\n",
127
+ " # end_datetime = datetime.strptime(row['end_time'], \"%Y-%m-%d %H:%M:%S\")\n",
128
+ "\n",
129
+ " duration_timedelta = row[\"end_time\"] - row[\"start_time\"]\n",
130
+ " return int(duration_timedelta.total_seconds())\n",
131
+ "\n",
132
+ "\n",
133
+ "result_df[\"duration\"] = result_df.apply(get_durations, axis=1)\n",
134
+ "# result_df[\"tool_calls\"] = result_df[\"intermediate_steps\"].apply(sum_tool_calls)"
135
+ ]
136
+ },
137
+ {
138
+ "cell_type": "code",
139
+ "execution_count": null,
140
+ "metadata": {},
141
+ "outputs": [],
142
+ "source": [
143
+ "result_df[\"agent_name\"].value_counts()"
144
+ ]
145
+ },
146
+ {
147
+ "cell_type": "markdown",
148
+ "metadata": {},
149
+ "source": [
150
+ "# 2. Inspect specific runs"
151
+ ]
152
+ },
153
+ {
154
+ "cell_type": "code",
155
+ "execution_count": null,
156
+ "metadata": {},
157
+ "outputs": [],
158
+ "source": [
159
+ "sel_df = result_df\n",
160
+ "# sel_df = sel_df.loc[\n",
161
+ "# (result_df[\"agent_name\"].isin(list_versions))\n",
162
+ "# ]\n",
163
+ "sel_df = sel_df.reset_index(drop=True)\n",
164
+ "display(sel_df[\"agent_name\"].value_counts())\n",
165
+ "sel_df = sel_df.drop_duplicates(subset=[\"agent_name\", \"question\"])\n",
166
+ "display(sel_df.groupby(\"agent_name\")[[\"task\"]].value_counts())\n",
167
+ "print(\"Total length:\", len(sel_df), \"- is complete:\", len(sel_df) == 165)"
168
+ ]
169
+ },
170
+ {
171
+ "cell_type": "code",
172
+ "execution_count": null,
173
+ "metadata": {},
174
+ "outputs": [],
175
+ "source": [
176
+ "display(\"Average score:\", sel_df.groupby(\"agent_name\")[[\"is_correct\"]].mean().round(3))\n",
177
+ "display(\n",
178
+ " sel_df.groupby([\"agent_name\", \"task\"])[[\"is_correct\", \"is_near_correct\", \"count_steps\", \"question\", \"duration\"]]\n",
179
+ " .agg(\n",
180
+ " {\n",
181
+ " \"is_correct\": \"mean\",\n",
182
+ " \"is_near_correct\": \"mean\",\n",
183
+ " \"count_steps\": \"mean\",\n",
184
+ " \"question\": \"count\",\n",
185
+ " \"duration\": \"mean\",\n",
186
+ " }\n",
187
+ " )\n",
188
+ " .rename(columns={\"question\": \"count\"})\n",
189
+ ")"
190
+ ]
191
+ },
192
+ {
193
+ "cell_type": "code",
194
+ "execution_count": null,
195
+ "metadata": {},
196
+ "outputs": [],
197
+ "source": [
198
+ "import plotly.express as px\n",
199
+ "\n",
200
+ "\n",
201
+ "cumulative_df = (\n",
202
+ " (\n",
203
+ " sel_df.groupby(\"agent_name\")[[\"is_correct\", \"is_near_correct\"]]\n",
204
+ " .expanding(min_periods=1, axis=0, method=\"single\")\n",
205
+ " .agg({\"is_correct\": \"mean\", \"is_near_correct\": \"count\"})\n",
206
+ " .reset_index()\n",
207
+ " )\n",
208
+ " .copy()\n",
209
+ " .rename(columns={\"is_near_correct\": \"index\"})\n",
210
+ ")\n",
211
+ "cumulative_df[\"index\"] = cumulative_df[\"index\"].astype(int) - 1\n",
212
+ "\n",
213
+ "\n",
214
+ "def find_question(row):\n",
215
+ " try:\n",
216
+ " res = sel_df.loc[sel_df[\"agent_name\"] == row[\"agent_name\"], \"question\"].iloc[row[\"index\"]][:50]\n",
217
+ " return res\n",
218
+ " except Exception:\n",
219
+ " return \"\"\n",
220
+ "\n",
221
+ "\n",
222
+ "cumulative_df[\"question\"] = cumulative_df.apply(find_question, axis=1)\n",
223
+ "\n",
224
+ "px.line(\n",
225
+ " cumulative_df,\n",
226
+ " color=\"agent_name\",\n",
227
+ " x=\"index\",\n",
228
+ " y=\"is_correct\",\n",
229
+ " hover_data=\"question\",\n",
230
+ ")"
231
+ ]
232
+ },
233
+ {
234
+ "cell_type": "markdown",
235
+ "metadata": {},
236
+ "source": [
237
+ "# 3. Dive deeper into one run"
238
+ ]
239
+ },
240
+ {
241
+ "cell_type": "code",
242
+ "execution_count": null,
243
+ "metadata": {},
244
+ "outputs": [],
245
+ "source": [
246
+ "sel_df = result_df.loc[result_df[\"agent_name\"] == \"o1\"]\n",
247
+ "print(len(sel_df))"
248
+ ]
249
+ },
250
+ {
251
+ "cell_type": "markdown",
252
+ "metadata": {},
253
+ "source": [
254
+ "### Count errors"
255
+ ]
256
+ },
257
+ {
258
+ "cell_type": "code",
259
+ "execution_count": null,
260
+ "metadata": {},
261
+ "outputs": [],
262
+ "source": [
263
+ "import numpy as np\n",
264
+ "\n",
265
+ "\n",
266
+ "error_types = [\n",
267
+ " \"AgentParsingError\",\n",
268
+ " \"AgentExecutionError\",\n",
269
+ " \"AgentMaxIterationsError\",\n",
270
+ " \"AgentGenerationError\",\n",
271
+ "]\n",
272
+ "sel_df[error_types] = 0\n",
273
+ "sel_df[\"Count steps\"] = np.nan\n",
274
+ "\n",
275
+ "\n",
276
+ "def count_errors(row):\n",
277
+ " if isinstance(row[\"intermediate_steps\"], list):\n",
278
+ " row[\"Count steps\"] = len(row[\"intermediate_steps\"])\n",
279
+ " for step in row[\"intermediate_steps\"]:\n",
280
+ " if isinstance(step, dict) and \"error\" in step:\n",
281
+ " try:\n",
282
+ " row[str(step[\"error\"][\"error_type\"])] += 1\n",
283
+ " except Exception:\n",
284
+ " pass\n",
285
+ " return row\n",
286
+ "\n",
287
+ "\n",
288
+ "sel_df = sel_df.apply(count_errors, axis=1)"
289
+ ]
290
+ },
291
+ {
292
+ "cell_type": "code",
293
+ "execution_count": null,
294
+ "metadata": {},
295
+ "outputs": [],
296
+ "source": [
297
+ "import plotly.express as px\n",
298
+ "\n",
299
+ "\n",
300
+ "aggregate_errors = (\n",
301
+ " sel_df.groupby([\"is_correct\"])[error_types + [\"Count steps\"]].mean().reset_index().melt(id_vars=[\"is_correct\"])\n",
302
+ ")\n",
303
+ "\n",
304
+ "fig = px.bar(\n",
305
+ " aggregate_errors,\n",
306
+ " y=\"value\",\n",
307
+ " x=\"variable\",\n",
308
+ " color=\"is_correct\",\n",
309
+ " labels={\n",
310
+ " \"agent_name\": \"<b>Model</b>\",\n",
311
+ " \"task\": \"<b>Level</b>\",\n",
312
+ " \"aggregate_score\": \"<b>Performance</b>\",\n",
313
+ " \"value\": \"<b>Average count</b>\",\n",
314
+ " \"eval_score_GPT4\": \"<b>Score</b>\",\n",
315
+ " },\n",
316
+ ")\n",
317
+ "fig.update_layout(\n",
318
+ " height=500,\n",
319
+ " width=800,\n",
320
+ " barmode=\"group\",\n",
321
+ " bargroupgap=0.0,\n",
322
+ ")\n",
323
+ "fig.update_traces(textposition=\"outside\")\n",
324
+ "fig.write_image(\"aggregate_errors.png\", scale=3)\n",
325
+ "fig.show()"
326
+ ]
327
+ },
328
+ {
329
+ "cell_type": "markdown",
330
+ "metadata": {},
331
+ "source": [
332
+ "### Inspect result by file extension type"
333
+ ]
334
+ },
335
+ {
336
+ "cell_type": "code",
337
+ "execution_count": null,
338
+ "metadata": {},
339
+ "outputs": [],
340
+ "source": [
341
+ "display(\n",
342
+ " result_df.groupby([\"attachment_type\"])[[\"is_correct\", \"count_steps\", \"question\"]].agg(\n",
343
+ " {\"is_correct\": \"mean\", \"count_steps\": \"mean\", \"question\": \"count\"}\n",
344
+ " )\n",
345
+ ")"
346
+ ]
347
+ },
348
+ {
349
+ "cell_type": "markdown",
350
+ "metadata": {},
351
+ "source": [
352
+ "# 4. Ensembling methods"
353
+ ]
354
+ },
355
+ {
356
+ "cell_type": "code",
357
+ "execution_count": null,
358
+ "metadata": {},
359
+ "outputs": [],
360
+ "source": [
361
+ "counts = result_df[\"agent_name\"].value_counts()\n",
362
+ "long_series = result_df.loc[result_df[\"agent_name\"].isin(counts[counts > 140].index)]"
363
+ ]
364
+ },
365
+ {
366
+ "cell_type": "code",
367
+ "execution_count": null,
368
+ "metadata": {},
369
+ "outputs": [],
370
+ "source": [
371
+ "def majority_vote(df):\n",
372
+ " df = df[(df[\"prediction\"] != \"Unable to determine\") & (~df[\"prediction\"].isna()) & (df[\"prediction\"] != \"None\")]\n",
373
+ "\n",
374
+ " answer_modes = df.groupby(\"question\")[\"prediction\"].agg(lambda x: x.mode()[0]).reset_index()\n",
375
+ " first_occurrences = (\n",
376
+ " df.groupby([\"question\", \"prediction\"]).agg({\"task\": \"first\", \"is_correct\": \"first\"}).reset_index()\n",
377
+ " )\n",
378
+ " result = answer_modes.merge(first_occurrences, on=[\"question\", \"prediction\"], how=\"left\")\n",
379
+ "\n",
380
+ " return result\n",
381
+ "\n",
382
+ "\n",
383
+ "def oracle(df):\n",
384
+ " def get_first_correct_or_first_wrong(group):\n",
385
+ " correct_answers = group[group[\"is_correct\"]]\n",
386
+ " if len(correct_answers) > 0:\n",
387
+ " return correct_answers.iloc[0]\n",
388
+ " return group.iloc[0]\n",
389
+ "\n",
390
+ " result = df.groupby(\"question\").apply(get_first_correct_or_first_wrong)\n",
391
+ "\n",
392
+ " return result.reset_index(drop=True)\n",
393
+ "\n",
394
+ "\n",
395
+ "display((long_series.groupby(\"agent_name\")[\"is_correct\"].mean() * 100).round(2))\n",
396
+ "print(f\"Majority score: {majority_vote(long_series)['is_correct'].mean() * 100:.2f}\")\n",
397
+ "print(f\"Oracle score: {oracle(long_series)['is_correct'].mean() * 100:.2f}\")"
398
+ ]
399
+ },
400
+ {
401
+ "cell_type": "markdown",
402
+ "metadata": {},
403
+ "source": [
404
+ "### Submit"
405
+ ]
406
+ },
407
+ {
408
+ "cell_type": "code",
409
+ "execution_count": null,
410
+ "metadata": {},
411
+ "outputs": [],
412
+ "source": [
413
+ "agent_run = \"code_o1_04_february_submission5.jsonl\"\n",
414
+ "df = pd.read_json(f\"output/validation/{agent_run}\", lines=True)\n",
415
+ "df = df[[\"task_id\", \"prediction\", \"intermediate_steps\"]]\n",
416
+ "df = df.rename(columns={\"prediction\": \"model_answer\", \"intermediate_steps\": \"reasoning_trace\"})"
417
+ ]
418
+ },
419
+ {
420
+ "cell_type": "code",
421
+ "execution_count": null,
422
+ "metadata": {},
423
+ "outputs": [],
424
+ "source": [
425
+ "df.to_json(\"submission.jsonl\", orient=\"records\", lines=True)"
426
+ ]
427
+ },
428
+ {
429
+ "cell_type": "code",
430
+ "execution_count": null,
431
+ "metadata": {},
432
+ "outputs": [],
433
+ "source": []
434
+ }
435
+ ],
436
+ "metadata": {
437
+ "kernelspec": {
438
+ "display_name": "agents",
439
+ "language": "python",
440
+ "name": "python3"
441
+ },
442
+ "language_info": {
443
+ "codemirror_mode": {
444
+ "name": "ipython",
445
+ "version": 3
446
+ },
447
+ "file_extension": ".py",
448
+ "mimetype": "text/x-python",
449
+ "name": "python",
450
+ "nbconvert_exporter": "python",
451
+ "pygments_lexer": "ipython3",
452
+ "version": "3.12.0"
453
+ }
454
+ },
455
+ "nbformat": 4,
456
+ "nbformat_minor": 2
457
+ }
examples/open_deep_research/app.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from run import create_agent
2
+
3
+ from smolagents.gradio_ui import GradioUI
4
+
5
+
6
+ agent = create_agent()
7
+
8
+ demo = GradioUI(agent)
9
+
10
+ if __name__ == "__main__":
11
+ demo.launch()
examples/open_deep_research/requirements.txt ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ anthropic>=0.37.1
2
+ audioop-lts<1.0; python_version >= "3.13" # required to use pydub in Python >=3.13; LTS port of the removed Python builtin module audioop
3
+ beautifulsoup4>=4.12.3
4
+ datasets>=2.21.0
5
+ google_search_results>=2.4.2
6
+ huggingface_hub>=0.23.4
7
+ mammoth>=1.8.0
8
+ markdownify>=0.13.1
9
+ numexpr>=2.10.1
10
+ numpy>=2.1.2
11
+ openai>=1.52.2
12
+ openpyxl
13
+ pandas>=2.2.3
14
+ pathvalidate>=3.2.1
15
+ pdfminer>=20191125
16
+ pdfminer.six>=20240706
17
+ Pillow>=11.0.0
18
+ puremagic>=1.28
19
+ pypdf>=5.1.0
20
+ python-dotenv>=1.0.1
21
+ python_pptx>=1.0.2
22
+ Requests>=2.32.3
23
+ tqdm>=4.66.4
24
+ torch>=2.2.2
25
+ torchvision>=0.17.2
26
+ transformers>=4.46.0
27
+ youtube_transcript_api>=0.6.2
28
+ chess
29
+ sympy
30
+ pubchempy
31
+ Bio
32
+ scikit-learn
33
+ scipy
34
+ pydub
35
+ PyPDF2
36
+ python-pptx
37
+ torch
38
+ xlrd
39
+ SpeechRecognition
examples/open_deep_research/run.py ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import os
3
+ import threading
4
+
5
+ from dotenv import load_dotenv
6
+ from huggingface_hub import login
7
+ from scripts.text_inspector_tool import TextInspectorTool
8
+ from scripts.text_web_browser import (
9
+ ArchiveSearchTool,
10
+ FinderTool,
11
+ FindNextTool,
12
+ PageDownTool,
13
+ PageUpTool,
14
+ SimpleTextBrowser,
15
+ VisitTool,
16
+ )
17
+ from scripts.visual_qa import visualizer
18
+
19
+ from smolagents import (
20
+ CodeAgent,
21
+ GoogleSearchTool,
22
+ # InferenceClientModel,
23
+ LiteLLMModel,
24
+ ToolCallingAgent,
25
+ )
26
+
27
+
28
+ load_dotenv(override=True)
29
+ login(os.getenv("HF_TOKEN"))
30
+
31
+ append_answer_lock = threading.Lock()
32
+
33
+
34
+ def parse_args():
35
+ parser = argparse.ArgumentParser()
36
+ parser.add_argument(
37
+ "question", type=str, help="for example: 'How many studio albums did Mercedes Sosa release before 2007?'"
38
+ )
39
+ parser.add_argument("--model-id", type=str, default="o1")
40
+ return parser.parse_args()
41
+
42
+
43
+ custom_role_conversions = {"tool-call": "assistant", "tool-response": "user"}
44
+
45
+ user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0"
46
+
47
+ BROWSER_CONFIG = {
48
+ "viewport_size": 1024 * 5,
49
+ "downloads_folder": "downloads_folder",
50
+ "request_kwargs": {
51
+ "headers": {"User-Agent": user_agent},
52
+ "timeout": 300,
53
+ },
54
+ "serpapi_key": os.getenv("SERPAPI_API_KEY"),
55
+ }
56
+
57
+ os.makedirs(f"./{BROWSER_CONFIG['downloads_folder']}", exist_ok=True)
58
+
59
+
60
+ def create_agent(model_id="o1"):
61
+ model_params = {
62
+ "model_id": model_id,
63
+ "custom_role_conversions": custom_role_conversions,
64
+ "max_completion_tokens": 8192,
65
+ }
66
+ if model_id == "o1":
67
+ model_params["reasoning_effort"] = "high"
68
+ model = LiteLLMModel(**model_params)
69
+
70
+ text_limit = 100000
71
+ browser = SimpleTextBrowser(**BROWSER_CONFIG)
72
+ WEB_TOOLS = [
73
+ GoogleSearchTool(provider="serper"),
74
+ VisitTool(browser),
75
+ PageUpTool(browser),
76
+ PageDownTool(browser),
77
+ FinderTool(browser),
78
+ FindNextTool(browser),
79
+ ArchiveSearchTool(browser),
80
+ TextInspectorTool(model, text_limit),
81
+ ]
82
+ text_webbrowser_agent = ToolCallingAgent(
83
+ model=model,
84
+ tools=WEB_TOOLS,
85
+ max_steps=20,
86
+ verbosity_level=2,
87
+ planning_interval=4,
88
+ name="search_agent",
89
+ description="""A team member that will search the internet to answer your question.
90
+ Ask him for all your questions that require browsing the web.
91
+ Provide him as much context as possible, in particular if you need to search on a specific timeframe!
92
+ And don't hesitate to provide him with a complex search task, like finding a difference between two webpages.
93
+ Your request must be a real sentence, not a google search! Like "Find me this information (...)" rather than a few keywords.
94
+ """,
95
+ provide_run_summary=True,
96
+ )
97
+ text_webbrowser_agent.prompt_templates["managed_agent"]["task"] += """You can navigate to .txt online files.
98
+ If a non-html page is in another format, especially .pdf or a Youtube video, use tool 'inspect_file_as_text' to inspect it.
99
+ Additionally, if after some searching you find out that you need more information to answer the question, you can use `final_answer` with your request for clarification as argument to request for more information."""
100
+
101
+ manager_agent = CodeAgent(
102
+ model=model,
103
+ tools=[visualizer, TextInspectorTool(model, text_limit)],
104
+ max_steps=12,
105
+ verbosity_level=2,
106
+ additional_authorized_imports=["*"],
107
+ planning_interval=4,
108
+ managed_agents=[text_webbrowser_agent],
109
+ )
110
+
111
+ return manager_agent
112
+
113
+
114
+ def main():
115
+ args = parse_args()
116
+
117
+ agent = create_agent(model_id=args.model_id)
118
+
119
+ answer = agent.run(args.question)
120
+
121
+ print(f"Got this answer: {answer}")
122
+
123
+
124
+ if __name__ == "__main__":
125
+ main()
examples/open_deep_research/run_gaia.py ADDED
@@ -0,0 +1,303 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # EXAMPLE COMMAND: from folder examples/open_deep_research, run: python run_gaia.py --concurrency 32 --run-name generate-traces-03-apr-noplanning --model-id gpt-4o
2
+ import argparse
3
+ import json
4
+ import os
5
+ import threading
6
+ from concurrent.futures import ThreadPoolExecutor, as_completed
7
+ from datetime import datetime
8
+ from pathlib import Path
9
+ from typing import Any
10
+
11
+ import datasets
12
+ import pandas as pd
13
+ from dotenv import load_dotenv
14
+ from huggingface_hub import login, snapshot_download
15
+ from scripts.reformulator import prepare_response
16
+ from scripts.run_agents import (
17
+ get_single_file_description,
18
+ get_zip_description,
19
+ )
20
+ from scripts.text_inspector_tool import TextInspectorTool
21
+ from scripts.text_web_browser import (
22
+ ArchiveSearchTool,
23
+ FinderTool,
24
+ FindNextTool,
25
+ PageDownTool,
26
+ PageUpTool,
27
+ SimpleTextBrowser,
28
+ VisitTool,
29
+ )
30
+ from scripts.visual_qa import visualizer
31
+ from tqdm import tqdm
32
+
33
+ from smolagents import (
34
+ CodeAgent,
35
+ GoogleSearchTool,
36
+ LiteLLMModel,
37
+ Model,
38
+ ToolCallingAgent,
39
+ )
40
+
41
+
42
+ load_dotenv(override=True)
43
+ login(os.getenv("HF_TOKEN"))
44
+
45
+ append_answer_lock = threading.Lock()
46
+
47
+
48
+ def parse_args():
49
+ parser = argparse.ArgumentParser()
50
+ parser.add_argument("--concurrency", type=int, default=8)
51
+ parser.add_argument("--model-id", type=str, default="o1")
52
+ parser.add_argument("--run-name", type=str, required=True)
53
+ parser.add_argument("--set-to-run", type=str, default="validation")
54
+ parser.add_argument("--use-open-models", type=bool, default=False)
55
+ parser.add_argument("--use-raw-dataset", action="store_true")
56
+ return parser.parse_args()
57
+
58
+
59
+ ### IMPORTANT: EVALUATION SWITCHES
60
+
61
+ print("Make sure you deactivated any VPN like Tailscale, else some URLs will be blocked!")
62
+
63
+ custom_role_conversions = {"tool-call": "assistant", "tool-response": "user"}
64
+
65
+
66
+ user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0"
67
+
68
+ BROWSER_CONFIG = {
69
+ "viewport_size": 1024 * 5,
70
+ "downloads_folder": "downloads_folder",
71
+ "request_kwargs": {
72
+ "headers": {"User-Agent": user_agent},
73
+ "timeout": 300,
74
+ },
75
+ "serpapi_key": os.getenv("SERPAPI_API_KEY"),
76
+ }
77
+
78
+ os.makedirs(f"./{BROWSER_CONFIG['downloads_folder']}", exist_ok=True)
79
+
80
+
81
+ def create_agent_team(model: Model):
82
+ text_limit = 100000
83
+ ti_tool = TextInspectorTool(model, text_limit)
84
+
85
+ browser = SimpleTextBrowser(**BROWSER_CONFIG)
86
+
87
+ WEB_TOOLS = [
88
+ GoogleSearchTool(provider="serper"),
89
+ VisitTool(browser),
90
+ PageUpTool(browser),
91
+ PageDownTool(browser),
92
+ FinderTool(browser),
93
+ FindNextTool(browser),
94
+ ArchiveSearchTool(browser),
95
+ TextInspectorTool(model, text_limit),
96
+ ]
97
+
98
+ text_webbrowser_agent = ToolCallingAgent(
99
+ model=model,
100
+ tools=WEB_TOOLS,
101
+ max_steps=20,
102
+ verbosity_level=2,
103
+ planning_interval=4,
104
+ name="search_agent",
105
+ description="""A team member that will search the internet to answer your question.
106
+ Ask him for all your questions that require browsing the web.
107
+ Provide him as much context as possible, in particular if you need to search on a specific timeframe!
108
+ And don't hesitate to provide him with a complex search task, like finding a difference between two webpages.
109
+ Your request must be a real sentence, not a google search! Like "Find me this information (...)" rather than a few keywords.
110
+ """,
111
+ provide_run_summary=True,
112
+ )
113
+ text_webbrowser_agent.prompt_templates["managed_agent"]["task"] += """You can navigate to .txt online files.
114
+ If a non-html page is in another format, especially .pdf or a Youtube video, use tool 'inspect_file_as_text' to inspect it.
115
+ Additionally, if after some searching you find out that you need more information to answer the question, you can use `final_answer` with your request for clarification as argument to request for more information."""
116
+
117
+ manager_agent = CodeAgent(
118
+ model=model,
119
+ tools=[visualizer, ti_tool],
120
+ max_steps=12,
121
+ verbosity_level=2,
122
+ additional_authorized_imports=["*"],
123
+ planning_interval=4,
124
+ managed_agents=[text_webbrowser_agent],
125
+ )
126
+ return manager_agent
127
+
128
+
129
+ def load_gaia_dataset(use_raw_dataset: bool, set_to_run: str) -> datasets.Dataset:
130
+ if not os.path.exists("data/gaia"):
131
+ if use_raw_dataset:
132
+ snapshot_download(
133
+ repo_id="gaia-benchmark/GAIA",
134
+ repo_type="dataset",
135
+ local_dir="data/gaia",
136
+ ignore_patterns=[".gitattributes", "README.md"],
137
+ )
138
+ else:
139
+ # WARNING: this dataset is gated: make sure you visit the repo to require access.
140
+ snapshot_download(
141
+ repo_id="smolagents/GAIA-annotated",
142
+ repo_type="dataset",
143
+ local_dir="data/gaia",
144
+ ignore_patterns=[".gitattributes", "README.md"],
145
+ )
146
+
147
+ def preprocess_file_paths(row):
148
+ if len(row["file_name"]) > 0:
149
+ row["file_name"] = f"data/gaia/{set_to_run}/" + row["file_name"]
150
+ return row
151
+
152
+ eval_ds = datasets.load_dataset(
153
+ "data/gaia/GAIA.py",
154
+ name="2023_all",
155
+ split=set_to_run,
156
+ # data_files={"validation": "validation/metadata.jsonl", "test": "test/metadata.jsonl"},
157
+ )
158
+
159
+ eval_ds = eval_ds.rename_columns({"Question": "question", "Final answer": "true_answer", "Level": "task"})
160
+ eval_ds = eval_ds.map(preprocess_file_paths)
161
+ return eval_ds
162
+
163
+
164
+ def append_answer(entry: dict, jsonl_file: str) -> None:
165
+ jsonl_path = Path(jsonl_file)
166
+ jsonl_path.parent.mkdir(parents=True, exist_ok=True)
167
+ with append_answer_lock, open(jsonl_file, "a", encoding="utf-8") as fp:
168
+ fp.write(json.dumps(entry) + "\n")
169
+ assert jsonl_path.exists(), "File not found!"
170
+ print("Answer exported to file:", jsonl_path.resolve())
171
+
172
+
173
+ def answer_single_question(
174
+ example: dict, model_id: str, answers_file: str, visual_inspection_tool: TextInspectorTool
175
+ ) -> None:
176
+ model_params: dict[str, Any] = {
177
+ "model_id": model_id,
178
+ "custom_role_conversions": custom_role_conversions,
179
+ }
180
+ if model_id == "o1":
181
+ model_params["reasoning_effort"] = "high"
182
+ model_params["max_completion_tokens"] = 8192
183
+ else:
184
+ model_params["max_tokens"] = 4096
185
+ model = LiteLLMModel(**model_params)
186
+ # model = InferenceClientModel(model_id="Qwen/Qwen3-32B", provider="novita", max_tokens=4096)
187
+ document_inspection_tool = TextInspectorTool(model, 100000)
188
+
189
+ agent = create_agent_team(model)
190
+
191
+ augmented_question = """You have one question to answer. It is paramount that you provide a correct answer.
192
+ Give it all you can: I know for a fact that you have access to all the relevant tools to solve it and find the correct answer (the answer does exist).
193
+ Failure or 'I cannot answer' or 'None found' will not be tolerated, success will be rewarded.
194
+ Run verification steps if that's needed, you must make sure you find the correct answer! Here is the task:
195
+
196
+ """ + example["question"]
197
+
198
+ if example["file_name"]:
199
+ if ".zip" in example["file_name"]:
200
+ prompt_use_files = "\n\nTo solve the task above, you will have to use these attached files:\n"
201
+ prompt_use_files += get_zip_description(
202
+ example["file_name"], example["question"], visual_inspection_tool, document_inspection_tool
203
+ )
204
+ else:
205
+ prompt_use_files = "\n\nTo solve the task above, you will have to use this attached file:\n"
206
+ prompt_use_files += get_single_file_description(
207
+ example["file_name"], example["question"], visual_inspection_tool, document_inspection_tool
208
+ )
209
+ augmented_question += prompt_use_files
210
+
211
+ start_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
212
+ try:
213
+ # Run agent 🚀
214
+ final_result = agent.run(augmented_question)
215
+
216
+ agent_memory = agent.write_memory_to_messages()
217
+
218
+ final_result = prepare_response(augmented_question, agent_memory, reformulation_model=model)
219
+
220
+ output = str(final_result)
221
+ for memory_step in agent.memory.steps:
222
+ memory_step.model_input_messages = None
223
+ intermediate_steps = agent_memory
224
+
225
+ # Check for parsing errors which indicate the LLM failed to follow the required format
226
+ parsing_error = True if any(["AgentParsingError" in step for step in intermediate_steps]) else False
227
+
228
+ # check if iteration limit exceeded
229
+ iteration_limit_exceeded = True if "Agent stopped due to iteration limit or time limit." in output else False
230
+ raised_exception = False
231
+
232
+ except Exception as e:
233
+ print("Error on ", augmented_question, e)
234
+ output = None
235
+ intermediate_steps = []
236
+ parsing_error = False
237
+ iteration_limit_exceeded = False
238
+ exception = e
239
+ raised_exception = True
240
+ end_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
241
+ token_counts_manager = agent.monitor.get_total_token_counts()
242
+ token_counts_web = list(agent.managed_agents.values())[0].monitor.get_total_token_counts()
243
+ total_token_counts = {
244
+ "input": token_counts_manager["input"] + token_counts_web["input"],
245
+ "output": token_counts_manager["output"] + token_counts_web["output"],
246
+ }
247
+ annotated_example = {
248
+ "agent_name": model.model_id,
249
+ "question": example["question"],
250
+ "augmented_question": augmented_question,
251
+ "prediction": output,
252
+ "intermediate_steps": intermediate_steps,
253
+ "parsing_error": parsing_error,
254
+ "iteration_limit_exceeded": iteration_limit_exceeded,
255
+ "agent_error": str(exception) if raised_exception else None,
256
+ "task": example["task"],
257
+ "task_id": example["task_id"],
258
+ "true_answer": example["true_answer"],
259
+ "start_time": start_time,
260
+ "end_time": end_time,
261
+ "token_counts": total_token_counts,
262
+ }
263
+ append_answer(annotated_example, answers_file)
264
+
265
+
266
+ def get_examples_to_answer(answers_file: str, eval_ds: datasets.Dataset) -> list[dict]:
267
+ print(f"Loading answers from {answers_file}...")
268
+ try:
269
+ done_questions = pd.read_json(answers_file, lines=True)["question"].tolist()
270
+ print(f"Found {len(done_questions)} previous results!")
271
+ except Exception as e:
272
+ print("Error when loading records: ", e)
273
+ print("No usable records! ▶️ Starting new.")
274
+ done_questions = []
275
+ return [line for line in eval_ds.to_list() if line["question"] not in done_questions and line["file_name"]]
276
+
277
+
278
+ def main():
279
+ args = parse_args()
280
+ print(f"Starting run with arguments: {args}")
281
+
282
+ eval_ds = load_gaia_dataset(args.use_raw_dataset, args.set_to_run)
283
+ print("Loaded evaluation dataset:")
284
+ print(pd.DataFrame(eval_ds)["task"].value_counts())
285
+
286
+ answers_file = f"output/{args.set_to_run}/{args.run_name}.jsonl"
287
+ tasks_to_run = get_examples_to_answer(answers_file, eval_ds)
288
+
289
+ with ThreadPoolExecutor(max_workers=args.concurrency) as exe:
290
+ futures = [
291
+ exe.submit(answer_single_question, example, args.model_id, answers_file, visualizer)
292
+ for example in tasks_to_run
293
+ ]
294
+ for f in tqdm(as_completed(futures), total=len(tasks_to_run), desc="Processing tasks"):
295
+ f.result()
296
+
297
+ # for example in tasks_to_run:
298
+ # answer_single_question(example, args.model_id, answers_file, visualizer)
299
+ print("All tasks processed.")
300
+
301
+
302
+ if __name__ == "__main__":
303
+ main()
examples/open_deep_research/visual_vs_text_browser.ipynb ADDED
@@ -0,0 +1,359 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# Compare a text-based vs a vision-based browser\n",
8
+ "\n",
9
+ "Warning: this notebook is experimental, it probably won't work out of the box!"
10
+ ]
11
+ },
12
+ {
13
+ "cell_type": "code",
14
+ "execution_count": null,
15
+ "metadata": {},
16
+ "outputs": [],
17
+ "source": [
18
+ "!pip install \"smolagents[litellm,toolkit]\" -q"
19
+ ]
20
+ },
21
+ {
22
+ "cell_type": "code",
23
+ "execution_count": null,
24
+ "metadata": {},
25
+ "outputs": [],
26
+ "source": [
27
+ "import datasets\n",
28
+ "\n",
29
+ "\n",
30
+ "eval_ds = datasets.load_dataset(\"gaia-benchmark/GAIA\", \"2023_all\")[\"validation\"]"
31
+ ]
32
+ },
33
+ {
34
+ "cell_type": "code",
35
+ "execution_count": 3,
36
+ "metadata": {},
37
+ "outputs": [],
38
+ "source": [
39
+ "to_keep = [\n",
40
+ " \"What's the last line of the rhyme under the flavor\",\n",
41
+ " 'Of the authors (First M. Last) that worked on the paper \"Pie Menus or Linear Menus',\n",
42
+ " \"In Series 9, Episode 11 of Doctor Who, the Doctor is trapped inside an ever-shifting maze. What is this location called in the official script for the episode? Give the setting exactly as it appears in the first scene heading.\",\n",
43
+ " \"Which contributor to the version of OpenCV where support was added for the Mask-RCNN model has the same name as a former Chinese head of government when the names are transliterated to the Latin alphabet?\",\n",
44
+ " \"The photograph in the Whitney Museum of American Art's collection with accession number 2022.128 shows a person holding a book. Which military unit did the author of this book join in 1813? Answer without using articles.\",\n",
45
+ " \"I went to Virtue restaurant & bar in Chicago for my birthday on March 22, 2021 and the main course I had was delicious! Unfortunately, when I went back about a month later on April 21, it was no longer on the dinner menu.\",\n",
46
+ " \"In Emily Midkiff's June 2014 article in a journal named for the one of Hreidmar's \",\n",
47
+ " \"Under DDC 633 on Bielefeld University Library's BASE, as of 2020\",\n",
48
+ " \"In the 2018 VSCode blog post on replit.com, what was the command they clicked on in the last video to remove extra lines?\",\n",
49
+ " \"The Metropolitan Museum of Art has a portrait in its collection with an accession number of 29.100.5. Of the consecrators and co-consecrators\",\n",
50
+ " \"In Nature journal's Scientific Reports conference proceedings from 2012, in the article that did not mention plasmons or plasmonics, what nano-compound is studied?\",\n",
51
+ " 'In the year 2022, and before December, what does \"R\" stand for in the three core policies of the type of content',\n",
52
+ " \"Who nominated the only Featured Article on English Wikipedia about a dinosaur that was promoted in November 2016?\",\n",
53
+ "]\n",
54
+ "eval_ds = eval_ds.filter(lambda row: any([el in row[\"Question\"] for el in to_keep]))\n",
55
+ "eval_ds = eval_ds.rename_columns({\"Question\": \"question\", \"Final answer\": \"true_answer\", \"Level\": \"task\"})"
56
+ ]
57
+ },
58
+ {
59
+ "cell_type": "code",
60
+ "execution_count": null,
61
+ "metadata": {},
62
+ "outputs": [],
63
+ "source": [
64
+ "import os\n",
65
+ "\n",
66
+ "from dotenv import load_dotenv\n",
67
+ "from huggingface_hub import login\n",
68
+ "\n",
69
+ "\n",
70
+ "load_dotenv(override=True)\n",
71
+ "\n",
72
+ "login(os.getenv(\"HF_TOKEN\"))"
73
+ ]
74
+ },
75
+ {
76
+ "cell_type": "markdown",
77
+ "metadata": {},
78
+ "source": [
79
+ "### Text browser"
80
+ ]
81
+ },
82
+ {
83
+ "cell_type": "code",
84
+ "execution_count": null,
85
+ "metadata": {},
86
+ "outputs": [],
87
+ "source": [
88
+ "from scripts.run_agents import answer_questions\n",
89
+ "from scripts.text_inspector_tool import TextInspectorTool\n",
90
+ "from scripts.text_web_browser import (\n",
91
+ " ArchiveSearchTool,\n",
92
+ " FinderTool,\n",
93
+ " FindNextTool,\n",
94
+ " NavigationalSearchTool,\n",
95
+ " PageDownTool,\n",
96
+ " PageUpTool,\n",
97
+ " SearchInformationTool,\n",
98
+ " VisitTool,\n",
99
+ ")\n",
100
+ "from scripts.visual_qa import VisualQAGPT4Tool\n",
101
+ "\n",
102
+ "from smolagents import CodeAgent, LiteLLMModel\n",
103
+ "\n",
104
+ "\n",
105
+ "proprietary_model = LiteLLMModel(model_id=\"gpt-4o\")"
106
+ ]
107
+ },
108
+ {
109
+ "cell_type": "code",
110
+ "execution_count": null,
111
+ "metadata": {},
112
+ "outputs": [],
113
+ "source": [
114
+ "### BUILD AGENTS & TOOLS\n",
115
+ "\n",
116
+ "WEB_TOOLS = [\n",
117
+ " SearchInformationTool(),\n",
118
+ " NavigationalSearchTool(),\n",
119
+ " VisitTool(),\n",
120
+ " PageUpTool(),\n",
121
+ " PageDownTool(),\n",
122
+ " FinderTool(),\n",
123
+ " FindNextTool(),\n",
124
+ " ArchiveSearchTool(),\n",
125
+ "]\n",
126
+ "\n",
127
+ "\n",
128
+ "surfer_agent = CodeAgent(\n",
129
+ " model=proprietary_model,\n",
130
+ " tools=WEB_TOOLS,\n",
131
+ " max_steps=20,\n",
132
+ " verbosity_level=2,\n",
133
+ ")\n",
134
+ "\n",
135
+ "results_text = answer_questions(\n",
136
+ " eval_ds,\n",
137
+ " surfer_agent,\n",
138
+ " \"code_gpt4o_27-01_text\",\n",
139
+ " reformulation_model=proprietary_model,\n",
140
+ " output_folder=\"output_browsers\",\n",
141
+ " visual_inspection_tool=VisualQAGPT4Tool(),\n",
142
+ " text_inspector_tool=TextInspectorTool(proprietary_model, 40000),\n",
143
+ ")"
144
+ ]
145
+ },
146
+ {
147
+ "cell_type": "markdown",
148
+ "metadata": {},
149
+ "source": [
150
+ "### Vision browser"
151
+ ]
152
+ },
153
+ {
154
+ "cell_type": "code",
155
+ "execution_count": null,
156
+ "metadata": {},
157
+ "outputs": [],
158
+ "source": [
159
+ "!pip install helium -q"
160
+ ]
161
+ },
162
+ {
163
+ "cell_type": "code",
164
+ "execution_count": null,
165
+ "metadata": {},
166
+ "outputs": [],
167
+ "source": [
168
+ "from scripts.visual_qa import VisualQAGPT4Tool\n",
169
+ "\n",
170
+ "from smolagents import CodeAgent, LiteLLMModel, WebSearchTool\n",
171
+ "from smolagents.vision_web_browser import (\n",
172
+ " close_popups,\n",
173
+ " go_back,\n",
174
+ " helium_instructions,\n",
175
+ " initialize_agent,\n",
176
+ " save_screenshot,\n",
177
+ " search_item_ctrl_f,\n",
178
+ ")\n",
179
+ "\n",
180
+ "\n",
181
+ "proprietary_model = LiteLLMModel(model_id=\"gpt-4o\")\n",
182
+ "vision_browser_agent = initialize_agent(proprietary_model)\n",
183
+ "### BUILD AGENTS & TOOLS\n",
184
+ "\n",
185
+ "CodeAgent(\n",
186
+ " tools=[WebSearchTool(), go_back, close_popups, search_item_ctrl_f],\n",
187
+ " model=proprietary_model,\n",
188
+ " additional_authorized_imports=[\"helium\"],\n",
189
+ " step_callbacks=[save_screenshot],\n",
190
+ " max_steps=20,\n",
191
+ " verbosity_level=2,\n",
192
+ ")\n",
193
+ "\n",
194
+ "results_vision = answer_questions(\n",
195
+ " eval_ds,\n",
196
+ " vision_browser_agent,\n",
197
+ " \"code_gpt4o_27-01_vision\",\n",
198
+ " reformulation_model=proprietary_model,\n",
199
+ " output_folder=\"output_browsers\",\n",
200
+ " visual_inspection_tool=VisualQAGPT4Tool(),\n",
201
+ " text_inspector_tool=TextInspectorTool(proprietary_model, 40000),\n",
202
+ " postprompt=helium_instructions\n",
203
+ " + \"Any web browser controls won't work on .pdf urls, rather use the tool 'inspect_file_as_text' to read them\",\n",
204
+ ")"
205
+ ]
206
+ },
207
+ {
208
+ "cell_type": "markdown",
209
+ "metadata": {},
210
+ "source": [
211
+ "### Browser-use browser"
212
+ ]
213
+ },
214
+ {
215
+ "cell_type": "code",
216
+ "execution_count": null,
217
+ "metadata": {},
218
+ "outputs": [],
219
+ "source": [
220
+ "!pip install browser-use lxml_html_clean -q\n",
221
+ "!playwright install"
222
+ ]
223
+ },
224
+ {
225
+ "cell_type": "code",
226
+ "execution_count": null,
227
+ "metadata": {},
228
+ "outputs": [],
229
+ "source": [
230
+ "import asyncio\n",
231
+ "\n",
232
+ "import nest_asyncio\n",
233
+ "\n",
234
+ "\n",
235
+ "nest_asyncio.apply()\n",
236
+ "\n",
237
+ "from browser_use import Agent\n",
238
+ "from dotenv import load_dotenv\n",
239
+ "from langchain_openai import ChatOpenAI\n",
240
+ "\n",
241
+ "\n",
242
+ "load_dotenv()\n",
243
+ "\n",
244
+ "\n",
245
+ "class BrowserUseAgent:\n",
246
+ " logs = []\n",
247
+ "\n",
248
+ " def write_inner_memory_from_logs(self, summary_mode):\n",
249
+ " return self.results\n",
250
+ "\n",
251
+ " def run(self, task, **kwargs):\n",
252
+ " agent = Agent(\n",
253
+ " task=task,\n",
254
+ " llm=ChatOpenAI(model=\"gpt-4o\"),\n",
255
+ " )\n",
256
+ " self.results = asyncio.get_event_loop().run_until_complete(agent.run())\n",
257
+ " return self.results.history[-1].result[0].extracted_content\n",
258
+ "\n",
259
+ "\n",
260
+ "browser_use_agent = BrowserUseAgent()\n",
261
+ "\n",
262
+ "results_browseruse = answer_questions(\n",
263
+ " eval_ds,\n",
264
+ " browser_use_agent,\n",
265
+ " \"gpt-4o_27-01_browseruse\",\n",
266
+ " reformulation_model=proprietary_model,\n",
267
+ " output_folder=\"output_browsers\",\n",
268
+ " visual_inspection_tool=VisualQAGPT4Tool(),\n",
269
+ " text_inspector_tool=TextInspectorTool(proprietary_model, 40000),\n",
270
+ " postprompt=\"\",\n",
271
+ " run_simple=True,\n",
272
+ ")"
273
+ ]
274
+ },
275
+ {
276
+ "cell_type": "markdown",
277
+ "metadata": {},
278
+ "source": [
279
+ "### Get results"
280
+ ]
281
+ },
282
+ {
283
+ "cell_type": "code",
284
+ "execution_count": null,
285
+ "metadata": {},
286
+ "outputs": [],
287
+ "source": [
288
+ "import pandas as pd\n",
289
+ "from scripts.gaia_scorer import question_scorer\n",
290
+ "\n",
291
+ "\n",
292
+ "results_vision, results_text, results_browseruse = (\n",
293
+ " pd.DataFrame(results_vision),\n",
294
+ " pd.DataFrame(results_text),\n",
295
+ " pd.DataFrame(results_browseruse),\n",
296
+ ")\n",
297
+ "\n",
298
+ "results_vision[\"is_correct\"] = results_vision.apply(\n",
299
+ " lambda x: question_scorer(x[\"prediction\"], x[\"true_answer\"]), axis=1\n",
300
+ ")\n",
301
+ "results_text[\"is_correct\"] = results_text.apply(lambda x: question_scorer(x[\"prediction\"], x[\"true_answer\"]), axis=1)\n",
302
+ "results_browseruse[\"is_correct\"] = results_browseruse.apply(\n",
303
+ " lambda x: question_scorer(x[\"prediction\"], x[\"true_answer\"]), axis=1\n",
304
+ ")"
305
+ ]
306
+ },
307
+ {
308
+ "cell_type": "code",
309
+ "execution_count": null,
310
+ "metadata": {},
311
+ "outputs": [],
312
+ "source": [
313
+ "results = pd.concat([results_vision, results_text, results_browseruse])\n",
314
+ "results.groupby(\"agent_name\")[\"is_correct\"].mean()"
315
+ ]
316
+ },
317
+ {
318
+ "cell_type": "code",
319
+ "execution_count": null,
320
+ "metadata": {},
321
+ "outputs": [],
322
+ "source": [
323
+ "correct_vision_results = results_vision.loc[results_vision[\"is_correct\"]]\n",
324
+ "correct_vision_results"
325
+ ]
326
+ },
327
+ {
328
+ "cell_type": "code",
329
+ "execution_count": null,
330
+ "metadata": {},
331
+ "outputs": [],
332
+ "source": [
333
+ "false_text_results = results_text.loc[~results_text[\"is_correct\"]]\n",
334
+ "false_text_results"
335
+ ]
336
+ }
337
+ ],
338
+ "metadata": {
339
+ "kernelspec": {
340
+ "display_name": "gaia",
341
+ "language": "python",
342
+ "name": "python3"
343
+ },
344
+ "language_info": {
345
+ "codemirror_mode": {
346
+ "name": "ipython",
347
+ "version": 3
348
+ },
349
+ "file_extension": ".py",
350
+ "mimetype": "text/x-python",
351
+ "name": "python",
352
+ "nbconvert_exporter": "python",
353
+ "pygments_lexer": "ipython3",
354
+ "version": "3.12.0"
355
+ }
356
+ },
357
+ "nbformat": 4,
358
+ "nbformat_minor": 2
359
+ }