Upload 7 files
Browse files- examples/open_deep_research/README.md +64 -0
- examples/open_deep_research/analysis.ipynb +457 -0
- examples/open_deep_research/app.py +11 -0
- examples/open_deep_research/requirements.txt +39 -0
- examples/open_deep_research/run.py +125 -0
- examples/open_deep_research/run_gaia.py +303 -0
- examples/open_deep_research/visual_vs_text_browser.ipynb +359 -0
examples/open_deep_research/README.md
ADDED
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Open Deep Research
|
2 |
+
|
3 |
+
Welcome to this open replication of [OpenAI's Deep Research](https://openai.com/index/introducing-deep-research/)! This agent attempts to replicate OpenAI's model and achieve similar performance on research tasks.
|
4 |
+
|
5 |
+
Read more about this implementation's goal and methods in our [blog post](https://huggingface.co/blog/open-deep-research).
|
6 |
+
|
7 |
+
|
8 |
+
This agent achieves **55% pass@1** on the GAIA validation set, compared to **67%** for the original Deep Research.
|
9 |
+
|
10 |
+
## Setup
|
11 |
+
|
12 |
+
To get started, follow the steps below:
|
13 |
+
|
14 |
+
### Clone the repository
|
15 |
+
|
16 |
+
```bash
|
17 |
+
git clone https://github.com/huggingface/smolagents.git
|
18 |
+
cd smolagents/examples/open_deep_research
|
19 |
+
```
|
20 |
+
|
21 |
+
### Install dependencies
|
22 |
+
|
23 |
+
Run the following command to install the required dependencies from the `requirements.txt` file:
|
24 |
+
|
25 |
+
```bash
|
26 |
+
pip install -r requirements.txt
|
27 |
+
```
|
28 |
+
|
29 |
+
### Install the development version of `smolagents`
|
30 |
+
|
31 |
+
```bash
|
32 |
+
pip install -e ../../.[dev]
|
33 |
+
```
|
34 |
+
|
35 |
+
### Set up environment variables
|
36 |
+
|
37 |
+
The agent uses the `GoogleSearchTool` for web search, which requires an environment variable with the corresponding API key, based on the selected provider:
|
38 |
+
- `SERPAPI_API_KEY` for SerpApi: [Sign up here to get a key](https://serpapi.com/users/sign_up)
|
39 |
+
- `SERPER_API_KEY` for Serper: [Sign up here to get a key](https://serper.dev/signup)
|
40 |
+
|
41 |
+
Depending on the model you want to use, you may need to set environment variables.
|
42 |
+
For example, to use the default `o1` model, you need to set the `OPENAI_API_KEY` environment variable.
|
43 |
+
[Sign up here to get a key](https://platform.openai.com/signup).
|
44 |
+
|
45 |
+
> [!WARNING]
|
46 |
+
> The use of the default `o1` model is restricted to tier-3 access: https://help.openai.com/en/articles/10362446-api-access-to-o1-and-o3-mini
|
47 |
+
|
48 |
+
|
49 |
+
## Usage
|
50 |
+
|
51 |
+
Then you're good to go! Run the run.py script, as in:
|
52 |
+
```bash
|
53 |
+
python run.py --model-id "o1" "Your question here!"
|
54 |
+
```
|
55 |
+
|
56 |
+
## Full reproducibility of results
|
57 |
+
|
58 |
+
The data used in our submissions to GAIA was augmented in this way:
|
59 |
+
- For each single-page .pdf or .xls file, it was opened in a file reader (MacOS Sonoma Numbers or Preview), and a ".png" screenshot was taken and added to the folder.
|
60 |
+
- Then for any file used in a question, the file loading system checks if there is a ".png" extension version of the file, and loads it instead of the original if it exists.
|
61 |
+
|
62 |
+
This process was done manually but could be automatized.
|
63 |
+
|
64 |
+
After processing, the annotated was uploaded to a [new dataset](https://huggingface.co/datasets/smolagents/GAIA-annotated). You need to request access (granted instantly).
|
examples/open_deep_research/analysis.ipynb
ADDED
@@ -0,0 +1,457 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cells": [
|
3 |
+
{
|
4 |
+
"cell_type": "code",
|
5 |
+
"execution_count": null,
|
6 |
+
"metadata": {},
|
7 |
+
"outputs": [],
|
8 |
+
"source": [
|
9 |
+
"!pip install plotly kaleido datasets nbformat -U -q"
|
10 |
+
]
|
11 |
+
},
|
12 |
+
{
|
13 |
+
"cell_type": "code",
|
14 |
+
"execution_count": null,
|
15 |
+
"metadata": {},
|
16 |
+
"outputs": [],
|
17 |
+
"source": [
|
18 |
+
"import os\n",
|
19 |
+
"\n",
|
20 |
+
"import datasets\n",
|
21 |
+
"import pandas as pd\n",
|
22 |
+
"from dotenv import load_dotenv\n",
|
23 |
+
"from huggingface_hub import login\n",
|
24 |
+
"\n",
|
25 |
+
"\n",
|
26 |
+
"load_dotenv(override=True)\n",
|
27 |
+
"login(os.getenv(\"HF_TOKEN\"))\n",
|
28 |
+
"\n",
|
29 |
+
"pd.set_option(\"max_colwidth\", None)\n",
|
30 |
+
"\n",
|
31 |
+
"OUTPUT_DIR = \"output\""
|
32 |
+
]
|
33 |
+
},
|
34 |
+
{
|
35 |
+
"cell_type": "code",
|
36 |
+
"execution_count": null,
|
37 |
+
"metadata": {},
|
38 |
+
"outputs": [],
|
39 |
+
"source": [
|
40 |
+
"eval_ds = datasets.load_dataset(\"gaia-benchmark/GAIA\", \"2023_all\")[\"validation\"]\n",
|
41 |
+
"eval_ds = eval_ds.rename_columns({\"Question\": \"question\", \"Final answer\": \"true_answer\", \"Level\": \"task\"})\n",
|
42 |
+
"eval_df = pd.DataFrame(eval_ds)"
|
43 |
+
]
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"cell_type": "markdown",
|
47 |
+
"metadata": {},
|
48 |
+
"source": [
|
49 |
+
"# 1. Load all results"
|
50 |
+
]
|
51 |
+
},
|
52 |
+
{
|
53 |
+
"cell_type": "code",
|
54 |
+
"execution_count": 88,
|
55 |
+
"metadata": {},
|
56 |
+
"outputs": [],
|
57 |
+
"source": [
|
58 |
+
"import glob\n",
|
59 |
+
"\n",
|
60 |
+
"\n",
|
61 |
+
"results = []\n",
|
62 |
+
"for f in glob.glob(f\"{OUTPUT_DIR}/validation/*.jsonl\"):\n",
|
63 |
+
" df = pd.read_json(f, lines=True)\n",
|
64 |
+
" df[\"agent_name\"] = f.split(\"/\")[-1].split(\".\")[0]\n",
|
65 |
+
" results.append(df)\n",
|
66 |
+
"\n",
|
67 |
+
"result_df = pd.concat(results)\n",
|
68 |
+
"result_df[\"prediction\"] = result_df[\"prediction\"].fillna(\"No prediction\")"
|
69 |
+
]
|
70 |
+
},
|
71 |
+
{
|
72 |
+
"cell_type": "code",
|
73 |
+
"execution_count": null,
|
74 |
+
"metadata": {},
|
75 |
+
"outputs": [],
|
76 |
+
"source": [
|
77 |
+
"import re\n",
|
78 |
+
"from collections import Counter\n",
|
79 |
+
"\n",
|
80 |
+
"from scripts.gaia_scorer import check_close_call, question_scorer\n",
|
81 |
+
"\n",
|
82 |
+
"\n",
|
83 |
+
"result_df[\"is_correct\"] = result_df.apply(lambda x: question_scorer(x[\"prediction\"], x[\"true_answer\"]), axis=1)\n",
|
84 |
+
"result_df[\"is_near_correct\"] = result_df.apply(\n",
|
85 |
+
" lambda x: check_close_call(x[\"prediction\"], x[\"true_answer\"], x[\"is_correct\"]),\n",
|
86 |
+
" axis=1,\n",
|
87 |
+
")\n",
|
88 |
+
"\n",
|
89 |
+
"result_df[\"count_steps\"] = result_df[\"intermediate_steps\"].apply(len)\n",
|
90 |
+
"\n",
|
91 |
+
"\n",
|
92 |
+
"def find_attachment(question):\n",
|
93 |
+
" matches = eval_df.loc[eval_df[\"question\"].apply(lambda x: x in question), \"file_name\"]\n",
|
94 |
+
"\n",
|
95 |
+
" if len(matches) == 0:\n",
|
96 |
+
" return \"Not found\"\n",
|
97 |
+
" file_path = matches.values[0]\n",
|
98 |
+
"\n",
|
99 |
+
" if isinstance(file_path, str) and len(file_path) > 0:\n",
|
100 |
+
" return file_path.split(\".\")[-1]\n",
|
101 |
+
" else:\n",
|
102 |
+
" return \"None\"\n",
|
103 |
+
"\n",
|
104 |
+
"\n",
|
105 |
+
"result_df[\"attachment_type\"] = result_df[\"question\"].apply(find_attachment)\n",
|
106 |
+
"\n",
|
107 |
+
"\n",
|
108 |
+
"def extract_tool_calls(code):\n",
|
109 |
+
" regex = r\"\\b(\\w+)\\(\"\n",
|
110 |
+
" function_calls = [el for el in re.findall(regex, code) if el.islower()]\n",
|
111 |
+
"\n",
|
112 |
+
" function_call_counter = Counter(function_calls)\n",
|
113 |
+
" return function_call_counter\n",
|
114 |
+
"\n",
|
115 |
+
"\n",
|
116 |
+
"def sum_tool_calls(steps):\n",
|
117 |
+
" total_count = Counter()\n",
|
118 |
+
" for step in steps:\n",
|
119 |
+
" if \"llm_output\" in step:\n",
|
120 |
+
" total_count += extract_tool_calls(step[\"llm_output\"])\n",
|
121 |
+
"\n",
|
122 |
+
" return total_count\n",
|
123 |
+
"\n",
|
124 |
+
"\n",
|
125 |
+
"def get_durations(row):\n",
|
126 |
+
" # start_datetime = datetime.strptime(row['start_time'], \"%Y-%m-%d %H:%M:%S\")\n",
|
127 |
+
" # end_datetime = datetime.strptime(row['end_time'], \"%Y-%m-%d %H:%M:%S\")\n",
|
128 |
+
"\n",
|
129 |
+
" duration_timedelta = row[\"end_time\"] - row[\"start_time\"]\n",
|
130 |
+
" return int(duration_timedelta.total_seconds())\n",
|
131 |
+
"\n",
|
132 |
+
"\n",
|
133 |
+
"result_df[\"duration\"] = result_df.apply(get_durations, axis=1)\n",
|
134 |
+
"# result_df[\"tool_calls\"] = result_df[\"intermediate_steps\"].apply(sum_tool_calls)"
|
135 |
+
]
|
136 |
+
},
|
137 |
+
{
|
138 |
+
"cell_type": "code",
|
139 |
+
"execution_count": null,
|
140 |
+
"metadata": {},
|
141 |
+
"outputs": [],
|
142 |
+
"source": [
|
143 |
+
"result_df[\"agent_name\"].value_counts()"
|
144 |
+
]
|
145 |
+
},
|
146 |
+
{
|
147 |
+
"cell_type": "markdown",
|
148 |
+
"metadata": {},
|
149 |
+
"source": [
|
150 |
+
"# 2. Inspect specific runs"
|
151 |
+
]
|
152 |
+
},
|
153 |
+
{
|
154 |
+
"cell_type": "code",
|
155 |
+
"execution_count": null,
|
156 |
+
"metadata": {},
|
157 |
+
"outputs": [],
|
158 |
+
"source": [
|
159 |
+
"sel_df = result_df\n",
|
160 |
+
"# sel_df = sel_df.loc[\n",
|
161 |
+
"# (result_df[\"agent_name\"].isin(list_versions))\n",
|
162 |
+
"# ]\n",
|
163 |
+
"sel_df = sel_df.reset_index(drop=True)\n",
|
164 |
+
"display(sel_df[\"agent_name\"].value_counts())\n",
|
165 |
+
"sel_df = sel_df.drop_duplicates(subset=[\"agent_name\", \"question\"])\n",
|
166 |
+
"display(sel_df.groupby(\"agent_name\")[[\"task\"]].value_counts())\n",
|
167 |
+
"print(\"Total length:\", len(sel_df), \"- is complete:\", len(sel_df) == 165)"
|
168 |
+
]
|
169 |
+
},
|
170 |
+
{
|
171 |
+
"cell_type": "code",
|
172 |
+
"execution_count": null,
|
173 |
+
"metadata": {},
|
174 |
+
"outputs": [],
|
175 |
+
"source": [
|
176 |
+
"display(\"Average score:\", sel_df.groupby(\"agent_name\")[[\"is_correct\"]].mean().round(3))\n",
|
177 |
+
"display(\n",
|
178 |
+
" sel_df.groupby([\"agent_name\", \"task\"])[[\"is_correct\", \"is_near_correct\", \"count_steps\", \"question\", \"duration\"]]\n",
|
179 |
+
" .agg(\n",
|
180 |
+
" {\n",
|
181 |
+
" \"is_correct\": \"mean\",\n",
|
182 |
+
" \"is_near_correct\": \"mean\",\n",
|
183 |
+
" \"count_steps\": \"mean\",\n",
|
184 |
+
" \"question\": \"count\",\n",
|
185 |
+
" \"duration\": \"mean\",\n",
|
186 |
+
" }\n",
|
187 |
+
" )\n",
|
188 |
+
" .rename(columns={\"question\": \"count\"})\n",
|
189 |
+
")"
|
190 |
+
]
|
191 |
+
},
|
192 |
+
{
|
193 |
+
"cell_type": "code",
|
194 |
+
"execution_count": null,
|
195 |
+
"metadata": {},
|
196 |
+
"outputs": [],
|
197 |
+
"source": [
|
198 |
+
"import plotly.express as px\n",
|
199 |
+
"\n",
|
200 |
+
"\n",
|
201 |
+
"cumulative_df = (\n",
|
202 |
+
" (\n",
|
203 |
+
" sel_df.groupby(\"agent_name\")[[\"is_correct\", \"is_near_correct\"]]\n",
|
204 |
+
" .expanding(min_periods=1, axis=0, method=\"single\")\n",
|
205 |
+
" .agg({\"is_correct\": \"mean\", \"is_near_correct\": \"count\"})\n",
|
206 |
+
" .reset_index()\n",
|
207 |
+
" )\n",
|
208 |
+
" .copy()\n",
|
209 |
+
" .rename(columns={\"is_near_correct\": \"index\"})\n",
|
210 |
+
")\n",
|
211 |
+
"cumulative_df[\"index\"] = cumulative_df[\"index\"].astype(int) - 1\n",
|
212 |
+
"\n",
|
213 |
+
"\n",
|
214 |
+
"def find_question(row):\n",
|
215 |
+
" try:\n",
|
216 |
+
" res = sel_df.loc[sel_df[\"agent_name\"] == row[\"agent_name\"], \"question\"].iloc[row[\"index\"]][:50]\n",
|
217 |
+
" return res\n",
|
218 |
+
" except Exception:\n",
|
219 |
+
" return \"\"\n",
|
220 |
+
"\n",
|
221 |
+
"\n",
|
222 |
+
"cumulative_df[\"question\"] = cumulative_df.apply(find_question, axis=1)\n",
|
223 |
+
"\n",
|
224 |
+
"px.line(\n",
|
225 |
+
" cumulative_df,\n",
|
226 |
+
" color=\"agent_name\",\n",
|
227 |
+
" x=\"index\",\n",
|
228 |
+
" y=\"is_correct\",\n",
|
229 |
+
" hover_data=\"question\",\n",
|
230 |
+
")"
|
231 |
+
]
|
232 |
+
},
|
233 |
+
{
|
234 |
+
"cell_type": "markdown",
|
235 |
+
"metadata": {},
|
236 |
+
"source": [
|
237 |
+
"# 3. Dive deeper into one run"
|
238 |
+
]
|
239 |
+
},
|
240 |
+
{
|
241 |
+
"cell_type": "code",
|
242 |
+
"execution_count": null,
|
243 |
+
"metadata": {},
|
244 |
+
"outputs": [],
|
245 |
+
"source": [
|
246 |
+
"sel_df = result_df.loc[result_df[\"agent_name\"] == \"o1\"]\n",
|
247 |
+
"print(len(sel_df))"
|
248 |
+
]
|
249 |
+
},
|
250 |
+
{
|
251 |
+
"cell_type": "markdown",
|
252 |
+
"metadata": {},
|
253 |
+
"source": [
|
254 |
+
"### Count errors"
|
255 |
+
]
|
256 |
+
},
|
257 |
+
{
|
258 |
+
"cell_type": "code",
|
259 |
+
"execution_count": null,
|
260 |
+
"metadata": {},
|
261 |
+
"outputs": [],
|
262 |
+
"source": [
|
263 |
+
"import numpy as np\n",
|
264 |
+
"\n",
|
265 |
+
"\n",
|
266 |
+
"error_types = [\n",
|
267 |
+
" \"AgentParsingError\",\n",
|
268 |
+
" \"AgentExecutionError\",\n",
|
269 |
+
" \"AgentMaxIterationsError\",\n",
|
270 |
+
" \"AgentGenerationError\",\n",
|
271 |
+
"]\n",
|
272 |
+
"sel_df[error_types] = 0\n",
|
273 |
+
"sel_df[\"Count steps\"] = np.nan\n",
|
274 |
+
"\n",
|
275 |
+
"\n",
|
276 |
+
"def count_errors(row):\n",
|
277 |
+
" if isinstance(row[\"intermediate_steps\"], list):\n",
|
278 |
+
" row[\"Count steps\"] = len(row[\"intermediate_steps\"])\n",
|
279 |
+
" for step in row[\"intermediate_steps\"]:\n",
|
280 |
+
" if isinstance(step, dict) and \"error\" in step:\n",
|
281 |
+
" try:\n",
|
282 |
+
" row[str(step[\"error\"][\"error_type\"])] += 1\n",
|
283 |
+
" except Exception:\n",
|
284 |
+
" pass\n",
|
285 |
+
" return row\n",
|
286 |
+
"\n",
|
287 |
+
"\n",
|
288 |
+
"sel_df = sel_df.apply(count_errors, axis=1)"
|
289 |
+
]
|
290 |
+
},
|
291 |
+
{
|
292 |
+
"cell_type": "code",
|
293 |
+
"execution_count": null,
|
294 |
+
"metadata": {},
|
295 |
+
"outputs": [],
|
296 |
+
"source": [
|
297 |
+
"import plotly.express as px\n",
|
298 |
+
"\n",
|
299 |
+
"\n",
|
300 |
+
"aggregate_errors = (\n",
|
301 |
+
" sel_df.groupby([\"is_correct\"])[error_types + [\"Count steps\"]].mean().reset_index().melt(id_vars=[\"is_correct\"])\n",
|
302 |
+
")\n",
|
303 |
+
"\n",
|
304 |
+
"fig = px.bar(\n",
|
305 |
+
" aggregate_errors,\n",
|
306 |
+
" y=\"value\",\n",
|
307 |
+
" x=\"variable\",\n",
|
308 |
+
" color=\"is_correct\",\n",
|
309 |
+
" labels={\n",
|
310 |
+
" \"agent_name\": \"<b>Model</b>\",\n",
|
311 |
+
" \"task\": \"<b>Level</b>\",\n",
|
312 |
+
" \"aggregate_score\": \"<b>Performance</b>\",\n",
|
313 |
+
" \"value\": \"<b>Average count</b>\",\n",
|
314 |
+
" \"eval_score_GPT4\": \"<b>Score</b>\",\n",
|
315 |
+
" },\n",
|
316 |
+
")\n",
|
317 |
+
"fig.update_layout(\n",
|
318 |
+
" height=500,\n",
|
319 |
+
" width=800,\n",
|
320 |
+
" barmode=\"group\",\n",
|
321 |
+
" bargroupgap=0.0,\n",
|
322 |
+
")\n",
|
323 |
+
"fig.update_traces(textposition=\"outside\")\n",
|
324 |
+
"fig.write_image(\"aggregate_errors.png\", scale=3)\n",
|
325 |
+
"fig.show()"
|
326 |
+
]
|
327 |
+
},
|
328 |
+
{
|
329 |
+
"cell_type": "markdown",
|
330 |
+
"metadata": {},
|
331 |
+
"source": [
|
332 |
+
"### Inspect result by file extension type"
|
333 |
+
]
|
334 |
+
},
|
335 |
+
{
|
336 |
+
"cell_type": "code",
|
337 |
+
"execution_count": null,
|
338 |
+
"metadata": {},
|
339 |
+
"outputs": [],
|
340 |
+
"source": [
|
341 |
+
"display(\n",
|
342 |
+
" result_df.groupby([\"attachment_type\"])[[\"is_correct\", \"count_steps\", \"question\"]].agg(\n",
|
343 |
+
" {\"is_correct\": \"mean\", \"count_steps\": \"mean\", \"question\": \"count\"}\n",
|
344 |
+
" )\n",
|
345 |
+
")"
|
346 |
+
]
|
347 |
+
},
|
348 |
+
{
|
349 |
+
"cell_type": "markdown",
|
350 |
+
"metadata": {},
|
351 |
+
"source": [
|
352 |
+
"# 4. Ensembling methods"
|
353 |
+
]
|
354 |
+
},
|
355 |
+
{
|
356 |
+
"cell_type": "code",
|
357 |
+
"execution_count": null,
|
358 |
+
"metadata": {},
|
359 |
+
"outputs": [],
|
360 |
+
"source": [
|
361 |
+
"counts = result_df[\"agent_name\"].value_counts()\n",
|
362 |
+
"long_series = result_df.loc[result_df[\"agent_name\"].isin(counts[counts > 140].index)]"
|
363 |
+
]
|
364 |
+
},
|
365 |
+
{
|
366 |
+
"cell_type": "code",
|
367 |
+
"execution_count": null,
|
368 |
+
"metadata": {},
|
369 |
+
"outputs": [],
|
370 |
+
"source": [
|
371 |
+
"def majority_vote(df):\n",
|
372 |
+
" df = df[(df[\"prediction\"] != \"Unable to determine\") & (~df[\"prediction\"].isna()) & (df[\"prediction\"] != \"None\")]\n",
|
373 |
+
"\n",
|
374 |
+
" answer_modes = df.groupby(\"question\")[\"prediction\"].agg(lambda x: x.mode()[0]).reset_index()\n",
|
375 |
+
" first_occurrences = (\n",
|
376 |
+
" df.groupby([\"question\", \"prediction\"]).agg({\"task\": \"first\", \"is_correct\": \"first\"}).reset_index()\n",
|
377 |
+
" )\n",
|
378 |
+
" result = answer_modes.merge(first_occurrences, on=[\"question\", \"prediction\"], how=\"left\")\n",
|
379 |
+
"\n",
|
380 |
+
" return result\n",
|
381 |
+
"\n",
|
382 |
+
"\n",
|
383 |
+
"def oracle(df):\n",
|
384 |
+
" def get_first_correct_or_first_wrong(group):\n",
|
385 |
+
" correct_answers = group[group[\"is_correct\"]]\n",
|
386 |
+
" if len(correct_answers) > 0:\n",
|
387 |
+
" return correct_answers.iloc[0]\n",
|
388 |
+
" return group.iloc[0]\n",
|
389 |
+
"\n",
|
390 |
+
" result = df.groupby(\"question\").apply(get_first_correct_or_first_wrong)\n",
|
391 |
+
"\n",
|
392 |
+
" return result.reset_index(drop=True)\n",
|
393 |
+
"\n",
|
394 |
+
"\n",
|
395 |
+
"display((long_series.groupby(\"agent_name\")[\"is_correct\"].mean() * 100).round(2))\n",
|
396 |
+
"print(f\"Majority score: {majority_vote(long_series)['is_correct'].mean() * 100:.2f}\")\n",
|
397 |
+
"print(f\"Oracle score: {oracle(long_series)['is_correct'].mean() * 100:.2f}\")"
|
398 |
+
]
|
399 |
+
},
|
400 |
+
{
|
401 |
+
"cell_type": "markdown",
|
402 |
+
"metadata": {},
|
403 |
+
"source": [
|
404 |
+
"### Submit"
|
405 |
+
]
|
406 |
+
},
|
407 |
+
{
|
408 |
+
"cell_type": "code",
|
409 |
+
"execution_count": null,
|
410 |
+
"metadata": {},
|
411 |
+
"outputs": [],
|
412 |
+
"source": [
|
413 |
+
"agent_run = \"code_o1_04_february_submission5.jsonl\"\n",
|
414 |
+
"df = pd.read_json(f\"output/validation/{agent_run}\", lines=True)\n",
|
415 |
+
"df = df[[\"task_id\", \"prediction\", \"intermediate_steps\"]]\n",
|
416 |
+
"df = df.rename(columns={\"prediction\": \"model_answer\", \"intermediate_steps\": \"reasoning_trace\"})"
|
417 |
+
]
|
418 |
+
},
|
419 |
+
{
|
420 |
+
"cell_type": "code",
|
421 |
+
"execution_count": null,
|
422 |
+
"metadata": {},
|
423 |
+
"outputs": [],
|
424 |
+
"source": [
|
425 |
+
"df.to_json(\"submission.jsonl\", orient=\"records\", lines=True)"
|
426 |
+
]
|
427 |
+
},
|
428 |
+
{
|
429 |
+
"cell_type": "code",
|
430 |
+
"execution_count": null,
|
431 |
+
"metadata": {},
|
432 |
+
"outputs": [],
|
433 |
+
"source": []
|
434 |
+
}
|
435 |
+
],
|
436 |
+
"metadata": {
|
437 |
+
"kernelspec": {
|
438 |
+
"display_name": "agents",
|
439 |
+
"language": "python",
|
440 |
+
"name": "python3"
|
441 |
+
},
|
442 |
+
"language_info": {
|
443 |
+
"codemirror_mode": {
|
444 |
+
"name": "ipython",
|
445 |
+
"version": 3
|
446 |
+
},
|
447 |
+
"file_extension": ".py",
|
448 |
+
"mimetype": "text/x-python",
|
449 |
+
"name": "python",
|
450 |
+
"nbconvert_exporter": "python",
|
451 |
+
"pygments_lexer": "ipython3",
|
452 |
+
"version": "3.12.0"
|
453 |
+
}
|
454 |
+
},
|
455 |
+
"nbformat": 4,
|
456 |
+
"nbformat_minor": 2
|
457 |
+
}
|
examples/open_deep_research/app.py
ADDED
@@ -0,0 +1,11 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from run import create_agent
|
2 |
+
|
3 |
+
from smolagents.gradio_ui import GradioUI
|
4 |
+
|
5 |
+
|
6 |
+
agent = create_agent()
|
7 |
+
|
8 |
+
demo = GradioUI(agent)
|
9 |
+
|
10 |
+
if __name__ == "__main__":
|
11 |
+
demo.launch()
|
examples/open_deep_research/requirements.txt
ADDED
@@ -0,0 +1,39 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
anthropic>=0.37.1
|
2 |
+
audioop-lts<1.0; python_version >= "3.13" # required to use pydub in Python >=3.13; LTS port of the removed Python builtin module audioop
|
3 |
+
beautifulsoup4>=4.12.3
|
4 |
+
datasets>=2.21.0
|
5 |
+
google_search_results>=2.4.2
|
6 |
+
huggingface_hub>=0.23.4
|
7 |
+
mammoth>=1.8.0
|
8 |
+
markdownify>=0.13.1
|
9 |
+
numexpr>=2.10.1
|
10 |
+
numpy>=2.1.2
|
11 |
+
openai>=1.52.2
|
12 |
+
openpyxl
|
13 |
+
pandas>=2.2.3
|
14 |
+
pathvalidate>=3.2.1
|
15 |
+
pdfminer>=20191125
|
16 |
+
pdfminer.six>=20240706
|
17 |
+
Pillow>=11.0.0
|
18 |
+
puremagic>=1.28
|
19 |
+
pypdf>=5.1.0
|
20 |
+
python-dotenv>=1.0.1
|
21 |
+
python_pptx>=1.0.2
|
22 |
+
Requests>=2.32.3
|
23 |
+
tqdm>=4.66.4
|
24 |
+
torch>=2.2.2
|
25 |
+
torchvision>=0.17.2
|
26 |
+
transformers>=4.46.0
|
27 |
+
youtube_transcript_api>=0.6.2
|
28 |
+
chess
|
29 |
+
sympy
|
30 |
+
pubchempy
|
31 |
+
Bio
|
32 |
+
scikit-learn
|
33 |
+
scipy
|
34 |
+
pydub
|
35 |
+
PyPDF2
|
36 |
+
python-pptx
|
37 |
+
torch
|
38 |
+
xlrd
|
39 |
+
SpeechRecognition
|
examples/open_deep_research/run.py
ADDED
@@ -0,0 +1,125 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import argparse
|
2 |
+
import os
|
3 |
+
import threading
|
4 |
+
|
5 |
+
from dotenv import load_dotenv
|
6 |
+
from huggingface_hub import login
|
7 |
+
from scripts.text_inspector_tool import TextInspectorTool
|
8 |
+
from scripts.text_web_browser import (
|
9 |
+
ArchiveSearchTool,
|
10 |
+
FinderTool,
|
11 |
+
FindNextTool,
|
12 |
+
PageDownTool,
|
13 |
+
PageUpTool,
|
14 |
+
SimpleTextBrowser,
|
15 |
+
VisitTool,
|
16 |
+
)
|
17 |
+
from scripts.visual_qa import visualizer
|
18 |
+
|
19 |
+
from smolagents import (
|
20 |
+
CodeAgent,
|
21 |
+
GoogleSearchTool,
|
22 |
+
# InferenceClientModel,
|
23 |
+
LiteLLMModel,
|
24 |
+
ToolCallingAgent,
|
25 |
+
)
|
26 |
+
|
27 |
+
|
28 |
+
load_dotenv(override=True)
|
29 |
+
login(os.getenv("HF_TOKEN"))
|
30 |
+
|
31 |
+
append_answer_lock = threading.Lock()
|
32 |
+
|
33 |
+
|
34 |
+
def parse_args():
|
35 |
+
parser = argparse.ArgumentParser()
|
36 |
+
parser.add_argument(
|
37 |
+
"question", type=str, help="for example: 'How many studio albums did Mercedes Sosa release before 2007?'"
|
38 |
+
)
|
39 |
+
parser.add_argument("--model-id", type=str, default="o1")
|
40 |
+
return parser.parse_args()
|
41 |
+
|
42 |
+
|
43 |
+
custom_role_conversions = {"tool-call": "assistant", "tool-response": "user"}
|
44 |
+
|
45 |
+
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0"
|
46 |
+
|
47 |
+
BROWSER_CONFIG = {
|
48 |
+
"viewport_size": 1024 * 5,
|
49 |
+
"downloads_folder": "downloads_folder",
|
50 |
+
"request_kwargs": {
|
51 |
+
"headers": {"User-Agent": user_agent},
|
52 |
+
"timeout": 300,
|
53 |
+
},
|
54 |
+
"serpapi_key": os.getenv("SERPAPI_API_KEY"),
|
55 |
+
}
|
56 |
+
|
57 |
+
os.makedirs(f"./{BROWSER_CONFIG['downloads_folder']}", exist_ok=True)
|
58 |
+
|
59 |
+
|
60 |
+
def create_agent(model_id="o1"):
|
61 |
+
model_params = {
|
62 |
+
"model_id": model_id,
|
63 |
+
"custom_role_conversions": custom_role_conversions,
|
64 |
+
"max_completion_tokens": 8192,
|
65 |
+
}
|
66 |
+
if model_id == "o1":
|
67 |
+
model_params["reasoning_effort"] = "high"
|
68 |
+
model = LiteLLMModel(**model_params)
|
69 |
+
|
70 |
+
text_limit = 100000
|
71 |
+
browser = SimpleTextBrowser(**BROWSER_CONFIG)
|
72 |
+
WEB_TOOLS = [
|
73 |
+
GoogleSearchTool(provider="serper"),
|
74 |
+
VisitTool(browser),
|
75 |
+
PageUpTool(browser),
|
76 |
+
PageDownTool(browser),
|
77 |
+
FinderTool(browser),
|
78 |
+
FindNextTool(browser),
|
79 |
+
ArchiveSearchTool(browser),
|
80 |
+
TextInspectorTool(model, text_limit),
|
81 |
+
]
|
82 |
+
text_webbrowser_agent = ToolCallingAgent(
|
83 |
+
model=model,
|
84 |
+
tools=WEB_TOOLS,
|
85 |
+
max_steps=20,
|
86 |
+
verbosity_level=2,
|
87 |
+
planning_interval=4,
|
88 |
+
name="search_agent",
|
89 |
+
description="""A team member that will search the internet to answer your question.
|
90 |
+
Ask him for all your questions that require browsing the web.
|
91 |
+
Provide him as much context as possible, in particular if you need to search on a specific timeframe!
|
92 |
+
And don't hesitate to provide him with a complex search task, like finding a difference between two webpages.
|
93 |
+
Your request must be a real sentence, not a google search! Like "Find me this information (...)" rather than a few keywords.
|
94 |
+
""",
|
95 |
+
provide_run_summary=True,
|
96 |
+
)
|
97 |
+
text_webbrowser_agent.prompt_templates["managed_agent"]["task"] += """You can navigate to .txt online files.
|
98 |
+
If a non-html page is in another format, especially .pdf or a Youtube video, use tool 'inspect_file_as_text' to inspect it.
|
99 |
+
Additionally, if after some searching you find out that you need more information to answer the question, you can use `final_answer` with your request for clarification as argument to request for more information."""
|
100 |
+
|
101 |
+
manager_agent = CodeAgent(
|
102 |
+
model=model,
|
103 |
+
tools=[visualizer, TextInspectorTool(model, text_limit)],
|
104 |
+
max_steps=12,
|
105 |
+
verbosity_level=2,
|
106 |
+
additional_authorized_imports=["*"],
|
107 |
+
planning_interval=4,
|
108 |
+
managed_agents=[text_webbrowser_agent],
|
109 |
+
)
|
110 |
+
|
111 |
+
return manager_agent
|
112 |
+
|
113 |
+
|
114 |
+
def main():
|
115 |
+
args = parse_args()
|
116 |
+
|
117 |
+
agent = create_agent(model_id=args.model_id)
|
118 |
+
|
119 |
+
answer = agent.run(args.question)
|
120 |
+
|
121 |
+
print(f"Got this answer: {answer}")
|
122 |
+
|
123 |
+
|
124 |
+
if __name__ == "__main__":
|
125 |
+
main()
|
examples/open_deep_research/run_gaia.py
ADDED
@@ -0,0 +1,303 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# EXAMPLE COMMAND: from folder examples/open_deep_research, run: python run_gaia.py --concurrency 32 --run-name generate-traces-03-apr-noplanning --model-id gpt-4o
|
2 |
+
import argparse
|
3 |
+
import json
|
4 |
+
import os
|
5 |
+
import threading
|
6 |
+
from concurrent.futures import ThreadPoolExecutor, as_completed
|
7 |
+
from datetime import datetime
|
8 |
+
from pathlib import Path
|
9 |
+
from typing import Any
|
10 |
+
|
11 |
+
import datasets
|
12 |
+
import pandas as pd
|
13 |
+
from dotenv import load_dotenv
|
14 |
+
from huggingface_hub import login, snapshot_download
|
15 |
+
from scripts.reformulator import prepare_response
|
16 |
+
from scripts.run_agents import (
|
17 |
+
get_single_file_description,
|
18 |
+
get_zip_description,
|
19 |
+
)
|
20 |
+
from scripts.text_inspector_tool import TextInspectorTool
|
21 |
+
from scripts.text_web_browser import (
|
22 |
+
ArchiveSearchTool,
|
23 |
+
FinderTool,
|
24 |
+
FindNextTool,
|
25 |
+
PageDownTool,
|
26 |
+
PageUpTool,
|
27 |
+
SimpleTextBrowser,
|
28 |
+
VisitTool,
|
29 |
+
)
|
30 |
+
from scripts.visual_qa import visualizer
|
31 |
+
from tqdm import tqdm
|
32 |
+
|
33 |
+
from smolagents import (
|
34 |
+
CodeAgent,
|
35 |
+
GoogleSearchTool,
|
36 |
+
LiteLLMModel,
|
37 |
+
Model,
|
38 |
+
ToolCallingAgent,
|
39 |
+
)
|
40 |
+
|
41 |
+
|
42 |
+
load_dotenv(override=True)
|
43 |
+
login(os.getenv("HF_TOKEN"))
|
44 |
+
|
45 |
+
append_answer_lock = threading.Lock()
|
46 |
+
|
47 |
+
|
48 |
+
def parse_args():
|
49 |
+
parser = argparse.ArgumentParser()
|
50 |
+
parser.add_argument("--concurrency", type=int, default=8)
|
51 |
+
parser.add_argument("--model-id", type=str, default="o1")
|
52 |
+
parser.add_argument("--run-name", type=str, required=True)
|
53 |
+
parser.add_argument("--set-to-run", type=str, default="validation")
|
54 |
+
parser.add_argument("--use-open-models", type=bool, default=False)
|
55 |
+
parser.add_argument("--use-raw-dataset", action="store_true")
|
56 |
+
return parser.parse_args()
|
57 |
+
|
58 |
+
|
59 |
+
### IMPORTANT: EVALUATION SWITCHES
|
60 |
+
|
61 |
+
print("Make sure you deactivated any VPN like Tailscale, else some URLs will be blocked!")
|
62 |
+
|
63 |
+
custom_role_conversions = {"tool-call": "assistant", "tool-response": "user"}
|
64 |
+
|
65 |
+
|
66 |
+
user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36 Edg/119.0.0.0"
|
67 |
+
|
68 |
+
BROWSER_CONFIG = {
|
69 |
+
"viewport_size": 1024 * 5,
|
70 |
+
"downloads_folder": "downloads_folder",
|
71 |
+
"request_kwargs": {
|
72 |
+
"headers": {"User-Agent": user_agent},
|
73 |
+
"timeout": 300,
|
74 |
+
},
|
75 |
+
"serpapi_key": os.getenv("SERPAPI_API_KEY"),
|
76 |
+
}
|
77 |
+
|
78 |
+
os.makedirs(f"./{BROWSER_CONFIG['downloads_folder']}", exist_ok=True)
|
79 |
+
|
80 |
+
|
81 |
+
def create_agent_team(model: Model):
|
82 |
+
text_limit = 100000
|
83 |
+
ti_tool = TextInspectorTool(model, text_limit)
|
84 |
+
|
85 |
+
browser = SimpleTextBrowser(**BROWSER_CONFIG)
|
86 |
+
|
87 |
+
WEB_TOOLS = [
|
88 |
+
GoogleSearchTool(provider="serper"),
|
89 |
+
VisitTool(browser),
|
90 |
+
PageUpTool(browser),
|
91 |
+
PageDownTool(browser),
|
92 |
+
FinderTool(browser),
|
93 |
+
FindNextTool(browser),
|
94 |
+
ArchiveSearchTool(browser),
|
95 |
+
TextInspectorTool(model, text_limit),
|
96 |
+
]
|
97 |
+
|
98 |
+
text_webbrowser_agent = ToolCallingAgent(
|
99 |
+
model=model,
|
100 |
+
tools=WEB_TOOLS,
|
101 |
+
max_steps=20,
|
102 |
+
verbosity_level=2,
|
103 |
+
planning_interval=4,
|
104 |
+
name="search_agent",
|
105 |
+
description="""A team member that will search the internet to answer your question.
|
106 |
+
Ask him for all your questions that require browsing the web.
|
107 |
+
Provide him as much context as possible, in particular if you need to search on a specific timeframe!
|
108 |
+
And don't hesitate to provide him with a complex search task, like finding a difference between two webpages.
|
109 |
+
Your request must be a real sentence, not a google search! Like "Find me this information (...)" rather than a few keywords.
|
110 |
+
""",
|
111 |
+
provide_run_summary=True,
|
112 |
+
)
|
113 |
+
text_webbrowser_agent.prompt_templates["managed_agent"]["task"] += """You can navigate to .txt online files.
|
114 |
+
If a non-html page is in another format, especially .pdf or a Youtube video, use tool 'inspect_file_as_text' to inspect it.
|
115 |
+
Additionally, if after some searching you find out that you need more information to answer the question, you can use `final_answer` with your request for clarification as argument to request for more information."""
|
116 |
+
|
117 |
+
manager_agent = CodeAgent(
|
118 |
+
model=model,
|
119 |
+
tools=[visualizer, ti_tool],
|
120 |
+
max_steps=12,
|
121 |
+
verbosity_level=2,
|
122 |
+
additional_authorized_imports=["*"],
|
123 |
+
planning_interval=4,
|
124 |
+
managed_agents=[text_webbrowser_agent],
|
125 |
+
)
|
126 |
+
return manager_agent
|
127 |
+
|
128 |
+
|
129 |
+
def load_gaia_dataset(use_raw_dataset: bool, set_to_run: str) -> datasets.Dataset:
|
130 |
+
if not os.path.exists("data/gaia"):
|
131 |
+
if use_raw_dataset:
|
132 |
+
snapshot_download(
|
133 |
+
repo_id="gaia-benchmark/GAIA",
|
134 |
+
repo_type="dataset",
|
135 |
+
local_dir="data/gaia",
|
136 |
+
ignore_patterns=[".gitattributes", "README.md"],
|
137 |
+
)
|
138 |
+
else:
|
139 |
+
# WARNING: this dataset is gated: make sure you visit the repo to require access.
|
140 |
+
snapshot_download(
|
141 |
+
repo_id="smolagents/GAIA-annotated",
|
142 |
+
repo_type="dataset",
|
143 |
+
local_dir="data/gaia",
|
144 |
+
ignore_patterns=[".gitattributes", "README.md"],
|
145 |
+
)
|
146 |
+
|
147 |
+
def preprocess_file_paths(row):
|
148 |
+
if len(row["file_name"]) > 0:
|
149 |
+
row["file_name"] = f"data/gaia/{set_to_run}/" + row["file_name"]
|
150 |
+
return row
|
151 |
+
|
152 |
+
eval_ds = datasets.load_dataset(
|
153 |
+
"data/gaia/GAIA.py",
|
154 |
+
name="2023_all",
|
155 |
+
split=set_to_run,
|
156 |
+
# data_files={"validation": "validation/metadata.jsonl", "test": "test/metadata.jsonl"},
|
157 |
+
)
|
158 |
+
|
159 |
+
eval_ds = eval_ds.rename_columns({"Question": "question", "Final answer": "true_answer", "Level": "task"})
|
160 |
+
eval_ds = eval_ds.map(preprocess_file_paths)
|
161 |
+
return eval_ds
|
162 |
+
|
163 |
+
|
164 |
+
def append_answer(entry: dict, jsonl_file: str) -> None:
|
165 |
+
jsonl_path = Path(jsonl_file)
|
166 |
+
jsonl_path.parent.mkdir(parents=True, exist_ok=True)
|
167 |
+
with append_answer_lock, open(jsonl_file, "a", encoding="utf-8") as fp:
|
168 |
+
fp.write(json.dumps(entry) + "\n")
|
169 |
+
assert jsonl_path.exists(), "File not found!"
|
170 |
+
print("Answer exported to file:", jsonl_path.resolve())
|
171 |
+
|
172 |
+
|
173 |
+
def answer_single_question(
|
174 |
+
example: dict, model_id: str, answers_file: str, visual_inspection_tool: TextInspectorTool
|
175 |
+
) -> None:
|
176 |
+
model_params: dict[str, Any] = {
|
177 |
+
"model_id": model_id,
|
178 |
+
"custom_role_conversions": custom_role_conversions,
|
179 |
+
}
|
180 |
+
if model_id == "o1":
|
181 |
+
model_params["reasoning_effort"] = "high"
|
182 |
+
model_params["max_completion_tokens"] = 8192
|
183 |
+
else:
|
184 |
+
model_params["max_tokens"] = 4096
|
185 |
+
model = LiteLLMModel(**model_params)
|
186 |
+
# model = InferenceClientModel(model_id="Qwen/Qwen3-32B", provider="novita", max_tokens=4096)
|
187 |
+
document_inspection_tool = TextInspectorTool(model, 100000)
|
188 |
+
|
189 |
+
agent = create_agent_team(model)
|
190 |
+
|
191 |
+
augmented_question = """You have one question to answer. It is paramount that you provide a correct answer.
|
192 |
+
Give it all you can: I know for a fact that you have access to all the relevant tools to solve it and find the correct answer (the answer does exist).
|
193 |
+
Failure or 'I cannot answer' or 'None found' will not be tolerated, success will be rewarded.
|
194 |
+
Run verification steps if that's needed, you must make sure you find the correct answer! Here is the task:
|
195 |
+
|
196 |
+
""" + example["question"]
|
197 |
+
|
198 |
+
if example["file_name"]:
|
199 |
+
if ".zip" in example["file_name"]:
|
200 |
+
prompt_use_files = "\n\nTo solve the task above, you will have to use these attached files:\n"
|
201 |
+
prompt_use_files += get_zip_description(
|
202 |
+
example["file_name"], example["question"], visual_inspection_tool, document_inspection_tool
|
203 |
+
)
|
204 |
+
else:
|
205 |
+
prompt_use_files = "\n\nTo solve the task above, you will have to use this attached file:\n"
|
206 |
+
prompt_use_files += get_single_file_description(
|
207 |
+
example["file_name"], example["question"], visual_inspection_tool, document_inspection_tool
|
208 |
+
)
|
209 |
+
augmented_question += prompt_use_files
|
210 |
+
|
211 |
+
start_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
212 |
+
try:
|
213 |
+
# Run agent 🚀
|
214 |
+
final_result = agent.run(augmented_question)
|
215 |
+
|
216 |
+
agent_memory = agent.write_memory_to_messages()
|
217 |
+
|
218 |
+
final_result = prepare_response(augmented_question, agent_memory, reformulation_model=model)
|
219 |
+
|
220 |
+
output = str(final_result)
|
221 |
+
for memory_step in agent.memory.steps:
|
222 |
+
memory_step.model_input_messages = None
|
223 |
+
intermediate_steps = agent_memory
|
224 |
+
|
225 |
+
# Check for parsing errors which indicate the LLM failed to follow the required format
|
226 |
+
parsing_error = True if any(["AgentParsingError" in step for step in intermediate_steps]) else False
|
227 |
+
|
228 |
+
# check if iteration limit exceeded
|
229 |
+
iteration_limit_exceeded = True if "Agent stopped due to iteration limit or time limit." in output else False
|
230 |
+
raised_exception = False
|
231 |
+
|
232 |
+
except Exception as e:
|
233 |
+
print("Error on ", augmented_question, e)
|
234 |
+
output = None
|
235 |
+
intermediate_steps = []
|
236 |
+
parsing_error = False
|
237 |
+
iteration_limit_exceeded = False
|
238 |
+
exception = e
|
239 |
+
raised_exception = True
|
240 |
+
end_time = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
|
241 |
+
token_counts_manager = agent.monitor.get_total_token_counts()
|
242 |
+
token_counts_web = list(agent.managed_agents.values())[0].monitor.get_total_token_counts()
|
243 |
+
total_token_counts = {
|
244 |
+
"input": token_counts_manager["input"] + token_counts_web["input"],
|
245 |
+
"output": token_counts_manager["output"] + token_counts_web["output"],
|
246 |
+
}
|
247 |
+
annotated_example = {
|
248 |
+
"agent_name": model.model_id,
|
249 |
+
"question": example["question"],
|
250 |
+
"augmented_question": augmented_question,
|
251 |
+
"prediction": output,
|
252 |
+
"intermediate_steps": intermediate_steps,
|
253 |
+
"parsing_error": parsing_error,
|
254 |
+
"iteration_limit_exceeded": iteration_limit_exceeded,
|
255 |
+
"agent_error": str(exception) if raised_exception else None,
|
256 |
+
"task": example["task"],
|
257 |
+
"task_id": example["task_id"],
|
258 |
+
"true_answer": example["true_answer"],
|
259 |
+
"start_time": start_time,
|
260 |
+
"end_time": end_time,
|
261 |
+
"token_counts": total_token_counts,
|
262 |
+
}
|
263 |
+
append_answer(annotated_example, answers_file)
|
264 |
+
|
265 |
+
|
266 |
+
def get_examples_to_answer(answers_file: str, eval_ds: datasets.Dataset) -> list[dict]:
|
267 |
+
print(f"Loading answers from {answers_file}...")
|
268 |
+
try:
|
269 |
+
done_questions = pd.read_json(answers_file, lines=True)["question"].tolist()
|
270 |
+
print(f"Found {len(done_questions)} previous results!")
|
271 |
+
except Exception as e:
|
272 |
+
print("Error when loading records: ", e)
|
273 |
+
print("No usable records! ▶️ Starting new.")
|
274 |
+
done_questions = []
|
275 |
+
return [line for line in eval_ds.to_list() if line["question"] not in done_questions and line["file_name"]]
|
276 |
+
|
277 |
+
|
278 |
+
def main():
|
279 |
+
args = parse_args()
|
280 |
+
print(f"Starting run with arguments: {args}")
|
281 |
+
|
282 |
+
eval_ds = load_gaia_dataset(args.use_raw_dataset, args.set_to_run)
|
283 |
+
print("Loaded evaluation dataset:")
|
284 |
+
print(pd.DataFrame(eval_ds)["task"].value_counts())
|
285 |
+
|
286 |
+
answers_file = f"output/{args.set_to_run}/{args.run_name}.jsonl"
|
287 |
+
tasks_to_run = get_examples_to_answer(answers_file, eval_ds)
|
288 |
+
|
289 |
+
with ThreadPoolExecutor(max_workers=args.concurrency) as exe:
|
290 |
+
futures = [
|
291 |
+
exe.submit(answer_single_question, example, args.model_id, answers_file, visualizer)
|
292 |
+
for example in tasks_to_run
|
293 |
+
]
|
294 |
+
for f in tqdm(as_completed(futures), total=len(tasks_to_run), desc="Processing tasks"):
|
295 |
+
f.result()
|
296 |
+
|
297 |
+
# for example in tasks_to_run:
|
298 |
+
# answer_single_question(example, args.model_id, answers_file, visualizer)
|
299 |
+
print("All tasks processed.")
|
300 |
+
|
301 |
+
|
302 |
+
if __name__ == "__main__":
|
303 |
+
main()
|
examples/open_deep_research/visual_vs_text_browser.ipynb
ADDED
@@ -0,0 +1,359 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cells": [
|
3 |
+
{
|
4 |
+
"cell_type": "markdown",
|
5 |
+
"metadata": {},
|
6 |
+
"source": [
|
7 |
+
"# Compare a text-based vs a vision-based browser\n",
|
8 |
+
"\n",
|
9 |
+
"Warning: this notebook is experimental, it probably won't work out of the box!"
|
10 |
+
]
|
11 |
+
},
|
12 |
+
{
|
13 |
+
"cell_type": "code",
|
14 |
+
"execution_count": null,
|
15 |
+
"metadata": {},
|
16 |
+
"outputs": [],
|
17 |
+
"source": [
|
18 |
+
"!pip install \"smolagents[litellm,toolkit]\" -q"
|
19 |
+
]
|
20 |
+
},
|
21 |
+
{
|
22 |
+
"cell_type": "code",
|
23 |
+
"execution_count": null,
|
24 |
+
"metadata": {},
|
25 |
+
"outputs": [],
|
26 |
+
"source": [
|
27 |
+
"import datasets\n",
|
28 |
+
"\n",
|
29 |
+
"\n",
|
30 |
+
"eval_ds = datasets.load_dataset(\"gaia-benchmark/GAIA\", \"2023_all\")[\"validation\"]"
|
31 |
+
]
|
32 |
+
},
|
33 |
+
{
|
34 |
+
"cell_type": "code",
|
35 |
+
"execution_count": 3,
|
36 |
+
"metadata": {},
|
37 |
+
"outputs": [],
|
38 |
+
"source": [
|
39 |
+
"to_keep = [\n",
|
40 |
+
" \"What's the last line of the rhyme under the flavor\",\n",
|
41 |
+
" 'Of the authors (First M. Last) that worked on the paper \"Pie Menus or Linear Menus',\n",
|
42 |
+
" \"In Series 9, Episode 11 of Doctor Who, the Doctor is trapped inside an ever-shifting maze. What is this location called in the official script for the episode? Give the setting exactly as it appears in the first scene heading.\",\n",
|
43 |
+
" \"Which contributor to the version of OpenCV where support was added for the Mask-RCNN model has the same name as a former Chinese head of government when the names are transliterated to the Latin alphabet?\",\n",
|
44 |
+
" \"The photograph in the Whitney Museum of American Art's collection with accession number 2022.128 shows a person holding a book. Which military unit did the author of this book join in 1813? Answer without using articles.\",\n",
|
45 |
+
" \"I went to Virtue restaurant & bar in Chicago for my birthday on March 22, 2021 and the main course I had was delicious! Unfortunately, when I went back about a month later on April 21, it was no longer on the dinner menu.\",\n",
|
46 |
+
" \"In Emily Midkiff's June 2014 article in a journal named for the one of Hreidmar's \",\n",
|
47 |
+
" \"Under DDC 633 on Bielefeld University Library's BASE, as of 2020\",\n",
|
48 |
+
" \"In the 2018 VSCode blog post on replit.com, what was the command they clicked on in the last video to remove extra lines?\",\n",
|
49 |
+
" \"The Metropolitan Museum of Art has a portrait in its collection with an accession number of 29.100.5. Of the consecrators and co-consecrators\",\n",
|
50 |
+
" \"In Nature journal's Scientific Reports conference proceedings from 2012, in the article that did not mention plasmons or plasmonics, what nano-compound is studied?\",\n",
|
51 |
+
" 'In the year 2022, and before December, what does \"R\" stand for in the three core policies of the type of content',\n",
|
52 |
+
" \"Who nominated the only Featured Article on English Wikipedia about a dinosaur that was promoted in November 2016?\",\n",
|
53 |
+
"]\n",
|
54 |
+
"eval_ds = eval_ds.filter(lambda row: any([el in row[\"Question\"] for el in to_keep]))\n",
|
55 |
+
"eval_ds = eval_ds.rename_columns({\"Question\": \"question\", \"Final answer\": \"true_answer\", \"Level\": \"task\"})"
|
56 |
+
]
|
57 |
+
},
|
58 |
+
{
|
59 |
+
"cell_type": "code",
|
60 |
+
"execution_count": null,
|
61 |
+
"metadata": {},
|
62 |
+
"outputs": [],
|
63 |
+
"source": [
|
64 |
+
"import os\n",
|
65 |
+
"\n",
|
66 |
+
"from dotenv import load_dotenv\n",
|
67 |
+
"from huggingface_hub import login\n",
|
68 |
+
"\n",
|
69 |
+
"\n",
|
70 |
+
"load_dotenv(override=True)\n",
|
71 |
+
"\n",
|
72 |
+
"login(os.getenv(\"HF_TOKEN\"))"
|
73 |
+
]
|
74 |
+
},
|
75 |
+
{
|
76 |
+
"cell_type": "markdown",
|
77 |
+
"metadata": {},
|
78 |
+
"source": [
|
79 |
+
"### Text browser"
|
80 |
+
]
|
81 |
+
},
|
82 |
+
{
|
83 |
+
"cell_type": "code",
|
84 |
+
"execution_count": null,
|
85 |
+
"metadata": {},
|
86 |
+
"outputs": [],
|
87 |
+
"source": [
|
88 |
+
"from scripts.run_agents import answer_questions\n",
|
89 |
+
"from scripts.text_inspector_tool import TextInspectorTool\n",
|
90 |
+
"from scripts.text_web_browser import (\n",
|
91 |
+
" ArchiveSearchTool,\n",
|
92 |
+
" FinderTool,\n",
|
93 |
+
" FindNextTool,\n",
|
94 |
+
" NavigationalSearchTool,\n",
|
95 |
+
" PageDownTool,\n",
|
96 |
+
" PageUpTool,\n",
|
97 |
+
" SearchInformationTool,\n",
|
98 |
+
" VisitTool,\n",
|
99 |
+
")\n",
|
100 |
+
"from scripts.visual_qa import VisualQAGPT4Tool\n",
|
101 |
+
"\n",
|
102 |
+
"from smolagents import CodeAgent, LiteLLMModel\n",
|
103 |
+
"\n",
|
104 |
+
"\n",
|
105 |
+
"proprietary_model = LiteLLMModel(model_id=\"gpt-4o\")"
|
106 |
+
]
|
107 |
+
},
|
108 |
+
{
|
109 |
+
"cell_type": "code",
|
110 |
+
"execution_count": null,
|
111 |
+
"metadata": {},
|
112 |
+
"outputs": [],
|
113 |
+
"source": [
|
114 |
+
"### BUILD AGENTS & TOOLS\n",
|
115 |
+
"\n",
|
116 |
+
"WEB_TOOLS = [\n",
|
117 |
+
" SearchInformationTool(),\n",
|
118 |
+
" NavigationalSearchTool(),\n",
|
119 |
+
" VisitTool(),\n",
|
120 |
+
" PageUpTool(),\n",
|
121 |
+
" PageDownTool(),\n",
|
122 |
+
" FinderTool(),\n",
|
123 |
+
" FindNextTool(),\n",
|
124 |
+
" ArchiveSearchTool(),\n",
|
125 |
+
"]\n",
|
126 |
+
"\n",
|
127 |
+
"\n",
|
128 |
+
"surfer_agent = CodeAgent(\n",
|
129 |
+
" model=proprietary_model,\n",
|
130 |
+
" tools=WEB_TOOLS,\n",
|
131 |
+
" max_steps=20,\n",
|
132 |
+
" verbosity_level=2,\n",
|
133 |
+
")\n",
|
134 |
+
"\n",
|
135 |
+
"results_text = answer_questions(\n",
|
136 |
+
" eval_ds,\n",
|
137 |
+
" surfer_agent,\n",
|
138 |
+
" \"code_gpt4o_27-01_text\",\n",
|
139 |
+
" reformulation_model=proprietary_model,\n",
|
140 |
+
" output_folder=\"output_browsers\",\n",
|
141 |
+
" visual_inspection_tool=VisualQAGPT4Tool(),\n",
|
142 |
+
" text_inspector_tool=TextInspectorTool(proprietary_model, 40000),\n",
|
143 |
+
")"
|
144 |
+
]
|
145 |
+
},
|
146 |
+
{
|
147 |
+
"cell_type": "markdown",
|
148 |
+
"metadata": {},
|
149 |
+
"source": [
|
150 |
+
"### Vision browser"
|
151 |
+
]
|
152 |
+
},
|
153 |
+
{
|
154 |
+
"cell_type": "code",
|
155 |
+
"execution_count": null,
|
156 |
+
"metadata": {},
|
157 |
+
"outputs": [],
|
158 |
+
"source": [
|
159 |
+
"!pip install helium -q"
|
160 |
+
]
|
161 |
+
},
|
162 |
+
{
|
163 |
+
"cell_type": "code",
|
164 |
+
"execution_count": null,
|
165 |
+
"metadata": {},
|
166 |
+
"outputs": [],
|
167 |
+
"source": [
|
168 |
+
"from scripts.visual_qa import VisualQAGPT4Tool\n",
|
169 |
+
"\n",
|
170 |
+
"from smolagents import CodeAgent, LiteLLMModel, WebSearchTool\n",
|
171 |
+
"from smolagents.vision_web_browser import (\n",
|
172 |
+
" close_popups,\n",
|
173 |
+
" go_back,\n",
|
174 |
+
" helium_instructions,\n",
|
175 |
+
" initialize_agent,\n",
|
176 |
+
" save_screenshot,\n",
|
177 |
+
" search_item_ctrl_f,\n",
|
178 |
+
")\n",
|
179 |
+
"\n",
|
180 |
+
"\n",
|
181 |
+
"proprietary_model = LiteLLMModel(model_id=\"gpt-4o\")\n",
|
182 |
+
"vision_browser_agent = initialize_agent(proprietary_model)\n",
|
183 |
+
"### BUILD AGENTS & TOOLS\n",
|
184 |
+
"\n",
|
185 |
+
"CodeAgent(\n",
|
186 |
+
" tools=[WebSearchTool(), go_back, close_popups, search_item_ctrl_f],\n",
|
187 |
+
" model=proprietary_model,\n",
|
188 |
+
" additional_authorized_imports=[\"helium\"],\n",
|
189 |
+
" step_callbacks=[save_screenshot],\n",
|
190 |
+
" max_steps=20,\n",
|
191 |
+
" verbosity_level=2,\n",
|
192 |
+
")\n",
|
193 |
+
"\n",
|
194 |
+
"results_vision = answer_questions(\n",
|
195 |
+
" eval_ds,\n",
|
196 |
+
" vision_browser_agent,\n",
|
197 |
+
" \"code_gpt4o_27-01_vision\",\n",
|
198 |
+
" reformulation_model=proprietary_model,\n",
|
199 |
+
" output_folder=\"output_browsers\",\n",
|
200 |
+
" visual_inspection_tool=VisualQAGPT4Tool(),\n",
|
201 |
+
" text_inspector_tool=TextInspectorTool(proprietary_model, 40000),\n",
|
202 |
+
" postprompt=helium_instructions\n",
|
203 |
+
" + \"Any web browser controls won't work on .pdf urls, rather use the tool 'inspect_file_as_text' to read them\",\n",
|
204 |
+
")"
|
205 |
+
]
|
206 |
+
},
|
207 |
+
{
|
208 |
+
"cell_type": "markdown",
|
209 |
+
"metadata": {},
|
210 |
+
"source": [
|
211 |
+
"### Browser-use browser"
|
212 |
+
]
|
213 |
+
},
|
214 |
+
{
|
215 |
+
"cell_type": "code",
|
216 |
+
"execution_count": null,
|
217 |
+
"metadata": {},
|
218 |
+
"outputs": [],
|
219 |
+
"source": [
|
220 |
+
"!pip install browser-use lxml_html_clean -q\n",
|
221 |
+
"!playwright install"
|
222 |
+
]
|
223 |
+
},
|
224 |
+
{
|
225 |
+
"cell_type": "code",
|
226 |
+
"execution_count": null,
|
227 |
+
"metadata": {},
|
228 |
+
"outputs": [],
|
229 |
+
"source": [
|
230 |
+
"import asyncio\n",
|
231 |
+
"\n",
|
232 |
+
"import nest_asyncio\n",
|
233 |
+
"\n",
|
234 |
+
"\n",
|
235 |
+
"nest_asyncio.apply()\n",
|
236 |
+
"\n",
|
237 |
+
"from browser_use import Agent\n",
|
238 |
+
"from dotenv import load_dotenv\n",
|
239 |
+
"from langchain_openai import ChatOpenAI\n",
|
240 |
+
"\n",
|
241 |
+
"\n",
|
242 |
+
"load_dotenv()\n",
|
243 |
+
"\n",
|
244 |
+
"\n",
|
245 |
+
"class BrowserUseAgent:\n",
|
246 |
+
" logs = []\n",
|
247 |
+
"\n",
|
248 |
+
" def write_inner_memory_from_logs(self, summary_mode):\n",
|
249 |
+
" return self.results\n",
|
250 |
+
"\n",
|
251 |
+
" def run(self, task, **kwargs):\n",
|
252 |
+
" agent = Agent(\n",
|
253 |
+
" task=task,\n",
|
254 |
+
" llm=ChatOpenAI(model=\"gpt-4o\"),\n",
|
255 |
+
" )\n",
|
256 |
+
" self.results = asyncio.get_event_loop().run_until_complete(agent.run())\n",
|
257 |
+
" return self.results.history[-1].result[0].extracted_content\n",
|
258 |
+
"\n",
|
259 |
+
"\n",
|
260 |
+
"browser_use_agent = BrowserUseAgent()\n",
|
261 |
+
"\n",
|
262 |
+
"results_browseruse = answer_questions(\n",
|
263 |
+
" eval_ds,\n",
|
264 |
+
" browser_use_agent,\n",
|
265 |
+
" \"gpt-4o_27-01_browseruse\",\n",
|
266 |
+
" reformulation_model=proprietary_model,\n",
|
267 |
+
" output_folder=\"output_browsers\",\n",
|
268 |
+
" visual_inspection_tool=VisualQAGPT4Tool(),\n",
|
269 |
+
" text_inspector_tool=TextInspectorTool(proprietary_model, 40000),\n",
|
270 |
+
" postprompt=\"\",\n",
|
271 |
+
" run_simple=True,\n",
|
272 |
+
")"
|
273 |
+
]
|
274 |
+
},
|
275 |
+
{
|
276 |
+
"cell_type": "markdown",
|
277 |
+
"metadata": {},
|
278 |
+
"source": [
|
279 |
+
"### Get results"
|
280 |
+
]
|
281 |
+
},
|
282 |
+
{
|
283 |
+
"cell_type": "code",
|
284 |
+
"execution_count": null,
|
285 |
+
"metadata": {},
|
286 |
+
"outputs": [],
|
287 |
+
"source": [
|
288 |
+
"import pandas as pd\n",
|
289 |
+
"from scripts.gaia_scorer import question_scorer\n",
|
290 |
+
"\n",
|
291 |
+
"\n",
|
292 |
+
"results_vision, results_text, results_browseruse = (\n",
|
293 |
+
" pd.DataFrame(results_vision),\n",
|
294 |
+
" pd.DataFrame(results_text),\n",
|
295 |
+
" pd.DataFrame(results_browseruse),\n",
|
296 |
+
")\n",
|
297 |
+
"\n",
|
298 |
+
"results_vision[\"is_correct\"] = results_vision.apply(\n",
|
299 |
+
" lambda x: question_scorer(x[\"prediction\"], x[\"true_answer\"]), axis=1\n",
|
300 |
+
")\n",
|
301 |
+
"results_text[\"is_correct\"] = results_text.apply(lambda x: question_scorer(x[\"prediction\"], x[\"true_answer\"]), axis=1)\n",
|
302 |
+
"results_browseruse[\"is_correct\"] = results_browseruse.apply(\n",
|
303 |
+
" lambda x: question_scorer(x[\"prediction\"], x[\"true_answer\"]), axis=1\n",
|
304 |
+
")"
|
305 |
+
]
|
306 |
+
},
|
307 |
+
{
|
308 |
+
"cell_type": "code",
|
309 |
+
"execution_count": null,
|
310 |
+
"metadata": {},
|
311 |
+
"outputs": [],
|
312 |
+
"source": [
|
313 |
+
"results = pd.concat([results_vision, results_text, results_browseruse])\n",
|
314 |
+
"results.groupby(\"agent_name\")[\"is_correct\"].mean()"
|
315 |
+
]
|
316 |
+
},
|
317 |
+
{
|
318 |
+
"cell_type": "code",
|
319 |
+
"execution_count": null,
|
320 |
+
"metadata": {},
|
321 |
+
"outputs": [],
|
322 |
+
"source": [
|
323 |
+
"correct_vision_results = results_vision.loc[results_vision[\"is_correct\"]]\n",
|
324 |
+
"correct_vision_results"
|
325 |
+
]
|
326 |
+
},
|
327 |
+
{
|
328 |
+
"cell_type": "code",
|
329 |
+
"execution_count": null,
|
330 |
+
"metadata": {},
|
331 |
+
"outputs": [],
|
332 |
+
"source": [
|
333 |
+
"false_text_results = results_text.loc[~results_text[\"is_correct\"]]\n",
|
334 |
+
"false_text_results"
|
335 |
+
]
|
336 |
+
}
|
337 |
+
],
|
338 |
+
"metadata": {
|
339 |
+
"kernelspec": {
|
340 |
+
"display_name": "gaia",
|
341 |
+
"language": "python",
|
342 |
+
"name": "python3"
|
343 |
+
},
|
344 |
+
"language_info": {
|
345 |
+
"codemirror_mode": {
|
346 |
+
"name": "ipython",
|
347 |
+
"version": 3
|
348 |
+
},
|
349 |
+
"file_extension": ".py",
|
350 |
+
"mimetype": "text/x-python",
|
351 |
+
"name": "python",
|
352 |
+
"nbconvert_exporter": "python",
|
353 |
+
"pygments_lexer": "ipython3",
|
354 |
+
"version": "3.12.0"
|
355 |
+
}
|
356 |
+
},
|
357 |
+
"nbformat": 4,
|
358 |
+
"nbformat_minor": 2
|
359 |
+
}
|