Duibonduil commited on
Commit
c0ce657
·
verified ·
1 Parent(s): e378ac8

Upload 4 files

Browse files
docs/source/en/examples/rag.md ADDED
@@ -0,0 +1,206 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Agentic RAG
2
+
3
+ [[open-in-colab]]
4
+
5
+ ## Introduction to Retrieval-Augmented Generation (RAG)
6
+
7
+ Retrieval-Augmented Generation (RAG) combines the power of large language models with external knowledge retrieval to produce more accurate, factual, and contextually relevant responses. At its core, RAG is about "using an LLM to answer a user query, but basing the answer on information retrieved from a knowledge base."
8
+
9
+ ### Why Use RAG?
10
+
11
+ RAG offers several significant advantages over using vanilla or fine-tuned LLMs:
12
+
13
+ 1. **Factual Grounding**: Reduces hallucinations by anchoring responses in retrieved facts
14
+ 2. **Domain Specialization**: Provides domain-specific knowledge without model retraining
15
+ 3. **Knowledge Recency**: Allows access to information beyond the model's training cutoff
16
+ 4. **Transparency**: Enables citation of sources for generated content
17
+ 5. **Control**: Offers fine-grained control over what information the model can access
18
+
19
+ ### Limitations of Traditional RAG
20
+
21
+ Despite its benefits, traditional RAG approaches face several challenges:
22
+
23
+ - **Single Retrieval Step**: If the initial retrieval results are poor, the final generation will suffer
24
+ - **Query-Document Mismatch**: User queries (often questions) may not match well with documents containing answers (often statements)
25
+ - **Limited Reasoning**: Simple RAG pipelines don't allow for multi-step reasoning or query refinement
26
+ - **Context Window Constraints**: Retrieved documents must fit within the model's context window
27
+
28
+ ## Agentic RAG: A More Powerful Approach
29
+
30
+ We can overcome these limitations by implementing an **Agentic RAG** system - essentially an agent equipped with retrieval capabilities. This approach transforms RAG from a rigid pipeline into an interactive, reasoning-driven process.
31
+
32
+ ### Key Benefits of Agentic RAG
33
+
34
+ An agent with retrieval tools can:
35
+
36
+ 1. ✅ **Formulate optimized queries**: The agent can transform user questions into retrieval-friendly queries
37
+ 2. ✅ **Perform multiple retrievals**: The agent can retrieve information iteratively as needed
38
+ 3. ✅ **Reason over retrieved content**: The agent can analyze, synthesize, and draw conclusions from multiple sources
39
+ 4. ✅ **Self-critique and refine**: The agent can evaluate retrieval results and adjust its approach
40
+
41
+ This approach naturally implements advanced RAG techniques:
42
+ - **Hypothetical Document Embedding (HyDE)**: Instead of using the user query directly, the agent formulates retrieval-optimized queries ([paper reference](https://huggingface.co/papers/2212.10496))
43
+ - **Self-Query Refinement**: The agent can analyze initial results and perform follow-up retrievals with refined queries ([technique reference](https://docs.llamaindex.ai/en/stable/examples/evaluation/RetryQuery/))
44
+
45
+ ## Building an Agentic RAG System
46
+
47
+ Let's build a complete Agentic RAG system step by step. We'll create an agent that can answer questions about the Hugging Face Transformers library by retrieving information from its documentation.
48
+
49
+ You can follow along with the code snippets below, or check out the full example in the smolagents GitHub repository: [examples/rag.py](https://github.com/huggingface/smolagents/blob/main/examples/rag.py).
50
+
51
+ ### Step 1: Install Required Dependencies
52
+
53
+ First, we need to install the necessary packages:
54
+
55
+ ```bash
56
+ pip install smolagents pandas langchain langchain-community sentence-transformers datasets python-dotenv rank_bm25 --upgrade
57
+ ```
58
+
59
+ If you plan to use Hugging Face's Inference API, you'll need to set up your API token:
60
+
61
+ ```python
62
+ # Load environment variables (including HF_TOKEN)
63
+ from dotenv import load_dotenv
64
+ load_dotenv()
65
+ ```
66
+
67
+ ### Step 2: Prepare the Knowledge Base
68
+
69
+ We'll use a dataset containing Hugging Face documentation and prepare it for retrieval:
70
+
71
+ ```python
72
+ import datasets
73
+ from langchain.docstore.document import Document
74
+ from langchain.text_splitter import RecursiveCharacterTextSplitter
75
+ from langchain_community.retrievers import BM25Retriever
76
+
77
+ # Load the Hugging Face documentation dataset
78
+ knowledge_base = datasets.load_dataset("m-ric/huggingface_doc", split="train")
79
+
80
+ # Filter to include only Transformers documentation
81
+ knowledge_base = knowledge_base.filter(lambda row: row["source"].startswith("huggingface/transformers"))
82
+
83
+ # Convert dataset entries to Document objects with metadata
84
+ source_docs = [
85
+ Document(page_content=doc["text"], metadata={"source": doc["source"].split("/")[1]})
86
+ for doc in knowledge_base
87
+ ]
88
+
89
+ # Split documents into smaller chunks for better retrieval
90
+ text_splitter = RecursiveCharacterTextSplitter(
91
+ chunk_size=500, # Characters per chunk
92
+ chunk_overlap=50, # Overlap between chunks to maintain context
93
+ add_start_index=True,
94
+ strip_whitespace=True,
95
+ separators=["\n\n", "\n", ".", " ", ""], # Priority order for splitting
96
+ )
97
+ docs_processed = text_splitter.split_documents(source_docs)
98
+
99
+ print(f"Knowledge base prepared with {len(docs_processed)} document chunks")
100
+ ```
101
+
102
+ ### Step 3: Create a Retriever Tool
103
+
104
+ Now we'll create a custom tool that our agent can use to retrieve information from the knowledge base:
105
+
106
+ ```python
107
+ from smolagents import Tool
108
+
109
+ class RetrieverTool(Tool):
110
+ name = "retriever"
111
+ description = "Uses semantic search to retrieve the parts of transformers documentation that could be most relevant to answer your query."
112
+ inputs = {
113
+ "query": {
114
+ "type": "string",
115
+ "description": "The query to perform. This should be semantically close to your target documents. Use the affirmative form rather than a question.",
116
+ }
117
+ }
118
+ output_type = "string"
119
+
120
+ def __init__(self, docs, **kwargs):
121
+ super().__init__(**kwargs)
122
+ # Initialize the retriever with our processed documents
123
+ self.retriever = BM25Retriever.from_documents(
124
+ docs, k=10 # Return top 10 most relevant documents
125
+ )
126
+
127
+ def forward(self, query: str) -> str:
128
+ """Execute the retrieval based on the provided query."""
129
+ assert isinstance(query, str), "Your search query must be a string"
130
+
131
+ # Retrieve relevant documents
132
+ docs = self.retriever.invoke(query)
133
+
134
+ # Format the retrieved documents for readability
135
+ return "\nRetrieved documents:\n" + "".join(
136
+ [
137
+ f"\n\n===== Document {str(i)} =====\n" + doc.page_content
138
+ for i, doc in enumerate(docs)
139
+ ]
140
+ )
141
+
142
+ # Initialize our retriever tool with the processed documents
143
+ retriever_tool = RetrieverTool(docs_processed)
144
+ ```
145
+
146
+ > [!TIP]
147
+ > We're using BM25, a lexical retrieval method, for simplicity and speed. For production systems, you might want to use semantic search with embeddings for better retrieval quality. Check the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard) for high-quality embedding models.
148
+
149
+ ### Step 4: Create an Advanced Retrieval Agent
150
+
151
+ Now we'll create an agent that can use our retriever tool to answer questions:
152
+
153
+ ```python
154
+ from smolagents import InferenceClientModel, CodeAgent
155
+
156
+ # Initialize the agent with our retriever tool
157
+ agent = CodeAgent(
158
+ tools=[retriever_tool], # List of tools available to the agent
159
+ model=InferenceClientModel(), # Default model "Qwen/Qwen2.5-Coder-32B-Instruct"
160
+ max_steps=4, # Limit the number of reasoning steps
161
+ verbosity_level=2, # Show detailed agent reasoning
162
+ )
163
+
164
+ # To use a specific model, you can specify it like this:
165
+ # model=InferenceClientModel(model_id="meta-llama/Llama-3.3-70B-Instruct")
166
+ ```
167
+
168
+ > [!TIP]
169
+ > Inference Providers give access to hundreds of models, powered by serverless inference partners. A list of supported providers can be found [here](https://huggingface.co/docs/inference-providers/index).
170
+
171
+ ### Step 5: Run the Agent to Answer Questions
172
+
173
+ Let's use our agent to answer a question about Transformers:
174
+
175
+ ```python
176
+ # Ask a question that requires retrieving information
177
+ question = "For a transformers model training, which is slower, the forward or the backward pass?"
178
+
179
+ # Run the agent to get an answer
180
+ agent_output = agent.run(question)
181
+
182
+ # Display the final answer
183
+ print("\nFinal answer:")
184
+ print(agent_output)
185
+ ```
186
+
187
+ ## Practical Applications of Agentic RAG
188
+
189
+ Agentic RAG systems can be applied to various use cases:
190
+
191
+ 1. **Technical Documentation Assistance**: Help users navigate complex technical documentation
192
+ 2. **Research Paper Analysis**: Extract and synthesize information from scientific papers
193
+ 3. **Legal Document Review**: Find relevant precedents and clauses in legal documents
194
+ 4. **Customer Support**: Answer questions based on product documentation and knowledge bases
195
+ 5. **Educational Tutoring**: Provide explanations based on textbooks and learning materials
196
+
197
+ ## Conclusion
198
+
199
+ Agentic RAG represents a significant advancement over traditional RAG pipelines. By combining the reasoning capabilities of LLM agents with the factual grounding of retrieval systems, we can build more powerful, flexible, and accurate information systems.
200
+
201
+ The approach we've demonstrated:
202
+ - Overcomes the limitations of single-step retrieval
203
+ - Enables more natural interactions with knowledge bases
204
+ - Provides a framework for continuous improvement through self-critique and query refinement
205
+
206
+ As you build your own Agentic RAG systems, consider experimenting with different retrieval methods, agent architectures, and knowledge sources to find the optimal configuration for your specific use case.
docs/source/en/examples/text_to_sql.md ADDED
@@ -0,0 +1,197 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Text-to-SQL
2
+
3
+ [[open-in-colab]]
4
+
5
+ In this tutorial, we’ll see how to implement an agent that leverages SQL using `smolagents`.
6
+
7
+ > Let's start with the golden question: why not keep it simple and use a standard text-to-SQL pipeline?
8
+
9
+ A standard text-to-sql pipeline is brittle, since the generated SQL query can be incorrect. Even worse, the query could be incorrect, but not raise an error, instead giving some incorrect/useless outputs without raising an alarm.
10
+
11
+ 👉 Instead, an agent system is able to critically inspect outputs and decide if the query needs to be changed or not, thus giving it a huge performance boost.
12
+
13
+ Let’s build this agent! 💪
14
+
15
+ Run the line below to install required dependencies:
16
+ ```bash
17
+ !pip install smolagents python-dotenv sqlalchemy --upgrade -q
18
+ ```
19
+ To call Inference Providers, you will need a valid token as your environment variable `HF_TOKEN`.
20
+ We use python-dotenv to load it.
21
+ ```py
22
+ from dotenv import load_dotenv
23
+ load_dotenv()
24
+ ```
25
+
26
+ Then, we setup the SQL environment:
27
+ ```py
28
+ from sqlalchemy import (
29
+ create_engine,
30
+ MetaData,
31
+ Table,
32
+ Column,
33
+ String,
34
+ Integer,
35
+ Float,
36
+ insert,
37
+ inspect,
38
+ text,
39
+ )
40
+
41
+ engine = create_engine("sqlite:///:memory:")
42
+ metadata_obj = MetaData()
43
+
44
+ def insert_rows_into_table(rows, table, engine=engine):
45
+ for row in rows:
46
+ stmt = insert(table).values(**row)
47
+ with engine.begin() as connection:
48
+ connection.execute(stmt)
49
+
50
+ table_name = "receipts"
51
+ receipts = Table(
52
+ table_name,
53
+ metadata_obj,
54
+ Column("receipt_id", Integer, primary_key=True),
55
+ Column("customer_name", String(16), primary_key=True),
56
+ Column("price", Float),
57
+ Column("tip", Float),
58
+ )
59
+ metadata_obj.create_all(engine)
60
+
61
+ rows = [
62
+ {"receipt_id": 1, "customer_name": "Alan Payne", "price": 12.06, "tip": 1.20},
63
+ {"receipt_id": 2, "customer_name": "Alex Mason", "price": 23.86, "tip": 0.24},
64
+ {"receipt_id": 3, "customer_name": "Woodrow Wilson", "price": 53.43, "tip": 5.43},
65
+ {"receipt_id": 4, "customer_name": "Margaret James", "price": 21.11, "tip": 1.00},
66
+ ]
67
+ insert_rows_into_table(rows, receipts)
68
+ ```
69
+
70
+ ### Build our agent
71
+
72
+ Now let’s make our SQL table retrievable by a tool.
73
+
74
+ The tool’s description attribute will be embedded in the LLM’s prompt by the agent system: it gives the LLM information about how to use the tool. This is where we want to describe the SQL table.
75
+
76
+ ```py
77
+ inspector = inspect(engine)
78
+ columns_info = [(col["name"], col["type"]) for col in inspector.get_columns("receipts")]
79
+
80
+ table_description = "Columns:\n" + "\n".join([f" - {name}: {col_type}" for name, col_type in columns_info])
81
+ print(table_description)
82
+ ```
83
+
84
+ ```text
85
+ Columns:
86
+ - receipt_id: INTEGER
87
+ - customer_name: VARCHAR(16)
88
+ - price: FLOAT
89
+ - tip: FLOAT
90
+ ```
91
+
92
+ Now let’s build our tool. It needs the following: (read [the tool doc](../tutorials/tools) for more detail)
93
+ - A docstring with an `Args:` part listing arguments.
94
+ - Type hints on both inputs and output.
95
+
96
+ ```py
97
+ from smolagents import tool
98
+
99
+ @tool
100
+ def sql_engine(query: str) -> str:
101
+ """
102
+ Allows you to perform SQL queries on the table. Returns a string representation of the result.
103
+ The table is named 'receipts'. Its description is as follows:
104
+ Columns:
105
+ - receipt_id: INTEGER
106
+ - customer_name: VARCHAR(16)
107
+ - price: FLOAT
108
+ - tip: FLOAT
109
+
110
+ Args:
111
+ query: The query to perform. This should be correct SQL.
112
+ """
113
+ output = ""
114
+ with engine.connect() as con:
115
+ rows = con.execute(text(query))
116
+ for row in rows:
117
+ output += "\n" + str(row)
118
+ return output
119
+ ```
120
+
121
+ Now let us create an agent that leverages this tool.
122
+
123
+ We use the `CodeAgent`, which is smolagents’ main agent class: an agent that writes actions in code and can iterate on previous output according to the ReAct framework.
124
+
125
+ The model is the LLM that powers the agent system. `InferenceClientModel` allows you to call LLMs using HF’s Inference API, either via Serverless or Dedicated endpoint, but you could also use any proprietary API.
126
+
127
+ ```py
128
+ from smolagents import CodeAgent, InferenceClientModel
129
+
130
+ agent = CodeAgent(
131
+ tools=[sql_engine],
132
+ model=InferenceClientModel(model_id="meta-llama/Llama-3.1-8B-Instruct"),
133
+ )
134
+ agent.run("Can you give me the name of the client who got the most expensive receipt?")
135
+ ```
136
+
137
+ ### Level 2: Table joins
138
+
139
+ Now let’s make it more challenging! We want our agent to handle joins across multiple tables.
140
+
141
+ So let’s make a second table recording the names of waiters for each receipt_id!
142
+
143
+ ```py
144
+ table_name = "waiters"
145
+ waiters = Table(
146
+ table_name,
147
+ metadata_obj,
148
+ Column("receipt_id", Integer, primary_key=True),
149
+ Column("waiter_name", String(16), primary_key=True),
150
+ )
151
+ metadata_obj.create_all(engine)
152
+
153
+ rows = [
154
+ {"receipt_id": 1, "waiter_name": "Corey Johnson"},
155
+ {"receipt_id": 2, "waiter_name": "Michael Watts"},
156
+ {"receipt_id": 3, "waiter_name": "Michael Watts"},
157
+ {"receipt_id": 4, "waiter_name": "Margaret James"},
158
+ ]
159
+ insert_rows_into_table(rows, waiters)
160
+ ```
161
+ Since we changed the table, we update the `SQLExecutorTool` with this table’s description to let the LLM properly leverage information from this table.
162
+
163
+ ```py
164
+ updated_description = """Allows you to perform SQL queries on the table. Beware that this tool's output is a string representation of the execution output.
165
+ It can use the following tables:"""
166
+
167
+ inspector = inspect(engine)
168
+ for table in ["receipts", "waiters"]:
169
+ columns_info = [(col["name"], col["type"]) for col in inspector.get_columns(table)]
170
+
171
+ table_description = f"Table '{table}':\n"
172
+
173
+ table_description += "Columns:\n" + "\n".join([f" - {name}: {col_type}" for name, col_type in columns_info])
174
+ updated_description += "\n\n" + table_description
175
+
176
+ print(updated_description)
177
+ ```
178
+ Since this request is a bit harder than the previous one, we’ll switch the LLM engine to use the more powerful [Qwen/Qwen2.5-Coder-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-Coder-32B-Instruct)!
179
+
180
+ ```py
181
+ sql_engine.description = updated_description
182
+
183
+ agent = CodeAgent(
184
+ tools=[sql_engine],
185
+ model=InferenceClientModel(model_id="Qwen/Qwen2.5-Coder-32B-Instruct"),
186
+ )
187
+
188
+ agent.run("Which waiter got more total money from tips?")
189
+ ```
190
+ It directly works! The setup was surprisingly simple, wasn’t it?
191
+
192
+ This example is done! We've touched upon these concepts:
193
+ - Building new tools.
194
+ - Updating a tool's description.
195
+ - Switching to a stronger LLM helps agent reasoning.
196
+
197
+ ✅ Now you can go build this text-to-SQL system you’ve always dreamt of! ✨
docs/source/en/examples/using_different_models.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Using different models
2
+
3
+ [[open-in-colab]]
4
+
5
+ `smolagents` provides a flexible framework that allows you to use various language models from different providers.
6
+ This guide will show you how to use different model types with your agents.
7
+
8
+ ## Available model types
9
+
10
+ `smolagents` supports several model types out of the box:
11
+ 1. [`InferenceClientModel`]: Uses Hugging Face's Inference API to access models
12
+ 2. [`TransformersModel`]: Runs models locally using the Transformers library
13
+ 3. [`VLLMModel`]: Uses vLLM for fast inference with optimized serving
14
+ 4. [`MLXModel`]: Optimized for Apple Silicon devices using MLX
15
+ 5. [`LiteLLMModel`]: Provides access to hundreds of LLMs through LiteLLM
16
+ 6. [`LiteLLMRouterModel`]: Distributes requests among multiple models
17
+ 7. [`OpenAIServerModel`]: Provides access to any provider that implements an OpenAI-compatible API
18
+ 8. [`AzureOpenAIServerModel`]: Uses Azure's OpenAI service
19
+ 9. [`AmazonBedrockServerModel`]: Connects to AWS Bedrock's API
20
+
21
+ ## Using Google Gemini Models
22
+
23
+ As explained in the Google Gemini API documentation (https://ai.google.dev/gemini-api/docs/openai),
24
+ Google provides an OpenAI-compatible API for Gemini models, allowing you to use the [`OpenAIServerModel`]
25
+ with Gemini models by setting the appropriate base URL.
26
+
27
+ First, install the required dependencies:
28
+ ```bash
29
+ pip install smolagents[openai]
30
+ ```
31
+
32
+ Then, [get a Gemini API key](https://ai.google.dev/gemini-api/docs/api-key) and set it in your code:
33
+ ```python
34
+ GEMINI_API_KEY = <YOUR-GEMINI-API-KEY>
35
+ ```
36
+
37
+ Now, you can initialize the Gemini model using the `OpenAIServerModel` class
38
+ and setting the `api_base` parameter to the Gemini API base URL:
39
+ ```python
40
+ from smolagents import OpenAIServerModel
41
+
42
+ model = OpenAIServerModel(
43
+ model_id="gemini-2.0-flash",
44
+ # Google Gemini OpenAI-compatible API base URL
45
+ api_base="https://generativelanguage.googleapis.com/v1beta/openai/",
46
+ api_key=GEMINI_API_KEY,
47
+ )
48
+ ```
49
+
50
+ ## Using OpenRouter Models
51
+
52
+ OpenRouter provides access to a wide variety of language models through a unified OpenAI-compatible API.
53
+ You can use the [`OpenAIServerModel`] to connect to OpenRouter by setting the appropriate base URL.
54
+
55
+ First, install the required dependencies:
56
+ ```bash
57
+ pip install smolagents[openai]
58
+ ```
59
+
60
+ Then, [get an OpenRouter API key](https://openrouter.ai/keys) and set it in your code:
61
+ ```python
62
+ OPENROUTER_API_KEY = <YOUR-OPENROUTER-API-KEY>
63
+ ```
64
+
65
+ Now, you can initialize any model available on OpenRouter using the `OpenAIServerModel` class:
66
+ ```python
67
+ from smolagents import OpenAIServerModel
68
+
69
+ model = OpenAIServerModel(
70
+ # You can use any model ID available on OpenRouter
71
+ model_id="openai/gpt-4o",
72
+ # OpenRouter API base URL
73
+ api_base="https://openrouter.ai/api/v1",
74
+ api_key=OPENROUTER_API_KEY,
75
+ )
76
+ ```
docs/source/en/examples/web_browser.md ADDED
@@ -0,0 +1,213 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Web Browser Automation with Agents 🤖🌐
2
+
3
+ [[open-in-colab]]
4
+
5
+ In this notebook, we'll create an **agent-powered web browser automation system**! This system can navigate websites, interact with elements, and extract information automatically.
6
+
7
+ The agent will be able to:
8
+
9
+ - [x] Navigate to web pages
10
+ - [x] Click on elements
11
+ - [x] Search within pages
12
+ - [x] Handle popups and modals
13
+ - [x] Extract information
14
+
15
+ Let's set up this system step by step!
16
+
17
+ First, run these lines to install the required dependencies:
18
+
19
+ ```bash
20
+ pip install smolagents selenium helium pillow -q
21
+ ```
22
+
23
+ Let's import our required libraries and set up environment variables:
24
+
25
+ ```python
26
+ from io import BytesIO
27
+ from time import sleep
28
+
29
+ import helium
30
+ from dotenv import load_dotenv
31
+ from PIL import Image
32
+ from selenium import webdriver
33
+ from selenium.webdriver.common.by import By
34
+ from selenium.webdriver.common.keys import Keys
35
+
36
+ from smolagents import CodeAgent, tool
37
+ from smolagents.agents import ActionStep
38
+
39
+ # Load environment variables
40
+ load_dotenv()
41
+ ```
42
+
43
+ Now let's create our core browser interaction tools that will allow our agent to navigate and interact with web pages:
44
+
45
+ ```python
46
+ @tool
47
+ def search_item_ctrl_f(text: str, nth_result: int = 1) -> str:
48
+ """
49
+ Searches for text on the current page via Ctrl + F and jumps to the nth occurrence.
50
+ Args:
51
+ text: The text to search for
52
+ nth_result: Which occurrence to jump to (default: 1)
53
+ """
54
+ elements = driver.find_elements(By.XPATH, f"//*[contains(text(), '{text}')]")
55
+ if nth_result > len(elements):
56
+ raise Exception(f"Match n°{nth_result} not found (only {len(elements)} matches found)")
57
+ result = f"Found {len(elements)} matches for '{text}'."
58
+ elem = elements[nth_result - 1]
59
+ driver.execute_script("arguments[0].scrollIntoView(true);", elem)
60
+ result += f"Focused on element {nth_result} of {len(elements)}"
61
+ return result
62
+
63
+ @tool
64
+ def go_back() -> None:
65
+ """Goes back to previous page."""
66
+ driver.back()
67
+
68
+ @tool
69
+ def close_popups() -> str:
70
+ """
71
+ Closes any visible modal or pop-up on the page. Use this to dismiss pop-up windows!
72
+ This does not work on cookie consent banners.
73
+ """
74
+ webdriver.ActionChains(driver).send_keys(Keys.ESCAPE).perform()
75
+ ```
76
+
77
+ Let's set up our browser with Chrome and configure screenshot capabilities:
78
+
79
+ ```python
80
+ # Configure Chrome options
81
+ chrome_options = webdriver.ChromeOptions()
82
+ chrome_options.add_argument("--force-device-scale-factor=1")
83
+ chrome_options.add_argument("--window-size=1000,1350")
84
+ chrome_options.add_argument("--disable-pdf-viewer")
85
+ chrome_options.add_argument("--window-position=0,0")
86
+
87
+ # Initialize the browser
88
+ driver = helium.start_chrome(headless=False, options=chrome_options)
89
+
90
+ # Set up screenshot callback
91
+ def save_screenshot(memory_step: ActionStep, agent: CodeAgent) -> None:
92
+ sleep(1.0) # Let JavaScript animations happen before taking the screenshot
93
+ driver = helium.get_driver()
94
+ current_step = memory_step.step_number
95
+ if driver is not None:
96
+ for previous_memory_step in agent.memory.steps: # Remove previous screenshots for lean processing
97
+ if isinstance(previous_memory_step, ActionStep) and previous_memory_step.step_number <= current_step - 2:
98
+ previous_memory_step.observations_images = None
99
+ png_bytes = driver.get_screenshot_as_png()
100
+ image = Image.open(BytesIO(png_bytes))
101
+ print(f"Captured a browser screenshot: {image.size} pixels")
102
+ memory_step.observations_images = [image.copy()] # Create a copy to ensure it persists
103
+
104
+ # Update observations with current URL
105
+ url_info = f"Current url: {driver.current_url}"
106
+ memory_step.observations = (
107
+ url_info if memory_step.observations is None else memory_step.observations + "\n" + url_info
108
+ )
109
+ ```
110
+
111
+ Now let's create our web automation agent:
112
+
113
+ ```python
114
+ from smolagents import InferenceClientModel
115
+
116
+ # Initialize the model
117
+ model_id = "Qwen/Qwen2-VL-72B-Instruct" # You can change this to your preferred VLM model
118
+ model = InferenceClientModel(model_id=model_id)
119
+
120
+ # Create the agent
121
+ agent = CodeAgent(
122
+ tools=[go_back, close_popups, search_item_ctrl_f],
123
+ model=model,
124
+ additional_authorized_imports=["helium"],
125
+ step_callbacks=[save_screenshot],
126
+ max_steps=20,
127
+ verbosity_level=2,
128
+ )
129
+
130
+ # Import helium for the agent
131
+ agent.python_executor("from helium import *", agent.state)
132
+ ```
133
+
134
+ The agent needs instructions on how to use Helium for web automation. Here are the instructions we'll provide:
135
+
136
+ ```python
137
+ helium_instructions = """
138
+ You can use helium to access websites. Don't bother about the helium driver, it's already managed.
139
+ We've already ran "from helium import *"
140
+ Then you can go to pages!
141
+ Code:
142
+ ```py
143
+ go_to('github.com/trending')
144
+ ```<end_code>
145
+
146
+ You can directly click clickable elements by inputting the text that appears on them.
147
+ Code:
148
+ ```py
149
+ click("Top products")
150
+ ```<end_code>
151
+
152
+ If it's a link:
153
+ Code:
154
+ ```py
155
+ click(Link("Top products"))
156
+ ```<end_code>
157
+
158
+ If you try to interact with an element and it's not found, you'll get a LookupError.
159
+ In general stop your action after each button click to see what happens on your screenshot.
160
+ Never try to login in a page.
161
+
162
+ To scroll up or down, use scroll_down or scroll_up with as an argument the number of pixels to scroll from.
163
+ Code:
164
+ ```py
165
+ scroll_down(num_pixels=1200) # This will scroll one viewport down
166
+ ```<end_code>
167
+
168
+ When you have pop-ups with a cross icon to close, don't try to click the close icon by finding its element or targeting an 'X' element (this most often fails).
169
+ Just use your built-in tool `close_popups` to close them:
170
+ Code:
171
+ ```py
172
+ close_popups()
173
+ ```<end_code>
174
+
175
+ You can use .exists() to check for the existence of an element. For example:
176
+ Code:
177
+ ```py
178
+ if Text('Accept cookies?').exists():
179
+ click('I accept')
180
+ ```<end_code>
181
+ """
182
+ ```
183
+
184
+ Now we can run our agent with a task! Let's try finding information on Wikipedia:
185
+
186
+ ```python
187
+ search_request = """
188
+ Please navigate to https://en.wikipedia.org/wiki/Chicago and give me a sentence containing the word "1992" that mentions a construction accident.
189
+ """
190
+
191
+ agent_output = agent.run(search_request + helium_instructions)
192
+ print("Final output:")
193
+ print(agent_output)
194
+ ```
195
+
196
+ You can run different tasks by modifying the request. For example, here's for me to know if I should work harder:
197
+
198
+ ```python
199
+ github_request = """
200
+ I'm trying to find how hard I have to work to get a repo in github.com/trending.
201
+ Can you navigate to the profile for the top author of the top trending repo, and give me their total number of commits over the last year?
202
+ """
203
+
204
+ agent_output = agent.run(github_request + helium_instructions)
205
+ print("Final output:")
206
+ print(agent_output)
207
+ ```
208
+
209
+ The system is particularly effective for tasks like:
210
+ - Data extraction from websites
211
+ - Web research automation
212
+ - UI testing and verification
213
+ - Content monitoring