Post
207
๐ต๏ธ๐ Building Browser Agents - notebook
No API? No problem.
Browser Agents can use websites like you do: click, type, wait, read.
๐ Step-by-step notebook: https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/browser_agents.ipynb
๐ฅ In the video, the Agent:
- Goes to Hugging Face Spaces
- Finds black-forest-labs/FLUX.1-schnell
- Expands a short prompt ("my holiday on Lake Como") into a detailed image generation prompt
- Waits for the image
- Returns the image URL
## What else can it do?
Great for information gathering and summarization
๐๏ธ๐๏ธ Compare news websites and create a table of shared stories with links
โถ๏ธ Find content creator social profiles from YouTube videos
๐๏ธ Find a product's price range on Amazon
๐ ๐ Gather public transportation travel options
## How is it built?
๐๏ธ Haystack โ Agent execution logic
๐ง Google Gemini 2.5 Flash โ Good and fast LLM with a generous free tier
๐ ๏ธ Playwright MCP server โ Browser automation tools: navigate, click, type, wait...
Even without vision capabilities, this setup can get quite far.
## Next steps
- Try a local open model
- Move from notebook to real deployment
- Incorporate vision
And you? Have you built something similar? What's in your stack?
No API? No problem.
Browser Agents can use websites like you do: click, type, wait, read.
๐ Step-by-step notebook: https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/browser_agents.ipynb
๐ฅ In the video, the Agent:
- Goes to Hugging Face Spaces
- Finds black-forest-labs/FLUX.1-schnell
- Expands a short prompt ("my holiday on Lake Como") into a detailed image generation prompt
- Waits for the image
- Returns the image URL
## What else can it do?
Great for information gathering and summarization
๐๏ธ๐๏ธ Compare news websites and create a table of shared stories with links
โถ๏ธ Find content creator social profiles from YouTube videos
๐๏ธ Find a product's price range on Amazon
๐ ๐ Gather public transportation travel options
## How is it built?
๐๏ธ Haystack โ Agent execution logic
๐ง Google Gemini 2.5 Flash โ Good and fast LLM with a generous free tier
๐ ๏ธ Playwright MCP server โ Browser automation tools: navigate, click, type, wait...
Even without vision capabilities, this setup can get quite far.
## Next steps
- Try a local open model
- Move from notebook to real deployment
- Incorporate vision
And you? Have you built something similar? What's in your stack?