victor HF Staff commited on
Commit
6ef48c6
·
1 Parent(s): edda836

Update README and app.py for Web Search MCP Server: enhance documentation, improve usage instructions, and implement main content extraction with error handling.

Browse files
Files changed (2) hide show
  1. README.md +129 -20
  2. app.py +173 -89
README.md CHANGED
@@ -1,39 +1,148 @@
1
  ---
2
  title: Websearch
3
- emoji: 🏢
4
  colorFrom: red
5
- colorTo: red
6
  sdk: gradio
7
  sdk_version: 5.36.2
8
  app_file: app.py
9
  pinned: false
10
  ---
11
 
12
- # Gradio News‑to‑Context Service
 
 
 
 
 
 
 
 
 
 
13
 
14
  ## Prerequisites
15
 
16
- `$ pip install gradio httpx trafilatura python-dateutil`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
17
 
18
- ## Environment
 
 
 
 
19
 
20
- `export SERPER_API_KEY="YOUR‑KEY‑HERE"`
21
 
22
- ## How it works design notes
 
 
 
23
 
24
- | Step | Technique | Why it matters |
25
- |---|---|---|
26
- | API search | Serper’s Google‑News JSON is fast, cost‑effective and immune to Google’s bot‑blocking. | |
27
- | Concurrency | `httpx.AsyncClient` + `asyncio.gather` gets 10 articles in < 2 s on typical broadband. | |
28
- | Extraction | Trafilatura consistently tops accuracy charts for main‑content extraction and needs no browser or heavy ML models. | |
29
- | Date parsing | `python‑dateutil` converts fuzzy strings (“16 hours ago”) into ISO YYYY‑MM‑DD so the LLM sees absolute dates. | |
30
- | LLM‑friendly output | Markdown headings and horizontal rules make chunk boundaries explicit; hyperlinks preserved for optional citation. | |
31
 
32
- ## Extending in production
 
 
 
33
 
34
- * **Caching** – add `aiocache` or Redis to avoid re‑fetching identical URLs within TTL.
35
- * **Long‑content trimming** – if each article can exceed your LLM’s context window, pipe `body` through a sentence‑ranker or GPT‑based summariser before concatenation.
36
- * **Paywalls / PDFs** – guard `extract_main_text` with fallback libraries (e.g. `readability‑lxml` or `pymupdf`) for unusual formats.
37
- * **Rate‑limiting** – Serper free tier allows 100 req/day; wrap the call with exponential‑backoff on HTTP 429.
38
 
39
- Drop this file into any Python‑3.10+ environment, set `SERPER_API_KEY`, pip install the three libraries, and you have a ready‑to‑embed “query‑» context” micro‑service for your LLM pipeline.
 
 
 
 
1
  ---
2
  title: Websearch
3
+ emoji: 🔎
4
  colorFrom: red
5
+ colorTo: green
6
  sdk: gradio
7
  sdk_version: 5.36.2
8
  app_file: app.py
9
  pinned: false
10
  ---
11
 
12
+ # Web Search MCP Server
13
+
14
+ A Model Context Protocol (MCP) server that provides web search capabilities to LLMs, allowing them to fetch and extract content from recent news articles.
15
+
16
+ ## Features
17
+
18
+ - **Real-time web search**: Search for recent news on any topic
19
+ - **Content extraction**: Automatically extracts main article content, removing ads and boilerplate
20
+ - **Rate limiting**: Built-in rate limiting (200 requests/hour) to prevent API abuse
21
+ - **Structured output**: Returns formatted content with metadata (title, source, date, URL)
22
+ - **Flexible results**: Control the number of results (1-20)
23
 
24
  ## Prerequisites
25
 
26
+ 1. **Serper API Key**: Sign up at [serper.dev](https://serper.dev) to get your API key
27
+ 2. **Python 3.8+**: Ensure you have Python installed
28
+ 3. **MCP-compatible LLM client**: Such as Claude Desktop, Cursor, or any MCP-enabled application
29
+
30
+ ## Installation
31
+
32
+ 1. Clone or download this repository
33
+ 2. Install dependencies:
34
+ ```bash
35
+ pip install -r requirements.txt
36
+ ```
37
+ Or install manually:
38
+ ```bash
39
+ pip install "gradio[mcp]" httpx trafilatura python-dateutil limits
40
+ ```
41
+
42
+ 3. Set your Serper API key:
43
+ ```bash
44
+ export SERPER_API_KEY="your-api-key-here"
45
+ ```
46
+
47
+ ## Usage
48
+
49
+ ### Starting the MCP Server
50
+
51
+ ```bash
52
+ python app_mcp.py
53
+ ```
54
+
55
+ The server will start on `http://localhost:7860` with the MCP endpoint at:
56
+ ```
57
+ http://localhost:7860/gradio_api/mcp/sse
58
+ ```
59
+
60
+ ### Connecting to LLM Clients
61
+
62
+ #### Claude Desktop
63
+ Add to your `claude_desktop_config.json`:
64
+ ```json
65
+ {
66
+ "mcpServers": {
67
+ "web-search": {
68
+ "command": "python",
69
+ "args": ["/path/to/app_mcp.py"],
70
+ "env": {
71
+ "SERPER_API_KEY": "your-api-key-here"
72
+ }
73
+ }
74
+ }
75
+ }
76
+ ```
77
+
78
+ #### Direct URL Connection
79
+ For clients that support URL-based MCP servers:
80
+ 1. Start the server: `python app_mcp.py`
81
+ 2. Connect to: `http://localhost:7860/gradio_api/mcp/sse`
82
+
83
+ ## Tool Documentation
84
+
85
+ ### `search_web` Function
86
+
87
+ **Purpose**: Search the web for recent news and extract article content.
88
+
89
+ **Parameters**:
90
+ - `query` (str, **REQUIRED**): The search query
91
+ - Examples: "OpenAI news", "climate change 2024", "python updates"
92
+
93
+ - `num_results` (int, **OPTIONAL**): Number of results to fetch
94
+ - Default: 4
95
+ - Range: 1-20
96
+ - More results provide more context but take longer
97
+
98
+ **Returns**: Formatted text containing:
99
+ - Summary of extraction results
100
+ - For each article:
101
+ - Title
102
+ - Source and date
103
+ - URL
104
+ - Extracted main content
105
+
106
+ **Example Usage in LLM**:
107
+ ```
108
+ "Search for recent developments in artificial intelligence"
109
+ "Find 10 articles about climate change in 2024"
110
+ "Get news about Python programming language updates"
111
+ ```
112
+
113
+ ## Error Handling
114
+
115
+ The tool handles various error scenarios:
116
+ - Missing API key: Clear error message with setup instructions
117
+ - Rate limiting: Informs when limit is exceeded
118
+ - Failed extractions: Reports which articles couldn't be extracted
119
+ - Network errors: Graceful error messages
120
+
121
+ ## Testing
122
 
123
+ You can test the server manually:
124
+ 1. Open `http://localhost:7860` in your browser
125
+ 2. Enter a search query
126
+ 3. Adjust the number of results
127
+ 4. Click "Search" to see the extracted content
128
 
129
+ ## Tips for LLM Usage
130
 
131
+ 1. **Be specific with queries**: More specific queries yield better results
132
+ 2. **Adjust result count**: Use fewer results for quick searches, more for comprehensive research
133
+ 3. **Check dates**: The tool shows article dates for temporal context
134
+ 4. **Follow up**: Use the extracted content to ask follow-up questions
135
 
136
+ ## Limitations
 
 
 
 
 
 
137
 
138
+ - Rate limited to 200 requests per hour
139
+ - Only searches news articles (not general web pages)
140
+ - Extraction quality depends on website structure
141
+ - Some websites may block automated access
142
 
143
+ ## Troubleshooting
 
 
 
144
 
145
+ 1. **"SERPER_API_KEY is not set"**: Ensure the environment variable is exported
146
+ 2. **Rate limit errors**: Wait before making more requests
147
+ 3. **No content extracted**: Some websites block scrapers; try different queries
148
+ 4. **Connection errors**: Check your internet connection and firewall settings
app.py CHANGED
@@ -1,123 +1,207 @@
1
  """
2
- Web Search - Feed LLMs with fresh sources
3
- ==========================================
4
 
5
  Prerequisites
6
  -------------
7
- $ pip install gradio httpx trafilatura python-dateutil
8
 
9
  Environment
10
  -----------
11
- export SERPER_API_KEY="YOURKEYHERE"
 
 
 
 
 
12
  """
13
 
14
- import os, asyncio, httpx, trafilatura, gradio as gr
 
 
 
 
 
15
  from dateutil import parser as dateparser
16
  from limits import parse
17
  from limits.aio.storage import MemoryStorage
18
  from limits.aio.strategies import MovingWindowRateLimiter
19
- from fastapi import FastAPI, Request, HTTPException
20
- from fastapi.responses import JSONResponse
21
 
 
22
  SERPER_API_KEY = os.getenv("SERPER_API_KEY")
23
  SERPER_ENDPOINT = "https://google.serper.dev/news"
24
  HEADERS = {"X-API-KEY": SERPER_API_KEY, "Content-Type": "application/json"}
25
 
26
  # Rate limiting
27
- app = FastAPI()
28
  storage = MemoryStorage()
29
  limiter = MovingWindowRateLimiter(storage)
30
  rate_limit = parse("200/hour")
31
 
32
 
33
- @app.exception_handler(HTTPException)
34
- async def http_exception_handler(request: Request, exc: HTTPException):
35
- return JSONResponse(status_code=exc.status_code, content={"message": exc.detail})
36
-
37
-
38
- ### 1 Serper call -------------------------------------------------------------
39
- @app.post("/serper-news")
40
- async def get_serper_news(query: str, num: int = 4) -> list[dict]:
41
- if not await limiter.hit(rate_limit, "global"):
42
- raise HTTPException(status_code=429, detail="Too Many Requests")
43
-
44
- payload = {"q": query, "type": "news", "num": num, "page": 1}
45
- async with httpx.AsyncClient(timeout=15) as client:
46
- resp = await client.post(SERPER_ENDPOINT, headers=HEADERS, json=payload)
47
- resp.raise_for_status()
48
- return resp.json()["news"]
49
-
50
-
51
- ### 2 ─ Concurrent HTML downloads ----------------------------------------------
52
- async def fetch_html_many(urls: list[str]) -> list[dict]:
53
- async with httpx.AsyncClient(timeout=20, follow_redirects=True) as client:
54
- tasks = [client.get(u) for u in urls]
55
- responses = await asyncio.gather(*tasks, return_exceptions=True)
56
- html_pages = []
57
- for r in responses:
58
- if isinstance(r, Exception):
59
- html_pages.append("") # keep positions aligned
60
- else:
61
- html_pages.append(r.text)
62
- return html_pages
63
-
64
-
65
- ### 3 ─ Main‑content extraction -------------------------------------------------
66
- def extract_main_text(html: str) -> str:
67
- if not html:
68
- return ""
69
- # Trafilatura auto‑detects language, removes boilerplate & returns plain text.
70
- return (
71
- trafilatura.extract(html, include_formatting=False, include_comments=False)
72
- or ""
73
- )
74
 
 
 
 
 
75
 
76
- ### 4 ─ Orchestration -----------------------------------------------------------
77
- async def build_context(query: str, k: int = 4) -> str:
78
- news_items = await get_serper_news(query, num=k)
79
- urls = [n["link"] for n in news_items]
80
- raw_pages = await fetch_html_many(urls)
81
-
82
- chunks = []
83
- for meta, html in zip(news_items, raw_pages):
84
- body = extract_main_text(html)
85
- if not body:
86
- continue # skip if extraction failed
87
- # Normalise Serper’s relative date (“21 hours ago”) to ISO date
88
- try:
89
- date_iso = dateparser.parse(meta.get("date", ""), fuzzy=True).strftime(
90
- "%Y-%m-%d"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  )
92
- except Exception:
93
- date_iso = meta.get("date", "")
94
- chunk = (
95
- f"## {meta['title']}\n"
96
- f"**Source:** {meta['source']} "
97
- f"**Date:** {date_iso}\n"
98
- f"{meta['link']}\n\n"
99
- f"{body.strip()}\n"
100
- )
101
- chunks.append(chunk)
102
 
103
- return "\n---\n".join(chunks) or "No extractable content found."
 
 
 
 
104
 
 
 
 
105
 
106
- ### 5 Gradio user interface ---------------------------------------------------
107
- async def handler(user_query: str, k: int) -> str:
108
- if not SERPER_API_KEY:
109
- return "✖️ SERPER_API_KEY is not set."
110
- return await build_context(user_query, k)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
111
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
- with gr.Blocks(title="WebSearch") as demo:
114
- gr.Markdown("# 🔍 Web Search\n" "Feed LLMs with fresh sources.")
115
- query = gr.Textbox(label="Query", placeholder='e.g. "apple inc"')
116
- top_k = gr.Slider(1, 20, value=4, label="How many results?")
117
- out = gr.Textbox(label="Extracted Context", lines=25)
118
- run = gr.Button("Fetch")
119
- run.click(handler, inputs=[query, top_k], outputs=out)
120
 
121
  if __name__ == "__main__":
122
- # Launch in shareable mode when running on Colab/VMs; edit as you wish.
123
- demo.launch()
 
 
1
  """
2
+ Web Search MCP Server - Feed LLMs with fresh sources
3
+ ====================================================
4
 
5
  Prerequisites
6
  -------------
7
+ $ pip install "gradio[mcp]" httpx trafilatura python-dateutil limits
8
 
9
  Environment
10
  -----------
11
+ export SERPER_API_KEY="YOUR-KEY-HERE"
12
+
13
+ Usage
14
+ -----
15
+ python app_mcp.py
16
+ Then connect to: http://localhost:7860/gradio_api/mcp/sse
17
  """
18
 
19
+ import os
20
+ import asyncio
21
+ from typing import Optional
22
+ import httpx
23
+ import trafilatura
24
+ import gradio as gr
25
  from dateutil import parser as dateparser
26
  from limits import parse
27
  from limits.aio.storage import MemoryStorage
28
  from limits.aio.strategies import MovingWindowRateLimiter
 
 
29
 
30
+ # Configuration
31
  SERPER_API_KEY = os.getenv("SERPER_API_KEY")
32
  SERPER_ENDPOINT = "https://google.serper.dev/news"
33
  HEADERS = {"X-API-KEY": SERPER_API_KEY, "Content-Type": "application/json"}
34
 
35
  # Rate limiting
 
36
  storage = MemoryStorage()
37
  limiter = MovingWindowRateLimiter(storage)
38
  rate_limit = parse("200/hour")
39
 
40
 
41
+ async def search_web(query: str, num_results: Optional[int] = 4) -> str:
42
+ """
43
+ Search the web for recent news and information, returning extracted content.
44
+
45
+ This tool searches for recent news articles related to your query and extracts
46
+ the main content from each article, providing you with fresh, relevant information
47
+ from the web.
48
+
49
+ Args:
50
+ query (str): The search query. This is REQUIRED. Examples: "apple inc earnings",
51
+ "climate change 2024", "AI developments"
52
+ num_results (int): Number of results to fetch. This is OPTIONAL. Default is 4.
53
+ Range: 1-20. More results = more context but longer response time.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
 
55
+ Returns:
56
+ str: Formatted text containing extracted article content with metadata (title,
57
+ source, date, URL, and main text) for each result, separated by dividers.
58
+ Returns error message if API key is missing or search fails.
59
 
60
+ Examples:
61
+ - search_web("OpenAI news", 5) - Get 5 recent news articles about OpenAI
62
+ - search_web("python 3.13 features") - Get 4 articles about Python 3.13
63
+ - search_web("stock market today", 10) - Get 10 articles about today's market
64
+ """
65
+ if not SERPER_API_KEY:
66
+ return "Error: SERPER_API_KEY environment variable is not set. Please set it to use this tool."
67
+
68
+ # Validate and constrain num_results
69
+ if num_results is None:
70
+ num_results = 4
71
+ num_results = max(1, min(20, num_results))
72
+
73
+ try:
74
+ # Check rate limit
75
+ if not await limiter.hit(rate_limit, "global"):
76
+ return "Error: Rate limit exceeded. Please try again later (limit: 200 requests per hour)."
77
+
78
+ # Search for news
79
+ payload = {"q": query, "type": "news", "num": num_results, "page": 1}
80
+ async with httpx.AsyncClient(timeout=15) as client:
81
+ resp = await client.post(SERPER_ENDPOINT, headers=HEADERS, json=payload)
82
+
83
+ if resp.status_code != 200:
84
+ return f"Error: Search API returned status {resp.status_code}. Please check your API key and try again."
85
+
86
+ news_items = resp.json().get("news", [])
87
+ if not news_items:
88
+ return (
89
+ f"No results found for query: '{query}'. Try a different search term."
90
  )
 
 
 
 
 
 
 
 
 
 
91
 
92
+ # Fetch HTML content concurrently
93
+ urls = [n["link"] for n in news_items]
94
+ async with httpx.AsyncClient(timeout=20, follow_redirects=True) as client:
95
+ tasks = [client.get(u) for u in urls]
96
+ responses = await asyncio.gather(*tasks, return_exceptions=True)
97
 
98
+ # Extract and format content
99
+ chunks = []
100
+ successful_extractions = 0
101
 
102
+ for meta, response in zip(news_items, responses):
103
+ if isinstance(response, Exception):
104
+ continue
105
+
106
+ # Extract main text content
107
+ body = trafilatura.extract(
108
+ response.text, include_formatting=False, include_comments=False
109
+ )
110
+
111
+ if not body:
112
+ continue
113
+
114
+ successful_extractions += 1
115
+
116
+ # Parse and format date
117
+ try:
118
+ date_iso = dateparser.parse(meta.get("date", ""), fuzzy=True).strftime(
119
+ "%Y-%m-%d"
120
+ )
121
+ except Exception:
122
+ date_iso = meta.get("date", "Unknown")
123
+
124
+ # Format the chunk
125
+ chunk = (
126
+ f"## {meta['title']}\n"
127
+ f"**Source:** {meta['source']} "
128
+ f"**Date:** {date_iso}\n"
129
+ f"**URL:** {meta['link']}\n\n"
130
+ f"{body.strip()}\n"
131
+ )
132
+ chunks.append(chunk)
133
+
134
+ if not chunks:
135
+ return f"Found {len(news_items)} results for '{query}', but couldn't extract readable content from any of them. The websites might be blocking automated access."
136
+
137
+ result = "\n---\n".join(chunks)
138
+ summary = f"Successfully extracted content from {successful_extractions} out of {len(news_items)} search results for query: '{query}'\n\n---\n\n"
139
 
140
+ return summary + result
141
+
142
+ except Exception as e:
143
+ return f"Error occurred while searching: {str(e)}. Please try again or check your query."
144
+
145
+
146
+ # Create Gradio interface
147
+ with gr.Blocks(title="Web Search MCP Server") as demo:
148
+ gr.Markdown(
149
+ """
150
+ # 🔍 Web Search MCP Server
151
+
152
+ This MCP server provides web search capabilities to LLMs. It searches for recent news
153
+ and extracts the main content from articles.
154
+
155
+ **Note:** This interface is primarily designed for MCP tool usage by LLMs, but you can
156
+ also test it manually below.
157
+ """
158
+ )
159
+
160
+ with gr.Row():
161
+ query_input = gr.Textbox(
162
+ label="Search Query",
163
+ placeholder='e.g. "OpenAI news", "climate change 2024", "AI developments"',
164
+ info="Required: Enter your search query",
165
+ )
166
+ num_results_input = gr.Slider(
167
+ minimum=1,
168
+ maximum=20,
169
+ value=4,
170
+ step=1,
171
+ label="Number of Results",
172
+ info="Optional: How many articles to fetch (default: 4)",
173
+ )
174
+
175
+ output = gr.Textbox(
176
+ label="Extracted Content",
177
+ lines=25,
178
+ max_lines=50,
179
+ info="The extracted article content will appear here",
180
+ )
181
+
182
+ search_button = gr.Button("Search", variant="primary")
183
+
184
+ # Add examples
185
+ gr.Examples(
186
+ examples=[
187
+ ["OpenAI GPT-5 news", 5],
188
+ ["climate change 2024", 4],
189
+ ["artificial intelligence breakthroughs", 8],
190
+ ["stock market today", 6],
191
+ ["python programming updates", 4],
192
+ ],
193
+ inputs=[query_input, num_results_input],
194
+ outputs=output,
195
+ fn=search_web,
196
+ cache_examples=False,
197
+ )
198
+
199
+ search_button.click(
200
+ fn=search_web, inputs=[query_input, num_results_input], outputs=output
201
+ )
202
 
 
 
 
 
 
 
 
203
 
204
  if __name__ == "__main__":
205
+ # Launch with MCP server enabled
206
+ # The MCP endpoint will be available at: http://localhost:7860/gradio_api/mcp/sse
207
+ demo.launch(mcp_server=True, show_api=True)