Spaces:

victor
/

websearch

Running

App Files Files Community

victor HF Staff commited on 1 day ago

Commit

6ef48c6

1 Parent(s): edda836

Update README and app.py for Web Search MCP Server: enhance documentation, improve usage instructions, and implement main content extraction with error handling.

Browse files

Files changed (2) hide show

README.md +129 -20
app.py +173 -89

README.md CHANGED Viewed

@@ -1,39 +1,148 @@
 ---
 title: Websearch
-emoji: 🏢
 colorFrom: red
-colorTo: red
 sdk: gradio
 sdk_version: 5.36.2
 app_file: app.py
 pinned: false
 ---
-# Gradio News‑to‑Context Service
 ## Prerequisites
-`$ pip install gradio httpx trafilatura python-dateutil`
-## Environment
-`export SERPER_API_KEY="YOUR‑KEY‑HERE"`
-## How it works – design notes
-| Step | Technique | Why it matters |
-|---|---|---|
-| API search | Serper’s Google‑News JSON is fast, cost‑effective and immune to Google’s bot‑blocking. | |
-| Concurrency | `httpx.AsyncClient` + `asyncio.gather` gets 10 articles in < 2 s on typical broadband. | |
-| Extraction | Trafilatura consistently tops accuracy charts for main‑content extraction and needs no browser or heavy ML models. | |
-| Date parsing | `python‑dateutil` converts fuzzy strings (“16 hours ago”) into ISO YYYY‑MM‑DD so the LLM sees absolute dates. | |
-| LLM‑friendly output | Markdown headings and horizontal rules make chunk boundaries explicit; hyperlinks preserved for optional citation. | |
-## Extending in production
-*   **Caching** – add `aiocache` or Redis to avoid re‑fetching identical URLs within TTL.
-*   **Long‑content trimming** – if each article can exceed your LLM’s context window, pipe `body` through a sentence‑ranker or GPT‑based summariser before concatenation.
-*   **Paywalls / PDFs** – guard `extract_main_text` with fallback libraries (e.g. `readability‑lxml` or `pymupdf`) for unusual formats.
-*   **Rate‑limiting** – Serper free tier allows 100 req/day; wrap the call with exponential‑backoff on HTTP 429.
-Drop this file into any Python‑3.10+ environment, set `SERPER_API_KEY`, pip install the three libraries, and you have a ready‑to‑embed “query‑» context” micro‑service for your LLM pipeline.

 ---
 title: Websearch
+emoji: 🔎
 colorFrom: red
+colorTo: green
 sdk: gradio
 sdk_version: 5.36.2
 app_file: app.py
 pinned: false
 ---
+# Web Search MCP Server
+A Model Context Protocol (MCP) server that provides web search capabilities to LLMs, allowing them to fetch and extract content from recent news articles.
+## Features
+- **Real-time web search**: Search for recent news on any topic
+- **Content extraction**: Automatically extracts main article content, removing ads and boilerplate
+- **Rate limiting**: Built-in rate limiting (200 requests/hour) to prevent API abuse
+- **Structured output**: Returns formatted content with metadata (title, source, date, URL)
+- **Flexible results**: Control the number of results (1-20)
 ## Prerequisites
+1. **Serper API Key**: Sign up at [serper.dev](https://serper.dev) to get your API key
+2. **Python 3.8+**: Ensure you have Python installed
+3. **MCP-compatible LLM client**: Such as Claude Desktop, Cursor, or any MCP-enabled application
+## Installation
+1. Clone or download this repository
+2. Install dependencies:
+   ```bash
+   pip install -r requirements.txt
+   ```
+   Or install manually:
+   ```bash
+   pip install "gradio[mcp]" httpx trafilatura python-dateutil limits
+   ```
+3. Set your Serper API key:
+   ```bash
+   export SERPER_API_KEY="your-api-key-here"
+   ```
+## Usage
+### Starting the MCP Server
+```bash
+python app_mcp.py
+```
+The server will start on `http://localhost:7860` with the MCP endpoint at:
+```
+http://localhost:7860/gradio_api/mcp/sse
+```
+### Connecting to LLM Clients
+#### Claude Desktop
+Add to your `claude_desktop_config.json`:
+```json
+{
+  "mcpServers": {
+    "web-search": {
+      "command": "python",
+      "args": ["/path/to/app_mcp.py"],
+      "env": {
+        "SERPER_API_KEY": "your-api-key-here"
+      }
+    }
+  }
+}
+```
+#### Direct URL Connection
+For clients that support URL-based MCP servers:
+1. Start the server: `python app_mcp.py`
+2. Connect to: `http://localhost:7860/gradio_api/mcp/sse`
+## Tool Documentation
+### `search_web` Function
+**Purpose**: Search the web for recent news and extract article content.
+**Parameters**:
+- `query` (str, **REQUIRED**): The search query
+  - Examples: "OpenAI news", "climate change 2024", "python updates"
+- `num_results` (int, **OPTIONAL**): Number of results to fetch
+  - Default: 4
+  - Range: 1-20
+  - More results provide more context but take longer
+**Returns**: Formatted text containing:
+- Summary of extraction results
+- For each article:
+  - Title
+  - Source and date
+  - URL
+  - Extracted main content
+**Example Usage in LLM**:
+```
+"Search for recent developments in artificial intelligence"
+"Find 10 articles about climate change in 2024"
+"Get news about Python programming language updates"
+```
+## Error Handling
+The tool handles various error scenarios:
+- Missing API key: Clear error message with setup instructions
+- Rate limiting: Informs when limit is exceeded
+- Failed extractions: Reports which articles couldn't be extracted
+- Network errors: Graceful error messages
+## Testing
+You can test the server manually:
+1. Open `http://localhost:7860` in your browser
+2. Enter a search query
+3. Adjust the number of results
+4. Click "Search" to see the extracted content
+## Tips for LLM Usage
+1. **Be specific with queries**: More specific queries yield better results
+2. **Adjust result count**: Use fewer results for quick searches, more for comprehensive research
+3. **Check dates**: The tool shows article dates for temporal context
+4. **Follow up**: Use the extracted content to ask follow-up questions
+## Limitations
+- Rate limited to 200 requests per hour
+- Only searches news articles (not general web pages)
+- Extraction quality depends on website structure
+- Some websites may block automated access
+## Troubleshooting
+1. **"SERPER_API_KEY is not set"**: Ensure the environment variable is exported
+2. **Rate limit errors**: Wait before making more requests
+3. **No content extracted**: Some websites block scrapers; try different queries
+4. **Connection errors**: Check your internet connection and firewall settings

app.py CHANGED Viewed

@@ -1,123 +1,207 @@
 """
-Web Search - Feed LLMs with fresh sources
-==========================================
 Prerequisites
 -------------
-$ pip install gradio httpx trafilatura python-dateutil
 Environment
 -----------
-export SERPER_API_KEY="YOUR‑KEY‑HERE"
 """
-import os, asyncio, httpx, trafilatura, gradio as gr
 from dateutil import parser as dateparser
 from limits import parse
 from limits.aio.storage import MemoryStorage
 from limits.aio.strategies import MovingWindowRateLimiter
-from fastapi import FastAPI, Request, HTTPException
-from fastapi.responses import JSONResponse
 SERPER_API_KEY = os.getenv("SERPER_API_KEY")
 SERPER_ENDPOINT = "https://google.serper.dev/news"
 HEADERS = {"X-API-KEY": SERPER_API_KEY, "Content-Type": "application/json"}
 # Rate limiting
-app = FastAPI()
 storage = MemoryStorage()
 limiter = MovingWindowRateLimiter(storage)
 rate_limit = parse("200/hour")
-@app.exception_handler(HTTPException)
-async def http_exception_handler(request: Request, exc: HTTPException):
-    return JSONResponse(status_code=exc.status_code, content={"message": exc.detail})
-### 1 ─ Serper call -------------------------------------------------------------
-@app.post("/serper-news")
-async def get_serper_news(query: str, num: int = 4) -> list[dict]:
-    if not await limiter.hit(rate_limit, "global"):
-        raise HTTPException(status_code=429, detail="Too Many Requests")
-    payload = {"q": query, "type": "news", "num": num, "page": 1}
-    async with httpx.AsyncClient(timeout=15) as client:
-        resp = await client.post(SERPER_ENDPOINT, headers=HEADERS, json=payload)
-    resp.raise_for_status()
-    return resp.json()["news"]
-### 2 ─ Concurrent HTML downloads ----------------------------------------------
-async def fetch_html_many(urls: list[str]) -> list[dict]:
-    async with httpx.AsyncClient(timeout=20, follow_redirects=True) as client:
-        tasks = [client.get(u) for u in urls]
-        responses = await asyncio.gather(*tasks, return_exceptions=True)
-    html_pages = []
-    for r in responses:
-        if isinstance(r, Exception):
-            html_pages.append("")  # keep positions aligned
-        else:
-            html_pages.append(r.text)
-    return html_pages
-### 3 ─ Main‑content extraction -------------------------------------------------
-def extract_main_text(html: str) -> str:
-    if not html:
-        return ""
-    # Trafilatura auto‑detects language, removes boilerplate & returns plain text.
-    return (
-        trafilatura.extract(html, include_formatting=False, include_comments=False)
-        or ""
-    )
-### 4 ─ Orchestration -----------------------------------------------------------
-async def build_context(query: str, k: int = 4) -> str:
-    news_items = await get_serper_news(query, num=k)
-    urls = [n["link"] for n in news_items]
-    raw_pages = await fetch_html_many(urls)
-    chunks = []
-    for meta, html in zip(news_items, raw_pages):
-        body = extract_main_text(html)
-        if not body:
-            continue  # skip if extraction failed
-        # Normalise Serper’s relative date (“21 hours ago”) to ISO date
-        try:
-            date_iso = dateparser.parse(meta.get("date", ""), fuzzy=True).strftime(
-                "%Y-%m-%d"
             )
-        except Exception:
-            date_iso = meta.get("date", "")
-        chunk = (
-            f"## {meta['title']}\n"
-            f"**Source:** {meta['source']}   "
-            f"**Date:** {date_iso}\n"
-            f"{meta['link']}\n\n"
-            f"{body.strip()}\n"
-        )
-        chunks.append(chunk)
-    return "\n---\n".join(chunks) or "No extractable content found."
-### 5 ─ Gradio user interface ---------------------------------------------------
-async def handler(user_query: str, k: int) -> str:
-    if not SERPER_API_KEY:
-        return "✖️ SERPER_API_KEY is not set."
-    return await build_context(user_query, k)
-with gr.Blocks(title="WebSearch") as demo:
-    gr.Markdown("# 🔍 Web Search\n" "Feed LLMs with fresh sources.")
-    query = gr.Textbox(label="Query", placeholder='e.g. "apple inc"')
-    top_k = gr.Slider(1, 20, value=4, label="How many results?")
-    out = gr.Textbox(label="Extracted Context", lines=25)
-    run = gr.Button("Fetch")
-    run.click(handler, inputs=[query, top_k], outputs=out)
 if __name__ == "__main__":
-    # Launch in shareable mode when running on Colab/VMs; edit as you wish.
-    demo.launch()

 """
+Web Search MCP Server - Feed LLMs with fresh sources
+====================================================
 Prerequisites
 -------------
+$ pip install "gradio[mcp]" httpx trafilatura python-dateutil limits
 Environment
 -----------
+export SERPER_API_KEY="YOUR-KEY-HERE"
+Usage
+-----
+python app_mcp.py
+Then connect to: http://localhost:7860/gradio_api/mcp/sse
 """
+import os
+import asyncio
+from typing import Optional
+import httpx
+import trafilatura
+import gradio as gr
 from dateutil import parser as dateparser
 from limits import parse
 from limits.aio.storage import MemoryStorage
 from limits.aio.strategies import MovingWindowRateLimiter
+# Configuration
 SERPER_API_KEY = os.getenv("SERPER_API_KEY")
 SERPER_ENDPOINT = "https://google.serper.dev/news"
 HEADERS = {"X-API-KEY": SERPER_API_KEY, "Content-Type": "application/json"}
 # Rate limiting
 storage = MemoryStorage()
 limiter = MovingWindowRateLimiter(storage)
 rate_limit = parse("200/hour")
+async def search_web(query: str, num_results: Optional[int] = 4) -> str:
+    """
+    Search the web for recent news and information, returning extracted content.
+    This tool searches for recent news articles related to your query and extracts
+    the main content from each article, providing you with fresh, relevant information
+    from the web.
+    Args:
+        query (str): The search query. This is REQUIRED. Examples: "apple inc earnings",
+                    "climate change 2024", "AI developments"
+        num_results (int): Number of results to fetch. This is OPTIONAL. Default is 4.
+                          Range: 1-20. More results = more context but longer response time.
+    Returns:
+        str: Formatted text containing extracted article content with metadata (title,
+             source, date, URL, and main text) for each result, separated by dividers.
+             Returns error message if API key is missing or search fails.
+    Examples:
+        - search_web("OpenAI news", 5) - Get 5 recent news articles about OpenAI
+        - search_web("python 3.13 features") - Get 4 articles about Python 3.13
+        - search_web("stock market today", 10) - Get 10 articles about today's market
+    """
+    if not SERPER_API_KEY:
+        return "Error: SERPER_API_KEY environment variable is not set. Please set it to use this tool."
+    # Validate and constrain num_results
+    if num_results is None:
+        num_results = 4
+    num_results = max(1, min(20, num_results))
+    try:
+        # Check rate limit
+        if not await limiter.hit(rate_limit, "global"):
+            return "Error: Rate limit exceeded. Please try again later (limit: 200 requests per hour)."
+        # Search for news
+        payload = {"q": query, "type": "news", "num": num_results, "page": 1}
+        async with httpx.AsyncClient(timeout=15) as client:
+            resp = await client.post(SERPER_ENDPOINT, headers=HEADERS, json=payload)
+        if resp.status_code != 200:
+            return f"Error: Search API returned status {resp.status_code}. Please check your API key and try again."
+        news_items = resp.json().get("news", [])
+        if not news_items:
+            return (
+                f"No results found for query: '{query}'. Try a different search term."
             )
+        # Fetch HTML content concurrently
+        urls = [n["link"] for n in news_items]
+        async with httpx.AsyncClient(timeout=20, follow_redirects=True) as client:
+            tasks = [client.get(u) for u in urls]
+            responses = await asyncio.gather(*tasks, return_exceptions=True)
+        # Extract and format content
+        chunks = []
+        successful_extractions = 0
+        for meta, response in zip(news_items, responses):
+            if isinstance(response, Exception):
+                continue
+            # Extract main text content
+            body = trafilatura.extract(
+                response.text, include_formatting=False, include_comments=False
+            )
+            if not body:
+                continue
+            successful_extractions += 1
+            # Parse and format date
+            try:
+                date_iso = dateparser.parse(meta.get("date", ""), fuzzy=True).strftime(
+                    "%Y-%m-%d"
+                )
+            except Exception:
+                date_iso = meta.get("date", "Unknown")
+            # Format the chunk
+            chunk = (
+                f"## {meta['title']}\n"
+                f"**Source:** {meta['source']}   "
+                f"**Date:** {date_iso}\n"
+                f"**URL:** {meta['link']}\n\n"
+                f"{body.strip()}\n"
+            )
+            chunks.append(chunk)
+        if not chunks:
+            return f"Found {len(news_items)} results for '{query}', but couldn't extract readable content from any of them. The websites might be blocking automated access."
+        result = "\n---\n".join(chunks)
+        summary = f"Successfully extracted content from {successful_extractions} out of {len(news_items)} search results for query: '{query}'\n\n---\n\n"
+        return summary + result
+    except Exception as e:
+        return f"Error occurred while searching: {str(e)}. Please try again or check your query."
+# Create Gradio interface
+with gr.Blocks(title="Web Search MCP Server") as demo:
+    gr.Markdown(
+        """
+        # 🔍 Web Search MCP Server
+        This MCP server provides web search capabilities to LLMs. It searches for recent news
+        and extracts the main content from articles.
+        **Note:** This interface is primarily designed for MCP tool usage by LLMs, but you can
+        also test it manually below.
+        """
+    )
+    with gr.Row():
+        query_input = gr.Textbox(
+            label="Search Query",
+            placeholder='e.g. "OpenAI news", "climate change 2024", "AI developments"',
+            info="Required: Enter your search query",
+        )
+        num_results_input = gr.Slider(
+            minimum=1,
+            maximum=20,
+            value=4,
+            step=1,
+            label="Number of Results",
+            info="Optional: How many articles to fetch (default: 4)",
+        )
+    output = gr.Textbox(
+        label="Extracted Content",
+        lines=25,
+        max_lines=50,
+        info="The extracted article content will appear here",
+    )
+    search_button = gr.Button("Search", variant="primary")
+    # Add examples
+    gr.Examples(
+        examples=[
+            ["OpenAI GPT-5 news", 5],
+            ["climate change 2024", 4],
+            ["artificial intelligence breakthroughs", 8],
+            ["stock market today", 6],
+            ["python programming updates", 4],
+        ],
+        inputs=[query_input, num_results_input],
+        outputs=output,
+        fn=search_web,
+        cache_examples=False,
+    )
+    search_button.click(
+        fn=search_web, inputs=[query_input, num_results_input], outputs=output
+    )
 if __name__ == "__main__":
+    # Launch with MCP server enabled
+    # The MCP endpoint will be available at: http://localhost:7860/gradio_api/mcp/sse
+    demo.launch(mcp_server=True, show_api=True)