Spaces:

victor
/

websearch

Running

App Files Files Community

victor HF Staff commited on 1 day ago

Commit

9d978bc

1 Parent(s): 6ef48c6

Enhance README and app.py: clarify search functionality, add search type options, and improve usage examples for web search capabilities.

Browse files

Files changed (2) hide show

README.md +43 -12
app.py +94 -40

README.md CHANGED Viewed

@@ -11,11 +11,14 @@ pinned: false
 # Web Search MCP Server
-A Model Context Protocol (MCP) server that provides web search capabilities to LLMs, allowing them to fetch and extract content from recent news articles.
 ## Features
-- **Real-time web search**: Search for recent news on any topic
 - **Content extraction**: Automatically extracts main article content, removing ads and boilerplate
 - **Rate limiting**: Built-in rate limiting (200 requests/hour) to prevent API abuse
 - **Structured output**: Returns formatted content with metadata (title, source, date, URL)
@@ -84,17 +87,23 @@ For clients that support URL-based MCP servers:
 ### `search_web` Function
-**Purpose**: Search the web for recent news and extract article content.
 **Parameters**:
 - `query` (str, **REQUIRED**): The search query
-  - Examples: "OpenAI news", "climate change 2024", "python updates"
 - `num_results` (int, **OPTIONAL**): Number of results to fetch
   - Default: 4
   - Range: 1-20
   - More results provide more context but take longer
 **Returns**: Formatted text containing:
 - Summary of extraction results
 - For each article:
@@ -103,11 +112,31 @@ For clients that support URL-based MCP servers:
   - URL
   - Extracted main content
 **Example Usage in LLM**:
 ```
-"Search for recent developments in artificial intelligence"
-"Find 10 articles about climate change in 2024"
-"Get news about Python programming language updates"
 ```
 ## Error Handling
@@ -128,17 +157,19 @@ You can test the server manually:
 ## Tips for LLM Usage
-1. **Be specific with queries**: More specific queries yield better results
-2. **Adjust result count**: Use fewer results for quick searches, more for comprehensive research
-3. **Check dates**: The tool shows article dates for temporal context
-4. **Follow up**: Use the extracted content to ask follow-up questions
 ## Limitations
 - Rate limited to 200 requests per hour
-- Only searches news articles (not general web pages)
 - Extraction quality depends on website structure
 - Some websites may block automated access
 ## Troubleshooting

 # Web Search MCP Server
+A Model Context Protocol (MCP) server that provides web search capabilities to LLMs, allowing them to fetch and extract content from web pages and news articles.
 ## Features
+- **Dual search modes**:
+  - **General Search**: Get diverse results from blogs, documentation, articles, and more
+  - **News Search**: Find fresh news articles and breaking stories from news sources
+- **Real-time web search**: Search for any topic with up-to-date results
 - **Content extraction**: Automatically extracts main article content, removing ads and boilerplate
 - **Rate limiting**: Built-in rate limiting (200 requests/hour) to prevent API abuse
 - **Structured output**: Returns formatted content with metadata (title, source, date, URL)
 ### `search_web` Function
+**Purpose**: Search the web for information or fresh news and extract content.
 **Parameters**:
 - `query` (str, **REQUIRED**): The search query
+  - Examples: "OpenAI news", "climate change 2024", "python tutorial"
 - `num_results` (int, **OPTIONAL**): Number of results to fetch
   - Default: 4
   - Range: 1-20
   - More results provide more context but take longer
+- `search_type` (str, **OPTIONAL**): Type of search to perform
+  - Default: "search" (general web search)
+  - Options: "search" or "news"
+  - Use "news" for fresh, time-sensitive news articles
+  - Use "search" for general information, documentation, tutorials
 **Returns**: Formatted text containing:
 - Summary of extraction results
 - For each article:
   - URL
   - Extracted main content
+**When to use each search type**:
+- **Use "news" mode for**:
+  - Breaking news or very recent events
+  - Time-sensitive information ("today", "this week")
+  - Current affairs and latest developments
+  - Press releases and announcements
+- **Use "search" mode for**:
+  - General information and research
+  - Technical documentation or tutorials
+  - Historical information
+  - Diverse perspectives from various sources
+  - How-to guides and explanations
 **Example Usage in LLM**:
 ```
+# News mode examples
+"Search for breaking news about OpenAI" -> uses news mode
+"Find today's stock market updates" -> uses news mode
+"Get latest climate change developments" -> uses news mode
+# Search mode examples (default)
+"Search for Python programming tutorials" -> uses search mode
+"Find information about machine learning algorithms" -> uses search mode
+"Research historical data about climate change" -> uses search mode
 ```
 ## Error Handling
 ## Tips for LLM Usage
+1. **Choose the right search type**: Use "news" for fresh, breaking news; use "search" for general information
+2. **Be specific with queries**: More specific queries yield better results
+3. **Adjust result count**: Use fewer results for quick searches, more for comprehensive research
+4. **Check dates**: The tool shows article dates for temporal context
+5. **Follow up**: Use the extracted content to ask follow-up questions
 ## Limitations
 - Rate limited to 200 requests per hour
 - Extraction quality depends on website structure
 - Some websites may block automated access
+- News mode focuses on recent articles from news sources
+- Search mode provides diverse results but may include older content
 ## Troubleshooting

app.py CHANGED Viewed

@@ -29,7 +29,8 @@ from limits.aio.strategies import MovingWindowRateLimiter
 # Configuration
 SERPER_API_KEY = os.getenv("SERPER_API_KEY")
-SERPER_ENDPOINT = "https://google.serper.dev/news"
 HEADERS = {"X-API-KEY": SERPER_API_KEY, "Content-Type": "application/json"}
 # Rate limiting
@@ -38,29 +39,45 @@ limiter = MovingWindowRateLimiter(storage)
 rate_limit = parse("200/hour")
-async def search_web(query: str, num_results: Optional[int] = 4) -> str:
     """
-    Search the web for recent news and information, returning extracted content.
-    This tool searches for recent news articles related to your query and extracts
-    the main content from each article, providing you with fresh, relevant information
-    from the web.
     Args:
         query (str): The search query. This is REQUIRED. Examples: "apple inc earnings",
                     "climate change 2024", "AI developments"
         num_results (int): Number of results to fetch. This is OPTIONAL. Default is 4.
                           Range: 1-20. More results = more context but longer response time.
     Returns:
-        str: Formatted text containing extracted article content with metadata (title,
              source, date, URL, and main text) for each result, separated by dividers.
              Returns error message if API key is missing or search fails.
     Examples:
-        - search_web("OpenAI news", 5) - Get 5 recent news articles about OpenAI
-        - search_web("python 3.13 features") - Get 4 articles about Python 3.13
-        - search_web("stock market today", 10) - Get 10 articles about today's market
     """
     if not SERPER_API_KEY:
         return "Error: SERPER_API_KEY environment variable is not set. Please set it to use this tool."
@@ -69,28 +86,44 @@ async def search_web(query: str, num_results: Optional[int] = 4) -> str:
     if num_results is None:
         num_results = 4
     num_results = max(1, min(20, num_results))
     try:
         # Check rate limit
         if not await limiter.hit(rate_limit, "global"):
             return "Error: Rate limit exceeded. Please try again later (limit: 200 requests per hour)."
-        # Search for news
-        payload = {"q": query, "type": "news", "num": num_results, "page": 1}
         async with httpx.AsyncClient(timeout=15) as client:
-            resp = await client.post(SERPER_ENDPOINT, headers=HEADERS, json=payload)
         if resp.status_code != 200:
             return f"Error: Search API returned status {resp.status_code}. Please check your API key and try again."
-        news_items = resp.json().get("news", [])
-        if not news_items:
             return (
-                f"No results found for query: '{query}'. Try a different search term."
             )
         # Fetch HTML content concurrently
-        urls = [n["link"] for n in news_items]
         async with httpx.AsyncClient(timeout=20, follow_redirects=True) as client:
             tasks = [client.get(u) for u in urls]
             responses = await asyncio.gather(*tasks, return_exceptions=True)
@@ -99,7 +132,7 @@ async def search_web(query: str, num_results: Optional[int] = 4) -> str:
         chunks = []
         successful_extractions = 0
-        for meta, response in zip(news_items, responses):
             if isinstance(response, Exception):
                 continue
@@ -115,16 +148,22 @@ async def search_web(query: str, num_results: Optional[int] = 4) -> str:
             # Parse and format date
             try:
-                date_iso = dateparser.parse(meta.get("date", ""), fuzzy=True).strftime(
-                    "%Y-%m-%d"
-                )
             except Exception:
-                date_iso = meta.get("date", "Unknown")
             # Format the chunk
             chunk = (
                 f"## {meta['title']}\n"
-                f"**Source:** {meta['source']}   "
                 f"**Date:** {date_iso}\n"
                 f"**URL:** {meta['link']}\n\n"
                 f"{body.strip()}\n"
@@ -132,10 +171,10 @@ async def search_web(query: str, num_results: Optional[int] = 4) -> str:
             chunks.append(chunk)
         if not chunks:
-            return f"Found {len(news_items)} results for '{query}', but couldn't extract readable content from any of them. The websites might be blocking automated access."
         result = "\n---\n".join(chunks)
-        summary = f"Successfully extracted content from {successful_extractions} out of {len(news_items)} search results for query: '{query}'\n\n---\n\n"
         return summary + result
@@ -149,8 +188,12 @@ with gr.Blocks(title="Web Search MCP Server") as demo:
         """
         # 🔍 Web Search MCP Server
-        This MCP server provides web search capabilities to LLMs. It searches for recent news
-        and extracts the main content from articles.
         **Note:** This interface is primarily designed for MCP tool usage by LLMs, but you can
         also test it manually below.
@@ -158,18 +201,28 @@ with gr.Blocks(title="Web Search MCP Server") as demo:
     )
     with gr.Row():
-        query_input = gr.Textbox(
-            label="Search Query",
-            placeholder='e.g. "OpenAI news", "climate change 2024", "AI developments"',
-            info="Required: Enter your search query",
-        )
         num_results_input = gr.Slider(
             minimum=1,
             maximum=20,
             value=4,
             step=1,
             label="Number of Results",
-            info="Optional: How many articles to fetch (default: 4)",
         )
     output = gr.Textbox(
@@ -184,20 +237,21 @@ with gr.Blocks(title="Web Search MCP Server") as demo:
     # Add examples
     gr.Examples(
         examples=[
-            ["OpenAI GPT-5 news", 5],
-            ["climate change 2024", 4],
-            ["artificial intelligence breakthroughs", 8],
-            ["stock market today", 6],
-            ["python programming updates", 4],
         ],
-        inputs=[query_input, num_results_input],
         outputs=output,
         fn=search_web,
         cache_examples=False,
     )
     search_button.click(
-        fn=search_web, inputs=[query_input, num_results_input], outputs=output
     )

 # Configuration
 SERPER_API_KEY = os.getenv("SERPER_API_KEY")
+SERPER_SEARCH_ENDPOINT = "https://google.serper.dev/search"
+SERPER_NEWS_ENDPOINT = "https://google.serper.dev/news"
 HEADERS = {"X-API-KEY": SERPER_API_KEY, "Content-Type": "application/json"}
 # Rate limiting
 rate_limit = parse("200/hour")
+async def search_web(query: str, search_type: str = "search", num_results: Optional[int] = 4) -> str:
     """
+    Search the web for information or fresh news, returning extracted content.
+    This tool can perform two types of searches:
+    - "search" (default): General web search for diverse, relevant content from various sources
+    - "news": Specifically searches for fresh news articles and breaking stories
+    Use "news" mode when looking for:
+    - Breaking news or very recent events
+    - Time-sensitive information
+    - Current affairs and latest developments
+    - Today's/this week's happenings
+    Use "search" mode (default) for:
+    - General information and research
+    - Technical documentation or guides
+    - Historical information
+    - Diverse perspectives from various sources
     Args:
         query (str): The search query. This is REQUIRED. Examples: "apple inc earnings",
                     "climate change 2024", "AI developments"
+        search_type (str): Type of search. This is OPTIONAL. Default is "search".
+                          Options: "search" (general web search) or "news" (fresh news articles).
+                          Use "news" for time-sensitive, breaking news content.
         num_results (int): Number of results to fetch. This is OPTIONAL. Default is 4.
                           Range: 1-20. More results = more context but longer response time.
     Returns:
+        str: Formatted text containing extracted content with metadata (title,
              source, date, URL, and main text) for each result, separated by dividers.
              Returns error message if API key is missing or search fails.
     Examples:
+        - search_web("OpenAI GPT-5", "news", 5) - Get 5 fresh news articles about OpenAI
+        - search_web("python tutorial", "search") - Get 4 general results about Python (default count)
+        - search_web("stock market today", "news", 10) - Get 10 news articles about today's market
+        - search_web("machine learning basics") - Get 4 general search results (all defaults)
     """
     if not SERPER_API_KEY:
         return "Error: SERPER_API_KEY environment variable is not set. Please set it to use this tool."
     if num_results is None:
         num_results = 4
     num_results = max(1, min(20, num_results))
+    # Validate search_type
+    if search_type not in ["search", "news"]:
+        search_type = "search"
     try:
         # Check rate limit
         if not await limiter.hit(rate_limit, "global"):
             return "Error: Rate limit exceeded. Please try again later (limit: 200 requests per hour)."
+        # Select endpoint based on search type
+        endpoint = SERPER_NEWS_ENDPOINT if search_type == "news" else SERPER_SEARCH_ENDPOINT
+        # Prepare payload
+        payload = {"q": query, "num": num_results}
+        if search_type == "news":
+            payload["type"] = "news"
+            payload["page"] = 1
         async with httpx.AsyncClient(timeout=15) as client:
+            resp = await client.post(endpoint, headers=HEADERS, json=payload)
         if resp.status_code != 200:
             return f"Error: Search API returned status {resp.status_code}. Please check your API key and try again."
+        # Extract results based on search type
+        if search_type == "news":
+            results = resp.json().get("news", [])
+        else:
+            results = resp.json().get("organic", [])
+        if not results:
             return (
+                f"No {search_type} results found for query: '{query}'. Try a different search term or search type."
             )
         # Fetch HTML content concurrently
+        urls = [r["link"] for r in results]
         async with httpx.AsyncClient(timeout=20, follow_redirects=True) as client:
             tasks = [client.get(u) for u in urls]
             responses = await asyncio.gather(*tasks, return_exceptions=True)
         chunks = []
         successful_extractions = 0
+        for meta, response in zip(results, responses):
             if isinstance(response, Exception):
                 continue
             # Parse and format date
             try:
+                # For news results, date is in 'date' field; for search results, it might be in 'snippet'
+                date_str = meta.get("date", "")
+                if date_str:
+                    date_iso = dateparser.parse(date_str, fuzzy=True).strftime("%Y-%m-%d")
+                else:
+                    date_iso = "Unknown"
             except Exception:
+                date_iso = "Unknown"
             # Format the chunk
+            # For search results, source might be in 'displayLink' or domain
+            source = meta.get('source', meta.get('displayLink', meta['link'].split('/')[2]))
             chunk = (
                 f"## {meta['title']}\n"
+                f"**Source:** {source}   "
                 f"**Date:** {date_iso}\n"
                 f"**URL:** {meta['link']}\n\n"
                 f"{body.strip()}\n"
             chunks.append(chunk)
         if not chunks:
+            return f"Found {len(results)} {search_type} results for '{query}', but couldn't extract readable content from any of them. The websites might be blocking automated access."
         result = "\n---\n".join(chunks)
+        summary = f"Successfully extracted content from {successful_extractions} out of {len(results)} {search_type} results for query: '{query}'\n\n---\n\n"
         return summary + result
         """
         # 🔍 Web Search MCP Server
+        This MCP server provides web search capabilities to LLMs. It can perform general web searches
+        or specifically search for fresh news articles, extracting the main content from results.
+        **Search Types:**
+        - **General Search**: Diverse results from various sources (blogs, docs, articles, etc.)
+        - **News Search**: Fresh news articles and breaking stories from news sources
         **Note:** This interface is primarily designed for MCP tool usage by LLMs, but you can
         also test it manually below.
     )
     with gr.Row():
+        with gr.Column(scale=3):
+            query_input = gr.Textbox(
+                label="Search Query",
+                placeholder='e.g. "OpenAI news", "climate change 2024", "AI developments"',
+                info="Required: Enter your search query",
+            )
+        with gr.Column(scale=1):
+            search_type_input = gr.Radio(
+                choices=["search", "news"],
+                value="search",
+                label="Search Type",
+                info="Choose search type",
+            )
+    with gr.Row():
         num_results_input = gr.Slider(
             minimum=1,
             maximum=20,
             value=4,
             step=1,
             label="Number of Results",
+            info="Optional: How many results to fetch (default: 4)",
         )
     output = gr.Textbox(
     # Add examples
     gr.Examples(
         examples=[
+            ["OpenAI GPT-5 latest developments", "news", 5],
+            ["python programming tutorial", "search", 4],
+            ["stock market today breaking news", "news", 6],
+            ["machine learning algorithms explained", "search", 8],
+            ["climate change 2024 latest news", "news", 4],
+            ["web development best practices", "search", 4],
         ],
+        inputs=[query_input, search_type_input, num_results_input],
         outputs=output,
         fn=search_web,
         cache_examples=False,
     )
     search_button.click(
+        fn=search_web, inputs=[query_input, search_type_input, num_results_input], outputs=output
     )