--- title: Web Scraper emoji: πŸš€ colorFrom: yellow colorTo: green sdk: gradio sdk_version: 5.32.1 app_file: app.py pinned: false --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference # Web Scraper & Sitemap Generator A Python Gradio application that scrapes websites, converts content to markdown, and generates sitemaps from page links. Available both as a web interface and as an MCP (Model Context Protocol) server for AI integration. ## Features - πŸ•·οΈ **Web Scraping**: Extract text content from any website - πŸ“ **Markdown Conversion**: Convert scraped HTML content to clean markdown format - πŸ—ΊοΈ **Sitemap Generation**: Create organized sitemaps based on all links found on the page - 🌐 **User-Friendly Interface**: Easy-to-use Gradio web interface - πŸ”— **Link Organization**: Separate internal and external links for better navigation - πŸ€– **MCP Server**: Expose scraping tools for AI assistants and LLMs ## Installation 1. Install Python dependencies: ```bash pip install -r requirements.txt ``` ## Usage ### Web Interface 1. Run the web application: ```bash python app.py ``` 2. Open your browser and navigate to `http://localhost:7861` 3. Enter a URL in the input field and click "Scrape Website" 4. View the results: - **Status**: Shows success/error messages - **Scraped Content**: Website content converted to markdown - **Sitemap**: Organized list of all links found on the page ### MCP Server 1. Run the MCP server: ```bash python mcp_server.py ``` 2. The server will be available at `http://localhost:7862` 3. **MCP Endpoint**: `http://localhost:7862/gradio_api/mcp/sse` #### Available MCP Tools - **scrape_content**: Extract and format website content as markdown - **generate_sitemap**: Generate a sitemap of all links found on a webpage - **analyze_website**: Complete website analysis with both content and sitemap #### MCP Client Configuration To use with Claude Desktop or other MCP clients, add this to your configuration: ```json { "mcpServers": { "web-scraper": { "url": "http://localhost:7862/gradio_api/mcp/sse" } } } ``` ## Dependencies - `gradio[mcp]`: Web interface framework with MCP support - `requests`: HTTP library for making web requests - `beautifulsoup4`: HTML parsing library - `markdownify`: HTML to markdown conversion - `lxml`: XML and HTML parser ## Project Structure ``` web-scraper/ β”œβ”€β”€ app.py # Main web interface application β”œβ”€β”€ mcp_server.py # MCP server with exposed tools β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ README.md # Project documentation β”œβ”€β”€ .github/ β”‚ └── copilot-instructions.md └── .vscode/ └── tasks.json # VS Code tasks ``` ## Features Details ### Web Scraping - Handles both HTTP and HTTPS URLs - Automatically adds protocol if missing - Removes unwanted elements (scripts, styles, navigation) - Focuses on main content areas ### Markdown Conversion - Converts HTML to clean markdown format - Preserves heading structure - Removes empty links and excessive whitespace - Adds page title as main heading ### Sitemap Generation - Extracts all links from the page - Converts relative URLs to absolute URLs - Organizes links by domain (internal vs external) - Limits display to prevent overwhelming output - Filters out unwanted links (anchors, javascript, etc.) ## Example URLs to Try - `https://httpbin.org/html` - Simple test page - `https://example.com` - Basic example site - `https://python.org` - Python official website ## Error Handling The application includes comprehensive error handling for: - Invalid URLs - Network timeouts - HTTP errors - Content parsing issues ## Customization You can customize the scraper by modifying: - User-Agent string in the `WebScraper` class - Content extraction selectors - Markdown formatting rules - Link filtering criteria