Spaces:
Running
Running
title: Web Scraper | |
emoji: π | |
colorFrom: yellow | |
colorTo: green | |
sdk: gradio | |
sdk_version: 5.32.1 | |
app_file: app.py | |
pinned: false | |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference | |
# Web Scraper & Sitemap Generator | |
A Python Gradio application that scrapes websites, converts content to markdown, and generates sitemaps from page links. Available both as a web interface and as an MCP (Model Context Protocol) server for AI integration. | |
## Features | |
- π·οΈ **Web Scraping**: Extract text content from any website | |
- π **Markdown Conversion**: Convert scraped HTML content to clean markdown format | |
- πΊοΈ **Sitemap Generation**: Create organized sitemaps based on all links found on the page | |
- π **User-Friendly Interface**: Easy-to-use Gradio web interface | |
- π **Link Organization**: Separate internal and external links for better navigation | |
- π€ **MCP Server**: Expose scraping tools for AI assistants and LLMs | |
## Installation | |
1. Install Python dependencies: | |
```bash | |
pip install -r requirements.txt | |
``` | |
## Usage | |
### Web Interface | |
1. Run the web application: | |
```bash | |
python app.py | |
``` | |
2. Open your browser and navigate to `http://localhost:7861` | |
3. Enter a URL in the input field and click "Scrape Website" | |
4. View the results: | |
- **Status**: Shows success/error messages | |
- **Scraped Content**: Website content converted to markdown | |
- **Sitemap**: Organized list of all links found on the page | |
### MCP Server | |
1. Run the MCP server: | |
```bash | |
python mcp_server.py | |
``` | |
2. The server will be available at `http://localhost:7862` | |
3. **MCP Endpoint**: `http://localhost:7862/gradio_api/mcp/sse` | |
#### Available MCP Tools | |
- **scrape_content**: Extract and format website content as markdown | |
- **generate_sitemap**: Generate a sitemap of all links found on a webpage | |
- **analyze_website**: Complete website analysis with both content and sitemap | |
#### MCP Client Configuration | |
To use with Claude Desktop or other MCP clients, add this to your configuration: | |
```json | |
{ | |
"mcpServers": { | |
"web-scraper": { | |
"url": "http://localhost:7862/gradio_api/mcp/sse" | |
} | |
} | |
} | |
``` | |
## Dependencies | |
- `gradio[mcp]`: Web interface framework with MCP support | |
- `requests`: HTTP library for making web requests | |
- `beautifulsoup4`: HTML parsing library | |
- `markdownify`: HTML to markdown conversion | |
- `lxml`: XML and HTML parser | |
## Project Structure | |
``` | |
web-scraper/ | |
βββ app.py # Main web interface application | |
βββ mcp_server.py # MCP server with exposed tools | |
βββ requirements.txt # Python dependencies | |
βββ requirements.txt # Python dependencies | |
βββ README.md # Project documentation | |
βββ .github/ | |
β βββ copilot-instructions.md | |
βββ .vscode/ | |
βββ tasks.json # VS Code tasks | |
``` | |
## Features Details | |
### Web Scraping | |
- Handles both HTTP and HTTPS URLs | |
- Automatically adds protocol if missing | |
- Removes unwanted elements (scripts, styles, navigation) | |
- Focuses on main content areas | |
### Markdown Conversion | |
- Converts HTML to clean markdown format | |
- Preserves heading structure | |
- Removes empty links and excessive whitespace | |
- Adds page title as main heading | |
### Sitemap Generation | |
- Extracts all links from the page | |
- Converts relative URLs to absolute URLs | |
- Organizes links by domain (internal vs external) | |
- Limits display to prevent overwhelming output | |
- Filters out unwanted links (anchors, javascript, etc.) | |
## Example URLs to Try | |
- `https://httpbin.org/html` - Simple test page | |
- `https://example.com` - Basic example site | |
- `https://python.org` - Python official website | |
## Error Handling | |
The application includes comprehensive error handling for: | |
- Invalid URLs | |
- Network timeouts | |
- HTTP errors | |
- Content parsing issues | |
## Customization | |
You can customize the scraper by modifying: | |
- User-Agent string in the `WebScraper` class | |
- Content extraction selectors | |
- Markdown formatting rules | |
- Link filtering criteria | |