---
title: Web Scraper
emoji: 🚀
colorFrom: yellow
colorTo: green
sdk: gradio
sdk_version: 5.32.1
app_file: app.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Web Scraper & Sitemap Generator

A Python Gradio application that scrapes websites, converts content to markdown, and generates sitemaps from page links. Available both as a web interface and as an MCP (Model Context Protocol) server for AI integration.

## Features

- 🕷️ **Web Scraping**: Extract text content from any website
- 📝 **Markdown Conversion**: Convert scraped HTML content to clean markdown format
- 🗺️ **Sitemap Generation**: Create organized sitemaps based on all links found on the page
- 🌐 **User-Friendly Interface**: Easy-to-use Gradio web interface
- 🔗 **Link Organization**: Separate internal and external links for better navigation
- 🤖 **MCP Server**: Expose scraping tools for AI assistants and LLMs

## Installation

1. Install Python dependencies:

```bash
pip install -r requirements.txt
```

## Usage

### Web Interface

1. Run the web application:

```bash
python app.py
```

2. Open your browser and navigate to `http://localhost:7861`

3. Enter a URL in the input field and click "Scrape Website"

4. View the results:
   - **Status**: Shows success/error messages
   - **Scraped Content**: Website content converted to markdown
   - **Sitemap**: Organized list of all links found on the page

### MCP Server

1. Run the MCP server:

```bash
python mcp_server.py
```

2. The server will be available at `http://localhost:7862`

3. **MCP Endpoint**: `http://localhost:7862/gradio_api/mcp/sse`

#### Available MCP Tools

- **scrape_content**: Extract and format website content as markdown
- **generate_sitemap**: Generate a sitemap of all links found on a webpage
- **analyze_website**: Complete website analysis with both content and sitemap

#### MCP Client Configuration

To use with Claude Desktop or other MCP clients, add this to your configuration:

```json
{
  "mcpServers": {
    "web-scraper": {
      "url": "http://localhost:7862/gradio_api/mcp/sse"
    }
  }
}
```

## Dependencies

- `gradio[mcp]`: Web interface framework with MCP support
- `requests`: HTTP library for making web requests
- `beautifulsoup4`: HTML parsing library
- `markdownify`: HTML to markdown conversion
- `lxml`: XML and HTML parser

## Project Structure

```
web-scraper/
├── app.py                 # Main web interface application
├── mcp_server.py         # MCP server with exposed tools
├── requirements.txt       # Python dependencies
├── requirements.txt       # Python dependencies
├── README.md             # Project documentation
├── .github/
│   └── copilot-instructions.md
└── .vscode/
    └── tasks.json        # VS Code tasks
```

## Features Details

### Web Scraping

- Handles both HTTP and HTTPS URLs
- Automatically adds protocol if missing
- Removes unwanted elements (scripts, styles, navigation)
- Focuses on main content areas

### Markdown Conversion

- Converts HTML to clean markdown format
- Preserves heading structure
- Removes empty links and excessive whitespace
- Adds page title as main heading

### Sitemap Generation

- Extracts all links from the page
- Converts relative URLs to absolute URLs
- Organizes links by domain (internal vs external)
- Limits display to prevent overwhelming output
- Filters out unwanted links (anchors, javascript, etc.)

## Example URLs to Try

- `https://httpbin.org/html` - Simple test page
- `https://example.com` - Basic example site
- `https://python.org` - Python official website

## Error Handling

The application includes comprehensive error handling for:

- Invalid URLs
- Network timeouts
- HTTP errors
- Content parsing issues

## Customization

You can customize the scraper by modifying:

- User-Agent string in the `WebScraper` class
- Content extraction selectors
- Markdown formatting rules
- Link filtering criteria