AI_SEO_Crawler

Running

File size: 6,875 Bytes

9d5c8f4
6f509ec
 
9d5c8f4
 
 
6f509ec
9d5c8f4
6f509ec

---
title: AI_SEO_Crawler
app_file: seo_analyzer_ui.py
sdk: gradio
sdk_version: 5.30.0
---
# Web Crawler Documentation

A scalable web crawler with configurability, politeness, and content extraction capabilities.

## Table of Contents

- [Architecture](#architecture)
- [Setup](#setup)
- [Usage](#usage)
- [Components](#components)
- [Troubleshooting](#troubleshooting)

## Architecture

The web crawler consists of the following key components:

1. **URL Frontier**: Manages URLs to be crawled with prioritization
2. **DNS Resolver**: Caches DNS lookups to improve performance
3. **Robots Handler**: Ensures compliance with robots.txt
4. **HTML Downloader**: Downloads web pages with error handling
5. **HTML Parser**: Extracts URLs and metadata from web pages
6. **Storage**: MongoDB for storage of URLs and metadata
7. **Crawler**: Main crawler orchestration
8. **API**: REST API for controlling the crawler

## Setup

### Requirements

- Python 3.8+
- MongoDB
- Redis server

### Installation

1. Install MongoDB:
   ```bash
   # For Ubuntu
   sudo apt-get install -y mongodb
   sudo systemctl start mongodb
   sudo systemctl enable mongodb
   
   # Verify MongoDB is running
   sudo systemctl status mongodb
   ```

2. Install Redis:
   ```bash
   sudo apt-get install redis-server
   sudo systemctl start redis-server
   
   # Verify Redis is running
   redis-cli ping  # Should return PONG
   ```

3. Install Python dependencies:
   ```bash
   pip install -r requirements.txt
   ```

4. Create a local configuration file:
   ```bash
   cp config.py local_config.py
   ```

5. Edit `local_config.py` to customize settings:
   ```python
   # Example configuration
   SEED_URLS = ["https://example.com"]  # Start URLs
   MAX_DEPTH = 3                        # Crawl depth
   MAX_WORKERS = 4                      # Number of worker threads
   DELAY_BETWEEN_REQUESTS = 1           # Politeness delay
   ```

## Usage

### Running the Crawler

To run the crawler with default settings:

```bash
cd 4_web_crawler
python run_crawler.py
```

To specify custom seed URLs:

```bash
python run_crawler.py --seed https://example.com https://another-site.com
```

To limit crawl depth:

```bash
python run_crawler.py --depth 2
```

To run with more worker threads:

```bash
python run_crawler.py --workers 8
```

### Sample Commands

Here are some common use cases with sample commands:

#### Crawl a Single Domain

This command crawls only example.com, not following external links:

```bash
python run_crawler.py --seed example.com --domain-filter example.com
```

#### Fresh Start (Reset Database)

This clears both MongoDB and Redis before starting, solving duplicate key errors:

```bash
python run_crawler.py --seed example.com --reset-db
```

#### Custom Speed and Depth

Control the crawler's speed and depth:

```bash
python run_crawler.py --seed example.com --depth 3 --workers 4 --delay 0.5
```

#### Crawl Multiple Sites

Crawl multiple websites at once:

```bash
python run_crawler.py --seed example.com blog.example.org docs.example.com
```

#### Ignore robots.txt Rules

Use with caution, as this ignores website crawling policies:

```bash
python run_crawler.py --seed example.com --ignore-robots
```

#### Set Custom User Agent

Identity the crawler with a specific user agent:

```bash
python run_crawler.py --seed example.com --user-agent "MyCustomBot/1.0"
```

#### Crawl sagarnildas.com

To specifically crawl sagarnildas.com with optimal settings:

```bash
python run_crawler.py --seed sagarnildas.com --domain-filter sagarnildas.com --reset-db --workers 2 --depth 3 --verbose
```

### Using the API

The crawler provides a REST API for control and monitoring:

```bash
cd 4_web_crawler
python api.py
```

The API will be available at http://localhost:8000

#### API Endpoints

- `GET /status` - Get crawler status
- `GET /stats` - Get detailed statistics
- `POST /start` - Start the crawler
- `POST /stop` - Stop the crawler
- `POST /seed` - Add seed URLs
- `GET /pages` - List crawled pages
- `GET /urls` - List discovered URLs

### Checking Results

Monitor the crawler through:

1. Console output:
   ```bash
   tail -f crawler.log
   ```

2. MongoDB collections:
   ```bash
   # Start mongo shell
   mongo
   
   # Switch to crawler database
   use crawler
   
   # Count discovered URLs
   db.urls.count()
   
   # View crawled pages
   db.pages.find().limit(5)
   ```

3. API statistics:
   ```bash
   curl http://localhost:8000/stats
   ```

## Components

The crawler has several key components that work together:

### URL Frontier

Manages the queue of URLs to be crawled with priority-based scheduling.

### DNS Resolver

Caches DNS lookups to improve performance and reduce load on DNS servers.

### Robots Handler

Ensures compliance with robots.txt rules to be a good web citizen.

### HTML Downloader

Downloads web pages with error handling, timeouts, and retries.

### HTML Parser

Extracts URLs and metadata from web pages.

### Crawler

The main component that orchestrates the crawling process.

## Troubleshooting

### MongoDB Errors

If you see duplicate key errors:

```
ERROR: Error saving seed URL to database: E11000 duplicate key error
```

Clean MongoDB collections:

```bash
cd 4_web_crawler
python mongo_cleanup.py
```

### Redis Connection Issues

If the crawler can't connect to Redis:

1. Check if Redis is running:
   ```bash
   sudo systemctl status redis-server
   ```

2. Verify Redis connection:
   ```bash
   redis-cli ping
   ```

### Performance Issues

If the crawler is running slowly:

1. Increase worker threads in `local_config.py`:
   ```python
   MAX_WORKERS = 8
   ```

2. Adjust the politeness delay:
   ```python
   DELAY_BETWEEN_REQUESTS = 0.5  # Half-second delay
   ```

3. Optimize DNS caching:
   ```python
   DNS_CACHE_SIZE = 10000
   DNS_CACHE_TTL = 7200  # 2 hours
   ```

### Crawler Not Starting

If the crawler won't start:

1. Check for MongoDB connection:
   ```bash
   mongo --eval "db.version()"
   ```

2. Ensure Redis is running:
   ```bash
   redis-cli info
   ```

3. Look for error messages in the logs:
   ```bash
   cat crawler.log
   ```

## Configuration Reference

Key configurations in `config.py` or `local_config.py`:

```python
# General settings
MAX_WORKERS = 4            # Number of worker threads
MAX_DEPTH = 3              # Maximum crawl depth
SEED_URLS = ["https://example.com"]  # Initial URLs

# Politeness settings
RESPECT_ROBOTS_TXT = True  # Whether to respect robots.txt
USER_AGENT = "MyBot/1.0"   # User agent for requests
DELAY_BETWEEN_REQUESTS = 1 # Delay between requests to the same domain

# Storage settings
MONGODB_URI = "mongodb://localhost:27017/"
MONGODB_DB = "crawler" 

# DNS settings
DNS_CACHE_SIZE = 10000
DNS_CACHE_TTL = 3600       # 1 hour

# Logging settings
LOG_LEVEL = "INFO"
LOG_FORMAT = "%(asctime)s [%(name)s] %(levelname)s: %(message)s"
```