metadata

title: AI_SEO_Crawler
app_file: seo_analyzer_ui.py
sdk: gradio
sdk_version: 5.30.0

Web Crawler Documentation

A scalable web crawler with configurability, politeness, and content extraction capabilities.

Architecture
Setup
Usage
Components
Troubleshooting

Architecture

The web crawler consists of the following key components:

URL Frontier: Manages URLs to be crawled with prioritization
DNS Resolver: Caches DNS lookups to improve performance
Robots Handler: Ensures compliance with robots.txt
HTML Downloader: Downloads web pages with error handling
HTML Parser: Extracts URLs and metadata from web pages
Storage: MongoDB for storage of URLs and metadata
Crawler: Main crawler orchestration
API: REST API for controlling the crawler

Setup

Requirements

Python 3.8+
MongoDB
Redis server

Installation

Install MongoDB:

# For Ubuntu
sudo apt-get install -y mongodb
sudo systemctl start mongodb
sudo systemctl enable mongodb

# Verify MongoDB is running
sudo systemctl status mongodb

Install Redis:

sudo apt-get install redis-server
sudo systemctl start redis-server

# Verify Redis is running
redis-cli ping  # Should return PONG

Install Python dependencies:
```
pip install -r requirements.txt
```
Create a local configuration file:
```
cp config.py local_config.py
```

Edit local_config.py to customize settings:

# Example configuration
SEED_URLS = ["https://example.com"]  # Start URLs
MAX_DEPTH = 3                        # Crawl depth
MAX_WORKERS = 4                      # Number of worker threads
DELAY_BETWEEN_REQUESTS = 1           # Politeness delay

Usage

Running the Crawler

To run the crawler with default settings:

cd 4_web_crawler
python run_crawler.py

To specify custom seed URLs:

python run_crawler.py --seed https://example.com https://another-site.com

To limit crawl depth:

python run_crawler.py --depth 2

To run with more worker threads:

python run_crawler.py --workers 8

Sample Commands

Here are some common use cases with sample commands:

Crawl a Single Domain

This command crawls only example.com, not following external links:

python run_crawler.py --seed example.com --domain-filter example.com

Fresh Start (Reset Database)

This clears both MongoDB and Redis before starting, solving duplicate key errors:

python run_crawler.py --seed example.com --reset-db

Custom Speed and Depth

Control the crawler's speed and depth:

python run_crawler.py --seed example.com --depth 3 --workers 4 --delay 0.5

Crawl Multiple Sites

Crawl multiple websites at once:

python run_crawler.py --seed example.com blog.example.org docs.example.com

Ignore robots.txt Rules

Use with caution, as this ignores website crawling policies:

python run_crawler.py --seed example.com --ignore-robots

Set Custom User Agent

Identity the crawler with a specific user agent:

python run_crawler.py --seed example.com --user-agent "MyCustomBot/1.0"

Crawl sagarnildas.com

To specifically crawl sagarnildas.com with optimal settings:

python run_crawler.py --seed sagarnildas.com --domain-filter sagarnildas.com --reset-db --workers 2 --depth 3 --verbose

Using the API

The crawler provides a REST API for control and monitoring:

cd 4_web_crawler
python api.py

The API will be available at http://localhost:8000

API Endpoints

GET /status - Get crawler status
GET /stats - Get detailed statistics
POST /start - Start the crawler
POST /stop - Stop the crawler
POST /seed - Add seed URLs
GET /pages - List crawled pages
GET /urls - List discovered URLs

Checking Results

Monitor the crawler through:

Console output:
```
tail -f crawler.log
```

MongoDB collections:

# Start mongo shell
mongo

# Switch to crawler database
use crawler

# Count discovered URLs
db.urls.count()

# View crawled pages
db.pages.find().limit(5)

API statistics:
```
curl http://localhost:8000/stats
```

Components

The crawler has several key components that work together:

URL Frontier

Manages the queue of URLs to be crawled with priority-based scheduling.

DNS Resolver

Caches DNS lookups to improve performance and reduce load on DNS servers.

Robots Handler

Ensures compliance with robots.txt rules to be a good web citizen.

HTML Downloader

Downloads web pages with error handling, timeouts, and retries.

HTML Parser

Extracts URLs and metadata from web pages.

Crawler

The main component that orchestrates the crawling process.

Troubleshooting

MongoDB Errors

If you see duplicate key errors:

ERROR: Error saving seed URL to database: E11000 duplicate key error

Clean MongoDB collections:

cd 4_web_crawler
python mongo_cleanup.py

Redis Connection Issues

If the crawler can't connect to Redis:

Check if Redis is running:
```
sudo systemctl status redis-server
```
Verify Redis connection:
```
redis-cli ping
```

Performance Issues

If the crawler is running slowly:

Increase worker threads in local_config.py:
```
MAX_WORKERS = 8
```

Adjust the politeness delay:

DELAY_BETWEEN_REQUESTS = 0.5  # Half-second delay

Optimize DNS caching:

DNS_CACHE_SIZE = 10000
DNS_CACHE_TTL = 7200  # 2 hours

Crawler Not Starting

If the crawler won't start:

Check for MongoDB connection:
```
mongo --eval "db.version()"
```
Ensure Redis is running:
```
redis-cli info
```
Look for error messages in the logs:
```
cat crawler.log
```

Configuration Reference

Key configurations in config.py or local_config.py:

# General settings
MAX_WORKERS = 4            # Number of worker threads
MAX_DEPTH = 3              # Maximum crawl depth
SEED_URLS = ["https://example.com"]  # Initial URLs

# Politeness settings
RESPECT_ROBOTS_TXT = True  # Whether to respect robots.txt
USER_AGENT = "MyBot/1.0"   # User agent for requests
DELAY_BETWEEN_REQUESTS = 1 # Delay between requests to the same domain

# Storage settings
MONGODB_URI = "mongodb://localhost:27017/"
MONGODB_DB = "crawler" 

# DNS settings
DNS_CACHE_SIZE = 10000
DNS_CACHE_TTL = 3600       # 1 hour

# Logging settings
LOG_LEVEL = "INFO"
LOG_FORMAT = "%(asctime)s [%(name)s] %(levelname)s: %(message)s"

Spaces:

Duplicated from sagarnildass/AI_SEO_Crawler

IAMTFRMZA
/

AI_SEO_Crawler

Sleeping