AI_SEO_Crawler / README.md
sagarnildass's picture
Upload folder using huggingface_hub
6f509ec verified

A newer version of the Gradio SDK is available: 5.35.0

Upgrade
metadata
title: AI_SEO_Crawler
app_file: seo_analyzer_ui.py
sdk: gradio
sdk_version: 5.30.0

Web Crawler Documentation

A scalable web crawler with configurability, politeness, and content extraction capabilities.

Table of Contents

Architecture

The web crawler consists of the following key components:

  1. URL Frontier: Manages URLs to be crawled with prioritization
  2. DNS Resolver: Caches DNS lookups to improve performance
  3. Robots Handler: Ensures compliance with robots.txt
  4. HTML Downloader: Downloads web pages with error handling
  5. HTML Parser: Extracts URLs and metadata from web pages
  6. Storage: MongoDB for storage of URLs and metadata
  7. Crawler: Main crawler orchestration
  8. API: REST API for controlling the crawler

Setup

Requirements

  • Python 3.8+
  • MongoDB
  • Redis server

Installation

  1. Install MongoDB:

    # For Ubuntu
    sudo apt-get install -y mongodb
    sudo systemctl start mongodb
    sudo systemctl enable mongodb
    
    # Verify MongoDB is running
    sudo systemctl status mongodb
    
  2. Install Redis:

    sudo apt-get install redis-server
    sudo systemctl start redis-server
    
    # Verify Redis is running
    redis-cli ping  # Should return PONG
    
  3. Install Python dependencies:

    pip install -r requirements.txt
    
  4. Create a local configuration file:

    cp config.py local_config.py
    
  5. Edit local_config.py to customize settings:

    # Example configuration
    SEED_URLS = ["https://example.com"]  # Start URLs
    MAX_DEPTH = 3                        # Crawl depth
    MAX_WORKERS = 4                      # Number of worker threads
    DELAY_BETWEEN_REQUESTS = 1           # Politeness delay
    

Usage

Running the Crawler

To run the crawler with default settings:

cd 4_web_crawler
python run_crawler.py

To specify custom seed URLs:

python run_crawler.py --seed https://example.com https://another-site.com

To limit crawl depth:

python run_crawler.py --depth 2

To run with more worker threads:

python run_crawler.py --workers 8

Sample Commands

Here are some common use cases with sample commands:

Crawl a Single Domain

This command crawls only example.com, not following external links:

python run_crawler.py --seed example.com --domain-filter example.com

Fresh Start (Reset Database)

This clears both MongoDB and Redis before starting, solving duplicate key errors:

python run_crawler.py --seed example.com --reset-db

Custom Speed and Depth

Control the crawler's speed and depth:

python run_crawler.py --seed example.com --depth 3 --workers 4 --delay 0.5

Crawl Multiple Sites

Crawl multiple websites at once:

python run_crawler.py --seed example.com blog.example.org docs.example.com

Ignore robots.txt Rules

Use with caution, as this ignores website crawling policies:

python run_crawler.py --seed example.com --ignore-robots

Set Custom User Agent

Identity the crawler with a specific user agent:

python run_crawler.py --seed example.com --user-agent "MyCustomBot/1.0"

Crawl sagarnildas.com

To specifically crawl sagarnildas.com with optimal settings:

python run_crawler.py --seed sagarnildas.com --domain-filter sagarnildas.com --reset-db --workers 2 --depth 3 --verbose

Using the API

The crawler provides a REST API for control and monitoring:

cd 4_web_crawler
python api.py

The API will be available at http://localhost:8000

API Endpoints

  • GET /status - Get crawler status
  • GET /stats - Get detailed statistics
  • POST /start - Start the crawler
  • POST /stop - Stop the crawler
  • POST /seed - Add seed URLs
  • GET /pages - List crawled pages
  • GET /urls - List discovered URLs

Checking Results

Monitor the crawler through:

  1. Console output:

    tail -f crawler.log
    
  2. MongoDB collections:

    # Start mongo shell
    mongo
    
    # Switch to crawler database
    use crawler
    
    # Count discovered URLs
    db.urls.count()
    
    # View crawled pages
    db.pages.find().limit(5)
    
  3. API statistics:

    curl http://localhost:8000/stats
    

Components

The crawler has several key components that work together:

URL Frontier

Manages the queue of URLs to be crawled with priority-based scheduling.

DNS Resolver

Caches DNS lookups to improve performance and reduce load on DNS servers.

Robots Handler

Ensures compliance with robots.txt rules to be a good web citizen.

HTML Downloader

Downloads web pages with error handling, timeouts, and retries.

HTML Parser

Extracts URLs and metadata from web pages.

Crawler

The main component that orchestrates the crawling process.

Troubleshooting

MongoDB Errors

If you see duplicate key errors:

ERROR: Error saving seed URL to database: E11000 duplicate key error

Clean MongoDB collections:

cd 4_web_crawler
python mongo_cleanup.py

Redis Connection Issues

If the crawler can't connect to Redis:

  1. Check if Redis is running:

    sudo systemctl status redis-server
    
  2. Verify Redis connection:

    redis-cli ping
    

Performance Issues

If the crawler is running slowly:

  1. Increase worker threads in local_config.py:

    MAX_WORKERS = 8
    
  2. Adjust the politeness delay:

    DELAY_BETWEEN_REQUESTS = 0.5  # Half-second delay
    
  3. Optimize DNS caching:

    DNS_CACHE_SIZE = 10000
    DNS_CACHE_TTL = 7200  # 2 hours
    

Crawler Not Starting

If the crawler won't start:

  1. Check for MongoDB connection:

    mongo --eval "db.version()"
    
  2. Ensure Redis is running:

    redis-cli info
    
  3. Look for error messages in the logs:

    cat crawler.log
    

Configuration Reference

Key configurations in config.py or local_config.py:

# General settings
MAX_WORKERS = 4            # Number of worker threads
MAX_DEPTH = 3              # Maximum crawl depth
SEED_URLS = ["https://example.com"]  # Initial URLs

# Politeness settings
RESPECT_ROBOTS_TXT = True  # Whether to respect robots.txt
USER_AGENT = "MyBot/1.0"   # User agent for requests
DELAY_BETWEEN_REQUESTS = 1 # Delay between requests to the same domain

# Storage settings
MONGODB_URI = "mongodb://localhost:27017/"
MONGODB_DB = "crawler" 

# DNS settings
DNS_CACHE_SIZE = 10000
DNS_CACHE_TTL = 3600       # 1 hour

# Logging settings
LOG_LEVEL = "INFO"
LOG_FORMAT = "%(asctime)s [%(name)s] %(levelname)s: %(message)s"