Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.35.0
title: AI_SEO_Crawler
app_file: seo_analyzer_ui.py
sdk: gradio
sdk_version: 5.30.0
Web Crawler Documentation
A scalable web crawler with configurability, politeness, and content extraction capabilities.
Table of Contents
Architecture
The web crawler consists of the following key components:
- URL Frontier: Manages URLs to be crawled with prioritization
- DNS Resolver: Caches DNS lookups to improve performance
- Robots Handler: Ensures compliance with robots.txt
- HTML Downloader: Downloads web pages with error handling
- HTML Parser: Extracts URLs and metadata from web pages
- Storage: MongoDB for storage of URLs and metadata
- Crawler: Main crawler orchestration
- API: REST API for controlling the crawler
Setup
Requirements
- Python 3.8+
- MongoDB
- Redis server
Installation
Install MongoDB:
# For Ubuntu sudo apt-get install -y mongodb sudo systemctl start mongodb sudo systemctl enable mongodb # Verify MongoDB is running sudo systemctl status mongodb
Install Redis:
sudo apt-get install redis-server sudo systemctl start redis-server # Verify Redis is running redis-cli ping # Should return PONG
Install Python dependencies:
pip install -r requirements.txt
Create a local configuration file:
cp config.py local_config.py
Edit
local_config.py
to customize settings:# Example configuration SEED_URLS = ["https://example.com"] # Start URLs MAX_DEPTH = 3 # Crawl depth MAX_WORKERS = 4 # Number of worker threads DELAY_BETWEEN_REQUESTS = 1 # Politeness delay
Usage
Running the Crawler
To run the crawler with default settings:
cd 4_web_crawler
python run_crawler.py
To specify custom seed URLs:
python run_crawler.py --seed https://example.com https://another-site.com
To limit crawl depth:
python run_crawler.py --depth 2
To run with more worker threads:
python run_crawler.py --workers 8
Sample Commands
Here are some common use cases with sample commands:
Crawl a Single Domain
This command crawls only example.com, not following external links:
python run_crawler.py --seed example.com --domain-filter example.com
Fresh Start (Reset Database)
This clears both MongoDB and Redis before starting, solving duplicate key errors:
python run_crawler.py --seed example.com --reset-db
Custom Speed and Depth
Control the crawler's speed and depth:
python run_crawler.py --seed example.com --depth 3 --workers 4 --delay 0.5
Crawl Multiple Sites
Crawl multiple websites at once:
python run_crawler.py --seed example.com blog.example.org docs.example.com
Ignore robots.txt Rules
Use with caution, as this ignores website crawling policies:
python run_crawler.py --seed example.com --ignore-robots
Set Custom User Agent
Identity the crawler with a specific user agent:
python run_crawler.py --seed example.com --user-agent "MyCustomBot/1.0"
Crawl sagarnildas.com
To specifically crawl sagarnildas.com with optimal settings:
python run_crawler.py --seed sagarnildas.com --domain-filter sagarnildas.com --reset-db --workers 2 --depth 3 --verbose
Using the API
The crawler provides a REST API for control and monitoring:
cd 4_web_crawler
python api.py
The API will be available at http://localhost:8000
API Endpoints
GET /status
- Get crawler statusGET /stats
- Get detailed statisticsPOST /start
- Start the crawlerPOST /stop
- Stop the crawlerPOST /seed
- Add seed URLsGET /pages
- List crawled pagesGET /urls
- List discovered URLs
Checking Results
Monitor the crawler through:
Console output:
tail -f crawler.log
MongoDB collections:
# Start mongo shell mongo # Switch to crawler database use crawler # Count discovered URLs db.urls.count() # View crawled pages db.pages.find().limit(5)
API statistics:
curl http://localhost:8000/stats
Components
The crawler has several key components that work together:
URL Frontier
Manages the queue of URLs to be crawled with priority-based scheduling.
DNS Resolver
Caches DNS lookups to improve performance and reduce load on DNS servers.
Robots Handler
Ensures compliance with robots.txt rules to be a good web citizen.
HTML Downloader
Downloads web pages with error handling, timeouts, and retries.
HTML Parser
Extracts URLs and metadata from web pages.
Crawler
The main component that orchestrates the crawling process.
Troubleshooting
MongoDB Errors
If you see duplicate key errors:
ERROR: Error saving seed URL to database: E11000 duplicate key error
Clean MongoDB collections:
cd 4_web_crawler
python mongo_cleanup.py
Redis Connection Issues
If the crawler can't connect to Redis:
Check if Redis is running:
sudo systemctl status redis-server
Verify Redis connection:
redis-cli ping
Performance Issues
If the crawler is running slowly:
Increase worker threads in
local_config.py
:MAX_WORKERS = 8
Adjust the politeness delay:
DELAY_BETWEEN_REQUESTS = 0.5 # Half-second delay
Optimize DNS caching:
DNS_CACHE_SIZE = 10000 DNS_CACHE_TTL = 7200 # 2 hours
Crawler Not Starting
If the crawler won't start:
Check for MongoDB connection:
mongo --eval "db.version()"
Ensure Redis is running:
redis-cli info
Look for error messages in the logs:
cat crawler.log
Configuration Reference
Key configurations in config.py
or local_config.py
:
# General settings
MAX_WORKERS = 4 # Number of worker threads
MAX_DEPTH = 3 # Maximum crawl depth
SEED_URLS = ["https://example.com"] # Initial URLs
# Politeness settings
RESPECT_ROBOTS_TXT = True # Whether to respect robots.txt
USER_AGENT = "MyBot/1.0" # User agent for requests
DELAY_BETWEEN_REQUESTS = 1 # Delay between requests to the same domain
# Storage settings
MONGODB_URI = "mongodb://localhost:27017/"
MONGODB_DB = "crawler"
# DNS settings
DNS_CACHE_SIZE = 10000
DNS_CACHE_TTL = 3600 # 1 hour
# Logging settings
LOG_LEVEL = "INFO"
LOG_FORMAT = "%(asctime)s [%(name)s] %(levelname)s: %(message)s"