--- title: AI_SEO_Crawler app_file: seo_analyzer_ui.py sdk: gradio sdk_version: 5.30.0 --- # Web Crawler Documentation A scalable web crawler with configurability, politeness, and content extraction capabilities. ## Table of Contents - [Architecture](#architecture) - [Setup](#setup) - [Usage](#usage) - [Components](#components) - [Troubleshooting](#troubleshooting) ## Architecture The web crawler consists of the following key components: 1. **URL Frontier**: Manages URLs to be crawled with prioritization 2. **DNS Resolver**: Caches DNS lookups to improve performance 3. **Robots Handler**: Ensures compliance with robots.txt 4. **HTML Downloader**: Downloads web pages with error handling 5. **HTML Parser**: Extracts URLs and metadata from web pages 6. **Storage**: MongoDB for storage of URLs and metadata 7. **Crawler**: Main crawler orchestration 8. **API**: REST API for controlling the crawler ## Setup ### Requirements - Python 3.8+ - MongoDB - Redis server ### Installation 1. Install MongoDB: ```bash # For Ubuntu sudo apt-get install -y mongodb sudo systemctl start mongodb sudo systemctl enable mongodb # Verify MongoDB is running sudo systemctl status mongodb ``` 2. Install Redis: ```bash sudo apt-get install redis-server sudo systemctl start redis-server # Verify Redis is running redis-cli ping # Should return PONG ``` 3. Install Python dependencies: ```bash pip install -r requirements.txt ``` 4. Create a local configuration file: ```bash cp config.py local_config.py ``` 5. Edit `local_config.py` to customize settings: ```python # Example configuration SEED_URLS = ["https://example.com"] # Start URLs MAX_DEPTH = 3 # Crawl depth MAX_WORKERS = 4 # Number of worker threads DELAY_BETWEEN_REQUESTS = 1 # Politeness delay ``` ## Usage ### Running the Crawler To run the crawler with default settings: ```bash cd 4_web_crawler python run_crawler.py ``` To specify custom seed URLs: ```bash python run_crawler.py --seed https://example.com https://another-site.com ``` To limit crawl depth: ```bash python run_crawler.py --depth 2 ``` To run with more worker threads: ```bash python run_crawler.py --workers 8 ``` ### Sample Commands Here are some common use cases with sample commands: #### Crawl a Single Domain This command crawls only example.com, not following external links: ```bash python run_crawler.py --seed example.com --domain-filter example.com ``` #### Fresh Start (Reset Database) This clears both MongoDB and Redis before starting, solving duplicate key errors: ```bash python run_crawler.py --seed example.com --reset-db ``` #### Custom Speed and Depth Control the crawler's speed and depth: ```bash python run_crawler.py --seed example.com --depth 3 --workers 4 --delay 0.5 ``` #### Crawl Multiple Sites Crawl multiple websites at once: ```bash python run_crawler.py --seed example.com blog.example.org docs.example.com ``` #### Ignore robots.txt Rules Use with caution, as this ignores website crawling policies: ```bash python run_crawler.py --seed example.com --ignore-robots ``` #### Set Custom User Agent Identity the crawler with a specific user agent: ```bash python run_crawler.py --seed example.com --user-agent "MyCustomBot/1.0" ``` #### Crawl sagarnildas.com To specifically crawl sagarnildas.com with optimal settings: ```bash python run_crawler.py --seed sagarnildas.com --domain-filter sagarnildas.com --reset-db --workers 2 --depth 3 --verbose ``` ### Using the API The crawler provides a REST API for control and monitoring: ```bash cd 4_web_crawler python api.py ``` The API will be available at http://localhost:8000 #### API Endpoints - `GET /status` - Get crawler status - `GET /stats` - Get detailed statistics - `POST /start` - Start the crawler - `POST /stop` - Stop the crawler - `POST /seed` - Add seed URLs - `GET /pages` - List crawled pages - `GET /urls` - List discovered URLs ### Checking Results Monitor the crawler through: 1. Console output: ```bash tail -f crawler.log ``` 2. MongoDB collections: ```bash # Start mongo shell mongo # Switch to crawler database use crawler # Count discovered URLs db.urls.count() # View crawled pages db.pages.find().limit(5) ``` 3. API statistics: ```bash curl http://localhost:8000/stats ``` ## Components The crawler has several key components that work together: ### URL Frontier Manages the queue of URLs to be crawled with priority-based scheduling. ### DNS Resolver Caches DNS lookups to improve performance and reduce load on DNS servers. ### Robots Handler Ensures compliance with robots.txt rules to be a good web citizen. ### HTML Downloader Downloads web pages with error handling, timeouts, and retries. ### HTML Parser Extracts URLs and metadata from web pages. ### Crawler The main component that orchestrates the crawling process. ## Troubleshooting ### MongoDB Errors If you see duplicate key errors: ``` ERROR: Error saving seed URL to database: E11000 duplicate key error ``` Clean MongoDB collections: ```bash cd 4_web_crawler python mongo_cleanup.py ``` ### Redis Connection Issues If the crawler can't connect to Redis: 1. Check if Redis is running: ```bash sudo systemctl status redis-server ``` 2. Verify Redis connection: ```bash redis-cli ping ``` ### Performance Issues If the crawler is running slowly: 1. Increase worker threads in `local_config.py`: ```python MAX_WORKERS = 8 ``` 2. Adjust the politeness delay: ```python DELAY_BETWEEN_REQUESTS = 0.5 # Half-second delay ``` 3. Optimize DNS caching: ```python DNS_CACHE_SIZE = 10000 DNS_CACHE_TTL = 7200 # 2 hours ``` ### Crawler Not Starting If the crawler won't start: 1. Check for MongoDB connection: ```bash mongo --eval "db.version()" ``` 2. Ensure Redis is running: ```bash redis-cli info ``` 3. Look for error messages in the logs: ```bash cat crawler.log ``` ## Configuration Reference Key configurations in `config.py` or `local_config.py`: ```python # General settings MAX_WORKERS = 4 # Number of worker threads MAX_DEPTH = 3 # Maximum crawl depth SEED_URLS = ["https://example.com"] # Initial URLs # Politeness settings RESPECT_ROBOTS_TXT = True # Whether to respect robots.txt USER_AGENT = "MyBot/1.0" # User agent for requests DELAY_BETWEEN_REQUESTS = 1 # Delay between requests to the same domain # Storage settings MONGODB_URI = "mongodb://localhost:27017/" MONGODB_DB = "crawler" # DNS settings DNS_CACHE_SIZE = 10000 DNS_CACHE_TTL = 3600 # 1 hour # Logging settings LOG_LEVEL = "INFO" LOG_FORMAT = "%(asctime)s [%(name)s] %(levelname)s: %(message)s" ```