AI_SEO_Crawler

Sleeping

App Files Files Community

sagarnildass commited on May 22

Commit

6f509ec

verified ·

1 Parent(s): 9d5c8f4

Upload folder using huggingface_hub

Browse files

Files changed (48) hide show

.env +1 -0
Dockerfile +89 -0
README.md +339 -7
__pycache__/api.cpython-311.pyc +0 -0
__pycache__/config.cpython-310.pyc +0 -0
__pycache__/config.cpython-311.pyc +0 -0
__pycache__/crawler.cpython-310.pyc +0 -0
__pycache__/crawler.cpython-311.pyc +0 -0
__pycache__/dns_resolver.cpython-310.pyc +0 -0
__pycache__/dns_resolver.cpython-311.pyc +0 -0
__pycache__/downloader.cpython-310.pyc +0 -0
__pycache__/downloader.cpython-311.pyc +0 -0
__pycache__/frontier.cpython-310.pyc +0 -0
__pycache__/frontier.cpython-311.pyc +0 -0
__pycache__/local_config.cpython-310.pyc +0 -0
__pycache__/local_config.cpython-311.pyc +0 -0
__pycache__/models.cpython-310.pyc +0 -0
__pycache__/models.cpython-311.pyc +0 -0
__pycache__/mongo_cleanup.cpython-310.pyc +0 -0
__pycache__/mongo_cleanup.cpython-311.pyc +0 -0
__pycache__/parser.cpython-310.pyc +0 -0
__pycache__/parser.cpython-311.pyc +0 -0
__pycache__/robots.cpython-310.pyc +0 -0
__pycache__/robots.cpython-311.pyc +0 -0
__pycache__/run_crawler.cpython-310.pyc +0 -0
api.py +588 -0
cleanup.py +130 -0
cleanup_all.sh +47 -0
config.py +96 -0
crawl.py +370 -0
crawler.log +0 -0
crawler.py +908 -0
deduplication.py +422 -0
dns_resolver.py +161 -0
docker-compose.yml +79 -0
downloader.py +400 -0
example.py +250 -0
file_cleanup.py +100 -0
frontier.py +319 -0
models.py +167 -0
mongo_cleanup.py +86 -0
parser.py +316 -0
requirements.txt +43 -0
robots.py +203 -0
run_crawler.py +237 -0
seo_analyzer_ui.py +708 -0
storage.py +888 -0
test_crawler.py +219 -0

.env ADDED Viewed

	@@ -0,0 +1 @@


1	+ DEPLOYMENT=true

Dockerfile ADDED Viewed

	@@ -0,0 +1,89 @@

+FROM python:3.10-slim
+# Set working directory
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    build-essential \
+    wget \
+    curl \
+    gnupg \
+    && rm -rf /var/lib/apt/lists/*
+# Install MongoDB
+RUN wget -qO - https://www.mongodb.org/static/pgp/server-6.0.asc | apt-key add - \
+    && echo "deb http://repo.mongodb.org/apt/debian buster/mongodb-org/6.0 main" | tee /etc/apt/sources.list.d/mongodb-org-6.0.list \
+    && apt-get update \
+    && apt-get install -y mongodb-org \
+    && mkdir -p /data/db \
+    && apt-get clean \
+    && rm -rf /var/lib/apt/lists/*
+# Install Redis
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    redis-server \
+    && apt-get clean \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements.txt
+COPY requirements.txt .
+# Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy the crawler code
+COPY . .
+# Create necessary directories
+RUN mkdir -p /data/storage/html_pages \
+    && mkdir -p /data/storage/logs \
+    && mkdir -p /data/storage/exports
+# Expose ports
+# Prometheus metrics port
+EXPOSE 9100
+# MongoDB port
+EXPOSE 27017
+# Redis port
+EXPOSE 6379
+# Set environment variables
+ENV MONGODB_URI=mongodb://localhost:27017/
+ENV REDIS_URI=redis://localhost:6379/0
+ENV PYTHONUNBUFFERED=1
+# Create entrypoint script
+RUN echo '#!/bin/bash\n\
+# Start MongoDB\n\
+mongod --fork --logpath /var/log/mongodb.log\n\
+\n\
+# Start Redis\n\
+redis-server --daemonize yes\n\
+\n\
+# Check if services are running\n\
+echo "Waiting for MongoDB to start..."\n\
+until mongo --eval "print(\"MongoDB is ready\")" > /dev/null 2>&1; do\n\
+    sleep 1\n\
+done\n\
+\n\
+echo "Waiting for Redis to start..."\n\
+until redis-cli ping > /dev/null 2>&1; do\n\
+    sleep 1\n\
+done\n\
+\n\
+echo "All services are running!"\n\
+\n\
+# Execute the provided command or default to help\n\
+if [ $# -eq 0 ]; then\n\
+    python crawl.py --help\n\
+else\n\
+    exec "$@"\n\
+fi' > /app/entrypoint.sh \
+    && chmod +x /app/entrypoint.sh
+# Set entrypoint
+ENTRYPOINT ["/app/entrypoint.sh"]
+# Default command is to show help
+CMD ["python", "crawl.py", "--help"]

README.md CHANGED Viewed

@@ -1,12 +1,344 @@
 ---
-title: AI SEO Crawler
-emoji: 📊
-colorFrom: green
-colorTo: indigo
 sdk: gradio
 sdk_version: 5.30.0
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: AI_SEO_Crawler
+app_file: seo_analyzer_ui.py
 sdk: gradio
 sdk_version: 5.30.0
 ---
+# Web Crawler Documentation
+A scalable web crawler with configurability, politeness, and content extraction capabilities.
+## Table of Contents
+- [Architecture](#architecture)
+- [Setup](#setup)
+- [Usage](#usage)
+- [Components](#components)
+- [Troubleshooting](#troubleshooting)
+## Architecture
+The web crawler consists of the following key components:
+1. **URL Frontier**: Manages URLs to be crawled with prioritization
+2. **DNS Resolver**: Caches DNS lookups to improve performance
+3. **Robots Handler**: Ensures compliance with robots.txt
+4. **HTML Downloader**: Downloads web pages with error handling
+5. **HTML Parser**: Extracts URLs and metadata from web pages
+6. **Storage**: MongoDB for storage of URLs and metadata
+7. **Crawler**: Main crawler orchestration
+8. **API**: REST API for controlling the crawler
+## Setup
+### Requirements
+- Python 3.8+
+- MongoDB
+- Redis server
+### Installation
+1. Install MongoDB:
+   ```bash
+   # For Ubuntu
+   sudo apt-get install -y mongodb
+   sudo systemctl start mongodb
+   sudo systemctl enable mongodb
+   # Verify MongoDB is running
+   sudo systemctl status mongodb
+   ```
+2. Install Redis:
+   ```bash
+   sudo apt-get install redis-server
+   sudo systemctl start redis-server
+   # Verify Redis is running
+   redis-cli ping  # Should return PONG
+   ```
+3. Install Python dependencies:
+   ```bash
+   pip install -r requirements.txt
+   ```
+4. Create a local configuration file:
+   ```bash
+   cp config.py local_config.py
+   ```
+5. Edit `local_config.py` to customize settings:
+   ```python
+   # Example configuration
+   SEED_URLS = ["https://example.com"]  # Start URLs
+   MAX_DEPTH = 3                        # Crawl depth
+   MAX_WORKERS = 4                      # Number of worker threads
+   DELAY_BETWEEN_REQUESTS = 1           # Politeness delay
+   ```
+## Usage
+### Running the Crawler
+To run the crawler with default settings:
+```bash
+cd 4_web_crawler
+python run_crawler.py
+```
+To specify custom seed URLs:
+```bash
+python run_crawler.py --seed https://example.com https://another-site.com
+```
+To limit crawl depth:
+```bash
+python run_crawler.py --depth 2
+```
+To run with more worker threads:
+```bash
+python run_crawler.py --workers 8
+```
+### Sample Commands
+Here are some common use cases with sample commands:
+#### Crawl a Single Domain
+This command crawls only example.com, not following external links:
+```bash
+python run_crawler.py --seed example.com --domain-filter example.com
+```
+#### Fresh Start (Reset Database)
+This clears both MongoDB and Redis before starting, solving duplicate key errors:
+```bash
+python run_crawler.py --seed example.com --reset-db
+```
+#### Custom Speed and Depth
+Control the crawler's speed and depth:
+```bash
+python run_crawler.py --seed example.com --depth 3 --workers 4 --delay 0.5
+```
+#### Crawl Multiple Sites
+Crawl multiple websites at once:
+```bash
+python run_crawler.py --seed example.com blog.example.org docs.example.com
+```
+#### Ignore robots.txt Rules
+Use with caution, as this ignores website crawling policies:
+```bash
+python run_crawler.py --seed example.com --ignore-robots
+```
+#### Set Custom User Agent
+Identity the crawler with a specific user agent:
+```bash
+python run_crawler.py --seed example.com --user-agent "MyCustomBot/1.0"
+```
+#### Crawl sagarnildas.com
+To specifically crawl sagarnildas.com with optimal settings:
+```bash
+python run_crawler.py --seed sagarnildas.com --domain-filter sagarnildas.com --reset-db --workers 2 --depth 3 --verbose
+```
+### Using the API
+The crawler provides a REST API for control and monitoring:
+```bash
+cd 4_web_crawler
+python api.py
+```
+The API will be available at http://localhost:8000
+#### API Endpoints
+- `GET /status` - Get crawler status
+- `GET /stats` - Get detailed statistics
+- `POST /start` - Start the crawler
+- `POST /stop` - Stop the crawler
+- `POST /seed` - Add seed URLs
+- `GET /pages` - List crawled pages
+- `GET /urls` - List discovered URLs
+### Checking Results
+Monitor the crawler through:
+1. Console output:
+   ```bash
+   tail -f crawler.log
+   ```
+2. MongoDB collections:
+   ```bash
+   # Start mongo shell
+   mongo
+   # Switch to crawler database
+   use crawler
+   # Count discovered URLs
+   db.urls.count()
+   # View crawled pages
+   db.pages.find().limit(5)
+   ```
+3. API statistics:
+   ```bash
+   curl http://localhost:8000/stats
+   ```
+## Components
+The crawler has several key components that work together:
+### URL Frontier
+Manages the queue of URLs to be crawled with priority-based scheduling.
+### DNS Resolver
+Caches DNS lookups to improve performance and reduce load on DNS servers.
+### Robots Handler
+Ensures compliance with robots.txt rules to be a good web citizen.
+### HTML Downloader
+Downloads web pages with error handling, timeouts, and retries.
+### HTML Parser
+Extracts URLs and metadata from web pages.
+### Crawler
+The main component that orchestrates the crawling process.
+## Troubleshooting
+### MongoDB Errors
+If you see duplicate key errors:
+```
+ERROR: Error saving seed URL to database: E11000 duplicate key error
+```
+Clean MongoDB collections:
+```bash
+cd 4_web_crawler
+python mongo_cleanup.py
+```
+### Redis Connection Issues
+If the crawler can't connect to Redis:
+1. Check if Redis is running:
+   ```bash
+   sudo systemctl status redis-server
+   ```
+2. Verify Redis connection:
+   ```bash
+   redis-cli ping
+   ```
+### Performance Issues
+If the crawler is running slowly:
+1. Increase worker threads in `local_config.py`:
+   ```python
+   MAX_WORKERS = 8
+   ```
+2. Adjust the politeness delay:
+   ```python
+   DELAY_BETWEEN_REQUESTS = 0.5  # Half-second delay
+   ```
+3. Optimize DNS caching:
+   ```python
+   DNS_CACHE_SIZE = 10000
+   DNS_CACHE_TTL = 7200  # 2 hours
+   ```
+### Crawler Not Starting
+If the crawler won't start:
+1. Check for MongoDB connection:
+   ```bash
+   mongo --eval "db.version()"
+   ```
+2. Ensure Redis is running:
+   ```bash
+   redis-cli info
+   ```
+3. Look for error messages in the logs:
+   ```bash
+   cat crawler.log
+   ```
+## Configuration Reference
+Key configurations in `config.py` or `local_config.py`:
+```python
+# General settings
+MAX_WORKERS = 4            # Number of worker threads
+MAX_DEPTH = 3              # Maximum crawl depth
+SEED_URLS = ["https://example.com"]  # Initial URLs
+# Politeness settings
+RESPECT_ROBOTS_TXT = True  # Whether to respect robots.txt
+USER_AGENT = "MyBot/1.0"   # User agent for requests
+DELAY_BETWEEN_REQUESTS = 1 # Delay between requests to the same domain
+# Storage settings
+MONGODB_URI = "mongodb://localhost:27017/"
+MONGODB_DB = "crawler"
+# DNS settings
+DNS_CACHE_SIZE = 10000
+DNS_CACHE_TTL = 3600       # 1 hour
+# Logging settings
+LOG_LEVEL = "INFO"
+LOG_FORMAT = "%(asctime)s [%(name)s] %(levelname)s: %(message)s"
+```

__pycache__/api.cpython-311.pyc ADDED Viewed

Binary file (25.7 kB). View file

__pycache__/config.cpython-310.pyc ADDED Viewed

Binary file (2.5 kB). View file

__pycache__/config.cpython-311.pyc ADDED Viewed

Binary file (3.84 kB). View file

__pycache__/crawler.cpython-310.pyc ADDED Viewed

Binary file (22.3 kB). View file

__pycache__/crawler.cpython-311.pyc ADDED Viewed

Binary file (40.2 kB). View file

__pycache__/dns_resolver.cpython-310.pyc ADDED Viewed

Binary file (4.7 kB). View file

__pycache__/dns_resolver.cpython-311.pyc ADDED Viewed

Binary file (7.84 kB). View file

__pycache__/downloader.cpython-310.pyc ADDED Viewed

Binary file (10.8 kB). View file

__pycache__/downloader.cpython-311.pyc ADDED Viewed

Binary file (18.6 kB). View file

__pycache__/frontier.cpython-310.pyc ADDED Viewed

Binary file (8.74 kB). View file

__pycache__/frontier.cpython-311.pyc ADDED Viewed

Binary file (20.6 kB). View file

__pycache__/local_config.cpython-310.pyc ADDED Viewed

Binary file (850 Bytes). View file

__pycache__/local_config.cpython-311.pyc ADDED Viewed

Binary file (1.31 kB). View file

__pycache__/models.cpython-310.pyc ADDED Viewed

Binary file (5.57 kB). View file

__pycache__/models.cpython-311.pyc ADDED Viewed

Binary file (7.77 kB). View file

__pycache__/mongo_cleanup.cpython-310.pyc ADDED Viewed

Binary file (2.27 kB). View file

__pycache__/mongo_cleanup.cpython-311.pyc ADDED Viewed

Binary file (4.26 kB). View file

__pycache__/parser.cpython-310.pyc ADDED Viewed

Binary file (7.95 kB). View file

__pycache__/parser.cpython-311.pyc ADDED Viewed

Binary file (14.1 kB). View file

__pycache__/robots.cpython-310.pyc ADDED Viewed

Binary file (4.75 kB). View file

__pycache__/robots.cpython-311.pyc ADDED Viewed

Binary file (7.92 kB). View file

__pycache__/run_crawler.cpython-310.pyc ADDED Viewed

Binary file (6.36 kB). View file

api.py ADDED Viewed

	@@ -0,0 +1,588 @@

+"""
+Web API for the web crawler.
+This module provides a FastAPI-based web API for controlling and monitoring the web crawler.
+"""
+import os
+import sys
+import time
+import json
+import logging
+import datetime
+from typing import List, Dict, Any, Optional
+from fastapi import FastAPI, HTTPException, Query, Path, BackgroundTasks, Depends
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import JSONResponse
+from pydantic import BaseModel, HttpUrl, Field
+import uvicorn
+from crawler import Crawler
+from models import URL, URLStatus, Priority
+import config
+# Configure logging
+logging.basicConfig(
+    level=getattr(logging, config.LOG_LEVEL),
+    format=config.LOG_FORMAT
+)
+logger = logging.getLogger(__name__)
+# Create FastAPI app
+app = FastAPI(
+    title="Web Crawler API",
+    description="API for controlling and monitoring the web crawler",
+    version="1.0.0"
+)
+# Enable CORS
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Global crawler instance
+crawler = None
+def get_crawler() -> Crawler:
+    """Get or initialize the crawler instance"""
+    global crawler
+    if crawler is None:
+        crawler = Crawler()
+    return crawler
+# API Models
+class SeedURL(BaseModel):
+    url: HttpUrl
+    priority: Optional[str] = Field(
+        default="NORMAL",
+        description="URL priority (VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW)"
+    )
+class SeedURLs(BaseModel):
+    urls: List[SeedURL]
+class CrawlerStatus(BaseModel):
+    running: bool
+    paused: bool
+    start_time: Optional[float] = None
+    uptime_seconds: Optional[float] = None
+    pages_crawled: int
+    pages_failed: int
+    urls_discovered: int
+    urls_filtered: int
+    domains_crawled: int
+    frontier_size: int
+class CrawlerConfig(BaseModel):
+    max_depth: int = Field(..., description="Maximum crawl depth")
+    max_workers: int = Field(..., description="Maximum number of worker threads")
+    delay_between_requests: float = Field(..., description="Delay between requests to the same domain (seconds)")
+class PageDetail(BaseModel):
+    url: str
+    domain: str
+    status_code: int
+    content_type: str
+    content_length: int
+    crawled_at: str
+    is_seed: bool
+    depth: int
+    title: Optional[str] = None
+    description: Optional[str] = None
+class URLDetail(BaseModel):
+    url: str
+    normalized_url: str
+    domain: str
+    status: str
+    priority: str
+    depth: int
+    parent_url: Optional[str] = None
+    last_crawled: Optional[str] = None
+    error: Optional[str] = None
+    retries: int
+class DomainStats(BaseModel):
+    domain: str
+    pages_count: int
+    successful_requests: int
+    failed_requests: int
+    avg_page_size: float
+    content_types: Dict[str, int]
+    status_codes: Dict[str, int]
+# API Routes
+@app.get("/")
+async def read_root():
+    """Root endpoint"""
+    return {
+        "name": "Web Crawler API",
+        "version": "1.0.0",
+        "description": "API for controlling and monitoring the web crawler",
+        "endpoints": {
+            "GET /": "This help message",
+            "GET /status": "Get crawler status",
+            "GET /stats": "Get crawler statistics",
+            "GET /config": "Get crawler configuration",
+            "PUT /config": "Update crawler configuration",
+            "POST /start": "Start the crawler",
+            "POST /stop": "Stop the crawler",
+            "POST /pause": "Pause the crawler",
+            "POST /resume": "Resume the crawler",
+            "GET /pages": "List crawled pages",
+            "GET /pages/{url}": "Get page details",
+            "GET /urls": "List discovered URLs",
+            "GET /urls/{url}": "Get URL details",
+            "POST /seed": "Add seed URLs",
+            "GET /domains": "Get domain statistics",
+            "GET /domains/{domain}": "Get statistics for a specific domain",
+        }
+    }
+@app.get("/status", response_model=CrawlerStatus)
+async def get_status(crawler: Crawler = Depends(get_crawler)):
+    """Get crawler status"""
+    status = {
+        "running": crawler.running,
+        "paused": crawler.paused,
+        "start_time": crawler.stats.get('start_time'),
+        "uptime_seconds": time.time() - crawler.stats.get('start_time', time.time()) if crawler.running else None,
+        "pages_crawled": crawler.stats.get('pages_crawled', 0),
+        "pages_failed": crawler.stats.get('pages_failed', 0),
+        "urls_discovered": crawler.stats.get('urls_discovered', 0),
+        "urls_filtered": crawler.stats.get('urls_filtered', 0),
+        "domains_crawled": len(crawler.stats.get('domains_crawled', set())),
+        "frontier_size": crawler.frontier.size()
+    }
+    return status
+@app.get("/stats")
+async def get_stats(crawler: Crawler = Depends(get_crawler)):
+    """Get detailed crawler statistics"""
+    stats = crawler.stats.copy()
+    # Convert sets to lists for JSON serialization
+    for key, value in stats.items():
+        if isinstance(value, set):
+            stats[key] = list(value)
+    # Add uptime
+    if stats.get('start_time'):
+        stats['uptime_seconds'] = time.time() - stats['start_time']
+        stats['uptime_formatted'] = str(datetime.timedelta(seconds=int(stats['uptime_seconds'])))
+    # Add DNS cache statistics if available
+    try:
+        dns_stats = crawler.dns_resolver.get_stats()
+        stats['dns_cache'] = dns_stats
+    except (AttributeError, Exception) as e:
+        logger.warning(f"Failed to get DNS stats: {e}")
+        stats['dns_cache'] = {'error': 'Stats not available'}
+    # Add frontier statistics if available
+    try:
+        stats['frontier_size'] = crawler.frontier.size()
+        if hasattr(crawler.frontier, 'get_stats'):
+            frontier_stats = crawler.frontier.get_stats()
+            stats['frontier'] = frontier_stats
+        else:
+            stats['frontier'] = {'size': crawler.frontier.size()}
+    except Exception as e:
+        logger.warning(f"Failed to get frontier stats: {e}")
+        stats['frontier'] = {'error': 'Stats not available'}
+    return stats
+@app.get("/config", response_model=CrawlerConfig)
+async def get_config():
+    """Get crawler configuration"""
+    return {
+        "max_depth": config.MAX_DEPTH,
+        "max_workers": config.MAX_WORKERS,
+        "delay_between_requests": config.DELAY_BETWEEN_REQUESTS
+    }
+@app.put("/config", response_model=CrawlerConfig)
+async def update_config(
+    crawler_config: CrawlerConfig,
+    crawler: Crawler = Depends(get_crawler)
+):
+    """Update crawler configuration"""
+    # Update configuration
+    config.MAX_DEPTH = crawler_config.max_depth
+    config.MAX_WORKERS = crawler_config.max_workers
+    config.DELAY_BETWEEN_REQUESTS = crawler_config.delay_between_requests
+    return crawler_config
+@app.post("/start")
+async def start_crawler(
+    background_tasks: BackgroundTasks,
+    num_workers: int = Query(None, description="Number of worker threads"),
+    async_mode: bool = Query(False, description="Whether to use async mode"),
+    crawler: Crawler = Depends(get_crawler)
+):
+    """Start the crawler"""
+    if crawler.running:
+        return {"status": "Crawler is already running"}
+    # Start crawler in background
+    def start_crawler_task():
+        try:
+            crawler.start(num_workers=num_workers, async_mode=async_mode)
+        except Exception as e:
+            logger.error(f"Error starting crawler: {e}")
+    background_tasks.add_task(start_crawler_task)
+    return {"status": "Crawler starting in background"}
+@app.post("/stop")
+async def stop_crawler(crawler: Crawler = Depends(get_crawler)):
+    """Stop the crawler"""
+    if not crawler.running:
+        return {"status": "Crawler is not running"}
+    crawler.stop()
+    return {"status": "Crawler stopped"}
+@app.post("/pause")
+async def pause_crawler(crawler: Crawler = Depends(get_crawler)):
+    """Pause the crawler"""
+    if not crawler.running:
+        return {"status": "Crawler is not running"}
+    if crawler.paused:
+        return {"status": "Crawler is already paused"}
+    crawler.pause()
+    return {"status": "Crawler paused"}
+@app.post("/resume")
+async def resume_crawler(crawler: Crawler = Depends(get_crawler)):
+    """Resume the crawler"""
+    if not crawler.running:
+        return {"status": "Crawler is not running"}
+    if not crawler.paused:
+        return {"status": "Crawler is not paused"}
+    crawler.resume()
+    return {"status": "Crawler resumed"}
+@app.get("/pages")
+async def list_pages(
+    limit: int = Query(10, ge=1, le=100, description="Number of pages to return"),
+    offset: int = Query(0, ge=0, description="Offset for pagination"),
+    domain: Optional[str] = Query(None, description="Filter by domain"),
+    status_code: Optional[int] = Query(None, description="Filter by HTTP status code"),
+    crawler: Crawler = Depends(get_crawler)
+):
+    """List crawled pages"""
+    # Build query
+    query = {}
+    if domain:
+        query['domain'] = domain
+    if status_code:
+        query['status_code'] = status_code
+    # Execute query
+    try:
+        pages = list(crawler.db.pages_collection.find(
+            query,
+            {'_id': 0}
+        ).skip(offset).limit(limit))
+        # Count total pages matching query
+        total_count = crawler.db.pages_collection.count_documents(query)
+        return {
+            "pages": pages,
+            "total": total_count,
+            "limit": limit,
+            "offset": offset
+        }
+    except Exception as e:
+        logger.error(f"Error listing pages: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+@app.get("/pages/{url:path}", response_model=PageDetail)
+async def get_page(
+    url: str,
+    include_content: bool = Query(False, description="Include page content"),
+    crawler: Crawler = Depends(get_crawler)
+):
+    """Get page details"""
+    try:
+        # Decode URL from path parameter
+        url = url.replace("___", "/")
+        # Find page in database
+        page = crawler.db.pages_collection.find_one({'url': url}, {'_id': 0})
+        if not page:
+            raise HTTPException(status_code=404, detail="Page not found")
+        # Load content if requested
+        if include_content:
+            try:
+                if crawler.use_s3:
+                    content = crawler._load_content_s3(url)
+                else:
+                    content = crawler._load_content_disk(url)
+                if content:
+                    page['content'] = content
+            except Exception as e:
+                logger.error(f"Error loading content for {url}: {e}")
+                page['content'] = None
+        return page
+    except HTTPException:
+        raise
+    except Exception as e:
+        logger.error(f"Error getting page {url}: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+@app.get("/urls")
+async def list_urls(
+    limit: int = Query(10, ge=1, le=100, description="Number of URLs to return"),
+    offset: int = Query(0, ge=0, description="Offset for pagination"),
+    status: Optional[str] = Query(None, description="Filter by URL status"),
+    domain: Optional[str] = Query(None, description="Filter by domain"),
+    priority: Optional[str] = Query(None, description="Filter by priority"),
+    crawler: Crawler = Depends(get_crawler)
+):
+    """List discovered URLs"""
+    # Build query
+    query = {}
+    if status:
+        query['status'] = status
+    if domain:
+        query['domain'] = domain
+    if priority:
+        query['priority'] = priority
+    # Execute query
+    try:
+        urls = list(crawler.db.urls_collection.find(
+            query,
+            {'_id': 0}
+        ).skip(offset).limit(limit))
+        # Count total URLs matching query
+        total_count = crawler.db.urls_collection.count_documents(query)
+        return {
+            "urls": urls,
+            "total": total_count,
+            "limit": limit,
+            "offset": offset
+        }
+    except Exception as e:
+        logger.error(f"Error listing URLs: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+@app.get("/urls/{url:path}", response_model=URLDetail)
+async def get_url(
+    url: str,
+    crawler: Crawler = Depends(get_crawler)
+):
+    """Get URL details"""
+    try:
+        # Decode URL from path parameter
+        url = url.replace("___", "/")
+        # Find URL in database
+        url_obj = crawler.db.urls_collection.find_one({'url': url}, {'_id': 0})
+        if not url_obj:
+            raise HTTPException(status_code=404, detail="URL not found")
+        return url_obj
+    except HTTPException:
+        raise
+    except Exception as e:
+        logger.error(f"Error getting URL {url}: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/seed")
+async def add_seed_urls(
+    seed_urls: SeedURLs,
+    crawler: Crawler = Depends(get_crawler)
+):
+    """Add seed URLs to the frontier"""
+    try:
+        urls_added = 0
+        for seed in seed_urls.urls:
+            url = str(seed.url)
+            priority = getattr(Priority, seed.priority, Priority.NORMAL)
+            # Create URL object
+            url_obj = URL(
+                url=url,
+                status=URLStatus.PENDING,
+                priority=priority,
+                depth=0  # Seed URLs are at depth 0
+            )
+            # Add to frontier
+            if crawler.frontier.add_url(url_obj):
+                # Save URL to database
+                crawler.urls_collection.update_one(
+                    {'url': url},
+                    {'$set': url_obj.dict()},
+                    upsert=True
+                )
+                urls_added += 1
+                logger.info(f"Added seed URL: {url}")
+        return {"status": "success", "urls_added": urls_added}
+    except Exception as e:
+        logger.error(f"Error adding seed URLs: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+@app.get("/domains")
+async def list_domains(
+    limit: int = Query(10, ge=1, le=100, description="Number of domains to return"),
+    offset: int = Query(0, ge=0, description="Offset for pagination"),
+    crawler: Crawler = Depends(get_crawler)
+):
+    """Get domain statistics"""
+    try:
+        # Get domains with counts
+        domain_counts = crawler.db.pages_collection.aggregate([
+            {"$group": {
+                "_id": "$domain",
+                "pages_count": {"$sum": 1},
+                "avg_page_size": {"$avg": "$content_length"}
+            }},
+            {"$sort": {"pages_count": -1}},
+            {"$skip": offset},
+            {"$limit": limit}
+        ])
+        # Get total domains count
+        total_domains = len(crawler.stats.get('domains_crawled', set()))
+        # Format result
+        domains = []
+        for domain in domain_counts:
+            domains.append({
+                "domain": domain["_id"],
+                "pages_count": domain["pages_count"],
+                "avg_page_size": domain["avg_page_size"]
+            })
+        return {
+            "domains": domains,
+            "total": total_domains,
+            "limit": limit,
+            "offset": offset
+        }
+    except Exception as e:
+        logger.error(f"Error listing domains: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+@app.get("/domains/{domain}", response_model=DomainStats)
+async def get_domain_stats(
+    domain: str,
+    crawler: Crawler = Depends(get_crawler)
+):
+    """Get statistics for a specific domain"""
+    try:
+        # Get basic domain stats
+        domain_stats = crawler.db.pages_collection.aggregate([
+            {"$match": {"domain": domain}},
+            {"$group": {
+                "_id": "$domain",
+                "pages_count": {"$sum": 1},
+                "successful_requests": {"$sum": {"$cond": [{"$lt": ["$status_code", 400]}, 1, 0]}},
+                "failed_requests": {"$sum": {"$cond": [{"$gte": ["$status_code", 400]}, 1, 0]}},
+                "avg_page_size": {"$avg": "$content_length"}
+            }}
+        ]).next()
+        # Get content type distribution
+        content_types = crawler.db.pages_collection.aggregate([
+            {"$match": {"domain": domain}},
+            {"$group": {
+                "_id": "$content_type",
+                "count": {"$sum": 1}
+            }}
+        ])
+        content_type_map = {}
+        for ct in content_types:
+            content_type_map[ct["_id"]] = ct["count"]
+        # Get status code distribution
+        status_codes = crawler.db.pages_collection.aggregate([
+            {"$match": {"domain": domain}},
+            {"$group": {
+                "_id": "$status_code",
+                "count": {"$sum": 1}
+            }}
+        ])
+        status_code_map = {}
+        for sc in status_codes:
+            status_code_map[str(sc["_id"])] = sc["count"]
+        # Format result
+        result = {
+            "domain": domain,
+            "pages_count": domain_stats["pages_count"],
+            "successful_requests": domain_stats["successful_requests"],
+            "failed_requests": domain_stats["failed_requests"],
+            "avg_page_size": domain_stats["avg_page_size"],
+            "content_types": content_type_map,
+            "status_codes": status_code_map
+        }
+        return result
+    except StopIteration:
+        # Domain not found
+        raise HTTPException(status_code=404, detail=f"Domain '{domain}' not found")
+    except Exception as e:
+        logger.error(f"Error getting domain stats for {domain}: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+if __name__ == "__main__":
+    # Run the API server
+    uvicorn.run(
+        "api:app",
+        host="0.0.0.0",
+        port=8000,
+        reload=True
+    )

cleanup.py ADDED Viewed

	@@ -0,0 +1,130 @@

+#!/usr/bin/env python3
+"""
+Cleanup script to remove all web crawler data from MongoDB
+and list files to be removed
+"""
+import os
+import sys
+import logging
+import shutil
+from pymongo import MongoClient
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s [%(name)s] %(levelname)s: %(message)s'
+)
+logger = logging.getLogger("cleanup")
+def cleanup_mongodb():
+    """Remove all web crawler data from MongoDB"""
+    try:
+        # Connect to MongoDB
+        logger.info("Connecting to MongoDB...")
+        client = MongoClient("mongodb://localhost:27017/")
+        # Access crawler database
+        db = client["crawler"]
+        # List and drop all collections
+        collections = db.list_collection_names()
+        if not collections:
+            logger.info("No collections found in the crawler database")
+        else:
+            logger.info(f"Found {len(collections)} collections to drop: {collections}")
+            for collection in collections:
+                logger.info(f"Dropping collection: {collection}")
+                db[collection].drop()
+            logger.info("All crawler collections dropped successfully")
+        # Optional: Drop the entire database
+        # client.drop_database("crawler")
+        # logger.info("Dropped entire crawler database")
+        logger.info("MongoDB cleanup completed")
+    except Exception as e:
+        logger.error(f"Error cleaning up MongoDB: {e}")
+        return False
+    return True
+def cleanup_files():
+    """List and remove files related to simple_crawler"""
+    try:
+        crawler_dir = os.path.dirname(os.path.abspath(__file__))
+        # Files directly related to simple_crawler
+        simple_crawler_files = [
+            os.path.join(crawler_dir, "simple_crawler.py"),
+            os.path.join(crawler_dir, "README_SIMPLE.md"),
+            os.path.join(crawler_dir, "simple_crawler.log")
+        ]
+        # Check storage directories
+        storage_dir = os.path.join(crawler_dir, "storage")
+        if os.path.exists(storage_dir):
+            logger.info(f"Will remove storage directory: {storage_dir}")
+            simple_crawler_files.append(storage_dir)
+        # List all files that will be removed
+        logger.info("The following files will be removed:")
+        for file_path in simple_crawler_files:
+            if os.path.exists(file_path):
+                logger.info(f"  - {file_path}")
+            else:
+                logger.info(f"  - {file_path} (not found)")
+        # Confirm removal
+        confirm = input("Do you want to proceed with removal? (y/n): ")
+        if confirm.lower() != 'y':
+            logger.info("File removal cancelled")
+            return False
+        # Remove files and directories
+        for file_path in simple_crawler_files:
+            if os.path.exists(file_path):
+                if os.path.isdir(file_path):
+                    logger.info(f"Removing directory: {file_path}")
+                    shutil.rmtree(file_path)
+                else:
+                    logger.info(f"Removing file: {file_path}")
+                    os.remove(file_path)
+        logger.info("File cleanup completed")
+    except Exception as e:
+        logger.error(f"Error cleaning up files: {e}")
+        return False
+    return True
+if __name__ == "__main__":
+    print("Web Crawler Cleanup Utility")
+    print("---------------------------")
+    print("This script will:")
+    print("1. Remove all web crawler collections from MongoDB")
+    print("2. List and remove files related to simple_crawler")
+    print()
+    proceed = input("Do you want to proceed? (y/n): ")
+    if proceed.lower() != 'y':
+        print("Cleanup cancelled")
+        sys.exit(0)
+    # Clean up MongoDB
+    print("\nStep 1: Cleaning up MongoDB...")
+    mongo_success = cleanup_mongodb()
+    # Clean up files
+    print("\nStep 2: Cleaning up files...")
+    files_success = cleanup_files()
+    # Summary
+    print("\nCleanup Summary:")
+    print(f"MongoDB cleanup: {'Completed' if mongo_success else 'Failed'}")
+    print(f"File cleanup: {'Completed' if files_success else 'Failed'}")

cleanup_all.sh ADDED Viewed

	@@ -0,0 +1,47 @@

+#!/bin/bash
+# Master cleanup script for web crawler - runs both MongoDB and file cleanup
+set -e  # Exit on error
+echo "====================================================="
+echo "          WEB CRAWLER COMPLETE CLEANUP               "
+echo "====================================================="
+echo
+# Get script directory
+SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
+cd "$SCRIPT_DIR"
+# Check if scripts exist
+if [ ! -f "./mongo_cleanup.py" ] || [ ! -f "./file_cleanup.py" ]; then
+    echo "Error: Required cleanup scripts not found in $SCRIPT_DIR"
+    exit 1
+fi
+# Ensure scripts are executable
+chmod +x ./mongo_cleanup.py
+chmod +x ./file_cleanup.py
+# Step 1: MongoDB cleanup
+echo "Step 1: MongoDB Cleanup"
+echo "----------------------"
+if [ "$1" == "--force" ]; then
+    python3 ./mongo_cleanup.py --force
+else
+    python3 ./mongo_cleanup.py
+fi
+# Step 2: File cleanup
+echo
+echo "Step 2: File Cleanup"
+echo "------------------"
+if [ "$1" == "--force" ]; then
+    python3 ./file_cleanup.py --force
+else
+    python3 ./file_cleanup.py
+fi
+echo
+echo "====================================================="
+echo "          CLEANUP PROCESS COMPLETED                  "
+echo "====================================================="

config.py ADDED Viewed

	@@ -0,0 +1,96 @@

+"""
+Configuration settings for the web crawler
+"""
+import os
+from typing import Dict, List, Any, Optional
+# General settings
+MAX_WORKERS = 100  # Maximum number of worker threads/processes
+MAX_DEPTH = 10  # Maximum depth to crawl from seed URLs
+CRAWL_TIMEOUT = 30  # Timeout for HTTP requests in seconds
+USER_AGENT = "Mozilla/5.0 WebCrawler/1.0 (+https://example.org/bot)"
+# Politeness settings
+ROBOTSTXT_OBEY = True  # Whether to obey robots.txt rules
+DOWNLOAD_DELAY = 1.0  # Delay between requests to the same domain (seconds)
+MAX_REQUESTS_PER_DOMAIN = 10  # Maximum concurrent requests per domain
+RESPECT_CRAWL_DELAY = True  # Respect Crawl-delay in robots.txt
+RETRY_TIMES = 3  # Number of retries for failed requests
+RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]  # HTTP codes to retry
+# URL settings
+ALLOWED_DOMAINS: Optional[List[str]] = None  # Domains to restrict crawling to (None = all domains)
+EXCLUDED_DOMAINS: List[str] = []  # Domains to exclude from crawling
+ALLOWED_SCHEMES = ["http", "https"]  # URL schemes to allow
+URL_FILTERS = [
+    # Only filter out binary and media files
+    r".*\.(jpg|jpeg|gif|png|ico|mp3|mp4|wav|avi|mov|mpeg|pdf|zip|rar|gz|exe|dmg|pkg|iso|bin)$",
+]  # Regex patterns to filter out URLs
+# Storage settings
+MONGODB_URI = "mongodb://localhost:27017/"
+MONGODB_DB = "webcrawler"
+REDIS_URI = "redis://localhost:6379/0"
+STORAGE_PATH = os.path.join(os.path.dirname(__file__), "storage")
+HTML_STORAGE_PATH = os.path.join(STORAGE_PATH, "html")
+LOG_PATH = os.path.join(STORAGE_PATH, "logs")
+# Frontier settings
+FRONTIER_QUEUE_SIZE = 100000  # Maximum number of URLs in the frontier queue
+PRIORITY_QUEUE_NUM = 5  # Number of priority queues
+HOST_QUEUE_NUM = 1000  # Number of host queues for politeness
+# Content settings
+MAX_CONTENT_SIZE = 10 * 1024 * 1024  # Maximum size of HTML content to download (10MB)
+ALLOWED_CONTENT_TYPES = [
+    "text/html",
+    "application/xhtml+xml",
+    "text/plain",  # Some servers might serve HTML as text/plain
+    "application/html",
+    "*/*",  # Accept any content type
+]  # Allowed content types
+# DNS settings
+DNS_CACHE_SIZE = 10000  # Maximum number of entries in DNS cache
+DNS_CACHE_TIMEOUT = 3600  # DNS cache timeout in seconds
+# Logging settings
+LOG_LEVEL = "INFO"
+LOG_FORMAT = "%(asctime)s [%(name)s] %(levelname)s: %(message)s"
+# Seed URLs
+SEED_URLS = [
+    "https://en.wikipedia.org/",
+    "https://www.nytimes.com/",
+    "https://www.bbc.com/",
+    "https://www.github.com/",
+    "https://www.reddit.com/",
+]
+# Override settings with environment variables
+def get_env_settings() -> Dict[str, Any]:
+    """Get settings from environment variables"""
+    env_settings = {}
+    for key, value in globals().items():
+        if key.isupper():  # Only consider uppercase variables as settings
+            env_value = os.environ.get(f"WEBCRAWLER_{key}")
+            if env_value is not None:
+                # Convert to appropriate type based on default value
+                if isinstance(value, bool):
+                    env_settings[key] = env_value.lower() in ("true", "1", "yes")
+                elif isinstance(value, int):
+                    env_settings[key] = int(env_value)
+                elif isinstance(value, float):
+                    env_settings[key] = float(env_value)
+                elif isinstance(value, list):
+                    # Assume comma-separated values
+                    env_settings[key] = [item.strip() for item in env_value.split(",")]
+                else:
+                    env_settings[key] = env_value
+    return env_settings
+# Update settings with environment variables
+globals().update(get_env_settings())

crawl.py ADDED Viewed

	@@ -0,0 +1,370 @@

+#!/usr/bin/env python3
+"""
+Command-line interface for the web crawler.
+Usage:
+    crawl.py start [--workers=<num>] [--async] [--seed=<url>...]
+    crawl.py stop
+    crawl.py pause
+    crawl.py resume
+    crawl.py stats
+    crawl.py clean [--days=<days>]
+    crawl.py export [--format=<format>] [--output=<file>]
+    crawl.py set-max-depth <depth>
+    crawl.py add-seed <url>...
+    crawl.py (-h | --help)
+    crawl.py --version
+Options:
+    -h --help           Show this help message
+    --version           Show version
+    --workers=<num>     Number of worker threads [default: 4]
+    --async             Use asynchronous mode
+    --seed=<url>        Seed URL(s) to start crawling
+    --days=<days>       Days threshold for data cleaning [default: 90]
+    --format=<format>   Export format (json, csv) [default: json]
+    --output=<file>     Output file path [default: crawl_data.json]
+"""
+import os
+import sys
+import time
+import json
+import signal
+import logging
+import csv
+from typing import List, Dict, Any
+from docopt import docopt
+import datetime
+import traceback
+from models import URL, URLStatus, Priority
+from crawler import Crawler
+import config
+# Configure logging
+logging.basicConfig(
+    level=getattr(logging, config.LOG_LEVEL),
+    format=config.LOG_FORMAT
+)
+logger = logging.getLogger(__name__)
+# Global crawler instance
+crawler = None
+def initialize_crawler() -> Crawler:
+    """Initialize the crawler instance"""
+    global crawler
+    if crawler is None:
+        crawler = Crawler()
+    return crawler
+def start_crawler(workers: int, async_mode: bool, seed_urls: List[str]) -> None:
+    """
+    Start the crawler
+    Args:
+        workers: Number of worker threads
+        async_mode: Whether to use async mode
+        seed_urls: List of seed URLs to add
+    """
+    crawler = initialize_crawler()
+    # Add seed URLs if provided
+    if seed_urls:
+        num_added = crawler.add_seed_urls(seed_urls)
+        logger.info(f"Added {num_added} seed URLs")
+    # Start crawler
+    try:
+        crawler.start(num_workers=workers, async_mode=async_mode)
+    except KeyboardInterrupt:
+        logger.info("Crawler interrupted by user")
+        crawler.stop()
+    except Exception as e:
+        logger.error(f"Error starting crawler: {e}")
+        logger.error(traceback.format_exc())
+        crawler.stop()
+def stop_crawler() -> None:
+    """Stop the crawler"""
+    if crawler is None:
+        logger.error("Crawler is not running")
+        return
+    crawler.stop()
+    logger.info("Crawler stopped")
+def pause_crawler() -> None:
+    """Pause the crawler"""
+    if crawler is None:
+        logger.error("Crawler is not running")
+        return
+    crawler.pause()
+    logger.info("Crawler paused")
+def resume_crawler() -> None:
+    """Resume the crawler"""
+    if crawler is None:
+        logger.error("Crawler is not running")
+        return
+    crawler.resume()
+    logger.info("Crawler resumed")
+def show_stats() -> None:
+    """Show crawler statistics"""
+    if crawler is None:
+        logger.error("Crawler is not running")
+        return
+    # Get crawler stats
+    stats = crawler.stats
+    # Calculate elapsed time
+    elapsed = time.time() - stats['start_time']
+    elapsed_str = str(datetime.timedelta(seconds=int(elapsed)))
+    # Format statistics
+    print("\n=== Crawler Statistics ===")
+    print(f"Running time: {elapsed_str}")
+    print(f"Pages crawled: {stats['pages_crawled']}")
+    print(f"Pages failed: {stats['pages_failed']}")
+    print(f"URLs discovered: {stats['urls_discovered']}")
+    print(f"URLs filtered: {stats['urls_filtered']}")
+    # Calculate pages per second
+    pages_per_second = stats['pages_crawled'] / elapsed if elapsed > 0 else 0
+    print(f"Crawl rate: {pages_per_second:.2f} pages/second")
+    # Domain statistics
+    domains = len(stats['domains_crawled'])
+    print(f"Domains crawled: {domains}")
+    # Status code statistics
+    print("\n--- HTTP Status Codes ---")
+    for status, count in sorted(stats['status_codes'].items()):
+        print(f"  {status}: {count}")
+    # Content type statistics
+    print("\n--- Content Types ---")
+    for content_type, count in sorted(stats['content_types'].items(), key=lambda x: x[1], reverse=True)[:10]:
+        print(f"  {content_type}: {count}")
+    # Frontier size
+    print(f"\nFrontier size: {crawler.frontier.size()}")
+    # DNS cache statistics
+    dns_stats = crawler.dns_resolver.get_stats()
+    print(f"\nDNS cache: {dns_stats['hit_count']} hits, {dns_stats['miss_count']} misses, {dns_stats['size']} entries")
+    print("\n=========================\n")
+def clean_data(days: int) -> None:
+    """
+    Clean old data
+    Args:
+        days: Days threshold for data cleaning
+    """
+    try:
+        if crawler is None:
+            initialize_crawler()
+        # Get MongoDB connection
+        storage = crawler.mongo_client
+        # Clean old pages
+        old_pages = storage.clean_old_pages(days)
+        # Clean failed URLs
+        failed_urls = storage.clean_failed_urls()
+        logger.info(f"Cleaned {old_pages} old pages and {failed_urls} failed URLs")
+        print(f"Cleaned {old_pages} old pages and {failed_urls} failed URLs")
+    except Exception as e:
+        logger.error(f"Error cleaning data: {e}")
+        print(f"Error cleaning data: {e}")
+def export_data(export_format: str, output_file: str) -> None:
+    """
+    Export crawler data
+    Args:
+        export_format: Format to export (json, csv)
+        output_file: Output file path
+    """
+    try:
+        if crawler is None:
+            initialize_crawler()
+        # Get MongoDB connection
+        db = crawler.db
+        # Get data
+        pages = list(db.pages_collection.find({}, {'_id': 0}))
+        urls = list(db.urls_collection.find({}, {'_id': 0}))
+        stats = list(db.stats_collection.find({}, {'_id': 0}))
+        # Prepare export data
+        export_data = {
+            'metadata': {
+                'exported_at': datetime.datetime.now().isoformat(),
+                'pages_count': len(pages),
+                'urls_count': len(urls),
+                'stats_count': len(stats),
+            },
+            'pages': pages,
+            'urls': urls,
+            'stats': stats
+        }
+        # Convert datetime objects to strings
+        export_data = json.loads(json.dumps(export_data, default=str))
+        # Export based on format
+        if export_format.lower() == 'json':
+            with open(output_file, 'w') as f:
+                json.dump(export_data, f, indent=2)
+            logger.info(f"Data exported to {output_file} in JSON format")
+            print(f"Data exported to {output_file} in JSON format")
+        elif export_format.lower() == 'csv':
+            # Split export into multiple CSV files
+            base_name = os.path.splitext(output_file)[0]
+            # Export pages
+            pages_file = f"{base_name}_pages.csv"
+            if pages:
+                with open(pages_file, 'w', newline='') as f:
+                    writer = csv.DictWriter(f, fieldnames=pages[0].keys())
+                    writer.writeheader()
+                    writer.writerows(pages)
+            # Export URLs
+            urls_file = f"{base_name}_urls.csv"
+            if urls:
+                with open(urls_file, 'w', newline='') as f:
+                    writer = csv.DictWriter(f, fieldnames=urls[0].keys())
+                    writer.writeheader()
+                    writer.writerows(urls)
+            # Export stats
+            stats_file = f"{base_name}_stats.csv"
+            if stats:
+                with open(stats_file, 'w', newline='') as f:
+                    writer = csv.DictWriter(f, fieldnames=stats[0].keys())
+                    writer.writeheader()
+                    writer.writerows(stats)
+            logger.info(f"Data exported to {base_name}_*.csv files in CSV format")
+            print(f"Data exported to {base_name}_*.csv files in CSV format")
+        else:
+            logger.error(f"Unsupported export format: {export_format}")
+            print(f"Unsupported export format: {export_format}")
+    except Exception as e:
+        logger.error(f"Error exporting data: {e}")
+        print(f"Error exporting data: {e}")
+def set_max_depth(depth: int) -> None:
+    """
+    Set maximum crawl depth
+    Args:
+        depth: Maximum crawl depth
+    """
+    try:
+        depth = int(depth)
+        if depth < 0:
+            logger.error("Depth must be a positive integer")
+            print("Depth must be a positive integer")
+            return
+        # Update configuration
+        config.MAX_DEPTH = depth
+        logger.info(f"Maximum crawl depth set to {depth}")
+        print(f"Maximum crawl depth set to {depth}")
+    except ValueError:
+        logger.error("Depth must be a valid integer")
+        print("Depth must be a valid integer")
+def add_seed_urls(urls: List[str]) -> None:
+    """
+    Add seed URLs to the crawler
+    Args:
+        urls: List of URLs to add
+    """
+    if crawler is None:
+        initialize_crawler()
+    num_added = crawler.add_seed_urls(urls)
+    logger.info(f"Added {num_added} seed URLs")
+    print(f"Added {num_added} seed URLs")
+def handle_signal(sig, frame):
+    """Handle signal interrupts"""
+    if sig == signal.SIGINT:
+        logger.info("Received SIGINT, stopping crawler")
+        stop_crawler()
+        sys.exit(0)
+    elif sig == signal.SIGTERM:
+        logger.info("Received SIGTERM, stopping crawler")
+        stop_crawler()
+        sys.exit(0)
+def main():
+    """Main entry point"""
+    # Register signal handlers
+    signal.signal(signal.SIGINT, handle_signal)
+    signal.signal(signal.SIGTERM, handle_signal)
+    # Parse arguments
+    args = docopt(__doc__, version='Web Crawler 1.0')
+    # Handle commands
+    if args['start']:
+        workers = int(args['--workers'])
+        async_mode = args['--async']
+        seed_urls = args['--seed'] if args['--seed'] else []
+        start_crawler(workers, async_mode, seed_urls)
+    elif args['stop']:
+        stop_crawler()
+    elif args['pause']:
+        pause_crawler()
+    elif args['resume']:
+        resume_crawler()
+    elif args['stats']:
+        show_stats()
+    elif args['clean']:
+        days = int(args['--days'])
+        clean_data(days)
+    elif args['export']:
+        export_format = args['--format']
+        output_file = args['--output']
+        export_data(export_format, output_file)
+    elif args['set-max-depth']:
+        depth = args['<depth>']
+        set_max_depth(depth)
+    elif args['add-seed']:
+        urls = args['<url>']
+        add_seed_urls(urls)
+    else:
+        print(__doc__)
+if __name__ == '__main__':
+    main()

crawler.log ADDED Viewed

File without changes

crawler.py ADDED Viewed

	@@ -0,0 +1,908 @@

+"""
+Main crawler class to coordinate the web crawling process
+"""
+import time
+import logging
+import os
+import asyncio
+import threading
+from typing import List, Dict, Set, Tuple, Optional, Any, Callable
+from concurrent.futures import ThreadPoolExecutor
+import signal
+import json
+from datetime import datetime
+from urllib.parse import urlparse
+import traceback
+from pymongo import MongoClient
+from prometheus_client import Counter, Gauge, Histogram, start_http_server, REGISTRY
+import redis
+from models import URL, Page, URLStatus, Priority
+from frontier import URLFrontier
+from downloader import HTMLDownloader
+from parser import HTMLParser
+from robots import RobotsHandler
+from dns_resolver import DNSResolver
+import config
+from dotenv import load_dotenv, find_dotenv
+load_dotenv(find_dotenv())
+# Check if we're in deployment mode
+IS_DEPLOYMENT = os.getenv('DEPLOYMENT', 'false').lower() == 'true'
+# Import local configuration if available
+try:
+    import local_config
+    # Override config settings with local settings
+    for key in dir(local_config):
+        if key.isupper():
+            setattr(config, key, getattr(local_config, key))
+    print(f"Loaded local configuration from {local_config.__file__}")
+except ImportError:
+    pass
+# Configure logging
+logging.basicConfig(
+    level=getattr(logging, config.LOG_LEVEL),
+    format=config.LOG_FORMAT
+)
+logger = logging.getLogger(__name__)
+class Crawler:
+    """
+    Main crawler class that coordinates the web crawling process
+    Manages:
+    - URL Frontier
+    - HTML Downloader
+    - HTML Parser
+    - Content Storage
+    - Monitoring and Statistics
+    """
+    def __init__(self,
+                 mongo_uri: Optional[str] = None,
+                 redis_uri: Optional[str] = None,
+                 metrics_port: int = 9100,
+                 storage: Optional[Any] = None):
+        """
+        Initialize the crawler
+        Args:
+            mongo_uri: MongoDB URI for content storage
+            redis_uri: Redis URI for URL frontier
+            metrics_port: Port for Prometheus metrics server
+            storage: Optional storage backend for deployment mode
+        """
+        self.storage = storage
+        self.metrics_port = metrics_port
+        # Initialize database connections only if not using custom storage
+        if storage is None:
+            self.mongo_uri = mongo_uri or config.MONGODB_URI
+            self.redis_uri = redis_uri or config.REDIS_URI
+            # Connect to MongoDB
+            self.mongo_client = MongoClient(self.mongo_uri)
+            self.db = self.mongo_client[config.MONGODB_DB]
+            self.pages_collection = self.db['pages']
+            self.urls_collection = self.db['urls']
+            self.stats_collection = self.db['stats']
+            # Ensure indexes
+            self._create_indexes()
+            # Create frontier with Redis
+            self.frontier = URLFrontier(redis_client=redis.from_url(self.redis_uri))
+        else:
+            # In deployment mode, use in-memory storage
+            self.frontier = URLFrontier(use_memory=True)
+        # Create other components that don't need database connections
+        self.robots_handler = RobotsHandler()
+        self.dns_resolver = DNSResolver()
+        self.downloader = HTMLDownloader(self.dns_resolver, self.robots_handler)
+        self.parser = HTMLParser()
+        # Initialize statistics
+        self.stats = {
+            'pages_crawled': 0,
+            'pages_failed': 0,
+            'urls_discovered': 0,
+            'urls_filtered': 0,
+            'start_time': time.time(),
+            'domains_crawled': set(),
+            'content_types': {},
+            'status_codes': {},
+        }
+        # Set up metrics only in local mode
+        if not IS_DEPLOYMENT:
+            self._setup_metrics()
+        else:
+            # In deployment mode, use dummy metrics that do nothing
+            self.pages_crawled_counter = DummyMetric()
+            self.pages_failed_counter = DummyMetric()
+            self.urls_discovered_counter = DummyMetric()
+            self.urls_filtered_counter = DummyMetric()
+            self.frontier_size_gauge = DummyMetric()
+            self.active_threads_gauge = DummyMetric()
+            self.download_time_histogram = DummyMetric()
+            self.page_size_histogram = DummyMetric()
+        # Flag to control crawling
+        self.running = False
+        self.paused = False
+        self.stop_event = threading.Event()
+        # Create storage directories if they don't exist
+        os.makedirs(config.HTML_STORAGE_PATH, exist_ok=True)
+        os.makedirs(config.LOG_PATH, exist_ok=True)
+    def _create_indexes(self):
+        """Create indexes for MongoDB collections"""
+        try:
+            # Pages collection indexes
+            self.pages_collection.create_index('url', unique=True)
+            self.pages_collection.create_index('content_hash')
+            self.pages_collection.create_index('crawled_at')
+            # URLs collection indexes
+            self.urls_collection.create_index('url', unique=True)
+            self.urls_collection.create_index('normalized_url', unique=True)
+            self.urls_collection.create_index('domain')
+            self.urls_collection.create_index('status')
+            self.urls_collection.create_index('priority')
+            logger.info("MongoDB indexes created")
+        except Exception as e:
+            logger.error(f"Error creating MongoDB indexes: {e}")
+    def _setup_metrics(self):
+        """Set up Prometheus metrics"""
+        # Clean up any existing metrics
+        collectors_to_remove = []
+        for collector in REGISTRY._collector_to_names:
+            for name in REGISTRY._collector_to_names[collector]:
+                if name.startswith('crawler_'):
+                    collectors_to_remove.append(collector)
+                    break
+        for collector in collectors_to_remove:
+            REGISTRY.unregister(collector)
+        # Counters
+        self.pages_crawled_counter = Counter('crawler_pages_crawled_total', 'Total pages crawled')
+        self.pages_failed_counter = Counter('crawler_pages_failed_total', 'Total pages failed')
+        self.urls_discovered_counter = Counter('crawler_urls_discovered_total', 'Total URLs discovered')
+        self.urls_filtered_counter = Counter('crawler_urls_filtered_total', 'Total URLs filtered')
+        # Gauges
+        self.frontier_size_gauge = Gauge('crawler_frontier_size', 'Size of URL frontier')
+        self.active_threads_gauge = Gauge('crawler_active_threads', 'Number of active crawler threads')
+        # Histograms
+        self.download_time_histogram = Histogram('crawler_download_time_seconds', 'Time to download pages')
+        self.page_size_histogram = Histogram('crawler_page_size_bytes', 'Size of downloaded pages')
+        # Start metrics server
+        try:
+            start_http_server(self.metrics_port)
+            logger.info(f"Metrics server started on port {self.metrics_port}")
+        except Exception as e:
+            logger.error(f"Error starting metrics server: {e}")
+    def add_seed_urls(self, urls: List[str], priority: Priority = Priority.VERY_HIGH) -> int:
+        """
+        Add seed URLs to the frontier
+        Args:
+            urls: List of URLs to add
+            priority: Priority for the seed URLs
+        Returns:
+            Number of URLs added
+        """
+        added = 0
+        for url in urls:
+            url_obj = URL(
+                url=url,
+                status=URLStatus.PENDING,
+                priority=priority,
+                depth=0  # Seed URLs are at depth 0
+            )
+            # Save URL based on storage mode
+            try:
+                if self.storage is not None:
+                    # Use custom storage in deployment mode
+                    self.storage.add_url(url_obj)
+                else:
+                    # Use MongoDB in local mode
+                    self.urls_collection.update_one(
+                        {'url': url},
+                        {'$set': url_obj.dict()},
+                        upsert=True
+                    )
+            except Exception as e:
+                logger.error(f"Error saving seed URL to database: {e}")
+            # Add to frontier
+            if self.frontier.add_url(url_obj):
+                added += 1
+                self.urls_discovered_counter.inc()
+                logger.info(f"Added seed URL: {url}")
+        return added
+    def start(self, num_workers: int = None, async_mode: bool = False) -> None:
+        """
+        Start the crawler
+        Args:
+            num_workers: Number of worker threads
+            async_mode: Whether to use async mode
+        """
+        if self.running:
+            logger.warning("Crawler is already running")
+            return
+        num_workers = num_workers or config.MAX_WORKERS
+        # Reset stop event
+        self.stop_event.clear()
+        # Add seed URLs if frontier is empty
+        if self.frontier.size() == 0:
+            logger.info("Adding seed URLs")
+            self.add_seed_urls(config.SEED_URLS)
+        # Start crawler
+        self.running = True
+        # Register signal handlers
+        self._register_signal_handlers()
+        logger.info(f"Starting crawler with {num_workers} workers")
+        if async_mode:
+            # Use asyncio for crawler
+            try:
+                loop = asyncio.get_event_loop()
+                loop.run_until_complete(self._crawl_async(num_workers))
+            except KeyboardInterrupt:
+                logger.info("Crawler stopped by user")
+            except Exception as e:
+                logger.error(f"Error in async crawler: {e}")
+                logger.error(traceback.format_exc())
+            finally:
+                self._cleanup()
+        else:
+            # Use threads for crawler
+            with ThreadPoolExecutor(max_workers=num_workers) as executor:
+                try:
+                    # Submit worker tasks
+                    futures = [executor.submit(self._crawl_worker) for _ in range(num_workers)]
+                    # Wait for completion
+                    for future in futures:
+                        future.result()
+                except KeyboardInterrupt:
+                    logger.info("Crawler stopped by user")
+                except Exception as e:
+                    logger.error(f"Error in threaded crawler: {e}")
+                    logger.error(traceback.format_exc())
+                finally:
+                    self._cleanup()
+    def _register_signal_handlers(self) -> None:
+        """Register signal handlers for graceful shutdown"""
+        def signal_handler(sig, frame):
+            logger.info(f"Received signal {sig}, shutting down")
+            self.stop()
+        signal.signal(signal.SIGINT, signal_handler)
+        signal.signal(signal.SIGTERM, signal_handler)
+    def _crawl_worker(self) -> None:
+        """Worker function for threaded crawler"""
+        try:
+            self.active_threads_gauge.inc()
+            while self.running and not self.stop_event.is_set():
+                # Check if paused
+                if self.paused:
+                    time.sleep(1)
+                    continue
+                # Get next URL from frontier
+                url_obj = self.frontier.get_next_url()
+                # No URL available, wait and retry
+                if url_obj is None:
+                    time.sleep(1)
+                    continue
+                try:
+                    # Process the URL
+                    self._process_url(url_obj)
+                    # Update statistics
+                    self._update_stats()
+                except Exception as e:
+                    logger.error(f"Error processing URL {url_obj.url}: {e}")
+                    logger.error(traceback.format_exc())
+                    # Update URL status to failed
+                    self._mark_url_failed(url_obj, str(e))
+        except Exception as e:
+            logger.error(f"Unhandled error in worker thread: {e}")
+            logger.error(traceback.format_exc())
+        finally:
+            self.active_threads_gauge.dec()
+    async def _crawl_async(self, num_workers: int) -> None:
+        """Async worker function for asyncio crawler"""
+        try:
+            self.active_threads_gauge.inc(num_workers)
+            # Create tasks
+            tasks = [self._async_worker() for _ in range(num_workers)]
+            # Wait for all tasks to complete
+            await asyncio.gather(*tasks)
+        except Exception as e:
+            logger.error(f"Unhandled error in async crawler: {e}")
+            logger.error(traceback.format_exc())
+        finally:
+            self.active_threads_gauge.dec(num_workers)
+    async def _async_worker(self) -> None:
+        """Async worker function"""
+        try:
+            while self.running and not self.stop_event.is_set():
+                # Check if paused
+                if self.paused:
+                    await asyncio.sleep(1)
+                    continue
+                # Get next URL from frontier
+                url_obj = self.frontier.get_next_url()
+                # No URL available, wait and retry
+                if url_obj is None:
+                    await asyncio.sleep(1)
+                    continue
+                try:
+                    # Process the URL
+                    await self._process_url_async(url_obj)
+                    # Update statistics
+                    self._update_stats()
+                except Exception as e:
+                    logger.error(f"Error processing URL {url_obj.url}: {e}")
+                    logger.error(traceback.format_exc())
+                    # Update URL status to failed
+                    self._mark_url_failed(url_obj, str(e))
+        except Exception as e:
+            logger.error(f"Unhandled error in async worker: {e}")
+            logger.error(traceback.format_exc())
+    def _process_url(self, url_obj: URL) -> None:
+        """
+        Process a URL
+        Args:
+            url_obj: URL object to process
+        """
+        url = url_obj.url
+        logger.debug(f"Processing URL: {url}")
+        # Download page
+        with self.download_time_histogram.time():
+            page = self.downloader.download(url_obj)
+        # If download failed
+        if page is None:
+            self.pages_failed_counter.inc()
+            self.stats['pages_failed'] += 1
+            self._mark_url_failed(url_obj, url_obj.error or "Download failed")
+            return
+        # Record page size
+        self.page_size_histogram.observe(page.content_length)
+        # Check for duplicate content
+        content_hash = page.content_hash
+        duplicate = self._check_duplicate_content(content_hash, url)
+        if duplicate:
+            logger.info(f"Duplicate content detected for URL {url}")
+            page.is_duplicate = True
+            # Mark URL as duplicate but still store the page
+            self._mark_url_completed(url_obj)
+        else:
+            # Parse page and extract URLs
+            extracted_urls, metadata = self.parser.parse(page)
+            # Store page metadata
+            page.metadata = metadata
+            # Process extracted URLs
+            self._process_extracted_urls(extracted_urls, url_obj, metadata)
+            # Mark URL as completed
+            self._mark_url_completed(url_obj)
+        # Store page
+        self._store_page(page)
+        # Update statistics
+        self.pages_crawled_counter.inc()
+        self.stats['pages_crawled'] += 1
+        # Add domain to statistics
+        domain = url_obj.domain
+        self.stats['domains_crawled'].add(domain)
+        # Update content type statistics
+        content_type = page.content_type.split(';')[0].strip()
+        self.stats['content_types'][content_type] = self.stats['content_types'].get(content_type, 0) + 1
+        # Update status code statistics
+        status_code = page.status_code
+        self.stats['status_codes'][str(status_code)] = self.stats['status_codes'].get(str(status_code), 0) + 1
+    async def _process_url_async(self, url_obj: URL) -> None:
+        """
+        Process a URL asynchronously
+        Args:
+            url_obj: URL object to process
+        """
+        url = url_obj.url
+        logger.debug(f"Processing URL (async): {url}")
+        # Download page
+        download_start = time.time()
+        page = await self.downloader.download_async(url_obj)
+        download_time = time.time() - download_start
+        self.download_time_histogram.observe(download_time)
+        # If download failed
+        if page is None:
+            self.pages_failed_counter.inc()
+            self.stats['pages_failed'] += 1
+            self._mark_url_failed(url_obj, url_obj.error or "Download failed")
+            return
+        # Record page size
+        self.page_size_histogram.observe(page.content_length)
+        # Check for duplicate content
+        content_hash = page.content_hash
+        duplicate = self._check_duplicate_content(content_hash, url)
+        if duplicate:
+            logger.info(f"Duplicate content detected for URL {url}")
+            page.is_duplicate = True
+            # Mark URL as duplicate but still store the page
+            self._mark_url_completed(url_obj)
+        else:
+            # Parse page and extract URLs
+            extracted_urls, metadata = self.parser.parse(page)
+            # Store page metadata
+            page.metadata = metadata
+            # Process extracted URLs
+            self._process_extracted_urls(extracted_urls, url_obj, metadata)
+            # Mark URL as completed
+            self._mark_url_completed(url_obj)
+        # Store page
+        self._store_page(page)
+        # Update statistics
+        self.pages_crawled_counter.inc()
+        self.stats['pages_crawled'] += 1
+    def _check_duplicate_content(self, content_hash: str, url: str) -> bool:
+        """
+        Check if content has been seen before
+        Args:
+            content_hash: Hash of the content
+            url: URL of the page
+        Returns:
+            True if content is a duplicate, False otherwise
+        """
+        try:
+            if self.storage is not None:
+                # Use custom storage - simplified duplicate check
+                for page in self.storage.pages.values():
+                    if page.content_hash == content_hash and page.url != url:
+                        return True
+                return False
+            else:
+                # Use MongoDB
+                return self.pages_collection.find_one({
+                    'content_hash': content_hash,
+                    'url': {'$ne': url}
+                }) is not None
+        except Exception as e:
+            logger.error(f"Error checking for duplicate content: {e}")
+            return False
+    def _process_extracted_urls(self, urls: List[str], parent_url_obj: URL, metadata: Dict[str, Any]) -> None:
+        """
+        Process extracted URLs
+        Args:
+            urls: List of URLs to process
+            parent_url_obj: Parent URL object
+            metadata: Metadata from the parent page
+        """
+        parent_url = parent_url_obj.url
+        parent_depth = parent_url_obj.depth
+        # Check max depth
+        if parent_depth >= config.MAX_DEPTH:
+            logger.debug(f"Max depth reached for {parent_url}")
+            return
+        for url in urls:
+            # Calculate priority based on URL and metadata
+            priority = self.parser.calculate_priority(url, metadata)
+            # Create URL object
+            url_obj = URL(
+                url=url,
+                status=URLStatus.PENDING,
+                priority=priority,
+                depth=parent_depth + 1,
+                parent_url=parent_url
+            )
+            # Add to frontier
+            if self.frontier.add_url(url_obj):
+                # URL was added to frontier
+                self.urls_discovered_counter.inc()
+                self.stats['urls_discovered'] += 1
+                # Save URL based on storage mode
+                try:
+                    if self.storage is not None:
+                        # Use custom storage in deployment mode
+                        self.storage.add_url(url_obj)
+                    else:
+                        # Use MongoDB in local mode
+                        self.urls_collection.update_one(
+                            {'url': url},
+                            {'$set': url_obj.dict()},
+                            upsert=True
+                        )
+                except Exception as e:
+                    logger.error(f"Error saving URL to database: {e}")
+            else:
+                # URL was not added (filtered or duplicate)
+                self.urls_filtered_counter.inc()
+                self.stats['urls_filtered'] += 1
+    def _mark_url_completed(self, url_obj: URL) -> None:
+        """
+        Mark URL as completed
+        Args:
+            url_obj: URL object to mark as completed
+        """
+        try:
+            url_obj.status = URLStatus.COMPLETED
+            url_obj.completed_at = datetime.now()
+            if self.storage is not None:
+                # Use custom storage
+                self.storage.add_url(url_obj)
+            else:
+                # Use MongoDB
+                self.urls_collection.update_one(
+                    {'url': url_obj.url},
+                    {'$set': url_obj.dict()},
+                    upsert=True
+                )
+        except Exception as e:
+            logger.error(f"Error marking URL as completed: {e}")
+    def _mark_url_failed(self, url_obj: URL, error: str) -> None:
+        """
+        Mark URL as failed
+        Args:
+            url_obj: URL object to mark as failed
+            error: Error message
+        """
+        try:
+            url_obj.status = URLStatus.FAILED
+            url_obj.error = error
+            url_obj.completed_at = datetime.now()
+            if self.storage is not None:
+                # Use custom storage
+                self.storage.add_url(url_obj)
+            else:
+                # Use MongoDB
+                self.urls_collection.update_one(
+                    {'url': url_obj.url},
+                    {'$set': url_obj.dict()},
+                    upsert=True
+                )
+            # If retries not exceeded, add back to frontier with lower priority
+            if url_obj.retries < config.RETRY_TIMES:
+                # Lower priority by one level (to a maximum of VERY_LOW)
+                new_priority = min(Priority.VERY_LOW, Priority(url_obj.priority + 1))
+                url_obj.priority = new_priority
+                url_obj.status = URLStatus.PENDING
+                # Add back to frontier
+                self.frontier.add_url(url_obj)
+        except Exception as e:
+            logger.error(f"Error marking URL as failed: {e}")
+    def _store_page(self, page: Page) -> None:
+        """
+        Store a page in the database and optionally on disk
+        Args:
+            page: Page object to store
+        """
+        try:
+            if self.storage is not None:
+                # Use custom storage in deployment mode
+                self.storage.add_page(page)
+            else:
+                # Use MongoDB in local mode
+                self.pages_collection.update_one(
+                    {'url': page.url},
+                    {'$set': page.dict()},
+                    upsert=True
+                )
+            # Optionally store HTML content on disk
+            if not page.is_duplicate:
+                if IS_DEPLOYMENT:
+                    # In deployment mode, store in temporary directory
+                    domain_dir = os.path.join(config.HTML_STORAGE_PATH, self._extract_domain(page.url))
+                    os.makedirs(domain_dir, exist_ok=True)
+                    # Create filename from URL
+                    filename = self._url_to_filename(page.url)
+                    filepath = os.path.join(domain_dir, filename)
+                    # Write HTML to file
+                    with open(filepath, 'w', encoding='utf-8') as f:
+                        f.write(page.content)
+                    logger.debug(f"Stored HTML content for {page.url} at {filepath}")
+                else:
+                    # In local mode, store in permanent storage
+                    domain = self._extract_domain(page.url)
+                    domain_dir = os.path.join(config.HTML_STORAGE_PATH, domain)
+                    os.makedirs(domain_dir, exist_ok=True)
+                    # Create filename from URL
+                    filename = self._url_to_filename(page.url)
+                    filepath = os.path.join(domain_dir, filename)
+                    # Write HTML to file
+                    with open(filepath, 'w', encoding='utf-8') as f:
+                        f.write(page.content)
+                    logger.debug(f"Stored HTML content for {page.url} at {filepath}")
+        except Exception as e:
+            logger.error(f"Error storing page: {e}")
+    def _extract_domain(self, url: str) -> str:
+        """Extract domain from URL"""
+        parsed = urlparse(url)
+        return parsed.netloc.replace(':', '_')
+    def _url_to_filename(self, url: str) -> str:
+        """Convert URL to filename"""
+        # Hash the URL to create a safe filename
+        url_hash = self._hash_url(url)
+        return f"{url_hash}.html"
+    def _hash_url(self, url: str) -> str:
+        """Create a hash of a URL"""
+        import hashlib
+        return hashlib.md5(url.encode('utf-8')).hexdigest()
+    def _update_stats(self) -> None:
+        """Update and log statistics"""
+        # Update frontier size gauge
+        self.frontier_size_gauge.set(self.frontier.size())
+        # Log statistics periodically
+        if self.stats['pages_crawled'] % 100 == 0:
+            self._log_stats()
+    def _log_stats(self) -> None:
+        """Log crawler statistics"""
+        # Calculate elapsed time
+        elapsed = time.time() - self.stats['start_time']
+        hours, remainder = divmod(elapsed, 3600)
+        minutes, seconds = divmod(remainder, 60)
+        # Get current statistics
+        pages_crawled = self.stats['pages_crawled']
+        pages_failed = self.stats['pages_failed']
+        urls_discovered = self.stats['urls_discovered']
+        urls_filtered = self.stats['urls_filtered']
+        domains_crawled = len(self.stats['domains_crawled'])
+        frontier_size = self.frontier.size()
+        # Calculate pages per second
+        pages_per_second = pages_crawled / elapsed if elapsed > 0 else 0
+        # Log statistics
+        logger.info(
+            f"Crawler running for {int(hours):02d}:{int(minutes):02d}:{int(seconds):02d} - "
+            f"Pages: {pages_crawled} ({pages_per_second:.2f}/s) - "
+            f"Failed: {pages_failed} - "
+            f"URLs Discovered: {urls_discovered} - "
+            f"URLs Filtered: {urls_filtered} - "
+            f"Domains: {domains_crawled} - "
+            f"Frontier: {frontier_size}"
+        )
+        # Save statistics to database
+        try:
+            stats_copy = self.stats.copy()
+            stats_copy['domains_crawled'] = list(stats_copy['domains_crawled'])
+            stats_copy['timestamp'] = datetime.datetime.now()
+            self.stats_collection.insert_one(stats_copy)
+        except Exception as e:
+            logger.error(f"Error saving statistics to database: {e}")
+    def stop(self) -> None:
+        """Stop the crawler"""
+        if not self.running:
+            logger.warning("Crawler is not running")
+            return
+        logger.info("Stopping crawler")
+        self.stop_event.set()
+        self.running = False
+    def pause(self) -> None:
+        """Pause the crawler"""
+        if not self.running:
+            logger.warning("Crawler is not running")
+            return
+        logger.info("Pausing crawler")
+        self.paused = True
+    def resume(self) -> None:
+        """Resume the crawler"""
+        if not self.running:
+            logger.warning("Crawler is not running")
+            return
+        logger.info("Resuming crawler")
+        self.paused = False
+    def checkpoint(self) -> bool:
+        """
+        Save crawler state for recovery
+        Returns:
+            True if successful, False otherwise
+        """
+        logger.info("Creating crawler checkpoint")
+        # Checkpoint the frontier
+        frontier_checkpoint = self.frontier.checkpoint()
+        # Save current statistics
+        try:
+            stats_copy = self.stats.copy()
+            stats_copy['domains_crawled'] = list(stats_copy['domains_crawled'])
+            stats_copy['checkpoint_time'] = datetime.datetime.now()
+            with open(os.path.join(config.STORAGE_PATH, 'crawler_stats.json'), 'w') as f:
+                json.dump(stats_copy, f)
+            logger.info("Crawler checkpoint created")
+            return frontier_checkpoint
+        except Exception as e:
+            logger.error(f"Error creating crawler checkpoint: {e}")
+            return False
+    def restore(self) -> bool:
+        """
+        Restore crawler state from checkpoint
+        Returns:
+            True if successful, False otherwise
+        """
+        logger.info("Restoring crawler from checkpoint")
+        # Restore frontier
+        frontier_restored = self.frontier.restore()
+        # Restore statistics
+        try:
+            stats_path = os.path.join(config.STORAGE_PATH, 'crawler_stats.json')
+            if os.path.exists(stats_path):
+                with open(stats_path, 'r') as f:
+                    saved_stats = json.load(f)
+                # Restore stats
+                self.stats = saved_stats
+                self.stats['domains_crawled'] = set(self.stats['domains_crawled'])
+                logger.info("Crawler statistics restored")
+            else:
+                logger.warning("No statistics checkpoint found")
+            return frontier_restored
+        except Exception as e:
+            logger.error(f"Error restoring crawler checkpoint: {e}")
+            return False
+    def _cleanup(self) -> None:
+        """Clean up resources when crawler stops"""
+        # Create final checkpoint
+        self.checkpoint()
+        # Log final statistics
+        self._log_stats()
+        # Reset flags
+        self.running = False
+        self.paused = False
+        logger.info("Crawler stopped")
+# Dummy metric class for deployment mode
+class DummyMetric:
+    """A dummy metric that does nothing"""
+    def inc(self, *args, **kwargs): pass
+    def dec(self, *args, **kwargs): pass
+    def set(self, *args, **kwargs): pass
+    def observe(self, *args, **kwargs): pass
+    def time(self): return self.Timer()
+    class Timer:
+        def __enter__(self): pass
+        def __exit__(self, exc_type, exc_val, exc_tb): pass
+if __name__ == "__main__":
+    # Create and start crawler
+    crawler = Crawler()
+    try:
+        crawler.start()
+    except KeyboardInterrupt:
+        logger.info("Crawler interrupted by user")
+    finally:
+        crawler.stop()

deduplication.py ADDED Viewed

	@@ -0,0 +1,422 @@

+"""
+Content deduplication component for the web crawler.
+Provides functionality to detect duplicate pages efficiently
+1. Exact content hashing
+2. Shingling and MinHash for near-duplicate detection
+3. SimHash for fuzzy matching
+"""
+import hashlib
+import logging
+import time
+from typing import Set, List, Dict, Tuple, Optional, Union
+import random
+import numpy as np
+from collections import defaultdict
+import re
+import config
+# Configure logging
+logging.basicConfig(
+    level=getattr(logging, config.LOG_LEVEL),
+    format=config.LOG_FORMAT
+)
+logger = logging.getLogger(__name__)
+class ContentDeduplicator:
+    """
+    Content deduplication using multiple techniques:
+    - Exact match (MD5 hash)
+    - Near-duplicate detection (MinHash)
+    - Fuzzy matching (SimHash)
+    """
+    def __init__(self):
+        """Initialize the deduplicator"""
+        # Exact content hashing
+        self.content_hashes = set()
+        # MinHash parameters
+        self.num_hashes = 100
+        self.minhash_signatures = {}  # URL -> MinHash signature
+        self.minhash_bands = defaultdict(set)  # band_id -> set of URLs
+        self.band_size = 5  # Each band contains 5 signatures
+        self.shingle_size = 3  # k-shingles of 3 consecutive tokens
+        # SimHash parameters
+        self.simhash_dim = 64
+        self.simhash_values = {}  # URL -> SimHash value
+        self.hamming_threshold = 3  # Maximum Hamming distance for similarity
+        # Cache of previously computed duplicates for quick lookups
+        self.duplicate_cache = {}  # URL -> set of duplicate URLs
+        # Token preprocessing
+        self.token_pattern = re.compile(r'\w+')
+        self.stop_words = set(['the', 'and', 'a', 'to', 'of', 'in', 'is', 'that', 'for', 'on', 'with'])
+        # Statistics
+        self.stats = {
+            'exact_duplicates': 0,
+            'near_duplicates': 0,
+            'fuzzy_duplicates': 0,
+            'processing_time': 0,
+            'total_documents': 0,
+        }
+    def is_duplicate(self, url: str, content: str) -> Tuple[bool, Optional[str]]:
+        """
+        Check if content is a duplicate
+        Args:
+            url: URL of the page
+            content: Page content
+        Returns:
+            (is_duplicate, duplicate_url): Tuple indicating if content is duplicate and what it duplicates
+        """
+        start_time = time.time()
+        # Check exact match first (fastest)
+        content_hash = self._hash_content(content)
+        if content_hash in self.content_hashes:
+            self.stats['exact_duplicates'] += 1
+            processing_time = time.time() - start_time
+            self.stats['processing_time'] += processing_time
+            # Find the URL with the same hash
+            for existing_url, existing_hash in self._get_hash_map().items():
+                if existing_hash == content_hash and existing_url != url:
+                    logger.debug(f"Exact duplicate detected: {url} duplicates {existing_url}")
+                    return True, existing_url
+            return True, None
+        # Check cache for quick lookup
+        if url in self.duplicate_cache:
+            duplicate_url = next(iter(self.duplicate_cache[url]))
+            logger.debug(f"Duplicate found in cache: {url} duplicates {duplicate_url}")
+            return True, duplicate_url
+        # Only perform more expensive checks if configured to do so
+        if config.NEAR_DUPLICATE_DETECTION:
+            # Check for near-duplicates using MinHash
+            near_duplicate = self._check_minhash(url, content)
+            if near_duplicate:
+                self.stats['near_duplicates'] += 1
+                processing_time = time.time() - start_time
+                self.stats['processing_time'] += processing_time
+                logger.debug(f"Near-duplicate detected: {url} is similar to {near_duplicate}")
+                self._add_to_duplicate_cache(url, near_duplicate)
+                return True, near_duplicate
+        if config.FUZZY_DUPLICATE_DETECTION:
+            # Check for fuzzy matches using SimHash
+            fuzzy_duplicate = self._check_simhash(url, content)
+            if fuzzy_duplicate:
+                self.stats['fuzzy_duplicates'] += 1
+                processing_time = time.time() - start_time
+                self.stats['processing_time'] += processing_time
+                logger.debug(f"Fuzzy duplicate detected: {url} is similar to {fuzzy_duplicate}")
+                self._add_to_duplicate_cache(url, fuzzy_duplicate)
+                return True, fuzzy_duplicate
+        # Not a duplicate, add to index
+        self._add_to_index(url, content, content_hash)
+        self.stats['total_documents'] += 1
+        processing_time = time.time() - start_time
+        self.stats['processing_time'] += processing_time
+        return False, None
+    def _add_to_duplicate_cache(self, url: str, duplicate_url: str) -> None:
+        """Add URL to duplicate cache for faster lookups"""
+        if url not in self.duplicate_cache:
+            self.duplicate_cache[url] = set()
+        self.duplicate_cache[url].add(duplicate_url)
+        # Also add reverse relationship
+        if duplicate_url not in self.duplicate_cache:
+            self.duplicate_cache[duplicate_url] = set()
+        self.duplicate_cache[duplicate_url].add(url)
+    def _get_hash_map(self) -> Dict[str, str]:
+        """Get mapping of URLs to their content hashes"""
+        return {url: signature for url, signature in self.simhash_values.items()}
+    def _hash_content(self, content: str) -> str:
+        """Create MD5 hash of content"""
+        return hashlib.md5(content.encode('utf-8')).hexdigest()
+    def _preprocess_content(self, content: str) -> List[str]:
+        """
+        Preprocess content for tokenization:
+        1. Convert to lowercase
+        2. Remove HTML tags
+        3. Extract tokens
+        4. Remove stop words
+        """
+        # Remove HTML tags
+        content = re.sub(r'<[^>]+>', ' ', content)
+        # Tokenize
+        tokens = self.token_pattern.findall(content.lower())
+        # Remove stop words
+        tokens = [token for token in tokens if token not in self.stop_words]
+        return tokens
+    def _add_to_index(self, url: str, content: str, content_hash: Optional[str] = None) -> None:
+        """
+        Add content to the deduplication index
+        Args:
+            url: URL of the page
+            content: Page content
+            content_hash: Optional pre-computed hash
+        """
+        # Add exact hash
+        if content_hash is None:
+            content_hash = self._hash_content(content)
+        self.content_hashes.add(content_hash)
+        # Add MinHash signature
+        if config.NEAR_DUPLICATE_DETECTION:
+            signature = self._compute_minhash(content)
+            self.minhash_signatures[url] = signature
+            # Add to LSH bands
+            for i in range(0, self.num_hashes, self.band_size):
+                band = tuple(signature[i:i+self.band_size])
+                band_id = hash(band)
+                self.minhash_bands[band_id].add(url)
+        # Add SimHash
+        if config.FUZZY_DUPLICATE_DETECTION:
+            simhash_value = self._compute_simhash(content)
+            self.simhash_values[url] = simhash_value
+    def _create_shingles(self, tokens: List[str], k: int = 3) -> Set[str]:
+        """
+        Create k-shingles from tokens
+        Args:
+            tokens: List of tokens
+            k: Size of shingles
+        Returns:
+            Set of shingles
+        """
+        return set(' '.join(tokens[i:i+k]) for i in range(len(tokens) - k + 1))
+    def _compute_minhash(self, content: str) -> List[int]:
+        """
+        Compute MinHash signature for content
+        Args:
+            content: Page content
+        Returns:
+            MinHash signature (list of hash values)
+        """
+        tokens = self._preprocess_content(content)
+        shingles = self._create_shingles(tokens, self.shingle_size)
+        # Generate random hash functions
+        max_hash = 2**32 - 1
+        # Create signature
+        signature = [max_hash] * self.num_hashes
+        # For each shingle, compute hashes and keep minimum values
+        for shingle in shingles:
+            # Use shingle as seed for random hash functions
+            shingle_hash = hash(shingle)
+            for i in range(self.num_hashes):
+                # Simple linear hash function: (a*x + b) mod c
+                a = i + 1  # Different 'a' for each hash function
+                b = i * i  # Different 'b' for each hash function
+                hash_value = (a * shingle_hash + b) % max_hash
+                # Keep the minimum hash value
+                signature[i] = min(signature[i], hash_value)
+        return signature
+    def _check_minhash(self, url: str, content: str) -> Optional[str]:
+        """
+        Check for near-duplicates using MinHash and LSH
+        Args:
+            url: URL of the page
+            content: Page content
+        Returns:
+            URL of duplicate page if found, None otherwise
+        """
+        # Compute MinHash signature
+        signature = self._compute_minhash(content)
+        # Check each band for potential matches
+        candidate_urls = set()
+        for i in range(0, self.num_hashes, self.band_size):
+            band = tuple(signature[i:i+self.band_size])
+            band_id = hash(band)
+            # Get URLs that share this band
+            if band_id in self.minhash_bands:
+                candidate_urls.update(self.minhash_bands[band_id])
+        # Check Jaccard similarity with candidates
+        for candidate_url in candidate_urls:
+            if candidate_url == url:
+                continue
+            candidate_signature = self.minhash_signatures[candidate_url]
+            similarity = self._jaccard_similarity(signature, candidate_signature)
+            if similarity >= config.SIMILARITY_THRESHOLD:
+                return candidate_url
+        return None
+    def _jaccard_similarity(self, sig1: List[int], sig2: List[int]) -> float:
+        """
+        Estimate Jaccard similarity from MinHash signatures
+        Args:
+            sig1: First signature
+            sig2: Second signature
+        Returns:
+            Estimated Jaccard similarity (0-1)
+        """
+        if len(sig1) != len(sig2):
+            raise ValueError("Signatures must have the same length")
+        # Count matching hash values
+        matches = sum(1 for i in range(len(sig1)) if sig1[i] == sig2[i])
+        # Estimate similarity
+        return matches / len(sig1)
+    def _compute_simhash(self, content: str) -> int:
+        """
+        Compute SimHash for content
+        Args:
+            content: Page content
+        Returns:
+            SimHash value
+        """
+        tokens = self._preprocess_content(content)
+        # Initialize vector
+        v = [0] * self.simhash_dim
+        # For each token, compute hash and update vector
+        for token in tokens:
+            # Compute hash of token
+            token_hash = hashlib.md5(token.encode('utf-8')).digest()
+            # Convert to binary representation
+            token_bits = ''.join(format(byte, '08b') for byte in token_hash)
+            # Use first self.simhash_dim bits
+            token_bits = token_bits[:self.simhash_dim]
+            # Update vector
+            for i, bit in enumerate(token_bits):
+                if bit == '1':
+                    v[i] += 1
+                else:
+                    v[i] -= 1
+        # Create fingerprint
+        fingerprint = 0
+        for i, val in enumerate(v):
+            if val > 0:
+                fingerprint |= (1 << i)
+        return fingerprint
+    def _check_simhash(self, url: str, content: str) -> Optional[str]:
+        """
+        Check for fuzzy duplicates using SimHash
+        Args:
+            url: URL of the page
+            content: Page content
+        Returns:
+            URL of duplicate page if found, None otherwise
+        """
+        # Compute SimHash
+        simhash_value = self._compute_simhash(content)
+        # Compare with existing SimHash values
+        for existing_url, existing_simhash in self.simhash_values.items():
+            if existing_url == url:
+                continue
+            # Calculate Hamming distance
+            hamming_distance = bin(simhash_value ^ existing_simhash).count('1')
+            if hamming_distance <= self.hamming_threshold:
+                return existing_url
+        return None
+    def clear(self) -> None:
+        """Clear all indexes and caches"""
+        self.content_hashes.clear()
+        self.minhash_signatures.clear()
+        self.minhash_bands.clear()
+        self.simhash_values.clear()
+        self.duplicate_cache.clear()
+        # Reset statistics
+        self.stats = {
+            'exact_duplicates': 0,
+            'near_duplicates': 0,
+            'fuzzy_duplicates': 0,
+            'processing_time': 0,
+            'total_documents': 0,
+        }
+    def get_stats(self) -> Dict[str, Union[int, float]]:
+        """Get deduplication statistics"""
+        stats_copy = self.stats.copy()
+        # Calculate average processing time
+        total_docs = self.stats['total_documents']
+        if total_docs > 0:
+            avg_time = self.stats['processing_time'] / total_docs
+            stats_copy['avg_processing_time'] = avg_time
+        else:
+            stats_copy['avg_processing_time'] = 0
+        # Calculate total duplicates
+        total_duplicates = (self.stats['exact_duplicates'] +
+                            self.stats['near_duplicates'] +
+                            self.stats['fuzzy_duplicates'])
+        stats_copy['total_duplicates'] = total_duplicates
+        # Calculate duplicate percentage
+        if total_docs > 0:
+            duplicate_percentage = (total_duplicates / total_docs) * 100
+            stats_copy['duplicate_percentage'] = duplicate_percentage
+        else:
+            stats_copy['duplicate_percentage'] = 0
+        return stats_copy

dns_resolver.py ADDED Viewed

	@@ -0,0 +1,161 @@

+"""
+DNS resolver with caching for web crawler
+"""
+import socket
+import logging
+import time
+from typing import Dict, Optional, Tuple
+from urllib.parse import urlparse
+from datetime import datetime, timedelta
+from cachetools import TTLCache
+import threading
+import dns
+import dns.resolver
+import config
+# Import local configuration if available
+try:
+    import local_config
+    # Override config settings with local settings
+    for key in dir(local_config):
+        if key.isupper():
+            setattr(config, key, getattr(local_config, key))
+    logging.info("Loaded local configuration")
+except ImportError:
+    pass
+# Configure logging
+logging.basicConfig(
+    level=getattr(logging, config.LOG_LEVEL),
+    format=config.LOG_FORMAT
+)
+logger = logging.getLogger(__name__)
+class DNSResolver:
+    """
+    DNS resolver with caching to improve performance
+    DNS resolution can be a bottleneck for crawlers due to the synchronous
+    nature of many DNS interfaces. This class provides a cached resolver
+    to reduce the number of DNS lookups.
+    """
+    def __init__(self, cache_size: int = 10000, cache_ttl: int = 3600):
+        """
+        Initialize DNS resolver
+        Args:
+            cache_size: Maximum number of DNS records to cache
+            cache_ttl: Time to live for cache entries in seconds
+        """
+        self.cache = TTLCache(maxsize=cache_size, ttl=cache_ttl)
+        self.lock = threading.RLock()  # Thread-safe operations
+        self.resolver = dns.resolver.Resolver()
+        self.resolver.timeout = 3.0  # Timeout for DNS requests in seconds
+        self.resolver.lifetime = 5.0  # Total timeout for all DNS requests
+        # Stats tracking
+        self.hit_count = 0
+        self.miss_count = 0
+    def resolve(self, url: str) -> Optional[str]:
+        """
+        Resolve a URL to an IP address
+        Args:
+            url: URL to resolve
+        Returns:
+            IP address or None if resolution fails
+        """
+        try:
+            parsed = urlparse(url)
+            hostname = parsed.netloc.split(':')[0]  # Remove port if present
+            # Check cache first
+            with self.lock:
+                if hostname in self.cache:
+                    logger.debug(f"DNS cache hit for {hostname}")
+                    self.hit_count += 1
+                    return self.cache[hostname]
+            # Cache miss - resolve hostname
+            ip_address = self._resolve_hostname(hostname)
+            # Update cache
+            if ip_address:
+                with self.lock:
+                    self.cache[hostname] = ip_address
+                    self.miss_count += 1
+            return ip_address
+        except Exception as e:
+            logger.warning(f"Error resolving DNS for {url}: {e}")
+            return None
+    def _resolve_hostname(self, hostname: str) -> Optional[str]:
+        """
+        Resolve hostname to IP address
+        Args:
+            hostname: Hostname to resolve
+        Returns:
+            IP address or None if resolution fails
+        """
+        try:
+            # First try using dnspython for more control
+            answers = self.resolver.resolve(hostname, 'A')
+            if answers:
+                # Return first IP address
+                return str(answers[0])
+        except dns.exception.DNSException as e:
+            logger.debug(f"dnspython DNS resolution failed for {hostname}: {e}")
+            # Fall back to socket.gethostbyname
+            try:
+                return socket.gethostbyname(hostname)
+            except socket.gaierror as e:
+                logger.warning(f"Socket DNS resolution failed for {hostname}: {e}")
+                return None
+    def bulk_resolve(self, urls: list) -> Dict[str, Optional[str]]:
+        """
+        Resolve multiple URLs to IP addresses
+        Args:
+            urls: List of URLs to resolve
+        Returns:
+            Dictionary mapping URLs to IP addresses
+        """
+        results = {}
+        for url in urls:
+            results[url] = self.resolve(url)
+        return results
+    def clear_cache(self) -> None:
+        """Clear the DNS cache"""
+        with self.lock:
+            self.cache.clear()
+    def get_stats(self) -> Dict[str, int]:
+        """
+        Get statistics about the DNS cache
+        Returns:
+            Dictionary with cache statistics
+        """
+        with self.lock:
+            return {
+                'size': len(self.cache),
+                'max_size': self.cache.maxsize,
+                'ttl': self.cache.ttl,
+                'hit_count': self.hit_count,
+                'miss_count': self.miss_count,
+                'hit_ratio': self.hit_count / (self.hit_count + self.miss_count) if (self.hit_count + self.miss_count) > 0 else 0
+            }

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,79 @@

+version: '3'
+services:
+  mongodb:
+    image: mongo:6.0
+    container_name: crawler-mongodb
+    ports:
+      - "27017:27017"
+    volumes:
+      - mongodb_data:/data/db
+    restart: unless-stopped
+    environment:
+      - MONGO_INITDB_DATABASE=webcrawler
+    networks:
+      - crawler-network
+  redis:
+    image: redis:latest
+    container_name: crawler-redis
+    ports:
+      - "6379:6379"
+    volumes:
+      - redis_data:/data
+    restart: unless-stopped
+    networks:
+      - crawler-network
+  web-crawler:
+    build:
+      context: .
+      dockerfile: Dockerfile
+    container_name: web-crawler
+    volumes:
+      - ./:/app
+      - crawler_data:/data/storage
+    ports:
+      - "9100:9100"
+    depends_on:
+      - mongodb
+      - redis
+    environment:
+      - MONGODB_URI=mongodb://mongodb:27017/
+      - REDIS_URI=redis://redis:6379/0
+      - LOG_LEVEL=INFO
+      - MAX_WORKERS=4
+    networks:
+      - crawler-network
+    command: python crawl.py start --workers=4
+  crawler-api:
+    build:
+      context: .
+      dockerfile: Dockerfile
+    container_name: crawler-api
+    volumes:
+      - ./:/app
+      - crawler_data:/data/storage
+    ports:
+      - "8000:8000"
+    depends_on:
+      - mongodb
+      - redis
+      - web-crawler
+    environment:
+      - MONGODB_URI=mongodb://mongodb:27017/
+      - REDIS_URI=redis://redis:6379/0
+      - LOG_LEVEL=INFO
+    networks:
+      - crawler-network
+    command: python api.py
+networks:
+  crawler-network:
+    driver: bridge
+volumes:
+  mongodb_data:
+  redis_data:
+  crawler_data:

downloader.py ADDED Viewed

	@@ -0,0 +1,400 @@

+"""
+HTML Downloader component for web crawler
+"""
+import time
+import logging
+import requests
+from requests.exceptions import RequestException
+from typing import Dict, Optional, Tuple, List, Any
+from urllib.parse import urlparse
+import aiohttp
+import asyncio
+from aiohttp.client_exceptions import ClientError
+import hashlib
+import os
+from models import URL, Page, calculate_content_hash
+from dns_resolver import DNSResolver
+from robots import RobotsHandler
+import config
+# Configure logging
+logging.basicConfig(
+    level=getattr(logging, config.LOG_LEVEL),
+    format=config.LOG_FORMAT
+)
+logger = logging.getLogger(__name__)
+class HTMLDownloader:
+    """
+    HTML Downloader responsible for downloading web pages
+    Features:
+    - Respects robots.txt rules
+    - Uses DNS caching for performance
+    - Handles errors and retries
+    - Supports both synchronous and asynchronous downloads
+    """
+    def __init__(self,
+                 dns_resolver: Optional[DNSResolver] = None,
+                 robots_handler: Optional[RobotsHandler] = None,
+                 user_agent: Optional[str] = None):
+        """
+        Initialize HTML Downloader
+        Args:
+            dns_resolver: DNS resolver for hostname resolution
+            robots_handler: Handler for robots.txt
+            user_agent: User agent to use for requests
+        """
+        self.dns_resolver = dns_resolver or DNSResolver()
+        self.robots_handler = robots_handler or RobotsHandler()
+        self.user_agent = user_agent or config.USER_AGENT
+        # Create request session
+        self.session = requests.Session()
+        self.session.headers.update({
+            'User-Agent': self.user_agent,
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',
+            'Accept-Language': 'en-US,en;q=0.5',
+            'Accept-Encoding': 'gzip, deflate, br',
+            'Connection': 'keep-alive',
+            'Upgrade-Insecure-Requests': '1',
+            'Cache-Control': 'max-age=0'
+        })
+    def download(self, url_obj: URL) -> Optional[Page]:
+        """
+        Download an HTML page from a URL
+        Args:
+            url_obj: URL object to download
+        Returns:
+            Page object or None if download fails
+        """
+        url = url_obj.url
+        try:
+            # Check robots.txt first
+            if config.ROBOTSTXT_OBEY:
+                allowed, crawl_delay = self.robots_handler.can_fetch(url)
+                if not allowed:
+                    logger.info(f"URL not allowed by robots.txt: {url}")
+                    url_obj.status = "robotstxt_excluded"
+                    return None
+                # Respect crawl delay if specified
+                if crawl_delay and crawl_delay > 0:
+                    time.sleep(crawl_delay)
+            # Resolve DNS
+            ip_address = self.dns_resolver.resolve(url)
+            if not ip_address:
+                logger.warning(f"Failed to resolve DNS for URL: {url}")
+                url_obj.error = "DNS resolution failed"
+                return None
+            # Download page with specific headers
+            start_time = time.time()
+            response = self.session.get(
+                url,
+                timeout=config.CRAWL_TIMEOUT,
+                allow_redirects=True,
+                stream=True,  # Stream to avoid downloading large files fully
+                headers={
+                    'User-Agent': self.user_agent,
+                    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',
+                    'Accept-Language': 'en-US,en;q=0.5',
+                    'Accept-Encoding': 'gzip',  # Only accept gzip to avoid encoding issues
+                    'Connection': 'keep-alive'
+                }
+            )
+            # Log response details
+            logger.debug(f"Response status code: {response.status_code}")
+            logger.debug(f"Response headers: {dict(response.headers)}")
+            # Check content type
+            content_type = response.headers.get('Content-Type', '').lower()
+            logger.debug(f"Content type for {url}: {content_type}")
+            is_html = any(allowed_type in content_type for allowed_type in config.ALLOWED_CONTENT_TYPES) or \
+                     any(allowed_type == '*/*' for allowed_type in config.ALLOWED_CONTENT_TYPES)
+            if not is_html:
+                logger.info(f"Skipping non-HTML content ({content_type}): {url}")
+                url_obj.error = f"Non-HTML content type: {content_type}"
+                return None
+            # Read content (with size limit)
+            content = b""
+            for chunk in response.iter_content(chunk_size=1024*1024):  # 1MB chunks
+                content += chunk
+                if len(content) > config.MAX_CONTENT_SIZE:
+                    logger.info(f"Content exceeded max size during download: {url}")
+                    url_obj.error = f"Content exceeded max size: {len(content)} bytes"
+                    return None
+            # Log content details
+            logger.debug(f"Downloaded content size: {len(content)} bytes")
+            logger.debug(f"First 100 bytes (hex): {content[:100].hex()}")
+            # Check for UTF-8 BOM
+            if content.startswith(b'\xef\xbb\xbf'):
+                content = content[3:]
+                logger.debug("Removed UTF-8 BOM from content")
+            # Try to detect encoding from response headers
+            encoding = None
+            if 'charset=' in content_type:
+                encoding = content_type.split('charset=')[-1].strip()
+                logger.debug(f"Found encoding in Content-Type header: {encoding}")
+            # Try to detect encoding from content
+            try:
+                import chardet
+                detected = chardet.detect(content)
+                if detected['confidence'] > 0.8:  # Only use if confidence is high
+                    encoding = detected['encoding']
+                    logger.debug(f"Detected encoding using chardet: {encoding} (confidence: {detected['confidence']})")
+            except ImportError:
+                logger.debug("chardet not available for encoding detection")
+            # Decode content with fallbacks
+            html_content = None
+            encodings_to_try = [
+                encoding,
+                'utf-8',
+                'utf-8-sig',
+                'iso-8859-1',
+                'cp1252',
+                'ascii'
+            ]
+            for enc in encodings_to_try:
+                if not enc:
+                    continue
+                try:
+                    html_content = content.decode(enc)
+                    # Quick validation of HTML content
+                    if '<!DOCTYPE' in html_content[:1000] or '<html' in html_content[:1000]:
+                        logger.debug(f"Successfully decoded content using {enc} encoding")
+                        break
+                    else:
+                        logger.debug(f"Decoded with {enc} but content doesn't look like HTML")
+                        html_content = None
+                except UnicodeDecodeError:
+                    logger.debug(f"Failed to decode content using {enc} encoding")
+                    continue
+            if html_content is None:
+                logger.warning(f"Failed to decode content for URL: {url} with any encoding")
+                url_obj.error = "Failed to decode content"
+                return None
+            # Additional HTML validation
+            if not any(marker in html_content[:1000] for marker in ['<!DOCTYPE', '<html', '<head', '<body']):
+                logger.warning(f"Content doesn't appear to be valid HTML for URL: {url}")
+                url_obj.error = "Invalid HTML content"
+                return None
+            # Calculate hash for duplicate detection
+            content_hash = calculate_content_hash(html_content)
+            elapsed_time = time.time() - start_time
+            # Create page object
+            page = Page(
+                url=url,
+                status_code=response.status_code,
+                content=html_content,
+                content_type=content_type,
+                content_length=len(content),
+                content_hash=content_hash,
+                headers={k.lower(): v for k, v in response.headers.items()},
+                crawled_at=time.time(),
+                redirect_url=response.url if response.url != url else None,
+                elapsed_time=elapsed_time
+            )
+            logger.info(f"Downloaded {len(content)} bytes from {url} in {elapsed_time:.2f}s")
+            return page
+        except RequestException as e:
+            logger.warning(f"Request error for URL {url}: {e}")
+            url_obj.error = f"Request error: {str(e)}"
+            return None
+        except Exception as e:
+            logger.error(f"Unexpected error downloading URL {url}: {e}")
+            url_obj.error = f"Unexpected error: {str(e)}"
+            return None
+    async def download_async(self, url_obj: URL, session: Optional[aiohttp.ClientSession] = None) -> Optional[Page]:
+        """
+        Download an HTML page asynchronously
+        Args:
+            url_obj: URL object to download
+            session: Optional aiohttp session to use
+        Returns:
+            Page object or None if download fails
+        """
+        url = url_obj.url
+        own_session = False
+        try:
+            # Check robots.txt first (blocking call)
+            if config.ROBOTSTXT_OBEY:
+                allowed, crawl_delay = self.robots_handler.can_fetch(url)
+                if not allowed:
+                    logger.info(f"URL not allowed by robots.txt: {url}")
+                    url_obj.status = "robotstxt_excluded"
+                    return None
+                # Respect crawl delay if specified
+                if crawl_delay and crawl_delay > 0:
+                    await asyncio.sleep(crawl_delay)
+            # Resolve DNS (blocking call, but cached)
+            ip_address = self.dns_resolver.resolve(url)
+            if not ip_address:
+                logger.warning(f"Failed to resolve DNS for URL: {url}")
+                url_obj.error = "DNS resolution failed"
+                return None
+            # Create session if not provided
+            if session is None:
+                own_session = True
+                session = aiohttp.ClientSession(headers={
+                    'User-Agent': self.user_agent,
+                    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',
+                    'Accept-Language': 'en-US,en;q=0.5',
+                    'Accept-Encoding': 'gzip, deflate, br',
+                    'Connection': 'keep-alive',
+                    'Upgrade-Insecure-Requests': '1',
+                    'Cache-Control': 'max-age=0'
+                })
+            # Download page
+            start_time = time.time()
+            async with session.get(url, timeout=config.CRAWL_TIMEOUT, allow_redirects=True) as response:
+                # Check content type
+                content_type = response.headers.get('Content-Type', '').lower()
+                is_html = any(allowed_type in content_type for allowed_type in config.ALLOWED_CONTENT_TYPES)
+                if not is_html:
+                    logger.info(f"Skipping non-HTML content ({content_type}): {url}")
+                    url_obj.error = f"Non-HTML content type: {content_type}"
+                    return None
+                # Check content length
+                content_length = int(response.headers.get('Content-Length', 0))
+                if content_length > config.MAX_CONTENT_SIZE:
+                    logger.info(f"Skipping large content ({content_length} bytes): {url}")
+                    url_obj.error = f"Content too large: {content_length} bytes"
+                    return None
+                # Read content (with size limit)
+                content = b""
+                async for chunk in response.content.iter_chunked(1024*1024):  # 1MB chunks
+                    content += chunk
+                    if len(content) > config.MAX_CONTENT_SIZE:
+                        logger.info(f"Content exceeded max size during download: {url}")
+                        url_obj.error = f"Content exceeded max size: {len(content)} bytes"
+                        return None
+                # Decode content
+                try:
+                    html_content = content.decode('utf-8')
+                except UnicodeDecodeError:
+                    try:
+                        # Try with a more forgiving encoding
+                        html_content = content.decode('iso-8859-1')
+                    except UnicodeDecodeError:
+                        logger.warning(f"Failed to decode content for URL: {url}")
+                        url_obj.error = "Failed to decode content"
+                        return None
+                # Calculate hash for duplicate detection
+                content_hash = calculate_content_hash(html_content)
+                elapsed_time = time.time() - start_time
+                # Create page object
+                page = Page(
+                    url=url,
+                    status_code=response.status,
+                    content=html_content,
+                    content_type=content_type,
+                    content_length=len(content),
+                    content_hash=content_hash,
+                    headers={k.lower(): v for k, v in response.headers.items()},
+                    crawled_at=time.time(),
+                    redirect_url=str(response.url) if str(response.url) != url else None,
+                    elapsed_time=elapsed_time
+                )
+                logger.info(f"Downloaded {len(content)} bytes from {url} in {elapsed_time:.2f}s")
+                return page
+        except (ClientError, asyncio.TimeoutError) as e:
+            logger.warning(f"Request error for URL {url}: {e}")
+            url_obj.error = f"Request error: {str(e)}"
+            return None
+        except Exception as e:
+            logger.error(f"Unexpected error downloading URL {url}: {e}")
+            url_obj.error = f"Unexpected error: {str(e)}"
+            return None
+        finally:
+            # Close session if we created it
+            if own_session and session:
+                await session.close()
+    async def bulk_download(self, urls: List[URL], concurrency: int = 10) -> Dict[str, Optional[Page]]:
+        """
+        Download multiple URLs concurrently
+        Args:
+            urls: List of URL objects to download
+            concurrency: Maximum number of concurrent downloads
+        Returns:
+            Dictionary mapping URL strings to Page objects
+        """
+        results = {}
+        # Create a session to be shared across requests
+        async with aiohttp.ClientSession(headers={
+            'User-Agent': self.user_agent,
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9',
+            'Accept-Language': 'en-US,en;q=0.5',
+            'Accept-Encoding': 'gzip, deflate, br',
+            'Connection': 'keep-alive',
+            'Upgrade-Insecure-Requests': '1',
+            'Cache-Control': 'max-age=0'
+        }) as session:
+            # Create a semaphore to limit concurrency
+            semaphore = asyncio.Semaphore(concurrency)
+            async def download_with_semaphore(url_obj):
+                async with semaphore:
+                    return await self.download_async(url_obj, session)
+            # Create download tasks
+            tasks = [download_with_semaphore(url_obj) for url_obj in urls]
+            # Wait for all tasks to complete
+            pages = await asyncio.gather(*tasks)
+            # Map results
+            for url_obj, page in zip(urls, pages):
+                results[url_obj.url] = page
+        return results

example.py ADDED Viewed

	@@ -0,0 +1,250 @@

+#!/usr/bin/env python3
+"""
+Example script that demonstrates how to use the web crawler programmatically.
+This example:
+1. Initializes the crawler
+2. Adds seed URLs
+3. Starts the crawler with 2 workers
+4. Monitors progress for a specific duration
+5. Pauses, resumes, and stops the crawler
+6. Exports crawl data
+Usage:
+    python example.py [--time=<seconds>] [--workers=<num>] [--async]
+Options:
+    --time=<seconds>    Duration of the crawl in seconds [default: 60]
+    --workers=<num>     Number of worker threads [default: 2]
+    --async             Use asynchronous mode
+"""
+import time
+import logging
+import sys
+import json
+import os
+import signal
+import threading
+from docopt import docopt
+from crawler import Crawler
+from models import URLStatus, Priority
+import config
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger('example')
+def log_stats(crawler, interval=5):
+    """Log crawler statistics periodically"""
+    stats = crawler.stats
+    elapsed = time.time() - stats['start_time']
+    logger.info(f"=== Crawler Statistics (after {int(elapsed)}s) ===")
+    logger.info(f"Pages crawled: {stats['pages_crawled']}")
+    logger.info(f"Pages failed: {stats['pages_failed']}")
+    logger.info(f"URLs discovered: {stats['urls_discovered']}")
+    logger.info(f"URLs filtered: {stats['urls_filtered']}")
+    logger.info(f"Domains crawled: {len(stats['domains_crawled'])}")
+    logger.info(f"Frontier size: {crawler.frontier.size()}")
+    # Status code distribution
+    status_codes = stats['status_codes']
+    if status_codes:
+        logger.info("Status code distribution:")
+        for status, count in sorted(status_codes.items()):
+            logger.info(f"  {status}: {count}")
+    # Check if crawler is still running
+    if crawler.running and not crawler.stop_event.is_set():
+        # Schedule next logging
+        timer = threading.Timer(interval, log_stats, args=[crawler, interval])
+        timer.daemon = True
+        timer.start()
+def example_crawl(duration=60, workers=2, async_mode=False):
+    """
+    Example crawler use
+    Args:
+        duration: Duration of the crawl in seconds
+        workers: Number of worker threads
+        async_mode: Whether to use async mode
+    """
+    logger.info("Initializing web crawler...")
+    # Initialize crawler
+    crawler = Crawler()
+    # Add seed URLs
+    seed_urls = [
+        'https://en.wikipedia.org/wiki/Web_crawler',
+        'https://en.wikipedia.org/wiki/Search_engine',
+        'https://en.wikipedia.org/wiki/Web_indexing',
+        'https://python.org',
+        'https://www.example.com'
+    ]
+    logger.info(f"Adding {len(seed_urls)} seed URLs...")
+    crawler.add_seed_urls(seed_urls)
+    # Set up signal handling
+    def signal_handler(sig, frame):
+        logger.info("Received interrupt signal, stopping crawler")
+        crawler.stop()
+        sys.exit(0)
+    signal.signal(signal.SIGINT, signal_handler)
+    # Start a thread to log stats periodically
+    log_stats(crawler, interval=5)
+    # Start the crawler in a separate thread
+    logger.info(f"Starting crawler with {workers} workers (async={async_mode})...")
+    crawler_thread = threading.Thread(
+        target=crawler.start,
+        kwargs={'num_workers': workers, 'async_mode': async_mode}
+    )
+    crawler_thread.daemon = True
+    crawler_thread.start()
+    # Let the crawler run for a while
+    logger.info(f"Crawler will run for {duration} seconds...")
+    time.sleep(duration // 2)
+    # Pause crawler
+    logger.info("Pausing crawler for 5 seconds...")
+    crawler.pause()
+    time.sleep(5)
+    # Resume crawler
+    logger.info("Resuming crawler...")
+    crawler.resume()
+    time.sleep(duration // 2)
+    # Stop crawler
+    logger.info("Stopping crawler...")
+    crawler.stop()
+    # Wait for crawler to stop
+    crawler_thread.join(timeout=10)
+    # Export crawl data
+    export_dir = os.path.join(config.STORAGE_PATH, 'exports')
+    os.makedirs(export_dir, exist_ok=True)
+    export_file = os.path.join(export_dir, 'example_crawl_results.json')
+    logger.info(f"Exporting crawl data to {export_file}...")
+    export_results(crawler, export_file)
+    logger.info("Crawl example completed")
+    # Print summary
+    print_summary(crawler)
+def export_results(crawler, output_file):
+    """
+    Export crawler results to a file
+    Args:
+        crawler: Crawler instance
+        output_file: Output file path
+    """
+    try:
+        # Get MongoDB collections
+        pages_collection = crawler.db.pages_collection
+        urls_collection = crawler.db.urls_collection
+        # Get data
+        pages = list(pages_collection.find({}, {'_id': 0}).limit(1000))
+        urls = list(urls_collection.find({}, {'_id': 0}).limit(1000))
+        # Prepare export data
+        export_data = {
+            'metadata': {
+                'crawl_duration': time.time() - crawler.stats['start_time'],
+                'pages_crawled': crawler.stats['pages_crawled'],
+                'urls_discovered': crawler.stats['urls_discovered'],
+                'domains_crawled': list(crawler.stats['domains_crawled']),
+                'exported_pages': len(pages),
+                'exported_urls': len(urls),
+                'export_timestamp': time.strftime('%Y-%m-%d %H:%M:%S')
+            },
+            'pages': pages,
+            'urls': urls,
+            'stats': crawler.stats
+        }
+        # Convert datetime objects to strings for JSON serialization
+        export_data = json.loads(json.dumps(export_data, default=str))
+        # Write to file
+        with open(output_file, 'w') as f:
+            json.dump(export_data, f, indent=2)
+        logger.info(f"Exported data to {output_file}")
+    except Exception as e:
+        logger.error(f"Error exporting results: {e}")
+def print_summary(crawler):
+    """
+    Print a summary of the crawl
+    Args:
+        crawler: Crawler instance
+    """
+    stats = crawler.stats
+    print("\n=============== CRAWL SUMMARY ===============")
+    print(f"Duration: {time.time() - stats['start_time']:.2f} seconds")
+    print(f"Pages crawled: {stats['pages_crawled']}")
+    print(f"Pages failed: {stats['pages_failed']}")
+    print(f"URLs discovered: {stats['urls_discovered']}")
+    print(f"URLs filtered: {stats['urls_filtered']}")
+    print(f"Domains crawled: {len(stats['domains_crawled'])}")
+    if stats['domains_crawled']:
+        print("\nTop domains:")
+        domain_counts = {}
+        # Count pages per domain
+        for page in crawler.db.pages_collection.find({}, {'domain': 1}):
+            domain = page.get('domain', 'unknown')
+            domain_counts[domain] = domain_counts.get(domain, 0) + 1
+        # Display top domains
+        for domain, count in sorted(domain_counts.items(), key=lambda x: x[1], reverse=True)[:10]:
+            print(f"  {domain}: {count} pages")
+    print("\nHTTP Status Codes:")
+    for status, count in sorted(stats['status_codes'].items()):
+        print(f"  {status}: {count}")
+    print("\nContent Types:")
+    for content_type, count in sorted(stats['content_types'].items(), key=lambda x: x[1], reverse=True)[:5]:
+        print(f"  {content_type}: {count}")
+    print("=============================================\n")
+if __name__ == '__main__':
+    # Parse command-line arguments
+    args = docopt(__doc__)
+    duration = int(args['--time'])
+    workers = int(args['--workers'])
+    async_mode = args['--async']
+    try:
+        example_crawl(duration, workers, async_mode)
+    except KeyboardInterrupt:
+        logger.info("Example interrupted by user")
+    except Exception as e:
+        logger.error(f"Error in example: {e}")
+        logger.exception(e)

file_cleanup.py ADDED Viewed

	@@ -0,0 +1,100 @@

+#!/usr/bin/env python3
+"""
+Script to remove all simple_crawler related files without interactive confirmation
+"""
+import os
+import sys
+import logging
+import shutil
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s [%(name)s] %(levelname)s: %(message)s'
+)
+logger = logging.getLogger("file_cleanup")
+def cleanup_files(dry_run=False):
+    """List and remove files related to simple_crawler"""
+    try:
+        crawler_dir = os.path.dirname(os.path.abspath(__file__))
+        # Files directly related to simple_crawler
+        simple_crawler_files = [
+            os.path.join(crawler_dir, "simple_crawler.py"),
+            os.path.join(crawler_dir, "README_SIMPLE.md"),
+            os.path.join(crawler_dir, "simple_crawler.log"),
+            os.path.join(crawler_dir, "local_config.py")
+        ]
+        # Check storage directories
+        storage_dir = os.path.join(crawler_dir, "storage")
+        if os.path.exists(storage_dir):
+            logger.info(f"Adding storage directory to removal list: {storage_dir}")
+            simple_crawler_files.append(storage_dir)
+        # Check for any log files with 'crawler' in the name
+        for filename in os.listdir(crawler_dir):
+            if ('crawler' in filename.lower() or 'crawl' in filename.lower()) and filename.endswith('.log'):
+                full_path = os.path.join(crawler_dir, filename)
+                if full_path not in simple_crawler_files:
+                    logger.info(f"Adding log file to removal list: {filename}")
+                    simple_crawler_files.append(full_path)
+        # List files that will be removed
+        logger.info("The following files will be removed:")
+        files_to_remove = []
+        for file_path in simple_crawler_files:
+            if os.path.exists(file_path):
+                logger.info(f"  - {file_path}")
+                files_to_remove.append(file_path)
+            else:
+                logger.info(f"  - {file_path} (not found)")
+        if dry_run:
+            logger.info("Dry run mode - no files will be removed")
+            return True
+        # Remove files and directories
+        for file_path in files_to_remove:
+            if os.path.isdir(file_path):
+                logger.info(f"Removing directory: {file_path}")
+                shutil.rmtree(file_path)
+            else:
+                logger.info(f"Removing file: {file_path}")
+                os.remove(file_path)
+        logger.info("File cleanup completed")
+        return True
+    except Exception as e:
+        logger.error(f"Error cleaning up files: {e}")
+        return False
+if __name__ == "__main__":
+    print("Simple Crawler File Cleanup")
+    print("--------------------------")
+    print("This script will remove all files related to simple_crawler")
+    print()
+    # Check for dry-run flag
+    dry_run = '--dry-run' in sys.argv
+    if '--force' in sys.argv:
+        # Non-interactive mode for scripting
+        success = cleanup_files(dry_run)
+        sys.exit(0 if success else 1)
+    else:
+        # Interactive mode
+        if dry_run:
+            print("DRY RUN MODE: Files will be listed but not removed")
+        proceed = input("Do you want to proceed with file cleanup? (y/n): ")
+        if proceed.lower() != 'y':
+            print("Cleanup cancelled")
+            sys.exit(0)
+        success = cleanup_files(dry_run)
+        print(f"\nFile cleanup: {'Completed' if success else 'Failed'}")

frontier.py ADDED Viewed

	@@ -0,0 +1,319 @@

+"""
+URL Frontier implementation for web crawler
+The URL Frontier maintains URLs to be crawled with two main goals:
+1. Prioritization - Important URLs are crawled first
+2. Politeness - Avoid overloading web servers with too many requests
+"""
+import time
+import logging
+import heapq
+import pickle
+import threading
+import random
+from typing import Dict, List, Tuple, Optional, Any, Set
+from collections import deque
+import redis
+from redis.exceptions import RedisError
+import mmh3
+import os
+import json
+from models import URL, Priority, URLStatus
+import config
+# Import local configuration if available
+try:
+    import local_config
+    # Override config settings with local settings
+    for key in dir(local_config):
+        if key.isupper():
+            setattr(config, key, getattr(local_config, key))
+    logging.info("Loaded local configuration")
+except ImportError:
+    pass
+# Configure logging
+logging.basicConfig(
+    level=getattr(logging, config.LOG_LEVEL),
+    format=config.LOG_FORMAT
+)
+logger = logging.getLogger(__name__)
+class URLFrontier:
+    """
+    URL Frontier implementation with prioritization and politeness
+    Architecture:
+    - Front queues: Priority-based queues
+    - Back queues: Host-based queues for politeness
+    This uses Redis for persistent storage to handle large number of URLs
+    and enable distributed crawling. In deployment mode, it can also use
+    in-memory storage.
+    """
+    def __init__(self, redis_client: Optional[redis.Redis] = None, use_memory: bool = False):
+        """Initialize the URL Frontier"""
+        self.use_memory = use_memory
+        if use_memory:
+            # Initialize in-memory storage
+            self.memory_storage = {
+                'seen_urls': set(),
+                'priority_queues': [[] for _ in range(config.PRIORITY_QUEUE_NUM)],
+                'host_queues': [[] for _ in range(config.HOST_QUEUE_NUM)]
+            }
+        else:
+            # Use Redis
+            self.redis = redis_client or redis.from_url(config.REDIS_URI)
+        self.priority_count = config.PRIORITY_QUEUE_NUM  # Number of priority queues
+        self.host_count = config.HOST_QUEUE_NUM  # Number of host queues
+        self.url_seen_key = "webcrawler:url_seen"  # Bloom filter for seen URLs
+        self.priority_queue_key_prefix = "webcrawler:priority_queue:"
+        self.host_queue_key_prefix = "webcrawler:host_queue:"
+        self.lock = threading.RLock()  # Thread-safe operations
+        # Simple mode uses Redis Set instead of Bloom filter
+        self.use_simple_mode = getattr(config, 'USE_SIMPLE_URL_SEEN', False)
+        logger.info(f"URLFrontier using simple mode: {self.use_simple_mode}")
+        # Ensure directory for checkpoint exists
+        if not os.path.exists(config.STORAGE_PATH):
+            os.makedirs(config.STORAGE_PATH)
+        # Initialize URL seen storage
+        if not self.use_memory:
+            self._init_url_seen()
+    def _init_url_seen(self):
+        """Initialize URL seen storage based on configuration"""
+        try:
+            # If using simple mode, just use a Redis set
+            if self.use_simple_mode:
+                if not self.redis.exists(self.url_seen_key):
+                    logger.info("Initializing URL seen set")
+                    self.redis.sadd(self.url_seen_key, "initialized")
+                return
+            # Try to use Bloom filter
+            if not self.redis.exists(self.url_seen_key):
+                logger.info("Initializing URL seen bloom filter")
+                try:
+                    # Use a bloom filter with 100 million items and 0.01 false positive rate
+                    # This requires approximately 119.5 MB of memory
+                    self.redis.execute_command("BF.RESERVE", self.url_seen_key, 0.01, 100000000)
+                except RedisError as e:
+                    logger.error(f"Failed to initialize bloom filter: {e}")
+                    logger.info("Falling back to simple set for URL seen detection")
+                    self.use_simple_mode = True
+                    # Initialize a set instead
+                    if not self.redis.exists(self.url_seen_key):
+                        self.redis.sadd(self.url_seen_key, "initialized")
+        except RedisError as e:
+            logger.error(f"Error initializing URL seen: {e}")
+            # Fallback to set if bloom filter is not available
+            self.use_simple_mode = True
+            if not self.redis.exists(self.url_seen_key):
+                self.redis.sadd(self.url_seen_key, "initialized")
+    def add_url(self, url_obj: URL) -> bool:
+        """Add a URL to the frontier"""
+        with self.lock:
+            url = url_obj.url
+            # Check if URL has been seen
+            if self.use_memory:
+                if url in self.memory_storage['seen_urls']:
+                    return False
+                self.memory_storage['seen_urls'].add(url)
+            else:
+                if self.use_simple_mode:
+                    if self.redis.sismember(self.url_seen_key, url):
+                        return False
+                    self.redis.sadd(self.url_seen_key, url)
+                else:
+                    if self._check_url_seen(url):
+                        return False
+                    self._mark_url_seen(url)
+            # Add to priority queue
+            priority_index = url_obj.priority.value % self.priority_count
+            if self.use_memory:
+                self.memory_storage['priority_queues'][priority_index].append(url_obj)
+            else:
+                priority_key = f"{self.priority_queue_key_prefix}{priority_index}"
+                self.redis.rpush(priority_key, url_obj.json())
+            return True
+    def get_next_url(self) -> Optional[URL]:
+        """Get the next URL to crawl"""
+        with self.lock:
+            # Try each priority queue
+            for i in range(self.priority_count):
+                if self.use_memory:
+                    queue = self.memory_storage['priority_queues'][i]
+                    if queue:
+                        return queue.pop(0)
+                else:
+                    priority_key = f"{self.priority_queue_key_prefix}{i}"
+                    url_data = self.redis.lpop(priority_key)
+                    if url_data:
+                        return URL.parse_raw(url_data)
+            return None
+    def _check_url_seen(self, url: str) -> bool:
+        """Check if URL has been seen"""
+        if self.use_memory:
+            return url in self.memory_storage['seen_urls']
+        elif self.use_simple_mode:
+            return self.redis.sismember(self.url_seen_key, url)
+        else:
+            # Using Redis Bloom filter
+            return bool(self.redis.getbit(self.url_seen_key, self._hash_url(url)))
+    def _mark_url_seen(self, url: str) -> None:
+        """Mark URL as seen"""
+        if self.use_memory:
+            self.memory_storage['seen_urls'].add(url)
+        elif self.use_simple_mode:
+            self.redis.sadd(self.url_seen_key, url)
+        else:
+            # Using Redis Bloom filter
+            self.redis.setbit(self.url_seen_key, self._hash_url(url), 1)
+    def _hash_url(self, url: str) -> int:
+        """Hash URL for Bloom filter"""
+        return hash(url) % (1 << 32)  # 32-bit hash
+    def size(self) -> int:
+        """Get the total size of all queues"""
+        if self.use_memory:
+            return sum(len(q) for q in self.memory_storage['priority_queues'])
+        else:
+            total = 0
+            for i in range(self.priority_count):
+                priority_key = f"{self.priority_queue_key_prefix}{i}"
+                total += self.redis.llen(priority_key)
+            return total
+    def get_stats(self) -> Dict[str, Any]:
+        """Get frontier statistics"""
+        stats = {
+            "size": self.size(),
+            "priority_queues": {},
+            "host_queues": {},
+        }
+        try:
+            # Get priority queue stats
+            for priority in range(1, self.priority_count + 1):
+                queue_key = f"{self.priority_queue_key_prefix}{priority}"
+                stats["priority_queues"][f"priority_{priority}"] = self.redis.llen(queue_key)
+            # Get host queue stats (just count total host queues with items)
+            host_queue_count = 0
+            for host_id in range(self.host_count):
+                queue_key = f"{self.host_queue_key_prefix}{host_id}"
+                if self.redis.llen(queue_key) > 0:
+                    host_queue_count += 1
+            stats["host_queues"]["count_with_items"] = host_queue_count
+            # Add URLs seen count if using simple mode
+            if self.use_simple_mode:
+                stats["urls_seen"] = self.redis.scard(self.url_seen_key)
+            return stats
+        except RedisError as e:
+            logger.error(f"Error getting frontier stats: {e}")
+            return stats
+    def checkpoint(self) -> bool:
+        """Save frontier state"""
+        if self.use_memory:
+            # No need to checkpoint in-memory storage
+            return True
+        try:
+            # Save priority queues
+            for i in range(self.priority_count):
+                priority_key = f"{self.priority_queue_key_prefix}{i}"
+                queue_data = []
+                while True:
+                    url_data = self.redis.lpop(priority_key)
+                    if not url_data:
+                        break
+                    queue_data.append(url_data)
+                # Save to file
+                checkpoint_file = os.path.join(config.STORAGE_PATH, f"priority_queue_{i}.json")
+                with open(checkpoint_file, 'w') as f:
+                    json.dump(queue_data, f)
+                # Restore queue
+                for url_data in reversed(queue_data):
+                    self.redis.rpush(priority_key, url_data)
+            return True
+        except Exception as e:
+            logger.error(f"Error creating frontier checkpoint: {e}")
+            return False
+    def restore(self) -> bool:
+        """Restore frontier state"""
+        if self.use_memory:
+            # No need to restore in-memory storage
+            return True
+        try:
+            # Restore priority queues
+            for i in range(self.priority_count):
+                checkpoint_file = os.path.join(config.STORAGE_PATH, f"priority_queue_{i}.json")
+                if os.path.exists(checkpoint_file):
+                    with open(checkpoint_file, 'r') as f:
+                        queue_data = json.load(f)
+                    # Clear existing queue
+                    priority_key = f"{self.priority_queue_key_prefix}{i}"
+                    self.redis.delete(priority_key)
+                    # Restore queue
+                    for url_data in queue_data:
+                        self.redis.rpush(priority_key, url_data)
+            return True
+        except Exception as e:
+            logger.error(f"Error restoring frontier checkpoint: {e}")
+            return False
+    def clear(self) -> bool:
+        """
+        Clear all queues in the frontier
+        Returns:
+            bool: True if successful, False otherwise
+        """
+        try:
+            # Delete all queue keys
+            keys = []
+            for priority in range(1, self.priority_count + 1):
+                keys.append(f"{self.priority_queue_key_prefix}{priority}")
+            for host_id in range(self.host_count):
+                keys.append(f"{self.host_queue_key_prefix}{host_id}")
+            if keys:
+                self.redis.delete(*keys)
+            # Reset URL seen filter (optional)
+            self.redis.delete(self.url_seen_key)
+            logger.info("Frontier cleared")
+            return True
+        except Exception as e:
+            logger.error(f"Error clearing frontier: {e}")
+            return False

models.py ADDED Viewed

	@@ -0,0 +1,167 @@

+"""
+Data models for the web crawler
+"""
+import time
+import hashlib
+import tldextract
+from urllib.parse import urlparse, urljoin, urlunparse
+from datetime import datetime
+from typing import Dict, List, Any, Optional, Set, Tuple
+from pydantic import BaseModel, Field, HttpUrl, validator
+from enum import Enum
+import logging
+logger = logging.getLogger(__name__)
+class URLStatus(str, Enum):
+    """Status of a URL in the crawl process"""
+    PENDING = "pending"  # Not yet processed
+    IN_PROGRESS = "in_progress"  # Currently being processed
+    COMPLETED = "completed"  # Successfully processed
+    FAILED = "failed"  # Failed to process
+    FILTERED = "filtered"  # Filtered out based on rules
+    ROBOTSTXT_EXCLUDED = "robotstxt_excluded"  # Excluded by robots.txt
+class Priority(int, Enum):
+    """Priority levels for URLs"""
+    VERY_HIGH = 1
+    HIGH = 2
+    MEDIUM = 3
+    LOW = 4
+    VERY_LOW = 5
+class URL(BaseModel):
+    """URL model with metadata for crawling"""
+    url: str
+    normalized_url: str = ""  # Normalized version of the URL
+    domain: str = ""  # Domain extracted from the URL
+    depth: int = 0  # Depth from seed URL
+    discovered_at: datetime = Field(default_factory=datetime.now)
+    last_crawled: Optional[datetime] = None
+    completed_at: Optional[datetime] = None  # When the URL was completed/failed
+    status: URLStatus = URLStatus.PENDING
+    priority: Priority = Priority.MEDIUM
+    parent_url: Optional[str] = None  # URL that led to this URL
+    retries: int = 0  # Number of times retried
+    error: Optional[str] = None  # Error message if failed
+    metadata: Dict[str, Any] = Field(default_factory=dict)  # Additional metadata
+    @validator("normalized_url", pre=True, always=True)
+    def set_normalized_url(cls, v, values):
+        """Normalize the URL if not already set"""
+        if not v and "url" in values:
+            return normalize_url(values["url"])
+        return v
+    @validator("domain", pre=True, always=True)
+    def set_domain(cls, v, values):
+        """Extract domain from URL if not already set"""
+        if not v and "url" in values:
+            parsed = tldextract.extract(values["url"])
+            return f"{parsed.domain}.{parsed.suffix}" if parsed.suffix else parsed.domain
+        return v
+class RobotsInfo(BaseModel):
+    """Information from robots.txt for a domain"""
+    domain: str
+    allowed: bool = True  # Whether crawling is allowed
+    crawl_delay: Optional[float] = None  # Crawl delay in seconds
+    last_fetched: datetime = Field(default_factory=datetime.now)
+    user_agents: Dict[str, Dict[str, Any]] = Field(default_factory=dict)  # Info per user agent
+    status_code: Optional[int] = None  # HTTP status code when fetching robots.txt
+class Page(BaseModel):
+    """Web page model with content and metadata"""
+    url: str
+    status_code: int
+    content: str  # HTML content
+    content_type: str
+    content_length: int
+    content_hash: str  # Hash of the content for duplicate detection
+    headers: Dict[str, str] = Field(default_factory=dict)
+    links: List[str] = Field(default_factory=list)  # Links extracted from the page
+    crawled_at: datetime = Field(default_factory=datetime.now)
+    redirect_url: Optional[str] = None  # URL after redirects
+    elapsed_time: float = 0.0  # Time taken to fetch the page
+    is_duplicate: bool = False  # Whether this is duplicate content
+    metadata: Dict[str, Any] = Field(default_factory=dict)  # Additional metadata
+class DomainStats(BaseModel):
+    """Statistics for a domain"""
+    domain: str
+    pages_crawled: int = 0
+    successful_crawls: int = 0
+    failed_crawls: int = 0
+    last_crawled: Optional[datetime] = None
+    robots_info: Optional[RobotsInfo] = None
+    crawl_times: List[float] = Field(default_factory=list)  # Recent crawl times
+    errors: Dict[int, int] = Field(default_factory=dict)  # Status code counts for errors
+def normalize_url(url: str) -> str:
+    """
+    Normalize a URL by:
+    1. Converting to lowercase
+    2. Removing fragments
+    3. Removing default ports
+    4. Sorting query parameters
+    5. Removing trailing slashes
+    6. Adding scheme if missing
+    """
+    try:
+        # Parse URL
+        parsed = urlparse(url)
+        # Add scheme if missing
+        if not parsed.scheme:
+            url = 'http://' + url
+            parsed = urlparse(url)
+        # Get domain and path
+        domain = parsed.netloc.lower()
+        path = parsed.path
+        # Remove default ports
+        if ':' in domain:
+            domain_parts = domain.split(':')
+            if (parsed.scheme == 'http' and domain_parts[1] == '80') or \
+               (parsed.scheme == 'https' and domain_parts[1] == '443'):
+                domain = domain_parts[0]
+        # Sort query parameters
+        query = parsed.query
+        if query:
+            query_params = sorted(query.split('&'))
+            query = '&'.join(query_params)
+        # Remove trailing slashes from path
+        while path.endswith('/') and len(path) > 1:
+            path = path[:-1]
+        # Add leading slash if missing
+        if not path:
+            path = '/'
+        # Reconstruct URL
+        normalized = f"{parsed.scheme}://{domain}{path}"
+        if query:
+            normalized += f"?{query}"
+        logger.debug(f"Normalized URL: {url} -> {normalized}")
+        return normalized
+    except Exception as e:
+        logger.error(f"Error normalizing URL {url}: {e}")
+        return url
+def calculate_content_hash(content: str) -> str:
+    """Calculate hash of content for duplicate detection"""
+    return hashlib.md5(content.encode('utf-8')).hexdigest()

mongo_cleanup.py ADDED Viewed

	@@ -0,0 +1,86 @@

+#!/usr/bin/env python3
+"""
+Script to remove all web crawler data from MongoDB without interactive confirmation
+"""
+import logging
+from pymongo import MongoClient
+import sys
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s [%(name)s] %(levelname)s: %(message)s'
+)
+logger = logging.getLogger("mongo_cleanup")
+def cleanup_mongodb():
+    """Remove all web crawler data from MongoDB"""
+    try:
+        # Connect to MongoDB
+        logger.info("Connecting to MongoDB...")
+        client = MongoClient("mongodb://localhost:27017/")
+        # Access crawler database
+        db = client["crawler"]
+        # List and drop all collections
+        collections = db.list_collection_names()
+        if not collections:
+            logger.info("No collections found in the crawler database")
+        else:
+            logger.info(f"Found {len(collections)} collections to drop: {collections}")
+            for collection in collections:
+                logger.info(f"Dropping collection: {collection}")
+                db[collection].drop()
+            logger.info("All crawler collections dropped successfully")
+        # Optionally drop the entire database
+        logger.info("Dropping entire crawler database")
+        client.drop_database("crawler")
+        # Check for any URLs collection in other databases that might be related
+        all_dbs = client.list_database_names()
+        for db_name in all_dbs:
+            if db_name in ['admin', 'config', 'local']:
+                continue
+            db = client[db_name]
+            if 'urls' in db.list_collection_names() or 'pages' in db.list_collection_names():
+                logger.info(f"Found crawler-related collections in database: {db_name}")
+                # Ask for confirmation before dropping collections in other databases
+                for collection in ['urls', 'pages', 'domains', 'stats']:
+                    if collection in db.list_collection_names():
+                        logger.info(f"Dropping collection {db_name}.{collection}")
+                        db[collection].drop()
+        logger.info("MongoDB cleanup completed successfully")
+        return True
+    except Exception as e:
+        logger.error(f"Error cleaning up MongoDB: {e}")
+        return False
+if __name__ == "__main__":
+    print("MongoDB Crawler Data Cleanup")
+    print("--------------------------")
+    print("This script will remove all web crawler collections from MongoDB")
+    print()
+    if len(sys.argv) > 1 and sys.argv[1] == '--force':
+        # Non-interactive mode for scripting
+        success = cleanup_mongodb()
+        sys.exit(0 if success else 1)
+    else:
+        # Interactive mode
+        proceed = input("Do you want to proceed with MongoDB cleanup? (y/n): ")
+        if proceed.lower() != 'y':
+            print("Cleanup cancelled")
+            sys.exit(0)
+        success = cleanup_mongodb()
+        print(f"\nMongoDB cleanup: {'Completed' if success else 'Failed'}")

parser.py ADDED Viewed

	@@ -0,0 +1,316 @@

+"""
+HTML Parser and URL Extractor component for web crawler
+"""
+import logging
+import re
+from typing import Dict, List, Set, Tuple, Optional, Any
+from urllib.parse import urlparse, urljoin, unquote
+from bs4 import BeautifulSoup
+import tldextract
+import hashlib
+import os
+from models import URL, Page, Priority, normalize_url
+import config
+# Configure logging
+logging.basicConfig(
+    level=getattr(logging, config.LOG_LEVEL),
+    format=config.LOG_FORMAT
+)
+logger = logging.getLogger(__name__)
+class HTMLParser:
+    """
+    Parses HTML content and extracts URLs and other information
+    """
+    def __init__(self):
+        """Initialize HTML parser"""
+        # Compile URL filter regex patterns for efficiency
+        self.url_filters = [re.compile(pattern) for pattern in config.URL_FILTERS]
+    def parse(self, page: Page, base_url: Optional[str] = None) -> Tuple[List[str], Dict[str, Any]]:
+        """
+        Parse HTML content and extract URLs and metadata
+        Args:
+            page: Page object containing HTML content
+            base_url: Base URL for resolving relative links (defaults to page URL)
+        Returns:
+            Tuple of (extracted URLs, metadata)
+        """
+        if not page or not page.content:
+            return [], {}
+        # Use page URL as base URL if not provided
+        if not base_url:
+            base_url = page.url
+        # Parse HTML content
+        soup = BeautifulSoup(page.content, 'html.parser')
+        # Extract URLs
+        urls = self._extract_urls(soup, base_url)
+        # Extract metadata
+        metadata = self._extract_metadata(soup)
+        return urls, metadata
+    def _extract_urls(self, soup: BeautifulSoup, base_url: str) -> List[str]:
+        """
+        Extract and normalize URLs from HTML content
+        Args:
+            soup: BeautifulSoup object
+            base_url: Base URL for resolving relative links
+        Returns:
+            List of normalized URLs
+        """
+        urls = set()
+        all_urls = set()  # Track all URLs before filtering
+        filtered_urls = set()  # Track filtered URLs
+        logger.debug(f"Extracting URLs from page: {base_url}")
+        # Extract URLs from <a> tags
+        for link in soup.find_all('a', href=True):
+            href = link['href'].strip()
+            if href and not href.startswith(('#', 'javascript:', 'mailto:', 'tel:')):
+                # Resolve relative URLs
+                try:
+                    absolute_url = urljoin(base_url, href)
+                    all_urls.add(absolute_url)
+                    # Normalize URL
+                    normalized_url = normalize_url(absolute_url)
+                    # Apply URL filters
+                    if self._should_allow_url(normalized_url):
+                        urls.add(normalized_url)
+                    else:
+                        filtered_urls.add(normalized_url)
+                except Exception as e:
+                    logger.debug(f"Error processing URL {href}: {e}")
+        # Extract URLs from other elements like <iframe>, <frame>, <img>, etc.
+        for tag_name, attr in [('frame', 'src'), ('iframe', 'src'), ('img', 'src'),
+                               ('link', 'href'), ('script', 'src'), ('area', 'href')]:
+            for tag in soup.find_all(tag_name, attrs={attr: True}):
+                url = tag[attr].strip()
+                if url and not url.startswith(('#', 'javascript:', 'data:', 'mailto:', 'tel:')):
+                    try:
+                        absolute_url = urljoin(base_url, url)
+                        all_urls.add(absolute_url)
+                        normalized_url = normalize_url(absolute_url)
+                        if self._should_allow_url(normalized_url):
+                            urls.add(normalized_url)
+                        else:
+                            filtered_urls.add(normalized_url)
+                    except Exception as e:
+                        logger.debug(f"Error processing URL {url}: {e}")
+        # Log statistics
+        logger.debug(f"Found {len(all_urls)} total URLs")
+        logger.debug(f"Filtered {len(filtered_urls)} URLs")
+        logger.debug(f"Accepted {len(urls)} URLs")
+        # Log some example filtered URLs for debugging
+        if filtered_urls:
+            sample_filtered = list(filtered_urls)[:5]
+            logger.debug(f"Sample filtered URLs: {sample_filtered}")
+        # Return list of unique URLs
+        return list(urls)
+    def _should_allow_url(self, url: str) -> bool:
+        """
+        Check if URL should be allowed based on filters
+        Args:
+            url: URL to check
+        Returns:
+            True if URL should be allowed, False otherwise
+        """
+        try:
+            parsed = urlparse(url)
+            # Check scheme
+            if parsed.scheme not in config.ALLOWED_SCHEMES:
+                logger.debug(f"URL filtered - invalid scheme: {url}")
+                return False
+            # Check domain restrictions
+            domain = self._extract_domain(url)
+            # Check allowed domains if set
+            if config.ALLOWED_DOMAINS and domain not in config.ALLOWED_DOMAINS:
+                logger.debug(f"URL filtered - domain not allowed: {url} (domain: {domain}, allowed: {config.ALLOWED_DOMAINS})")
+                return False
+            # Check excluded domains
+            if domain in config.EXCLUDED_DOMAINS:
+                logger.debug(f"URL filtered - domain excluded: {url}")
+                return False
+            # Check URL filters
+            for pattern in self.url_filters:
+                if pattern.match(url):
+                    logger.debug(f"URL filtered - pattern match: {url}")
+                    return False
+            return True
+        except Exception as e:
+            logger.debug(f"Error checking URL {url}: {e}")
+            return False
+    def _extract_metadata(self, soup: BeautifulSoup) -> Dict[str, Any]:
+        """
+        Extract metadata from HTML content
+        Args:
+            soup: BeautifulSoup object
+        Returns:
+            Dictionary of metadata
+        """
+        metadata = {}
+        # Extract title
+        title_tag = soup.find('title')
+        if title_tag and title_tag.string:
+            metadata['title'] = title_tag.string.strip()
+        # Extract meta description
+        description_tag = soup.find('meta', attrs={'name': 'description'})
+        if description_tag and description_tag.get('content'):
+            metadata['description'] = description_tag['content'].strip()
+        # Extract meta keywords
+        keywords_tag = soup.find('meta', attrs={'name': 'keywords'})
+        if keywords_tag and keywords_tag.get('content'):
+            metadata['keywords'] = [k.strip() for k in keywords_tag['content'].split(',')]
+        # Extract canonical URL
+        canonical_tag = soup.find('link', attrs={'rel': 'canonical'})
+        if canonical_tag and canonical_tag.get('href'):
+            metadata['canonical_url'] = canonical_tag['href'].strip()
+        # Extract robots meta
+        robots_tag = soup.find('meta', attrs={'name': 'robots'})
+        if robots_tag and robots_tag.get('content'):
+            metadata['robots'] = robots_tag['content'].strip()
+        # Extract Open Graph metadata
+        og_metadata = {}
+        for meta_tag in soup.find_all('meta', attrs={'property': re.compile('^og:')}):
+            if meta_tag.get('content'):
+                property_name = meta_tag['property'][3:]  # Remove 'og:' prefix
+                og_metadata[property_name] = meta_tag['content'].strip()
+        if og_metadata:
+            metadata['open_graph'] = og_metadata
+        # Extract Twitter Card metadata
+        twitter_metadata = {}
+        for meta_tag in soup.find_all('meta', attrs={'name': re.compile('^twitter:')}):
+            if meta_tag.get('content'):
+                property_name = meta_tag['name'][8:]  # Remove 'twitter:' prefix
+                twitter_metadata[property_name] = meta_tag['content'].strip()
+        if twitter_metadata:
+            metadata['twitter_card'] = twitter_metadata
+        # Extract schema.org structured data (JSON-LD)
+        schema_metadata = []
+        for script in soup.find_all('script', attrs={'type': 'application/ld+json'}):
+            if script.string:
+                try:
+                    import json
+                    schema_data = json.loads(script.string)
+                    schema_metadata.append(schema_data)
+                except Exception as e:
+                    logger.debug(f"Error parsing JSON-LD: {e}")
+        if schema_metadata:
+            metadata['structured_data'] = schema_metadata
+        # Extract text content statistics
+        text_content = soup.get_text(separator=' ', strip=True)
+        if text_content:
+            word_count = len(text_content.split())
+            metadata['word_count'] = word_count
+            metadata['text_length'] = len(text_content)
+        return metadata
+    def _extract_domain(self, url: str) -> str:
+        """Extract domain from URL"""
+        parsed = tldextract.extract(url)
+        return f"{parsed.domain}.{parsed.suffix}" if parsed.suffix else parsed.domain
+    def calculate_priority(self, url: str, metadata: Dict[str, Any]) -> Priority:
+        """
+        Calculate priority for a URL based on various factors
+        Args:
+            url: URL to calculate priority for
+            metadata: Metadata extracted from the page
+        Returns:
+            Priority enum value
+        """
+        # Default priority
+        priority = Priority.MEDIUM
+        try:
+            # Extract path depth
+            parsed = urlparse(url)
+            path = parsed.path
+            depth = len([p for p in path.split('/') if p])
+            # Prioritize URLs with shorter paths
+            if depth <= 1:
+                priority = Priority.HIGH
+            elif depth <= 3:
+                priority = Priority.MEDIUM
+            else:
+                priority = Priority.LOW
+            # Prioritize URLs with certain keywords in path
+            if re.search(r'(article|blog|news|post)', path, re.IGNORECASE):
+                priority = Priority.HIGH
+            # Deprioritize URLs with pagination patterns
+            if re.search(r'(page|p|pg)=\d+', url, re.IGNORECASE):
+                priority = Priority.LOW
+            # Check metadata
+            if metadata:
+                # Prioritize based on title
+                title = metadata.get('title', '')
+                if title and len(title) > 10:
+                    priority = min(priority, Priority.MEDIUM)  # Raise priority if it's lower
+                # Prioritize based on description
+                description = metadata.get('description', '')
+                if description and len(description) > 50:
+                    priority = min(priority, Priority.MEDIUM)  # Raise priority if it's lower
+                # Prioritize based on word count
+                word_count = metadata.get('word_count', 0)
+                if word_count > 1000:
+                    priority = min(priority, Priority.HIGH)  # High priority for content-rich pages
+                elif word_count > 500:
+                    priority = min(priority, Priority.MEDIUM)
+            return priority
+        except Exception as e:
+            logger.debug(f"Error calculating priority for URL {url}: {e}")
+            return Priority.MEDIUM

requirements.txt ADDED Viewed

	@@ -0,0 +1,43 @@

+# Core dependencies
+requests==2.31.0
+beautifulsoup4==4.12.3
+aiohttp==3.9.3
+lxml==4.9.2
+html5lib==1.1
+pydantic==1.10.7
+pymongo==4.6.1
+redis==5.0.1
+boto3==1.26.123
+docopt==0.6.2
+# URL and DNS handling
+dnspython==2.3.0
+tldextract==5.1.1
+validators==0.20.0
+robotexclusionrulesparser==1.7.1
+urllib3==1.26.15
+# Monitoring and metrics
+prometheus-client==0.16.0
+# HTML processing
+html2text==2020.1.16
+# Async and concurrency
+anyio==3.6.2
+asyncio==3.4.3
+# Utilities
+python-dateutil==2.8.2
+pytz==2023.3
+retry==0.9.2
+cryptography==40.0.1
+cachetools==5.3.0
+# Added from the code block
+openai==1.12.0
+gradio==4.16.0
+chardet==5.2.0
+# Dotenv
+python-dotenv

robots.py ADDED Viewed

	@@ -0,0 +1,203 @@

+"""
+Robots.txt handler for web crawler
+"""
+import time
+import logging
+import requests
+from urllib.parse import urlparse, urljoin
+from typing import Dict, Optional, Tuple
+import tldextract
+from datetime import datetime, timedelta
+from cachetools import TTLCache
+import robotexclusionrulesparser
+from models import RobotsInfo
+import config
+# Import local configuration if available
+try:
+    import local_config
+    # Override config settings with local settings
+    for key in dir(local_config):
+        if key.isupper():
+            setattr(config, key, getattr(local_config, key))
+    logging.info("Loaded local configuration")
+except ImportError:
+    pass
+# Configure logging
+logging.basicConfig(
+    level=getattr(logging, config.LOG_LEVEL),
+    format=config.LOG_FORMAT
+)
+logger = logging.getLogger(__name__)
+class RobotsHandler:
+    """Handles robots.txt fetching and parsing"""
+    def __init__(self, user_agent: Optional[str] = None, cache_size: int = 1000, cache_ttl: int = 3600):
+        """
+        Initialize robots handler
+        Args:
+            user_agent: User agent to use when fetching robots.txt
+            cache_size: Maximum number of robots.txt rules to cache
+            cache_ttl: Time to live for cache entries in seconds
+        """
+        self.user_agent = user_agent or config.USER_AGENT
+        self.parser = robotexclusionrulesparser.RobotExclusionRulesParser()
+        # Cache of robots.txt rules for domains
+        self.robots_cache = TTLCache(maxsize=cache_size, ttl=cache_ttl)
+        # Create request session
+        self.session = requests.Session()
+        self.session.headers.update({'User-Agent': self.user_agent})
+    def can_fetch(self, url: str) -> Tuple[bool, Optional[float]]:
+        """
+        Check if URL can be fetched according to robots.txt
+        Args:
+            url: URL to check
+        Returns:
+            Tuple of (can_fetch, crawl_delay), where crawl_delay is in seconds
+        """
+        try:
+            parsed = urlparse(url)
+            base_url = f"{parsed.scheme}://{parsed.netloc}"
+            domain = self._get_domain(url)
+            # Check if robots info is in cache
+            robots_info = self._get_robots_info(base_url, domain)
+            # Check if allowed
+            path = parsed.path or "/"
+            allowed = robots_info.allowed
+            if allowed:
+                allowed = self.parser.is_allowed(self.user_agent, path)
+            # Get crawl delay
+            crawl_delay = robots_info.crawl_delay
+            if not crawl_delay and hasattr(self.parser, 'get_crawl_delay'):
+                try:
+                    crawl_delay = float(self.parser.get_crawl_delay(self.user_agent) or 0)
+                except:
+                    crawl_delay = 0
+            return allowed, crawl_delay
+        except Exception as e:
+            logger.warning(f"Error checking robots.txt for {url}: {e}")
+            # In case of error, assume allowed
+            return True, None
+    def _get_robots_info(self, base_url: str, domain: str) -> RobotsInfo:
+        """
+        Get robots.txt info for a domain
+        Args:
+            base_url: Base URL of the domain
+            domain: Domain name
+        Returns:
+            RobotsInfo object
+        """
+        # Check if in cache
+        if domain in self.robots_cache:
+            return self.robots_cache[domain]
+        # Fetch robots.txt
+        robots_url = urljoin(base_url, "/robots.txt")
+        try:
+            response = self.session.get(
+                robots_url,
+                timeout=config.CRAWL_TIMEOUT,
+                allow_redirects=True
+            )
+            status_code = response.status_code
+            # If robots.txt exists
+            if status_code == 200:
+                # Parse robots.txt
+                self.parser.parse(response.text)
+                # Create simpler user agents info that doesn't depend on get_user_agents
+                user_agents = {}
+                # Just store info for our specific user agent
+                crawl_delay = None
+                if hasattr(self.parser, 'get_crawl_delay'):
+                    try:
+                        crawl_delay = self.parser.get_crawl_delay(self.user_agent)
+                    except:
+                        crawl_delay = None
+                user_agents[self.user_agent] = {
+                    'crawl_delay': crawl_delay
+                }
+                # Create robots info
+                robots_info = RobotsInfo(
+                    domain=domain,
+                    allowed=True,
+                    crawl_delay=crawl_delay,
+                    last_fetched=datetime.now(),
+                    user_agents=user_agents,
+                    status_code=status_code
+                )
+            else:
+                # If no robots.txt or error, assume allowed
+                self.parser.parse("")  # Parse empty robots.txt
+                robots_info = RobotsInfo(
+                    domain=domain,
+                    allowed=True,
+                    crawl_delay=None,
+                    last_fetched=datetime.now(),
+                    user_agents={},
+                    status_code=status_code
+                )
+            # Cache robots info
+            self.robots_cache[domain] = robots_info
+            return robots_info
+        except requests.RequestException as e:
+            logger.warning(f"Error fetching robots.txt from {robots_url}: {e}")
+            # In case of error, assume allowed
+            self.parser.parse("")  # Parse empty robots.txt
+            robots_info = RobotsInfo(
+                domain=domain,
+                allowed=True,
+                crawl_delay=None,
+                last_fetched=datetime.now(),
+                user_agents={},
+                status_code=None
+            )
+            # Cache robots info
+            self.robots_cache[domain] = robots_info
+            return robots_info
+    def _get_domain(self, url: str) -> str:
+        """Extract domain from URL"""
+        parsed = tldextract.extract(url)
+        return f"{parsed.domain}.{parsed.suffix}" if parsed.suffix else parsed.domain
+    def clear_cache(self) -> None:
+        """Clear the robots.txt cache"""
+        self.robots_cache.clear()
+    def update_cache(self, domain: str) -> None:
+        """
+        Force update of a domain's robots.txt in the cache
+        Args:
+            domain: Domain to update
+        """
+        if domain in self.robots_cache:
+            del self.robots_cache[domain]

run_crawler.py ADDED Viewed

	@@ -0,0 +1,237 @@

+#!/usr/bin/env python3
+"""
+Main script to run the web crawler with command line arguments
+"""
+import os
+import sys
+import time
+import logging
+import argparse
+import signal
+from urllib.parse import urlparse
+# Add the current directory to path if needed
+script_dir = os.path.dirname(os.path.abspath(__file__))
+if script_dir not in sys.path:
+    sys.path.insert(0, script_dir)
+# Configure logging - do this first
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s [%(name)s] %(levelname)s: %(message)s',
+    handlers=[
+        logging.StreamHandler(sys.stdout),
+        logging.FileHandler(os.path.join(script_dir, 'crawler.log'))
+    ]
+)
+logger = logging.getLogger("run_crawler")
+# Now import the crawler components
+logger.info("Importing crawler modules...")
+try:
+    from crawler import Crawler
+    from models import Priority
+    logger.info("Successfully imported crawler modules")
+except Exception as e:
+    logger.error(f"Error importing crawler modules: {e}", exc_info=True)
+    sys.exit(1)
+def parse_arguments():
+    """Parse command line arguments"""
+    parser = argparse.ArgumentParser(description='Run the web crawler with custom settings')
+    parser.add_argument('--seed', nargs='+', metavar='URL',
+                        help='One or more seed URLs to start crawling')
+    parser.add_argument('--depth', type=int, default=None,
+                        help='Maximum crawl depth')
+    parser.add_argument('--workers', type=int, default=None,
+                        help='Number of worker threads')
+    parser.add_argument('--delay', type=float, default=None,
+                        help='Delay between requests to the same domain (in seconds)')
+    parser.add_argument('--respect-robots', dest='respect_robots', action='store_true',
+                        help='Respect robots.txt rules')
+    parser.add_argument('--ignore-robots', dest='respect_robots', action='store_false',
+                        help='Ignore robots.txt rules')
+    parser.add_argument('--user-agent', type=str, default=None,
+                       help='User agent to use for requests')
+    parser.add_argument('--async', dest='async_mode', action='store_true',
+                       help='Use async mode for requests')
+    parser.add_argument('--domain-filter', type=str, default=None,
+                       help='Only crawl URLs that match this domain')
+    parser.add_argument('--reset-db', action='store_true',
+                       help='Reset MongoDB and flush Redis data before starting')
+    parser.add_argument('--verbose', action='store_true',
+                       help='Enable verbose logging')
+    args = parser.parse_args()
+    # Set log level based on verbose flag
+    if args.verbose:
+        logger.setLevel(logging.DEBUG)
+        logger.debug("Verbose logging enabled")
+    return args
+def reset_databases():
+    """Reset MongoDB and flush Redis data"""
+    success = True
+    # Reset MongoDB
+    try:
+        logger.info("Starting MongoDB cleanup...")
+        from mongo_cleanup import cleanup_mongodb
+        mongo_success = cleanup_mongodb()
+        if not mongo_success:
+            logger.warning("MongoDB cleanup may not have been completely successful")
+            success = False
+        else:
+            logger.info("MongoDB cleanup completed successfully")
+    except Exception as e:
+        logger.error(f"Error cleaning up MongoDB: {e}", exc_info=True)
+        success = False
+    # Flush Redis
+    try:
+        logger.info("Starting Redis flush...")
+        import redis
+        logger.debug("Connecting to Redis to flush data...")
+        # Set a timeout for Redis connection
+        r = redis.Redis(host='localhost', port=6379, db=0, socket_timeout=5)
+        # Check if Redis is available
+        try:
+            logger.debug("Testing Redis connection...")
+            ping_result = r.ping()
+            logger.debug(f"Redis ping result: {ping_result}")
+            # If connection works, flush all data
+            logger.info("Flushing all Redis data...")
+            result = r.flushall()
+            logger.info(f"Redis flush result: {result}")
+        except redis.ConnectionError as e:
+            logger.error(f"Redis connection error: {e}")
+            success = False
+    except Exception as e:
+        logger.error(f"Error flushing Redis: {e}", exc_info=True)
+        success = False
+    return success
+def setup_signal_handlers(crawler_instance):
+    """Setup signal handlers for graceful shutdown"""
+    def signal_handler(sig, frame):
+        logger.info(f"Received signal {sig}, shutting down gracefully...")
+        if crawler_instance and crawler_instance.running:
+            logger.info("Stopping crawler...")
+            crawler_instance.stop()
+        sys.exit(0)
+    signal.signal(signal.SIGINT, signal_handler)
+    signal.signal(signal.SIGTERM, signal_handler)
+def run_crawler():
+    """Run the crawler with command-line arguments"""
+    args = parse_arguments()
+    crawler = None
+    try:
+        logger.info("Starting the web crawler...")
+        # Reset database if requested
+        if args.reset_db:
+            logger.info("Resetting MongoDB and flushing Redis data...")
+            if not reset_databases():
+                logger.warning("Database reset was not completely successful")
+        # Create crawler instance
+        logger.info("Creating crawler instance...")
+        crawler = Crawler()
+        logger.info("Crawler instance created successfully")
+        # Setup signal handlers
+        setup_signal_handlers(crawler)
+        # Override settings from command line if provided
+        if args.depth is not None:
+            import config
+            config.MAX_DEPTH = args.depth
+            logger.info(f"Setting maximum depth to {args.depth}")
+        if args.delay is not None:
+            import config
+            config.DELAY_BETWEEN_REQUESTS = args.delay
+            logger.info(f"Setting delay between requests to {args.delay} seconds")
+        if args.respect_robots is not None:
+            import config
+            config.RESPECT_ROBOTS_TXT = args.respect_robots
+            logger.info(f"Respect robots.txt: {args.respect_robots}")
+        if args.user_agent is not None:
+            import config
+            config.USER_AGENT = args.user_agent
+            logger.info(f"Using user agent: {args.user_agent}")
+        # Add seed URLs if provided
+        if args.seed:
+            logger.info(f"Adding {len(args.seed)} seed URLs")
+            seed_urls = []
+            for url in args.seed:
+                if not (url.startswith('http://') or url.startswith('https://')):
+                    url = 'https://' + url
+                seed_urls.append(url)
+                logger.debug(f"Added seed URL: {url}")
+            # Add the URLs to the frontier
+            logger.info("Adding seed URLs to frontier...")
+            added = crawler.add_seed_urls(seed_urls, Priority.VERY_HIGH)
+            logger.info(f"Successfully added {added} seed URLs to the frontier")
+        # Apply domain filter if provided
+        if args.domain_filter:
+            import config
+            # Allow both domain.com or http://domain.com formats
+            domain = args.domain_filter
+            if domain.startswith('http://') or domain.startswith('https://'):
+                domain = urlparse(domain).netloc
+            config.ALLOWED_DOMAINS = [domain]
+            logger.info(f"Filtering to domain: {domain}")
+        # Start the crawler
+        num_workers = args.workers if args.workers is not None else 4
+        logger.info(f"Starting crawler with {num_workers} workers...")
+        crawler.start(num_workers=num_workers, async_mode=args.async_mode)
+        # If we get here, crawler has finished or was stopped
+        logger.info("Crawler finished")
+    except KeyboardInterrupt:
+        logger.info("Crawler interrupted by user")
+        if crawler and crawler.running:
+            logger.info("Stopping crawler...")
+            crawler.stop()
+    except Exception as e:
+        logger.error(f"Error running crawler: {e}", exc_info=True)
+        if crawler and crawler.running:
+            try:
+                logger.info("Attempting to stop crawler after error...")
+                crawler.stop()
+            except:
+                pass
+if __name__ == "__main__":
+    run_crawler()

seo_analyzer_ui.py ADDED Viewed

	@@ -0,0 +1,708 @@

+"""
+SEO Analyzer UI using Gradio, Web Crawler, and OpenAI
+"""
+import gradio as gr
+import logging
+import json
+from typing import Dict, List, Any, Tuple, Optional
+from urllib.parse import urlparse
+import tldextract
+from openai import OpenAI
+import time
+import os
+import threading
+import queue
+import shutil
+import uuid
+from concurrent.futures import ThreadPoolExecutor
+from datetime import datetime
+import tempfile
+from crawler import Crawler
+from frontier import URLFrontier
+from models import URL, Page
+import config
+from run_crawler import reset_databases
+from dotenv import load_dotenv, find_dotenv
+load_dotenv(find_dotenv())
+# Check if we're in deployment mode (e.g., Hugging Face Spaces)
+IS_DEPLOYMENT = os.getenv('DEPLOYMENT', 'false').lower() == 'true'
+# Custom CSS for better styling
+CUSTOM_CSS = """
+.container {
+    max-width: 1200px !important;
+    margin: auto;
+    padding: 20px;
+}
+.header {
+    text-align: center;
+    margin-bottom: 2rem;
+}
+.header h1 {
+    color: #2d3748;
+    font-size: 2.5rem;
+    font-weight: 700;
+    margin-bottom: 1rem;
+}
+.header p {
+    color: #4a5568;
+    font-size: 1.1rem;
+    max-width: 800px;
+    margin: 0 auto;
+}
+.input-section {
+    background: #f7fafc;
+    border-radius: 12px;
+    padding: 24px;
+    margin-bottom: 24px;
+    box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+}
+.analysis-section {
+    background: white;
+    border-radius: 12px;
+    padding: 24px;
+    margin-top: 24px;
+    box-shadow: 0 2px 4px rgba(0,0,0,0.1);
+}
+.log-section {
+    font-family: monospace;
+    background: #1a202c;
+    color: #e2e8f0;
+    padding: 16px;
+    border-radius: 8px;
+    margin-top: 24px;
+}
+/* Custom styling for inputs */
+.input-container {
+    background: white;
+    padding: 16px;
+    border-radius: 8px;
+    margin-bottom: 16px;
+}
+/* Custom styling for the slider */
+.slider-container {
+    padding: 12px;
+    background: white;
+    border-radius: 8px;
+}
+/* Custom styling for buttons */
+.primary-button {
+    background: #4299e1 !important;
+    color: white !important;
+    padding: 12px 24px !important;
+    border-radius: 8px !important;
+    font-weight: 600 !important;
+    transition: all 0.3s ease !important;
+}
+.primary-button:hover {
+    background: #3182ce !important;
+    transform: translateY(-1px) !important;
+}
+/* Markdown output styling */
+.markdown-output {
+    font-family: system-ui, -apple-system, sans-serif;
+    line-height: 1.6;
+}
+.markdown-output h1 {
+    color: #2d3748;
+    border-bottom: 2px solid #e2e8f0;
+    padding-bottom: 0.5rem;
+}
+.markdown-output h2 {
+    color: #4a5568;
+    margin-top: 2rem;
+}
+.markdown-output h3 {
+    color: #718096;
+    margin-top: 1.5rem;
+}
+/* Progress bar styling */
+.progress-bar {
+    height: 8px !important;
+    border-radius: 4px !important;
+    background: #ebf8ff !important;
+}
+.progress-bar-fill {
+    background: #4299e1 !important;
+    border-radius: 4px !important;
+}
+/* Add some spacing between sections */
+.gap {
+    margin: 2rem 0;
+}
+"""
+# Create a custom handler that will store logs in a queue
+class QueueHandler(logging.Handler):
+    def __init__(self, log_queue):
+        super().__init__()
+        self.log_queue = log_queue
+    def emit(self, record):
+        log_entry = self.format(record)
+        try:
+            self.log_queue.put_nowait(f"{datetime.now().strftime('%H:%M:%S')} - {log_entry}")
+        except queue.Full:
+            pass  # Ignore if queue is full
+# Configure logging
+logging.basicConfig(
+    level=getattr(logging, config.LOG_LEVEL),
+    format='%(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+logger.info(f"IS_DEPLOYMENT: {IS_DEPLOYMENT}")
+class InMemoryStorage:
+    """Simple in-memory storage for deployment mode"""
+    def __init__(self):
+        self.urls = {}
+        self.pages = {}
+    def reset(self):
+        self.urls.clear()
+        self.pages.clear()
+    def add_url(self, url_obj):
+        self.urls[url_obj.url] = url_obj
+    def add_page(self, page_obj):
+        self.pages[page_obj.url] = page_obj
+    def get_url(self, url):
+        return self.urls.get(url)
+    def get_page(self, url):
+        return self.pages.get(url)
+class SEOAnalyzer:
+    """
+    SEO Analyzer that combines web crawler with OpenAI analysis
+    """
+    def __init__(self, api_key: str):
+        """Initialize SEO Analyzer"""
+        self.client = OpenAI(api_key=api_key)
+        self.crawler = None
+        self.crawled_pages = []
+        self.pages_crawled = 0
+        self.max_pages = 0
+        self.crawl_complete = threading.Event()
+        self.log_queue = queue.Queue(maxsize=1000)
+        self.session_id = str(uuid.uuid4())
+        self.storage = InMemoryStorage() if IS_DEPLOYMENT else None
+        # Add queue handler to logger
+        queue_handler = QueueHandler(self.log_queue)
+        queue_handler.setFormatter(logging.Formatter('%(levelname)s - %(message)s'))
+        logger.addHandler(queue_handler)
+    def _setup_session_storage(self) -> Tuple[str, str, str]:
+        """
+        Set up session-specific storage directories
+        Returns:
+            Tuple of (storage_path, html_path, log_path)
+        """
+        # Create session-specific paths
+        session_storage = os.path.join(config.STORAGE_PATH, self.session_id)
+        session_html = os.path.join(session_storage, "html")
+        session_logs = os.path.join(session_storage, "logs")
+        # Create directories
+        os.makedirs(session_storage, exist_ok=True)
+        os.makedirs(session_html, exist_ok=True)
+        os.makedirs(session_logs, exist_ok=True)
+        logger.info(f"Created session storage at {session_storage}")
+        return session_storage, session_html, session_logs
+    def _cleanup_session_storage(self):
+        """Clean up session-specific storage"""
+        session_path = os.path.join(config.STORAGE_PATH, self.session_id)
+        try:
+            if os.path.exists(session_path):
+                shutil.rmtree(session_path)
+                logger.info(f"Cleaned up session storage at {session_path}")
+        except Exception as e:
+            logger.error(f"Error cleaning up session storage: {e}")
+    def _reset_storage(self):
+        """Reset storage based on deployment mode"""
+        if IS_DEPLOYMENT:
+            self.storage.reset()
+        else:
+            reset_databases()
+    def analyze_website(self, url: str, max_pages: int = 10, progress: gr.Progress = gr.Progress()) -> Tuple[str, List[Dict], str]:
+        """
+        Crawl website and analyze SEO using OpenAI
+        Args:
+            url: Seed URL to crawl
+            max_pages: Maximum number of pages to crawl
+            progress: Gradio progress indicator
+        Returns:
+            Tuple of (overall analysis, list of page-specific analyses, log output)
+        """
+        try:
+            # Reset state
+            self.crawled_pages = []
+            self.pages_crawled = 0
+            self.max_pages = max_pages
+            self.crawl_complete.clear()
+            # Set up storage
+            if IS_DEPLOYMENT:
+                # Use temporary directory for file storage in deployment
+                temp_dir = tempfile.mkdtemp()
+                session_storage = temp_dir
+                session_html = os.path.join(temp_dir, "html")
+                session_logs = os.path.join(temp_dir, "logs")
+                os.makedirs(session_html, exist_ok=True)
+                os.makedirs(session_logs, exist_ok=True)
+            else:
+                session_storage, session_html, session_logs = self._setup_session_storage()
+            # Update config paths for this session
+            config.HTML_STORAGE_PATH = session_html
+            config.LOG_PATH = session_logs
+            # Clear log queue
+            while not self.log_queue.empty():
+                self.log_queue.get_nowait()
+            logger.info(f"Starting analysis of {url} with max_pages={max_pages}")
+            # Reset storage
+            logger.info("Resetting storage...")
+            self._reset_storage()
+            logger.info("Storage reset completed")
+            # Create new crawler instance with appropriate storage
+            logger.info("Creating crawler instance...")
+            if IS_DEPLOYMENT:
+                # In deployment mode, use in-memory storage
+                self.crawler = Crawler(storage=self.storage)
+                # Set frontier to use memory mode
+                self.crawler.frontier = URLFrontier(use_memory=True)
+            else:
+                # In local mode, use MongoDB and Redis
+                self.crawler = Crawler()
+            logger.info("Crawler instance created successfully")
+            # Extract domain for filtering
+            domain = self._extract_domain(url)
+            logger.info(f"Analyzing domain: {domain}")
+            # Add seed URL and configure domain filter
+            self.crawler.add_seed_urls([url])
+            config.ALLOWED_DOMAINS = [domain]
+            logger.info("Added seed URL and configured domain filter")
+            # Override the crawler's _process_url method to capture pages
+            original_process_url = self.crawler._process_url
+            def wrapped_process_url(url_obj):
+                if self.pages_crawled >= self.max_pages:
+                    self.crawler.running = False  # Signal crawler to stop
+                    self.crawl_complete.set()
+                    return
+                original_process_url(url_obj)
+                # Get the page based on storage mode
+                if IS_DEPLOYMENT:
+                    # In deployment mode, get page from in-memory storage
+                    page = self.storage.get_page(url_obj.url)
+                    if page:
+                        _, metadata = self.crawler.parser.parse(page)
+                        self.crawled_pages.append({
+                            'url': url_obj.url,
+                            'content': page.content,
+                            'metadata': metadata
+                        })
+                        self.pages_crawled += 1
+                        logger.info(f"Crawled page {self.pages_crawled}/{max_pages}: {url_obj.url}")
+                else:
+                    # In local mode, get page from MongoDB
+                    page_data = self.crawler.pages_collection.find_one({'url': url_obj.url})
+                    if page_data and page_data.get('content'):
+                        _, metadata = self.crawler.parser.parse(Page(**page_data))
+                        self.crawled_pages.append({
+                            'url': url_obj.url,
+                            'content': page_data['content'],
+                            'metadata': metadata
+                        })
+                        self.pages_crawled += 1
+                        logger.info(f"Crawled page {self.pages_crawled}/{max_pages}: {url_obj.url}")
+                if self.pages_crawled >= self.max_pages:
+                    self.crawler.running = False  # Signal crawler to stop
+                    self.crawl_complete.set()
+            self.crawler._process_url = wrapped_process_url
+            def run_crawler():
+                try:
+                    # Skip signal handler registration
+                    self.crawler.running = True
+                    with ThreadPoolExecutor(max_workers=1) as executor:
+                        try:
+                            futures = [executor.submit(self.crawler._crawl_worker)]
+                            for future in futures:
+                                future.result()
+                        except Exception as e:
+                            logger.error(f"Error in crawler worker: {e}")
+                        finally:
+                            self.crawler.running = False
+                            self.crawl_complete.set()
+                except Exception as e:
+                    logger.error(f"Error in run_crawler: {e}")
+                    self.crawl_complete.set()
+            # Start crawler in a thread
+            crawler_thread = threading.Thread(target=run_crawler)
+            crawler_thread.daemon = True
+            crawler_thread.start()
+            # Wait for completion or timeout with progress updates
+            timeout = 300  # 5 minutes
+            start_time = time.time()
+            last_progress = 0
+            while not self.crawl_complete.is_set() and time.time() - start_time < timeout:
+                current_progress = min(0.8, self.pages_crawled / max_pages)
+                if current_progress != last_progress:
+                    progress(current_progress, f"Crawled {self.pages_crawled}/{max_pages} pages")
+                    last_progress = current_progress
+                time.sleep(0.1)  # More frequent updates
+            if time.time() - start_time >= timeout:
+                logger.warning("Crawler timed out")
+                self.crawler.running = False
+            # Wait for thread to finish
+            crawler_thread.join(timeout=10)
+            # Restore original method
+            self.crawler._process_url = original_process_url
+            # Collect all logs
+            logs = []
+            while not self.log_queue.empty():
+                logs.append(self.log_queue.get_nowait())
+            log_output = "\n".join(logs)
+            if not self.crawled_pages:
+                self._cleanup_session_storage()
+                return "No pages were successfully crawled.", [], log_output
+            logger.info("Starting OpenAI analysis...")
+            progress(0.9, "Analyzing crawled pages with OpenAI...")
+            # Analyze crawled pages with OpenAI
+            overall_analysis = self._get_overall_analysis(self.crawled_pages)
+            progress(0.95, "Generating page-specific analyses...")
+            page_analyses = self._get_page_analyses(self.crawled_pages)
+            logger.info("Analysis complete")
+            progress(1.0, "Analysis complete")
+            # Format the results
+            formatted_analysis = f"""
+# SEO Analysis Report for {domain}
+## Overall Analysis
+{overall_analysis}
+## Page-Specific Analyses
+"""
+            for page_analysis in page_analyses:
+                formatted_analysis += f"""
+### {page_analysis['url']}
+{page_analysis['analysis']}
+"""
+            # Clean up all resources
+            logger.info("Cleaning up resources...")
+            if IS_DEPLOYMENT:
+                shutil.rmtree(temp_dir, ignore_errors=True)
+                self.storage.reset()
+            else:
+                self._cleanup_session_storage()
+                self._reset_storage()
+            logger.info("All resources cleaned up")
+            return formatted_analysis, page_analyses, log_output
+        except Exception as e:
+            logger.error(f"Error analyzing website: {e}")
+            # Clean up all resources even on error
+            if IS_DEPLOYMENT:
+                shutil.rmtree(temp_dir, ignore_errors=True)
+                self.storage.reset()
+            else:
+                self._cleanup_session_storage()
+                self._reset_storage()
+            # Collect all logs
+            logs = []
+            while not self.log_queue.empty():
+                logs.append(self.log_queue.get_nowait())
+            log_output = "\n".join(logs)
+            return f"Error analyzing website: {str(e)}", [], log_output
+    def _extract_domain(self, url: str) -> str:
+        """Extract domain from URL"""
+        extracted = tldextract.extract(url)
+        return f"{extracted.domain}.{extracted.suffix}"
+    def _get_overall_analysis(self, pages: List[Dict]) -> str:
+        """Get overall SEO analysis using OpenAI"""
+        try:
+            # Prepare site overview for analysis
+            site_overview = {
+                'num_pages': len(pages),
+                'pages': [{
+                    'url': page['url'],
+                    'metadata': page['metadata']
+                } for page in pages]
+            }
+            # Create analysis prompt
+            prompt = f"""
+You are an expert SEO consultant. Analyze this website's SEO based on the crawled data:
+{json.dumps(site_overview, indent=2)}
+Provide a comprehensive SEO analysis including:
+1. Overall site structure and navigation
+2. Common SEO issues across pages
+3. Content quality and optimization
+4. Technical SEO recommendations
+5. Priority improvements
+Format your response in Markdown.
+"""
+            # Get analysis from OpenAI
+            response = self.client.chat.completions.create(
+                model="gpt-4o-mini",
+                messages=[
+                    {"role": "system", "content": "You are an expert SEO consultant providing detailed website analysis."},
+                    {"role": "user", "content": prompt}
+                ],
+                temperature=0.7,
+                max_tokens=2000
+            )
+            return response.choices[0].message.content
+        except Exception as e:
+            logger.error(f"Error getting overall analysis: {e}")
+            return f"Error generating overall analysis: {str(e)}"
+    def _get_page_analyses(self, pages: List[Dict]) -> List[Dict]:
+        """Get page-specific SEO analyses using OpenAI"""
+        page_analyses = []
+        for page in pages:
+            try:
+                # Create page analysis prompt
+                prompt = f"""
+Analyze this page's SEO:
+URL: {page['url']}
+Metadata: {json.dumps(page['metadata'], indent=2)}
+Provide specific recommendations for:
+1. Title and meta description
+2. Heading structure
+3. Content optimization
+4. Internal linking
+5. Technical improvements
+Format your response in Markdown.
+"""
+                # Get analysis from OpenAI
+                response = self.client.chat.completions.create(
+                    model="gpt-4o-mini",
+                    messages=[
+                        {"role": "system", "content": "You are an expert SEO consultant providing detailed page analysis."},
+                        {"role": "user", "content": prompt}
+                    ],
+                    temperature=0.7,
+                    max_tokens=1000
+                )
+                page_analyses.append({
+                    'url': page['url'],
+                    'analysis': response.choices[0].message.content
+                })
+                # Sleep to respect rate limits
+                time.sleep(1)
+            except Exception as e:
+                logger.error(f"Error analyzing page {page['url']}: {e}")
+                page_analyses.append({
+                    'url': page['url'],
+                    'analysis': f"Error analyzing page: {str(e)}"
+                })
+        return page_analyses
+def create_ui() -> gr.Interface:
+    """Create Gradio interface"""
+    def analyze(url: str, api_key: str, max_pages: int, progress: gr.Progress = gr.Progress()) -> Tuple[str, str]:
+        """Gradio interface function"""
+        try:
+            # Initialize analyzer
+            analyzer = SEOAnalyzer(api_key)
+            # Run analysis with progress updates
+            analysis, _, logs = analyzer.analyze_website(url, max_pages, progress)
+            # Collect all logs
+            log_output = ""
+            while not analyzer.log_queue.empty():
+                try:
+                    log_output += analyzer.log_queue.get_nowait() + "\n"
+                except queue.Empty:
+                    break
+            # Set progress to complete
+            progress(1.0, "Analysis complete")
+            # Return results
+            return analysis, log_output
+        except Exception as e:
+            error_msg = f"Error: {str(e)}"
+            logger.error(error_msg)
+            return error_msg, error_msg
+    # Create markdown content for the about section
+    about_markdown = """
+    # 🔍 SEO Analyzer Pro
+    Analyze your website's SEO performance using advanced crawling and AI technology.
+    ### Features:
+    - 🕷️ Intelligent Web Crawling
+    - 🧠 AI-Powered Analysis
+    - 📊 Comprehensive Reports
+    - 🚀 Performance Insights
+    ### How to Use:
+    1. Enter your website URL
+    2. Provide your OpenAI API key
+    3. Choose how many pages to analyze
+    4. Click Analyze and watch the magic happen!
+    ### What You'll Get:
+    - Detailed SEO analysis
+    - Content quality assessment
+    - Technical recommendations
+    - Performance insights
+    - Actionable improvements
+    """
+    # Create the interface with custom styling
+    with gr.Blocks(css=CUSTOM_CSS) as iface:
+        gr.Markdown(about_markdown)
+        with gr.Row():
+            with gr.Column(scale=2):
+                with gr.Group(elem_classes="input-section"):
+                    gr.Markdown("### 📝 Enter Website Details")
+                    url_input = gr.Textbox(
+                        label="Website URL",
+                        placeholder="https://example.com",
+                        elem_classes="input-container",
+                        info="Enter the full URL of the website you want to analyze (e.g., https://example.com)"
+                    )
+                    api_key = gr.Textbox(
+                        label="OpenAI API Key",
+                        placeholder="sk-...",
+                        type="password",
+                        elem_classes="input-container",
+                        info="Your OpenAI API key is required for AI-powered analysis. Keep this secure!"
+                    )
+                    max_pages = gr.Slider(
+                        minimum=1,
+                        maximum=50,
+                        value=10,
+                        step=1,
+                        label="Maximum Pages to Crawl",
+                        elem_classes="slider-container",
+                        info="Choose how many pages to analyze. More pages = more comprehensive analysis but takes longer"
+                    )
+                    analyze_btn = gr.Button(
+                        "🔍 Analyze Website",
+                        elem_classes="primary-button"
+                    )
+        with gr.Row():
+            with gr.Column():
+                with gr.Group(elem_classes="analysis-section"):
+                    gr.Markdown("### 📊 Analysis Results")
+                    analysis_output = gr.Markdown(
+                        label="SEO Analysis",
+                        elem_classes="markdown-output"
+                    )
+        with gr.Row():
+            with gr.Column():
+                with gr.Group(elem_classes="log-section"):
+                    gr.Markdown("### 📋 Process Logs")
+                    logs_output = gr.Textbox(
+                        label="Logs",
+                        lines=10,
+                        elem_classes="log-output"
+                    )
+        # Connect the button click to the analyze function
+        analyze_btn.click(
+            fn=analyze,
+            inputs=[url_input, api_key, max_pages],
+            outputs=[analysis_output, logs_output],
+        )
+    return iface
+if __name__ == "__main__":
+    # Create base storage directory if it doesn't exist
+    os.makedirs(config.STORAGE_PATH, exist_ok=True)
+    # Create and launch UI
+    ui = create_ui()
+    ui.launch(
+        share=False,
+        server_name="0.0.0.0",
+        show_api=False,
+        show_error=True,
+    )

storage.py ADDED Viewed

	@@ -0,0 +1,888 @@

+"""
+Storage component for the web crawler.
+Handles storing and retrieving crawled web pages using:
+1. MongoDB for metadata, URL information, and crawl stats
+2. Disk-based storage for HTML content
+3. Optional Amazon S3 integration for scalable storage
+"""
+import os
+import logging
+import time
+import datetime
+import hashlib
+import json
+import gzip
+import shutil
+from typing import Dict, List, Optional, Union, Any, Tuple
+from urllib.parse import urlparse
+import pymongo
+from pymongo import MongoClient, UpdateOne
+from pymongo.errors import PyMongoError, BulkWriteError
+import boto3
+from botocore.exceptions import ClientError
+from models import Page, URL
+import config
+# Configure logging
+logging.basicConfig(
+    level=getattr(logging, config.LOG_LEVEL),
+    format=config.LOG_FORMAT
+)
+logger = logging.getLogger(__name__)
+class StorageManager:
+    """
+    Storage manager for web crawler data
+    Handles:
+    - MongoDB for metadata, URL information, and stats
+    - Disk-based storage for HTML content
+    - Optional Amazon S3 integration
+    """
+    def __init__(self,
+                 mongo_uri: Optional[str] = None,
+                 use_s3: bool = False,
+                 compress_html: bool = True,
+                 max_disk_usage_gb: float = 100.0):
+        """
+        Initialize the storage manager
+        Args:
+            mongo_uri: MongoDB connection URI
+            use_s3: Whether to use Amazon S3 for HTML storage
+            compress_html: Whether to compress HTML content
+            max_disk_usage_gb: Maximum disk space to use in GB
+        """
+        self.mongo_uri = mongo_uri or config.MONGODB_URI
+        self.use_s3 = use_s3
+        self.compress_html = compress_html
+        self.max_disk_usage_gb = max_disk_usage_gb
+        # Connect to MongoDB
+        self.mongo_client = MongoClient(self.mongo_uri)
+        self.db = self.mongo_client[config.MONGODB_DB]
+        # MongoDB collections
+        self.pages_collection = self.db['pages']
+        self.urls_collection = self.db['urls']
+        self.stats_collection = self.db['stats']
+        # Create necessary indexes
+        self._create_indexes()
+        # S3 client (if enabled)
+        self.s3_client = None
+        if self.use_s3:
+            self._init_s3_client()
+        # Ensure storage directories exist
+        self._ensure_directories()
+        # Bulk operation buffers
+        self.page_buffer = []
+        self.url_buffer = []
+        self.max_buffer_size = 100
+        # Statistics
+        self.stats = {
+            'pages_stored': 0,
+            'pages_retrieved': 0,
+            'urls_stored': 0,
+            'urls_retrieved': 0,
+            'disk_space_used': 0,
+            's3_objects_stored': 0,
+            'mongodb_size': 0,
+            'storage_errors': 0,
+            'start_time': time.time()
+        }
+    def _create_indexes(self) -> None:
+        """Create necessary indexes in MongoDB collections"""
+        try:
+            # Pages collection indexes
+            self.pages_collection.create_index('url', unique=True)
+            self.pages_collection.create_index('content_hash')
+            self.pages_collection.create_index('crawled_at')
+            self.pages_collection.create_index('domain')
+            # URLs collection indexes
+            self.urls_collection.create_index('url', unique=True)
+            self.urls_collection.create_index('normalized_url')
+            self.urls_collection.create_index('domain')
+            self.urls_collection.create_index('status')
+            self.urls_collection.create_index('priority')
+            self.urls_collection.create_index('last_crawled')
+            logger.info("MongoDB indexes created")
+        except PyMongoError as e:
+            logger.error(f"Error creating MongoDB indexes: {e}")
+            self.stats['storage_errors'] += 1
+    def _init_s3_client(self) -> None:
+        """Initialize AWS S3 client"""
+        try:
+            self.s3_client = boto3.client(
+                's3',
+                aws_access_key_id=config.AWS_ACCESS_KEY,
+                aws_secret_access_key=config.AWS_SECRET_KEY,
+                region_name=config.AWS_REGION
+            )
+            logger.info("S3 client initialized")
+            # Create bucket if it doesn't exist
+            self._ensure_s3_bucket()
+        except Exception as e:
+            logger.error(f"Error initializing S3 client: {e}")
+            self.use_s3 = False
+            self.stats['storage_errors'] += 1
+    def _ensure_s3_bucket(self) -> None:
+        """Create S3 bucket if it doesn't exist"""
+        if not self.s3_client:
+            return
+        try:
+            # Check if bucket exists
+            self.s3_client.head_bucket(Bucket=config.S3_BUCKET)
+            logger.info(f"S3 bucket '{config.S3_BUCKET}' exists")
+        except ClientError as e:
+            error_code = e.response.get('Error', {}).get('Code')
+            if error_code == '404':
+                # Bucket doesn't exist, create it
+                try:
+                    self.s3_client.create_bucket(
+                        Bucket=config.S3_BUCKET,
+                        CreateBucketConfiguration={
+                            'LocationConstraint': config.AWS_REGION
+                        }
+                    )
+                    logger.info(f"Created S3 bucket '{config.S3_BUCKET}'")
+                except ClientError as ce:
+                    logger.error(f"Error creating S3 bucket: {ce}")
+                    self.use_s3 = False
+                    self.stats['storage_errors'] += 1
+            else:
+                logger.error(f"Error checking S3 bucket: {e}")
+                self.use_s3 = False
+                self.stats['storage_errors'] += 1
+    def _ensure_directories(self) -> None:
+        """Ensure storage directories exist"""
+        # Create main storage directory
+        os.makedirs(config.STORAGE_PATH, exist_ok=True)
+        # Create HTML storage directory
+        os.makedirs(config.HTML_STORAGE_PATH, exist_ok=True)
+        # Create log directory
+        os.makedirs(config.LOG_PATH, exist_ok=True)
+        logger.info("Storage directories created")
+    def store_page(self, page: Page, flush: bool = False) -> bool:
+        """
+        Store a crawled page
+        Args:
+            page: Page object to store
+            flush: Whether to flush page buffer immediately
+        Returns:
+            True if successful, False otherwise
+        """
+        try:
+            # Store page content based on configuration
+            if self.use_s3:
+                content_stored = self._store_content_s3(page)
+            else:
+                content_stored = self._store_content_disk(page)
+            if not content_stored:
+                logger.warning(f"Failed to store content for {page.url}")
+                self.stats['storage_errors'] += 1
+                return False
+            # Remove HTML content from page object for MongoDB storage
+            page_dict = page.dict(exclude={'content'})
+            # Convert datetime fields to proper format
+            if page.crawled_at:
+                page_dict['crawled_at'] = page.crawled_at
+            # Add to buffer
+            self.page_buffer.append(
+                UpdateOne(
+                    {'url': page.url},
+                    {'$set': page_dict},
+                    upsert=True
+                )
+            )
+            # Update statistics
+            self.stats['pages_stored'] += 1
+            # Check if buffer should be flushed
+            if flush or len(self.page_buffer) >= self.max_buffer_size:
+                return self.flush_page_buffer()
+            return True
+        except Exception as e:
+            logger.error(f"Error storing page {page.url}: {e}")
+            self.stats['storage_errors'] += 1
+            return False
+    def _store_content_disk(self, page: Page) -> bool:
+        """
+        Store page content on disk
+        Args:
+            page: Page to store
+        Returns:
+            True if successful, False otherwise
+        """
+        try:
+            # Check disk space
+            if not self._check_disk_space():
+                logger.warning("Disk space limit exceeded")
+                return False
+            # Create directory for domain if it doesn't exist
+            domain = self._extract_domain(page.url)
+            domain_dir = os.path.join(config.HTML_STORAGE_PATH, domain)
+            os.makedirs(domain_dir, exist_ok=True)
+            # Create filename
+            filename = self._url_to_filename(page.url)
+            # Full path for the file
+            if self.compress_html:
+                filepath = os.path.join(domain_dir, f"{filename}.gz")
+                # Compress and write HTML to file
+                with gzip.open(filepath, 'wt', encoding='utf-8') as f:
+                    f.write(page.content)
+            else:
+                filepath = os.path.join(domain_dir, f"{filename}.html")
+                # Write HTML to file
+                with open(filepath, 'w', encoding='utf-8') as f:
+                    f.write(page.content)
+            # Update disk space used
+            file_size = os.path.getsize(filepath)
+            self.stats['disk_space_used'] += file_size
+            logger.debug(f"Stored HTML content for {page.url} at {filepath}")
+            return True
+        except Exception as e:
+            logger.error(f"Error storing content on disk for {page.url}: {e}")
+            self.stats['storage_errors'] += 1
+            return False
+    def _store_content_s3(self, page: Page) -> bool:
+        """
+        Store page content in S3
+        Args:
+            page: Page to store
+        Returns:
+            True if successful, False otherwise
+        """
+        if not self.s3_client:
+            logger.warning("S3 client not initialized, falling back to disk storage")
+            return self._store_content_disk(page)
+        try:
+            # Create key for S3 object
+            domain = self._extract_domain(page.url)
+            filename = self._url_to_filename(page.url)
+            # S3 key
+            s3_key = f"{domain}/{filename}"
+            if self.compress_html:
+                s3_key += ".gz"
+                # Compress content
+                content_bytes = gzip.compress(page.content.encode('utf-8'))
+                content_type = 'application/gzip'
+            else:
+                s3_key += ".html"
+                content_bytes = page.content.encode('utf-8')
+                content_type = 'text/html'
+            # Upload to S3
+            self.s3_client.put_object(
+                Bucket=config.S3_BUCKET,
+                Key=s3_key,
+                Body=content_bytes,
+                ContentType=content_type,
+                Metadata={
+                    'url': page.url,
+                    'crawled_at': page.crawled_at.isoformat() if page.crawled_at else '',
+                    'content_hash': page.content_hash or ''
+                }
+            )
+            # Update statistics
+            self.stats['s3_objects_stored'] += 1
+            logger.debug(f"Stored HTML content for {page.url} in S3 at {s3_key}")
+            return True
+        except Exception as e:
+            logger.error(f"Error storing content in S3 for {page.url}: {e}")
+            self.stats['storage_errors'] += 1
+            # Fall back to disk storage
+            logger.info(f"Falling back to disk storage for {page.url}")
+            return self._store_content_disk(page)
+    def store_url(self, url_obj: URL, flush: bool = False) -> bool:
+        """
+        Store URL information
+        Args:
+            url_obj: URL object to store
+            flush: Whether to flush URL buffer immediately
+        Returns:
+            True if successful, False otherwise
+        """
+        try:
+            # Convert URL object to dict
+            url_dict = url_obj.dict()
+            # Add to buffer
+            self.url_buffer.append(
+                UpdateOne(
+                    {'url': url_obj.url},
+                    {'$set': url_dict},
+                    upsert=True
+                )
+            )
+            # Update statistics
+            self.stats['urls_stored'] += 1
+            # Check if buffer should be flushed
+            if flush or len(self.url_buffer) >= self.max_buffer_size:
+                return self.flush_url_buffer()
+            return True
+        except Exception as e:
+            logger.error(f"Error storing URL {url_obj.url}: {e}")
+            self.stats['storage_errors'] += 1
+            return False
+    def flush_page_buffer(self) -> bool:
+        """
+        Flush page buffer to MongoDB
+        Returns:
+            True if successful, False otherwise
+        """
+        if not self.page_buffer:
+            return True
+        try:
+            # Execute bulk operation
+            result = self.pages_collection.bulk_write(self.page_buffer, ordered=False)
+            # Clear buffer
+            buffer_size = len(self.page_buffer)
+            self.page_buffer = []
+            logger.debug(f"Flushed {buffer_size} pages to MongoDB")
+            return True
+        except BulkWriteError as e:
+            logger.error(f"Error in bulk write for pages: {e.details}")
+            self.stats['storage_errors'] += 1
+            # Clear buffer
+            self.page_buffer = []
+            return False
+        except Exception as e:
+            logger.error(f"Error flushing page buffer: {e}")
+            self.stats['storage_errors'] += 1
+            # Clear buffer
+            self.page_buffer = []
+            return False
+    def flush_url_buffer(self) -> bool:
+        """
+        Flush URL buffer to MongoDB
+        Returns:
+            True if successful, False otherwise
+        """
+        if not self.url_buffer:
+            return True
+        try:
+            # Execute bulk operation
+            result = self.urls_collection.bulk_write(self.url_buffer, ordered=False)
+            # Clear buffer
+            buffer_size = len(self.url_buffer)
+            self.url_buffer = []
+            logger.debug(f"Flushed {buffer_size} URLs to MongoDB")
+            return True
+        except BulkWriteError as e:
+            logger.error(f"Error in bulk write for URLs: {e.details}")
+            self.stats['storage_errors'] += 1
+            # Clear buffer
+            self.url_buffer = []
+            return False
+        except Exception as e:
+            logger.error(f"Error flushing URL buffer: {e}")
+            self.stats['storage_errors'] += 1
+            # Clear buffer
+            self.url_buffer = []
+            return False
+    def get_page(self, url: str) -> Optional[Page]:
+        """
+        Retrieve a page by URL
+        Args:
+            url: URL of the page to retrieve
+        Returns:
+            Page object if found, None otherwise
+        """
+        try:
+            # Get page metadata from MongoDB
+            page_doc = self.pages_collection.find_one({'url': url})
+            if not page_doc:
+                return None
+            # Create Page object from document
+            page = Page(**page_doc)
+            # Load content based on configuration
+            if self.use_s3:
+                content = self._load_content_s3(url)
+            else:
+                content = self._load_content_disk(url)
+            if content:
+                page.content = content
+            # Update statistics
+            self.stats['pages_retrieved'] += 1
+            return page
+        except Exception as e:
+            logger.error(f"Error retrieving page {url}: {e}")
+            self.stats['storage_errors'] += 1
+            return None
+    def _load_content_disk(self, url: str) -> Optional[str]:
+        """
+        Load page content from disk
+        Args:
+            url: URL of the page
+        Returns:
+            Page content if found, None otherwise
+        """
+        try:
+            # Get domain and filename
+            domain = self._extract_domain(url)
+            filename = self._url_to_filename(url)
+            # Check for compressed file first
+            compressed_path = os.path.join(config.HTML_STORAGE_PATH, domain, f"{filename}.gz")
+            uncompressed_path = os.path.join(config.HTML_STORAGE_PATH, domain, f"{filename}.html")
+            if os.path.exists(compressed_path):
+                # Load compressed content
+                with gzip.open(compressed_path, 'rt', encoding='utf-8') as f:
+                    return f.read()
+            elif os.path.exists(uncompressed_path):
+                # Load uncompressed content
+                with open(uncompressed_path, 'r', encoding='utf-8') as f:
+                    return f.read()
+            else:
+                logger.warning(f"Content file not found for {url}")
+                return None
+        except Exception as e:
+            logger.error(f"Error loading content from disk for {url}: {e}")
+            self.stats['storage_errors'] += 1
+            return None
+    def _load_content_s3(self, url: str) -> Optional[str]:
+        """
+        Load page content from S3
+        Args:
+            url: URL of the page
+        Returns:
+            Page content if found, None otherwise
+        """
+        if not self.s3_client:
+            logger.warning("S3 client not initialized, falling back to disk loading")
+            return self._load_content_disk(url)
+        try:
+            # Get domain and filename
+            domain = self._extract_domain(url)
+            filename = self._url_to_filename(url)
+            # Try both compressed and uncompressed keys
+            s3_key_compressed = f"{domain}/{filename}.gz"
+            s3_key_uncompressed = f"{domain}/{filename}.html"
+            try:
+                # Try compressed file first
+                response = self.s3_client.get_object(
+                    Bucket=config.S3_BUCKET,
+                    Key=s3_key_compressed
+                )
+                # Decompress content
+                content_bytes = response['Body'].read()
+                return gzip.decompress(content_bytes).decode('utf-8')
+            except ClientError as e:
+                if e.response['Error']['Code'] == 'NoSuchKey':
+                    # Try uncompressed file
+                    try:
+                        response = self.s3_client.get_object(
+                            Bucket=config.S3_BUCKET,
+                            Key=s3_key_uncompressed
+                        )
+                        content_bytes = response['Body'].read()
+                        return content_bytes.decode('utf-8')
+                    except ClientError as e2:
+                        if e2.response['Error']['Code'] == 'NoSuchKey':
+                            logger.warning(f"Content not found in S3 for {url}")
+                            # Try loading from disk as fallback
+                            return self._load_content_disk(url)
+                        else:
+                            raise e2
+                else:
+                    raise e
+        except Exception as e:
+            logger.error(f"Error loading content from S3 for {url}: {e}")
+            self.stats['storage_errors'] += 1
+            # Try loading from disk as fallback
+            return self._load_content_disk(url)
+    def get_url(self, url: str) -> Optional[URL]:
+        """
+        Retrieve URL information by URL
+        Args:
+            url: URL to retrieve
+        Returns:
+            URL object if found, None otherwise
+        """
+        try:
+            # Get URL information from MongoDB
+            url_doc = self.urls_collection.find_one({'url': url})
+            if not url_doc:
+                return None
+            # Create URL object from document
+            url_obj = URL(**url_doc)
+            # Update statistics
+            self.stats['urls_retrieved'] += 1
+            return url_obj
+        except Exception as e:
+            logger.error(f"Error retrieving URL {url}: {e}")
+            self.stats['storage_errors'] += 1
+            return None
+    def get_urls_by_status(self, status: str, limit: int = 100) -> List[URL]:
+        """
+        Retrieve URLs by status
+        Args:
+            status: Status of URLs to retrieve
+            limit: Maximum number of URLs to retrieve
+        Returns:
+            List of URL objects
+        """
+        try:
+            # Get URLs from MongoDB
+            url_docs = list(self.urls_collection.find({'status': status}).limit(limit))
+            # Create URL objects from documents
+            url_objs = [URL(**doc) for doc in url_docs]
+            # Update statistics
+            self.stats['urls_retrieved'] += len(url_objs)
+            return url_objs
+        except Exception as e:
+            logger.error(f"Error retrieving URLs by status {status}: {e}")
+            self.stats['storage_errors'] += 1
+            return []
+    def get_urls_by_domain(self, domain: str, limit: int = 100) -> List[URL]:
+        """
+        Retrieve URLs by domain
+        Args:
+            domain: Domain of URLs to retrieve
+            limit: Maximum number of URLs to retrieve
+        Returns:
+            List of URL objects
+        """
+        try:
+            # Get URLs from MongoDB
+            url_docs = list(self.urls_collection.find({'domain': domain}).limit(limit))
+            # Create URL objects from documents
+            url_objs = [URL(**doc) for doc in url_docs]
+            # Update statistics
+            self.stats['urls_retrieved'] += len(url_objs)
+            return url_objs
+        except Exception as e:
+            logger.error(f"Error retrieving URLs by domain {domain}: {e}")
+            self.stats['storage_errors'] += 1
+            return []
+    def store_stats(self, stats: Dict[str, Any]) -> bool:
+        """
+        Store crawler statistics
+        Args:
+            stats: Statistics to store
+        Returns:
+            True if successful, False otherwise
+        """
+        try:
+            # Create statistics document
+            stats_doc = stats.copy()
+            stats_doc['timestamp'] = datetime.datetime.now()
+            # Convert sets to lists for MongoDB
+            for key, value in stats_doc.items():
+                if isinstance(value, set):
+                    stats_doc[key] = list(value)
+            # Store in MongoDB
+            self.stats_collection.insert_one(stats_doc)
+            return True
+        except Exception as e:
+            logger.error(f"Error storing statistics: {e}")
+            self.stats['storage_errors'] += 1
+            return False
+    def _check_disk_space(self) -> bool:
+        """
+        Check if disk space limit is exceeded
+        Returns:
+            True if space is available, False otherwise
+        """
+        # Convert max disk usage to bytes
+        max_bytes = self.max_disk_usage_gb * 1024 * 1024 * 1024
+        # Check if limit is exceeded
+        return self.stats['disk_space_used'] < max_bytes
+    def _extract_domain(self, url: str) -> str:
+        """Extract domain from URL"""
+        parsed = urlparse(url)
+        return parsed.netloc.replace(':', '_')
+    def _url_to_filename(self, url: str) -> str:
+        """Convert URL to filename"""
+        # Hash the URL to create a safe filename
+        return hashlib.md5(url.encode('utf-8')).hexdigest()
+    def clean_old_pages(self, days: int = 90) -> int:
+        """
+        Remove pages older than a specified number of days
+        Args:
+            days: Number of days after which pages are considered old
+        Returns:
+            Number of pages removed
+        """
+        try:
+            # Calculate cutoff date
+            cutoff_date = datetime.datetime.now() - datetime.timedelta(days=days)
+            # Find old pages
+            old_pages = list(self.pages_collection.find({
+                'crawled_at': {'$lt': cutoff_date}
+            }, {'url': 1}))
+            if not old_pages:
+                logger.info(f"No pages older than {days} days found")
+                return 0
+            # Remove from database
+            delete_result = self.pages_collection.delete_many({
+                'crawled_at': {'$lt': cutoff_date}
+            })
+            # Remove content files
+            count = 0
+            for page in old_pages:
+                url = page['url']
+                domain = self._extract_domain(url)
+                filename = self._url_to_filename(url)
+                # Check disk
+                compressed_path = os.path.join(config.HTML_STORAGE_PATH, domain, f"{filename}.gz")
+                uncompressed_path = os.path.join(config.HTML_STORAGE_PATH, domain, f"{filename}.html")
+                if os.path.exists(compressed_path):
+                    os.remove(compressed_path)
+                    count += 1
+                if os.path.exists(uncompressed_path):
+                    os.remove(uncompressed_path)
+                    count += 1
+                # Check S3
+                if self.s3_client:
+                    s3_key_compressed = f"{domain}/{filename}.gz"
+                    s3_key_uncompressed = f"{domain}/{filename}.html"
+                    try:
+                        self.s3_client.delete_object(
+                            Bucket=config.S3_BUCKET,
+                            Key=s3_key_compressed
+                        )
+                        count += 1
+                    except:
+                        pass
+                    try:
+                        self.s3_client.delete_object(
+                            Bucket=config.S3_BUCKET,
+                            Key=s3_key_uncompressed
+                        )
+                        count += 1
+                    except:
+                        pass
+            logger.info(f"Removed {delete_result.deleted_count} old pages and {count} content files")
+            return delete_result.deleted_count
+        except Exception as e:
+            logger.error(f"Error cleaning old pages: {e}")
+            self.stats['storage_errors'] += 1
+            return 0
+    def clean_failed_urls(self, retries: int = 3) -> int:
+        """
+        Remove URLs that have failed repeatedly
+        Args:
+            retries: Number of retries after which a URL is considered permanently failed
+        Returns:
+            Number of URLs removed
+        """
+        try:
+            # Delete failed URLs with too many retries
+            delete_result = self.urls_collection.delete_many({
+                'status': 'FAILED',
+                'retries': {'$gte': retries}
+            })
+            logger.info(f"Removed {delete_result.deleted_count} permanently failed URLs")
+            return delete_result.deleted_count
+        except Exception as e:
+            logger.error(f"Error cleaning failed URLs: {e}")
+            self.stats['storage_errors'] += 1
+            return 0
+    def calculate_storage_stats(self) -> Dict[str, Any]:
+        """
+        Calculate storage statistics
+        Returns:
+            Dictionary of storage statistics
+        """
+        stats = {
+            'timestamp': datetime.datetime.now(),
+            'pages_count': 0,
+            'urls_count': 0,
+            'disk_space_used_mb': 0,
+            's3_objects_count': 0,
+            'mongodb_size_mb': 0,
+        }
+        try:
+            # Count pages and URLs
+            stats['pages_count'] = self.pages_collection.count_documents({})
+            stats['urls_count'] = self.urls_collection.count_documents({})
+            # Calculate disk space used
+            total_size = 0
+            for root, _, files in os.walk(config.HTML_STORAGE_PATH):
+                total_size += sum(os.path.getsize(os.path.join(root, name)) for name in files)
+            stats['disk_space_used_mb'] = total_size / (1024 * 1024)
+            # Calculate MongoDB size
+            db_stats = self.db.command('dbStats')
+            stats['mongodb_size_mb'] = db_stats['storageSize'] / (1024 * 1024)
+            # Count S3 objects if enabled
+            if self.s3_client:
+                try:
+                    s3_objects = 0
+                    paginator = self.s3_client.get_paginator('list_objects_v2')
+                    for page in paginator.paginate(Bucket=config.S3_BUCKET):
+                        if 'Contents' in page:
+                            s3_objects += len(page['Contents'])
+                    stats['s3_objects_count'] = s3_objects
+                except Exception as e:
+                    logger.error(f"Error counting S3 objects: {e}")
+            # Update internal statistics
+            self.stats['disk_space_used'] = total_size
+            self.stats['mongodb_size'] = db_stats['storageSize']
+            return stats
+        except Exception as e:
+            logger.error(f"Error calculating storage statistics: {e}")
+            self.stats['storage_errors'] += 1
+            return stats
+    def close(self) -> None:
+        """Close connections and perform cleanup"""
+        # Flush any pending buffers
+        self.flush_page_buffer()
+        self.flush_url_buffer()
+        # Close MongoDB connection
+        if self.mongo_client:
+            self.mongo_client.close()
+            logger.info("MongoDB connection closed")
+        # Log final statistics
+        logger.info(f"Storage manager closed. Pages stored: {self.stats['pages_stored']}, URLs stored: {self.stats['urls_stored']}")

test_crawler.py ADDED Viewed

	@@ -0,0 +1,219 @@

+#!/usr/bin/env python3
+"""
+Test script for the web crawler - tests only the URL frontier and downloader
+without requiring MongoDB
+"""
+import os
+import sys
+import time
+import logging
+import threading
+from urllib.parse import urlparse
+import redis
+# Make sure we're in the right directory
+script_dir = os.path.dirname(os.path.abspath(__file__))
+os.chdir(script_dir)
+# Set up logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s [%(name)s] %(levelname)s: %(message)s',
+    handlers=[
+        logging.StreamHandler(sys.stdout),
+        logging.FileHandler(os.path.join(script_dir, 'test_crawler.log'))
+    ]
+)
+logger = logging.getLogger("test_crawler")
+# Import our modules
+import config
+from frontier import URLFrontier
+from models import URL, Priority, URLStatus
+from downloader import HTMLDownloader
+from parser import HTMLParser
+from robots import RobotsHandler
+from dns_resolver import DNSResolver
+# Import local configuration if available
+try:
+    import local_config
+    # Override config settings with local settings
+    for key in dir(local_config):
+        if key.isupper():
+            setattr(config, key, getattr(local_config, key))
+    logger.info("Loaded local configuration")
+except ImportError:
+    logger.warning("No local_config.py found - using default config")
+def test_redis():
+    """Test Redis connection"""
+    try:
+        logger.info(f"Testing Redis connection to {config.REDIS_URI}")
+        r = redis.from_url(config.REDIS_URI)
+        r.ping()
+        logger.info("Redis connection successful")
+        return True
+    except Exception as e:
+        logger.error(f"Redis connection failed: {e}")
+        return False
+def test_robots_txt():
+    """Test robots.txt handling"""
+    try:
+        logger.info("Testing robots.txt handling")
+        robots_handler = RobotsHandler()
+        test_urls = [
+            "https://www.google.com/",
+            "https://www.github.com/",
+            "https://sagarnildas.com/",
+        ]
+        for url in test_urls:
+            logger.info(f"Checking robots.txt for {url}")
+            allowed, crawl_delay = robots_handler.can_fetch(url)
+            logger.info(f"  Allowed: {allowed}, Crawl delay: {crawl_delay}")
+        return True
+    except Exception as e:
+        logger.error(f"Error testing robots.txt: {e}")
+        return False
+def test_dns_resolver():
+    """Test DNS resolver"""
+    try:
+        logger.info("Testing DNS resolver")
+        dns_resolver = DNSResolver()
+        test_domains = [
+            "www.google.com",
+            "www.github.com",
+            "example.com",
+        ]
+        for domain in test_domains:
+            logger.info(f"Resolving {domain}")
+            ip = dns_resolver.resolve(f"https://{domain}/")
+            logger.info(f"  IP: {ip}")
+        return True
+    except Exception as e:
+        logger.error(f"Error testing DNS resolver: {e}")
+        return False
+def test_url_frontier():
+    """Test URL frontier"""
+    try:
+        logger.info("Testing URL frontier")
+        frontier = URLFrontier()
+        # Clear frontier
+        frontier.clear()
+        # Add some URLs
+        test_urls = [
+            "https://www.google.com/",
+            "https://www.github.com/",
+            "https://sagarnildas.com/",
+        ]
+        for i, url in enumerate(test_urls):
+            url_obj = URL(
+                url=url,
+                priority=Priority.MEDIUM,
+                status=URLStatus.PENDING,
+                depth=0
+            )
+            added = frontier.add_url(url_obj)
+            logger.info(f"Added {url}: {added}")
+        # Check size
+        size = frontier.size()
+        logger.info(f"Frontier size: {size}")
+        # Get next URL
+        url = frontier.get_next_url()
+        if url:
+            logger.info(f"Next URL: {url.url} (priority: {url.priority})")
+        else:
+            logger.info("No URL available")
+        return True
+    except Exception as e:
+        logger.error(f"Error testing URL frontier: {e}")
+        return False
+def test_downloader():
+    """Test HTML downloader"""
+    try:
+        logger.info("Testing HTML downloader")
+        downloader = HTMLDownloader()
+        test_urls = [
+            URL(url="https://sagarnildas.com/", priority=Priority.MEDIUM, status=URLStatus.PENDING, depth=0),
+            URL(url="https://www.google.com/", priority=Priority.MEDIUM, status=URLStatus.PENDING, depth=0),
+        ]
+        for url_obj in test_urls:
+            logger.info(f"Downloading {url_obj.url}")
+            page = downloader.download(url_obj)
+            if page:
+                logger.info(f"  Downloaded {page.content_length} bytes, status: {page.status_code}")
+                # Test parsing
+                parser = HTMLParser()
+                urls, metadata = parser.parse(page)
+                logger.info(f"  Extracted {len(urls)} URLs and {len(metadata)} metadata items")
+            else:
+                logger.info(f"  Download failed: {url_obj.error}")
+        return True
+    except Exception as e:
+        logger.error(f"Error testing HTML downloader: {e}")
+        return False
+def run_tests():
+    """Run all tests"""
+    logger.info("Starting crawler component tests")
+    tests = [
+        ("Redis", test_redis),
+        ("Robots.txt", test_robots_txt),
+        ("DNS Resolver", test_dns_resolver),
+        ("URL Frontier", test_url_frontier),
+        ("HTML Downloader", test_downloader),
+    ]
+    results = []
+    for name, test_func in tests:
+        logger.info(f"\n=== Testing {name} ===")
+        start_time = time.time()
+        success = test_func()
+        elapsed = time.time() - start_time
+        result = {
+            "name": name,
+            "success": success,
+            "time": elapsed
+        }
+        results.append(result)
+        logger.info(f"=== {name} test {'succeeded' if success else 'failed'} in {elapsed:.2f}s ===\n")
+    # Print summary
+    logger.info("\n=== Test Summary ===")
+    all_success = True
+    for result in results:
+        status = "SUCCESS" if result["success"] else "FAILED"
+        logger.info(f"{result['name']}: {status} ({result['time']:.2f}s)")
+        if not result["success"]:
+            all_success = False
+    if all_success:
+        logger.info("All tests passed!")
+    else:
+        logger.warning("Some tests failed. Check logs for details.")
+    return all_success
+if __name__ == "__main__":
+    run_tests()