Spaces:

milwright
/

chatui-helper

Running

milwright commited on Jul 14

Commit

12839ce

1 Parent(s): e2619ba

Replace Crawl4AI with simple HTTP requests and BeautifulSoup

- Remove Crawl4AI dependency and scraping_service.py
- Implement simple HTTP-based web scraping using requests + BeautifulSoup
- Update all documentation and templates to reflect new scraping approach
- Maintain compatibility with existing URL fetching functionality
- Simplify deployment requirements by removing async dependencies

Files changed (7) hide show

CLAUDE.md +10 -11
README.md +2 -2
TEST_PROCEDURE.md +1 -1
app.py +44 -43
file_upload_proposal.md +2 -2
requirements.txt +0 -2
scraping_service.py +0 -146

CLAUDE.md CHANGED Viewed

@@ -11,14 +11,13 @@ Chat UI Helper is a Gradio-based tool for generating and configuring chat interf
 ### Main Application (`app.py`)
 - **Primary Interface**: Two-tab Gradio application - "Spaces Configuration" for generating chat interfaces and "Chat Support" for getting help
 - **Template System**: `SPACE_TEMPLATE` generates complete HuggingFace Space apps with embedded configuration
-- **Web Scraping**: Integration with Crawl4AI for URL content fetching and context grounding
 - **Vector RAG**: Optional document processing pipeline for course materials and knowledge bases
 ### Document Processing Pipeline
 - **RAGTool** (`rag_tool.py`): Main orchestrator for document upload and processing
 - **DocumentProcessor** (`document_processor.py`): Handles PDF, DOCX, TXT, MD file parsing and chunking
 - **VectorStore** (`vector_store.py`): FAISS-based similarity search and embedding management
-- **ScrapingService** (`scraping_service.py`): Crawl4AI integration for web content extraction
 ### Package Generation
 The tool generates complete HuggingFace Spaces with:
@@ -59,7 +58,7 @@ pip install -r requirements.txt
 ### Key Dependencies
 - **Gradio 5.35.0+**: Main UI framework
-- **Crawl4AI 0.4.0+**: Web scraping with async support
 - **sentence-transformers**: Embeddings for RAG (optional)
 - **faiss-cpu**: Vector similarity search (optional)
 - **PyMuPDF**: PDF text extraction (optional)
@@ -101,7 +100,7 @@ Generated spaces use these template substitutions:
 ### Template Generation Pattern
 All generated HuggingFace Spaces follow consistent structure:
 1. Configuration section with environment variable loading
-2. Web scraping functions (sync/async Crawl4AI wrappers)
 3. RAG context retrieval (if enabled)
 4. OpenRouter API integration with conversation history
 5. Gradio ChatInterface with access control
@@ -137,7 +136,7 @@ This pattern allows the main application to function even when optional vector d
 - Proper message format handling for OpenRouter API
 ### Dynamic URL Fetching
-When enabled, generated spaces can extract URLs from user messages and fetch content dynamically using regex pattern matching and Crawl4AI processing.
 ### Vector RAG Workflow
 1. Documents uploaded through Gradio File component
@@ -147,12 +146,12 @@ When enabled, generated spaces can extract URLs from user messages and fetch con
 5. Embedded in generated template for deployment portability
 6. Runtime similarity search for context-aware responses
-### Mock vs Production Web Scraping
-The application has two modes for web scraping:
-- **Mock mode** (lines 14-18 in app.py): Returns placeholder content for testing
-- **Production mode**: Uses Crawl4AI via scraping_service.py for actual web content extraction
-Switch between modes by commenting/uncommenting the imports and function definitions.
 ## Testing and Quality Assurance

 ### Main Application (`app.py`)
 - **Primary Interface**: Two-tab Gradio application - "Spaces Configuration" for generating chat interfaces and "Chat Support" for getting help
 - **Template System**: `SPACE_TEMPLATE` generates complete HuggingFace Space apps with embedded configuration
+- **Web Scraping**: Simple HTTP requests with BeautifulSoup for URL content fetching and context grounding
 - **Vector RAG**: Optional document processing pipeline for course materials and knowledge bases
 ### Document Processing Pipeline
 - **RAGTool** (`rag_tool.py`): Main orchestrator for document upload and processing
 - **DocumentProcessor** (`document_processor.py`): Handles PDF, DOCX, TXT, MD file parsing and chunking
 - **VectorStore** (`vector_store.py`): FAISS-based similarity search and embedding management
 ### Package Generation
 The tool generates complete HuggingFace Spaces with:
 ### Key Dependencies
 - **Gradio 5.35.0+**: Main UI framework
+- **requests**: HTTP requests for web content fetching
 - **sentence-transformers**: Embeddings for RAG (optional)
 - **faiss-cpu**: Vector similarity search (optional)
 - **PyMuPDF**: PDF text extraction (optional)
 ### Template Generation Pattern
 All generated HuggingFace Spaces follow consistent structure:
 1. Configuration section with environment variable loading
+2. Web scraping functions (simple HTTP requests with BeautifulSoup)
 3. RAG context retrieval (if enabled)
 4. OpenRouter API integration with conversation history
 5. Gradio ChatInterface with access control
 - Proper message format handling for OpenRouter API
 ### Dynamic URL Fetching
+When enabled, generated spaces can extract URLs from user messages and fetch content dynamically using regex pattern matching and simple HTTP requests with BeautifulSoup.
 ### Vector RAG Workflow
 1. Documents uploaded through Gradio File component
 5. Embedded in generated template for deployment portability
 6. Runtime similarity search for context-aware responses
+### Web Scraping Implementation
+The application uses simple HTTP requests with BeautifulSoup for web content extraction:
+- **Simple HTTP requests**: Uses `requests` library with timeout and user-agent headers
+- **Content extraction**: BeautifulSoup for HTML parsing and text extraction
+- **Content cleaning**: Removes scripts, styles, navigation elements and normalizes whitespace
+- **Content limits**: Truncates content to ~4000 characters for context management
 ## Testing and Quality Assurance

README.md CHANGED Viewed

@@ -47,7 +47,7 @@ Set your OpenRouter API key as a secret:
 Each generated space includes:
 - **OpenRouter API Integration**: Support for multiple LLM models
-- **Web Scraping**: Crawl4AI integration for URL content fetching
 - **Document RAG**: Optional upload and search through PDF, DOCX, TXT, MD files
 - **Access Control**: Environment-based student access codes
 - **Modern UI**: Gradio 5.x ChatInterface with proper message formatting
@@ -56,7 +56,7 @@ Each generated space includes:
 - **Main Application**: `app.py` with two-tab interface
 - **Document Processing**: RAG pipeline with FAISS vector search
-- **Web Scraping**: Async Crawl4AI integration
 - **Template Generation**: Complete HuggingFace Space creation
 For detailed development guidance, see [CLAUDE.md](CLAUDE.md).

 Each generated space includes:
 - **OpenRouter API Integration**: Support for multiple LLM models
+- **Web Scraping**: Simple HTTP requests with BeautifulSoup for URL content fetching
 - **Document RAG**: Optional upload and search through PDF, DOCX, TXT, MD files
 - **Access Control**: Environment-based student access codes
 - **Modern UI**: Gradio 5.x ChatInterface with proper message formatting
 - **Main Application**: `app.py` with two-tab interface
 - **Document Processing**: RAG pipeline with FAISS vector search
+- **Web Scraping**: HTTP requests with BeautifulSoup for content extraction
 - **Template Generation**: Complete HuggingFace Space creation
 For detailed development guidance, see [CLAUDE.md](CLAUDE.md).

TEST_PROCEDURE.md CHANGED Viewed

@@ -98,7 +98,7 @@ python -c "from test_vector_db import test_rag_tool; test_rag_tool()"
 # Verify placeholder content is returned
 # Test in production mode
-# Verify actual web content is fetched via Crawl4AI
 ```
 #### 4.2 URL Processing

 # Verify placeholder content is returned
 # Test in production mode
+# Verify actual web content is fetched via HTTP requests
 ```
 #### 4.2 URL Processing

app.py CHANGED Viewed

@@ -9,13 +9,21 @@ import requests
 from bs4 import BeautifulSoup
 import tempfile
 from pathlib import Path
-# from scraping_service import get_grounding_context_crawl4ai, fetch_url_content_crawl4ai
-# Temporary mock functions for testing
-def get_grounding_context_crawl4ai(urls):
-    return "\n\n[URL content would be fetched here]\n\n"
-def fetch_url_content_crawl4ai(url):
-    return f"[Content from {url} would be fetched here]"
 # Import RAG components
 try:
@@ -33,8 +41,7 @@ SPACE_TEMPLATE = '''import gradio as gr
 import os
 import requests
 import json
-import asyncio
-from crawl4ai import AsyncWebCrawler
 # Configuration
 SPACE_NAME = "{name}"
@@ -51,36 +58,30 @@ RAG_DATA = {rag_data_json}
 # Get API key from environment - customizable variable name
 API_KEY = os.environ.get("{api_key_var}")
-async def fetch_url_content_async(url, crawler):
-    """Fetch and extract text content from a URL using Crawl4AI"""
-    try:
-        result = await crawler.arun(
-            url=url,
-            bypass_cache=True,
-            word_count_threshold=10,
-            excluded_tags=['script', 'style', 'nav', 'header', 'footer'],
-            remove_overlay_elements=True
-        )
-        if result.success:
-            content = result.markdown or result.cleaned_html or ""
-            # Truncate to ~4000 characters
-            if len(content) > 4000:
-                content = content[:4000] + "..."
-            return content
-        else:
-            return f"Error fetching {{url}}: Failed to retrieve content"
-    except Exception as e:
-        return f"Error fetching {{url}}: {{str(e)}}"
 def fetch_url_content(url):
-    """Synchronous wrapper for URL fetching"""
-    async def fetch():
-        async with AsyncWebCrawler(verbose=False) as crawler:
-            return await fetch_url_content_async(url, crawler)
     try:
-        return asyncio.run(fetch())
     except Exception as e:
         return f"Error fetching {{url}}: {{str(e)}}"
@@ -462,7 +463,7 @@ Generated on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} with Chat U/I Helper
 def create_requirements(enable_vector_rag=False):
     """Generate requirements.txt"""
-    base_requirements = "gradio>=5.35.0\nrequests>=2.32.3\ncrawl4ai>=0.4.0\naiofiles>=24.0"
     if enable_vector_rag:
         base_requirements += "\nfaiss-cpu==1.7.4\nnumpy==1.24.3"
@@ -720,8 +721,8 @@ def get_cached_grounding_context(urls):
     if cache_key in url_content_cache:
         return url_content_cache[cache_key]
-    # If not cached, fetch using Crawl4AI
-    grounding_context = get_grounding_context_crawl4ai(valid_urls)
     # Cache the result
     url_content_cache[cache_key] = grounding_context
@@ -877,11 +878,11 @@ def remove_chat_urls(count):
 def toggle_research_assistant(enable_research):
     """Toggle visibility of research assistant detailed fields and disable custom categories"""
     if enable_research:
-        combined_prompt = "You are a research assistant that provides link-grounded information through Crawl4AI web fetching. Use MLA documentation for parenthetical citations and bibliographic entries. This assistant is designed for students and researchers conducting academic inquiry. Your main responsibilities include: analyzing academic sources, fact-checking claims with evidence, providing properly cited research summaries, and helping users navigate scholarly information. Ground all responses in provided URL contexts and any additional URLs you're instructed to fetch. Never rely on memory for factual claims."
         return (
             gr.update(visible=True),  # Show research detailed fields
             gr.update(value=combined_prompt),  # Update main system prompt
-            gr.update(value="You are a research assistant that provides link-grounded information through Crawl4AI web fetching. Use MLA documentation for parenthetical citations and bibliographic entries."),
             gr.update(value="This assistant is designed for students and researchers conducting academic inquiry."),
             gr.update(value="Your main responsibilities include: analyzing academic sources, fact-checking claims with evidence, providing properly cited research summaries, and helping users navigate scholarly information."),
             gr.update(value="Ground all responses in provided URL contexts and any additional URLs you're instructed to fetch. Never rely on memory for factual claims."),
@@ -1088,7 +1089,7 @@ with gr.Blocks(title="Chat U/I Helper") as demo:
                     enable_dynamic_urls = gr.Checkbox(
                         label="Enable Dynamic URL Fetching",
                         value=False,
-                        info="Allow the assistant to fetch additional URLs mentioned in conversations (uses Crawl4AI)"
                     )
                     enable_vector_rag = gr.Checkbox(

 from bs4 import BeautifulSoup
 import tempfile
 from pathlib import Path
+# Simple URL content fetching using requests and BeautifulSoup
+def get_grounding_context_simple(urls):
+    """Fetch grounding context using simple HTTP requests"""
+    if not urls:
+        return ""
+    context_parts = []
+    for i, url in enumerate(urls, 1):
+        if url and url.strip():
+            content = fetch_url_content(url.strip())
+            context_parts.append(f"Context from URL {i} ({url}):\n{content}")
+    if context_parts:
+        return "\n\n" + "\n\n".join(context_parts) + "\n\n"
+    return ""
 # Import RAG components
 try:
 import os
 import requests
 import json
+from bs4 import BeautifulSoup
 # Configuration
 SPACE_NAME = "{name}"
 # Get API key from environment - customizable variable name
 API_KEY = os.environ.get("{api_key_var}")
 def fetch_url_content(url):
+    """Fetch and extract text content from a URL using requests and BeautifulSoup"""
     try:
+        response = requests.get(url, timeout=10, headers={{'User-Agent': 'Mozilla/5.0'}})
+        response.raise_for_status()
+        soup = BeautifulSoup(response.content, 'html.parser')
+        # Remove script and style elements
+        for script in soup(["script", "style", "nav", "header", "footer"]):
+            script.decompose()
+        # Get text content
+        text = soup.get_text()
+        # Clean up whitespace
+        lines = (line.strip() for line in text.splitlines())
+        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
+        text = ' '.join(chunk for chunk in chunks if chunk)
+        # Truncate to ~4000 characters
+        if len(text) > 4000:
+            text = text[:4000] + "..."
+        return text
     except Exception as e:
         return f"Error fetching {{url}}: {{str(e)}}"
 def create_requirements(enable_vector_rag=False):
     """Generate requirements.txt"""
+    base_requirements = "gradio>=5.35.0\nrequests>=2.32.3\nbeautifulsoup4>=4.12.3"
     if enable_vector_rag:
         base_requirements += "\nfaiss-cpu==1.7.4\nnumpy==1.24.3"
     if cache_key in url_content_cache:
         return url_content_cache[cache_key]
+    # If not cached, fetch using simple HTTP requests
+    grounding_context = get_grounding_context_simple(valid_urls)
     # Cache the result
     url_content_cache[cache_key] = grounding_context
 def toggle_research_assistant(enable_research):
     """Toggle visibility of research assistant detailed fields and disable custom categories"""
     if enable_research:
+        combined_prompt = "You are a research assistant that provides link-grounded information through web fetching. Use MLA documentation for parenthetical citations and bibliographic entries. This assistant is designed for students and researchers conducting academic inquiry. Your main responsibilities include: analyzing academic sources, fact-checking claims with evidence, providing properly cited research summaries, and helping users navigate scholarly information. Ground all responses in provided URL contexts and any additional URLs you're instructed to fetch. Never rely on memory for factual claims."
         return (
             gr.update(visible=True),  # Show research detailed fields
             gr.update(value=combined_prompt),  # Update main system prompt
+            gr.update(value="You are a research assistant that provides link-grounded information through web fetching. Use MLA documentation for parenthetical citations and bibliographic entries."),
             gr.update(value="This assistant is designed for students and researchers conducting academic inquiry."),
             gr.update(value="Your main responsibilities include: analyzing academic sources, fact-checking claims with evidence, providing properly cited research summaries, and helping users navigate scholarly information."),
             gr.update(value="Ground all responses in provided URL contexts and any additional URLs you're instructed to fetch. Never rely on memory for factual claims."),
                     enable_dynamic_urls = gr.Checkbox(
                         label="Enable Dynamic URL Fetching",
                         value=False,
+                        info="Allow the assistant to fetch additional URLs mentioned in conversations"
                     )
                     enable_vector_rag = gr.Checkbox(

file_upload_proposal.md CHANGED Viewed

@@ -31,7 +31,7 @@ Upload → Parse → Chunk → Vector Store → RAG Integration → Deployment P
 - **TXT/MD**: Direct text processing with metadata extraction
 - **Auto-detection**: File type identification and appropriate parser routing
-### RAG Integration (enhancement to existing Crawl4AI system)
 - **Chunking Strategy**: Semantic chunking (500-1000 tokens with 100-token overlap)
 - **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (lightweight, fast)
 - **Vector Store**: In-memory FAISS index for deployment portability
@@ -138,7 +138,7 @@ This approach maintains your existing speed while adding powerful document under
 1. Implement document parser service
 2. Add file upload UI components
-3. Integrate RAG system with existing Crawl4AI architecture
 4. Enhance SPACE_TEMPLATE with embedded materials
 5. Test with sample course materials
 6. Optimize for deployment package size

 - **TXT/MD**: Direct text processing with metadata extraction
 - **Auto-detection**: File type identification and appropriate parser routing
+### RAG Integration (enhancement to existing web scraping system)
 - **Chunking Strategy**: Semantic chunking (500-1000 tokens with 100-token overlap)
 - **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (lightweight, fast)
 - **Vector Store**: In-memory FAISS index for deployment portability
 1. Implement document parser service
 2. Add file upload UI components
+3. Integrate RAG system with existing web scraping architecture
 4. Enhance SPACE_TEMPLATE with embedded materials
 5. Test with sample course materials
 6. Optimize for deployment package size

requirements.txt CHANGED Viewed

@@ -2,8 +2,6 @@ gradio==5.35.0
 requests>=2.32.3
 beautifulsoup4>=4.12.3
 python-dotenv>=1.0.0
-crawl4ai>=0.4.0
-aiofiles>=24.0
 # Vector RAG dependencies (optional)
 sentence-transformers>=2.2.2

 requests>=2.32.3
 beautifulsoup4>=4.12.3
 python-dotenv>=1.0.0
 # Vector RAG dependencies (optional)
 sentence-transformers>=2.2.2

scraping_service.py DELETED Viewed

@@ -1,146 +0,0 @@
-import asyncio
-from crawl4ai import AsyncWebCrawler
-from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
-import json
-from typing import List, Dict, Optional
-class Crawl4AIScraper:
-    """Web scraping service using Crawl4AI for better content extraction"""
-    def __init__(self):
-        self.crawler = None
-    async def __aenter__(self):
-        """Initialize the crawler when entering async context"""
-        self.crawler = AsyncWebCrawler(verbose=False)
-        await self.crawler.__aenter__()
-        return self
-    async def __aexit__(self, exc_type, exc_val, exc_tb):
-        """Clean up the crawler when exiting async context"""
-        if self.crawler:
-            await self.crawler.__aexit__(exc_type, exc_val, exc_tb)
-    async def scrape_url(self, url: str, max_chars: int = 4000) -> str:
-        """
-        Scrape a single URL and extract text content
-        Args:
-            url: The URL to scrape
-            max_chars: Maximum characters to return (default 4000)
-        Returns:
-            Extracted text content or error message
-        """
-        try:
-            # Perform the crawl
-            result = await self.crawler.arun(
-                url=url,
-                bypass_cache=True,
-                word_count_threshold=10,
-                excluded_tags=['script', 'style', 'nav', 'header', 'footer'],
-                remove_overlay_elements=True
-            )
-            if result.success:
-                # Get cleaned text content - prefer markdown over cleaned_html
-                content = result.markdown or result.cleaned_html or ""
-                # Truncate if needed
-                if len(content) > max_chars:
-                    content = content[:max_chars] + "..."
-                return content
-            else:
-                return f"Error fetching {url}: Failed to retrieve content"
-        except Exception as e:
-            return f"Error fetching {url}: {str(e)}"
-    async def scrape_multiple_urls(self, urls: List[str], max_chars_per_url: int = 4000) -> Dict[str, str]:
-        """
-        Scrape multiple URLs concurrently
-        Args:
-            urls: List of URLs to scrape
-            max_chars_per_url: Maximum characters per URL
-        Returns:
-            Dictionary mapping URLs to their content
-        """
-        tasks = [self.scrape_url(url, max_chars_per_url) for url in urls if url and url.strip()]
-        results = await asyncio.gather(*tasks, return_exceptions=True)
-        url_content = {}
-        for url, result in zip(urls, results):
-            if isinstance(result, Exception):
-                url_content[url] = f"Error fetching {url}: {str(result)}"
-            else:
-                url_content[url] = result
-        return url_content
-def get_grounding_context_crawl4ai(urls: List[str]) -> str:
-    """
-    Synchronous wrapper to fetch grounding context using Crawl4AI
-    Args:
-        urls: List of URLs to fetch context from
-    Returns:
-        Formatted grounding context string
-    """
-    if not urls:
-        return ""
-    # Filter valid URLs
-    valid_urls = [url for url in urls if url and url.strip()]
-    if not valid_urls:
-        return ""
-    async def fetch_all():
-        async with Crawl4AIScraper() as scraper:
-            return await scraper.scrape_multiple_urls(valid_urls)
-    # Run the async function - handle existing event loop
-    try:
-        loop = asyncio.get_running_loop()
-        # We're already in an async context, create a new event loop in a thread
-        import concurrent.futures
-        with concurrent.futures.ThreadPoolExecutor() as executor:
-            future = executor.submit(asyncio.run, fetch_all())
-            url_content = future.result()
-    except RuntimeError:
-        # No event loop running, we can use asyncio.run directly
-        url_content = asyncio.run(fetch_all())
-    except Exception as e:
-        return f"Error initializing scraper: {str(e)}"
-    # Format the context
-    context_parts = []
-    for i, (url, content) in enumerate(url_content.items(), 1):
-        context_parts.append(f"Context from URL {i} ({url}):\n{content}")
-    if context_parts:
-        return "\n\n" + "\n\n".join(context_parts) + "\n\n"
-    return ""
-# Backwards compatibility function
-def fetch_url_content_crawl4ai(url: str) -> str:
-    """
-    Synchronous wrapper for single URL scraping (backwards compatibility)
-    Args:
-        url: The URL to fetch
-    Returns:
-        Extracted content or error message
-    """
-    async def fetch_one():
-        async with Crawl4AIScraper() as scraper:
-            return await scraper.scrape_url(url)
-    try:
-        return asyncio.run(fetch_one())
-    except Exception as e:
-        return f"Error fetching {url}: {str(e)}"