Spaces:
Running
Running
Replace Crawl4AI with simple HTTP requests and BeautifulSoup
Browse files- Remove Crawl4AI dependency and scraping_service.py
- Implement simple HTTP-based web scraping using requests + BeautifulSoup
- Update all documentation and templates to reflect new scraping approach
- Maintain compatibility with existing URL fetching functionality
- Simplify deployment requirements by removing async dependencies
- CLAUDE.md +10 -11
- README.md +2 -2
- TEST_PROCEDURE.md +1 -1
- app.py +44 -43
- file_upload_proposal.md +2 -2
- requirements.txt +0 -2
- scraping_service.py +0 -146
CLAUDE.md
CHANGED
@@ -11,14 +11,13 @@ Chat UI Helper is a Gradio-based tool for generating and configuring chat interf
|
|
11 |
### Main Application (`app.py`)
|
12 |
- **Primary Interface**: Two-tab Gradio application - "Spaces Configuration" for generating chat interfaces and "Chat Support" for getting help
|
13 |
- **Template System**: `SPACE_TEMPLATE` generates complete HuggingFace Space apps with embedded configuration
|
14 |
-
- **Web Scraping**:
|
15 |
- **Vector RAG**: Optional document processing pipeline for course materials and knowledge bases
|
16 |
|
17 |
### Document Processing Pipeline
|
18 |
- **RAGTool** (`rag_tool.py`): Main orchestrator for document upload and processing
|
19 |
- **DocumentProcessor** (`document_processor.py`): Handles PDF, DOCX, TXT, MD file parsing and chunking
|
20 |
- **VectorStore** (`vector_store.py`): FAISS-based similarity search and embedding management
|
21 |
-
- **ScrapingService** (`scraping_service.py`): Crawl4AI integration for web content extraction
|
22 |
|
23 |
### Package Generation
|
24 |
The tool generates complete HuggingFace Spaces with:
|
@@ -59,7 +58,7 @@ pip install -r requirements.txt
|
|
59 |
|
60 |
### Key Dependencies
|
61 |
- **Gradio 5.35.0+**: Main UI framework
|
62 |
-
- **
|
63 |
- **sentence-transformers**: Embeddings for RAG (optional)
|
64 |
- **faiss-cpu**: Vector similarity search (optional)
|
65 |
- **PyMuPDF**: PDF text extraction (optional)
|
@@ -101,7 +100,7 @@ Generated spaces use these template substitutions:
|
|
101 |
### Template Generation Pattern
|
102 |
All generated HuggingFace Spaces follow consistent structure:
|
103 |
1. Configuration section with environment variable loading
|
104 |
-
2. Web scraping functions (
|
105 |
3. RAG context retrieval (if enabled)
|
106 |
4. OpenRouter API integration with conversation history
|
107 |
5. Gradio ChatInterface with access control
|
@@ -137,7 +136,7 @@ This pattern allows the main application to function even when optional vector d
|
|
137 |
- Proper message format handling for OpenRouter API
|
138 |
|
139 |
### Dynamic URL Fetching
|
140 |
-
When enabled, generated spaces can extract URLs from user messages and fetch content dynamically using regex pattern matching and
|
141 |
|
142 |
### Vector RAG Workflow
|
143 |
1. Documents uploaded through Gradio File component
|
@@ -147,12 +146,12 @@ When enabled, generated spaces can extract URLs from user messages and fetch con
|
|
147 |
5. Embedded in generated template for deployment portability
|
148 |
6. Runtime similarity search for context-aware responses
|
149 |
|
150 |
-
###
|
151 |
-
The application
|
152 |
-
- **
|
153 |
-
- **
|
154 |
-
|
155 |
-
|
156 |
|
157 |
## Testing and Quality Assurance
|
158 |
|
|
|
11 |
### Main Application (`app.py`)
|
12 |
- **Primary Interface**: Two-tab Gradio application - "Spaces Configuration" for generating chat interfaces and "Chat Support" for getting help
|
13 |
- **Template System**: `SPACE_TEMPLATE` generates complete HuggingFace Space apps with embedded configuration
|
14 |
+
- **Web Scraping**: Simple HTTP requests with BeautifulSoup for URL content fetching and context grounding
|
15 |
- **Vector RAG**: Optional document processing pipeline for course materials and knowledge bases
|
16 |
|
17 |
### Document Processing Pipeline
|
18 |
- **RAGTool** (`rag_tool.py`): Main orchestrator for document upload and processing
|
19 |
- **DocumentProcessor** (`document_processor.py`): Handles PDF, DOCX, TXT, MD file parsing and chunking
|
20 |
- **VectorStore** (`vector_store.py`): FAISS-based similarity search and embedding management
|
|
|
21 |
|
22 |
### Package Generation
|
23 |
The tool generates complete HuggingFace Spaces with:
|
|
|
58 |
|
59 |
### Key Dependencies
|
60 |
- **Gradio 5.35.0+**: Main UI framework
|
61 |
+
- **requests**: HTTP requests for web content fetching
|
62 |
- **sentence-transformers**: Embeddings for RAG (optional)
|
63 |
- **faiss-cpu**: Vector similarity search (optional)
|
64 |
- **PyMuPDF**: PDF text extraction (optional)
|
|
|
100 |
### Template Generation Pattern
|
101 |
All generated HuggingFace Spaces follow consistent structure:
|
102 |
1. Configuration section with environment variable loading
|
103 |
+
2. Web scraping functions (simple HTTP requests with BeautifulSoup)
|
104 |
3. RAG context retrieval (if enabled)
|
105 |
4. OpenRouter API integration with conversation history
|
106 |
5. Gradio ChatInterface with access control
|
|
|
136 |
- Proper message format handling for OpenRouter API
|
137 |
|
138 |
### Dynamic URL Fetching
|
139 |
+
When enabled, generated spaces can extract URLs from user messages and fetch content dynamically using regex pattern matching and simple HTTP requests with BeautifulSoup.
|
140 |
|
141 |
### Vector RAG Workflow
|
142 |
1. Documents uploaded through Gradio File component
|
|
|
146 |
5. Embedded in generated template for deployment portability
|
147 |
6. Runtime similarity search for context-aware responses
|
148 |
|
149 |
+
### Web Scraping Implementation
|
150 |
+
The application uses simple HTTP requests with BeautifulSoup for web content extraction:
|
151 |
+
- **Simple HTTP requests**: Uses `requests` library with timeout and user-agent headers
|
152 |
+
- **Content extraction**: BeautifulSoup for HTML parsing and text extraction
|
153 |
+
- **Content cleaning**: Removes scripts, styles, navigation elements and normalizes whitespace
|
154 |
+
- **Content limits**: Truncates content to ~4000 characters for context management
|
155 |
|
156 |
## Testing and Quality Assurance
|
157 |
|
README.md
CHANGED
@@ -47,7 +47,7 @@ Set your OpenRouter API key as a secret:
|
|
47 |
|
48 |
Each generated space includes:
|
49 |
- **OpenRouter API Integration**: Support for multiple LLM models
|
50 |
-
- **Web Scraping**:
|
51 |
- **Document RAG**: Optional upload and search through PDF, DOCX, TXT, MD files
|
52 |
- **Access Control**: Environment-based student access codes
|
53 |
- **Modern UI**: Gradio 5.x ChatInterface with proper message formatting
|
@@ -56,7 +56,7 @@ Each generated space includes:
|
|
56 |
|
57 |
- **Main Application**: `app.py` with two-tab interface
|
58 |
- **Document Processing**: RAG pipeline with FAISS vector search
|
59 |
-
- **Web Scraping**:
|
60 |
- **Template Generation**: Complete HuggingFace Space creation
|
61 |
|
62 |
For detailed development guidance, see [CLAUDE.md](CLAUDE.md).
|
|
|
47 |
|
48 |
Each generated space includes:
|
49 |
- **OpenRouter API Integration**: Support for multiple LLM models
|
50 |
+
- **Web Scraping**: Simple HTTP requests with BeautifulSoup for URL content fetching
|
51 |
- **Document RAG**: Optional upload and search through PDF, DOCX, TXT, MD files
|
52 |
- **Access Control**: Environment-based student access codes
|
53 |
- **Modern UI**: Gradio 5.x ChatInterface with proper message formatting
|
|
|
56 |
|
57 |
- **Main Application**: `app.py` with two-tab interface
|
58 |
- **Document Processing**: RAG pipeline with FAISS vector search
|
59 |
+
- **Web Scraping**: HTTP requests with BeautifulSoup for content extraction
|
60 |
- **Template Generation**: Complete HuggingFace Space creation
|
61 |
|
62 |
For detailed development guidance, see [CLAUDE.md](CLAUDE.md).
|
TEST_PROCEDURE.md
CHANGED
@@ -98,7 +98,7 @@ python -c "from test_vector_db import test_rag_tool; test_rag_tool()"
|
|
98 |
# Verify placeholder content is returned
|
99 |
|
100 |
# Test in production mode
|
101 |
-
# Verify actual web content is fetched via
|
102 |
```
|
103 |
|
104 |
#### 4.2 URL Processing
|
|
|
98 |
# Verify placeholder content is returned
|
99 |
|
100 |
# Test in production mode
|
101 |
+
# Verify actual web content is fetched via HTTP requests
|
102 |
```
|
103 |
|
104 |
#### 4.2 URL Processing
|
app.py
CHANGED
@@ -9,13 +9,21 @@ import requests
|
|
9 |
from bs4 import BeautifulSoup
|
10 |
import tempfile
|
11 |
from pathlib import Path
|
12 |
-
#
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
19 |
|
20 |
# Import RAG components
|
21 |
try:
|
@@ -33,8 +41,7 @@ SPACE_TEMPLATE = '''import gradio as gr
|
|
33 |
import os
|
34 |
import requests
|
35 |
import json
|
36 |
-
import
|
37 |
-
from crawl4ai import AsyncWebCrawler
|
38 |
|
39 |
# Configuration
|
40 |
SPACE_NAME = "{name}"
|
@@ -51,36 +58,30 @@ RAG_DATA = {rag_data_json}
|
|
51 |
# Get API key from environment - customizable variable name
|
52 |
API_KEY = os.environ.get("{api_key_var}")
|
53 |
|
54 |
-
async def fetch_url_content_async(url, crawler):
|
55 |
-
"""Fetch and extract text content from a URL using Crawl4AI"""
|
56 |
-
try:
|
57 |
-
result = await crawler.arun(
|
58 |
-
url=url,
|
59 |
-
bypass_cache=True,
|
60 |
-
word_count_threshold=10,
|
61 |
-
excluded_tags=['script', 'style', 'nav', 'header', 'footer'],
|
62 |
-
remove_overlay_elements=True
|
63 |
-
)
|
64 |
-
|
65 |
-
if result.success:
|
66 |
-
content = result.markdown or result.cleaned_html or ""
|
67 |
-
# Truncate to ~4000 characters
|
68 |
-
if len(content) > 4000:
|
69 |
-
content = content[:4000] + "..."
|
70 |
-
return content
|
71 |
-
else:
|
72 |
-
return f"Error fetching {{url}}: Failed to retrieve content"
|
73 |
-
except Exception as e:
|
74 |
-
return f"Error fetching {{url}}: {{str(e)}}"
|
75 |
-
|
76 |
def fetch_url_content(url):
|
77 |
-
"""
|
78 |
-
async def fetch():
|
79 |
-
async with AsyncWebCrawler(verbose=False) as crawler:
|
80 |
-
return await fetch_url_content_async(url, crawler)
|
81 |
-
|
82 |
try:
|
83 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
84 |
except Exception as e:
|
85 |
return f"Error fetching {{url}}: {{str(e)}}"
|
86 |
|
@@ -462,7 +463,7 @@ Generated on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} with Chat U/I Helper
|
|
462 |
|
463 |
def create_requirements(enable_vector_rag=False):
|
464 |
"""Generate requirements.txt"""
|
465 |
-
base_requirements = "gradio>=5.35.0\nrequests>=2.32.3\
|
466 |
|
467 |
if enable_vector_rag:
|
468 |
base_requirements += "\nfaiss-cpu==1.7.4\nnumpy==1.24.3"
|
@@ -720,8 +721,8 @@ def get_cached_grounding_context(urls):
|
|
720 |
if cache_key in url_content_cache:
|
721 |
return url_content_cache[cache_key]
|
722 |
|
723 |
-
# If not cached, fetch using
|
724 |
-
grounding_context =
|
725 |
|
726 |
# Cache the result
|
727 |
url_content_cache[cache_key] = grounding_context
|
@@ -877,11 +878,11 @@ def remove_chat_urls(count):
|
|
877 |
def toggle_research_assistant(enable_research):
|
878 |
"""Toggle visibility of research assistant detailed fields and disable custom categories"""
|
879 |
if enable_research:
|
880 |
-
combined_prompt = "You are a research assistant that provides link-grounded information through
|
881 |
return (
|
882 |
gr.update(visible=True), # Show research detailed fields
|
883 |
gr.update(value=combined_prompt), # Update main system prompt
|
884 |
-
gr.update(value="You are a research assistant that provides link-grounded information through
|
885 |
gr.update(value="This assistant is designed for students and researchers conducting academic inquiry."),
|
886 |
gr.update(value="Your main responsibilities include: analyzing academic sources, fact-checking claims with evidence, providing properly cited research summaries, and helping users navigate scholarly information."),
|
887 |
gr.update(value="Ground all responses in provided URL contexts and any additional URLs you're instructed to fetch. Never rely on memory for factual claims."),
|
@@ -1088,7 +1089,7 @@ with gr.Blocks(title="Chat U/I Helper") as demo:
|
|
1088 |
enable_dynamic_urls = gr.Checkbox(
|
1089 |
label="Enable Dynamic URL Fetching",
|
1090 |
value=False,
|
1091 |
-
info="Allow the assistant to fetch additional URLs mentioned in conversations
|
1092 |
)
|
1093 |
|
1094 |
enable_vector_rag = gr.Checkbox(
|
|
|
9 |
from bs4 import BeautifulSoup
|
10 |
import tempfile
|
11 |
from pathlib import Path
|
12 |
+
# Simple URL content fetching using requests and BeautifulSoup
|
13 |
+
def get_grounding_context_simple(urls):
|
14 |
+
"""Fetch grounding context using simple HTTP requests"""
|
15 |
+
if not urls:
|
16 |
+
return ""
|
17 |
+
|
18 |
+
context_parts = []
|
19 |
+
for i, url in enumerate(urls, 1):
|
20 |
+
if url and url.strip():
|
21 |
+
content = fetch_url_content(url.strip())
|
22 |
+
context_parts.append(f"Context from URL {i} ({url}):\n{content}")
|
23 |
+
|
24 |
+
if context_parts:
|
25 |
+
return "\n\n" + "\n\n".join(context_parts) + "\n\n"
|
26 |
+
return ""
|
27 |
|
28 |
# Import RAG components
|
29 |
try:
|
|
|
41 |
import os
|
42 |
import requests
|
43 |
import json
|
44 |
+
from bs4 import BeautifulSoup
|
|
|
45 |
|
46 |
# Configuration
|
47 |
SPACE_NAME = "{name}"
|
|
|
58 |
# Get API key from environment - customizable variable name
|
59 |
API_KEY = os.environ.get("{api_key_var}")
|
60 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
61 |
def fetch_url_content(url):
|
62 |
+
"""Fetch and extract text content from a URL using requests and BeautifulSoup"""
|
|
|
|
|
|
|
|
|
63 |
try:
|
64 |
+
response = requests.get(url, timeout=10, headers={{'User-Agent': 'Mozilla/5.0'}})
|
65 |
+
response.raise_for_status()
|
66 |
+
soup = BeautifulSoup(response.content, 'html.parser')
|
67 |
+
|
68 |
+
# Remove script and style elements
|
69 |
+
for script in soup(["script", "style", "nav", "header", "footer"]):
|
70 |
+
script.decompose()
|
71 |
+
|
72 |
+
# Get text content
|
73 |
+
text = soup.get_text()
|
74 |
+
|
75 |
+
# Clean up whitespace
|
76 |
+
lines = (line.strip() for line in text.splitlines())
|
77 |
+
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
|
78 |
+
text = ' '.join(chunk for chunk in chunks if chunk)
|
79 |
+
|
80 |
+
# Truncate to ~4000 characters
|
81 |
+
if len(text) > 4000:
|
82 |
+
text = text[:4000] + "..."
|
83 |
+
|
84 |
+
return text
|
85 |
except Exception as e:
|
86 |
return f"Error fetching {{url}}: {{str(e)}}"
|
87 |
|
|
|
463 |
|
464 |
def create_requirements(enable_vector_rag=False):
|
465 |
"""Generate requirements.txt"""
|
466 |
+
base_requirements = "gradio>=5.35.0\nrequests>=2.32.3\nbeautifulsoup4>=4.12.3"
|
467 |
|
468 |
if enable_vector_rag:
|
469 |
base_requirements += "\nfaiss-cpu==1.7.4\nnumpy==1.24.3"
|
|
|
721 |
if cache_key in url_content_cache:
|
722 |
return url_content_cache[cache_key]
|
723 |
|
724 |
+
# If not cached, fetch using simple HTTP requests
|
725 |
+
grounding_context = get_grounding_context_simple(valid_urls)
|
726 |
|
727 |
# Cache the result
|
728 |
url_content_cache[cache_key] = grounding_context
|
|
|
878 |
def toggle_research_assistant(enable_research):
|
879 |
"""Toggle visibility of research assistant detailed fields and disable custom categories"""
|
880 |
if enable_research:
|
881 |
+
combined_prompt = "You are a research assistant that provides link-grounded information through web fetching. Use MLA documentation for parenthetical citations and bibliographic entries. This assistant is designed for students and researchers conducting academic inquiry. Your main responsibilities include: analyzing academic sources, fact-checking claims with evidence, providing properly cited research summaries, and helping users navigate scholarly information. Ground all responses in provided URL contexts and any additional URLs you're instructed to fetch. Never rely on memory for factual claims."
|
882 |
return (
|
883 |
gr.update(visible=True), # Show research detailed fields
|
884 |
gr.update(value=combined_prompt), # Update main system prompt
|
885 |
+
gr.update(value="You are a research assistant that provides link-grounded information through web fetching. Use MLA documentation for parenthetical citations and bibliographic entries."),
|
886 |
gr.update(value="This assistant is designed for students and researchers conducting academic inquiry."),
|
887 |
gr.update(value="Your main responsibilities include: analyzing academic sources, fact-checking claims with evidence, providing properly cited research summaries, and helping users navigate scholarly information."),
|
888 |
gr.update(value="Ground all responses in provided URL contexts and any additional URLs you're instructed to fetch. Never rely on memory for factual claims."),
|
|
|
1089 |
enable_dynamic_urls = gr.Checkbox(
|
1090 |
label="Enable Dynamic URL Fetching",
|
1091 |
value=False,
|
1092 |
+
info="Allow the assistant to fetch additional URLs mentioned in conversations"
|
1093 |
)
|
1094 |
|
1095 |
enable_vector_rag = gr.Checkbox(
|
file_upload_proposal.md
CHANGED
@@ -31,7 +31,7 @@ Upload → Parse → Chunk → Vector Store → RAG Integration → Deployment P
|
|
31 |
- **TXT/MD**: Direct text processing with metadata extraction
|
32 |
- **Auto-detection**: File type identification and appropriate parser routing
|
33 |
|
34 |
-
### RAG Integration (enhancement to existing
|
35 |
- **Chunking Strategy**: Semantic chunking (500-1000 tokens with 100-token overlap)
|
36 |
- **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (lightweight, fast)
|
37 |
- **Vector Store**: In-memory FAISS index for deployment portability
|
@@ -138,7 +138,7 @@ This approach maintains your existing speed while adding powerful document under
|
|
138 |
|
139 |
1. Implement document parser service
|
140 |
2. Add file upload UI components
|
141 |
-
3. Integrate RAG system with existing
|
142 |
4. Enhance SPACE_TEMPLATE with embedded materials
|
143 |
5. Test with sample course materials
|
144 |
6. Optimize for deployment package size
|
|
|
31 |
- **TXT/MD**: Direct text processing with metadata extraction
|
32 |
- **Auto-detection**: File type identification and appropriate parser routing
|
33 |
|
34 |
+
### RAG Integration (enhancement to existing web scraping system)
|
35 |
- **Chunking Strategy**: Semantic chunking (500-1000 tokens with 100-token overlap)
|
36 |
- **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (lightweight, fast)
|
37 |
- **Vector Store**: In-memory FAISS index for deployment portability
|
|
|
138 |
|
139 |
1. Implement document parser service
|
140 |
2. Add file upload UI components
|
141 |
+
3. Integrate RAG system with existing web scraping architecture
|
142 |
4. Enhance SPACE_TEMPLATE with embedded materials
|
143 |
5. Test with sample course materials
|
144 |
6. Optimize for deployment package size
|
requirements.txt
CHANGED
@@ -2,8 +2,6 @@ gradio==5.35.0
|
|
2 |
requests>=2.32.3
|
3 |
beautifulsoup4>=4.12.3
|
4 |
python-dotenv>=1.0.0
|
5 |
-
crawl4ai>=0.4.0
|
6 |
-
aiofiles>=24.0
|
7 |
|
8 |
# Vector RAG dependencies (optional)
|
9 |
sentence-transformers>=2.2.2
|
|
|
2 |
requests>=2.32.3
|
3 |
beautifulsoup4>=4.12.3
|
4 |
python-dotenv>=1.0.0
|
|
|
|
|
5 |
|
6 |
# Vector RAG dependencies (optional)
|
7 |
sentence-transformers>=2.2.2
|
scraping_service.py
DELETED
@@ -1,146 +0,0 @@
|
|
1 |
-
import asyncio
|
2 |
-
from crawl4ai import AsyncWebCrawler
|
3 |
-
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
|
4 |
-
import json
|
5 |
-
from typing import List, Dict, Optional
|
6 |
-
|
7 |
-
class Crawl4AIScraper:
|
8 |
-
"""Web scraping service using Crawl4AI for better content extraction"""
|
9 |
-
|
10 |
-
def __init__(self):
|
11 |
-
self.crawler = None
|
12 |
-
|
13 |
-
async def __aenter__(self):
|
14 |
-
"""Initialize the crawler when entering async context"""
|
15 |
-
self.crawler = AsyncWebCrawler(verbose=False)
|
16 |
-
await self.crawler.__aenter__()
|
17 |
-
return self
|
18 |
-
|
19 |
-
async def __aexit__(self, exc_type, exc_val, exc_tb):
|
20 |
-
"""Clean up the crawler when exiting async context"""
|
21 |
-
if self.crawler:
|
22 |
-
await self.crawler.__aexit__(exc_type, exc_val, exc_tb)
|
23 |
-
|
24 |
-
async def scrape_url(self, url: str, max_chars: int = 4000) -> str:
|
25 |
-
"""
|
26 |
-
Scrape a single URL and extract text content
|
27 |
-
|
28 |
-
Args:
|
29 |
-
url: The URL to scrape
|
30 |
-
max_chars: Maximum characters to return (default 4000)
|
31 |
-
|
32 |
-
Returns:
|
33 |
-
Extracted text content or error message
|
34 |
-
"""
|
35 |
-
try:
|
36 |
-
# Perform the crawl
|
37 |
-
result = await self.crawler.arun(
|
38 |
-
url=url,
|
39 |
-
bypass_cache=True,
|
40 |
-
word_count_threshold=10,
|
41 |
-
excluded_tags=['script', 'style', 'nav', 'header', 'footer'],
|
42 |
-
remove_overlay_elements=True
|
43 |
-
)
|
44 |
-
|
45 |
-
if result.success:
|
46 |
-
# Get cleaned text content - prefer markdown over cleaned_html
|
47 |
-
content = result.markdown or result.cleaned_html or ""
|
48 |
-
|
49 |
-
# Truncate if needed
|
50 |
-
if len(content) > max_chars:
|
51 |
-
content = content[:max_chars] + "..."
|
52 |
-
|
53 |
-
return content
|
54 |
-
else:
|
55 |
-
return f"Error fetching {url}: Failed to retrieve content"
|
56 |
-
|
57 |
-
except Exception as e:
|
58 |
-
return f"Error fetching {url}: {str(e)}"
|
59 |
-
|
60 |
-
async def scrape_multiple_urls(self, urls: List[str], max_chars_per_url: int = 4000) -> Dict[str, str]:
|
61 |
-
"""
|
62 |
-
Scrape multiple URLs concurrently
|
63 |
-
|
64 |
-
Args:
|
65 |
-
urls: List of URLs to scrape
|
66 |
-
max_chars_per_url: Maximum characters per URL
|
67 |
-
|
68 |
-
Returns:
|
69 |
-
Dictionary mapping URLs to their content
|
70 |
-
"""
|
71 |
-
tasks = [self.scrape_url(url, max_chars_per_url) for url in urls if url and url.strip()]
|
72 |
-
results = await asyncio.gather(*tasks, return_exceptions=True)
|
73 |
-
|
74 |
-
url_content = {}
|
75 |
-
for url, result in zip(urls, results):
|
76 |
-
if isinstance(result, Exception):
|
77 |
-
url_content[url] = f"Error fetching {url}: {str(result)}"
|
78 |
-
else:
|
79 |
-
url_content[url] = result
|
80 |
-
|
81 |
-
return url_content
|
82 |
-
|
83 |
-
def get_grounding_context_crawl4ai(urls: List[str]) -> str:
|
84 |
-
"""
|
85 |
-
Synchronous wrapper to fetch grounding context using Crawl4AI
|
86 |
-
|
87 |
-
Args:
|
88 |
-
urls: List of URLs to fetch context from
|
89 |
-
|
90 |
-
Returns:
|
91 |
-
Formatted grounding context string
|
92 |
-
"""
|
93 |
-
if not urls:
|
94 |
-
return ""
|
95 |
-
|
96 |
-
# Filter valid URLs
|
97 |
-
valid_urls = [url for url in urls if url and url.strip()]
|
98 |
-
if not valid_urls:
|
99 |
-
return ""
|
100 |
-
|
101 |
-
async def fetch_all():
|
102 |
-
async with Crawl4AIScraper() as scraper:
|
103 |
-
return await scraper.scrape_multiple_urls(valid_urls)
|
104 |
-
|
105 |
-
# Run the async function - handle existing event loop
|
106 |
-
try:
|
107 |
-
loop = asyncio.get_running_loop()
|
108 |
-
# We're already in an async context, create a new event loop in a thread
|
109 |
-
import concurrent.futures
|
110 |
-
with concurrent.futures.ThreadPoolExecutor() as executor:
|
111 |
-
future = executor.submit(asyncio.run, fetch_all())
|
112 |
-
url_content = future.result()
|
113 |
-
except RuntimeError:
|
114 |
-
# No event loop running, we can use asyncio.run directly
|
115 |
-
url_content = asyncio.run(fetch_all())
|
116 |
-
except Exception as e:
|
117 |
-
return f"Error initializing scraper: {str(e)}"
|
118 |
-
|
119 |
-
# Format the context
|
120 |
-
context_parts = []
|
121 |
-
for i, (url, content) in enumerate(url_content.items(), 1):
|
122 |
-
context_parts.append(f"Context from URL {i} ({url}):\n{content}")
|
123 |
-
|
124 |
-
if context_parts:
|
125 |
-
return "\n\n" + "\n\n".join(context_parts) + "\n\n"
|
126 |
-
return ""
|
127 |
-
|
128 |
-
# Backwards compatibility function
|
129 |
-
def fetch_url_content_crawl4ai(url: str) -> str:
|
130 |
-
"""
|
131 |
-
Synchronous wrapper for single URL scraping (backwards compatibility)
|
132 |
-
|
133 |
-
Args:
|
134 |
-
url: The URL to fetch
|
135 |
-
|
136 |
-
Returns:
|
137 |
-
Extracted content or error message
|
138 |
-
"""
|
139 |
-
async def fetch_one():
|
140 |
-
async with Crawl4AIScraper() as scraper:
|
141 |
-
return await scraper.scrape_url(url)
|
142 |
-
|
143 |
-
try:
|
144 |
-
return asyncio.run(fetch_one())
|
145 |
-
except Exception as e:
|
146 |
-
return f"Error fetching {url}: {str(e)}"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|