milwright commited on
Commit
12839ce
·
1 Parent(s): e2619ba

Replace Crawl4AI with simple HTTP requests and BeautifulSoup

Browse files

- Remove Crawl4AI dependency and scraping_service.py
- Implement simple HTTP-based web scraping using requests + BeautifulSoup
- Update all documentation and templates to reflect new scraping approach
- Maintain compatibility with existing URL fetching functionality
- Simplify deployment requirements by removing async dependencies

Files changed (7) hide show
  1. CLAUDE.md +10 -11
  2. README.md +2 -2
  3. TEST_PROCEDURE.md +1 -1
  4. app.py +44 -43
  5. file_upload_proposal.md +2 -2
  6. requirements.txt +0 -2
  7. scraping_service.py +0 -146
CLAUDE.md CHANGED
@@ -11,14 +11,13 @@ Chat UI Helper is a Gradio-based tool for generating and configuring chat interf
11
  ### Main Application (`app.py`)
12
  - **Primary Interface**: Two-tab Gradio application - "Spaces Configuration" for generating chat interfaces and "Chat Support" for getting help
13
  - **Template System**: `SPACE_TEMPLATE` generates complete HuggingFace Space apps with embedded configuration
14
- - **Web Scraping**: Integration with Crawl4AI for URL content fetching and context grounding
15
  - **Vector RAG**: Optional document processing pipeline for course materials and knowledge bases
16
 
17
  ### Document Processing Pipeline
18
  - **RAGTool** (`rag_tool.py`): Main orchestrator for document upload and processing
19
  - **DocumentProcessor** (`document_processor.py`): Handles PDF, DOCX, TXT, MD file parsing and chunking
20
  - **VectorStore** (`vector_store.py`): FAISS-based similarity search and embedding management
21
- - **ScrapingService** (`scraping_service.py`): Crawl4AI integration for web content extraction
22
 
23
  ### Package Generation
24
  The tool generates complete HuggingFace Spaces with:
@@ -59,7 +58,7 @@ pip install -r requirements.txt
59
 
60
  ### Key Dependencies
61
  - **Gradio 5.35.0+**: Main UI framework
62
- - **Crawl4AI 0.4.0+**: Web scraping with async support
63
  - **sentence-transformers**: Embeddings for RAG (optional)
64
  - **faiss-cpu**: Vector similarity search (optional)
65
  - **PyMuPDF**: PDF text extraction (optional)
@@ -101,7 +100,7 @@ Generated spaces use these template substitutions:
101
  ### Template Generation Pattern
102
  All generated HuggingFace Spaces follow consistent structure:
103
  1. Configuration section with environment variable loading
104
- 2. Web scraping functions (sync/async Crawl4AI wrappers)
105
  3. RAG context retrieval (if enabled)
106
  4. OpenRouter API integration with conversation history
107
  5. Gradio ChatInterface with access control
@@ -137,7 +136,7 @@ This pattern allows the main application to function even when optional vector d
137
  - Proper message format handling for OpenRouter API
138
 
139
  ### Dynamic URL Fetching
140
- When enabled, generated spaces can extract URLs from user messages and fetch content dynamically using regex pattern matching and Crawl4AI processing.
141
 
142
  ### Vector RAG Workflow
143
  1. Documents uploaded through Gradio File component
@@ -147,12 +146,12 @@ When enabled, generated spaces can extract URLs from user messages and fetch con
147
  5. Embedded in generated template for deployment portability
148
  6. Runtime similarity search for context-aware responses
149
 
150
- ### Mock vs Production Web Scraping
151
- The application has two modes for web scraping:
152
- - **Mock mode** (lines 14-18 in app.py): Returns placeholder content for testing
153
- - **Production mode**: Uses Crawl4AI via scraping_service.py for actual web content extraction
154
-
155
- Switch between modes by commenting/uncommenting the imports and function definitions.
156
 
157
  ## Testing and Quality Assurance
158
 
 
11
  ### Main Application (`app.py`)
12
  - **Primary Interface**: Two-tab Gradio application - "Spaces Configuration" for generating chat interfaces and "Chat Support" for getting help
13
  - **Template System**: `SPACE_TEMPLATE` generates complete HuggingFace Space apps with embedded configuration
14
+ - **Web Scraping**: Simple HTTP requests with BeautifulSoup for URL content fetching and context grounding
15
  - **Vector RAG**: Optional document processing pipeline for course materials and knowledge bases
16
 
17
  ### Document Processing Pipeline
18
  - **RAGTool** (`rag_tool.py`): Main orchestrator for document upload and processing
19
  - **DocumentProcessor** (`document_processor.py`): Handles PDF, DOCX, TXT, MD file parsing and chunking
20
  - **VectorStore** (`vector_store.py`): FAISS-based similarity search and embedding management
 
21
 
22
  ### Package Generation
23
  The tool generates complete HuggingFace Spaces with:
 
58
 
59
  ### Key Dependencies
60
  - **Gradio 5.35.0+**: Main UI framework
61
+ - **requests**: HTTP requests for web content fetching
62
  - **sentence-transformers**: Embeddings for RAG (optional)
63
  - **faiss-cpu**: Vector similarity search (optional)
64
  - **PyMuPDF**: PDF text extraction (optional)
 
100
  ### Template Generation Pattern
101
  All generated HuggingFace Spaces follow consistent structure:
102
  1. Configuration section with environment variable loading
103
+ 2. Web scraping functions (simple HTTP requests with BeautifulSoup)
104
  3. RAG context retrieval (if enabled)
105
  4. OpenRouter API integration with conversation history
106
  5. Gradio ChatInterface with access control
 
136
  - Proper message format handling for OpenRouter API
137
 
138
  ### Dynamic URL Fetching
139
+ When enabled, generated spaces can extract URLs from user messages and fetch content dynamically using regex pattern matching and simple HTTP requests with BeautifulSoup.
140
 
141
  ### Vector RAG Workflow
142
  1. Documents uploaded through Gradio File component
 
146
  5. Embedded in generated template for deployment portability
147
  6. Runtime similarity search for context-aware responses
148
 
149
+ ### Web Scraping Implementation
150
+ The application uses simple HTTP requests with BeautifulSoup for web content extraction:
151
+ - **Simple HTTP requests**: Uses `requests` library with timeout and user-agent headers
152
+ - **Content extraction**: BeautifulSoup for HTML parsing and text extraction
153
+ - **Content cleaning**: Removes scripts, styles, navigation elements and normalizes whitespace
154
+ - **Content limits**: Truncates content to ~4000 characters for context management
155
 
156
  ## Testing and Quality Assurance
157
 
README.md CHANGED
@@ -47,7 +47,7 @@ Set your OpenRouter API key as a secret:
47
 
48
  Each generated space includes:
49
  - **OpenRouter API Integration**: Support for multiple LLM models
50
- - **Web Scraping**: Crawl4AI integration for URL content fetching
51
  - **Document RAG**: Optional upload and search through PDF, DOCX, TXT, MD files
52
  - **Access Control**: Environment-based student access codes
53
  - **Modern UI**: Gradio 5.x ChatInterface with proper message formatting
@@ -56,7 +56,7 @@ Each generated space includes:
56
 
57
  - **Main Application**: `app.py` with two-tab interface
58
  - **Document Processing**: RAG pipeline with FAISS vector search
59
- - **Web Scraping**: Async Crawl4AI integration
60
  - **Template Generation**: Complete HuggingFace Space creation
61
 
62
  For detailed development guidance, see [CLAUDE.md](CLAUDE.md).
 
47
 
48
  Each generated space includes:
49
  - **OpenRouter API Integration**: Support for multiple LLM models
50
+ - **Web Scraping**: Simple HTTP requests with BeautifulSoup for URL content fetching
51
  - **Document RAG**: Optional upload and search through PDF, DOCX, TXT, MD files
52
  - **Access Control**: Environment-based student access codes
53
  - **Modern UI**: Gradio 5.x ChatInterface with proper message formatting
 
56
 
57
  - **Main Application**: `app.py` with two-tab interface
58
  - **Document Processing**: RAG pipeline with FAISS vector search
59
+ - **Web Scraping**: HTTP requests with BeautifulSoup for content extraction
60
  - **Template Generation**: Complete HuggingFace Space creation
61
 
62
  For detailed development guidance, see [CLAUDE.md](CLAUDE.md).
TEST_PROCEDURE.md CHANGED
@@ -98,7 +98,7 @@ python -c "from test_vector_db import test_rag_tool; test_rag_tool()"
98
  # Verify placeholder content is returned
99
 
100
  # Test in production mode
101
- # Verify actual web content is fetched via Crawl4AI
102
  ```
103
 
104
  #### 4.2 URL Processing
 
98
  # Verify placeholder content is returned
99
 
100
  # Test in production mode
101
+ # Verify actual web content is fetched via HTTP requests
102
  ```
103
 
104
  #### 4.2 URL Processing
app.py CHANGED
@@ -9,13 +9,21 @@ import requests
9
  from bs4 import BeautifulSoup
10
  import tempfile
11
  from pathlib import Path
12
- # from scraping_service import get_grounding_context_crawl4ai, fetch_url_content_crawl4ai
13
- # Temporary mock functions for testing
14
- def get_grounding_context_crawl4ai(urls):
15
- return "\n\n[URL content would be fetched here]\n\n"
16
-
17
- def fetch_url_content_crawl4ai(url):
18
- return f"[Content from {url} would be fetched here]"
 
 
 
 
 
 
 
 
19
 
20
  # Import RAG components
21
  try:
@@ -33,8 +41,7 @@ SPACE_TEMPLATE = '''import gradio as gr
33
  import os
34
  import requests
35
  import json
36
- import asyncio
37
- from crawl4ai import AsyncWebCrawler
38
 
39
  # Configuration
40
  SPACE_NAME = "{name}"
@@ -51,36 +58,30 @@ RAG_DATA = {rag_data_json}
51
  # Get API key from environment - customizable variable name
52
  API_KEY = os.environ.get("{api_key_var}")
53
 
54
- async def fetch_url_content_async(url, crawler):
55
- """Fetch and extract text content from a URL using Crawl4AI"""
56
- try:
57
- result = await crawler.arun(
58
- url=url,
59
- bypass_cache=True,
60
- word_count_threshold=10,
61
- excluded_tags=['script', 'style', 'nav', 'header', 'footer'],
62
- remove_overlay_elements=True
63
- )
64
-
65
- if result.success:
66
- content = result.markdown or result.cleaned_html or ""
67
- # Truncate to ~4000 characters
68
- if len(content) > 4000:
69
- content = content[:4000] + "..."
70
- return content
71
- else:
72
- return f"Error fetching {{url}}: Failed to retrieve content"
73
- except Exception as e:
74
- return f"Error fetching {{url}}: {{str(e)}}"
75
-
76
  def fetch_url_content(url):
77
- """Synchronous wrapper for URL fetching"""
78
- async def fetch():
79
- async with AsyncWebCrawler(verbose=False) as crawler:
80
- return await fetch_url_content_async(url, crawler)
81
-
82
  try:
83
- return asyncio.run(fetch())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
  except Exception as e:
85
  return f"Error fetching {{url}}: {{str(e)}}"
86
 
@@ -462,7 +463,7 @@ Generated on {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} with Chat U/I Helper
462
 
463
  def create_requirements(enable_vector_rag=False):
464
  """Generate requirements.txt"""
465
- base_requirements = "gradio>=5.35.0\nrequests>=2.32.3\ncrawl4ai>=0.4.0\naiofiles>=24.0"
466
 
467
  if enable_vector_rag:
468
  base_requirements += "\nfaiss-cpu==1.7.4\nnumpy==1.24.3"
@@ -720,8 +721,8 @@ def get_cached_grounding_context(urls):
720
  if cache_key in url_content_cache:
721
  return url_content_cache[cache_key]
722
 
723
- # If not cached, fetch using Crawl4AI
724
- grounding_context = get_grounding_context_crawl4ai(valid_urls)
725
 
726
  # Cache the result
727
  url_content_cache[cache_key] = grounding_context
@@ -877,11 +878,11 @@ def remove_chat_urls(count):
877
  def toggle_research_assistant(enable_research):
878
  """Toggle visibility of research assistant detailed fields and disable custom categories"""
879
  if enable_research:
880
- combined_prompt = "You are a research assistant that provides link-grounded information through Crawl4AI web fetching. Use MLA documentation for parenthetical citations and bibliographic entries. This assistant is designed for students and researchers conducting academic inquiry. Your main responsibilities include: analyzing academic sources, fact-checking claims with evidence, providing properly cited research summaries, and helping users navigate scholarly information. Ground all responses in provided URL contexts and any additional URLs you're instructed to fetch. Never rely on memory for factual claims."
881
  return (
882
  gr.update(visible=True), # Show research detailed fields
883
  gr.update(value=combined_prompt), # Update main system prompt
884
- gr.update(value="You are a research assistant that provides link-grounded information through Crawl4AI web fetching. Use MLA documentation for parenthetical citations and bibliographic entries."),
885
  gr.update(value="This assistant is designed for students and researchers conducting academic inquiry."),
886
  gr.update(value="Your main responsibilities include: analyzing academic sources, fact-checking claims with evidence, providing properly cited research summaries, and helping users navigate scholarly information."),
887
  gr.update(value="Ground all responses in provided URL contexts and any additional URLs you're instructed to fetch. Never rely on memory for factual claims."),
@@ -1088,7 +1089,7 @@ with gr.Blocks(title="Chat U/I Helper") as demo:
1088
  enable_dynamic_urls = gr.Checkbox(
1089
  label="Enable Dynamic URL Fetching",
1090
  value=False,
1091
- info="Allow the assistant to fetch additional URLs mentioned in conversations (uses Crawl4AI)"
1092
  )
1093
 
1094
  enable_vector_rag = gr.Checkbox(
 
9
  from bs4 import BeautifulSoup
10
  import tempfile
11
  from pathlib import Path
12
+ # Simple URL content fetching using requests and BeautifulSoup
13
+ def get_grounding_context_simple(urls):
14
+ """Fetch grounding context using simple HTTP requests"""
15
+ if not urls:
16
+ return ""
17
+
18
+ context_parts = []
19
+ for i, url in enumerate(urls, 1):
20
+ if url and url.strip():
21
+ content = fetch_url_content(url.strip())
22
+ context_parts.append(f"Context from URL {i} ({url}):\n{content}")
23
+
24
+ if context_parts:
25
+ return "\n\n" + "\n\n".join(context_parts) + "\n\n"
26
+ return ""
27
 
28
  # Import RAG components
29
  try:
 
41
  import os
42
  import requests
43
  import json
44
+ from bs4 import BeautifulSoup
 
45
 
46
  # Configuration
47
  SPACE_NAME = "{name}"
 
58
  # Get API key from environment - customizable variable name
59
  API_KEY = os.environ.get("{api_key_var}")
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  def fetch_url_content(url):
62
+ """Fetch and extract text content from a URL using requests and BeautifulSoup"""
 
 
 
 
63
  try:
64
+ response = requests.get(url, timeout=10, headers={{'User-Agent': 'Mozilla/5.0'}})
65
+ response.raise_for_status()
66
+ soup = BeautifulSoup(response.content, 'html.parser')
67
+
68
+ # Remove script and style elements
69
+ for script in soup(["script", "style", "nav", "header", "footer"]):
70
+ script.decompose()
71
+
72
+ # Get text content
73
+ text = soup.get_text()
74
+
75
+ # Clean up whitespace
76
+ lines = (line.strip() for line in text.splitlines())
77
+ chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
78
+ text = ' '.join(chunk for chunk in chunks if chunk)
79
+
80
+ # Truncate to ~4000 characters
81
+ if len(text) > 4000:
82
+ text = text[:4000] + "..."
83
+
84
+ return text
85
  except Exception as e:
86
  return f"Error fetching {{url}}: {{str(e)}}"
87
 
 
463
 
464
  def create_requirements(enable_vector_rag=False):
465
  """Generate requirements.txt"""
466
+ base_requirements = "gradio>=5.35.0\nrequests>=2.32.3\nbeautifulsoup4>=4.12.3"
467
 
468
  if enable_vector_rag:
469
  base_requirements += "\nfaiss-cpu==1.7.4\nnumpy==1.24.3"
 
721
  if cache_key in url_content_cache:
722
  return url_content_cache[cache_key]
723
 
724
+ # If not cached, fetch using simple HTTP requests
725
+ grounding_context = get_grounding_context_simple(valid_urls)
726
 
727
  # Cache the result
728
  url_content_cache[cache_key] = grounding_context
 
878
  def toggle_research_assistant(enable_research):
879
  """Toggle visibility of research assistant detailed fields and disable custom categories"""
880
  if enable_research:
881
+ combined_prompt = "You are a research assistant that provides link-grounded information through web fetching. Use MLA documentation for parenthetical citations and bibliographic entries. This assistant is designed for students and researchers conducting academic inquiry. Your main responsibilities include: analyzing academic sources, fact-checking claims with evidence, providing properly cited research summaries, and helping users navigate scholarly information. Ground all responses in provided URL contexts and any additional URLs you're instructed to fetch. Never rely on memory for factual claims."
882
  return (
883
  gr.update(visible=True), # Show research detailed fields
884
  gr.update(value=combined_prompt), # Update main system prompt
885
+ gr.update(value="You are a research assistant that provides link-grounded information through web fetching. Use MLA documentation for parenthetical citations and bibliographic entries."),
886
  gr.update(value="This assistant is designed for students and researchers conducting academic inquiry."),
887
  gr.update(value="Your main responsibilities include: analyzing academic sources, fact-checking claims with evidence, providing properly cited research summaries, and helping users navigate scholarly information."),
888
  gr.update(value="Ground all responses in provided URL contexts and any additional URLs you're instructed to fetch. Never rely on memory for factual claims."),
 
1089
  enable_dynamic_urls = gr.Checkbox(
1090
  label="Enable Dynamic URL Fetching",
1091
  value=False,
1092
+ info="Allow the assistant to fetch additional URLs mentioned in conversations"
1093
  )
1094
 
1095
  enable_vector_rag = gr.Checkbox(
file_upload_proposal.md CHANGED
@@ -31,7 +31,7 @@ Upload → Parse → Chunk → Vector Store → RAG Integration → Deployment P
31
  - **TXT/MD**: Direct text processing with metadata extraction
32
  - **Auto-detection**: File type identification and appropriate parser routing
33
 
34
- ### RAG Integration (enhancement to existing Crawl4AI system)
35
  - **Chunking Strategy**: Semantic chunking (500-1000 tokens with 100-token overlap)
36
  - **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (lightweight, fast)
37
  - **Vector Store**: In-memory FAISS index for deployment portability
@@ -138,7 +138,7 @@ This approach maintains your existing speed while adding powerful document under
138
 
139
  1. Implement document parser service
140
  2. Add file upload UI components
141
- 3. Integrate RAG system with existing Crawl4AI architecture
142
  4. Enhance SPACE_TEMPLATE with embedded materials
143
  5. Test with sample course materials
144
  6. Optimize for deployment package size
 
31
  - **TXT/MD**: Direct text processing with metadata extraction
32
  - **Auto-detection**: File type identification and appropriate parser routing
33
 
34
+ ### RAG Integration (enhancement to existing web scraping system)
35
  - **Chunking Strategy**: Semantic chunking (500-1000 tokens with 100-token overlap)
36
  - **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 (lightweight, fast)
37
  - **Vector Store**: In-memory FAISS index for deployment portability
 
138
 
139
  1. Implement document parser service
140
  2. Add file upload UI components
141
+ 3. Integrate RAG system with existing web scraping architecture
142
  4. Enhance SPACE_TEMPLATE with embedded materials
143
  5. Test with sample course materials
144
  6. Optimize for deployment package size
requirements.txt CHANGED
@@ -2,8 +2,6 @@ gradio==5.35.0
2
  requests>=2.32.3
3
  beautifulsoup4>=4.12.3
4
  python-dotenv>=1.0.0
5
- crawl4ai>=0.4.0
6
- aiofiles>=24.0
7
 
8
  # Vector RAG dependencies (optional)
9
  sentence-transformers>=2.2.2
 
2
  requests>=2.32.3
3
  beautifulsoup4>=4.12.3
4
  python-dotenv>=1.0.0
 
 
5
 
6
  # Vector RAG dependencies (optional)
7
  sentence-transformers>=2.2.2
scraping_service.py DELETED
@@ -1,146 +0,0 @@
1
- import asyncio
2
- from crawl4ai import AsyncWebCrawler
3
- from crawl4ai.extraction_strategy import JsonCssExtractionStrategy, LLMExtractionStrategy
4
- import json
5
- from typing import List, Dict, Optional
6
-
7
- class Crawl4AIScraper:
8
- """Web scraping service using Crawl4AI for better content extraction"""
9
-
10
- def __init__(self):
11
- self.crawler = None
12
-
13
- async def __aenter__(self):
14
- """Initialize the crawler when entering async context"""
15
- self.crawler = AsyncWebCrawler(verbose=False)
16
- await self.crawler.__aenter__()
17
- return self
18
-
19
- async def __aexit__(self, exc_type, exc_val, exc_tb):
20
- """Clean up the crawler when exiting async context"""
21
- if self.crawler:
22
- await self.crawler.__aexit__(exc_type, exc_val, exc_tb)
23
-
24
- async def scrape_url(self, url: str, max_chars: int = 4000) -> str:
25
- """
26
- Scrape a single URL and extract text content
27
-
28
- Args:
29
- url: The URL to scrape
30
- max_chars: Maximum characters to return (default 4000)
31
-
32
- Returns:
33
- Extracted text content or error message
34
- """
35
- try:
36
- # Perform the crawl
37
- result = await self.crawler.arun(
38
- url=url,
39
- bypass_cache=True,
40
- word_count_threshold=10,
41
- excluded_tags=['script', 'style', 'nav', 'header', 'footer'],
42
- remove_overlay_elements=True
43
- )
44
-
45
- if result.success:
46
- # Get cleaned text content - prefer markdown over cleaned_html
47
- content = result.markdown or result.cleaned_html or ""
48
-
49
- # Truncate if needed
50
- if len(content) > max_chars:
51
- content = content[:max_chars] + "..."
52
-
53
- return content
54
- else:
55
- return f"Error fetching {url}: Failed to retrieve content"
56
-
57
- except Exception as e:
58
- return f"Error fetching {url}: {str(e)}"
59
-
60
- async def scrape_multiple_urls(self, urls: List[str], max_chars_per_url: int = 4000) -> Dict[str, str]:
61
- """
62
- Scrape multiple URLs concurrently
63
-
64
- Args:
65
- urls: List of URLs to scrape
66
- max_chars_per_url: Maximum characters per URL
67
-
68
- Returns:
69
- Dictionary mapping URLs to their content
70
- """
71
- tasks = [self.scrape_url(url, max_chars_per_url) for url in urls if url and url.strip()]
72
- results = await asyncio.gather(*tasks, return_exceptions=True)
73
-
74
- url_content = {}
75
- for url, result in zip(urls, results):
76
- if isinstance(result, Exception):
77
- url_content[url] = f"Error fetching {url}: {str(result)}"
78
- else:
79
- url_content[url] = result
80
-
81
- return url_content
82
-
83
- def get_grounding_context_crawl4ai(urls: List[str]) -> str:
84
- """
85
- Synchronous wrapper to fetch grounding context using Crawl4AI
86
-
87
- Args:
88
- urls: List of URLs to fetch context from
89
-
90
- Returns:
91
- Formatted grounding context string
92
- """
93
- if not urls:
94
- return ""
95
-
96
- # Filter valid URLs
97
- valid_urls = [url for url in urls if url and url.strip()]
98
- if not valid_urls:
99
- return ""
100
-
101
- async def fetch_all():
102
- async with Crawl4AIScraper() as scraper:
103
- return await scraper.scrape_multiple_urls(valid_urls)
104
-
105
- # Run the async function - handle existing event loop
106
- try:
107
- loop = asyncio.get_running_loop()
108
- # We're already in an async context, create a new event loop in a thread
109
- import concurrent.futures
110
- with concurrent.futures.ThreadPoolExecutor() as executor:
111
- future = executor.submit(asyncio.run, fetch_all())
112
- url_content = future.result()
113
- except RuntimeError:
114
- # No event loop running, we can use asyncio.run directly
115
- url_content = asyncio.run(fetch_all())
116
- except Exception as e:
117
- return f"Error initializing scraper: {str(e)}"
118
-
119
- # Format the context
120
- context_parts = []
121
- for i, (url, content) in enumerate(url_content.items(), 1):
122
- context_parts.append(f"Context from URL {i} ({url}):\n{content}")
123
-
124
- if context_parts:
125
- return "\n\n" + "\n\n".join(context_parts) + "\n\n"
126
- return ""
127
-
128
- # Backwards compatibility function
129
- def fetch_url_content_crawl4ai(url: str) -> str:
130
- """
131
- Synchronous wrapper for single URL scraping (backwards compatibility)
132
-
133
- Args:
134
- url: The URL to fetch
135
-
136
- Returns:
137
- Extracted content or error message
138
- """
139
- async def fetch_one():
140
- async with Crawl4AIScraper() as scraper:
141
- return await scraper.scrape_url(url)
142
-
143
- try:
144
- return asyncio.run(fetch_one())
145
- except Exception as e:
146
- return f"Error fetching {url}: {str(e)}"