Spaces:
Running
Running
Added support for PDF files.
Browse filesGenerated README using Claude 3
- README.md +33 -8
- messages.py +0 -1
- requirements.txt +1 -0
- search_agent.py +41 -16
README.md
CHANGED
@@ -1,15 +1,33 @@
|
|
1 |
# Simple Search Agent
|
2 |
|
3 |
-
This
|
|
|
4 |
|
5 |
-
## How It Works
|
6 |
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
|
14 |
## Setup and Installation
|
15 |
|
@@ -24,6 +42,13 @@ This is a simple search agent that (kind of) does what [Perplexity AI](https://w
|
|
24 |
|
25 |
## Usage
|
26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
```
|
28 |
➜ python ./search_agent.py --provider groq -o text "Write a linkedin post on how Sequoia Capital AI Ascent 2024 is interesting"
|
29 |
[21:44:05] Using mixtral-8x7b-32768 on groq with temperature 0.0 search_agent.py:78
|
|
|
1 |
# Simple Search Agent
|
2 |
|
3 |
+
This Python project provides a search agent that can perform web searches, optimize search queries, fetch and process web content, and generate responses using a language model and the retrieved information.
|
4 |
+
Does a bit what [Perplexity AI](https://www.perplexity.ai/) does.
|
5 |
|
|
|
6 |
|
7 |
+
This Python script is a search agent that utilizes the LangChain library to perform optimized web searches, retrieve relevant content, and generate informative answers to user queries. The script supports multiple language models and providers, including OpenAI, Anthropic, and Groq.
|
8 |
+
|
9 |
+
The main functionality of the script can be summarized as follows:
|
10 |
+
|
11 |
+
1. **Query Optimization**: The user's input query is optimized for web search by identifying the key information requested and transforming it into a concise search string using the language model's capabilities.
|
12 |
+
2. **Web Search**: The optimized search query is used to fetch search results from the Brave Search API. The script allows limiting the search to a specific domain and setting the maximum number of pages to retrieve.
|
13 |
+
3. **Content Extraction**: The script fetches the content of the retrieved search results, handling both HTML and PDF documents. It extracts the main text content from web pages and text from PDF files.
|
14 |
+
4. **Vectorization**: The extracted content is split into smaller text chunks and vectorized using OpenAI's text embeddings. The vectorized data is stored in a FAISS vector store for efficient retrieval.
|
15 |
+
5. **Query Answering**: The user's original query is answered by retrieving the most relevant text chunks from the vector store using a Multi-Query Retriever. The language model generates an informative answer by synthesizing the retrieved information, citing the sources used, and formatting the response in Markdown.
|
16 |
+
|
17 |
+
The script supports various options for customization, such as specifying the language model provider (OpenAI, Anthropic, Groq, or OllaMa), temperature for language model generation, and output format (text or Markdown).
|
18 |
+
|
19 |
+
Additionally, the script integrates with the LangChain Tracing V2 feature, allowing users to monitor and analyze the execution of their LangChain applications using the LangChain Studio.
|
20 |
+
|
21 |
+
To run the script, users need to provide their API keys for the desired language model provider and the Brave Search API in a `.env` file. The script can be executed from the command line, passing the desired options and the search query as arguments.
|
22 |
+
|
23 |
+
## Features
|
24 |
+
|
25 |
+
- Supports multiple language model providers (Bedrock, OpenAI, Groq, and Ollama)
|
26 |
+
- Optimizes search queries using a language model
|
27 |
+
- Fetches web pages and extracts main content (HTML and PDF)
|
28 |
+
- Vectorizes the content for efficient retrieval
|
29 |
+
- Queries the vectorized content using a Retrieval-Augmented Generation (RAG) approach
|
30 |
+
- Generates markdown-formatted responses with references to the used sources
|
31 |
|
32 |
## Setup and Installation
|
33 |
|
|
|
42 |
|
43 |
## Usage
|
44 |
|
45 |
+
```
|
46 |
+
python search_agent.py --query "your search query" --provider "provider_name" --model "model_name" --temperature 0.0
|
47 |
+
```
|
48 |
+
|
49 |
+
Replace `"your search query"` with your desired search query, `"provider_name"` with the language model provider (e.g., `bedrock`, `openai`, `groq`, `ollama`), `"model_name"` with the specific model name (optional), and `temperature` with the desired temperature value for the language model (optional).
|
50 |
+
|
51 |
+
Example:
|
52 |
```
|
53 |
➜ python ./search_agent.py --provider groq -o text "Write a linkedin post on how Sequoia Capital AI Ascent 2024 is interesting"
|
54 |
[21:44:05] Using mixtral-8x7b-32768 on groq with temperature 0.0 search_agent.py:78
|
messages.py
CHANGED
@@ -38,7 +38,6 @@ def get_optimized_search_messages(query):
|
|
38 |
- Remove lenght instruction (example: essay, article, letter, etc)
|
39 |
|
40 |
Add "**" to the end of the search string to indicate the end of the query
|
41 |
-
Provide your output in this format: optimized search string**
|
42 |
|
43 |
Example:
|
44 |
Question: How do I bake chocolate chip cookies from scratch?
|
|
|
38 |
- Remove lenght instruction (example: essay, article, letter, etc)
|
39 |
|
40 |
Add "**" to the end of the search string to indicate the end of the query
|
|
|
41 |
|
42 |
Example:
|
43 |
Question: How do I bake chocolate chip cookies from scratch?
|
requirements.txt
CHANGED
@@ -2,6 +2,7 @@ boto3
|
|
2 |
bs4
|
3 |
docopt
|
4 |
faiss-cpu
|
|
|
5 |
python-dotenv
|
6 |
langchain
|
7 |
langchain_core
|
|
|
2 |
bs4
|
3 |
docopt
|
4 |
faiss-cpu
|
5 |
+
pdfplumber
|
6 |
python-dotenv
|
7 |
langchain
|
8 |
langchain_core
|
search_agent.py
CHANGED
@@ -25,12 +25,14 @@ Options:
|
|
25 |
|
26 |
import json
|
27 |
import os
|
|
|
28 |
from concurrent.futures import ThreadPoolExecutor
|
29 |
from urllib.parse import quote
|
30 |
|
31 |
from bs4 import BeautifulSoup
|
32 |
from docopt import docopt
|
33 |
import dotenv
|
|
|
34 |
|
35 |
from langchain_core.documents.base import Document
|
36 |
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
@@ -77,7 +79,7 @@ def get_chat_llm(provider, model=None, temperature=0.0):
|
|
77 |
chat_llm = ChatOllama(model=model, temperature=temperature)
|
78 |
case _:
|
79 |
raise ValueError(f"Unknown LLM provider {provider}")
|
80 |
-
|
81 |
console.log(f"Using {model} on {provider} with temperature {temperature}")
|
82 |
return chat_llm
|
83 |
|
@@ -140,17 +142,35 @@ def extract_main_content(html):
|
|
140 |
soup = BeautifulSoup(html, 'html.parser')
|
141 |
for element in soup(["script", "style", "head", "nav", "footer", "iframe", "img"]):
|
142 |
element.extract()
|
143 |
-
main_content =
|
144 |
return main_content
|
145 |
except Exception:
|
146 |
return None
|
147 |
|
148 |
def process_source(source):
|
149 |
response = fetch_with_timeout(source['link'], 8)
|
|
|
150 |
if response:
|
151 |
-
|
152 |
-
|
153 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
154 |
return None
|
155 |
|
156 |
def get_links_contents(sources):
|
@@ -163,14 +183,17 @@ def get_links_contents(sources):
|
|
163 |
def vectorize(contents, text_chunk_size=1000,text_chunk_overlap=200,):
|
164 |
documents = []
|
165 |
for content in contents:
|
166 |
-
|
167 |
-
|
168 |
-
|
169 |
-
|
170 |
-
|
171 |
-
|
172 |
-
|
173 |
-
|
|
|
|
|
|
|
174 |
|
175 |
text_splitter = RecursiveCharacterTextSplitter(
|
176 |
chunk_size=text_chunk_size,
|
@@ -195,9 +218,11 @@ def format_docs(docs):
|
|
195 |
|
196 |
def query_rag(chat_llm, question, search_query, vectorstore):
|
197 |
retriever_from_llm = MultiQueryRetriever.from_llm(
|
198 |
-
retriever=vectorstore.as_retriever(), llm=chat_llm,
|
|
|
|
|
|
|
199 |
)
|
200 |
-
unique_docs = retriever_from_llm.get_relevant_documents(query=search_query, config={"callbacks": callbacks})
|
201 |
context = format_docs(unique_docs)
|
202 |
prompt = get_rag_prompt_template().format(query=question, context=context)
|
203 |
response = chat_llm.invoke(prompt, config={"callbacks": callbacks})
|
@@ -249,7 +274,7 @@ if __name__ == '__main__':
|
|
249 |
contents = get_links_contents(sources)
|
250 |
console.log(f"Managed to extract content from {len(contents)} sources")
|
251 |
|
252 |
-
with console.status(f"[bold green]Embeddubg {len(
|
253 |
vector_store = vectorize(contents)
|
254 |
|
255 |
with console.status("[bold green]Querying LLM relevant context", spinner='dots8Bit'):
|
|
|
25 |
|
26 |
import json
|
27 |
import os
|
28 |
+
import io
|
29 |
from concurrent.futures import ThreadPoolExecutor
|
30 |
from urllib.parse import quote
|
31 |
|
32 |
from bs4 import BeautifulSoup
|
33 |
from docopt import docopt
|
34 |
import dotenv
|
35 |
+
import pdfplumber
|
36 |
|
37 |
from langchain_core.documents.base import Document
|
38 |
from langchain.text_splitter import RecursiveCharacterTextSplitter
|
|
|
79 |
chat_llm = ChatOllama(model=model, temperature=temperature)
|
80 |
case _:
|
81 |
raise ValueError(f"Unknown LLM provider {provider}")
|
82 |
+
|
83 |
console.log(f"Using {model} on {provider} with temperature {temperature}")
|
84 |
return chat_llm
|
85 |
|
|
|
142 |
soup = BeautifulSoup(html, 'html.parser')
|
143 |
for element in soup(["script", "style", "head", "nav", "footer", "iframe", "img"]):
|
144 |
element.extract()
|
145 |
+
main_content = soup.get_text(separator='\n', strip=True)
|
146 |
return main_content
|
147 |
except Exception:
|
148 |
return None
|
149 |
|
150 |
def process_source(source):
|
151 |
response = fetch_with_timeout(source['link'], 8)
|
152 |
+
console.log(f"Processing {source['link']}")
|
153 |
if response:
|
154 |
+
content_type = response.headers.get('Content-Type')
|
155 |
+
if content_type == 'application/pdf':
|
156 |
+
# The response is a PDF file
|
157 |
+
pdf_content = response.content
|
158 |
+
# Create a file-like object from the bytes
|
159 |
+
pdf_file = io.BytesIO(pdf_content)
|
160 |
+
# Extract text from PDF using pdfplumber
|
161 |
+
with pdfplumber.open(pdf_file) as pdf:
|
162 |
+
text = ""
|
163 |
+
for page in pdf.pages:
|
164 |
+
text += page.extract_text()
|
165 |
+
return {**source, 'pdf_content': text}
|
166 |
+
elif content_type.startswith('text/html'):
|
167 |
+
# The response is an HTML file
|
168 |
+
html = response.text
|
169 |
+
main_content = extract_main_content(html)
|
170 |
+
return {**source, 'html': main_content}
|
171 |
+
else:
|
172 |
+
console.log(f"Skipping {source['link']}! Unsupported content type: {content_type}")
|
173 |
+
return None
|
174 |
return None
|
175 |
|
176 |
def get_links_contents(sources):
|
|
|
183 |
def vectorize(contents, text_chunk_size=1000,text_chunk_overlap=200,):
|
184 |
documents = []
|
185 |
for content in contents:
|
186 |
+
page_content = content['snippet']
|
187 |
+
if 'htlm' in content:
|
188 |
+
page_content = content['html']
|
189 |
+
if 'pdf_content' in content:
|
190 |
+
page_content = content['pdf_content']
|
191 |
+
try:
|
192 |
+
metadata = {'title': content['title'], 'source': content['link']}
|
193 |
+
doc = Document(page_content=page_content, metadata=metadata)
|
194 |
+
documents.append(doc)
|
195 |
+
except Exception as e:
|
196 |
+
console.log(f"[gray]Error processing content for {content['link']}: {e}")
|
197 |
|
198 |
text_splitter = RecursiveCharacterTextSplitter(
|
199 |
chunk_size=text_chunk_size,
|
|
|
218 |
|
219 |
def query_rag(chat_llm, question, search_query, vectorstore):
|
220 |
retriever_from_llm = MultiQueryRetriever.from_llm(
|
221 |
+
retriever=vectorstore.as_retriever(), llm=chat_llm, include_original=True,
|
222 |
+
)
|
223 |
+
unique_docs = retriever_from_llm.get_relevant_documents(
|
224 |
+
query=search_query, callbacks=callbacks, verbose=True
|
225 |
)
|
|
|
226 |
context = format_docs(unique_docs)
|
227 |
prompt = get_rag_prompt_template().format(query=question, context=context)
|
228 |
response = chat_llm.invoke(prompt, config={"callbacks": callbacks})
|
|
|
274 |
contents = get_links_contents(sources)
|
275 |
console.log(f"Managed to extract content from {len(contents)} sources")
|
276 |
|
277 |
+
with console.status(f"[bold green]Embeddubg {len(contents)} sources for content", spinner="growVertical"):
|
278 |
vector_store = vectorize(contents)
|
279 |
|
280 |
with console.status("[bold green]Querying LLM relevant context", spinner='dots8Bit'):
|