CyranoB commited on
Commit
615fe60
·
1 Parent(s): 8d1e83e

Added support for PDF files.

Browse files

Generated README using Claude 3

Files changed (4) hide show
  1. README.md +33 -8
  2. messages.py +0 -1
  3. requirements.txt +1 -0
  4. search_agent.py +41 -16
README.md CHANGED
@@ -1,15 +1,33 @@
1
  # Simple Search Agent
2
 
3
- This is a simple search agent that (kind of) does what [Perplexity AI](https://www.perplexity.ai/) does.
 
4
 
5
- ## How It Works
6
 
7
- 1. The user asks the agent a question.
8
- 2. The agent performs a web search using the question as the query.
9
- 3. The agent extracts the most relevant snippets and information from the top search results.
10
- 4. The extracted web results are passed as context to a large language model.
11
- 5. The LLM uses the web search context to generate a final answer to the original question.
12
- 6. The agent returns the generated answer to the user.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
  ## Setup and Installation
15
 
@@ -24,6 +42,13 @@ This is a simple search agent that (kind of) does what [Perplexity AI](https://w
24
 
25
  ## Usage
26
 
 
 
 
 
 
 
 
27
  ```
28
  ➜ python ./search_agent.py --provider groq -o text "Write a linkedin post on how Sequoia Capital AI Ascent 2024 is interesting"
29
  [21:44:05] Using mixtral-8x7b-32768 on groq with temperature 0.0 search_agent.py:78
 
1
  # Simple Search Agent
2
 
3
+ This Python project provides a search agent that can perform web searches, optimize search queries, fetch and process web content, and generate responses using a language model and the retrieved information.
4
+ Does a bit what [Perplexity AI](https://www.perplexity.ai/) does.
5
 
 
6
 
7
+ This Python script is a search agent that utilizes the LangChain library to perform optimized web searches, retrieve relevant content, and generate informative answers to user queries. The script supports multiple language models and providers, including OpenAI, Anthropic, and Groq.
8
+
9
+ The main functionality of the script can be summarized as follows:
10
+
11
+ 1. **Query Optimization**: The user's input query is optimized for web search by identifying the key information requested and transforming it into a concise search string using the language model's capabilities.
12
+ 2. **Web Search**: The optimized search query is used to fetch search results from the Brave Search API. The script allows limiting the search to a specific domain and setting the maximum number of pages to retrieve.
13
+ 3. **Content Extraction**: The script fetches the content of the retrieved search results, handling both HTML and PDF documents. It extracts the main text content from web pages and text from PDF files.
14
+ 4. **Vectorization**: The extracted content is split into smaller text chunks and vectorized using OpenAI's text embeddings. The vectorized data is stored in a FAISS vector store for efficient retrieval.
15
+ 5. **Query Answering**: The user's original query is answered by retrieving the most relevant text chunks from the vector store using a Multi-Query Retriever. The language model generates an informative answer by synthesizing the retrieved information, citing the sources used, and formatting the response in Markdown.
16
+
17
+ The script supports various options for customization, such as specifying the language model provider (OpenAI, Anthropic, Groq, or OllaMa), temperature for language model generation, and output format (text or Markdown).
18
+
19
+ Additionally, the script integrates with the LangChain Tracing V2 feature, allowing users to monitor and analyze the execution of their LangChain applications using the LangChain Studio.
20
+
21
+ To run the script, users need to provide their API keys for the desired language model provider and the Brave Search API in a `.env` file. The script can be executed from the command line, passing the desired options and the search query as arguments.
22
+
23
+ ## Features
24
+
25
+ - Supports multiple language model providers (Bedrock, OpenAI, Groq, and Ollama)
26
+ - Optimizes search queries using a language model
27
+ - Fetches web pages and extracts main content (HTML and PDF)
28
+ - Vectorizes the content for efficient retrieval
29
+ - Queries the vectorized content using a Retrieval-Augmented Generation (RAG) approach
30
+ - Generates markdown-formatted responses with references to the used sources
31
 
32
  ## Setup and Installation
33
 
 
42
 
43
  ## Usage
44
 
45
+ ```
46
+ python search_agent.py --query "your search query" --provider "provider_name" --model "model_name" --temperature 0.0
47
+ ```
48
+
49
+ Replace `"your search query"` with your desired search query, `"provider_name"` with the language model provider (e.g., `bedrock`, `openai`, `groq`, `ollama`), `"model_name"` with the specific model name (optional), and `temperature` with the desired temperature value for the language model (optional).
50
+
51
+ Example:
52
  ```
53
  ➜ python ./search_agent.py --provider groq -o text "Write a linkedin post on how Sequoia Capital AI Ascent 2024 is interesting"
54
  [21:44:05] Using mixtral-8x7b-32768 on groq with temperature 0.0 search_agent.py:78
messages.py CHANGED
@@ -38,7 +38,6 @@ def get_optimized_search_messages(query):
38
  - Remove lenght instruction (example: essay, article, letter, etc)
39
 
40
  Add "**" to the end of the search string to indicate the end of the query
41
- Provide your output in this format: optimized search string**
42
 
43
  Example:
44
  Question: How do I bake chocolate chip cookies from scratch?
 
38
  - Remove lenght instruction (example: essay, article, letter, etc)
39
 
40
  Add "**" to the end of the search string to indicate the end of the query
 
41
 
42
  Example:
43
  Question: How do I bake chocolate chip cookies from scratch?
requirements.txt CHANGED
@@ -2,6 +2,7 @@ boto3
2
  bs4
3
  docopt
4
  faiss-cpu
 
5
  python-dotenv
6
  langchain
7
  langchain_core
 
2
  bs4
3
  docopt
4
  faiss-cpu
5
+ pdfplumber
6
  python-dotenv
7
  langchain
8
  langchain_core
search_agent.py CHANGED
@@ -25,12 +25,14 @@ Options:
25
 
26
  import json
27
  import os
 
28
  from concurrent.futures import ThreadPoolExecutor
29
  from urllib.parse import quote
30
 
31
  from bs4 import BeautifulSoup
32
  from docopt import docopt
33
  import dotenv
 
34
 
35
  from langchain_core.documents.base import Document
36
  from langchain.text_splitter import RecursiveCharacterTextSplitter
@@ -77,7 +79,7 @@ def get_chat_llm(provider, model=None, temperature=0.0):
77
  chat_llm = ChatOllama(model=model, temperature=temperature)
78
  case _:
79
  raise ValueError(f"Unknown LLM provider {provider}")
80
-
81
  console.log(f"Using {model} on {provider} with temperature {temperature}")
82
  return chat_llm
83
 
@@ -140,17 +142,35 @@ def extract_main_content(html):
140
  soup = BeautifulSoup(html, 'html.parser')
141
  for element in soup(["script", "style", "head", "nav", "footer", "iframe", "img"]):
142
  element.extract()
143
- main_content = ' '.join(soup.body.get_text().split())
144
  return main_content
145
  except Exception:
146
  return None
147
 
148
  def process_source(source):
149
  response = fetch_with_timeout(source['link'], 8)
 
150
  if response:
151
- html = response.text
152
- main_content = extract_main_content(html)
153
- return {**source, 'html': main_content}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
154
  return None
155
 
156
  def get_links_contents(sources):
@@ -163,14 +183,17 @@ def get_links_contents(sources):
163
  def vectorize(contents, text_chunk_size=1000,text_chunk_overlap=200,):
164
  documents = []
165
  for content in contents:
166
- if content['html']:
167
- try:
168
- page_content = content['html']
169
- metadata = {'title': content['title'], 'source': content['link']}
170
- doc = Document(page_content=page_content, metadata=metadata)
171
- documents.append(doc)
172
- except Exception as e:
173
- console.log(f"[gray]Error processing content for {content['link']}: {e}")
 
 
 
174
 
175
  text_splitter = RecursiveCharacterTextSplitter(
176
  chunk_size=text_chunk_size,
@@ -195,9 +218,11 @@ def format_docs(docs):
195
 
196
  def query_rag(chat_llm, question, search_query, vectorstore):
197
  retriever_from_llm = MultiQueryRetriever.from_llm(
198
- retriever=vectorstore.as_retriever(), llm=chat_llm,
 
 
 
199
  )
200
- unique_docs = retriever_from_llm.get_relevant_documents(query=search_query, config={"callbacks": callbacks})
201
  context = format_docs(unique_docs)
202
  prompt = get_rag_prompt_template().format(query=question, context=context)
203
  response = chat_llm.invoke(prompt, config={"callbacks": callbacks})
@@ -249,7 +274,7 @@ if __name__ == '__main__':
249
  contents = get_links_contents(sources)
250
  console.log(f"Managed to extract content from {len(contents)} sources")
251
 
252
- with console.status(f"[bold green]Embeddubg {len(sources)} sources", spinner="growVertical"):
253
  vector_store = vectorize(contents)
254
 
255
  with console.status("[bold green]Querying LLM relevant context", spinner='dots8Bit'):
 
25
 
26
  import json
27
  import os
28
+ import io
29
  from concurrent.futures import ThreadPoolExecutor
30
  from urllib.parse import quote
31
 
32
  from bs4 import BeautifulSoup
33
  from docopt import docopt
34
  import dotenv
35
+ import pdfplumber
36
 
37
  from langchain_core.documents.base import Document
38
  from langchain.text_splitter import RecursiveCharacterTextSplitter
 
79
  chat_llm = ChatOllama(model=model, temperature=temperature)
80
  case _:
81
  raise ValueError(f"Unknown LLM provider {provider}")
82
+
83
  console.log(f"Using {model} on {provider} with temperature {temperature}")
84
  return chat_llm
85
 
 
142
  soup = BeautifulSoup(html, 'html.parser')
143
  for element in soup(["script", "style", "head", "nav", "footer", "iframe", "img"]):
144
  element.extract()
145
+ main_content = soup.get_text(separator='\n', strip=True)
146
  return main_content
147
  except Exception:
148
  return None
149
 
150
  def process_source(source):
151
  response = fetch_with_timeout(source['link'], 8)
152
+ console.log(f"Processing {source['link']}")
153
  if response:
154
+ content_type = response.headers.get('Content-Type')
155
+ if content_type == 'application/pdf':
156
+ # The response is a PDF file
157
+ pdf_content = response.content
158
+ # Create a file-like object from the bytes
159
+ pdf_file = io.BytesIO(pdf_content)
160
+ # Extract text from PDF using pdfplumber
161
+ with pdfplumber.open(pdf_file) as pdf:
162
+ text = ""
163
+ for page in pdf.pages:
164
+ text += page.extract_text()
165
+ return {**source, 'pdf_content': text}
166
+ elif content_type.startswith('text/html'):
167
+ # The response is an HTML file
168
+ html = response.text
169
+ main_content = extract_main_content(html)
170
+ return {**source, 'html': main_content}
171
+ else:
172
+ console.log(f"Skipping {source['link']}! Unsupported content type: {content_type}")
173
+ return None
174
  return None
175
 
176
  def get_links_contents(sources):
 
183
  def vectorize(contents, text_chunk_size=1000,text_chunk_overlap=200,):
184
  documents = []
185
  for content in contents:
186
+ page_content = content['snippet']
187
+ if 'htlm' in content:
188
+ page_content = content['html']
189
+ if 'pdf_content' in content:
190
+ page_content = content['pdf_content']
191
+ try:
192
+ metadata = {'title': content['title'], 'source': content['link']}
193
+ doc = Document(page_content=page_content, metadata=metadata)
194
+ documents.append(doc)
195
+ except Exception as e:
196
+ console.log(f"[gray]Error processing content for {content['link']}: {e}")
197
 
198
  text_splitter = RecursiveCharacterTextSplitter(
199
  chunk_size=text_chunk_size,
 
218
 
219
  def query_rag(chat_llm, question, search_query, vectorstore):
220
  retriever_from_llm = MultiQueryRetriever.from_llm(
221
+ retriever=vectorstore.as_retriever(), llm=chat_llm, include_original=True,
222
+ )
223
+ unique_docs = retriever_from_llm.get_relevant_documents(
224
+ query=search_query, callbacks=callbacks, verbose=True
225
  )
 
226
  context = format_docs(unique_docs)
227
  prompt = get_rag_prompt_template().format(query=question, context=context)
228
  response = chat_llm.invoke(prompt, config={"callbacks": callbacks})
 
274
  contents = get_links_contents(sources)
275
  console.log(f"Managed to extract content from {len(contents)} sources")
276
 
277
+ with console.status(f"[bold green]Embeddubg {len(contents)} sources for content", spinner="growVertical"):
278
  vector_store = vectorize(contents)
279
 
280
  with console.status("[bold green]Querying LLM relevant context", spinner='dots8Bit'):