metadata

title: SemanticSearchPOC
emoji: 📉
colorFrom: red
colorTo: indigo
sdk: docker
app_port: 8501
pinned: true
startup_duration_timeout: 3 hours

POC for Retrieval Augmented Generation with Large Language Models

I created this Proof-of-Concept project to learn how to implement Retrieval Augmented Generation (RAG) when prompting Large Language Models (LLMs). I plan to use this POC as starting point for future LLM-based applications to use the power of RAG techniques.

The "happy path" of the code seems to work fairly well. As noted later, there is more work to be done to improve it.

If you encounter issues in running the POC, please try reloading the web page. Also, please note that I've currently configured the space to use the 2-vCPU setting to run the POC. Depending on load, re-initialization takes five or more minutes. Inferencing can take three minutes or more to complete. Adding GPU support is at the top of the list of future improvements.

Components

Here are the key components of the project:

llama.cpp: An optimized implementation of the LLaMa language model in C++.
Weaviate Vector Database: A vector database for efficient storage and retrieval of embeddings.
text2vec-transformers: A library for converting text to vector representations using transformer models.
Streamlit: A framework for building interactive web applications with Python.

Screenshot

Application Notes

As part of the initialization process, the python application executes a Bash script asynchronously. The script carries out these steps:

Start the text2vec-transformers Weaviate module to run as an asynchronous process.
Start the Weaviate database server to run asynchronously as well.
Wait so that the subprocesses continue run and be ready to accept requests.

Also, the vector database is only loaded with two Weaviate schemas/collections based on two documents in the inputDocs folder. These are main topic html pages from Wikipedia. One page has content related to artifical intelligence and the other content about Norwegian literature. More and different web pages can be added later.

Usage

To use the application, follow these steps:

Type in an optional system prompt and a user prompt in the corresponding input text boxes.
Click the "Run LLM Prompt" button to call the llama-2 LLM with the prompt.
Display the completion and the full prompt created by the application using the llama-2 JSON format for prompts.
If the "Enable RAG" check box is clicked, the user prompt will be modified to include RAG information from the Vector DB for generation.
Click the "Get All Rag Data" button to view all the information about the two documents in the database including chunks.

Future Improvements

The following areas have been identified for future improvements:

Run the POC with a GPU.
Include RAG documents/web pages with distinct information likely not to be in the LLM. This should make it clearer when RAG information is used.
Do more testing of the RAG support.
Experiment with different database settings on queries such as the distance parameter on the collection query.near_vector() call.