metadata

title: SemanticSearchPOC
emoji: 📉
colorFrom: red
colorTo: indigo
sdk: docker
app_port: 8501
pinned: true
startup_duration_timeout: 3 hours
hardware: gpu

POC for Retrieval Augmented Generation with Large Language Models

I created this Proof-of-Concept project to learn how to implement Retrieval Augmented Generation (RAG) when prompting Large Language Models (LLMs). I plan to use this POC as starting point for future LLM-based applications to use the power of RAG techniques.

The "happy path" of the code seems to work well. As noted later, there is more work to be done to improve it.

If you encounter issues in running the POC, please try reloading the web page. Also, please note that I've currently configured the space to use the 2-vCPU setting to run the POC. Depending on load, re-initialization takes five or more minutes. Inferencing can take three minutes or more to complete. Adding GPU support is at the top of the list of future improvements.

Components

Here are the key components of the project:

llama.cpp: An optimized implementation of logic written in C++ to run the Llama model.
Weaviate Vector Database: A vector database for efficient storage and retrieval of information encoded in vector embeddings.
text2vec-transformers: A library for converting text to vector representations using transformer models.
Streamlit: A framework for building interactive web applications with Python.

Screenshot

Application Notes

As part of the initialization process, the python application executes a Bash script asynchronously. The script carries out these steps:

Start the text2vec-transformers Weaviate module to run as an asynchronous process.
Start the Weaviate database server to run asynchronously as well and connect to the text2vec transformer.
Wait so that the subprocesses continue to run and be ready to accept requests.

Also, the vector database is only loaded with a few Weaviate schemas/collections based on HTML documents in the inputDocs folder. More and different web pages can be added later.

Usage

To use the application, follow these steps:

Type in a user prompt and an optional system prompt in the corresponding input text boxes.
Click the "Run LLM Prompt" button to invoke the LLM with the prompt.
Display the completion and the full prompt created by the application using the llama JSON format for prompts.
If the "Enable RAG" check box is clicked, the user prompt will be modified to include RAG information from the vector database for respose generation.
Click the "Get All Rag Data" button to view all the records in the vector database including chunked information.

How to Test RAG Support

A special HTML page exists in the inputDocs directory containing bogus information about a non-existent planet in Earth's solar system. The planet is named QuantumLeap. A non-RAG query will return a response saying the planet does not exist. However enabling RAG for the query will cause information about the bogus planet to be returned but the LLM will still say that the planet does not exist. This makes it clear that the RAG data was used in the query.

Here is an example prompt: Describe the planet in Earth's solar system named QuantumLeap.

Future Improvements

The following areas have been identified for future improvements:

Run the POC with a GPU.
Add more RAG information.
Do more testing of RAG support.
Experiment with different database settings on queries such as the distance parameter on the collection query.near_vector() call.