metadata

title: SemanticSearchPOC
emoji: 📉
colorFrom: red
colorTo: indigo
sdk: docker
app_port: 8501
pinned: true
startup_duration_timeout: 3 hours

POC for Retrieval Augmented Generation with Large Language Models

I created this Proof-of-Concept project to learn how to implement Retrieval Augmented Generation (RAG) when prompting Large Language Models (LLMs). I plan to use this POC as starting point for future LLM-based applications to leverage the power of RAG techniques.

The "happy path" of the code seems to work fairly well. As noted later, there is more work to be done to improve it.

If you encounter issues in running the POC, please try reloading the web page. Also, please note that I've currently configured the 2-vCPU configuration to run the POC so re-initialization takes five or more minutes. Inferencing takes two to three minutes to complete. I think this has to do with the total load on the Huggingface system. Butting adding GPU support is at the top of the list of future improvements.

Components

Here are the key components of the project:

llama.cpp: An optimized implementation of the LLaMa language model in C++.
Weaviate Vector Database: A vector database for efficient storage and retrieval of embeddings.
text2vec-transformers: A library for converting text to vector representations using transformer models.
Streamlit: A framework for building interactive web applications with Python.

Screenshot

Application Notes

As part of the initialization process, the python application executes a Bash script asynchronously. The script carries out these steps:

Start the text2vec-transformers Weaviate module to run as an asynchronous process. The Weaviate DB uses this.
Start the Weaviate database server to run asWynchronously as well.
Wait so that the subprocesses continue run and be ready to accept requests.

Also, the vector database is only loaded with two Weaviate schemas/collections based on two documents in the inputDocs folder. These are main topic html pages from Wikipedia. One page has content related to artifical intelligence and the other content about Norwegian literature. More and different web pages can be added later.

Usage

To use the application, follow these steps:

Type in an optional system prompt and a user prompt in the corresponding input text boxes.
Click the "Run LLM Prompt" button to call the llama-2 LLM with the prompt.
Display the completion and the full prompt created by the application using the llama-2 JSON format for prompts.
If the "Enable RAG" check box is clicked, the user prompt will be modified to consider RAG information from the Vector DB.
Click the "Get All Rag Data" button to view all the information about the two documents in the database including chunks.

Future Improvements

The following areas have been identified for future improvements:

Run the POC with a GPU.
Do more testing of the RAG support. Currently, it seems to work basically. But is it producing additional, useful information for inferencing.
Also to this end, add web pages with details on a topic that the LLM wasn't trained with. Compare prompts with and without RAG.
Experiment with different database settings on queries such as the distance parameter on the collection query.near_vector() call.