Spaces:

rasoul-nikbakht
/

Tspec-RAG

Runtime error

File size: 9,281 Bytes

57ec4a7
 
 
 
 
 
 
 
 
 
 
 
 
50bd1cd
 
34ee52b
50bd1cd

---
title: Tspec RAG
emoji: 👁
colorFrom: pink
colorTo: indigo
sdk: gradio
sdk_version: 5.28.0
app_file: app.py
pinned: false
license: cc-by-nc-sa-4.0
short_description: Chat with 3GPP documents using Tspec-LLM dataset
---

# 3GPP TSpec RAG Assistant

[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/rasoul-nikbakht/Tspec-RAG)

## Overview

This application provides a Retrieval-Augmented Generation (RAG) interface for querying specific 3GPP technical specification documents. It leverages a base set of pre-processed documents and allows users to dynamically load and query additional specification files from the `rasoul-nikbakht/TSpec-LLM` dataset hosted on Hugging Face Hub.

The system uses OpenAI's language models (like `gpt-4o-mini`) for understanding and generation, and OpenAI's embedding models (`text-embedding-3-small`) combined with FAISS for efficient retrieval from the document content.

## Features

*   **Base Knowledge:** Includes a pre-indexed set of core 3GPP specification documents for immediate querying.
*   **Dynamic File Loading:** Users can specify up to 3 additional specification files (by their relative path within the dataset) per session for querying.
*   **Embedding Caching:** Embeddings for dynamically loaded files are cached locally (`cached_embeddings/`) to speed up subsequent sessions and reduce API costs. A manifest (`cache_manifest.json`) tracks cached files.
*   **User Interaction Logging:** Logs user emails and the files they process in `user_data.json` (for usage tracking).
*   **Gradio Interface:** Provides an interactive web UI built with Gradio, featuring separate chat and configuration/information panels.
*   **Cost Estimation:** Provides a rough estimate of the OpenAI API cost for processing *new* dynamic files.
*   **Hugging Face Spaces Ready:** Designed for easy deployment on Hugging Face Spaces, utilizing environment variables/secrets for API keys.

## Technology Stack

*   **Backend:** Python 3
*   **UI:** Gradio
*   **RAG & LLM Orchestration:** LangChain
*   **LLM & Embeddings:** OpenAI API (`gpt-4o-mini`, `text-embedding-3-small`)
*   **Vector Store:** FAISS (Facebook AI Similarity Search)
*   **Dataset Access:** Hugging Face Hub (`huggingface_hub`)
*   **Dependencies:** See `requirements.txt`

## Setup

### Prerequisites

*   Python 3.8+
*   Git and Git LFS (though LFS is not needed if you follow the `.gitignore` setup correctly)
*   An OpenAI API Key
*   A Hugging Face Account and an API Token (`HF_TOKEN`) with access granted to the `rasoul-nikbakht/TSpec-LLM` dataset. (You need to accept the dataset's license terms on the Hugging Face website while logged in).

### Installation

1.  **Clone the repository:**
    ```bash
    git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git # Or your HF Space repo URL
    cd YOUR_REPO_NAME
    ```

2.  **Create a virtual environment (Recommended):**
    ```bash
    python -m venv .venv
    source .venv/bin/activate # On Windows use `.venv\Scripts\activate`
    ```

3.  **Install dependencies:**
    ```bash
    pip install -r requirements.txt
    ```

### Configuration

1.  **API Keys:** You need to provide your Hugging Face token and OpenAI API key.
    *   **Local Development:** Create a `.env` file in the root directory of the project:
        ```dotenv
        # Required for downloading the dataset
        HF_TOKEN=hf_YOUR_HUGGING_FACE_TOKEN

        # Required for embedding generation (base index + dynamic files)
        # This key will be used for the initial base knowledge processing
        OPENAI_API_KEY=sk_YOUR_OPENAI_API_KEY
        ```
        **Important:** Ensure `.env` is listed in your `.gitignore` file to avoid committing secrets.
    *   **Hugging Face Spaces Deployment:** Do **NOT** upload the `.env` file. Instead, set `HF_TOKEN` and `OPENAI_API_KEY` as **Secrets** in your Space settings. The application will automatically use these secrets.

2.  **Dataset Access:** Ensure the Hugging Face account associated with your `HF_TOKEN` has accepted the license for `rasoul-nikbakht/TSpec-LLM` on the Hugging Face Hub website.

## Running Locally

1.  **Activate your virtual environment (if using one):**
    ```bash
    source .venv/bin/activate
    ```

2.  **Run the Gradio app:**
    ```bash
    python app.py
    ```

3.  **Initial Run:** The *very first time* you run the application locally, it will:
    *   Download the fixed base knowledge files specified in the script.
    *   Process these files (chunking and embedding using the `OPENAI_API_KEY` from your `.env` file).
    *   Create and save the `base_knowledge.faiss` and `base_knowledge.pkl` files in the `cached_embeddings/` directory.
    *   This initial pre-processing step might take a few minutes and requires the API keys to be correctly configured. Subsequent runs will load the existing base index much faster.

4.  **Access the UI:** Open the local URL provided in your terminal (usually `http://127.0.0.1:7860`).

## Deployment (Hugging Face Spaces)

1.  **Create a Hugging Face Space:** Choose the Gradio SDK.
2.  **Upload Files:** Upload the following files to your Space repository:
    *   `app.py`
    *   `requirements.txt`
    *   `.gitignore` (Ensure it includes `cached_embeddings/`, `hf_cache/`, `.env`)
3.  **Configure Secrets:** In your Space settings, go to the "Secrets" section and add:
    *   `HF_TOKEN`: Your Hugging Face API token.
    *   `OPENAI_API_KEY`: Your OpenAI API key.
4.  **Build & Run:** The Space will automatically build the environment using `requirements.txt` and run `app.py`. Similar to the local first run, the deployed Space will perform the initial base knowledge processing using the secrets you provided. This happens on the Space's persistent or ephemeral storage.

## Usage

1.  **Enter Email:** Provide your email address (used for logging interaction history).
2.  **Enter OpenAI API Key (Optional but Recommended):** Provide your OpenAI key if you plan to query *new* dynamic files not already cached. If you only query the base knowledge or already cached files, this *might* not be strictly necessary if a base key was configured, but providing it is safer.
3.  **Specify Dynamic Files (Optional):** Enter the relative paths (e.g., `Rel-17/23_series/23501-h50.md`) of up to 3 specification documents you want to query *in addition* to the base knowledge. Separate multiple paths with commas. Check the "Cached Dynamic Files" accordion to see which files might already be processed.
4.  **Ask Question:** Enter your question related to the content of the base knowledge files and any dynamic files you specified.
5.  **Interact:** View the chatbot's response. The right-hand panel provides status updates (e.g., if a new file was cached) and information about the system.

## Directory Structure

```
.
├── app.py                 # Main Gradio application script
├── requirements.txt       # Python dependencies
├── .env                   # Local environment variables (API keys - DO NOT COMMIT)
├── .gitignore             # Specifies intentionally untracked files by Git
├── cached_embeddings/     # Stores generated FAISS index files (.faiss, .pkl) - GITIGNORED
│   ├── base_knowledge.faiss
│   ├── base_knowledge.pkl
│   └── ... (dynamically cached files)
├── user_data.json         # Logs user emails and processed files
├── cache_manifest.json    # Maps dataset file paths to local cached FAISS files
├── hf_cache/              # Stores downloaded dataset files - GITIGNORED
└── README.md              # This file
```

## Important Notes & Disclaimers

*   **Research Preview:** This tool is intended for demonstration and research purposes. The accuracy of the generated responses is not guaranteed. Always consult the original specifications for authoritative information.
*   **Dataset License:** Your use of this application is subject to the terms and license agreement of the underlying dataset (`rasoul-nikbakht/TSpec-LLM`). Please review these terms on the dataset's Hugging Face page.
*   **API Costs:** Processing *new* (not yet cached) dynamic files will incur costs associated with the OpenAI Embedding API (estimated at ~$0.02 per file for `text-embedding-3-small`). Querying the base knowledge or already cached files does not incur additional OpenAI costs within this application. LLM inference also incurs costs.
*   **API Key Security:** Your OpenAI API key, when provided in the UI, is used directly for embedding generation during your session. For local use, it's read from `.env`. For deployment, it should be stored securely as a Hugging Face Space Secret.
*   **Data Logging:** The application logs the email address you provide and the dynamic files processed during your session in the `user_data.json` file for usage analysis.

## License

The code in this repository is provided under the [cc-by-nc-sa-4.0]. However, usage is strictly governed by the license terms of the `rasoul-nikbakht/TSpec-LLM` dataset.

## Acknowledgements

*   Thanks to Rasoul Nikbakht for creating and sharing the `TSpec-LLM` dataset.
*   Built using Gradio, LangChain, OpenAI, FAISS, and the Hugging Face ecosystem.