Spaces:
Runtime error
Runtime error
Add readme to the space
Browse files
README.md
CHANGED
@@ -12,3 +12,153 @@ short_description: Chat with 3GPP documents using Tspec-LLM dataset
|
|
12 |
---
|
13 |
|
14 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
---
|
13 |
|
14 |
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
15 |
+
|
16 |
+
# 3GPP TSpec RAG Assistant
|
17 |
+
|
18 |
+
[](https://huggingface.co/spaces/YOUR_HF_USERNAME/YOUR_SPACE_NAME) <!-- Replace with your actual Space URL -->
|
19 |
+
|
20 |
+
## Overview
|
21 |
+
|
22 |
+
This application provides a Retrieval-Augmented Generation (RAG) interface for querying specific 3GPP technical specification documents. It leverages a base set of pre-processed documents and allows users to dynamically load and query additional specification files from the `rasoul-nikbakht/TSpec-LLM` dataset hosted on Hugging Face Hub.
|
23 |
+
|
24 |
+
The system uses OpenAI's language models (like `gpt-4o-mini`) for understanding and generation, and OpenAI's embedding models (`text-embedding-3-small`) combined with FAISS for efficient retrieval from the document content.
|
25 |
+
|
26 |
+
## Features
|
27 |
+
|
28 |
+
* **Base Knowledge:** Includes a pre-indexed set of core 3GPP specification documents for immediate querying.
|
29 |
+
* **Dynamic File Loading:** Users can specify up to 3 additional specification files (by their relative path within the dataset) per session for querying.
|
30 |
+
* **Embedding Caching:** Embeddings for dynamically loaded files are cached locally (`cached_embeddings/`) to speed up subsequent sessions and reduce API costs. A manifest (`cache_manifest.json`) tracks cached files.
|
31 |
+
* **User Interaction Logging:** Logs user emails and the files they process in `user_data.json` (for usage tracking).
|
32 |
+
* **Gradio Interface:** Provides an interactive web UI built with Gradio, featuring separate chat and configuration/information panels.
|
33 |
+
* **Cost Estimation:** Provides a rough estimate of the OpenAI API cost for processing *new* dynamic files.
|
34 |
+
* **Hugging Face Spaces Ready:** Designed for easy deployment on Hugging Face Spaces, utilizing environment variables/secrets for API keys.
|
35 |
+
|
36 |
+
## Technology Stack
|
37 |
+
|
38 |
+
* **Backend:** Python 3
|
39 |
+
* **UI:** Gradio
|
40 |
+
* **RAG & LLM Orchestration:** LangChain
|
41 |
+
* **LLM & Embeddings:** OpenAI API (`gpt-4o-mini`, `text-embedding-3-small`)
|
42 |
+
* **Vector Store:** FAISS (Facebook AI Similarity Search)
|
43 |
+
* **Dataset Access:** Hugging Face Hub (`huggingface_hub`)
|
44 |
+
* **Dependencies:** See `requirements.txt`
|
45 |
+
|
46 |
+
## Setup
|
47 |
+
|
48 |
+
### Prerequisites
|
49 |
+
|
50 |
+
* Python 3.8+
|
51 |
+
* Git and Git LFS (though LFS is not needed if you follow the `.gitignore` setup correctly)
|
52 |
+
* An OpenAI API Key
|
53 |
+
* A Hugging Face Account and an API Token (`HF_TOKEN`) with access granted to the `rasoul-nikbakht/TSpec-LLM` dataset. (You need to accept the dataset's license terms on the Hugging Face website while logged in).
|
54 |
+
|
55 |
+
### Installation
|
56 |
+
|
57 |
+
1. **Clone the repository:**
|
58 |
+
```bash
|
59 |
+
git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git # Or your HF Space repo URL
|
60 |
+
cd YOUR_REPO_NAME
|
61 |
+
```
|
62 |
+
|
63 |
+
2. **Create a virtual environment (Recommended):**
|
64 |
+
```bash
|
65 |
+
python -m venv .venv
|
66 |
+
source .venv/bin/activate # On Windows use `.venv\Scripts\activate`
|
67 |
+
```
|
68 |
+
|
69 |
+
3. **Install dependencies:**
|
70 |
+
```bash
|
71 |
+
pip install -r requirements.txt
|
72 |
+
```
|
73 |
+
|
74 |
+
### Configuration
|
75 |
+
|
76 |
+
1. **API Keys:** You need to provide your Hugging Face token and OpenAI API key.
|
77 |
+
* **Local Development:** Create a `.env` file in the root directory of the project:
|
78 |
+
```dotenv
|
79 |
+
# Required for downloading the dataset
|
80 |
+
HF_TOKEN=hf_YOUR_HUGGING_FACE_TOKEN
|
81 |
+
|
82 |
+
# Required for embedding generation (base index + dynamic files)
|
83 |
+
# This key will be used for the initial base knowledge processing
|
84 |
+
OPENAI_API_KEY=sk_YOUR_OPENAI_API_KEY
|
85 |
+
```
|
86 |
+
**Important:** Ensure `.env` is listed in your `.gitignore` file to avoid committing secrets.
|
87 |
+
* **Hugging Face Spaces Deployment:** Do **NOT** upload the `.env` file. Instead, set `HF_TOKEN` and `OPENAI_API_KEY` as **Secrets** in your Space settings. The application will automatically use these secrets.
|
88 |
+
|
89 |
+
2. **Dataset Access:** Ensure the Hugging Face account associated with your `HF_TOKEN` has accepted the license for `rasoul-nikbakht/TSpec-LLM` on the Hugging Face Hub website.
|
90 |
+
|
91 |
+
## Running Locally
|
92 |
+
|
93 |
+
1. **Activate your virtual environment (if using one):**
|
94 |
+
```bash
|
95 |
+
source .venv/bin/activate
|
96 |
+
```
|
97 |
+
|
98 |
+
2. **Run the Gradio app:**
|
99 |
+
```bash
|
100 |
+
python app.py
|
101 |
+
```
|
102 |
+
|
103 |
+
3. **Initial Run:** The *very first time* you run the application locally, it will:
|
104 |
+
* Download the fixed base knowledge files specified in the script.
|
105 |
+
* Process these files (chunking and embedding using the `OPENAI_API_KEY` from your `.env` file).
|
106 |
+
* Create and save the `base_knowledge.faiss` and `base_knowledge.pkl` files in the `cached_embeddings/` directory.
|
107 |
+
* This initial pre-processing step might take a few minutes and requires the API keys to be correctly configured. Subsequent runs will load the existing base index much faster.
|
108 |
+
|
109 |
+
4. **Access the UI:** Open the local URL provided in your terminal (usually `http://127.0.0.1:7860`).
|
110 |
+
|
111 |
+
## Deployment (Hugging Face Spaces)
|
112 |
+
|
113 |
+
1. **Create a Hugging Face Space:** Choose the Gradio SDK.
|
114 |
+
2. **Upload Files:** Upload the following files to your Space repository:
|
115 |
+
* `app.py`
|
116 |
+
* `requirements.txt`
|
117 |
+
* `.gitignore` (Ensure it includes `cached_embeddings/`, `hf_cache/`, `.env`)
|
118 |
+
3. **Configure Secrets:** In your Space settings, go to the "Secrets" section and add:
|
119 |
+
* `HF_TOKEN`: Your Hugging Face API token.
|
120 |
+
* `OPENAI_API_KEY`: Your OpenAI API key.
|
121 |
+
4. **Build & Run:** The Space will automatically build the environment using `requirements.txt` and run `app.py`. Similar to the local first run, the deployed Space will perform the initial base knowledge processing using the secrets you provided. This happens on the Space's persistent or ephemeral storage.
|
122 |
+
|
123 |
+
## Usage
|
124 |
+
|
125 |
+
1. **Enter Email:** Provide your email address (used for logging interaction history).
|
126 |
+
2. **Enter OpenAI API Key (Optional but Recommended):** Provide your OpenAI key if you plan to query *new* dynamic files not already cached. If you only query the base knowledge or already cached files, this *might* not be strictly necessary if a base key was configured, but providing it is safer.
|
127 |
+
3. **Specify Dynamic Files (Optional):** Enter the relative paths (e.g., `Rel-17/23_series/23501-h50.md`) of up to 3 specification documents you want to query *in addition* to the base knowledge. Separate multiple paths with commas. Check the "Cached Dynamic Files" accordion to see which files might already be processed.
|
128 |
+
4. **Ask Question:** Enter your question related to the content of the base knowledge files and any dynamic files you specified.
|
129 |
+
5. **Interact:** View the chatbot's response. The right-hand panel provides status updates (e.g., if a new file was cached) and information about the system.
|
130 |
+
|
131 |
+
## Directory Structure
|
132 |
+
|
133 |
+
```
|
134 |
+
.
|
135 |
+
βββ app.py # Main Gradio application script
|
136 |
+
βββ requirements.txt # Python dependencies
|
137 |
+
βββ .env # Local environment variables (API keys - DO NOT COMMIT)
|
138 |
+
βββ .gitignore # Specifies intentionally untracked files by Git
|
139 |
+
βββ cached_embeddings/ # Stores generated FAISS index files (.faiss, .pkl) - GITIGNORED
|
140 |
+
β βββ base_knowledge.faiss
|
141 |
+
β βββ base_knowledge.pkl
|
142 |
+
β βββ ... (dynamically cached files)
|
143 |
+
βββ user_data.json # Logs user emails and processed files
|
144 |
+
βββ cache_manifest.json # Maps dataset file paths to local cached FAISS files
|
145 |
+
βββ hf_cache/ # Stores downloaded dataset files - GITIGNORED
|
146 |
+
βββ README.md # This file
|
147 |
+
```
|
148 |
+
|
149 |
+
## Important Notes & Disclaimers
|
150 |
+
|
151 |
+
* **Research Preview:** This tool is intended for demonstration and research purposes. The accuracy of the generated responses is not guaranteed. Always consult the original specifications for authoritative information.
|
152 |
+
* **Dataset License:** Your use of this application is subject to the terms and license agreement of the underlying dataset (`rasoul-nikbakht/TSpec-LLM`). Please review these terms on the dataset's Hugging Face page.
|
153 |
+
* **API Costs:** Processing *new* (not yet cached) dynamic files will incur costs associated with the OpenAI Embedding API (estimated at ~$0.02 per file for `text-embedding-3-small`). Querying the base knowledge or already cached files does not incur additional OpenAI costs within this application. LLM inference also incurs costs.
|
154 |
+
* **API Key Security:** Your OpenAI API key, when provided in the UI, is used directly for embedding generation during your session. For local use, it's read from `.env`. For deployment, it should be stored securely as a Hugging Face Space Secret.
|
155 |
+
* **Data Logging:** The application logs the email address you provide and the dynamic files processed during your session in the `user_data.json` file for usage analysis.
|
156 |
+
|
157 |
+
## License
|
158 |
+
|
159 |
+
The code in this repository is provided under the [cc-by-nc-sa-4.0]. However, usage is strictly governed by the license terms of the `rasoul-nikbakht/TSpec-LLM` dataset.
|
160 |
+
|
161 |
+
## Acknowledgements
|
162 |
+
|
163 |
+
* Thanks to Rasoul Nikbakht for creating and sharing the `TSpec-LLM` dataset.
|
164 |
+
* Built using Gradio, LangChain, OpenAI, FAISS, and the Hugging Face ecosystem.
|