rasoul-nikbakht commited on
Commit
50bd1cd
Β·
verified Β·
1 Parent(s): b406f5a

Add readme to the space

Browse files
Files changed (1) hide show
  1. README.md +150 -0
README.md CHANGED
@@ -12,3 +12,153 @@ short_description: Chat with 3GPP documents using Tspec-LLM dataset
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
+
16
+ # 3GPP TSpec RAG Assistant
17
+
18
+ [![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/YOUR_HF_USERNAME/YOUR_SPACE_NAME) <!-- Replace with your actual Space URL -->
19
+
20
+ ## Overview
21
+
22
+ This application provides a Retrieval-Augmented Generation (RAG) interface for querying specific 3GPP technical specification documents. It leverages a base set of pre-processed documents and allows users to dynamically load and query additional specification files from the `rasoul-nikbakht/TSpec-LLM` dataset hosted on Hugging Face Hub.
23
+
24
+ The system uses OpenAI's language models (like `gpt-4o-mini`) for understanding and generation, and OpenAI's embedding models (`text-embedding-3-small`) combined with FAISS for efficient retrieval from the document content.
25
+
26
+ ## Features
27
+
28
+ * **Base Knowledge:** Includes a pre-indexed set of core 3GPP specification documents for immediate querying.
29
+ * **Dynamic File Loading:** Users can specify up to 3 additional specification files (by their relative path within the dataset) per session for querying.
30
+ * **Embedding Caching:** Embeddings for dynamically loaded files are cached locally (`cached_embeddings/`) to speed up subsequent sessions and reduce API costs. A manifest (`cache_manifest.json`) tracks cached files.
31
+ * **User Interaction Logging:** Logs user emails and the files they process in `user_data.json` (for usage tracking).
32
+ * **Gradio Interface:** Provides an interactive web UI built with Gradio, featuring separate chat and configuration/information panels.
33
+ * **Cost Estimation:** Provides a rough estimate of the OpenAI API cost for processing *new* dynamic files.
34
+ * **Hugging Face Spaces Ready:** Designed for easy deployment on Hugging Face Spaces, utilizing environment variables/secrets for API keys.
35
+
36
+ ## Technology Stack
37
+
38
+ * **Backend:** Python 3
39
+ * **UI:** Gradio
40
+ * **RAG & LLM Orchestration:** LangChain
41
+ * **LLM & Embeddings:** OpenAI API (`gpt-4o-mini`, `text-embedding-3-small`)
42
+ * **Vector Store:** FAISS (Facebook AI Similarity Search)
43
+ * **Dataset Access:** Hugging Face Hub (`huggingface_hub`)
44
+ * **Dependencies:** See `requirements.txt`
45
+
46
+ ## Setup
47
+
48
+ ### Prerequisites
49
+
50
+ * Python 3.8+
51
+ * Git and Git LFS (though LFS is not needed if you follow the `.gitignore` setup correctly)
52
+ * An OpenAI API Key
53
+ * A Hugging Face Account and an API Token (`HF_TOKEN`) with access granted to the `rasoul-nikbakht/TSpec-LLM` dataset. (You need to accept the dataset's license terms on the Hugging Face website while logged in).
54
+
55
+ ### Installation
56
+
57
+ 1. **Clone the repository:**
58
+ ```bash
59
+ git clone https://github.com/YOUR_USERNAME/YOUR_REPO_NAME.git # Or your HF Space repo URL
60
+ cd YOUR_REPO_NAME
61
+ ```
62
+
63
+ 2. **Create a virtual environment (Recommended):**
64
+ ```bash
65
+ python -m venv .venv
66
+ source .venv/bin/activate # On Windows use `.venv\Scripts\activate`
67
+ ```
68
+
69
+ 3. **Install dependencies:**
70
+ ```bash
71
+ pip install -r requirements.txt
72
+ ```
73
+
74
+ ### Configuration
75
+
76
+ 1. **API Keys:** You need to provide your Hugging Face token and OpenAI API key.
77
+ * **Local Development:** Create a `.env` file in the root directory of the project:
78
+ ```dotenv
79
+ # Required for downloading the dataset
80
+ HF_TOKEN=hf_YOUR_HUGGING_FACE_TOKEN
81
+
82
+ # Required for embedding generation (base index + dynamic files)
83
+ # This key will be used for the initial base knowledge processing
84
+ OPENAI_API_KEY=sk_YOUR_OPENAI_API_KEY
85
+ ```
86
+ **Important:** Ensure `.env` is listed in your `.gitignore` file to avoid committing secrets.
87
+ * **Hugging Face Spaces Deployment:** Do **NOT** upload the `.env` file. Instead, set `HF_TOKEN` and `OPENAI_API_KEY` as **Secrets** in your Space settings. The application will automatically use these secrets.
88
+
89
+ 2. **Dataset Access:** Ensure the Hugging Face account associated with your `HF_TOKEN` has accepted the license for `rasoul-nikbakht/TSpec-LLM` on the Hugging Face Hub website.
90
+
91
+ ## Running Locally
92
+
93
+ 1. **Activate your virtual environment (if using one):**
94
+ ```bash
95
+ source .venv/bin/activate
96
+ ```
97
+
98
+ 2. **Run the Gradio app:**
99
+ ```bash
100
+ python app.py
101
+ ```
102
+
103
+ 3. **Initial Run:** The *very first time* you run the application locally, it will:
104
+ * Download the fixed base knowledge files specified in the script.
105
+ * Process these files (chunking and embedding using the `OPENAI_API_KEY` from your `.env` file).
106
+ * Create and save the `base_knowledge.faiss` and `base_knowledge.pkl` files in the `cached_embeddings/` directory.
107
+ * This initial pre-processing step might take a few minutes and requires the API keys to be correctly configured. Subsequent runs will load the existing base index much faster.
108
+
109
+ 4. **Access the UI:** Open the local URL provided in your terminal (usually `http://127.0.0.1:7860`).
110
+
111
+ ## Deployment (Hugging Face Spaces)
112
+
113
+ 1. **Create a Hugging Face Space:** Choose the Gradio SDK.
114
+ 2. **Upload Files:** Upload the following files to your Space repository:
115
+ * `app.py`
116
+ * `requirements.txt`
117
+ * `.gitignore` (Ensure it includes `cached_embeddings/`, `hf_cache/`, `.env`)
118
+ 3. **Configure Secrets:** In your Space settings, go to the "Secrets" section and add:
119
+ * `HF_TOKEN`: Your Hugging Face API token.
120
+ * `OPENAI_API_KEY`: Your OpenAI API key.
121
+ 4. **Build & Run:** The Space will automatically build the environment using `requirements.txt` and run `app.py`. Similar to the local first run, the deployed Space will perform the initial base knowledge processing using the secrets you provided. This happens on the Space's persistent or ephemeral storage.
122
+
123
+ ## Usage
124
+
125
+ 1. **Enter Email:** Provide your email address (used for logging interaction history).
126
+ 2. **Enter OpenAI API Key (Optional but Recommended):** Provide your OpenAI key if you plan to query *new* dynamic files not already cached. If you only query the base knowledge or already cached files, this *might* not be strictly necessary if a base key was configured, but providing it is safer.
127
+ 3. **Specify Dynamic Files (Optional):** Enter the relative paths (e.g., `Rel-17/23_series/23501-h50.md`) of up to 3 specification documents you want to query *in addition* to the base knowledge. Separate multiple paths with commas. Check the "Cached Dynamic Files" accordion to see which files might already be processed.
128
+ 4. **Ask Question:** Enter your question related to the content of the base knowledge files and any dynamic files you specified.
129
+ 5. **Interact:** View the chatbot's response. The right-hand panel provides status updates (e.g., if a new file was cached) and information about the system.
130
+
131
+ ## Directory Structure
132
+
133
+ ```
134
+ .
135
+ β”œβ”€β”€ app.py # Main Gradio application script
136
+ β”œβ”€β”€ requirements.txt # Python dependencies
137
+ β”œβ”€β”€ .env # Local environment variables (API keys - DO NOT COMMIT)
138
+ β”œβ”€β”€ .gitignore # Specifies intentionally untracked files by Git
139
+ β”œβ”€β”€ cached_embeddings/ # Stores generated FAISS index files (.faiss, .pkl) - GITIGNORED
140
+ β”‚ β”œβ”€β”€ base_knowledge.faiss
141
+ β”‚ β”œβ”€β”€ base_knowledge.pkl
142
+ β”‚ └── ... (dynamically cached files)
143
+ β”œβ”€β”€ user_data.json # Logs user emails and processed files
144
+ β”œβ”€β”€ cache_manifest.json # Maps dataset file paths to local cached FAISS files
145
+ β”œβ”€β”€ hf_cache/ # Stores downloaded dataset files - GITIGNORED
146
+ └── README.md # This file
147
+ ```
148
+
149
+ ## Important Notes & Disclaimers
150
+
151
+ * **Research Preview:** This tool is intended for demonstration and research purposes. The accuracy of the generated responses is not guaranteed. Always consult the original specifications for authoritative information.
152
+ * **Dataset License:** Your use of this application is subject to the terms and license agreement of the underlying dataset (`rasoul-nikbakht/TSpec-LLM`). Please review these terms on the dataset's Hugging Face page.
153
+ * **API Costs:** Processing *new* (not yet cached) dynamic files will incur costs associated with the OpenAI Embedding API (estimated at ~$0.02 per file for `text-embedding-3-small`). Querying the base knowledge or already cached files does not incur additional OpenAI costs within this application. LLM inference also incurs costs.
154
+ * **API Key Security:** Your OpenAI API key, when provided in the UI, is used directly for embedding generation during your session. For local use, it's read from `.env`. For deployment, it should be stored securely as a Hugging Face Space Secret.
155
+ * **Data Logging:** The application logs the email address you provide and the dynamic files processed during your session in the `user_data.json` file for usage analysis.
156
+
157
+ ## License
158
+
159
+ The code in this repository is provided under the [cc-by-nc-sa-4.0]. However, usage is strictly governed by the license terms of the `rasoul-nikbakht/TSpec-LLM` dataset.
160
+
161
+ ## Acknowledgements
162
+
163
+ * Thanks to Rasoul Nikbakht for creating and sharing the `TSpec-LLM` dataset.
164
+ * Built using Gradio, LangChain, OpenAI, FAISS, and the Hugging Face ecosystem.