ghostai1 commited on
Commit
aa3fe34
·
verified ·
1 Parent(s): 45284f7

Appended status update on May 01, 2025

Browse files
Files changed (1) hide show
  1. README.md +264 -35
README.md CHANGED
@@ -1,46 +1,165 @@
1
  ---
2
- title: InternalRAGCX
3
- emoji: 🔥
4
- colorFrom: pink
5
- colorTo: blue
6
  sdk: gradio
7
- sdk_version: 5.28.0
8
- app_file: app.py
9
- pinned: false
10
- license: openrail
11
- short_description: Cleans Data for Sagemaker/Azure Training
12
  ---
 
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
 
16
 
17
 
 
18
 
 
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- Call Center Data Analysis
22
 
23
- A powerful data analysis tool for call center logs, built on Hugging Face Spaces (free tier). This demo showcases after-the-fact analysis of call center data, including data cleaning, statistical visualization, and export options for downstream AI modeling in SageMaker or Azure AI. It reflects over 5 years of AI expertise, focusing on real-world challenges in junk data mitigation for enterprise CX workflows.
24
 
25
- Features
26
 
27
 
28
 
29
 
30
 
31
- Data Parsing and Cleaning: Processes large call center CSVs, removing nulls, duplicates, short entries, malformed queries, and invalid timestamps, ensuring data integrity.
32
 
33
 
34
 
35
- Statistical Visualization: Generates plots for call duration distribution, satisfaction scores by agent, and query frequency by language using Matplotlib and Seaborn.
36
 
37
 
38
 
39
- Export Options: Provides downloadable cleaned CSV for SageMaker/Azure AI modeling and a PDF report summarizing data quality and statistics.
40
 
41
 
42
 
43
- Gradio-Powered Interface: A responsive, dark-themed UI for viewing raw data, cleanup stats, and visualizations, optimized for enterprise workflows.
 
 
 
 
44
 
45
  Setup
46
 
@@ -48,19 +167,19 @@ Setup
48
 
49
 
50
 
51
- Clone this repository to a Hugging Face Space (free tier, public visibility).
52
 
53
 
54
 
55
- Upload your call_center_logs.csv to the Space.
56
 
57
 
58
 
59
- Populate requirements.txt with the specified dependencies, ensuring compatibility with Python 3.9+ and CPU-only execution.
60
 
61
 
62
 
63
- Deploy app.py and launch the Space with Gradio SDK.
64
 
65
  Usage
66
 
@@ -68,58 +187,168 @@ Usage
68
 
69
 
70
 
71
- Click the "Analyze Data" button to process the call center logs.
72
 
73
 
74
 
75
- View the raw data (first 50 rows), cleanup statistics, and statistical plots.
76
 
77
 
78
 
79
- Download the cleaned CSV (cleaned_call_center_logs.csv) for SageMaker/Azure AI modeling.
80
 
81
 
 
82
 
83
- Download the PDF report (data_analysis_report.pdf) summarizing the analysis.
84
 
85
- Technical Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
 
 
88
 
89
 
90
 
91
- Core Stack:
92
 
93
 
94
 
 
95
 
96
 
97
- Python 3.9+: Foundation for data processing and analysis.
98
 
 
99
 
100
 
101
- Pandas: High-performance CSV parsing and data cleaning.
102
 
103
 
104
 
105
- Matplotlib/Seaborn: Statistical visualization of call center metrics.
106
 
107
 
108
 
109
- Gradio: Interactive UI for data analysis and export.
110
 
111
 
112
 
113
- ReportLab/Pillow: PDF report generation with embedded plots.
114
 
 
115
 
116
 
117
- Free Tier Optimization: Designed for CPU-only execution, minimizing memory footprint.
118
 
119
 
120
 
121
- Extensibility: Cleaned CSV is structured for SageMaker (e.g., BERT-based intent classification) and Azure AI (e.g., custom ML models).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
 
123
  Purpose
124
 
125
- This Space demonstrates proficiency in after-the-fact data analysis for call center environments, addressing junk data challenges and preparing data for AI modeling, aligning with enterprise CX needs.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: mit
3
+ title: Customer Experience Bot Demo
 
 
4
  sdk: gradio
5
+ colorFrom: purple
6
+ colorTo: green
7
+ short_description: CX AI LLM
 
 
8
  ---
9
+ title: Customer Experience Bot Demo emoji: 🤖 colorFrom: blue colorTo: purple sdk: gradio sdk_version: "4.44.0" app_file: app.py pinned: false
10
 
 
11
 
12
 
13
 
14
+ Customer Experience Bot Demo
15
 
16
+ A cutting-edge Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG) powered Customer Experience (CX) bot, deployed on Hugging Face Spaces (free tier). Architected with over 5 years of AI expertise since 2020, this demo leverages advanced Natural Language Processing (NLP) pipelines to deliver high-fidelity, multilingual CX solutions for enterprise-grade applications in SaaS, HealthTech, FinTech, and eCommerce. The system showcases robust data preprocessing for call center datasets, integrating state-of-the-art technologies like Pandas for data wrangling, Hugging Face Transformers for embeddings, FAISS for vectorized retrieval, and FastAPI-compatible API design principles for scalable inference.
17
 
18
+ Technical Architecture
19
+
20
+ Retrieval-Augmented Generation (RAG) Pipeline
21
+
22
+ The core of this CX bot is a RAG framework, designed to fuse retrieval and generation for contextually relevant responses. The pipeline employs:
23
+
24
+
25
+
26
+
27
+
28
+ Hugging Face Transformers: Utilizes all-MiniLM-L6-v2, a lightweight Sentence-BERT model (~80MB), fine-tuned for semantic embeddings, to encode call center FAQs into dense vectors. This ensures efficient, high-dimensional representation of query semantics.
29
+
30
+
31
+
32
+ FAISS (CPU): Implements a FAISS IndexFlatL2 for similarity search, enabling rapid retrieval of top-k FAQs (default k=2) via L2 distance metrics. FAISS’s CPU optimization ensures free-tier compatibility while maintaining sub-millisecond retrieval latency.
33
+
34
+
35
+
36
+ Rule-Based Generation: Bypasses heavy LLMs (e.g., GPT-2) for free-tier constraints, using retrieved FAQ answers directly, achieving a simulated 95% accuracy while minimizing compute overhead.
37
+
38
+ Context-Augmented Generation (CAG) Integration
39
+
40
+ Building on RAG, the system incorporates CAG principles by enriching retrieved contexts with metadata (e.g., call_id, language) from call center CSVs. This contextual augmentation enhances response relevance, particularly for multilingual CX (e.g., English, Spanish), ensuring the bot adapts to diverse enterprise needs.
41
+
42
+ Call Center Data Preprocessing with Pandas
43
+
44
+ The bot ingests raw call center CSVs, which are often riddled with junk data (nulls, duplicates, malformed entries). Leveraging Pandas, the preprocessing pipeline:
45
+
46
+
47
+
48
+
49
+
50
+ Data Ingestion: Parses CSVs with pd.read_csv, using io.StringIO for embedded data, with explicit quotechar and escapechar to handle complex strings.
51
+
52
+
53
+
54
+ Junk Data Cleanup:
55
+
56
+
57
+
58
+
59
+
60
+ Null Handling: Drops rows with missing question or answer using df.dropna().
61
+
62
+
63
+
64
+ Duplicate Removal: Eliminates redundant FAQs via df[~df['question'].duplicated()].
65
+
66
+
67
+
68
+ Short Entry Filtering: Excludes questions <10 chars or answers <20 chars with df[(df['question'].str.len() >= 10) & (df['answer'].str.len() >= 20)].
69
+
70
+
71
+
72
+ Malformed Detection: Uses regex ([!?]{2,}|\b(Invalid|N/A)\b) to filter invalid questions.
73
+
74
+
75
+
76
+ Standardization: Normalizes text (e.g., mo to month) and fills missing language with en.
77
+
78
+
79
+
80
+ Output: Generates cleaned_call_center_faqs.csv for downstream modeling, with detailed cleanup stats (e.g., nulls, duplicates removed).
81
+
82
+ Enterprise-Grade Modeling Compatibility
83
+
84
+ The cleaned CSV is optimized for:
85
+
86
+
87
+
88
+
89
+
90
+ Amazon SageMaker: Ready for training BERT-based models (e.g., bert-base-uncased) for intent classification or FAQ retrieval, deployable via SageMaker JumpStart.
91
+
92
+
93
+
94
+ Azure AI: Compatible with Azure Machine Learning pipelines for fine-tuning models like DistilBERT in Azure Blob Storage, enabling scalable CX automation.
95
+
96
+
97
+
98
+ LLM Integration: While not used in this free-tier demo, the cleaned data supports fine-tuning LLMs (e.g., distilgpt2) for generative tasks, leveraging your FastAPI experience for API-driven inference.
99
+
100
+ Performance Monitoring and Visualization
101
+
102
+ The bot includes a performance monitoring suite:
103
+
104
+
105
+
106
+
107
+
108
+ Latency Tracking: Measures embedding, retrieval, and generation times using time.perf_counter(), reported in milliseconds.
109
+
110
+
111
+
112
+ Accuracy Metrics: Simulates retrieval accuracy (95% if FAQs retrieved, 0% otherwise) for demo purposes.
113
+
114
+
115
+
116
+ Visualization: Uses Matplotlib and Seaborn to plot a dual-axis chart (rag_plot.png):
117
+
118
+
119
+
120
+
121
+
122
+ Bar Chart: Latency (ms) per stage (Embedding, Retrieval, Generation).
123
+
124
+
125
+
126
+ Line Chart: Accuracy (%) per stage, with a muted palette for professional aesthetics.
127
+
128
+ Gradio Interface for Interactive CX
129
+
130
+ The bot is deployed via Gradio, providing a user-friendly interface:
131
+
132
+
133
+
134
+
135
+
136
+ Input: Text query field for user inputs (e.g., “How do I reset my password?”).
137
 
 
138
 
 
139
 
140
+ Outputs:
141
 
142
 
143
 
144
 
145
 
146
+ Bot response (e.g., “Go to the login page, click ‘Forgot Password,’...”).
147
 
148
 
149
 
150
+ Retrieved FAQs with question-answer pairs.
151
 
152
 
153
 
154
+ Cleanup stats (e.g., “Cleaned FAQs: 6; removed 4 junk entries”).
155
 
156
 
157
 
158
+ RAG pipeline plot for latency and accuracy.
159
+
160
+
161
+
162
+ Styling: Custom dark theme CSS (#2a2a2a background, blue buttons) for a sleek, enterprise-ready UI.
163
 
164
  Setup
165
 
 
167
 
168
 
169
 
170
+ Clone this repository to a Hugging Face Space (free tier, public).
171
 
172
 
173
 
174
+ Add requirements.txt with dependencies (gradio==4.44.0, pandas==2.2.3, etc.).
175
 
176
 
177
 
178
+ Upload app.py (embeds call center FAQs for seamless deployment).
179
 
180
 
181
 
182
+ Configure to run with Python 3.9+, CPU hardware (no GPU).
183
 
184
  Usage
185
 
 
187
 
188
 
189
 
190
+ Query: Enter a question in the Gradio UI (e.g., “How do I reset my password?”).
191
 
192
 
193
 
194
+ Output:
195
 
196
 
197
 
 
198
 
199
 
200
+ Response: Contextually relevant answer from retrieved FAQs.
201
 
 
202
 
203
+
204
+ Retrieved FAQs: Top-k question-answer pairs.
205
+
206
+
207
+
208
+ Cleanup Stats: Detailed breakdown of junk data removal (nulls, duplicates, short entries, malformed).
209
+
210
+
211
+
212
+ RAG Plot: Visual metrics for latency and accuracy.
213
+
214
+
215
+
216
+ Example:
217
+
218
+
219
+
220
+
221
+
222
+ Query: “How do I reset my password?”
223
+
224
+
225
+
226
+ Response: “Go to the login page, click ‘Forgot Password,’ and follow the email instructions.”
227
+
228
+
229
+
230
+ Cleanup Stats: “Cleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”
231
+
232
+ Call Center Data Cleanup
233
+
234
+
235
+
236
+
237
+
238
+ Preprocessing Pipeline:
239
+
240
+
241
+
242
+
243
+
244
+ Null Handling: Eliminates incomplete entries with df.dropna().
245
+
246
+
247
+
248
+ Duplicate Removal: Ensures uniqueness via df[~df['question'].duplicated()].
249
+
250
+
251
+
252
+ Short Entry Filtering: Maintains quality with length-based filtering.
253
+
254
 
255
 
256
+ Malformed Detection: Uses regex to identify and remove invalid queries.
257
 
258
 
259
 
260
+ Standardization: Normalizes text and metadata for consistency.
261
 
262
 
263
 
264
+ Impact: Produces high-fidelity FAQs for RAG/CAG pipelines, critical for call center CX automation.
265
 
266
 
 
267
 
268
+ Modeling Output: The cleaned cleaned_call_center_faqs.csv is ready for:
269
 
270
 
 
271
 
272
 
273
 
274
+ SageMaker: Fine-tuning BERT models for intent classification or FAQ retrieval.
275
 
276
 
277
 
278
+ Azure AI: Training DistilBERT in Azure ML for scalable CX automation.
279
 
280
 
281
 
282
+ LLM Fine-Tuning: Supports advanced generative tasks with LLMs via FastAPI endpoints.
283
 
284
+ Technical Details
285
 
286
 
 
287
 
288
 
289
 
290
+ Stack:
291
+
292
+
293
+
294
+
295
+
296
+ Pandas: Data wrangling and preprocessing for call center CSVs.
297
+
298
+
299
+
300
+ Hugging Face Transformers: all-MiniLM-L6-v2 for semantic embeddings.
301
+
302
+
303
+
304
+ FAISS: Vectorized similarity search with L2 distance metrics.
305
+
306
+
307
+
308
+ Gradio: Interactive UI for real-time CX demos.
309
+
310
+
311
+
312
+ Matplotlib/Seaborn: Performance visualization with dual-axis plots.
313
+
314
+
315
+
316
+ FastAPI Compatibility: Designed with API-driven inference in mind, leveraging your experience with FastAPI for scalable deployments (e.g., RESTful endpoints for RAG inference).
317
+
318
+
319
+
320
+ Free Tier Optimization: Lightweight with CPU-only dependencies, no GPU required.
321
+
322
+
323
+
324
+ Extensibility: Ready for integration with enterprise CRMs (e.g., Salesforce) via FastAPI, and cloud deployments on AWS Lambda or Azure Functions.
325
 
326
  Purpose
327
 
328
+ This demo showcases expertise in AI-driven CX automation, with a focus on call center data quality, built on over 5 years of experience in AI, NLP, and enterprise-grade deployments. It demonstrates the power of RAG and CAG pipelines, Pandas-based data preprocessing, and scalable modeling for SageMaker and Azure AI, making it ideal for advanced CX solutions in call center environments.
329
+
330
+ Future Enhancements
331
+
332
+
333
+
334
+
335
+
336
+ LLM Integration: Incorporate distilgpt2 or t5-small (from your past projects) for generative responses, fine-tuned on cleaned call center data.
337
+
338
+
339
+
340
+ FastAPI Deployment: Expose RAG pipeline via FastAPI endpoints for production-grade inference.
341
+
342
+
343
+
344
+ Multilingual Scaling: Expand language support (e.g., French, German) using Hugging Face’s multilingual models.
345
+
346
+
347
+
348
+ Real-Time Monitoring: Add Prometheus metrics for latency/accuracy in production environments.
349
+
350
+ ## Configuration missing in update.ini for ghostai1/CXRAG - May 01, 2025 📝
351
+ -
352
+
353
+ **Website**: https://ghostai.pro/
354
+ **Discord**: https://discord.gg/9cnJNBQtHE