File size: 8,661 Bytes
fe7ddde
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71a9da8
3568916
431731b
0ff7ad7
5740c78
d376aa9
ad487bb
f87e09f
e0bb0d4
29e28fa
5b7e5f3
e60e7d9
3dc1ded
cdd958f
d069961
8339cb4
9630a75
57fe742
b6b36de
fd2ed1f
0977439
b88e106
c3e34a7
3bc3eb7
bb57552
d1e15bc
d500601
e77734c
39cef7b
c40c8d4
45d6ee5
9d569e7
e0fcf24
614a94a
34c6884
03b90ea
1a31c3b
779d359
f3765f2
7349089
c2f9266
b0c0515
d43747b
f4a9622
0de6748
49011b5
963cfb1
5d73d0b
63158db
4ac510e
863a8ee
1615a20
eee76d8
238906a
a094031
53761d4
e0565a7
3021d7b
2c4ff88
7c52b1b
39cf0aa
cbaecc9
694156a
4bb913f
dd53932
b0fc30b
f05eafb
47898ea
a54b868
798644e
25d7a9a
55ed48d
0ff87b5
1b61644
e39bf7c
cc73e5f
d10ab3b
add9e97
9f43047
a8ff796
7adda31
d869023
68620b2
409d59e
59188f3
73a4a5e
71a9da8
5eea381
2edc584
6683c3a
4d7f6fd
7efefd6
86ea7fa
6367f5c
766941d
382b198
10c5d83
e422f85
2b93169
7bd0abf
7f453e0
4fbe457
638a95d
b554e13
1eb55af
ac0ad3b
911acce
5afa8ad
428031b
a460439
d64e088
ec01119
51b57a9
f4ec72d
32bbddd
50c31ad
1a1e8cc
fe7ddde
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
tags: [model]
---
# Internal RAG CX Data Preprocessing Demo

A robust data preprocessing pipeline for Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG) systems, deployed on Hugging Face as a Model repository (free tier). Built with over 5 years of AI expertise since 2020, this demo focuses on cleaning and preparing call center datasets for enterprise-grade CX applications in SaaS, HealthTech, FinTech, and eCommerce. It integrates advanced data wrangling with Pandas, ensuring high-quality FAQs for downstream RAG/CAG pipelines, and is compatible with Amazon SageMaker and Azure AI for scalable modeling.

## Technical Architecture

### Data Preprocessing Pipeline

The core of this demo is a comprehensive data preprocessing pipeline designed to clean raw call center datasets:

- **Data Ingestion**:
  - Parses CSVs with `pd.read_csv`, using `io.StringIO` for embedded data, with explicit `quotechar` and `escapechar` to handle complex strings.
  - Handles datasets with columns: `call_id`, `question`, `answer`, `language`.

- **Junk Data Cleanup**:
  - **Null Handling**: Drops rows with missing `question` or `answer` using `df.dropna()`.
  - **Duplicate Removal**: Eliminates redundant FAQs via `df[~df['question'].duplicated()]`.
  - **Short Entry Filtering**: Excludes questions <10 chars or answers <20 chars with `df[(df['question'].str.len() >= 10) & (df['answer'].str.len() >= 20)]`.
  - **Malformed Detection**: Uses regex (`[!?]{2,}|(Invalid|N/A)`) to filter invalid questions.
  - **Standardization**: Normalizes text (e.g., "mo" to "month") and fills missing `language` with "en".

- **Output**:
  - Generates `cleaned_call_center_faqs.csv` for downstream modeling.
  - Provides cleanup stats: nulls removed, duplicates removed, short entries filtered, malformed entries detected.

### Enterprise-Grade Modeling Compatibility

The cleaned dataset is optimized for:

- **Amazon SageMaker**: Ready for training BERT-based models (e.g., `bert-base-uncased`) for intent classification or FAQ retrieval, deployable via SageMaker JumpStart.
- **Azure AI**: Compatible with Azure Machine Learning pipelines for fine-tuning models like DistilBERT in Azure Blob Storage, enabling scalable CX automation.
- **LLM Integration**: Supports fine-tuning LLMs (e.g., `distilgpt2`) for generative tasks, leveraging your FastAPI experience for API-driven inference.

## Performance Monitoring and Visualization

The demo includes a performance monitoring suite:

- **Processing Time Tracking**: Measures data ingestion, cleaning, and output times using `time.perf_counter()`, reported in milliseconds.
- **Cleanup Metrics**: Tracks the number of nulls, duplicates, short entries, and malformed entries removed.
- **Visualization**: Uses Matplotlib to plot a bar chart (`cleanup_stats.png`):
  - Bars: Number of entries removed per category (Nulls, Duplicates, Short, Malformed).
  - Palette: Professional muted colors for enterprise aesthetics.

## Gradio Interface for Interactive Demo

The demo is accessible via Gradio, providing an interactive data preprocessing experience:

- **Input**: Upload a sample call center CSV or use the embedded demo dataset.
- **Outputs**:
  - **Cleaned Dataset**: Download `cleaned_call_center_faqs.csv`.
  - **Cleanup Stats**: Detailed breakdown (e.g., β€œCleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”).
  - **Performance Plot**: Visual metrics for processing time and cleanup stats.
- **Styling**: Custom dark theme CSS (`#2a2a2a` background, blue buttons) for a sleek, enterprise-ready UI.

## Setup

- Clone this repository to a Hugging Face Model repository (free tier, public).
- Add `requirements.txt` with dependencies (`gradio==4.44.0`, `pandas==2.2.3`, `matplotlib==3.9.2`, etc.).
- Upload `app.py` (includes embedded demo dataset for seamless deployment).
- Configure to run with Python 3.9+, CPU hardware (no GPU).

## Usage

- **Upload CSV**: Provide a call center CSV in the Gradio UI, or use the default demo dataset.
- **Output**:
  - **Cleaned Dataset**: Download the processed `cleaned_call_center_faqs.csv`.
  - **Cleanup Stats**: β€œCleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”.
  - **Performance Plot**: Visual metrics for processing time and cleanup stats.

**Example**:
- **Input CSV**: Sample dataset with 10 FAQs, including 2 nulls, 1 duplicate, 1 short entry.
- **Output**:
  - Cleaned Dataset: 6 FAQs in `cleaned_call_center_faqs.csv`.
  - Cleanup Stats: β€œCleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”.
  - Plot: Processing Time (Ingestion: 50ms, Cleaning: 30ms, Output: 10ms), Cleanup Stats (Nulls: 2, Duplicates: 1, Short: 1, Malformed: 0).

## Technical Details

**Stack**:
- **Pandas**: Data wrangling and preprocessing for call center CSVs.
- **Gradio**: Interactive UI for real-time data preprocessing demos.
- **Matplotlib**: Performance visualization with bar charts.
- **FastAPI Compatibility**: Designed with API-driven preprocessing in mind, leveraging your experience with FastAPI for scalable deployments.

**Free Tier Optimization**: Lightweight with CPU-only dependencies, no GPU required.

**Extensibility**: Ready for integration with RAG/CAG pipelines, and cloud deployments on AWS Lambda or Azure Functions.

## Purpose

This demo showcases expertise in data preprocessing for AI-driven CX automation, focusing on call center data quality. Built on over 5 years of experience in AI, data engineering, and enterprise-grade deployments, it demonstrates the power of Pandas-based data cleaning for RAG/CAG pipelines, making it ideal for advanced CX solutions in call center environments.

## Latest Update

**Status Update**: Configuration missing in update.ini for ghostai1/internalRAGCX: Expected sections InternalragcxUpdate and InternalragcxEmojis - May 28, 2025 πŸ“  
-  - August 31, 2025 πŸ“
-  - August 28, 2025 πŸ“
-  - August 26, 2025 πŸ“
-  - August 23, 2025 πŸ“
-  - August 21, 2025 πŸ“
-  - August 19, 2025 πŸ“
-  - August 18, 2025 πŸ“
-  - August 16, 2025 πŸ“
-  - August 15, 2025 πŸ“
-  - August 14, 2025 πŸ“
-  - August 13, 2025 πŸ“
-  - August 12, 2025 πŸ“
-  - August 11, 2025 πŸ“
-  - August 10, 2025 πŸ“
-  - August 09, 2025 πŸ“
-  - August 08, 2025 πŸ“
-  - August 07, 2025 πŸ“
-  - August 06, 2025 πŸ“
-  - August 05, 2025 πŸ“
-  - August 04, 2025 πŸ“
-  - August 03, 2025 πŸ“
-  - August 02, 2025 πŸ“
-  - August 01, 2025 πŸ“
-  - July 31, 2025 πŸ“
-  - July 30, 2025 πŸ“
-  - July 29, 2025 πŸ“
-  - July 28, 2025 πŸ“
-  - July 27, 2025 πŸ“
-  - July 26, 2025 πŸ“
-  - July 25, 2025 πŸ“
-  - July 24, 2025 πŸ“
-  - July 23, 2025 πŸ“
-  - July 22, 2025 πŸ“
-  - July 21, 2025 πŸ“
-  - July 20, 2025 πŸ“
-  - July 19, 2025 πŸ“
-  - July 18, 2025 πŸ“
-  - July 17, 2025 πŸ“
-  - July 16, 2025 πŸ“
-  - July 15, 2025 πŸ“
-  - July 14, 2025 πŸ“
-  - July 11, 2025 πŸ“
-  - July 10, 2025 πŸ“
-  - July 09, 2025 πŸ“
-  - July 08, 2025 πŸ“
-  - July 07, 2025 πŸ“
-  - July 06, 2025 πŸ“
-  - July 05, 2025 πŸ“
-  - July 04, 2025 πŸ“
-  - July 03, 2025 πŸ“
-  - July 02, 2025 πŸ“
-  - July 01, 2025 πŸ“
-  - June 30, 2025 πŸ“
-  - June 29, 2025 πŸ“
-  - June 28, 2025 πŸ“
-  - June 27, 2025 πŸ“
-  - June 26, 2025 πŸ“
-  - June 25, 2025 πŸ“
-  - June 24, 2025 πŸ“
-  - June 23, 2025 πŸ“
-  - June 22, 2025 πŸ“
-  - June 21, 2025 πŸ“
-  - June 20, 2025 πŸ“
-  - June 19, 2025 πŸ“
-  - June 18, 2025 πŸ“
-  - June 17, 2025 πŸ“
-  - June 16, 2025 πŸ“
-  - June 15, 2025 πŸ“
-  - June 14, 2025 πŸ“
-  - June 13, 2025 πŸ“
-  - June 12, 2025 πŸ“
-  - June 11, 2025 πŸ“
-  - June 10, 2025 πŸ“
-  - June 09, 2025 πŸ“
-  - June 08, 2025 πŸ“
-  - June 07, 2025 πŸ“
-  - June 06, 2025 πŸ“
-  - June 05, 2025 πŸ“
-  - June 04, 2025 πŸ“
-  - June 03, 2025 πŸ“
-  - June 02, 2025 πŸ“
-  - June 01, 2025 πŸ“
-  - May 31, 2025 πŸ“
-  - May 30, 2025 πŸ“
-  - May 29, 2025 πŸ“
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- Placeholder update text.

## Future Enhancements

- **Real-Time Streaming**: Add support for real-time data streaming from Kafka for live preprocessing.
- **FastAPI Deployment**: Expose preprocessing pipeline via FastAPI endpoints for production-grade use.
- **Advanced Validation**: Implement stricter data validation rules using machine learning-based outlier detection.
- **Cloud Integration**: Enhance compatibility with AWS Glue or Azure Data Factory for enterprise data pipelines.

**Website**: https://ghostainews.com/  
**Discord**: https://discord.gg/BfA23aYz