File size: 8,661 Bytes

fe7ddde
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71a9da8
3568916
431731b
0ff7ad7
5740c78
d376aa9
ad487bb
f87e09f
e0bb0d4
29e28fa
5b7e5f3
e60e7d9
3dc1ded
cdd958f
d069961
8339cb4
9630a75
57fe742
b6b36de
fd2ed1f
0977439
b88e106
c3e34a7
3bc3eb7
bb57552
d1e15bc
d500601
e77734c
39cef7b
c40c8d4
45d6ee5
9d569e7
e0fcf24
614a94a
34c6884
03b90ea
1a31c3b
779d359
f3765f2
7349089
c2f9266
b0c0515
d43747b
f4a9622
0de6748
49011b5
963cfb1
5d73d0b
63158db
4ac510e
863a8ee
1615a20
eee76d8
238906a
a094031
53761d4
e0565a7
3021d7b
2c4ff88
7c52b1b
39cf0aa
cbaecc9
694156a
4bb913f
dd53932
b0fc30b
f05eafb
47898ea
a54b868
798644e
25d7a9a
55ed48d
0ff87b5
1b61644
e39bf7c
cc73e5f
d10ab3b
add9e97
9f43047
a8ff796
7adda31
d869023
68620b2
409d59e
59188f3
73a4a5e
71a9da8
5eea381
2edc584
6683c3a
4d7f6fd
7efefd6
86ea7fa
6367f5c
766941d
382b198
10c5d83
e422f85
2b93169
7bd0abf
7f453e0
4fbe457
638a95d
b554e13
1eb55af
ac0ad3b
911acce
5afa8ad
428031b
a460439
d64e088
ec01119
51b57a9
f4ec72d
32bbddd
50c31ad
1a1e8cc
fe7ddde

---
tags: [model]
---
# Internal RAG CX Data Preprocessing Demo

A robust data preprocessing pipeline for Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG) systems, deployed on Hugging Face as a Model repository (free tier). Built with over 5 years of AI expertise since 2020, this demo focuses on cleaning and preparing call center datasets for enterprise-grade CX applications in SaaS, HealthTech, FinTech, and eCommerce. It integrates advanced data wrangling with Pandas, ensuring high-quality FAQs for downstream RAG/CAG pipelines, and is compatible with Amazon SageMaker and Azure AI for scalable modeling.

## Technical Architecture

### Data Preprocessing Pipeline

The core of this demo is a comprehensive data preprocessing pipeline designed to clean raw call center datasets:

- **Data Ingestion**:
  - Parses CSVs with `pd.read_csv`, using `io.StringIO` for embedded data, with explicit `quotechar` and `escapechar` to handle complex strings.
  - Handles datasets with columns: `call_id`, `question`, `answer`, `language`.

- **Junk Data Cleanup**:
  - **Null Handling**: Drops rows with missing `question` or `answer` using `df.dropna()`.
  - **Duplicate Removal**: Eliminates redundant FAQs via `df[~df['question'].duplicated()]`.
  - **Short Entry Filtering**: Excludes questions <10 chars or answers <20 chars with `df[(df['question'].str.len() >= 10) & (df['answer'].str.len() >= 20)]`.
  - **Malformed Detection**: Uses regex (`[!?]{2,}|(Invalid|N/A)`) to filter invalid questions.
  - **Standardization**: Normalizes text (e.g., "mo" to "month") and fills missing `language` with "en".

- **Output**:
  - Generates `cleaned_call_center_faqs.csv` for downstream modeling.
  - Provides cleanup stats: nulls removed, duplicates removed, short entries filtered, malformed entries detected.

### Enterprise-Grade Modeling Compatibility

The cleaned dataset is optimized for:

- **Amazon SageMaker**: Ready for training BERT-based models (e.g., `bert-base-uncased`) for intent classification or FAQ retrieval, deployable via SageMaker JumpStart.
- **Azure AI**: Compatible with Azure Machine Learning pipelines for fine-tuning models like DistilBERT in Azure Blob Storage, enabling scalable CX automation.
- **LLM Integration**: Supports fine-tuning LLMs (e.g., `distilgpt2`) for generative tasks, leveraging your FastAPI experience for API-driven inference.

## Performance Monitoring and Visualization

The demo includes a performance monitoring suite:

- **Processing Time Tracking**: Measures data ingestion, cleaning, and output times using `time.perf_counter()`, reported in milliseconds.
- **Cleanup Metrics**: Tracks the number of nulls, duplicates, short entries, and malformed entries removed.
- **Visualization**: Uses Matplotlib to plot a bar chart (`cleanup_stats.png`):
  - Bars: Number of entries removed per category (Nulls, Duplicates, Short, Malformed).
  - Palette: Professional muted colors for enterprise aesthetics.

## Gradio Interface for Interactive Demo

The demo is accessible via Gradio, providing an interactive data preprocessing experience:

- **Input**: Upload a sample call center CSV or use the embedded demo dataset.
- **Outputs**:
  - **Cleaned Dataset**: Download `cleaned_call_center_faqs.csv`.
  - **Cleanup Stats**: Detailed breakdown (e.g., “Cleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”).
  - **Performance Plot**: Visual metrics for processing time and cleanup stats.
- **Styling**: Custom dark theme CSS (`#2a2a2a` background, blue buttons) for a sleek, enterprise-ready UI.

## Setup

- Clone this repository to a Hugging Face Model repository (free tier, public).
- Add `requirements.txt` with dependencies (`gradio==4.44.0`, `pandas==2.2.3`, `matplotlib==3.9.2`, etc.).
- Upload `app.py` (includes embedded demo dataset for seamless deployment).
- Configure to run with Python 3.9+, CPU hardware (no GPU).

## Usage

- **Upload CSV**: Provide a call center CSV in the Gradio UI, or use the default demo dataset.
- **Output**:
  - **Cleaned Dataset**: Download the processed `cleaned_call_center_faqs.csv`.
  - **Cleanup Stats**: “Cleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”.
  - **Performance Plot**: Visual metrics for processing time and cleanup stats.

**Example**:
- **Input CSV**: Sample dataset with 10 FAQs, including 2 nulls, 1 duplicate, 1 short entry.
- **Output**:
  - Cleaned Dataset: 6 FAQs in `cleaned_call_center_faqs.csv`.
  - Cleanup Stats: “Cleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”.
  - Plot: Processing Time (Ingestion: 50ms, Cleaning: 30ms, Output: 10ms), Cleanup Stats (Nulls: 2, Duplicates: 1, Short: 1, Malformed: 0).

## Technical Details

**Stack**:
- **Pandas**: Data wrangling and preprocessing for call center CSVs.
- **Gradio**: Interactive UI for real-time data preprocessing demos.
- **Matplotlib**: Performance visualization with bar charts.
- **FastAPI Compatibility**: Designed with API-driven preprocessing in mind, leveraging your experience with FastAPI for scalable deployments.

**Free Tier Optimization**: Lightweight with CPU-only dependencies, no GPU required.

**Extensibility**: Ready for integration with RAG/CAG pipelines, and cloud deployments on AWS Lambda or Azure Functions.

## Purpose

This demo showcases expertise in data preprocessing for AI-driven CX automation, focusing on call center data quality. Built on over 5 years of experience in AI, data engineering, and enterprise-grade deployments, it demonstrates the power of Pandas-based data cleaning for RAG/CAG pipelines, making it ideal for advanced CX solutions in call center environments.

## Latest Update

**Status Update**: Configuration missing in update.ini for ghostai1/internalRAGCX: Expected sections InternalragcxUpdate and InternalragcxEmojis - May 28, 2025 📝  
-  - August 31, 2025 📝
-  - August 28, 2025 📝
-  - August 26, 2025 📝
-  - August 23, 2025 📝
-  - August 21, 2025 📝
-  - August 19, 2025 📝
-  - August 18, 2025 📝
-  - August 16, 2025 📝
-  - August 15, 2025 📝
-  - August 14, 2025 📝
-  - August 13, 2025 📝
-  - August 12, 2025 📝
-  - August 11, 2025 📝
-  - August 10, 2025 📝
-  - August 09, 2025 📝
-  - August 08, 2025 📝
-  - August 07, 2025 📝
-  - August 06, 2025 📝
-  - August 05, 2025 📝
-  - August 04, 2025 📝
-  - August 03, 2025 📝
-  - August 02, 2025 📝
-  - August 01, 2025 📝
-  - July 31, 2025 📝
-  - July 30, 2025 📝
-  - July 29, 2025 📝
-  - July 28, 2025 📝
-  - July 27, 2025 📝
-  - July 26, 2025 📝
-  - July 25, 2025 📝
-  - July 24, 2025 📝
-  - July 23, 2025 📝
-  - July 22, 2025 📝
-  - July 21, 2025 📝
-  - July 20, 2025 📝
-  - July 19, 2025 📝
-  - July 18, 2025 📝
-  - July 17, 2025 📝
-  - July 16, 2025 📝
-  - July 15, 2025 📝
-  - July 14, 2025 📝
-  - July 11, 2025 📝
-  - July 10, 2025 📝
-  - July 09, 2025 📝
-  - July 08, 2025 📝
-  - July 07, 2025 📝
-  - July 06, 2025 📝
-  - July 05, 2025 📝
-  - July 04, 2025 📝
-  - July 03, 2025 📝
-  - July 02, 2025 📝
-  - July 01, 2025 📝
-  - June 30, 2025 📝
-  - June 29, 2025 📝
-  - June 28, 2025 📝
-  - June 27, 2025 📝
-  - June 26, 2025 📝
-  - June 25, 2025 📝
-  - June 24, 2025 📝
-  - June 23, 2025 📝
-  - June 22, 2025 📝
-  - June 21, 2025 📝
-  - June 20, 2025 📝
-  - June 19, 2025 📝
-  - June 18, 2025 📝
-  - June 17, 2025 📝
-  - June 16, 2025 📝
-  - June 15, 2025 📝
-  - June 14, 2025 📝
-  - June 13, 2025 📝
-  - June 12, 2025 📝
-  - June 11, 2025 📝
-  - June 10, 2025 📝
-  - June 09, 2025 📝
-  - June 08, 2025 📝
-  - June 07, 2025 📝
-  - June 06, 2025 📝
-  - June 05, 2025 📝
-  - June 04, 2025 📝
-  - June 03, 2025 📝
-  - June 02, 2025 📝
-  - June 01, 2025 📝
-  - May 31, 2025 📝
-  - May 30, 2025 📝
-  - May 29, 2025 📝
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- 
- Placeholder update text.

## Future Enhancements

- **Real-Time Streaming**: Add support for real-time data streaming from Kafka for live preprocessing.
- **FastAPI Deployment**: Expose preprocessing pipeline via FastAPI endpoints for production-grade use.
- **Advanced Validation**: Implement stricter data validation rules using machine learning-based outlier detection.
- **Cloud Integration**: Enhance compatibility with AWS Glue or Azure Data Factory for enterprise data pipelines.

**Website**: https://ghostainews.com/  
**Discord**: https://discord.gg/BfA23aYz