Palbha Kulkarni (Nazwale) commited on
Commit
bd044ff
Β·
unverified Β·
1 Parent(s): df48547

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -1
README.md CHANGED
@@ -1 +1,111 @@
1
- # airline-faq-rag
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Airline FAQ RAG Project
2
+
3
+ Welcome to the **Airline FAQ RAG Project**! This repository houses an innovative pet project exploring the creation of airline-related FAQ data using an AI agent and building a Retrieval-Augmented Generation (RAG) application to provide accurate and context-aware responses. Whether you're an AI enthusiast, a developer, or someone curious about natural language processing (NLP) and information retrieval, this project offers insights into data generation, embedding strategies, and RAG system optimization.
4
+
5
+ ## 🎯 Project Overview
6
+
7
+ This project has two primary components:
8
+
9
+ 1. **FAQ Data Generation Agent**: An AI-driven agent that generates high-quality, airline-related FAQ data based on predefined topics, exploring the impact of prompt engineering on output quality.
10
+ 2. **RAG Application**: A Retrieval-Augmented Generation system that leverages the generated FAQ data to answer queries, with experiments on different chunking strategies to optimize retrieval performance.
11
+
12
+ The goal is to build a foundation for an end-to-end conversational AI agent for airline customer support, starting with a robust RAG system.
13
+
14
+ ## πŸš€ Features
15
+
16
+ - **Dynamic FAQ Generation**: Automatically creates comprehensive airline FAQs covering topics like booking, baggage, cancellations, and in-flight services.
17
+ - **RAG Implementation**: Combines retrieval and generation to provide accurate, context-aware answers to user queries.
18
+ - **Chunking Experiments**: Evaluates multiple chunking strategies (e.g., full FAQ file vs. question-answer pair chunking) to optimize embedding and retrieval performance.
19
+ - **Prompt Engineering Insights**: Explores how different prompts affect the quality and relevance of generated FAQ data.
20
+ - **Scalable Design**: Lays the groundwork for extending the system into a fully autonomous airline support agent.
21
+
22
+ ## πŸ“‚ Repository Structure
23
+
24
+ - `data_generation/`: Scripts for generating airline FAQ data using an AI agent.
25
+ - `rag_application/`: Implementation of the RAG system, including embedding creation and retrieval logic.
26
+ - `data/`: Sample FAQ datasets and embeddings.
27
+ - `experiments/`: Notebooks and scripts comparing chunking strategies and their performance.
28
+ - `docs/`: Additional documentation and analysis of findings.
29
+
30
+ ## πŸ› οΈ Getting Started
31
+
32
+ ### Prerequisites
33
+ - Python 3.8+
34
+ - Libraries: `transformers`, `faiss`, `numpy`, `pandas`, `langchain` (or your preferred RAG framework)
35
+ - Optional: GPU for faster embedding generation
36
+
37
+ ### Installation
38
+ 1. Clone the repository:
39
+ ```bash
40
+ git clone https://github.com/your-username/airline-faq-rag.git
41
+ cd airline-faq-rag
42
+ ```
43
+ 2. Install dependencies:
44
+ ```bash
45
+ pip install -r requirements.txt
46
+ ```
47
+ 3. Generate FAQ data:
48
+ ```bash
49
+ python data_generation/generate_faq.py --topics "booking,baggage,cancellations"
50
+ ```
51
+ 4. Run the RAG application:
52
+ ```bash
53
+ python rag_application/run_rag.py
54
+ ```
55
+
56
+ ## πŸ’‘ Key Components
57
+
58
+ ### 1. FAQ Data Generation
59
+ The FAQ generation agent creates airline-related question-answer pairs based on user-specified topics (e.g., booking, baggage, cancellations). Key features:
60
+ - **Prompt Engineering**: Experimented with various prompts to control tone, detail, and accuracy. For example:
61
+ - Prompt 1: "Generate 10 FAQs about airline baggage policies in a formal tone."
62
+ - Prompt 2: "Create concise FAQs for airline cancellations with customer-friendly language."
63
+ - **Learnings**:
64
+ - Specific prompts with clear instructions (e.g., "include examples") yield more relevant and detailed FAQs.
65
+ - Iterative prompt refinement improves output consistency and reduces hallucination.
66
+ - Adding context (e.g., airline-specific policies) enhances realism but requires careful prompt design to avoid bias.
67
+
68
+ ### 2. RAG Application
69
+ The RAG system retrieves relevant FAQ answers for user queries using embeddings and generates responses. Key experiments:
70
+ - **Embedding Strategies**:
71
+ - **Full FAQ File Embedding**: Treated the entire FAQ dataset as a single document, creating one embedding per file.
72
+ - **Question-Answer Pair Chunking**: Split FAQs into individual question-answer pairs, creating embeddings for each pair.
73
+ - **Custom Chunking**: Grouped related FAQs (e.g., all baggage-related questions) into chunks to balance context and granularity.
74
+ - **Performance Evaluation**:
75
+ - **Full FAQ Embedding**: Fast but less precise, as it struggles with fine-grained retrieval for specific questions.
76
+ - **Question-Answer Pair Chunking**: Best performance for precise queries (e.g., "What is the baggage allowance?"), with higher relevance scores in retrieval.
77
+ - **Custom Chunking**: Improved context for complex queries but increased retrieval latency.
78
+ - **Findings**: Question-answer pair chunking outperformed other methods in precision and recall, making it the preferred approach for this use case.
79
+
80
+ ## πŸ” Key Learnings
81
+ - **Prompt Sensitivity**: Small changes in prompt wording significantly affect FAQ quality. For example, specifying "customer-friendly" vs. "formal" tones altered the output's readability and tone.
82
+ - **Chunking Matters**: Fine-grained chunking (question-answer pairs) improves retrieval accuracy but requires careful indexing to manage scale.
83
+ - **Embedding Trade-offs**: Dense embeddings (e.g., using BERT-based models) offer better semantic understanding but are computationally expensive compared to sparse methods.
84
+ - **Scalability Challenges**: Large FAQ datasets require efficient indexing (e.g., FAISS) to maintain low-latency retrieval.
85
+
86
+ ## πŸš€ Next Steps
87
+ To evolve this project into an end-to-end airline support agent:
88
+ 1. **Context-Aware Generation**: Integrate user context (e.g., booking details) into the RAG pipeline for personalized responses.
89
+ 2. **Multi-Turn Conversations**: Enhance the agent to handle follow-up questions and maintain conversation history.
90
+ 3. **Real-Time Data Integration**: Incorporate live airline data (e.g., flight status APIs) to provide dynamic answers.
91
+ 4. **Model Fine-Tuning**: Fine-tune the language model on airline-specific data to improve response accuracy and domain knowledge.
92
+ 5. **Evaluation Metrics**: Implement automated evaluation (e.g., BLEU, ROUGE, or human-in-the-loop feedback) to quantify response quality.
93
+ 6. **Deployment**: Package the RAG system as a web or mobile app for real-world testing.
94
+
95
+ ## 🀝 Contributing
96
+ We welcome contributions! Feel free to:
97
+ - Submit pull requests for new features or bug fixes.
98
+ - Share ideas for improving FAQ generation or RAG performance.
99
+ - Add new chunking strategies or prompt templates to the experiments folder.
100
+
101
+ Please read `CONTRIBUTING.md` for guidelines.
102
+
103
+ ## πŸ“œ License
104
+ This project is licensed under the MIT License. See `LICENSE` for details.
105
+
106
+ ## πŸ“¬ Contact
107
+ For questions or feedback, reach out via GitHub Issues or email at [[email protected]].
108
+
109
+ ---
110
+
111
+ ✈️ *Let's build the future of airline customer support together!*