File size: 8,627 Bytes
733fcd8
 
 
a306fec
 
 
733fcd8
a306fec
 
 
733fcd8
a306fec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
733fcd8
a306fec
 
 
733fcd8
a306fec
 
 
 
 
 
 
 
 
 
733fcd8
a306fec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
733fcd8
 
 
a306fec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
733fcd8
a306fec
 
733fcd8
a306fec
 
733fcd8
a306fec
 
 
 
 
 
 
 
 
 
 
733fcd8
a306fec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
733fcd8
a306fec
733fcd8
a306fec
733fcd8
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
import streamlit as st

def render_report():
    st.title("Group 5: Term Project Report")
    
    # Title Page Information
    st.markdown("""
    **Course:** CSE 555 β€” Introduction to Pattern Recognition  
    **Authors:** Saksham Lakhera and Ahmed Zaher  
    **Date:** July 2025
    """)
    
    # Abstract
    st.header("Abstract")
    
    st.subheader("NLP Engineering Perspective")
    st.markdown("""
    This project addresses the challenge of improving recipe recommendation systems through
    advanced semantic search capabilities using transformer-based language models. Traditional
    keyword-based search methods often fail to capture the nuanced relationships between
    ingredients, cooking techniques, and user preferences in culinary contexts. 
    
    Our approach leverages BERT (Bidirectional Encoder Representations from Transformers) 
    fine-tuning on a custom recipe dataset to develop a semantic understanding of culinary content. 
    We preprocessed and structured a subset of 15,000 recipes into standardized sequences organized
    by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for
    the BERT architecture.
    
    The model was fine-tuned to learn contextual embeddings that capture semantic relationships 
    between ingredients and tags. At inference time we generate embeddings for all recipes in our 
    dataset and perform cosine-similarity retrieval to produce the top-K most relevant recipes 
    for a user query.
    """)
    
    # Introduction
    st.header("Introduction")
    st.markdown("""
    This term project serves primarily as an educational exercise aimed at giving students 
    end-to-end exposure to building a modern NLP system. Our goal is to construct a semantic 
    recipe-search engine that demonstrates how domain-specific fine-tuning of BERT can 
    substantially improve retrieval quality over simple keyword matching.
    
    **Key Contributions:**
    - A cleaned, category-labelled recipe subset of 15,000 recipes
    - Training scripts that yield domain-adapted contextual embeddings
    - A production-ready retrieval service that returns top-K most relevant recipes
    - Comparative evaluation against classical baselines
    """)
    
    # Dataset and Preprocessing
    st.header("Dataset and Pre-processing")
    
    st.subheader("Data Sources")
    st.markdown("""
    The project draws from two CSV files:
    - **Raw_recipes.csv** – 231,637 rows, one per recipe with columns: *id, name, ingredients, tags, minutes, steps, description, n_steps, n_ingredients*
    - **Raw_interactions.csv** – user feedback containing *recipe_id, user_id, rating (1-5), review text*
    """)
    
    st.subheader("Corpus Filtering and Subset Selection")
    st.markdown("""
    1. **Invalid rows removed** – recipes with empty ingredient lists, missing tags, or fewer than three total tags
    2. **Random sampling** – 15,000 recipes selected for NLP fine-tuning
    3. **Positive/negative pairs** – generated for contrastive learning using ratings and tag similarity
    4. **Train/test split** – 80/20 stratified split (12,000/3,000 pairs)
    """)
    
    st.subheader("Text Pre-processing Pipeline")
    st.markdown("""
    - **Lower-casing & punctuation removal** – normalized to lowercase, special characters stripped
    - **Stop-descriptor removal** – culinary modifiers (*fresh, chopped, minced*) and measurements removed
    - **Ingredient ordering** – re-ordered into sequence: **protein β†’ vegetables β†’ grains β†’ dairy β†’ other**
    - **Tag normalization** – mapped to six canonical slots: *cuisine, course, main-ingredient, dietary, difficulty, occasion*
    - **Tokenization** – standard *bert-base-uncased* WordPiece tokenizer, sequences truncated/padded to 128 tokens
    """)
    
    # Methodology
    st.header("Methodology")
    
    st.subheader("Model Architecture")
    st.markdown("""
    - **Base Model:** `bert-base-uncased` checkpoint
    - **Additional Layers:** Single linear classification layer (768 β†’ 1) with dropout (p = 0.1)
    - **Training Objective:** Triplet-margin loss with margin of 1.0
    """)
    
    st.subheader("Hyperparameters")
    col1, col2 = st.columns(2)
    with col1:
        st.markdown("""
        - **Batch size:** 8
        - **Max sequence length:** 128 tokens
        - **Learning rate:** 2 Γ— 10⁻⁡
        - **Weight decay:** 0.01
        """)
    with col2:
        st.markdown("""
        - **Optimizer:** AdamW
        - **Epochs:** 3
        - **Hardware:** Google Colab A100 GPU (40 GB VRAM)
        - **Training time:** ~75 minutes per run
        """)
    
    # Mathematical Formulations
    st.header("Mathematical Formulations")
    
    st.subheader("Query Embedding and Similarity Calculation")
    st.latex(r"""
        \text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\|\hat{q}\|\|\hat{r}_i\|}
    """)
    st.markdown("Where $\\hat{q}$ is the BERT embedding of the query, and $\\hat{r}_i$ is the embedding of the i-th recipe.")
    
    st.subheader("Final Score Calculation")
    st.latex(r"""
        \text{Score}_i = 0.6 \times \text{Similarity}_i + 0.4 \times \text{Popularity}_i
    """)
    
    # Results
    st.header("Results")
    
    st.subheader("Training and Validation Loss")
    results_data = {
        "Run": [1, 2, 3, 4],
        "Configuration": [
            "Raw, no cleaning/ordering",
            "Cleaned text, unordered", 
            "Cleaned text + dropout",
            "Cleaned text + dropout + ordering"
        ],
        "Epoch-3 Train Loss": [0.0065, 0.0023, 0.0061, 0.0119],
        "Validation Loss": [0.1100, 0.0000, 0.0118, 0.0067]
    }
    st.table(results_data)
    
    st.markdown("""
    **Key Finding:** Run 4 (cleaned text + dropout + ordering) achieved the best balance 
    between low validation loss and meaningful retrieval quality.
    """)
    
    st.subheader("Qualitative Retrieval Examples")
    st.markdown("""
    **Query: "beef steak dinner"**
    - Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans*
    - Run 4 (Final): *grilled garlic steak dinner*, *classic beef steak au poivre*
    
    **Query: "chicken italian pasta"**  
    - Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans*
    - Run 4 (Final): *creamy tuscan chicken pasta*, *italian chicken penne bake*
    
    **Query: "vegetarian salad healthy"**
    - Run 1 (Raw): (irrelevant hits)
    - Run 4 (Final): *kale quinoa power salad*, *superfood spinach & berry salad*
    """)
    
    # Discussion and Conclusion
    st.header("Discussion and Conclusion")
    st.markdown("""
    The experimental evidence underscores the importance of disciplined pre-processing when 
    adapting large language models to niche domains. The breakthrough came with **ingredient-ordering** 
    (protein β†’ vegetables β†’ grains β†’ dairy β†’ other) which supplied consistent positional signals.
    
    **Key Achievements:**
    - End-to-end recipe recommendation system with semantic search
    - Sub-second latency across 231k recipes
    - Meaningful semantic understanding of culinary content
    - Reproducible blueprint for domain-specific NLP applications
    
    **Limitations:**
    - Private dataset relatively small (15k samples) compared to public corpora
    - Minimal hyperparameter search conducted
    - Single-machine deployment tested
    """)
    
    # Technical Specifications
    st.header("Technical Specifications")
    col1, col2 = st.columns(2)
    with col1:
        st.markdown("""
        **Dataset:**
        - Total Recipes: 231,630
        - Training Set: 15,000 recipes
        - Average Tags per Recipe: ~6
        - Ingredients per Recipe: 3-20
        """)
    with col2:
        st.markdown("""
        **Infrastructure:**
        - Python 3.10
        - PyTorch 2.1 (CUDA 11.8)
        - Transformers 4.38
        - Google Colab A100 GPU
        """)
    
    # References
    st.header("References")
    st.markdown("""
    [1] Vaswani et al., "Attention Is All You Need," NeurIPS, 2017.
    
    [2] Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL-HLT, 2019.
    
    [3] Reimers and Gurevych, "Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks," EMNLP-IJCNLP, 2019.
    
    [4] Hugging Face, "BERT Model Documentation," 2024.
    """)
    
    st.markdown("---")
    st.markdown("Β© 2025 CSE 555 Term Project. All rights reserved.")

# Render the report
render_report()