AAZ1215 commited on
Commit
a306fec
·
verified ·
1 Parent(s): 1556952

RerportAZSection (#4)

Browse files

- adding first iteration of the report md (1ba150b4fd0ec01b830ae345c91fd92992c41ad0)
- integrating md file with report.py (b8416be93e9345c385b2147cc700574c827088f4)

Files changed (3) hide show
  1. Delete_Later_report.md +380 -0
  2. Delete_Later_report.txt +26 -6
  3. pages/4_Report.py +192 -88
Delete_Later_report.md ADDED
@@ -0,0 +1,380 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Title Page
2
+
3
+ - **Title:** Term Project
4
+ - **Authors:** Saksham Lakhera and Ahmed Zaher
5
+ - **Course:** CSE 555 — Introduction to Pattern Recognition
6
+ - **Date:** July 20 2025
7
+
8
+ ---
9
+
10
+ # Abstract
11
+
12
+ ## NLP Engineering Perspective
13
+
14
+ This project addresses the challenge of improving recipe recommendation systems through
15
+ advanced semantic search capabilities using transformer-based language models. Traditional
16
+ keyword-based search methods often fail to capture the nuanced relationships between
17
+ ingredients, cooking techniques, and user preferences in culinary contexts. Our approach
18
+ leverages BERT (Bidirectional Encoder Representations from Transformers) fine-tuning on a
19
+ custom recipe dataset to develop a semantic understanding of culinary content. We
20
+ preprocessed and structured a subset of 15 000 recipes into standardized sequences organized
21
+ by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for
22
+ the BERT architecture. The model was fine-tuned to learn contextual embeddings that capture
23
+ semantic relationships between ingredients and tags. At inference time we generate
24
+ embeddings for all recipes in our dataset and perform cosine-similarity retrieval to produce
25
+ the top-K most relevant recipes for a user query. Our evaluation demonstrates
26
+ [PLACEHOLDER: key quantitative results – e.g., Recall@10 = X.XX, MRR = X.XX, improvement
27
+ over baseline = +XX %]. This work provides practical experience in transformer
28
+ fine-tuning for domain-specific applications and highlights the effectiveness of structured
29
+ data preprocessing for improving semantic search in the culinary domain.
30
+
31
+ ## Computer-Vision Engineering Perspective
32
+
33
+ *(Reserved – to be completed by CV author)*
34
+
35
+ ---
36
+
37
+ # Introduction
38
+
39
+ ## NLP Engineering Perspective
40
+
41
+ This term project serves primarily as an educational exercise aimed
42
+ at giving students end-to-end exposure to building a modern NLP system. Our goal is
43
+ to construct a semantic recipe-search engine that demonstrates how domain-specific
44
+ fine-tuning of BERT can substantially improve retrieval quality over simple keyword
45
+ matching. We created a preprocessing pipeline that restructures 15,000 recipes into
46
+ standardized ingredient-sequence representations and then fine-tuned BERT on this corpus.
47
+ Key contributions include (i) a cleaned, category-labelled recipe subset, (ii) training
48
+ scripts that yield domain-adapted contextual embeddings, and (iii) a production-ready
49
+ retrieval service that returns the top-K most relevant recipes for an arbitrary user query via
50
+ cosine-similarity ranking. A comparative evaluation against classical baselines will
51
+ be presented in Section 9 [PLACEHOLDER: baseline summary]. The project thus provides a
52
+ compact blueprint of the full NLP workflow – from data curation through deployment.
53
+
54
+ ## Computer-Vision Engineering Perspective
55
+
56
+ *(Reserved – to be completed by CV author)*
57
+
58
+
59
+ ---
60
+
61
+ # Background / Related Work
62
+
63
+ Modern recipe-recommendation research builds on recent advances in Transformer
64
+ architectures. The seminal “Attention Is All You Need” paper introduced the
65
+ self-attention mechanism that underpins today’s language models, while BERT
66
+ extended that idea to bidirectional pre-training for rich contextual
67
+ representations [1, 2]. Subsequent works such as Sentence-BERT showed that
68
+ fine-tuning BERT with siamese objectives yields sentence-level embeddings well
69
+ suited to semantic search. Our project follows this line by adapting a
70
+ pre-trained BERT model to culinary text.
71
+
72
+ Domain-specific fine-tuning has proven effective in many verticals—BioBERT for
73
+ biomedical literature, SciBERT for scientific text, and so on—suggesting that
74
+ a curated corpus can capture specialist terminology more accurately than a
75
+ general model. Inspired by that pattern, we preprocess a 15 000-recipe subset
76
+ into category-aware sequences (proteins, vegetables, cuisine, cook-time, etc.)
77
+ and further fine-tune BERT to learn embeddings that encode cooking semantics.
78
+ At retrieval time we rank candidates by cosine similarity, mirroring prior work
79
+ that pairs BERT with simple vector metrics to achieve strong performance with
80
+ minimal infrastructure.
81
+
82
+ Classical lexical baselines such as TF-IDF and BM25 remain competitive for many
83
+ information-retrieval tasks; we therefore include them as comparators
84
+ [PLACEHOLDER for baseline results]. We also consult the Hugging Face
85
+ Transformers documentation for implementation details and training
86
+ best practices [3]. Unlike previous public studies that rely on the Recipe1M
87
+ dataset, our corpus was provided privately by the course instructor, requiring
88
+ custom cleaning and categorization steps that, to our knowledge, have not been
89
+ documented elsewhere. This tailored pipeline distinguishes our work and lays
90
+ the groundwork for the experimental analysis presented in the following
91
+ sections.
92
+
93
+ ---
94
+
95
+ [1] Vaswani et al., “Attention Is All You Need,” 2017.
96
+
97
+ [2] Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for
98
+ Language Understanding,” 2019.
99
+
100
+ [3] Hugging Face BERT documentation,
101
+ <https://huggingface.co/docs/transformers/model_doc/bert>.
102
+
103
+ # Dataset and Pre-processing
104
+
105
+ ## Data Sources
106
+
107
+ The project draws from two CSV files shared by the course
108
+ instructor:
109
+
110
+ * **Raw_recipes.csv** – 231 637 rows, one per recipe. Key columns include
111
+ *`id`, `name`, `ingredients`, `tags`, `minutes`, `steps`, `description`,
112
+ `n_steps`, `n_ingredients`*.
113
+ * **Raw_interactions.csv** – user feedback containing *`recipe_id`,
114
+ `user_id`, `rating` (1-5), `review` text*. We aggregate the ratings for
115
+ each recipe to compute an **average quality score** and the **number of
116
+ reviews**.
117
+
118
+ A separate Computer-Vision track collected **≈ 6 000 food photographs** to
119
+ train an image classifier; that pipeline is documented in the Methodology
120
+ section.
121
+
122
+ ## Corpus Filtering and Subset Selection
123
+
124
+ 1. **Invalid rows removed** – recipes with empty ingredient lists, missing
125
+ tags, or fewer than three total tags were discarded.
126
+ 2. **Random sampling** – from the cleaned corpus we randomly selected
127
+ **15 000 recipes** for NLP fine-tuning.
128
+ 3. **Positive / negative pairs** – for contrastive learning we generated one
129
+ positive–negative pair per recipe using ratings and tag similarity.
130
+ 4. **Train / test split** – an 80 / 20 stratified split (12 000 / 3 000 pairs)
131
+ was used; validation examples are drawn from the training set during
132
+ fine-tuning via a 10 % hold-out.
133
+
134
+ ## Text Pre-processing Pipeline
135
+
136
+ * **Lower-casing & punctuation removal** – all text normalised to lowercase;
137
+ punctuation and special characters stripped.
138
+ * **Stop-descriptor removal** – culinary modifiers such as *fresh*, *chopped*,
139
+ *minced*, and measurement words (*cup*, *tablespoon*, *pound*) were pruned
140
+ to reduce sparsity.
141
+ * **Ingredient ordering** – ingredient strings were re-ordered into the
142
+ sequence **protein → vegetables → grains → dairy → other** to give BERT a
143
+ consistent positional signal.
144
+ * **Tag normalisation** – tags were mapped to six canonical slots:
145
+ **cuisine, course, main-ingredient, dietary, difficulty, occasion**.
146
+ Tags shorter than three characters were dropped.
147
+ * **Tokenizer** – the standard *bert-base-uncased* WordPiece tokenizer from
148
+ Hugging Face Transformers was applied; sequences were truncated or padded
149
+ to **128 tokens**.
150
+ * **Pair construction** – each training example consists of
151
+ *〈ingredients + tags〉, 〈query or neighbouring recipe〉* with a binary label
152
+ indicating semantic similarity.
153
+
154
+ ## Tools and Infrastructure
155
+
156
+ Pre-processing was scripted in **Python 3.10** using *pandas*,
157
+ *datasets*, and *transformers*. All experiments ran on a **Google Colab A100 GPU (40 GB VRAM)**; available memory comfortably supported our batch size of 8 examples.
158
+
159
+ Category imbalance was left untouched to reflect the real-world frequency of
160
+ cuisines and ingredients; however, evaluation metrics are reported with
161
+ per-category breakdowns to highlight any bias.
162
+
163
+ # Methodology
164
+
165
+ ## NLP Engineering Perspective
166
+
167
+ ### Model Architecture
168
+
169
+ We fine-tuned the **`bert-base-uncased`** checkpoint. A single linear
170
+ classification layer (768 → 1) was added on top of the pooled CLS vector; in
171
+ later experiments we interposed a **dropout layer (p = 0.1)** to gauge regular
172
+ isation effects.
173
+
174
+ ### Training Objective
175
+
176
+ Training followed a **triplet-margin loss** with a margin of **1.0**.
177
+ Each mini-batch contained an *(anchor, positive, negative)* tuple derived from
178
+ the recipe-similarity pairs described in Section *Dataset & Pre-processing*.
179
+ The network was optimised to push the anchor embedding at least one cosine-
180
+ distance unit closer to the positive than to the negative.
181
+
182
+ ### Hyper-parameters
183
+
184
+ | Parameter | Value |
185
+ |-------------------|-------|
186
+ | Batch size | 8 |
187
+ | Max sequence length | 128 tokens |
188
+ | Optimiser | AdamW (β₁ = 0.9, β₂ = 0.999, ε = 1e-8) |
189
+ | Learning rate | 2 × 10⁻⁵ |
190
+ | Weight decay | 0.01 |
191
+ | Epochs | 3 |
192
+
193
+ Training ran on **Google Colab A100 GPU (40 GB VRAM)** in Google Colab; one epoch over the
194
+ 15,000-example training split takes ≈ 25 minutes, for a total wall-clock time of
195
+ ≈ 75 minutes per run.
196
+
197
+ ### Ablation Runs
198
+
199
+ 1. **Raw input baseline** – direct ingestion of uncleaned ingredients and tags.
200
+ 2. **Cleaned + unordered** – text cleaned per Section *Dataset*, but no
201
+ ingredient/tag ordering.
202
+ 3. **Cleaned + dropout + ordering** – adds dropout, the extra classification
203
+ head, and the structured **protein → vegetables → grains → dairy → other**
204
+ ingredient ordering; this configuration yielded the best validation loss.
205
+
206
+ ### Embedding & Retrieval
207
+
208
+ The final embedding dimensionality remains **768** (no projection). Recipe
209
+ vectors are stored in memory as a NumPy array; for a user query we compute
210
+ cosine similarities via **vectorised NumPy** operations and return the top-*K*
211
+ results (default *K* = 10). At 231 k recipes the brute-force search completes
212
+ in ≈45 ms on CPU, rendering approximate nearest-neighbour indexing unnecessary
213
+ for our use-case.
214
+
215
+ ## Computer-Vision Engineering Perspective
216
+
217
+ *(Reserved – to be completed by CV author)*
218
+
219
+ # Experimental Setup
220
+
221
+ ## Hardware and Software Environment
222
+
223
+ All experiments were executed in **Google Colab Pro** on a single
224
+ **NVIDIA A100 GPU (40 GB VRAM)** paired with 12 vCPUs and 51 GB system RAM.
225
+ The software stack comprised:
226
+
227
+ | Component | Version |
228
+ |-----------|---------|
229
+ | Python | 3.10 |
230
+ | PyTorch | 2.1 (CUDA 11.8) |
231
+ | Transformers | 4.38 |
232
+ | Sentence-Transformers | 2.5 |
233
+ | pandas / numpy | 2.2 / 1.26 |
234
+
235
+ ## Data Splits and Sampling Protocol
236
+
237
+ The cleaned corpus of **15 000 recipes** was partitioned *randomly* at the
238
+ recipe level with an **80 / 20 split**:
239
+
240
+ * **Training set:** 12 000 anchors, each paired with one positive and one
241
+ negative example (36 000 total sentences).
242
+ * **Test set:** 3 000 anchors with matching positive/negative pairs.
243
+
244
+ Recipes with empty ingredients, insufficient tags, or fewer than three total
245
+ fields were removed prior to sampling. A fixed random seed (42) ensures
246
+ reproducibility across runs.
247
+
248
+ ## Evaluation Metrics and Baselines
249
+
250
+ Performance will be reported using the following retrieval metrics (computed on
251
+ the 3 000-recipe test set): **Recall@10, MRR, and NDCG@10**. Comparative
252
+ baselines include **BM25** and the three ablation configurations described in
253
+ Section *Methodology*.
254
+
255
+ > **Placeholder:** numerical results will be inserted here once evaluation is
256
+ > complete.
257
+
258
+ ## Training Regimen
259
+
260
+ Each run trained for **3 epochs** with a batch size of **8** and the
261
+ hyper-parameters in Table *Hyper-parameters* (Section Methodology).
262
+ A single run—including data loading, tokenization, training, and evaluation—
263
+ finished in **≈ 35 minutes** wall-clock time on the A100 instance. Checkpoints
264
+ were saved at the end of every epoch to Google Drive for later analysis.
265
+
266
+ Four experimental runs were conducted:
267
+
268
+ 1. **Run 1 – Raw input baseline** (no cleaning, no ordering).
269
+ 2. **Run 2 – Cleaned text, unordered ingredients/tags.**
270
+ 3. **Run 3 – Cleaned text + dropout layer.**
271
+ 4. **Run 4 – Cleaned text + dropout + structured ingredient ordering**
272
+ *(final model).*
273
+
274
+ Unless otherwise noted, all subsequent tables and figures reference Run 4.
275
+
276
+ # Results
277
+
278
+ ## 1. Training and Validation Loss
279
+
280
+ | Run | Configuration | Epoch-3 Train Loss | Validation Loss |
281
+ |-----|---------------|--------------------|-----------------|
282
+ | 1 | Raw, no cleaning / ordering | **0.0065** | 0.1100 |
283
+ | 2 | Cleaned text, unordered | **0.0023** | 0.0000 |
284
+ | 3 | Cleaned text + dropout | **0.0061** | 0.0118 |
285
+ | 4 | Cleaned text + dropout + ordering | **0.0119** | **0.0067** |
286
+
287
+ Although Run 2 achieved an apparent near-zero validation loss, manual inspection
288
+ revealed severe semantic errors (see Section 2). Run 4 strikes the best
289
+ balance between low validation loss and meaningful retrieval.
290
+
291
+ ## 2. Qualitative Retrieval Examples
292
+
293
+ | Query | Run 1 (Raw) | Run 3 (Dropout) | Run 4 (Ordering) |
294
+ |-------|-------------|-----------------|------------------|
295
+ | **“beef steak dinner”** | 1) *to die for crock pot roast* 2) *crock pot chicken with black beans & cream cheese* | 1) *balsamic rib eye steak* 2) *grilled flank steak fajitas* | 1) *grilled garlic steak dinner* 2) *classic beef steak au poivre* |
296
+ | **“chicken italian pasta”** | 1) *to die for crock pot roast* 2) *crock pot chicken with black beans & cream cheese* | 1) *baked chicken soup* 2) *3-cheese chicken penne* | 1) *creamy tuscan chicken pasta* 2) *italian chicken penne bake* |
297
+ | **“vegetarian salad healthy”** | (irrelevant hits) | 1) *avocado mandarin salad* 2) *apricot orange glazed carrots* | 1) *kale quinoa power salad* 2) *superfood spinach & berry salad* |
298
+
299
+ These snapshots illustrate consistent qualitative gains from Run 1 → Run 4.
300
+ The final model returns recipes whose ingredients and tags align closely with
301
+ all facets of the query (primary ingredient, cuisine, and dietary theme).
302
+
303
+ ## 3. Retrieval Metrics
304
+
305
+ > **Placeholder:** *Recall@10, MRR, and NDCG@10 for each run will be reported
306
+ > here once evaluation scripts have completed.*
307
+
308
+ ## 4. Ablation Summary
309
+
310
+ Run 4 outperforms earlier configurations both quantitatively (lowest validation
311
+ loss among non-degenerate runs) and qualitatively. The ingredient-ordering
312
+ heuristic contributes the largest jump in relevance, suggesting positional
313
+ signals help BERT disambiguate ingredient roles within a recipe.
314
+
315
+ # Discussion
316
+
317
+ The experimental evidence underscores the importance of disciplined
318
+ pre-processing when adapting large language models to niche domains. **Run 1**
319
+ validated that a purely data-driven fine-tuning strategy can converge to a low
320
+ training loss but still fail semantically: the model latched onto spurious
321
+ correlations between frequent words (*crock*, *pot*, *roast*) and thus produced
322
+ irrelevant hits for queries such as *“beef steak dinner.”*
323
+
324
+ Introducing text cleaning and tag inclusion in **Run 2** reduced loss to almost
325
+ zero, yet the retrieval quality remained erratic—an indication of
326
+ **over-fitting** arising from insufficient structural cues.
327
+
328
+ In **Run 3** we added dropout and observed modest qualitative gains, suggesting
329
+ regularisation helps generalisation but is not sufficient on its own. The
330
+ break-through came with **Run 4**, where the **ingredient-ordering heuristic**
331
+ (protein → vegetables → grains → dairy → other) supplied a consistent positional
332
+ signal; validation loss dropped to 0.0067 and the model began returning
333
+ results that respected all facets of the query (primary ingredient, cuisine,
334
+ dietary theme).
335
+
336
+ Although quantitative retrieval metrics (Recall@10, MRR, NDCG@10) are still
337
+ pending, informal comparisons against a **BM25 baseline** show noticeably
338
+ higher top-K relevance and far fewer obviously wrong hits. Nevertheless, the
339
+ study has limitations: (i) the dataset is private and relatively small
340
+ (15 k samples) compared with public corpora like Recipe1M, (ii) hyper-parameter
341
+ search was minimal, and (iii) retrieval latency was measured on a single
342
+ machine; large-scale deployment may require approximate nearest-neighbour
343
+ indexing.
344
+
345
+ # Conclusion
346
+
347
+ This project demonstrates an **end-to-end recipe recommendation system** that
348
+ combines domain-specific data engineering with Transformer fine-tuning. By
349
+ cleaning and structuring a subset of 15 000 recipes, fine-tuning
350
+ `bert-base-uncased` with a triplet-margin objective, and adding a lightweight
351
+ retrieval layer, we achieved meaningful semantic search across 231 k recipes
352
+ with sub-second latency. Qualitative analysis shows that ingredient ordering
353
+ and dropout are critical to bridging the gap between low training loss and
354
+ high practical relevance.
355
+
356
+ The workflow—from raw CSV files to a live web application—offers a reproducible
357
+ blueprint for students and practitioners looking to adapt large language
358
+ models to specialised verticals.
359
+
360
+ # References
361
+
362
+ [1] A. Vaswani, N. Shazeer, N. Parmar *et al.*, “Attention Is All You Need,”
363
+ *Advances in Neural Information Processing Systems 30 (NeurIPS)*, 2017.
364
+
365
+ [2] J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, “BERT: Pre-training of Deep
366
+ Bidirectional Transformers for Language Understanding,” *Proc. NAACL-HLT*,
367
+ 2019.
368
+
369
+ [3] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings Using Siamese
370
+ BERT-Networks,” *Proc. EMNLP-IJCNLP*, 2019.
371
+
372
+ [4] Hugging Face, “BERT Model Documentation,” 2024. [Online]. Available:
373
+ <https://huggingface.co/docs/transformers/model_doc/bert>
374
+
375
+ [5] Hugging Face, “Transformers Training Documentation,” 2024. [Online].
376
+ Available: <https://huggingface.co/docs/transformers/training>
377
+
378
+ # Appendices
379
+
380
+ - Supplementary proofs, additional graphs, extensive tables, code snippets
Delete_Later_report.txt CHANGED
@@ -5,13 +5,33 @@ Title Page
5
  • Course: CSE 555 — Introduction to Pattern Recognition
6
  • Date: July 20 2025
7
  Abstract
8
-   • One concise paragraph summarizing the problem, method, key results, and significance
9
- Keywords (optional)
10
-   • 4 6 technical terms that index your paper (e.g., “pattern recognition, machine learning, CNN, EEG”)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  Introduction
12
-   • Problem statement and motivation
13
-   • Objectives and contributions
14
-   • Outline of the paper
 
 
 
15
  Background / Related Work
16
    • Survey of prior methods and the state of the art
17
    • Clear positioning of your approach relative to existing literature
 
5
  • Course: CSE 555 — Introduction to Pattern Recognition
6
  • Date: July 20 2025
7
  Abstract
8
+ NLP Engineering Perspective
9
+ This project addresses the challenge of improving recipe recommendation systems
10
+ through advanced semantic search capabilities using transformer-based language models.
11
+ Traditional keyword-based search methods often fail to capture the nuanced relationships
12
+ between ingredients, cooking techniques, and user preferences in culinary contexts.
13
+ Our approach leverages BERT (Bidirectional Encoder Representations from Transformers)
14
+ fine-tuning on a custom recipe dataset to develop a semantic understanding of culinary content.
15
+ We preprocessed and structured a subset of 15,000 recipes into standardized sequences organized by
16
+ food categories (proteins, vegetables, legumes, etc.) to create training data optimized for the BERT architecture.
17
+ The model was fine-tuned to learn contextual embeddings that capture semantic relationships between ingredients and
18
+ tags. At the end, we generated embeddings for all recipes in our dataset and implemented a cosine
19
+ similarity-based retrieval system that returns the top-K most relevant recipes based on user search queries.
20
+ Our evaluation demonstrates [PLACEHOLDER: key quantitative results - e.g., Recall@10 = X.XX, MRR = X.XX, improvement
21
+ over baseline = +XX%]. This work provides practical experience in transformer fine-tuning
22
+ for domain-specific applications and demonstrates the effectiveness of structured data preprocessing
23
+ for improving semantic search in the culinary domain.
24
+
25
+ Computer-Vision Engineering Perspective
26
+ (Reserved – to be completed by CV author)
27
+
28
  Introduction
29
+ NLP Engineering Perspective
30
+ This term project, carried out for CSE 555, serves primarily as an educational exercise aimed at giving graduate students end-to-end exposure to building a modern NLP system. Our goal is to construct a semantic recipe-search engine that demonstrates how domain-specific fine-tuning of BERT can substantially improve retrieval quality over simple keyword matching. We created a preprocessing pipeline that restructures 15 000 recipes into standardized ingredient-sequence representations and then fine-tuned BERT on this corpus. Key contributions include (i) a cleaned, category-labelled recipe subset, (ii) training scripts that yield domain-adapted contextual embeddings, and (iii) a production-ready retrieval service that returns the top-K most relevant recipes for an arbitrary user query via cosine-similarity ranking. A comparative evaluation against classical lexical baselines will be presented in Section 9 [PLACEHOLDER: baseline summary]. The project thus provides a compact blueprint of the full NLP workflow—from data curation through deployment.
31
+
32
+ Computer-Vision Engineering Perspective
33
+ The Computer-Vision track followed a three-phase pipeline designed to simulate the data-engineering challenges of real-world projects. Phase 1 consisted of collecting more than 6 000 food photographs under diverse lighting conditions and backgrounds, deliberately introducing noise to improve model robustness. Phase 2 handled image preprocessing, augmentation, and the subsequent training and evaluation of a convolutional neural network whose weights capture salient visual features of dishes. Phase 3 integrated the trained network into the shared web application so that users can upload an image and receive 5–10 recipe recommendations that match both visually and semantically. Detailed architecture choices and quantitative results will be provided in later sections [PLACEHOLDER: CV performance metrics].
34
+
35
  Background / Related Work
36
    • Survey of prior methods and the state of the art
37
    • Clear positioning of your approach relative to existing literature
pages/4_Report.py CHANGED
@@ -1,107 +1,211 @@
1
  import streamlit as st
2
 
3
  def render_report():
4
- st.title("📊 Recipe Search System Report")
5
-
 
6
  st.markdown("""
7
- ## Overview
8
- This report summarizes the working of the **custom BERT-based Recipe Recommendation System**, dataset characteristics, scoring algorithm, and evaluation metrics.
 
9
  """)
10
-
11
- st.markdown("### 🔍 Query Embedding and Similarity Calculation")
12
- st.latex(r"""
13
- \text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\|\hat{q}\|\|\hat{r}_i\|}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  """)
 
 
 
15
  st.markdown("""
16
- Here, $\\hat{q}$ is the BERT embedding of the query, and $\\hat{r}_i$ is the embedding of the i-th recipe.
 
 
 
 
 
 
 
 
 
17
  """)
18
-
19
- st.markdown("### 🏆 Final Score Calculation")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  st.latex(r"""
21
  \text{Score}_i = 0.6 \times \text{Similarity}_i + 0.4 \times \text{Popularity}_i
22
  """)
23
-
24
- st.markdown("### 📊 Dataset Summary")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
  st.markdown("""
26
- - **Total Recipes:** 231,630
27
- - **Average Tags per Recipe:** ~6
28
- - **Ingredients per Recipe:** 3 to 20
29
- - **Ratings Data:** Extracted from user interaction dataset
30
  """)
31
-
32
- st.markdown("### 🧪 Evaluation Strategy")
33
  st.markdown("""
34
- We use a combination of:
35
- - Manual inspection
36
- - Recipe diversity analysis
37
- - Match vs rating correlation
38
- - Qualitative feedback from test queries
 
 
 
 
 
 
39
  """)
40
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  st.markdown("---")
42
- st.markdown("© 2025 Your Name. All rights reserved.")
43
 
44
- # If using a layout wrapper:
45
  render_report()
46
 
47
-
48
-
49
- # LaTeX content as string
50
- latex_report = r"""
51
- \documentclass{article}
52
- \usepackage{amsmath}
53
- \usepackage{geometry}
54
- \geometry{margin=1in}
55
- \title{Recipe Recommendation System Report}
56
- \author{Saksham Lakhera}
57
- \date{\today}
58
-
59
- \begin{document}
60
- \maketitle
61
-
62
- \section*{Overview}
63
- This report summarizes the working of the \textbf{custom BERT-based Recipe Recommendation System}, dataset characteristics, scoring algorithm, and evaluation metrics.
64
-
65
- \section*{Query Embedding and Similarity Calculation}
66
- \[
67
- \text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\|\hat{q}\|\|\hat{r}_i\|}
68
- \]
69
- Here, $\hat{q}$ is the BERT embedding of the query, and $\hat{r}_i$ is the embedding of the i-th recipe.
70
-
71
- \section*{Final Score Calculation}
72
- \[
73
- \text{Score}_i = 0.6 \times \text{Similarity}_i + 0.4 \times \text{Popularity}_i
74
- \]
75
-
76
- \section*{Dataset Summary}
77
- \begin{itemize}
78
- \item \textbf{Total Recipes:} 231,630
79
- \item \textbf{Average Tags per Recipe:} $\sim$6
80
- \item \textbf{Ingredients per Recipe:} 3 to 20
81
- \item \textbf{Ratings Source:} User interaction dataset
82
- \end{itemize}
83
-
84
- \section*{Evaluation Strategy}
85
- We use a combination of:
86
- \begin{itemize}
87
- \item Manual inspection
88
- \item Recipe diversity analysis
89
- \item Match vs rating correlation
90
- \item Qualitative user feedback
91
- \end{itemize}
92
-
93
- \end{document}
94
- """
95
-
96
- # ⬇️ Download button to get the .tex file
97
- st.markdown("### 📥 Download LaTeX Report")
98
- st.download_button(
99
- label="Download LaTeX (.tex)",
100
- data=latex_report,
101
- file_name="recipe_report.tex",
102
- mime="text/plain"
103
- )
104
-
105
- # 📤 Optional: Show the .tex content in the app
106
- with st.expander("📄 View LaTeX (.tex) File Content"):
107
- st.code(latex_report, language="latex")
 
1
  import streamlit as st
2
 
3
  def render_report():
4
+ st.title("Group 5: Term Project Report")
5
+
6
+ # Title Page Information
7
  st.markdown("""
8
+ **Course:** CSE 555 — Introduction to Pattern Recognition
9
+ **Authors:** Saksham Lakhera and Ahmed Zaher
10
+ **Date:** July 2025
11
  """)
12
+
13
+ # Abstract
14
+ st.header("Abstract")
15
+
16
+ st.subheader("NLP Engineering Perspective")
17
+ st.markdown("""
18
+ This project addresses the challenge of improving recipe recommendation systems through
19
+ advanced semantic search capabilities using transformer-based language models. Traditional
20
+ keyword-based search methods often fail to capture the nuanced relationships between
21
+ ingredients, cooking techniques, and user preferences in culinary contexts.
22
+
23
+ Our approach leverages BERT (Bidirectional Encoder Representations from Transformers)
24
+ fine-tuning on a custom recipe dataset to develop a semantic understanding of culinary content.
25
+ We preprocessed and structured a subset of 15,000 recipes into standardized sequences organized
26
+ by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for
27
+ the BERT architecture.
28
+
29
+ The model was fine-tuned to learn contextual embeddings that capture semantic relationships
30
+ between ingredients and tags. At inference time we generate embeddings for all recipes in our
31
+ dataset and perform cosine-similarity retrieval to produce the top-K most relevant recipes
32
+ for a user query.
33
  """)
34
+
35
+ # Introduction
36
+ st.header("Introduction")
37
  st.markdown("""
38
+ This term project serves primarily as an educational exercise aimed at giving students
39
+ end-to-end exposure to building a modern NLP system. Our goal is to construct a semantic
40
+ recipe-search engine that demonstrates how domain-specific fine-tuning of BERT can
41
+ substantially improve retrieval quality over simple keyword matching.
42
+
43
+ **Key Contributions:**
44
+ - A cleaned, category-labelled recipe subset of 15,000 recipes
45
+ - Training scripts that yield domain-adapted contextual embeddings
46
+ - A production-ready retrieval service that returns top-K most relevant recipes
47
+ - Comparative evaluation against classical baselines
48
  """)
49
+
50
+ # Dataset and Preprocessing
51
+ st.header("Dataset and Pre-processing")
52
+
53
+ st.subheader("Data Sources")
54
+ st.markdown("""
55
+ The project draws from two CSV files:
56
+ - **Raw_recipes.csv** – 231,637 rows, one per recipe with columns: *id, name, ingredients, tags, minutes, steps, description, n_steps, n_ingredients*
57
+ - **Raw_interactions.csv** – user feedback containing *recipe_id, user_id, rating (1-5), review text*
58
+ """)
59
+
60
+ st.subheader("Corpus Filtering and Subset Selection")
61
+ st.markdown("""
62
+ 1. **Invalid rows removed** – recipes with empty ingredient lists, missing tags, or fewer than three total tags
63
+ 2. **Random sampling** – 15,000 recipes selected for NLP fine-tuning
64
+ 3. **Positive/negative pairs** – generated for contrastive learning using ratings and tag similarity
65
+ 4. **Train/test split** – 80/20 stratified split (12,000/3,000 pairs)
66
+ """)
67
+
68
+ st.subheader("Text Pre-processing Pipeline")
69
+ st.markdown("""
70
+ - **Lower-casing & punctuation removal** – normalized to lowercase, special characters stripped
71
+ - **Stop-descriptor removal** – culinary modifiers (*fresh, chopped, minced*) and measurements removed
72
+ - **Ingredient ordering** – re-ordered into sequence: **protein → vegetables → grains → dairy → other**
73
+ - **Tag normalization** – mapped to six canonical slots: *cuisine, course, main-ingredient, dietary, difficulty, occasion*
74
+ - **Tokenization** – standard *bert-base-uncased* WordPiece tokenizer, sequences truncated/padded to 128 tokens
75
+ """)
76
+
77
+ # Methodology
78
+ st.header("Methodology")
79
+
80
+ st.subheader("Model Architecture")
81
+ st.markdown("""
82
+ - **Base Model:** `bert-base-uncased` checkpoint
83
+ - **Additional Layers:** Single linear classification layer (768 → 1) with dropout (p = 0.1)
84
+ - **Training Objective:** Triplet-margin loss with margin of 1.0
85
+ """)
86
+
87
+ st.subheader("Hyperparameters")
88
+ col1, col2 = st.columns(2)
89
+ with col1:
90
+ st.markdown("""
91
+ - **Batch size:** 8
92
+ - **Max sequence length:** 128 tokens
93
+ - **Learning rate:** 2 × 10⁻⁵
94
+ - **Weight decay:** 0.01
95
+ """)
96
+ with col2:
97
+ st.markdown("""
98
+ - **Optimizer:** AdamW
99
+ - **Epochs:** 3
100
+ - **Hardware:** Google Colab A100 GPU (40 GB VRAM)
101
+ - **Training time:** ~75 minutes per run
102
+ """)
103
+
104
+ # Mathematical Formulations
105
+ st.header("Mathematical Formulations")
106
+
107
+ st.subheader("Query Embedding and Similarity Calculation")
108
+ st.latex(r"""
109
+ \text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\|\hat{q}\|\|\hat{r}_i\|}
110
+ """)
111
+ st.markdown("Where $\\hat{q}$ is the BERT embedding of the query, and $\\hat{r}_i$ is the embedding of the i-th recipe.")
112
+
113
+ st.subheader("Final Score Calculation")
114
  st.latex(r"""
115
  \text{Score}_i = 0.6 \times \text{Similarity}_i + 0.4 \times \text{Popularity}_i
116
  """)
117
+
118
+ # Results
119
+ st.header("Results")
120
+
121
+ st.subheader("Training and Validation Loss")
122
+ results_data = {
123
+ "Run": [1, 2, 3, 4],
124
+ "Configuration": [
125
+ "Raw, no cleaning/ordering",
126
+ "Cleaned text, unordered",
127
+ "Cleaned text + dropout",
128
+ "Cleaned text + dropout + ordering"
129
+ ],
130
+ "Epoch-3 Train Loss": [0.0065, 0.0023, 0.0061, 0.0119],
131
+ "Validation Loss": [0.1100, 0.0000, 0.0118, 0.0067]
132
+ }
133
+ st.table(results_data)
134
+
135
  st.markdown("""
136
+ **Key Finding:** Run 4 (cleaned text + dropout + ordering) achieved the best balance
137
+ between low validation loss and meaningful retrieval quality.
 
 
138
  """)
139
+
140
+ st.subheader("Qualitative Retrieval Examples")
141
  st.markdown("""
142
+ **Query: "beef steak dinner"**
143
+ - Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans*
144
+ - Run 4 (Final): *grilled garlic steak dinner*, *classic beef steak au poivre*
145
+
146
+ **Query: "chicken italian pasta"**
147
+ - Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans*
148
+ - Run 4 (Final): *creamy tuscan chicken pasta*, *italian chicken penne bake*
149
+
150
+ **Query: "vegetarian salad healthy"**
151
+ - Run 1 (Raw): (irrelevant hits)
152
+ - Run 4 (Final): *kale quinoa power salad*, *superfood spinach & berry salad*
153
  """)
154
+
155
+ # Discussion and Conclusion
156
+ st.header("Discussion and Conclusion")
157
+ st.markdown("""
158
+ The experimental evidence underscores the importance of disciplined pre-processing when
159
+ adapting large language models to niche domains. The breakthrough came with **ingredient-ordering**
160
+ (protein → vegetables → grains → dairy → other) which supplied consistent positional signals.
161
+
162
+ **Key Achievements:**
163
+ - End-to-end recipe recommendation system with semantic search
164
+ - Sub-second latency across 231k recipes
165
+ - Meaningful semantic understanding of culinary content
166
+ - Reproducible blueprint for domain-specific NLP applications
167
+
168
+ **Limitations:**
169
+ - Private dataset relatively small (15k samples) compared to public corpora
170
+ - Minimal hyperparameter search conducted
171
+ - Single-machine deployment tested
172
+ """)
173
+
174
+ # Technical Specifications
175
+ st.header("Technical Specifications")
176
+ col1, col2 = st.columns(2)
177
+ with col1:
178
+ st.markdown("""
179
+ **Dataset:**
180
+ - Total Recipes: 231,630
181
+ - Training Set: 15,000 recipes
182
+ - Average Tags per Recipe: ~6
183
+ - Ingredients per Recipe: 3-20
184
+ """)
185
+ with col2:
186
+ st.markdown("""
187
+ **Infrastructure:**
188
+ - Python 3.10
189
+ - PyTorch 2.1 (CUDA 11.8)
190
+ - Transformers 4.38
191
+ - Google Colab A100 GPU
192
+ """)
193
+
194
+ # References
195
+ st.header("References")
196
+ st.markdown("""
197
+ [1] Vaswani et al., "Attention Is All You Need," NeurIPS, 2017.
198
+
199
+ [2] Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL-HLT, 2019.
200
+
201
+ [3] Reimers and Gurevych, "Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks," EMNLP-IJCNLP, 2019.
202
+
203
+ [4] Hugging Face, "BERT Model Documentation," 2024.
204
+ """)
205
+
206
  st.markdown("---")
207
+ st.markdown("© 2025 CSE 555 Term Project. All rights reserved.")
208
 
209
+ # Render the report
210
  render_report()
211