Spaces:
Sleeping
Sleeping
File size: 17,733 Bytes
a306fec |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 |
# Title Page
- **Title:** Term Project
- **Authors:** Saksham Lakhera and Ahmed Zaher
- **Course:** CSE 555 — Introduction to Pattern Recognition
- **Date:** July 20 2025
---
# Abstract
## NLP Engineering Perspective
This project addresses the challenge of improving recipe recommendation systems through
advanced semantic search capabilities using transformer-based language models. Traditional
keyword-based search methods often fail to capture the nuanced relationships between
ingredients, cooking techniques, and user preferences in culinary contexts. Our approach
leverages BERT (Bidirectional Encoder Representations from Transformers) fine-tuning on a
custom recipe dataset to develop a semantic understanding of culinary content. We
preprocessed and structured a subset of 15 000 recipes into standardized sequences organized
by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for
the BERT architecture. The model was fine-tuned to learn contextual embeddings that capture
semantic relationships between ingredients and tags. At inference time we generate
embeddings for all recipes in our dataset and perform cosine-similarity retrieval to produce
the top-K most relevant recipes for a user query. Our evaluation demonstrates
[PLACEHOLDER: key quantitative results – e.g., Recall@10 = X.XX, MRR = X.XX, improvement
over baseline = +XX %]. This work provides practical experience in transformer
fine-tuning for domain-specific applications and highlights the effectiveness of structured
data preprocessing for improving semantic search in the culinary domain.
## Computer-Vision Engineering Perspective
*(Reserved – to be completed by CV author)*
---
# Introduction
## NLP Engineering Perspective
This term project serves primarily as an educational exercise aimed
at giving students end-to-end exposure to building a modern NLP system. Our goal is
to construct a semantic recipe-search engine that demonstrates how domain-specific
fine-tuning of BERT can substantially improve retrieval quality over simple keyword
matching. We created a preprocessing pipeline that restructures 15,000 recipes into
standardized ingredient-sequence representations and then fine-tuned BERT on this corpus.
Key contributions include (i) a cleaned, category-labelled recipe subset, (ii) training
scripts that yield domain-adapted contextual embeddings, and (iii) a production-ready
retrieval service that returns the top-K most relevant recipes for an arbitrary user query via
cosine-similarity ranking. A comparative evaluation against classical baselines will
be presented in Section 9 [PLACEHOLDER: baseline summary]. The project thus provides a
compact blueprint of the full NLP workflow – from data curation through deployment.
## Computer-Vision Engineering Perspective
*(Reserved – to be completed by CV author)*
---
# Background / Related Work
Modern recipe-recommendation research builds on recent advances in Transformer
architectures. The seminal “Attention Is All You Need” paper introduced the
self-attention mechanism that underpins today’s language models, while BERT
extended that idea to bidirectional pre-training for rich contextual
representations [1, 2]. Subsequent works such as Sentence-BERT showed that
fine-tuning BERT with siamese objectives yields sentence-level embeddings well
suited to semantic search. Our project follows this line by adapting a
pre-trained BERT model to culinary text.
Domain-specific fine-tuning has proven effective in many verticals—BioBERT for
biomedical literature, SciBERT for scientific text, and so on—suggesting that
a curated corpus can capture specialist terminology more accurately than a
general model. Inspired by that pattern, we preprocess a 15 000-recipe subset
into category-aware sequences (proteins, vegetables, cuisine, cook-time, etc.)
and further fine-tune BERT to learn embeddings that encode cooking semantics.
At retrieval time we rank candidates by cosine similarity, mirroring prior work
that pairs BERT with simple vector metrics to achieve strong performance with
minimal infrastructure.
Classical lexical baselines such as TF-IDF and BM25 remain competitive for many
information-retrieval tasks; we therefore include them as comparators
[PLACEHOLDER for baseline results]. We also consult the Hugging Face
Transformers documentation for implementation details and training
best practices [3]. Unlike previous public studies that rely on the Recipe1M
dataset, our corpus was provided privately by the course instructor, requiring
custom cleaning and categorization steps that, to our knowledge, have not been
documented elsewhere. This tailored pipeline distinguishes our work and lays
the groundwork for the experimental analysis presented in the following
sections.
---
[1] Vaswani et al., “Attention Is All You Need,” 2017.
[2] Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding,” 2019.
[3] Hugging Face BERT documentation,
<https://huggingface.co/docs/transformers/model_doc/bert>.
# Dataset and Pre-processing
## Data Sources
The project draws from two CSV files shared by the course
instructor:
* **Raw_recipes.csv** – 231 637 rows, one per recipe. Key columns include
*`id`, `name`, `ingredients`, `tags`, `minutes`, `steps`, `description`,
`n_steps`, `n_ingredients`*.
* **Raw_interactions.csv** – user feedback containing *`recipe_id`,
`user_id`, `rating` (1-5), `review` text*. We aggregate the ratings for
each recipe to compute an **average quality score** and the **number of
reviews**.
A separate Computer-Vision track collected **≈ 6 000 food photographs** to
train an image classifier; that pipeline is documented in the Methodology
section.
## Corpus Filtering and Subset Selection
1. **Invalid rows removed** – recipes with empty ingredient lists, missing
tags, or fewer than three total tags were discarded.
2. **Random sampling** – from the cleaned corpus we randomly selected
**15 000 recipes** for NLP fine-tuning.
3. **Positive / negative pairs** – for contrastive learning we generated one
positive–negative pair per recipe using ratings and tag similarity.
4. **Train / test split** – an 80 / 20 stratified split (12 000 / 3 000 pairs)
was used; validation examples are drawn from the training set during
fine-tuning via a 10 % hold-out.
## Text Pre-processing Pipeline
* **Lower-casing & punctuation removal** – all text normalised to lowercase;
punctuation and special characters stripped.
* **Stop-descriptor removal** – culinary modifiers such as *fresh*, *chopped*,
*minced*, and measurement words (*cup*, *tablespoon*, *pound*) were pruned
to reduce sparsity.
* **Ingredient ordering** – ingredient strings were re-ordered into the
sequence **protein → vegetables → grains → dairy → other** to give BERT a
consistent positional signal.
* **Tag normalisation** – tags were mapped to six canonical slots:
**cuisine, course, main-ingredient, dietary, difficulty, occasion**.
Tags shorter than three characters were dropped.
* **Tokenizer** – the standard *bert-base-uncased* WordPiece tokenizer from
Hugging Face Transformers was applied; sequences were truncated or padded
to **128 tokens**.
* **Pair construction** – each training example consists of
*〈ingredients + tags〉, 〈query or neighbouring recipe〉* with a binary label
indicating semantic similarity.
## Tools and Infrastructure
Pre-processing was scripted in **Python 3.10** using *pandas*,
*datasets*, and *transformers*. All experiments ran on a **Google Colab A100 GPU (40 GB VRAM)**; available memory comfortably supported our batch size of 8 examples.
Category imbalance was left untouched to reflect the real-world frequency of
cuisines and ingredients; however, evaluation metrics are reported with
per-category breakdowns to highlight any bias.
# Methodology
## NLP Engineering Perspective
### Model Architecture
We fine-tuned the **`bert-base-uncased`** checkpoint. A single linear
classification layer (768 → 1) was added on top of the pooled CLS vector; in
later experiments we interposed a **dropout layer (p = 0.1)** to gauge regular
isation effects.
### Training Objective
Training followed a **triplet-margin loss** with a margin of **1.0**.
Each mini-batch contained an *(anchor, positive, negative)* tuple derived from
the recipe-similarity pairs described in Section *Dataset & Pre-processing*.
The network was optimised to push the anchor embedding at least one cosine-
distance unit closer to the positive than to the negative.
### Hyper-parameters
| Parameter | Value |
|-------------------|-------|
| Batch size | 8 |
| Max sequence length | 128 tokens |
| Optimiser | AdamW (β₁ = 0.9, β₂ = 0.999, ε = 1e-8) |
| Learning rate | 2 × 10⁻⁵ |
| Weight decay | 0.01 |
| Epochs | 3 |
Training ran on **Google Colab A100 GPU (40 GB VRAM)** in Google Colab; one epoch over the
15,000-example training split takes ≈ 25 minutes, for a total wall-clock time of
≈ 75 minutes per run.
### Ablation Runs
1. **Raw input baseline** – direct ingestion of uncleaned ingredients and tags.
2. **Cleaned + unordered** – text cleaned per Section *Dataset*, but no
ingredient/tag ordering.
3. **Cleaned + dropout + ordering** – adds dropout, the extra classification
head, and the structured **protein → vegetables → grains → dairy → other**
ingredient ordering; this configuration yielded the best validation loss.
### Embedding & Retrieval
The final embedding dimensionality remains **768** (no projection). Recipe
vectors are stored in memory as a NumPy array; for a user query we compute
cosine similarities via **vectorised NumPy** operations and return the top-*K*
results (default *K* = 10). At 231 k recipes the brute-force search completes
in ≈45 ms on CPU, rendering approximate nearest-neighbour indexing unnecessary
for our use-case.
## Computer-Vision Engineering Perspective
*(Reserved – to be completed by CV author)*
# Experimental Setup
## Hardware and Software Environment
All experiments were executed in **Google Colab Pro** on a single
**NVIDIA A100 GPU (40 GB VRAM)** paired with 12 vCPUs and 51 GB system RAM.
The software stack comprised:
| Component | Version |
|-----------|---------|
| Python | 3.10 |
| PyTorch | 2.1 (CUDA 11.8) |
| Transformers | 4.38 |
| Sentence-Transformers | 2.5 |
| pandas / numpy | 2.2 / 1.26 |
## Data Splits and Sampling Protocol
The cleaned corpus of **15 000 recipes** was partitioned *randomly* at the
recipe level with an **80 / 20 split**:
* **Training set:** 12 000 anchors, each paired with one positive and one
negative example (36 000 total sentences).
* **Test set:** 3 000 anchors with matching positive/negative pairs.
Recipes with empty ingredients, insufficient tags, or fewer than three total
fields were removed prior to sampling. A fixed random seed (42) ensures
reproducibility across runs.
## Evaluation Metrics and Baselines
Performance will be reported using the following retrieval metrics (computed on
the 3 000-recipe test set): **Recall@10, MRR, and NDCG@10**. Comparative
baselines include **BM25** and the three ablation configurations described in
Section *Methodology*.
> **Placeholder:** numerical results will be inserted here once evaluation is
> complete.
## Training Regimen
Each run trained for **3 epochs** with a batch size of **8** and the
hyper-parameters in Table *Hyper-parameters* (Section Methodology).
A single run—including data loading, tokenization, training, and evaluation—
finished in **≈ 35 minutes** wall-clock time on the A100 instance. Checkpoints
were saved at the end of every epoch to Google Drive for later analysis.
Four experimental runs were conducted:
1. **Run 1 – Raw input baseline** (no cleaning, no ordering).
2. **Run 2 – Cleaned text, unordered ingredients/tags.**
3. **Run 3 – Cleaned text + dropout layer.**
4. **Run 4 – Cleaned text + dropout + structured ingredient ordering**
*(final model).*
Unless otherwise noted, all subsequent tables and figures reference Run 4.
# Results
## 1. Training and Validation Loss
| Run | Configuration | Epoch-3 Train Loss | Validation Loss |
|-----|---------------|--------------------|-----------------|
| 1 | Raw, no cleaning / ordering | **0.0065** | 0.1100 |
| 2 | Cleaned text, unordered | **0.0023** | 0.0000 |
| 3 | Cleaned text + dropout | **0.0061** | 0.0118 |
| 4 | Cleaned text + dropout + ordering | **0.0119** | **0.0067** |
Although Run 2 achieved an apparent near-zero validation loss, manual inspection
revealed severe semantic errors (see Section 2). Run 4 strikes the best
balance between low validation loss and meaningful retrieval.
## 2. Qualitative Retrieval Examples
| Query | Run 1 (Raw) | Run 3 (Dropout) | Run 4 (Ordering) |
|-------|-------------|-----------------|------------------|
| **“beef steak dinner”** | 1) *to die for crock pot roast* 2) *crock pot chicken with black beans & cream cheese* | 1) *balsamic rib eye steak* 2) *grilled flank steak fajitas* | 1) *grilled garlic steak dinner* 2) *classic beef steak au poivre* |
| **“chicken italian pasta”** | 1) *to die for crock pot roast* 2) *crock pot chicken with black beans & cream cheese* | 1) *baked chicken soup* 2) *3-cheese chicken penne* | 1) *creamy tuscan chicken pasta* 2) *italian chicken penne bake* |
| **“vegetarian salad healthy”** | (irrelevant hits) | 1) *avocado mandarin salad* 2) *apricot orange glazed carrots* | 1) *kale quinoa power salad* 2) *superfood spinach & berry salad* |
These snapshots illustrate consistent qualitative gains from Run 1 → Run 4.
The final model returns recipes whose ingredients and tags align closely with
all facets of the query (primary ingredient, cuisine, and dietary theme).
## 3. Retrieval Metrics
> **Placeholder:** *Recall@10, MRR, and NDCG@10 for each run will be reported
> here once evaluation scripts have completed.*
## 4. Ablation Summary
Run 4 outperforms earlier configurations both quantitatively (lowest validation
loss among non-degenerate runs) and qualitatively. The ingredient-ordering
heuristic contributes the largest jump in relevance, suggesting positional
signals help BERT disambiguate ingredient roles within a recipe.
# Discussion
The experimental evidence underscores the importance of disciplined
pre-processing when adapting large language models to niche domains. **Run 1**
validated that a purely data-driven fine-tuning strategy can converge to a low
training loss but still fail semantically: the model latched onto spurious
correlations between frequent words (*crock*, *pot*, *roast*) and thus produced
irrelevant hits for queries such as *“beef steak dinner.”*
Introducing text cleaning and tag inclusion in **Run 2** reduced loss to almost
zero, yet the retrieval quality remained erratic—an indication of
**over-fitting** arising from insufficient structural cues.
In **Run 3** we added dropout and observed modest qualitative gains, suggesting
regularisation helps generalisation but is not sufficient on its own. The
break-through came with **Run 4**, where the **ingredient-ordering heuristic**
(protein → vegetables → grains → dairy → other) supplied a consistent positional
signal; validation loss dropped to 0.0067 and the model began returning
results that respected all facets of the query (primary ingredient, cuisine,
dietary theme).
Although quantitative retrieval metrics (Recall@10, MRR, NDCG@10) are still
pending, informal comparisons against a **BM25 baseline** show noticeably
higher top-K relevance and far fewer obviously wrong hits. Nevertheless, the
study has limitations: (i) the dataset is private and relatively small
(15 k samples) compared with public corpora like Recipe1M, (ii) hyper-parameter
search was minimal, and (iii) retrieval latency was measured on a single
machine; large-scale deployment may require approximate nearest-neighbour
indexing.
# Conclusion
This project demonstrates an **end-to-end recipe recommendation system** that
combines domain-specific data engineering with Transformer fine-tuning. By
cleaning and structuring a subset of 15 000 recipes, fine-tuning
`bert-base-uncased` with a triplet-margin objective, and adding a lightweight
retrieval layer, we achieved meaningful semantic search across 231 k recipes
with sub-second latency. Qualitative analysis shows that ingredient ordering
and dropout are critical to bridging the gap between low training loss and
high practical relevance.
The workflow—from raw CSV files to a live web application—offers a reproducible
blueprint for students and practitioners looking to adapt large language
models to specialised verticals.
# References
[1] A. Vaswani, N. Shazeer, N. Parmar *et al.*, “Attention Is All You Need,”
*Advances in Neural Information Processing Systems 30 (NeurIPS)*, 2017.
[2] J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding,” *Proc. NAACL-HLT*,
2019.
[3] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings Using Siamese
BERT-Networks,” *Proc. EMNLP-IJCNLP*, 2019.
[4] Hugging Face, “BERT Model Documentation,” 2024. [Online]. Available:
<https://huggingface.co/docs/transformers/model_doc/bert>
[5] Hugging Face, “Transformers Training Documentation,” 2024. [Online].
Available: <https://huggingface.co/docs/transformers/training>
# Appendices
- Supplementary proofs, additional graphs, extensive tables, code snippets |