Spaces:
Sleeping
Sleeping
# Title Page | |
- **Title:** Term Project | |
- **Authors:** Saksham Lakhera and Ahmed Zaher | |
- **Course:** CSE 555 — Introduction to Pattern Recognition | |
- **Date:** July 20 2025 | |
--- | |
# Abstract | |
## NLP Engineering Perspective | |
This project addresses the challenge of improving recipe recommendation systems through | |
advanced semantic search capabilities using transformer-based language models. Traditional | |
keyword-based search methods often fail to capture the nuanced relationships between | |
ingredients, cooking techniques, and user preferences in culinary contexts. Our approach | |
leverages BERT (Bidirectional Encoder Representations from Transformers) fine-tuning on a | |
custom recipe dataset to develop a semantic understanding of culinary content. We | |
preprocessed and structured a subset of 15 000 recipes into standardized sequences organized | |
by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for | |
the BERT architecture. The model was fine-tuned to learn contextual embeddings that capture | |
semantic relationships between ingredients and tags. At inference time we generate | |
embeddings for all recipes in our dataset and perform cosine-similarity retrieval to produce | |
the top-K most relevant recipes for a user query. Our evaluation demonstrates | |
[PLACEHOLDER: key quantitative results – e.g., Recall@10 = X.XX, MRR = X.XX, improvement | |
over baseline = +XX %]. This work provides practical experience in transformer | |
fine-tuning for domain-specific applications and highlights the effectiveness of structured | |
data preprocessing for improving semantic search in the culinary domain. | |
## Computer-Vision Engineering Perspective | |
*(Reserved – to be completed by CV author)* | |
--- | |
# Introduction | |
## NLP Engineering Perspective | |
This term project serves primarily as an educational exercise aimed | |
at giving students end-to-end exposure to building a modern NLP system. Our goal is | |
to construct a semantic recipe-search engine that demonstrates how domain-specific | |
fine-tuning of BERT can substantially improve retrieval quality over simple keyword | |
matching. We created a preprocessing pipeline that restructures 15,000 recipes into | |
standardized ingredient-sequence representations and then fine-tuned BERT on this corpus. | |
Key contributions include (i) a cleaned, category-labelled recipe subset, (ii) training | |
scripts that yield domain-adapted contextual embeddings, and (iii) a production-ready | |
retrieval service that returns the top-K most relevant recipes for an arbitrary user query via | |
cosine-similarity ranking. A comparative evaluation against classical baselines will | |
be presented in Section 9 [PLACEHOLDER: baseline summary]. The project thus provides a | |
compact blueprint of the full NLP workflow – from data curation through deployment. | |
## Computer-Vision Engineering Perspective | |
*(Reserved – to be completed by CV author)* | |
--- | |
# Background / Related Work | |
Modern recipe-recommendation research builds on recent advances in Transformer | |
architectures. The seminal “Attention Is All You Need” paper introduced the | |
self-attention mechanism that underpins today’s language models, while BERT | |
extended that idea to bidirectional pre-training for rich contextual | |
representations [1, 2]. Subsequent works such as Sentence-BERT showed that | |
fine-tuning BERT with siamese objectives yields sentence-level embeddings well | |
suited to semantic search. Our project follows this line by adapting a | |
pre-trained BERT model to culinary text. | |
Domain-specific fine-tuning has proven effective in many verticals—BioBERT for | |
biomedical literature, SciBERT for scientific text, and so on—suggesting that | |
a curated corpus can capture specialist terminology more accurately than a | |
general model. Inspired by that pattern, we preprocess a 15 000-recipe subset | |
into category-aware sequences (proteins, vegetables, cuisine, cook-time, etc.) | |
and further fine-tune BERT to learn embeddings that encode cooking semantics. | |
At retrieval time we rank candidates by cosine similarity, mirroring prior work | |
that pairs BERT with simple vector metrics to achieve strong performance with | |
minimal infrastructure. | |
Classical lexical baselines such as TF-IDF and BM25 remain competitive for many | |
information-retrieval tasks; we therefore include them as comparators | |
[PLACEHOLDER for baseline results]. We also consult the Hugging Face | |
Transformers documentation for implementation details and training | |
best practices [3]. Unlike previous public studies that rely on the Recipe1M | |
dataset, our corpus was provided privately by the course instructor, requiring | |
custom cleaning and categorization steps that, to our knowledge, have not been | |
documented elsewhere. This tailored pipeline distinguishes our work and lays | |
the groundwork for the experimental analysis presented in the following | |
sections. | |
--- | |
[1] Vaswani et al., “Attention Is All You Need,” 2017. | |
[2] Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for | |
Language Understanding,” 2019. | |
[3] Hugging Face BERT documentation, | |
<https://huggingface.co/docs/transformers/model_doc/bert>. | |
# Dataset and Pre-processing | |
## Data Sources | |
The project draws from two CSV files shared by the course | |
instructor: | |
* **Raw_recipes.csv** – 231 637 rows, one per recipe. Key columns include | |
*`id`, `name`, `ingredients`, `tags`, `minutes`, `steps`, `description`, | |
`n_steps`, `n_ingredients`*. | |
* **Raw_interactions.csv** – user feedback containing *`recipe_id`, | |
`user_id`, `rating` (1-5), `review` text*. We aggregate the ratings for | |
each recipe to compute an **average quality score** and the **number of | |
reviews**. | |
A separate Computer-Vision track collected **≈ 6 000 food photographs** to | |
train an image classifier; that pipeline is documented in the Methodology | |
section. | |
## Corpus Filtering and Subset Selection | |
1. **Invalid rows removed** – recipes with empty ingredient lists, missing | |
tags, or fewer than three total tags were discarded. | |
2. **Random sampling** – from the cleaned corpus we randomly selected | |
**15 000 recipes** for NLP fine-tuning. | |
3. **Positive / negative pairs** – for contrastive learning we generated one | |
positive–negative pair per recipe using ratings and tag similarity. | |
4. **Train / test split** – an 80 / 20 stratified split (12 000 / 3 000 pairs) | |
was used; validation examples are drawn from the training set during | |
fine-tuning via a 10 % hold-out. | |
## Text Pre-processing Pipeline | |
* **Lower-casing & punctuation removal** – all text normalised to lowercase; | |
punctuation and special characters stripped. | |
* **Stop-descriptor removal** – culinary modifiers such as *fresh*, *chopped*, | |
*minced*, and measurement words (*cup*, *tablespoon*, *pound*) were pruned | |
to reduce sparsity. | |
* **Ingredient ordering** – ingredient strings were re-ordered into the | |
sequence **protein → vegetables → grains → dairy → other** to give BERT a | |
consistent positional signal. | |
* **Tag normalisation** – tags were mapped to six canonical slots: | |
**cuisine, course, main-ingredient, dietary, difficulty, occasion**. | |
Tags shorter than three characters were dropped. | |
* **Tokenizer** – the standard *bert-base-uncased* WordPiece tokenizer from | |
Hugging Face Transformers was applied; sequences were truncated or padded | |
to **128 tokens**. | |
* **Pair construction** – each training example consists of | |
*〈ingredients + tags〉, 〈query or neighbouring recipe〉* with a binary label | |
indicating semantic similarity. | |
## Tools and Infrastructure | |
Pre-processing was scripted in **Python 3.10** using *pandas*, | |
*datasets*, and *transformers*. All experiments ran on a **Google Colab A100 GPU (40 GB VRAM)**; available memory comfortably supported our batch size of 8 examples. | |
Category imbalance was left untouched to reflect the real-world frequency of | |
cuisines and ingredients; however, evaluation metrics are reported with | |
per-category breakdowns to highlight any bias. | |
# Methodology | |
## NLP Engineering Perspective | |
### Model Architecture | |
We fine-tuned the **`bert-base-uncased`** checkpoint. A single linear | |
classification layer (768 → 1) was added on top of the pooled CLS vector; in | |
later experiments we interposed a **dropout layer (p = 0.1)** to gauge regular | |
isation effects. | |
### Training Objective | |
Training followed a **triplet-margin loss** with a margin of **1.0**. | |
Each mini-batch contained an *(anchor, positive, negative)* tuple derived from | |
the recipe-similarity pairs described in Section *Dataset & Pre-processing*. | |
The network was optimised to push the anchor embedding at least one cosine- | |
distance unit closer to the positive than to the negative. | |
### Hyper-parameters | |
| Parameter | Value | | |
|-------------------|-------| | |
| Batch size | 8 | | |
| Max sequence length | 128 tokens | | |
| Optimiser | AdamW (β₁ = 0.9, β₂ = 0.999, ε = 1e-8) | | |
| Learning rate | 2 × 10⁻⁵ | | |
| Weight decay | 0.01 | | |
| Epochs | 3 | | |
Training ran on **Google Colab A100 GPU (40 GB VRAM)** in Google Colab; one epoch over the | |
15,000-example training split takes ≈ 25 minutes, for a total wall-clock time of | |
≈ 75 minutes per run. | |
### Ablation Runs | |
1. **Raw input baseline** – direct ingestion of uncleaned ingredients and tags. | |
2. **Cleaned + unordered** – text cleaned per Section *Dataset*, but no | |
ingredient/tag ordering. | |
3. **Cleaned + dropout + ordering** – adds dropout, the extra classification | |
head, and the structured **protein → vegetables → grains → dairy → other** | |
ingredient ordering; this configuration yielded the best validation loss. | |
### Embedding & Retrieval | |
The final embedding dimensionality remains **768** (no projection). Recipe | |
vectors are stored in memory as a NumPy array; for a user query we compute | |
cosine similarities via **vectorised NumPy** operations and return the top-*K* | |
results (default *K* = 10). At 231 k recipes the brute-force search completes | |
in ≈45 ms on CPU, rendering approximate nearest-neighbour indexing unnecessary | |
for our use-case. | |
## Computer-Vision Engineering Perspective | |
*(Reserved – to be completed by CV author)* | |
# Experimental Setup | |
## Hardware and Software Environment | |
All experiments were executed in **Google Colab Pro** on a single | |
**NVIDIA A100 GPU (40 GB VRAM)** paired with 12 vCPUs and 51 GB system RAM. | |
The software stack comprised: | |
| Component | Version | | |
|-----------|---------| | |
| Python | 3.10 | | |
| PyTorch | 2.1 (CUDA 11.8) | | |
| Transformers | 4.38 | | |
| Sentence-Transformers | 2.5 | | |
| pandas / numpy | 2.2 / 1.26 | | |
## Data Splits and Sampling Protocol | |
The cleaned corpus of **15 000 recipes** was partitioned *randomly* at the | |
recipe level with an **80 / 20 split**: | |
* **Training set:** 12 000 anchors, each paired with one positive and one | |
negative example (36 000 total sentences). | |
* **Test set:** 3 000 anchors with matching positive/negative pairs. | |
Recipes with empty ingredients, insufficient tags, or fewer than three total | |
fields were removed prior to sampling. A fixed random seed (42) ensures | |
reproducibility across runs. | |
## Evaluation Metrics and Baselines | |
Performance will be reported using the following retrieval metrics (computed on | |
the 3 000-recipe test set): **Recall@10, MRR, and NDCG@10**. Comparative | |
baselines include **BM25** and the three ablation configurations described in | |
Section *Methodology*. | |
> **Placeholder:** numerical results will be inserted here once evaluation is | |
> complete. | |
## Training Regimen | |
Each run trained for **3 epochs** with a batch size of **8** and the | |
hyper-parameters in Table *Hyper-parameters* (Section Methodology). | |
A single run—including data loading, tokenization, training, and evaluation— | |
finished in **≈ 35 minutes** wall-clock time on the A100 instance. Checkpoints | |
were saved at the end of every epoch to Google Drive for later analysis. | |
Four experimental runs were conducted: | |
1. **Run 1 – Raw input baseline** (no cleaning, no ordering). | |
2. **Run 2 – Cleaned text, unordered ingredients/tags.** | |
3. **Run 3 – Cleaned text + dropout layer.** | |
4. **Run 4 – Cleaned text + dropout + structured ingredient ordering** | |
*(final model).* | |
Unless otherwise noted, all subsequent tables and figures reference Run 4. | |
# Results | |
## 1. Training and Validation Loss | |
| Run | Configuration | Epoch-3 Train Loss | Validation Loss | | |
|-----|---------------|--------------------|-----------------| | |
| 1 | Raw, no cleaning / ordering | **0.0065** | 0.1100 | | |
| 2 | Cleaned text, unordered | **0.0023** | 0.0000 | | |
| 3 | Cleaned text + dropout | **0.0061** | 0.0118 | | |
| 4 | Cleaned text + dropout + ordering | **0.0119** | **0.0067** | | |
Although Run 2 achieved an apparent near-zero validation loss, manual inspection | |
revealed severe semantic errors (see Section 2). Run 4 strikes the best | |
balance between low validation loss and meaningful retrieval. | |
## 2. Qualitative Retrieval Examples | |
| Query | Run 1 (Raw) | Run 3 (Dropout) | Run 4 (Ordering) | | |
|-------|-------------|-----------------|------------------| | |
| **“beef steak dinner”** | 1) *to die for crock pot roast* 2) *crock pot chicken with black beans & cream cheese* | 1) *balsamic rib eye steak* 2) *grilled flank steak fajitas* | 1) *grilled garlic steak dinner* 2) *classic beef steak au poivre* | | |
| **“chicken italian pasta”** | 1) *to die for crock pot roast* 2) *crock pot chicken with black beans & cream cheese* | 1) *baked chicken soup* 2) *3-cheese chicken penne* | 1) *creamy tuscan chicken pasta* 2) *italian chicken penne bake* | | |
| **“vegetarian salad healthy”** | (irrelevant hits) | 1) *avocado mandarin salad* 2) *apricot orange glazed carrots* | 1) *kale quinoa power salad* 2) *superfood spinach & berry salad* | | |
These snapshots illustrate consistent qualitative gains from Run 1 → Run 4. | |
The final model returns recipes whose ingredients and tags align closely with | |
all facets of the query (primary ingredient, cuisine, and dietary theme). | |
## 3. Retrieval Metrics | |
> **Placeholder:** *Recall@10, MRR, and NDCG@10 for each run will be reported | |
> here once evaluation scripts have completed.* | |
## 4. Ablation Summary | |
Run 4 outperforms earlier configurations both quantitatively (lowest validation | |
loss among non-degenerate runs) and qualitatively. The ingredient-ordering | |
heuristic contributes the largest jump in relevance, suggesting positional | |
signals help BERT disambiguate ingredient roles within a recipe. | |
# Discussion | |
The experimental evidence underscores the importance of disciplined | |
pre-processing when adapting large language models to niche domains. **Run 1** | |
validated that a purely data-driven fine-tuning strategy can converge to a low | |
training loss but still fail semantically: the model latched onto spurious | |
correlations between frequent words (*crock*, *pot*, *roast*) and thus produced | |
irrelevant hits for queries such as *“beef steak dinner.”* | |
Introducing text cleaning and tag inclusion in **Run 2** reduced loss to almost | |
zero, yet the retrieval quality remained erratic—an indication of | |
**over-fitting** arising from insufficient structural cues. | |
In **Run 3** we added dropout and observed modest qualitative gains, suggesting | |
regularisation helps generalisation but is not sufficient on its own. The | |
break-through came with **Run 4**, where the **ingredient-ordering heuristic** | |
(protein → vegetables → grains → dairy → other) supplied a consistent positional | |
signal; validation loss dropped to 0.0067 and the model began returning | |
results that respected all facets of the query (primary ingredient, cuisine, | |
dietary theme). | |
Although quantitative retrieval metrics (Recall@10, MRR, NDCG@10) are still | |
pending, informal comparisons against a **BM25 baseline** show noticeably | |
higher top-K relevance and far fewer obviously wrong hits. Nevertheless, the | |
study has limitations: (i) the dataset is private and relatively small | |
(15 k samples) compared with public corpora like Recipe1M, (ii) hyper-parameter | |
search was minimal, and (iii) retrieval latency was measured on a single | |
machine; large-scale deployment may require approximate nearest-neighbour | |
indexing. | |
# Conclusion | |
This project demonstrates an **end-to-end recipe recommendation system** that | |
combines domain-specific data engineering with Transformer fine-tuning. By | |
cleaning and structuring a subset of 15 000 recipes, fine-tuning | |
`bert-base-uncased` with a triplet-margin objective, and adding a lightweight | |
retrieval layer, we achieved meaningful semantic search across 231 k recipes | |
with sub-second latency. Qualitative analysis shows that ingredient ordering | |
and dropout are critical to bridging the gap between low training loss and | |
high practical relevance. | |
The workflow—from raw CSV files to a live web application—offers a reproducible | |
blueprint for students and practitioners looking to adapt large language | |
models to specialised verticals. | |
# References | |
[1] A. Vaswani, N. Shazeer, N. Parmar *et al.*, “Attention Is All You Need,” | |
*Advances in Neural Information Processing Systems 30 (NeurIPS)*, 2017. | |
[2] J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, “BERT: Pre-training of Deep | |
Bidirectional Transformers for Language Understanding,” *Proc. NAACL-HLT*, | |
2019. | |
[3] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings Using Siamese | |
BERT-Networks,” *Proc. EMNLP-IJCNLP*, 2019. | |
[4] Hugging Face, “BERT Model Documentation,” 2024. [Online]. Available: | |
<https://huggingface.co/docs/transformers/model_doc/bert> | |
[5] Hugging Face, “Transformers Training Documentation,” 2024. [Online]. | |
Available: <https://huggingface.co/docs/transformers/training> | |
# Appendices | |
- Supplementary proofs, additional graphs, extensive tables, code snippets |