Title Page

Title: Term Project
Authors: Saksham Lakhera and Ahmed Zaher
Course: CSE 555 — Introduction to Pattern Recognition
Date: July 20 2025

Abstract

NLP Engineering Perspective

This project addresses the challenge of improving recipe recommendation systems through advanced semantic search capabilities using transformer-based language models. Traditional keyword-based search methods often fail to capture the nuanced relationships between ingredients, cooking techniques, and user preferences in culinary contexts. Our approach leverages BERT (Bidirectional Encoder Representations from Transformers) fine-tuning on a custom recipe dataset to develop a semantic understanding of culinary content. We preprocessed and structured a subset of 15 000 recipes into standardized sequences organized by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for the BERT architecture. The model was fine-tuned to learn contextual embeddings that capture semantic relationships between ingredients and tags. At inference time we generate embeddings for all recipes in our dataset and perform cosine-similarity retrieval to produce the top-K most relevant recipes for a user query. Our evaluation demonstrates [PLACEHOLDER: key quantitative results – e.g., Recall@10 = X.XX, MRR = X.XX, improvement over baseline = +XX %]. This work provides practical experience in transformer fine-tuning for domain-specific applications and highlights the effectiveness of structured data preprocessing for improving semantic search in the culinary domain.

Computer-Vision Engineering Perspective

(Reserved – to be completed by CV author)

Introduction

NLP Engineering Perspective

This term project serves primarily as an educational exercise aimed at giving students end-to-end exposure to building a modern NLP system. Our goal is to construct a semantic recipe-search engine that demonstrates how domain-specific fine-tuning of BERT can substantially improve retrieval quality over simple keyword matching. We created a preprocessing pipeline that restructures 15,000 recipes into standardized ingredient-sequence representations and then fine-tuned BERT on this corpus. Key contributions include (i) a cleaned, category-labelled recipe subset, (ii) training scripts that yield domain-adapted contextual embeddings, and (iii) a production-ready retrieval service that returns the top-K most relevant recipes for an arbitrary user query via cosine-similarity ranking. A comparative evaluation against classical baselines will be presented in Section 9 [PLACEHOLDER: baseline summary]. The project thus provides a compact blueprint of the full NLP workflow – from data curation through deployment.

Computer-Vision Engineering Perspective

(Reserved – to be completed by CV author)

Background / Related Work

Modern recipe-recommendation research builds on recent advances in Transformer architectures. The seminal “Attention Is All You Need” paper introduced the self-attention mechanism that underpins today’s language models, while BERT extended that idea to bidirectional pre-training for rich contextual representations [1, 2]. Subsequent works such as Sentence-BERT showed that fine-tuning BERT with siamese objectives yields sentence-level embeddings well suited to semantic search. Our project follows this line by adapting a pre-trained BERT model to culinary text.

Domain-specific fine-tuning has proven effective in many verticals—BioBERT for biomedical literature, SciBERT for scientific text, and so on—suggesting that a curated corpus can capture specialist terminology more accurately than a general model. Inspired by that pattern, we preprocess a 15 000-recipe subset into category-aware sequences (proteins, vegetables, cuisine, cook-time, etc.) and further fine-tune BERT to learn embeddings that encode cooking semantics. At retrieval time we rank candidates by cosine similarity, mirroring prior work that pairs BERT with simple vector metrics to achieve strong performance with minimal infrastructure.

Classical lexical baselines such as TF-IDF and BM25 remain competitive for many information-retrieval tasks; we therefore include them as comparators [PLACEHOLDER for baseline results]. We also consult the Hugging Face Transformers documentation for implementation details and training best practices [3]. Unlike previous public studies that rely on the Recipe1M dataset, our corpus was provided privately by the course instructor, requiring custom cleaning and categorization steps that, to our knowledge, have not been documented elsewhere. This tailored pipeline distinguishes our work and lays the groundwork for the experimental analysis presented in the following sections.

[1] Vaswani et al., “Attention Is All You Need,” 2017.

[2] Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” 2019.

[3] Hugging Face BERT documentation, https://huggingface.co/docs/transformers/model_doc/bert.

Dataset and Pre-processing

Data Sources

The project draws from two CSV files shared by the course instructor:

Raw_recipes.csv – 231 637 rows, one per recipe. Key columns include id, name, ingredients, tags, minutes, steps, description, n_steps, n_ingredients.
Raw_interactions.csv – user feedback containing recipe_id, user_id, rating (1-5), review text. We aggregate the ratings for each recipe to compute an average quality score and the number of reviews.

A separate Computer-Vision track collected ≈ 6 000 food photographs to train an image classifier; that pipeline is documented in the Methodology section.

Corpus Filtering and Subset Selection

Invalid rows removed – recipes with empty ingredient lists, missing tags, or fewer than three total tags were discarded.
Random sampling – from the cleaned corpus we randomly selected 15 000 recipes for NLP fine-tuning.
Positive / negative pairs – for contrastive learning we generated one positive–negative pair per recipe using ratings and tag similarity.
Train / test split – an 80 / 20 stratified split (12 000 / 3 000 pairs) was used; validation examples are drawn from the training set during fine-tuning via a 10 % hold-out.

Text Pre-processing Pipeline

Lower-casing & punctuation removal – all text normalised to lowercase; punctuation and special characters stripped.
Stop-descriptor removal – culinary modifiers such as fresh, chopped, minced, and measurement words (cup, tablespoon, pound) were pruned to reduce sparsity.
Ingredient ordering – ingredient strings were re-ordered into the sequence protein → vegetables → grains → dairy → other to give BERT a consistent positional signal.
Tag normalisation – tags were mapped to six canonical slots: cuisine, course, main-ingredient, dietary, difficulty, occasion. Tags shorter than three characters were dropped.
Tokenizer – the standard bert-base-uncased WordPiece tokenizer from Hugging Face Transformers was applied; sequences were truncated or padded to 128 tokens.
Pair construction – each training example consists of 〈ingredients + tags〉, 〈query or neighbouring recipe〉 with a binary label indicating semantic similarity.

Tools and Infrastructure

Pre-processing was scripted in Python 3.10 using pandas, datasets, and transformers. All experiments ran on a Google Colab A100 GPU (40 GB VRAM); available memory comfortably supported our batch size of 8 examples.

Category imbalance was left untouched to reflect the real-world frequency of cuisines and ingredients; however, evaluation metrics are reported with per-category breakdowns to highlight any bias.

Methodology

NLP Engineering Perspective

Model Architecture

We fine-tuned the bert-base-uncased checkpoint. A single linear classification layer (768 → 1) was added on top of the pooled CLS vector; in later experiments we interposed a dropout layer (p = 0.1) to gauge regular isation effects.

Training Objective

Training followed a triplet-margin loss with a margin of 1.0. Each mini-batch contained an (anchor, positive, negative) tuple derived from the recipe-similarity pairs described in Section Dataset & Pre-processing. The network was optimised to push the anchor embedding at least one cosine- distance unit closer to the positive than to the negative.

Hyper-parameters

Parameter	Value
Batch size	8
Max sequence length	128 tokens
Optimiser	AdamW (β₁ = 0.9, β₂ = 0.999, ε = 1e-8)
Learning rate	2 × 10⁻⁵
Weight decay	0.01
Epochs	3

Training ran on Google Colab A100 GPU (40 GB VRAM) in Google Colab; one epoch over the 15,000-example training split takes ≈ 25 minutes, for a total wall-clock time of ≈ 75 minutes per run.

Ablation Runs

Raw input baseline – direct ingestion of uncleaned ingredients and tags.
Cleaned + unordered – text cleaned per Section Dataset, but no ingredient/tag ordering.
Cleaned + dropout + ordering – adds dropout, the extra classification head, and the structured protein → vegetables → grains → dairy → other ingredient ordering; this configuration yielded the best validation loss.

Embedding & Retrieval

The final embedding dimensionality remains 768 (no projection). Recipe vectors are stored in memory as a NumPy array; for a user query we compute cosine similarities via vectorised NumPy operations and return the top-K results (default K = 10). At 231 k recipes the brute-force search completes in ≈45 ms on CPU, rendering approximate nearest-neighbour indexing unnecessary for our use-case.

Computer-Vision Engineering Perspective

(Reserved – to be completed by CV author)

Experimental Setup

Hardware and Software Environment

All experiments were executed in Google Colab Pro on a single NVIDIA A100 GPU (40 GB VRAM) paired with 12 vCPUs and 51 GB system RAM. The software stack comprised:

Component	Version
Python	3.10
PyTorch	2.1 (CUDA 11.8)
Transformers	4.38
Sentence-Transformers	2.5
pandas / numpy	2.2 / 1.26

Data Splits and Sampling Protocol

The cleaned corpus of 15 000 recipes was partitioned randomly at the recipe level with an 80 / 20 split:

Training set: 12 000 anchors, each paired with one positive and one negative example (36 000 total sentences).
Test set: 3 000 anchors with matching positive/negative pairs.

Recipes with empty ingredients, insufficient tags, or fewer than three total fields were removed prior to sampling. A fixed random seed (42) ensures reproducibility across runs.

Evaluation Metrics and Baselines

Performance will be reported using the following retrieval metrics (computed on the 3 000-recipe test set): Recall@10, MRR, and NDCG@10. Comparative baselines include BM25 and the three ablation configurations described in Section Methodology.

Placeholder: numerical results will be inserted here once evaluation is complete.

Training Regimen

Each run trained for 3 epochs with a batch size of 8 and the hyper-parameters in Table Hyper-parameters (Section Methodology). A single run—including data loading, tokenization, training, and evaluation— finished in ≈ 35 minutes wall-clock time on the A100 instance. Checkpoints were saved at the end of every epoch to Google Drive for later analysis.

Four experimental runs were conducted:

Run 1 – Raw input baseline (no cleaning, no ordering).
Run 2 – Cleaned text, unordered ingredients/tags.
Run 3 – Cleaned text + dropout layer.
Run 4 – Cleaned text + dropout + structured ingredient ordering (final model).

Unless otherwise noted, all subsequent tables and figures reference Run 4.

Results

1. Training and Validation Loss

Run	Configuration	Epoch-3 Train Loss	Validation Loss
1	Raw, no cleaning / ordering	0.0065	0.1100
2	Cleaned text, unordered	0.0023	0.0000
3	Cleaned text + dropout	0.0061	0.0118
4	Cleaned text + dropout + ordering	0.0119	0.0067

Although Run 2 achieved an apparent near-zero validation loss, manual inspection revealed severe semantic errors (see Section 2). Run 4 strikes the best balance between low validation loss and meaningful retrieval.

2. Qualitative Retrieval Examples

Query	Run 1 (Raw)	Run 3 (Dropout)	Run 4 (Ordering)
“beef steak dinner”	1) to die for crock pot roast 2) crock pot chicken with black beans & cream cheese	1) balsamic rib eye steak 2) grilled flank steak fajitas	1) grilled garlic steak dinner 2) classic beef steak au poivre
“chicken italian pasta”	1) to die for crock pot roast 2) crock pot chicken with black beans & cream cheese	1) baked chicken soup 2) 3-cheese chicken penne	1) creamy tuscan chicken pasta 2) italian chicken penne bake
“vegetarian salad healthy”	(irrelevant hits)	1) avocado mandarin salad 2) apricot orange glazed carrots	1) kale quinoa power salad 2) superfood spinach & berry salad

These snapshots illustrate consistent qualitative gains from Run 1 → Run 4. The final model returns recipes whose ingredients and tags align closely with all facets of the query (primary ingredient, cuisine, and dietary theme).

3. Retrieval Metrics

Placeholder: Recall@10, MRR, and NDCG@10 for each run will be reported here once evaluation scripts have completed.

4. Ablation Summary

Run 4 outperforms earlier configurations both quantitatively (lowest validation loss among non-degenerate runs) and qualitatively. The ingredient-ordering heuristic contributes the largest jump in relevance, suggesting positional signals help BERT disambiguate ingredient roles within a recipe.

Discussion

The experimental evidence underscores the importance of disciplined pre-processing when adapting large language models to niche domains. Run 1 validated that a purely data-driven fine-tuning strategy can converge to a low training loss but still fail semantically: the model latched onto spurious correlations between frequent words (crock, pot, roast) and thus produced irrelevant hits for queries such as “beef steak dinner.”

Introducing text cleaning and tag inclusion in Run 2 reduced loss to almost zero, yet the retrieval quality remained erratic—an indication of over-fitting arising from insufficient structural cues.

In Run 3 we added dropout and observed modest qualitative gains, suggesting regularisation helps generalisation but is not sufficient on its own. The break-through came with Run 4, where the ingredient-ordering heuristic (protein → vegetables → grains → dairy → other) supplied a consistent positional signal; validation loss dropped to 0.0067 and the model began returning results that respected all facets of the query (primary ingredient, cuisine, dietary theme).

Although quantitative retrieval metrics (Recall@10, MRR, NDCG@10) are still pending, informal comparisons against a BM25 baseline show noticeably higher top-K relevance and far fewer obviously wrong hits. Nevertheless, the study has limitations: (i) the dataset is private and relatively small (15 k samples) compared with public corpora like Recipe1M, (ii) hyper-parameter search was minimal, and (iii) retrieval latency was measured on a single machine; large-scale deployment may require approximate nearest-neighbour indexing.

Conclusion

This project demonstrates an end-to-end recipe recommendation system that combines domain-specific data engineering with Transformer fine-tuning. By cleaning and structuring a subset of 15 000 recipes, fine-tuning bert-base-uncased with a triplet-margin objective, and adding a lightweight retrieval layer, we achieved meaningful semantic search across 231 k recipes with sub-second latency. Qualitative analysis shows that ingredient ordering and dropout are critical to bridging the gap between low training loss and high practical relevance.

The workflow—from raw CSV files to a live web application—offers a reproducible blueprint for students and practitioners looking to adapt large language models to specialised verticals.

References

[1] A. Vaswani, N. Shazeer, N. Parmar et al., “Attention Is All You Need,” Advances in Neural Information Processing Systems 30 (NeurIPS), 2017.

[2] J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Proc. NAACL-HLT, 2019.

[3] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks,” Proc. EMNLP-IJCNLP, 2019.

[4] Hugging Face, “BERT Model Documentation,” 2024. [Online]. Available: https://huggingface.co/docs/transformers/model_doc/bert

[5] Hugging Face, “Transformers Training Documentation,” 2024. [Online]. Available: https://huggingface.co/docs/transformers/training

Appendices

Supplementary proofs, additional graphs, extensive tables, code snippets