Spaces:
Sleeping
Sleeping
| # Title Page | |
| - **Title:** Term Project | |
| - **Authors:** Saksham Lakhera and Ahmed Zaher | |
| - **Course:** CSE 555 — Introduction to Pattern Recognition | |
| - **Date:** July 20 2025 | |
| --- | |
| # Abstract | |
| ## NLP Engineering Perspective | |
| This project addresses the challenge of improving recipe recommendation systems through | |
| advanced semantic search capabilities using transformer-based language models. Traditional | |
| keyword-based search methods often fail to capture the nuanced relationships between | |
| ingredients, cooking techniques, and user preferences in culinary contexts. Our approach | |
| leverages BERT (Bidirectional Encoder Representations from Transformers) fine-tuning on a | |
| custom recipe dataset to develop a semantic understanding of culinary content. We | |
| preprocessed and structured a subset of 15 000 recipes into standardized sequences organized | |
| by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for | |
| the BERT architecture. The model was fine-tuned to learn contextual embeddings that capture | |
| semantic relationships between ingredients and tags. At inference time we generate | |
| embeddings for all recipes in our dataset and perform cosine-similarity retrieval to produce | |
| the top-K most relevant recipes for a user query. Our evaluation demonstrates | |
| [PLACEHOLDER: key quantitative results – e.g., Recall@10 = X.XX, MRR = X.XX, improvement | |
| over baseline = +XX %]. This work provides practical experience in transformer | |
| fine-tuning for domain-specific applications and highlights the effectiveness of structured | |
| data preprocessing for improving semantic search in the culinary domain. | |
| ## Computer-Vision Engineering Perspective | |
| *(Reserved – to be completed by CV author)* | |
| --- | |
| # Introduction | |
| ## NLP Engineering Perspective | |
| This term project serves primarily as an educational exercise aimed | |
| at giving students end-to-end exposure to building a modern NLP system. Our goal is | |
| to construct a semantic recipe-search engine that demonstrates how domain-specific | |
| fine-tuning of BERT can substantially improve retrieval quality over simple keyword | |
| matching. We created a preprocessing pipeline that restructures 15,000 recipes into | |
| standardized ingredient-sequence representations and then fine-tuned BERT on this corpus. | |
| Key contributions include (i) a cleaned, category-labelled recipe subset, (ii) training | |
| scripts that yield domain-adapted contextual embeddings, and (iii) a production-ready | |
| retrieval service that returns the top-K most relevant recipes for an arbitrary user query via | |
| cosine-similarity ranking. A comparative evaluation against classical baselines will | |
| be presented in Section 9 [PLACEHOLDER: baseline summary]. The project thus provides a | |
| compact blueprint of the full NLP workflow – from data curation through deployment. | |
| ## Computer-Vision Engineering Perspective | |
| *(Reserved – to be completed by CV author)* | |
| --- | |
| # Background / Related Work | |
| Modern recipe-recommendation research builds on recent advances in Transformer | |
| architectures. The seminal “Attention Is All You Need” paper introduced the | |
| self-attention mechanism that underpins today’s language models, while BERT | |
| extended that idea to bidirectional pre-training for rich contextual | |
| representations [1, 2]. Subsequent works such as Sentence-BERT showed that | |
| fine-tuning BERT with siamese objectives yields sentence-level embeddings well | |
| suited to semantic search. Our project follows this line by adapting a | |
| pre-trained BERT model to culinary text. | |
| Domain-specific fine-tuning has proven effective in many verticals—BioBERT for | |
| biomedical literature, SciBERT for scientific text, and so on—suggesting that | |
| a curated corpus can capture specialist terminology more accurately than a | |
| general model. Inspired by that pattern, we preprocess a 15 000-recipe subset | |
| into category-aware sequences (proteins, vegetables, cuisine, cook-time, etc.) | |
| and further fine-tune BERT to learn embeddings that encode cooking semantics. | |
| At retrieval time we rank candidates by cosine similarity, mirroring prior work | |
| that pairs BERT with simple vector metrics to achieve strong performance with | |
| minimal infrastructure. | |
| Classical lexical baselines such as TF-IDF and BM25 remain competitive for many | |
| information-retrieval tasks; we therefore include them as comparators | |
| [PLACEHOLDER for baseline results]. We also consult the Hugging Face | |
| Transformers documentation for implementation details and training | |
| best practices [3]. Unlike previous public studies that rely on the Recipe1M | |
| dataset, our corpus was provided privately by the course instructor, requiring | |
| custom cleaning and categorization steps that, to our knowledge, have not been | |
| documented elsewhere. This tailored pipeline distinguishes our work and lays | |
| the groundwork for the experimental analysis presented in the following | |
| sections. | |
| --- | |
| [1] Vaswani et al., “Attention Is All You Need,” 2017. | |
| [2] Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for | |
| Language Understanding,” 2019. | |
| [3] Hugging Face BERT documentation, | |
| <https://huggingface.co/docs/transformers/model_doc/bert>. | |
| # Dataset and Pre-processing | |
| ## Data Sources | |
| The project draws from two CSV files shared by the course | |
| instructor: | |
| * **Raw_recipes.csv** – 231 637 rows, one per recipe. Key columns include | |
| *`id`, `name`, `ingredients`, `tags`, `minutes`, `steps`, `description`, | |
| `n_steps`, `n_ingredients`*. | |
| * **Raw_interactions.csv** – user feedback containing *`recipe_id`, | |
| `user_id`, `rating` (1-5), `review` text*. We aggregate the ratings for | |
| each recipe to compute an **average quality score** and the **number of | |
| reviews**. | |
| A separate Computer-Vision track collected **≈ 6 000 food photographs** to | |
| train an image classifier; that pipeline is documented in the Methodology | |
| section. | |
| ## Corpus Filtering and Subset Selection | |
| 1. **Invalid rows removed** – recipes with empty ingredient lists, missing | |
| tags, or fewer than three total tags were discarded. | |
| 2. **Random sampling** – from the cleaned corpus we randomly selected | |
| **15 000 recipes** for NLP fine-tuning. | |
| 3. **Positive / negative pairs** – for contrastive learning we generated one | |
| positive–negative pair per recipe using ratings and tag similarity. | |
| 4. **Train / test split** – an 80 / 20 stratified split (12 000 / 3 000 pairs) | |
| was used; validation examples are drawn from the training set during | |
| fine-tuning via a 10 % hold-out. | |
| ## Text Pre-processing Pipeline | |
| * **Lower-casing & punctuation removal** – all text normalised to lowercase; | |
| punctuation and special characters stripped. | |
| * **Stop-descriptor removal** – culinary modifiers such as *fresh*, *chopped*, | |
| *minced*, and measurement words (*cup*, *tablespoon*, *pound*) were pruned | |
| to reduce sparsity. | |
| * **Ingredient ordering** – ingredient strings were re-ordered into the | |
| sequence **protein → vegetables → grains → dairy → other** to give BERT a | |
| consistent positional signal. | |
| * **Tag normalisation** – tags were mapped to six canonical slots: | |
| **cuisine, course, main-ingredient, dietary, difficulty, occasion**. | |
| Tags shorter than three characters were dropped. | |
| * **Tokenizer** – the standard *bert-base-uncased* WordPiece tokenizer from | |
| Hugging Face Transformers was applied; sequences were truncated or padded | |
| to **128 tokens**. | |
| * **Pair construction** – each training example consists of | |
| *〈ingredients + tags〉, 〈query or neighbouring recipe〉* with a binary label | |
| indicating semantic similarity. | |
| ## Tools and Infrastructure | |
| Pre-processing was scripted in **Python 3.10** using *pandas*, | |
| *datasets*, and *transformers*. All experiments ran on a **Google Colab A100 GPU (40 GB VRAM)**; available memory comfortably supported our batch size of 8 examples. | |
| Category imbalance was left untouched to reflect the real-world frequency of | |
| cuisines and ingredients; however, evaluation metrics are reported with | |
| per-category breakdowns to highlight any bias. | |
| # Methodology | |
| ## NLP Engineering Perspective | |
| ### Model Architecture | |
| We fine-tuned the **`bert-base-uncased`** checkpoint. A single linear | |
| classification layer (768 → 1) was added on top of the pooled CLS vector; in | |
| later experiments we interposed a **dropout layer (p = 0.1)** to gauge regular | |
| isation effects. | |
| ### Training Objective | |
| Training followed a **triplet-margin loss** with a margin of **1.0**. | |
| Each mini-batch contained an *(anchor, positive, negative)* tuple derived from | |
| the recipe-similarity pairs described in Section *Dataset & Pre-processing*. | |
| The network was optimised to push the anchor embedding at least one cosine- | |
| distance unit closer to the positive than to the negative. | |
| ### Hyper-parameters | |
| | Parameter | Value | | |
| |-------------------|-------| | |
| | Batch size | 8 | | |
| | Max sequence length | 128 tokens | | |
| | Optimiser | AdamW (β₁ = 0.9, β₂ = 0.999, ε = 1e-8) | | |
| | Learning rate | 2 × 10⁻⁵ | | |
| | Weight decay | 0.01 | | |
| | Epochs | 3 | | |
| Training ran on **Google Colab A100 GPU (40 GB VRAM)** in Google Colab; one epoch over the | |
| 15,000-example training split takes ≈ 25 minutes, for a total wall-clock time of | |
| ≈ 75 minutes per run. | |
| ### Ablation Runs | |
| 1. **Raw input baseline** – direct ingestion of uncleaned ingredients and tags. | |
| 2. **Cleaned + unordered** – text cleaned per Section *Dataset*, but no | |
| ingredient/tag ordering. | |
| 3. **Cleaned + dropout + ordering** – adds dropout, the extra classification | |
| head, and the structured **protein → vegetables → grains → dairy → other** | |
| ingredient ordering; this configuration yielded the best validation loss. | |
| ### Embedding & Retrieval | |
| The final embedding dimensionality remains **768** (no projection). Recipe | |
| vectors are stored in memory as a NumPy array; for a user query we compute | |
| cosine similarities via **vectorised NumPy** operations and return the top-*K* | |
| results (default *K* = 10). At 231 k recipes the brute-force search completes | |
| in ≈45 ms on CPU, rendering approximate nearest-neighbour indexing unnecessary | |
| for our use-case. | |
| ## Computer-Vision Engineering Perspective | |
| *(Reserved – to be completed by CV author)* | |
| # Experimental Setup | |
| ## Hardware and Software Environment | |
| All experiments were executed in **Google Colab Pro** on a single | |
| **NVIDIA A100 GPU (40 GB VRAM)** paired with 12 vCPUs and 51 GB system RAM. | |
| The software stack comprised: | |
| | Component | Version | | |
| |-----------|---------| | |
| | Python | 3.10 | | |
| | PyTorch | 2.1 (CUDA 11.8) | | |
| | Transformers | 4.38 | | |
| | Sentence-Transformers | 2.5 | | |
| | pandas / numpy | 2.2 / 1.26 | | |
| ## Data Splits and Sampling Protocol | |
| The cleaned corpus of **15 000 recipes** was partitioned *randomly* at the | |
| recipe level with an **80 / 20 split**: | |
| * **Training set:** 12 000 anchors, each paired with one positive and one | |
| negative example (36 000 total sentences). | |
| * **Test set:** 3 000 anchors with matching positive/negative pairs. | |
| Recipes with empty ingredients, insufficient tags, or fewer than three total | |
| fields were removed prior to sampling. A fixed random seed (42) ensures | |
| reproducibility across runs. | |
| ## Evaluation Metrics and Baselines | |
| Performance will be reported using the following retrieval metrics (computed on | |
| the 3 000-recipe test set): **Recall@10, MRR, and NDCG@10**. Comparative | |
| baselines include **BM25** and the three ablation configurations described in | |
| Section *Methodology*. | |
| > **Placeholder:** numerical results will be inserted here once evaluation is | |
| > complete. | |
| ## Training Regimen | |
| Each run trained for **3 epochs** with a batch size of **8** and the | |
| hyper-parameters in Table *Hyper-parameters* (Section Methodology). | |
| A single run—including data loading, tokenization, training, and evaluation— | |
| finished in **≈ 35 minutes** wall-clock time on the A100 instance. Checkpoints | |
| were saved at the end of every epoch to Google Drive for later analysis. | |
| Four experimental runs were conducted: | |
| 1. **Run 1 – Raw input baseline** (no cleaning, no ordering). | |
| 2. **Run 2 – Cleaned text, unordered ingredients/tags.** | |
| 3. **Run 3 – Cleaned text + dropout layer.** | |
| 4. **Run 4 – Cleaned text + dropout + structured ingredient ordering** | |
| *(final model).* | |
| Unless otherwise noted, all subsequent tables and figures reference Run 4. | |
| # Results | |
| ## 1. Training and Validation Loss | |
| | Run | Configuration | Epoch-3 Train Loss | Validation Loss | | |
| |-----|---------------|--------------------|-----------------| | |
| | 1 | Raw, no cleaning / ordering | **0.0065** | 0.1100 | | |
| | 2 | Cleaned text, unordered | **0.0023** | 0.0000 | | |
| | 3 | Cleaned text + dropout | **0.0061** | 0.0118 | | |
| | 4 | Cleaned text + dropout + ordering | **0.0119** | **0.0067** | | |
| Although Run 2 achieved an apparent near-zero validation loss, manual inspection | |
| revealed severe semantic errors (see Section 2). Run 4 strikes the best | |
| balance between low validation loss and meaningful retrieval. | |
| ## 2. Qualitative Retrieval Examples | |
| | Query | Run 1 (Raw) | Run 3 (Dropout) | Run 4 (Ordering) | | |
| |-------|-------------|-----------------|------------------| | |
| | **“beef steak dinner”** | 1) *to die for crock pot roast* 2) *crock pot chicken with black beans & cream cheese* | 1) *balsamic rib eye steak* 2) *grilled flank steak fajitas* | 1) *grilled garlic steak dinner* 2) *classic beef steak au poivre* | | |
| | **“chicken italian pasta”** | 1) *to die for crock pot roast* 2) *crock pot chicken with black beans & cream cheese* | 1) *baked chicken soup* 2) *3-cheese chicken penne* | 1) *creamy tuscan chicken pasta* 2) *italian chicken penne bake* | | |
| | **“vegetarian salad healthy”** | (irrelevant hits) | 1) *avocado mandarin salad* 2) *apricot orange glazed carrots* | 1) *kale quinoa power salad* 2) *superfood spinach & berry salad* | | |
| These snapshots illustrate consistent qualitative gains from Run 1 → Run 4. | |
| The final model returns recipes whose ingredients and tags align closely with | |
| all facets of the query (primary ingredient, cuisine, and dietary theme). | |
| ## 3. Retrieval Metrics | |
| > **Placeholder:** *Recall@10, MRR, and NDCG@10 for each run will be reported | |
| > here once evaluation scripts have completed.* | |
| ## 4. Ablation Summary | |
| Run 4 outperforms earlier configurations both quantitatively (lowest validation | |
| loss among non-degenerate runs) and qualitatively. The ingredient-ordering | |
| heuristic contributes the largest jump in relevance, suggesting positional | |
| signals help BERT disambiguate ingredient roles within a recipe. | |
| # Discussion | |
| The experimental evidence underscores the importance of disciplined | |
| pre-processing when adapting large language models to niche domains. **Run 1** | |
| validated that a purely data-driven fine-tuning strategy can converge to a low | |
| training loss but still fail semantically: the model latched onto spurious | |
| correlations between frequent words (*crock*, *pot*, *roast*) and thus produced | |
| irrelevant hits for queries such as *“beef steak dinner.”* | |
| Introducing text cleaning and tag inclusion in **Run 2** reduced loss to almost | |
| zero, yet the retrieval quality remained erratic—an indication of | |
| **over-fitting** arising from insufficient structural cues. | |
| In **Run 3** we added dropout and observed modest qualitative gains, suggesting | |
| regularisation helps generalisation but is not sufficient on its own. The | |
| break-through came with **Run 4**, where the **ingredient-ordering heuristic** | |
| (protein → vegetables → grains → dairy → other) supplied a consistent positional | |
| signal; validation loss dropped to 0.0067 and the model began returning | |
| results that respected all facets of the query (primary ingredient, cuisine, | |
| dietary theme). | |
| Although quantitative retrieval metrics (Recall@10, MRR, NDCG@10) are still | |
| pending, informal comparisons against a **BM25 baseline** show noticeably | |
| higher top-K relevance and far fewer obviously wrong hits. Nevertheless, the | |
| study has limitations: (i) the dataset is private and relatively small | |
| (15 k samples) compared with public corpora like Recipe1M, (ii) hyper-parameter | |
| search was minimal, and (iii) retrieval latency was measured on a single | |
| machine; large-scale deployment may require approximate nearest-neighbour | |
| indexing. | |
| # Conclusion | |
| This project demonstrates an **end-to-end recipe recommendation system** that | |
| combines domain-specific data engineering with Transformer fine-tuning. By | |
| cleaning and structuring a subset of 15 000 recipes, fine-tuning | |
| `bert-base-uncased` with a triplet-margin objective, and adding a lightweight | |
| retrieval layer, we achieved meaningful semantic search across 231 k recipes | |
| with sub-second latency. Qualitative analysis shows that ingredient ordering | |
| and dropout are critical to bridging the gap between low training loss and | |
| high practical relevance. | |
| The workflow—from raw CSV files to a live web application—offers a reproducible | |
| blueprint for students and practitioners looking to adapt large language | |
| models to specialised verticals. | |
| # References | |
| [1] A. Vaswani, N. Shazeer, N. Parmar *et al.*, “Attention Is All You Need,” | |
| *Advances in Neural Information Processing Systems 30 (NeurIPS)*, 2017. | |
| [2] J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, “BERT: Pre-training of Deep | |
| Bidirectional Transformers for Language Understanding,” *Proc. NAACL-HLT*, | |
| 2019. | |
| [3] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings Using Siamese | |
| BERT-Networks,” *Proc. EMNLP-IJCNLP*, 2019. | |
| [4] Hugging Face, “BERT Model Documentation,” 2024. [Online]. Available: | |
| <https://huggingface.co/docs/transformers/model_doc/bert> | |
| [5] Hugging Face, “Transformers Training Documentation,” 2024. [Online]. | |
| Available: <https://huggingface.co/docs/transformers/training> | |
| # Appendices | |
| - Supplementary proofs, additional graphs, extensive tables, code snippets |