Spaces:
Sleeping
Sleeping
File size: 11,531 Bytes
733fcd8 04cc6b0 733fcd8 04cc6b0 a306fec 733fcd8 a306fec 733fcd8 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 733fcd8 a306fec 04cc6b0 733fcd8 04cc6b0 a306fec 04cc6b0 a306fec 733fcd8 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 733fcd8 04cc6b0 a306fec 733fcd8 a306fec 04cc6b0 733fcd8 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 733fcd8 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 04cc6b0 a306fec 733fcd8 a306fec 733fcd8 a306fec 04cc6b0 733fcd8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 |
import streamlit as st
from utils.layout import render_layout
def render_report():
st.title("Image Classification CV and Fine-Tuned NLP Recipe Recommendation")
# Title Page Information
st.markdown("""
**Authors:** Saksham Lakhera and Ahmed Zaher
**Date:** July 2025
""")
# Abstract
st.subheader("Abstract")
st.markdown("""
**NLP Engineering Perspective:**
This project addresses the challenge of improving recipe recommendation systems through
advanced semantic search capabilities using transformer-based language models. This will explain how to fine-tune a model
to learn domain-specific context to capture the nuanced relationships between
ingredients and cooking techniques in culinary contexts.
Our approach leverages BERT (Bidirectional Encoder Representations from Transformers)
fine-tuning on a custom recipe dataset to develop a semantic understanding of culinary content.
We preprocessed and structured a subset of 15,000 recipes into standardized sequences organized
by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for
the BERT architecture.
The model was fine-tuned to learn contextual embeddings that capture semantic relationships
between ingredients and tags. At the end, we generate embeddings for all recipes in our
dataset and perform cosine-similarity retrieval to produce the top-K most relevant recipes
for a user query.
""")
# Introduction
st.subheader("Introduction")
st.markdown("""
This term project serves primarily as an educational exercise aimed at giving
end-to-end exposure to building a modern NLP system. Our goal is to construct a semantic
recipe-search engine that demonstrates how domain-specific fine-tuning of BERT can
substantially improve retrieval quality over simple keyword matching.
**Key Contributions:**
- A cleaned, category-labelled recipe subset of 15,000 recipes
- Training scripts that yield adapted contextual embeddings
- A production-ready retrieval service that returns top-K most relevant recipes
- Comparative evaluation against classical baselines
""")
# Dataset and Preprocessing
st.subheader("Dataset and Pre-processing")
st.markdown("""
**Data Sources:**
The project draws from two CSV files:
- **Raw_recipes.csv:** 231,637 rows, one per recipe with columns: *id, name, ingredients, tags, minutes, steps, description, n_steps, n_ingredients*
- **Raw_interactions.csv:** user feedback containing *recipe_id, user_id, rating, review text*
""")
st.markdown("""
**Corpus Filtering and Subset Selection**
- **Invalid rows removed:** recipes with empty ingredient lists, missing tags, or fewer than three total tags
- **Random sampling:** 15,000 recipes selected for NLP fine-tuning
- **Positive/negative pairs:** generated for contrastive learning using ratings and tag similarity
- **Train/test split:** 80/20 stratified split (12,000/3,000 pairs)
""")
st.markdown("""
**Text Pre-processing Pipeline**
- **Lower-casing & punctuation removal:** normalized to lowercase, special characters stripped
- **Stop-descriptor removal:** culinary modifiers (*fresh, chopped, minced*) and measurements (tablespoons, teaspoons, cups, etc.) removed
- **Ingredient ordering:** re-ordered into sequence: protein β vegetables/grains/ dairy β other
- **Tag normalization:** mapped to 7 main categories: *cuisine, course, main-ingredient, dietary, difficulty, occasion, cooking_method*
- **Tokenization:** standard *bert-base-uncased* WordPiece tokenizer, sequences truncated/padded to 128 tokens
""")
# Technical Specifications
st.subheader("Technical Specifications")
col1, col2 = st.columns(2)
with col1:
st.markdown("""
**Dataset:**
- Total Recipes: 231,630
- Training Set: 12,000 recipes
- Average Tags per Recipe: ~6
- Ingredients per Recipe: 3-20
""")
with col2:
st.markdown("""
**Infrastructure:**
- Python 3.10
- PyTorch 2.1 (CUDA 11.8)
- Transformers 4.38
- Google Colab A100 GPU
""")
# Methodology
st.subheader("Methodology")
st.markdown("""
**Model Architecture**
- **Base Model:** bert-base-uncased
- **Additional Layers:** In some runs, we added a single linear classification layer with dropout (p = 0.1)
- **Training Objective:** Triplet-margin loss with margin of 1.0
We trained the model directly on the raw data to see if we will get any good results. As seen in table 1, this run resulted in a very low training error
but when ran on the validation set, the training error was higher. We then used cleaned up the data by removing any empty space, standardized to lower text, removed
all punctuation and retrained the model. This resulted in a highly overfitted model as seen in table 1 and the results section below. Next, we added a single linear layer on top of
the BERT's current architecture and added a dropout to get rid of overfitting. The results as shown in table 1 were better. Although the semantic
results were better than before, it still was not good in indentifying the relashionships between ingredients and the different tags. We then further
structured the data by ordering the tags and ingredients in a strcutured manner across the dataset and retrained the model. This resulted in a better
training and validation loss. This is also evident in the semantic retrieval results below.
**Website Development:**
- We used streamlit to develop the websit. However, we faced few issues with the size of the trained model and we switched hosting to Hugging Face.
- The website loades the pre-trained model along with recipes embeddings and top-k retrieval function and waits for the user to enter a query.
- The query is then processed b the model and top-k recipes are returned.
""")
st.markdown("**Hyperparameters and Training**")
col1, col2 = st.columns(2)
with col1:
st.markdown("""
- **Batch size:** 8
- **Max sequence length:** 128 tokens
- **Learning rate:** 2 Γ 10β»β΅
- **Weight decay:** 0.01
""")
with col2:
st.markdown("""
- **Optimizer:** AdamW
- **Epochs:** 3
- **Hardware:** Google Colab A100 GPU (40 GB VRAM)
- **Training time:** ~30 minutes per run
""")
# Mathematical Formulations
st.subheader("Mathematical Formulations and Top-K Retrieval")
st.markdown("""**Query Embedding and Similarity Calculation**: we used the trained model weights to generate embeddings for the entire recipe corpus. We then used cosine similarity to calculate the similarity between the query and the recipe corpus.
and once the user query is passed, we embedded the querry using the trained model and used the cosine similarity formula below to retrieve the top-K
recipes. We then filtered the only ones that have an average rating >= 3.0 and at least 5 ratings. We then sorted the recipes by similarity and then by average rating.
""")
st.latex(r"""
\text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\|\hat{q}\|\|\hat{r}_i\|}
""")
st.markdown("Where $\\hat{q}$ is the BERT embedding of the query, and $\\hat{r}_i$ is the embedding of the i-th recipe.")
# Results
st.subheader("Results")
st.markdown("**Training and Validation Loss**")
results_data = {
"Run": [1, 2, 3, 4],
"Configuration": [
"Raw, no cleaning/ordering",
"Cleaned text, unordered",
"Cleaned text + single layer + dropout",
"Cleaned text + ordering"
],
"Epoch-3 Train Loss": [0.0065, 0.0023, 0.0061, 0.0119],
"Validation Loss": [0.1100, 0.0000, 0.0118, 0.0067]
}
st.table(results_data)
st.markdown("""Table 1: Training and Validation Loss for each run""")
st.markdown("""
**Key Finding:** Run 4 (cleaned text + ordering) achieved the best balance
between low validation loss and meaningful retrieval quality.
""")
st.markdown("**Qualitative Retrieval Examples**")
st.markdown("""
In this section, we will show how the results of the model differ between runs and how the model performs on different queries.
**Query: "beef steak dinner"**
- Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans*
- Run 2 (Cleaned text, unordered): *aussie pepper steak steak with creamy pepper sauce*
- Run 3 (Cleaned text + single layer + dropout): *balsamic rib eye steak with bleu cheese sauce*
- Run 4 (Final): *grilled garlic steak dinner*, *classic beef steak au poivre*
**Query: "chicken italian pasta"**
- Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans*
- Run 2 (Cleaned text, unordered): *baked chicken soup*
- Run 3 (Cleaned text + single layer + dropout): *absolute best ever lasagna*
- Run 4 (Final): *creamy tuscan chicken pasta*, *italian chicken penne bake*
**Query: "vegetarian salad healthy"**
- Run 1 (Raw): *to die for crock pot roast*
- Run 2 (Cleaned text, unordered): *avocado mandarin salad*
- Run 3 (Cleaned text + single layer + dropout): *black bean and sweet potato salad*
- Run 4 (Final): *kale quinoa power salad*, *superfood spinach & berry salad*
""")
# Discussion and Conclusion
st.subheader("Discussion and Conclusion")
st.markdown("""
The experimental evidence underscores the importance of disciplined pre-processing when
adapting large language models to niche domains. The breakthrough came with ingredient-ordering
(protein β vegetables β grains β dairy β other) which supplied consistent positional signals. As we can see in the results,
the performance of the model improves with the addition of the single layer and dropout but the results are still not as good as the final run where
we added the ordering of the ingredients.
**Key Achievements:**
- End-to-end recipe recommendation system with semantic search
- Meaningful semantic understanding of culinary content
- Reproducible blueprint for domain-specific NLP applications
**Limitations:**
- Private dataset relatively small training set (12k samples) compared to public corpora
- Further pre-processing could be done to improve the results
- Minimal hyperparameter search conducted
- Single-machine deployment tested
- The model is not able to handle complex queries and it is not able to handle synonyms and antonyms.
""")
# References
st.subheader("References")
st.markdown("""
[1] Vaswani et al., "Attention Is All You Need," NeurIPS, 2017.
[2] Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL-HLT, 2019.
[3] Reimers and Gurevych, "Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks," EMNLP-IJCNLP, 2019.
[4] Hugging Face, "BERT Model Documentation," 2024.
""")
st.markdown("---")
st.markdown("Β© 2025 CSE 555 Term Project. All rights reserved.")
# Render the report
render_layout(render_report)
|