Spaces:
Sleeping
Sleeping
import streamlit as st | |
def render_report(): | |
st.title("Group 5: Term Project Report") | |
# Title Page Information | |
st.markdown(""" | |
**Course:** CSE 555 β Introduction to Pattern Recognition | |
**Authors:** Saksham Lakhera and Ahmed Zaher | |
**Date:** July 2025 | |
""") | |
# Abstract | |
st.header("Abstract") | |
st.subheader("NLP Engineering Perspective") | |
st.markdown(""" | |
This project addresses the challenge of improving recipe recommendation systems through | |
advanced semantic search capabilities using transformer-based language models. Traditional | |
keyword-based search methods often fail to capture the nuanced relationships between | |
ingredients, cooking techniques, and user preferences in culinary contexts. | |
Our approach leverages BERT (Bidirectional Encoder Representations from Transformers) | |
fine-tuning on a custom recipe dataset to develop a semantic understanding of culinary content. | |
We preprocessed and structured a subset of 15,000 recipes into standardized sequences organized | |
by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for | |
the BERT architecture. | |
The model was fine-tuned to learn contextual embeddings that capture semantic relationships | |
between ingredients and tags. At inference time we generate embeddings for all recipes in our | |
dataset and perform cosine-similarity retrieval to produce the top-K most relevant recipes | |
for a user query. | |
""") | |
# Introduction | |
st.header("Introduction") | |
st.markdown(""" | |
This term project serves primarily as an educational exercise aimed at giving students | |
end-to-end exposure to building a modern NLP system. Our goal is to construct a semantic | |
recipe-search engine that demonstrates how domain-specific fine-tuning of BERT can | |
substantially improve retrieval quality over simple keyword matching. | |
**Key Contributions:** | |
- A cleaned, category-labelled recipe subset of 15,000 recipes | |
- Training scripts that yield domain-adapted contextual embeddings | |
- A production-ready retrieval service that returns top-K most relevant recipes | |
- Comparative evaluation against classical baselines | |
""") | |
# Dataset and Preprocessing | |
st.header("Dataset and Pre-processing") | |
st.subheader("Data Sources") | |
st.markdown(""" | |
The project draws from two CSV files: | |
- **Raw_recipes.csv** β 231,637 rows, one per recipe with columns: *id, name, ingredients, tags, minutes, steps, description, n_steps, n_ingredients* | |
- **Raw_interactions.csv** β user feedback containing *recipe_id, user_id, rating (1-5), review text* | |
""") | |
st.subheader("Corpus Filtering and Subset Selection") | |
st.markdown(""" | |
1. **Invalid rows removed** β recipes with empty ingredient lists, missing tags, or fewer than three total tags | |
2. **Random sampling** β 15,000 recipes selected for NLP fine-tuning | |
3. **Positive/negative pairs** β generated for contrastive learning using ratings and tag similarity | |
4. **Train/test split** β 80/20 stratified split (12,000/3,000 pairs) | |
""") | |
st.subheader("Text Pre-processing Pipeline") | |
st.markdown(""" | |
- **Lower-casing & punctuation removal** β normalized to lowercase, special characters stripped | |
- **Stop-descriptor removal** β culinary modifiers (*fresh, chopped, minced*) and measurements removed | |
- **Ingredient ordering** β re-ordered into sequence: **protein β vegetables β grains β dairy β other** | |
- **Tag normalization** β mapped to six canonical slots: *cuisine, course, main-ingredient, dietary, difficulty, occasion* | |
- **Tokenization** β standard *bert-base-uncased* WordPiece tokenizer, sequences truncated/padded to 128 tokens | |
""") | |
# Methodology | |
st.header("Methodology") | |
st.subheader("Model Architecture") | |
st.markdown(""" | |
- **Base Model:** `bert-base-uncased` checkpoint | |
- **Additional Layers:** Single linear classification layer (768 β 1) with dropout (p = 0.1) | |
- **Training Objective:** Triplet-margin loss with margin of 1.0 | |
""") | |
st.subheader("Hyperparameters") | |
col1, col2 = st.columns(2) | |
with col1: | |
st.markdown(""" | |
- **Batch size:** 8 | |
- **Max sequence length:** 128 tokens | |
- **Learning rate:** 2 Γ 10β»β΅ | |
- **Weight decay:** 0.01 | |
""") | |
with col2: | |
st.markdown(""" | |
- **Optimizer:** AdamW | |
- **Epochs:** 3 | |
- **Hardware:** Google Colab A100 GPU (40 GB VRAM) | |
- **Training time:** ~75 minutes per run | |
""") | |
# Mathematical Formulations | |
st.header("Mathematical Formulations") | |
st.subheader("Query Embedding and Similarity Calculation") | |
st.latex(r""" | |
\text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\|\hat{q}\|\|\hat{r}_i\|} | |
""") | |
st.markdown("Where $\\hat{q}$ is the BERT embedding of the query, and $\\hat{r}_i$ is the embedding of the i-th recipe.") | |
st.subheader("Final Score Calculation") | |
st.latex(r""" | |
\text{Score}_i = 0.6 \times \text{Similarity}_i + 0.4 \times \text{Popularity}_i | |
""") | |
# Results | |
st.header("Results") | |
st.subheader("Training and Validation Loss") | |
results_data = { | |
"Run": [1, 2, 3, 4], | |
"Configuration": [ | |
"Raw, no cleaning/ordering", | |
"Cleaned text, unordered", | |
"Cleaned text + dropout", | |
"Cleaned text + dropout + ordering" | |
], | |
"Epoch-3 Train Loss": [0.0065, 0.0023, 0.0061, 0.0119], | |
"Validation Loss": [0.1100, 0.0000, 0.0118, 0.0067] | |
} | |
st.table(results_data) | |
st.markdown(""" | |
**Key Finding:** Run 4 (cleaned text + dropout + ordering) achieved the best balance | |
between low validation loss and meaningful retrieval quality. | |
""") | |
st.subheader("Qualitative Retrieval Examples") | |
st.markdown(""" | |
**Query: "beef steak dinner"** | |
- Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans* | |
- Run 4 (Final): *grilled garlic steak dinner*, *classic beef steak au poivre* | |
**Query: "chicken italian pasta"** | |
- Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans* | |
- Run 4 (Final): *creamy tuscan chicken pasta*, *italian chicken penne bake* | |
**Query: "vegetarian salad healthy"** | |
- Run 1 (Raw): (irrelevant hits) | |
- Run 4 (Final): *kale quinoa power salad*, *superfood spinach & berry salad* | |
""") | |
# Discussion and Conclusion | |
st.header("Discussion and Conclusion") | |
st.markdown(""" | |
The experimental evidence underscores the importance of disciplined pre-processing when | |
adapting large language models to niche domains. The breakthrough came with **ingredient-ordering** | |
(protein β vegetables β grains β dairy β other) which supplied consistent positional signals. | |
**Key Achievements:** | |
- End-to-end recipe recommendation system with semantic search | |
- Sub-second latency across 231k recipes | |
- Meaningful semantic understanding of culinary content | |
- Reproducible blueprint for domain-specific NLP applications | |
**Limitations:** | |
- Private dataset relatively small (15k samples) compared to public corpora | |
- Minimal hyperparameter search conducted | |
- Single-machine deployment tested | |
""") | |
# Technical Specifications | |
st.header("Technical Specifications") | |
col1, col2 = st.columns(2) | |
with col1: | |
st.markdown(""" | |
**Dataset:** | |
- Total Recipes: 231,630 | |
- Training Set: 15,000 recipes | |
- Average Tags per Recipe: ~6 | |
- Ingredients per Recipe: 3-20 | |
""") | |
with col2: | |
st.markdown(""" | |
**Infrastructure:** | |
- Python 3.10 | |
- PyTorch 2.1 (CUDA 11.8) | |
- Transformers 4.38 | |
- Google Colab A100 GPU | |
""") | |
# References | |
st.header("References") | |
st.markdown(""" | |
[1] Vaswani et al., "Attention Is All You Need," NeurIPS, 2017. | |
[2] Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL-HLT, 2019. | |
[3] Reimers and Gurevych, "Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks," EMNLP-IJCNLP, 2019. | |
[4] Hugging Face, "BERT Model Documentation," 2024. | |
""") | |
st.markdown("---") | |
st.markdown("Β© 2025 CSE 555 Term Project. All rights reserved.") | |
# Render the report | |
render_report() | |