pattern / pages /4_Report.py
AAZ1215's picture
RerportAZSection (#4)
a306fec verified
raw
history blame
8.63 kB
import streamlit as st
def render_report():
st.title("Group 5: Term Project Report")
# Title Page Information
st.markdown("""
**Course:** CSE 555 β€” Introduction to Pattern Recognition
**Authors:** Saksham Lakhera and Ahmed Zaher
**Date:** July 2025
""")
# Abstract
st.header("Abstract")
st.subheader("NLP Engineering Perspective")
st.markdown("""
This project addresses the challenge of improving recipe recommendation systems through
advanced semantic search capabilities using transformer-based language models. Traditional
keyword-based search methods often fail to capture the nuanced relationships between
ingredients, cooking techniques, and user preferences in culinary contexts.
Our approach leverages BERT (Bidirectional Encoder Representations from Transformers)
fine-tuning on a custom recipe dataset to develop a semantic understanding of culinary content.
We preprocessed and structured a subset of 15,000 recipes into standardized sequences organized
by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for
the BERT architecture.
The model was fine-tuned to learn contextual embeddings that capture semantic relationships
between ingredients and tags. At inference time we generate embeddings for all recipes in our
dataset and perform cosine-similarity retrieval to produce the top-K most relevant recipes
for a user query.
""")
# Introduction
st.header("Introduction")
st.markdown("""
This term project serves primarily as an educational exercise aimed at giving students
end-to-end exposure to building a modern NLP system. Our goal is to construct a semantic
recipe-search engine that demonstrates how domain-specific fine-tuning of BERT can
substantially improve retrieval quality over simple keyword matching.
**Key Contributions:**
- A cleaned, category-labelled recipe subset of 15,000 recipes
- Training scripts that yield domain-adapted contextual embeddings
- A production-ready retrieval service that returns top-K most relevant recipes
- Comparative evaluation against classical baselines
""")
# Dataset and Preprocessing
st.header("Dataset and Pre-processing")
st.subheader("Data Sources")
st.markdown("""
The project draws from two CSV files:
- **Raw_recipes.csv** – 231,637 rows, one per recipe with columns: *id, name, ingredients, tags, minutes, steps, description, n_steps, n_ingredients*
- **Raw_interactions.csv** – user feedback containing *recipe_id, user_id, rating (1-5), review text*
""")
st.subheader("Corpus Filtering and Subset Selection")
st.markdown("""
1. **Invalid rows removed** – recipes with empty ingredient lists, missing tags, or fewer than three total tags
2. **Random sampling** – 15,000 recipes selected for NLP fine-tuning
3. **Positive/negative pairs** – generated for contrastive learning using ratings and tag similarity
4. **Train/test split** – 80/20 stratified split (12,000/3,000 pairs)
""")
st.subheader("Text Pre-processing Pipeline")
st.markdown("""
- **Lower-casing & punctuation removal** – normalized to lowercase, special characters stripped
- **Stop-descriptor removal** – culinary modifiers (*fresh, chopped, minced*) and measurements removed
- **Ingredient ordering** – re-ordered into sequence: **protein β†’ vegetables β†’ grains β†’ dairy β†’ other**
- **Tag normalization** – mapped to six canonical slots: *cuisine, course, main-ingredient, dietary, difficulty, occasion*
- **Tokenization** – standard *bert-base-uncased* WordPiece tokenizer, sequences truncated/padded to 128 tokens
""")
# Methodology
st.header("Methodology")
st.subheader("Model Architecture")
st.markdown("""
- **Base Model:** `bert-base-uncased` checkpoint
- **Additional Layers:** Single linear classification layer (768 β†’ 1) with dropout (p = 0.1)
- **Training Objective:** Triplet-margin loss with margin of 1.0
""")
st.subheader("Hyperparameters")
col1, col2 = st.columns(2)
with col1:
st.markdown("""
- **Batch size:** 8
- **Max sequence length:** 128 tokens
- **Learning rate:** 2 Γ— 10⁻⁡
- **Weight decay:** 0.01
""")
with col2:
st.markdown("""
- **Optimizer:** AdamW
- **Epochs:** 3
- **Hardware:** Google Colab A100 GPU (40 GB VRAM)
- **Training time:** ~75 minutes per run
""")
# Mathematical Formulations
st.header("Mathematical Formulations")
st.subheader("Query Embedding and Similarity Calculation")
st.latex(r"""
\text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\|\hat{q}\|\|\hat{r}_i\|}
""")
st.markdown("Where $\\hat{q}$ is the BERT embedding of the query, and $\\hat{r}_i$ is the embedding of the i-th recipe.")
st.subheader("Final Score Calculation")
st.latex(r"""
\text{Score}_i = 0.6 \times \text{Similarity}_i + 0.4 \times \text{Popularity}_i
""")
# Results
st.header("Results")
st.subheader("Training and Validation Loss")
results_data = {
"Run": [1, 2, 3, 4],
"Configuration": [
"Raw, no cleaning/ordering",
"Cleaned text, unordered",
"Cleaned text + dropout",
"Cleaned text + dropout + ordering"
],
"Epoch-3 Train Loss": [0.0065, 0.0023, 0.0061, 0.0119],
"Validation Loss": [0.1100, 0.0000, 0.0118, 0.0067]
}
st.table(results_data)
st.markdown("""
**Key Finding:** Run 4 (cleaned text + dropout + ordering) achieved the best balance
between low validation loss and meaningful retrieval quality.
""")
st.subheader("Qualitative Retrieval Examples")
st.markdown("""
**Query: "beef steak dinner"**
- Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans*
- Run 4 (Final): *grilled garlic steak dinner*, *classic beef steak au poivre*
**Query: "chicken italian pasta"**
- Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans*
- Run 4 (Final): *creamy tuscan chicken pasta*, *italian chicken penne bake*
**Query: "vegetarian salad healthy"**
- Run 1 (Raw): (irrelevant hits)
- Run 4 (Final): *kale quinoa power salad*, *superfood spinach & berry salad*
""")
# Discussion and Conclusion
st.header("Discussion and Conclusion")
st.markdown("""
The experimental evidence underscores the importance of disciplined pre-processing when
adapting large language models to niche domains. The breakthrough came with **ingredient-ordering**
(protein β†’ vegetables β†’ grains β†’ dairy β†’ other) which supplied consistent positional signals.
**Key Achievements:**
- End-to-end recipe recommendation system with semantic search
- Sub-second latency across 231k recipes
- Meaningful semantic understanding of culinary content
- Reproducible blueprint for domain-specific NLP applications
**Limitations:**
- Private dataset relatively small (15k samples) compared to public corpora
- Minimal hyperparameter search conducted
- Single-machine deployment tested
""")
# Technical Specifications
st.header("Technical Specifications")
col1, col2 = st.columns(2)
with col1:
st.markdown("""
**Dataset:**
- Total Recipes: 231,630
- Training Set: 15,000 recipes
- Average Tags per Recipe: ~6
- Ingredients per Recipe: 3-20
""")
with col2:
st.markdown("""
**Infrastructure:**
- Python 3.10
- PyTorch 2.1 (CUDA 11.8)
- Transformers 4.38
- Google Colab A100 GPU
""")
# References
st.header("References")
st.markdown("""
[1] Vaswani et al., "Attention Is All You Need," NeurIPS, 2017.
[2] Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL-HLT, 2019.
[3] Reimers and Gurevych, "Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks," EMNLP-IJCNLP, 2019.
[4] Hugging Face, "BERT Model Documentation," 2024.
""")
st.markdown("---")
st.markdown("Β© 2025 CSE 555 Term Project. All rights reserved.")
# Render the report
render_report()