Spaces:

PatternGroup5
/

pattern

Sleeping

App Files Files Community

azaher1215 commited on 25 days ago

Commit

b8416be

1 Parent(s): 1ba150b

integrating md file with report.py

Browse files

Files changed (1) hide show

pages/4_Report.py +192 -88

pages/4_Report.py CHANGED Viewed

@@ -1,107 +1,211 @@
 import streamlit as st
 def render_report():
-    st.title("📊 Recipe Search System Report")
     st.markdown("""
-        ## Overview
-        This report summarizes the working of the **custom BERT-based Recipe Recommendation System**, dataset characteristics, scoring algorithm, and evaluation metrics.
     """)
-    st.markdown("### 🔍 Query Embedding and Similarity Calculation")
-    st.latex(r"""
-        \text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\|\hat{q}\|\|\hat{r}_i\|}
     """)
     st.markdown("""
-        Here, $\\hat{q}$ is the BERT embedding of the query, and $\\hat{r}_i$ is the embedding of the i-th recipe.
     """)
-    st.markdown("### 🏆 Final Score Calculation")
     st.latex(r"""
         \text{Score}_i = 0.6 \times \text{Similarity}_i + 0.4 \times \text{Popularity}_i
     """)
-    st.markdown("### 📊 Dataset Summary")
     st.markdown("""
-        - **Total Recipes:** 231,630
-        - **Average Tags per Recipe:** ~6
-        - **Ingredients per Recipe:** 3 to 20
-        - **Ratings Data:** Extracted from user interaction dataset
     """)
-    st.markdown("### 🧪 Evaluation Strategy")
     st.markdown("""
-        We use a combination of:
-        - Manual inspection
-        - Recipe diversity analysis
-        - Match vs rating correlation
-        - Qualitative feedback from test queries
     """)
     st.markdown("---")
-    st.markdown("© 2025 Your Name. All rights reserved.")
-# If using a layout wrapper:
 render_report()
-# LaTeX content as string
-latex_report = r"""
-\documentclass{article}
-\usepackage{amsmath}
-\usepackage{geometry}
-\geometry{margin=1in}
-\title{Recipe Recommendation System Report}
-\author{Saksham Lakhera}
-\date{\today}
-\begin{document}
-\maketitle
-\section*{Overview}
-This report summarizes the working of the \textbf{custom BERT-based Recipe Recommendation System}, dataset characteristics, scoring algorithm, and evaluation metrics.
-\section*{Query Embedding and Similarity Calculation}
-\[
-\text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\|\hat{q}\|\|\hat{r}_i\|}
-\]
-Here, $\hat{q}$ is the BERT embedding of the query, and $\hat{r}_i$ is the embedding of the i-th recipe.
-\section*{Final Score Calculation}
-\[
-\text{Score}_i = 0.6 \times \text{Similarity}_i + 0.4 \times \text{Popularity}_i
-\]
-\section*{Dataset Summary}
-\begin{itemize}
-  \item \textbf{Total Recipes:} 231,630
-  \item \textbf{Average Tags per Recipe:} $\sim$6
-  \item \textbf{Ingredients per Recipe:} 3 to 20
-  \item \textbf{Ratings Source:} User interaction dataset
-\end{itemize}
-\section*{Evaluation Strategy}
-We use a combination of:
-\begin{itemize}
-  \item Manual inspection
-  \item Recipe diversity analysis
-  \item Match vs rating correlation
-  \item Qualitative user feedback
-\end{itemize}
-\end{document}
-"""
-# ⬇️ Download button to get the .tex file
-st.markdown("### 📥 Download LaTeX Report")
-st.download_button(
-    label="Download LaTeX (.tex)",
-    data=latex_report,
-    file_name="recipe_report.tex",
-    mime="text/plain"
-)
-# 📤 Optional: Show the .tex content in the app
-with st.expander("📄 View LaTeX (.tex) File Content"):
-    st.code(latex_report, language="latex")

 import streamlit as st
 def render_report():
+    st.title("Group 5: Term Project Report")
+    # Title Page Information
     st.markdown("""
+    **Course:** CSE 555 — Introduction to Pattern Recognition
+    **Authors:** Saksham Lakhera and Ahmed Zaher
+    **Date:** July 2025
     """)
+    # Abstract
+    st.header("Abstract")
+    st.subheader("NLP Engineering Perspective")
+    st.markdown("""
+    This project addresses the challenge of improving recipe recommendation systems through
+    advanced semantic search capabilities using transformer-based language models. Traditional
+    keyword-based search methods often fail to capture the nuanced relationships between
+    ingredients, cooking techniques, and user preferences in culinary contexts.
+    Our approach leverages BERT (Bidirectional Encoder Representations from Transformers)
+    fine-tuning on a custom recipe dataset to develop a semantic understanding of culinary content.
+    We preprocessed and structured a subset of 15,000 recipes into standardized sequences organized
+    by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for
+    the BERT architecture.
+    The model was fine-tuned to learn contextual embeddings that capture semantic relationships
+    between ingredients and tags. At inference time we generate embeddings for all recipes in our
+    dataset and perform cosine-similarity retrieval to produce the top-K most relevant recipes
+    for a user query.
     """)
+    # Introduction
+    st.header("Introduction")
     st.markdown("""
+    This term project serves primarily as an educational exercise aimed at giving students
+    end-to-end exposure to building a modern NLP system. Our goal is to construct a semantic
+    recipe-search engine that demonstrates how domain-specific fine-tuning of BERT can
+    substantially improve retrieval quality over simple keyword matching.
+    **Key Contributions:**
+    - A cleaned, category-labelled recipe subset of 15,000 recipes
+    - Training scripts that yield domain-adapted contextual embeddings
+    - A production-ready retrieval service that returns top-K most relevant recipes
+    - Comparative evaluation against classical baselines
     """)
+    # Dataset and Preprocessing
+    st.header("Dataset and Pre-processing")
+    st.subheader("Data Sources")
+    st.markdown("""
+    The project draws from two CSV files:
+    - **Raw_recipes.csv** – 231,637 rows, one per recipe with columns: *id, name, ingredients, tags, minutes, steps, description, n_steps, n_ingredients*
+    - **Raw_interactions.csv** – user feedback containing *recipe_id, user_id, rating (1-5), review text*
+    """)
+    st.subheader("Corpus Filtering and Subset Selection")
+    st.markdown("""
+    1. **Invalid rows removed** – recipes with empty ingredient lists, missing tags, or fewer than three total tags
+    2. **Random sampling** – 15,000 recipes selected for NLP fine-tuning
+    3. **Positive/negative pairs** – generated for contrastive learning using ratings and tag similarity
+    4. **Train/test split** – 80/20 stratified split (12,000/3,000 pairs)
+    """)
+    st.subheader("Text Pre-processing Pipeline")
+    st.markdown("""
+    - **Lower-casing & punctuation removal** – normalized to lowercase, special characters stripped
+    - **Stop-descriptor removal** – culinary modifiers (*fresh, chopped, minced*) and measurements removed
+    - **Ingredient ordering** – re-ordered into sequence: **protein → vegetables → grains → dairy → other**
+    - **Tag normalization** – mapped to six canonical slots: *cuisine, course, main-ingredient, dietary, difficulty, occasion*
+    - **Tokenization** – standard *bert-base-uncased* WordPiece tokenizer, sequences truncated/padded to 128 tokens
+    """)
+    # Methodology
+    st.header("Methodology")
+    st.subheader("Model Architecture")
+    st.markdown("""
+    - **Base Model:** `bert-base-uncased` checkpoint
+    - **Additional Layers:** Single linear classification layer (768 → 1) with dropout (p = 0.1)
+    - **Training Objective:** Triplet-margin loss with margin of 1.0
+    """)
+    st.subheader("Hyperparameters")
+    col1, col2 = st.columns(2)
+    with col1:
+        st.markdown("""
+        - **Batch size:** 8
+        - **Max sequence length:** 128 tokens
+        - **Learning rate:** 2 × 10⁻⁵
+        - **Weight decay:** 0.01
+        """)
+    with col2:
+        st.markdown("""
+        - **Optimizer:** AdamW
+        - **Epochs:** 3
+        - **Hardware:** Google Colab A100 GPU (40 GB VRAM)
+        - **Training time:** ~75 minutes per run
+        """)
+    # Mathematical Formulations
+    st.header("Mathematical Formulations")
+    st.subheader("Query Embedding and Similarity Calculation")
+    st.latex(r"""
+        \text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\|\hat{q}\|\|\hat{r}_i\|}
+    """)
+    st.markdown("Where $\\hat{q}$ is the BERT embedding of the query, and $\\hat{r}_i$ is the embedding of the i-th recipe.")
+    st.subheader("Final Score Calculation")
     st.latex(r"""
         \text{Score}_i = 0.6 \times \text{Similarity}_i + 0.4 \times \text{Popularity}_i
     """)
+    # Results
+    st.header("Results")
+    st.subheader("Training and Validation Loss")
+    results_data = {
+        "Run": [1, 2, 3, 4],
+        "Configuration": [
+            "Raw, no cleaning/ordering",
+            "Cleaned text, unordered",
+            "Cleaned text + dropout",
+            "Cleaned text + dropout + ordering"
+        ],
+        "Epoch-3 Train Loss": [0.0065, 0.0023, 0.0061, 0.0119],
+        "Validation Loss": [0.1100, 0.0000, 0.0118, 0.0067]
+    }
+    st.table(results_data)
     st.markdown("""
+    **Key Finding:** Run 4 (cleaned text + dropout + ordering) achieved the best balance
+    between low validation loss and meaningful retrieval quality.
     """)
+    st.subheader("Qualitative Retrieval Examples")
     st.markdown("""
+    **Query: "beef steak dinner"**
+    - Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans*
+    - Run 4 (Final): *grilled garlic steak dinner*, *classic beef steak au poivre*
+    **Query: "chicken italian pasta"**
+    - Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans*
+    - Run 4 (Final): *creamy tuscan chicken pasta*, *italian chicken penne bake*
+    **Query: "vegetarian salad healthy"**
+    - Run 1 (Raw): (irrelevant hits)
+    - Run 4 (Final): *kale quinoa power salad*, *superfood spinach & berry salad*
     """)
+    # Discussion and Conclusion
+    st.header("Discussion and Conclusion")
+    st.markdown("""
+    The experimental evidence underscores the importance of disciplined pre-processing when
+    adapting large language models to niche domains. The breakthrough came with **ingredient-ordering**
+    (protein → vegetables → grains → dairy → other) which supplied consistent positional signals.
+    **Key Achievements:**
+    - End-to-end recipe recommendation system with semantic search
+    - Sub-second latency across 231k recipes
+    - Meaningful semantic understanding of culinary content
+    - Reproducible blueprint for domain-specific NLP applications
+    **Limitations:**
+    - Private dataset relatively small (15k samples) compared to public corpora
+    - Minimal hyperparameter search conducted
+    - Single-machine deployment tested
+    """)
+    # Technical Specifications
+    st.header("Technical Specifications")
+    col1, col2 = st.columns(2)
+    with col1:
+        st.markdown("""
+        **Dataset:**
+        - Total Recipes: 231,630
+        - Training Set: 15,000 recipes
+        - Average Tags per Recipe: ~6
+        - Ingredients per Recipe: 3-20
+        """)
+    with col2:
+        st.markdown("""
+        **Infrastructure:**
+        - Python 3.10
+        - PyTorch 2.1 (CUDA 11.8)
+        - Transformers 4.38
+        - Google Colab A100 GPU
+        """)
+    # References
+    st.header("References")
+    st.markdown("""
+    [1] Vaswani et al., "Attention Is All You Need," NeurIPS, 2017.
+    [2] Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL-HLT, 2019.
+    [3] Reimers and Gurevych, "Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks," EMNLP-IJCNLP, 2019.
+    [4] Hugging Face, "BERT Model Documentation," 2024.
+    """)
     st.markdown("---")
+    st.markdown("© 2025 CSE 555 Term Project. All rights reserved.")
+# Render the report
 render_report()