Spaces:

PatternGroup5
/

pattern

Sleeping

File size: 8,627 Bytes

import streamlit as st

def render_report():
    st.title("Group 5: Term Project Report")
    
    # Title Page Information
    st.markdown("""
    **Course:** CSE 555 — Introduction to Pattern Recognition  
    **Authors:** Saksham Lakhera and Ahmed Zaher  
    **Date:** July 2025
    """)
    
    # Abstract
    st.header("Abstract")
    
    st.subheader("NLP Engineering Perspective")
    st.markdown("""
    This project addresses the challenge of improving recipe recommendation systems through
    advanced semantic search capabilities using transformer-based language models. Traditional
    keyword-based search methods often fail to capture the nuanced relationships between
    ingredients, cooking techniques, and user preferences in culinary contexts. 
    
    Our approach leverages BERT (Bidirectional Encoder Representations from Transformers) 
    fine-tuning on a custom recipe dataset to develop a semantic understanding of culinary content. 
    We preprocessed and structured a subset of 15,000 recipes into standardized sequences organized
    by food categories (proteins, vegetables, legumes, etc.) to create training data optimized for
    the BERT architecture.
    
    The model was fine-tuned to learn contextual embeddings that capture semantic relationships 
    between ingredients and tags. At inference time we generate embeddings for all recipes in our 
    dataset and perform cosine-similarity retrieval to produce the top-K most relevant recipes 
    for a user query.
    """)
    
    # Introduction
    st.header("Introduction")
    st.markdown("""
    This term project serves primarily as an educational exercise aimed at giving students 
    end-to-end exposure to building a modern NLP system. Our goal is to construct a semantic 
    recipe-search engine that demonstrates how domain-specific fine-tuning of BERT can 
    substantially improve retrieval quality over simple keyword matching.
    
    **Key Contributions:**
    - A cleaned, category-labelled recipe subset of 15,000 recipes
    - Training scripts that yield domain-adapted contextual embeddings
    - A production-ready retrieval service that returns top-K most relevant recipes
    - Comparative evaluation against classical baselines
    """)
    
    # Dataset and Preprocessing
    st.header("Dataset and Pre-processing")
    
    st.subheader("Data Sources")
    st.markdown("""
    The project draws from two CSV files:
    - **Raw_recipes.csv** – 231,637 rows, one per recipe with columns: *id, name, ingredients, tags, minutes, steps, description, n_steps, n_ingredients*
    - **Raw_interactions.csv** – user feedback containing *recipe_id, user_id, rating (1-5), review text*
    """)
    
    st.subheader("Corpus Filtering and Subset Selection")
    st.markdown("""
    1. **Invalid rows removed** – recipes with empty ingredient lists, missing tags, or fewer than three total tags
    2. **Random sampling** – 15,000 recipes selected for NLP fine-tuning
    3. **Positive/negative pairs** – generated for contrastive learning using ratings and tag similarity
    4. **Train/test split** – 80/20 stratified split (12,000/3,000 pairs)
    """)
    
    st.subheader("Text Pre-processing Pipeline")
    st.markdown("""
    - **Lower-casing & punctuation removal** – normalized to lowercase, special characters stripped
    - **Stop-descriptor removal** – culinary modifiers (*fresh, chopped, minced*) and measurements removed
    - **Ingredient ordering** – re-ordered into sequence: **protein → vegetables → grains → dairy → other**
    - **Tag normalization** – mapped to six canonical slots: *cuisine, course, main-ingredient, dietary, difficulty, occasion*
    - **Tokenization** – standard *bert-base-uncased* WordPiece tokenizer, sequences truncated/padded to 128 tokens
    """)
    
    # Methodology
    st.header("Methodology")
    
    st.subheader("Model Architecture")
    st.markdown("""
    - **Base Model:** `bert-base-uncased` checkpoint
    - **Additional Layers:** Single linear classification layer (768 → 1) with dropout (p = 0.1)
    - **Training Objective:** Triplet-margin loss with margin of 1.0
    """)
    
    st.subheader("Hyperparameters")
    col1, col2 = st.columns(2)
    with col1:
        st.markdown("""
        - **Batch size:** 8
        - **Max sequence length:** 128 tokens
        - **Learning rate:** 2 × 10⁻⁵
        - **Weight decay:** 0.01
        """)
    with col2:
        st.markdown("""
        - **Optimizer:** AdamW
        - **Epochs:** 3
        - **Hardware:** Google Colab A100 GPU (40 GB VRAM)
        - **Training time:** ~75 minutes per run
        """)
    
    # Mathematical Formulations
    st.header("Mathematical Formulations")
    
    st.subheader("Query Embedding and Similarity Calculation")
    st.latex(r"""
        \text{Similarity}(q, r_i) = \cos(\hat{q}, \hat{r}_i) = \frac{\hat{q} \cdot \hat{r}_i}{\|\hat{q}\|\|\hat{r}_i\|}
    """)
    st.markdown("Where $\\hat{q}$ is the BERT embedding of the query, and $\\hat{r}_i$ is the embedding of the i-th recipe.")
    
    st.subheader("Final Score Calculation")
    st.latex(r"""
        \text{Score}_i = 0.6 \times \text{Similarity}_i + 0.4 \times \text{Popularity}_i
    """)
    
    # Results
    st.header("Results")
    
    st.subheader("Training and Validation Loss")
    results_data = {
        "Run": [1, 2, 3, 4],
        "Configuration": [
            "Raw, no cleaning/ordering",
            "Cleaned text, unordered", 
            "Cleaned text + dropout",
            "Cleaned text + dropout + ordering"
        ],
        "Epoch-3 Train Loss": [0.0065, 0.0023, 0.0061, 0.0119],
        "Validation Loss": [0.1100, 0.0000, 0.0118, 0.0067]
    }
    st.table(results_data)
    
    st.markdown("""
    **Key Finding:** Run 4 (cleaned text + dropout + ordering) achieved the best balance 
    between low validation loss and meaningful retrieval quality.
    """)
    
    st.subheader("Qualitative Retrieval Examples")
    st.markdown("""
    **Query: "beef steak dinner"**
    - Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans*
    - Run 4 (Final): *grilled garlic steak dinner*, *classic beef steak au poivre*
    
    **Query: "chicken italian pasta"**  
    - Run 1 (Raw): *to die for crock pot roast*, *crock pot chicken with black beans*
    - Run 4 (Final): *creamy tuscan chicken pasta*, *italian chicken penne bake*
    
    **Query: "vegetarian salad healthy"**
    - Run 1 (Raw): (irrelevant hits)
    - Run 4 (Final): *kale quinoa power salad*, *superfood spinach & berry salad*
    """)
    
    # Discussion and Conclusion
    st.header("Discussion and Conclusion")
    st.markdown("""
    The experimental evidence underscores the importance of disciplined pre-processing when 
    adapting large language models to niche domains. The breakthrough came with **ingredient-ordering** 
    (protein → vegetables → grains → dairy → other) which supplied consistent positional signals.
    
    **Key Achievements:**
    - End-to-end recipe recommendation system with semantic search
    - Sub-second latency across 231k recipes
    - Meaningful semantic understanding of culinary content
    - Reproducible blueprint for domain-specific NLP applications
    
    **Limitations:**
    - Private dataset relatively small (15k samples) compared to public corpora
    - Minimal hyperparameter search conducted
    - Single-machine deployment tested
    """)
    
    # Technical Specifications
    st.header("Technical Specifications")
    col1, col2 = st.columns(2)
    with col1:
        st.markdown("""
        **Dataset:**
        - Total Recipes: 231,630
        - Training Set: 15,000 recipes
        - Average Tags per Recipe: ~6
        - Ingredients per Recipe: 3-20
        """)
    with col2:
        st.markdown("""
        **Infrastructure:**
        - Python 3.10
        - PyTorch 2.1 (CUDA 11.8)
        - Transformers 4.38
        - Google Colab A100 GPU
        """)
    
    # References
    st.header("References")
    st.markdown("""
    [1] Vaswani et al., "Attention Is All You Need," NeurIPS, 2017.
    
    [2] Devlin et al., "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," NAACL-HLT, 2019.
    
    [3] Reimers and Gurevych, "Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks," EMNLP-IJCNLP, 2019.
    
    [4] Hugging Face, "BERT Model Documentation," 2024.
    """)
    
    st.markdown("---")
    st.markdown("© 2025 CSE 555 Term Project. All rights reserved.")

# Render the report
render_report()