Spaces:

raymondEDS
/

DS_webclass

Running

App Files Files Community

raymondEDS commited on May 27

Commit

faeb953

1 Parent(s): b1b0b70

test week 5

Browse files

Files changed (2) hide show

Reference files/Copy_Lab_5_hands_on_peer_review.ipynb +0 -0
app/pages/week_5.py +271 -0

Reference files/Copy_Lab_5_hands_on_peer_review.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

app/pages/week_5.py ADDED Viewed

	@@ -0,0 +1,271 @@

+import streamlit as st
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+from sklearn.linear_model import LinearRegression
+from sklearn.metrics import r2_score
+import scipy.stats as stats
+from nltk.tokenize import word_tokenize
+def show():
+    st.title("Week 5: Introduction to Machine Learning and Linear Regression")
+    # Introduction Section
+    st.header("Course Overview")
+    st.write("""
+    In this week, we'll explore machine learning through a fascinating real-world challenge: The Academic Publishing Crisis.
+    Imagine you're the program chair for a prestigious AI conference. You've just received 5,000 paper submissions, and you need to:
+    - Decide which papers to accept (only 20% can be accepted)
+    - Ensure fair and consistent reviews
+    - Understand what makes reviewers confident in their assessments
+    The Problem: Human reviewers are inconsistent. Some are harsh, others lenient. Some write detailed reviews, others just a few sentences.
+    How can we use data to understand and improve this process?
+    **Your Mission: Build a machine learning system to analyze review patterns and predict paper acceptance!**
+    """)
+    # Learning Path
+    st.subheader("Key Concepts You'll Master")
+    st.write("""
+    1. **Linear Regression (线性回归):**
+       - Definition: A statistical method that models the relationship between a dependent variable and one or more independent variables
+       - Real-world example: Predicting house prices based on size and location
+    2. **Correlation Analysis (相关性分析):**
+       - Definition: Statistical measure that shows how strongly two variables are related
+       - Range: -1 (perfect negative correlation) to +1 (perfect positive correlation)
+    3. **Reading Linear Regression Output (解读线性回归结果):**
+       - R-squared (R²): Proportion of variance explained by the model (0-1)
+       - p-value: Probability that the observed relationship occurred by chance
+       - Coefficients (系数): How much the dependent variable changes with a one-unit change in the independent variable
+       - Standard errors: Uncertainty in coefficient estimates
+       - Confidence intervals: Range where true coefficient likely lies
+    """)
+    # Module 1: Setting Up Your Data Science Toolkit
+    st.header("Module 1: Setting Up Your Data Science Toolkit")
+    st.write("""
+    Let's start by importing the necessary libraries for our analysis:
+    """)
+    st.code("""
+    import numpy as np
+    import pandas as pd
+    import scipy.stats as stats
+    import matplotlib.pyplot as plt
+    import sklearn
+    from nltk.tokenize import word_tokenize
+    import seaborn as sns
+    # Set up visualization style
+    sns.set_style("whitegrid")
+    sns.set_context("poster")
+    """)
+    # Module 2: Loading and Understanding Data
+    st.header("Module 2: Loading and Understanding Data")
+    st.write("""
+    Before diving into analysis, we need to understand our data structure. What information do we have about each review? Each submission?
+    """)
+    if st.button("Load Sample Data"):
+        # Create sample data for demonstration
+        sample_reviews = pd.DataFrame({
+            'rating_int': [6, 6, 5, 6, 8],
+            'confidence_int': [4.0, 4.0, 4.0, 3.0, 3.0],
+            'review': [
+                'There is a lot of recent work on link-prediction...',
+                'Pros: The different attention techniques...',
+                'Overview of the paper: This paper studies...',
+                'Summary: The authors propose a near minimax...',
+                'This paper introduces a GPU-friendly variant...'
+            ],
+            'forum': ['tGZu6DlbreV', 'uKhGRvM8QNH', 'IrM64DGB21', 'ww-7bdU6GA9', 'r1VGvBcxl']
+        })
+        st.write("Sample Reviews Data:")
+        st.dataframe(sample_reviews)
+    # Module 3: Feature Engineering
+    st.header("Module 3: Feature Engineering")
+    st.write("""
+    We'll create features from our text data that can help predict paper acceptance:
+    - Review length (word count)
+    - Review rating
+    - Reviewer confidence
+    - Number of keywords in the paper
+    """)
+    # Interactive Feature Engineering
+    st.subheader("Try Feature Engineering")
+    st.write("""
+    Let's create some features from a review:
+    """)
+    review_text = st.text_area(
+        "Enter a review to analyze:",
+        "This paper introduces a novel approach to machine learning. The methodology is sound and the results are promising.",
+        key="review_text"
+    )
+    if st.button("Extract Features"):
+        # Calculate features
+        word_count = len(word_tokenize(review_text))
+        sentence_count = len(review_text.split('.'))
+        st.write("Extracted Features:")
+        st.write(f"Word Count: {word_count}")
+        st.write(f"Sentence Count: {sentence_count}")
+    # Module 4: Linear Regression Analysis
+    st.header("Module 4: Linear Regression Analysis")
+    st.write("""
+    Let's build a simple linear regression model to predict paper ratings based on review features.
+    """)
+    # Interactive Regression
+    st.subheader("Try Linear Regression")
+    st.write("""
+    Let's create a simple regression model:
+    """)
+    if st.button("Run Sample Regression"):
+        # Create sample data
+        np.random.seed(42)
+        X = np.random.rand(100, 1) * 10  # Review length
+        y = 2 * X + np.random.randn(100, 1) * 2  # Rating with some noise
+        # Fit regression model
+        model = LinearRegression()
+        model.fit(X, y)
+        # Create visualization
+        plt.figure(figsize=(10, 6))
+        plt.scatter(X, y, color='blue', alpha=0.5)
+        plt.plot(X, model.predict(X), color='red', linewidth=2)
+        plt.xlabel('Review Length')
+        plt.ylabel('Rating')
+        plt.title('Linear Regression: Review Length vs Rating')
+        st.pyplot(plt)
+        # Show model metrics
+        st.write(f"R-squared: {r2_score(y, model.predict(X)):.3f}")
+        st.write(f"Coefficient: {model.coef_[0][0]:.3f}")
+        st.write(f"Intercept: {model.intercept_[0]:.3f}")
+    # Practice Exercises
+    st.header("Practice Exercises")
+    with st.expander("Exercise 1: Feature Engineering"):
+        st.write("""
+        1. Load the reviews dataset
+        2. Create features from review text
+        3. Calculate correlation between features
+        4. Visualize relationships
+        """)
+        st.code("""
+        # Solution
+        import pandas as pd
+        import numpy as np
+        from nltk.tokenize import word_tokenize
+        # Load data
+        df_reviews = pd.read_csv('reviews.csv')
+        # Create features
+        df_reviews['word_count'] = df_reviews['review'].apply(
+            lambda x: len(word_tokenize(x)))
+        df_reviews['sentence_count'] = df_reviews['review'].apply(
+            lambda x: len(x.split('.')))
+        # Calculate correlation
+        correlation = df_reviews[['word_count', 'rating_int',
+                                'confidence_int']].corr()
+        # Visualize
+        sns.heatmap(correlation, annot=True)
+        plt.show()
+        """)
+    with st.expander("Exercise 2: Building a Predictive Model"):
+        st.write("""
+        1. Prepare features for modeling
+        2. Split data into training and test sets
+        3. Train a linear regression model
+        4. Evaluate model performance
+        """)
+        st.code("""
+        # Solution
+        from sklearn.model_selection import train_test_split
+        from sklearn.linear_model import LinearRegression
+        # Prepare features
+        X = df_reviews[['word_count', 'confidence_int']]
+        y = df_reviews['rating_int']
+        # Split data
+        X_train, X_test, y_train, y_test = train_test_split(
+            X, y, test_size=0.2, random_state=42)
+        # Train model
+        model = LinearRegression()
+        model.fit(X_train, y_train)
+        # Evaluate
+        train_score = model.score(X_train, y_train)
+        test_score = model.score(X_test, y_test)
+        print(f"Training R²: {train_score:.3f}")
+        print(f"Testing R²: {test_score:.3f}")
+        """)
+    # Weekly Assignment
+    username = st.session_state.get("username", "Student")
+    st.header(f"{username}'s Weekly Assignment")
+    if username == "manxiii":
+        st.markdown("""
+        Hello **manxiii**, here is your Assignment 5: Machine Learning Analysis.
+        1. Complete the feature engineering pipeline for the ICLR dataset
+        2. Build a linear regression model to predict paper ratings
+        3. Analyze the relationship between review features and acceptance
+        4. Submit your findings in a Jupyter notebook
+        **Due Date:** End of Week 5
+        """)
+    elif username == "zhu":
+        st.markdown("""
+        Hello **zhu**, here is your Assignment 5: Machine Learning Analysis.
+        1. Implement the complete machine learning workflow
+        2. Create insightful visualizations of model results
+        3. Draw conclusions from your analysis
+        4. Submit your work in a Jupyter notebook
+        **Due Date:** End of Week 5
+        """)
+    elif username == "WK":
+        st.markdown("""
+        Hello **WK**, here is your Assignment 5: Machine Learning Analysis.
+        1. Complete the feature engineering pipeline
+        2. Build and evaluate a linear regression model
+        3. Analyze patterns in the data
+        4. Submit your findings
+        **Due Date:** End of Week 5
+        """)
+    else:
+        st.markdown(f"""
+        Hello **{username}**, here is your Assignment 5: Machine Learning Analysis.
+        1. Complete the feature engineering pipeline
+        2. Build and evaluate a linear regression model
+        3. Analyze patterns in the data
+        4. Submit your findings
+        **Due Date:** End of Week 5
+        """)