Spaces:

raymondEDS
/

DS_webclass

Running

App Files Files Community

raymondEDS commited on May 27

Commit

ae38d1c

1 Parent(s): faeb953

Updating lesson 5

Browse files

Files changed (15) hide show

Data/Submissions.csv +0 -0
Data/decision.csv +0 -0
Data/reviews.csv +0 -0
Data/submission_keyword.csv +0 -0
app/__pycache__/__init__.cpython-311.pyc +0 -0
app/__pycache__/main.cpython-311.pyc +0 -0
app/components/__pycache__/__init__.cpython-311.pyc +0 -0
app/components/__pycache__/login.cpython-311.pyc +0 -0
app/main.py +4 -1
app/pages/__pycache__/week_1.cpython-311.pyc +0 -0
app/pages/__pycache__/week_2.cpython-311.pyc +0 -0
app/pages/__pycache__/week_3.cpython-311.pyc +0 -0
app/pages/__pycache__/week_4.cpython-311.pyc +0 -0
app/pages/__pycache__/week_5.cpython-311.pyc +0 -0
app/pages/week_5.py +269 -200

Data/Submissions.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

Data/decision.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

Data/reviews.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

Data/submission_keyword.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

app/__pycache__/__init__.cpython-311.pyc CHANGED Viewed

Binary files a/app/__pycache__/__init__.cpython-311.pyc and b/app/__pycache__/__init__.cpython-311.pyc differ

app/__pycache__/main.cpython-311.pyc CHANGED Viewed

Binary files a/app/__pycache__/main.cpython-311.pyc and b/app/__pycache__/main.cpython-311.pyc differ

app/components/__pycache__/__init__.cpython-311.pyc CHANGED Viewed

Binary files a/app/components/__pycache__/__init__.cpython-311.pyc and b/app/components/__pycache__/__init__.cpython-311.pyc differ

app/components/__pycache__/login.cpython-311.pyc CHANGED Viewed

Binary files a/app/components/__pycache__/login.cpython-311.pyc and b/app/components/__pycache__/login.cpython-311.pyc differ

app/main.py CHANGED Viewed

@@ -22,6 +22,7 @@ from app.pages import week_1
 from app.pages import week_2
 from app.pages import week_3
 from app.pages import week_4
 # Page configuration
 st.set_page_config(
     page_title="Data Science Course App",
@@ -146,6 +147,8 @@ def show_week_content():
         week_3.show()
     elif st.session_state.current_week == 4:
         week_4.show()
     else:
         st.warning("Content for this week is not yet available.")
@@ -158,7 +161,7 @@ def main():
         return
     # User is logged in, show course content
-    if st.session_state.current_week in [1, 2, 3, 4]:
         show_week_content()
     else:
         st.title("Data Science Research Paper Course")

 from app.pages import week_2
 from app.pages import week_3
 from app.pages import week_4
+from app.pages import week_5
 # Page configuration
 st.set_page_config(
     page_title="Data Science Course App",
         week_3.show()
     elif st.session_state.current_week == 4:
         week_4.show()
+    elif st.session_state.current_week == 5:
+        week_5.show()
     else:
         st.warning("Content for this week is not yet available.")
         return
     # User is logged in, show course content
+    if st.session_state.current_week in [1, 2, 3, 4, 5]:
         show_week_content()
     else:
         st.title("Data Science Research Paper Course")

app/pages/__pycache__/week_1.cpython-311.pyc CHANGED Viewed

Binary files a/app/pages/__pycache__/week_1.cpython-311.pyc and b/app/pages/__pycache__/week_1.cpython-311.pyc differ

app/pages/__pycache__/week_2.cpython-311.pyc CHANGED Viewed

Binary files a/app/pages/__pycache__/week_2.cpython-311.pyc and b/app/pages/__pycache__/week_2.cpython-311.pyc differ

app/pages/__pycache__/week_3.cpython-311.pyc CHANGED Viewed

Binary files a/app/pages/__pycache__/week_3.cpython-311.pyc and b/app/pages/__pycache__/week_3.cpython-311.pyc differ

app/pages/__pycache__/week_4.cpython-311.pyc CHANGED Viewed

Binary files a/app/pages/__pycache__/week_4.cpython-311.pyc and b/app/pages/__pycache__/week_4.cpython-311.pyc differ

app/pages/__pycache__/week_5.cpython-311.pyc ADDED Viewed

Binary file (18.4 kB). View file

app/pages/week_5.py CHANGED Viewed

@@ -7,6 +7,70 @@ from sklearn.linear_model import LinearRegression
 from sklearn.metrics import r2_score
 import scipy.stats as stats
 from nltk.tokenize import word_tokenize
 def show():
     st.title("Week 5: Introduction to Machine Learning and Linear Regression")
@@ -28,7 +92,7 @@ def show():
     """)
     # Learning Path
-    st.subheader("Key Concepts You'll Master")
     st.write("""
     1. **Linear Regression (线性回归):**
        - Definition: A statistical method that models the relationship between a dependent variable and one or more independent variables
@@ -46,226 +110,231 @@ def show():
        - Confidence intervals: Range where true coefficient likely lies
     """)
-    # Module 1: Setting Up Your Data Science Toolkit
-    st.header("Module 1: Setting Up Your Data Science Toolkit")
-    st.write("""
-    Let's start by importing the necessary libraries for our analysis:
-    """)
-    st.code("""
-    import numpy as np
-    import pandas as pd
-    import scipy.stats as stats
-    import matplotlib.pyplot as plt
-    import sklearn
-    from nltk.tokenize import word_tokenize
-    import seaborn as sns
-    # Set up visualization style
-    sns.set_style("whitegrid")
-    sns.set_context("poster")
-    """)
-    # Module 2: Loading and Understanding Data
-    st.header("Module 2: Loading and Understanding Data")
-    st.write("""
-    Before diving into analysis, we need to understand our data structure. What information do we have about each review? Each submission?
-    """)
-    if st.button("Load Sample Data"):
-        # Create sample data for demonstration
-        sample_reviews = pd.DataFrame({
-            'rating_int': [6, 6, 5, 6, 8],
-            'confidence_int': [4.0, 4.0, 4.0, 3.0, 3.0],
-            'review': [
-                'There is a lot of recent work on link-prediction...',
-                'Pros: The different attention techniques...',
-                'Overview of the paper: This paper studies...',
-                'Summary: The authors propose a near minimax...',
-                'This paper introduces a GPU-friendly variant...'
-            ],
-            'forum': ['tGZu6DlbreV', 'uKhGRvM8QNH', 'IrM64DGB21', 'ww-7bdU6GA9', 'r1VGvBcxl']
-        })
-        st.write("Sample Reviews Data:")
-        st.dataframe(sample_reviews)
-    # Module 3: Feature Engineering
-    st.header("Module 3: Feature Engineering")
-    st.write("""
-    We'll create features from our text data that can help predict paper acceptance:
-    - Review length (word count)
-    - Review rating
-    - Reviewer confidence
-    - Number of keywords in the paper
-    """)
-    # Interactive Feature Engineering
-    st.subheader("Try Feature Engineering")
-    st.write("""
-    Let's create some features from a review:
-    """)
-    review_text = st.text_area(
-        "Enter a review to analyze:",
-        "This paper introduces a novel approach to machine learning. The methodology is sound and the results are promising.",
-        key="review_text"
-    )
-    if st.button("Extract Features"):
-        # Calculate features
-        word_count = len(word_tokenize(review_text))
-        sentence_count = len(review_text.split('.'))
-        st.write("Extracted Features:")
-        st.write(f"Word Count: {word_count}")
-        st.write(f"Sentence Count: {sentence_count}")
-    # Module 4: Linear Regression Analysis
-    st.header("Module 4: Linear Regression Analysis")
-    st.write("""
-    Let's build a simple linear regression model to predict paper ratings based on review features.
-    """)
-    # Interactive Regression
-    st.subheader("Try Linear Regression")
-    st.write("""
-    Let's create a simple regression model:
-    """)
-    if st.button("Run Sample Regression"):
-        # Create sample data
-        np.random.seed(42)
-        X = np.random.rand(100, 1) * 10  # Review length
-        y = 2 * X + np.random.randn(100, 1) * 2  # Rating with some noise
-        # Fit regression model
-        model = LinearRegression()
-        model.fit(X, y)
-        # Create visualization
-        plt.figure(figsize=(10, 6))
-        plt.scatter(X, y, color='blue', alpha=0.5)
-        plt.plot(X, model.predict(X), color='red', linewidth=2)
-        plt.xlabel('Review Length')
-        plt.ylabel('Rating')
-        plt.title('Linear Regression: Review Length vs Rating')
-        st.pyplot(plt)
-        # Show model metrics
-        st.write(f"R-squared: {r2_score(y, model.predict(X)):.3f}")
-        st.write(f"Coefficient: {model.coef_[0][0]:.3f}")
-        st.write(f"Intercept: {model.intercept_[0]:.3f}")
-    # Practice Exercises
-    st.header("Practice Exercises")
-    with st.expander("Exercise 1: Feature Engineering"):
         st.write("""
-        1. Load the reviews dataset
-        2. Create features from review text
-        3. Calculate correlation between features
-        4. Visualize relationships
         """)
-        st.code("""
-        # Solution
-        import pandas as pd
-        import numpy as np
-        from nltk.tokenize import word_tokenize
-        # Load data
-        df_reviews = pd.read_csv('reviews.csv')
-        # Create features
-        df_reviews['word_count'] = df_reviews['review'].apply(
-            lambda x: len(word_tokenize(x)))
-        df_reviews['sentence_count'] = df_reviews['review'].apply(
-            lambda x: len(x.split('.')))
-        # Calculate correlation
-        correlation = df_reviews[['word_count', 'rating_int',
-                                'confidence_int']].corr()
-        # Visualize
-        sns.heatmap(correlation, annot=True)
-        plt.show()
-        """)
-    with st.expander("Exercise 2: Building a Predictive Model"):
         st.write("""
-        1. Prepare features for modeling
-        2. Split data into training and test sets
-        3. Train a linear regression model
-        4. Evaluate model performance
         """)
-        st.code("""
-        # Solution
-        from sklearn.model_selection import train_test_split
-        from sklearn.linear_model import LinearRegression
-        # Prepare features
         X = df_reviews[['word_count', 'confidence_int']]
         y = df_reviews['rating_int']
-        # Split data
-        X_train, X_test, y_train, y_test = train_test_split(
-            X, y, test_size=0.2, random_state=42)
-        # Train model
         model = LinearRegression()
-        model.fit(X_train, y_train)
-        # Evaluate
-        train_score = model.score(X_train, y_train)
-        test_score = model.score(X_test, y_test)
-        print(f"Training R²: {train_score:.3f}")
-        print(f"Testing R²: {test_score:.3f}")
-        """)
-    # Weekly Assignment
-    username = st.session_state.get("username", "Student")
-    st.header(f"{username}'s Weekly Assignment")
-    if username == "manxiii":
-        st.markdown("""
-        Hello **manxiii**, here is your Assignment 5: Machine Learning Analysis.
-        1. Complete the feature engineering pipeline for the ICLR dataset
-        2. Build a linear regression model to predict paper ratings
-        3. Analyze the relationship between review features and acceptance
-        4. Submit your findings in a Jupyter notebook
-        **Due Date:** End of Week 5
-        """)
-    elif username == "zhu":
-        st.markdown("""
-        Hello **zhu**, here is your Assignment 5: Machine Learning Analysis.
-        1. Implement the complete machine learning workflow
-        2. Create insightful visualizations of model results
-        3. Draw conclusions from your analysis
-        4. Submit your work in a Jupyter notebook
-        **Due Date:** End of Week 5
-        """)
-    elif username == "WK":
-        st.markdown("""
-        Hello **WK**, here is your Assignment 5: Machine Learning Analysis.
-        1. Complete the feature engineering pipeline
-        2. Build and evaluate a linear regression model
-        3. Analyze patterns in the data
-        4. Submit your findings
-        **Due Date:** End of Week 5
-        """)
-    else:
-        st.markdown(f"""
-        Hello **{username}**, here is your Assignment 5: Machine Learning Analysis.
-        1. Complete the feature engineering pipeline
-        2. Build and evaluate a linear regression model
-        3. Analyze patterns in the data
-        4. Submit your findings
-        **Due Date:** End of Week 5
-        """)

 from sklearn.metrics import r2_score
 import scipy.stats as stats
 from nltk.tokenize import word_tokenize
+import plotly.express as px
+import plotly.graph_objects as go
+from pathlib import Path
+import os
+# Set up the style for all plots
+plt.style.use('default')
+sns.set_theme(style="whitegrid", palette="husl")
+def load_data():
+    """Load and prepare the data"""
+    # Get the current file's directory
+    current_dir = Path(__file__).parent
+    # Navigate to the Data directory (two levels up from the pages directory)
+    data_dir = current_dir.parent.parent / "Data"
+    # Load the datasets
+    try:
+        df_reviews = pd.read_csv(data_dir / "reviews.csv")
+        df_submissions = pd.read_csv(data_dir / "Submissions.csv")
+        df_dec = pd.read_csv(data_dir / "decision.csv")
+        df_keyword = pd.read_csv(data_dir / "submission_keyword.csv")
+        return df_reviews, df_submissions, df_dec, df_keyword
+    except FileNotFoundError as e:
+        st.error(f"Data files not found. Please make sure the data files are in the correct location: {data_dir}")
+        st.error(f"Error details: {str(e)}")
+        return None, None, None, None
+def create_feature_plot(df, x_col, y_col, title):
+    """Create an interactive scatter plot using plotly"""
+    fig = px.scatter(df, x=x_col, y=y_col,
+                    title=title,
+                    labels={x_col: x_col.replace('_', ' ').title(),
+                           y_col: y_col.replace('_', ' ').title()},
+                    template="plotly_white")
+    fig.update_layout(
+        title_x=0.5,
+        title_font_size=20,
+        showlegend=True,
+        plot_bgcolor='white',
+        paper_bgcolor='white'
+    )
+    return fig
+def create_correlation_heatmap(df, columns):
+    """Create a correlation heatmap using plotly"""
+    corr = df[columns].corr()
+    fig = go.Figure(data=go.Heatmap(
+        z=corr,
+        x=corr.columns,
+        y=corr.columns,
+        colorscale='RdBu',
+        zmin=-1, zmax=1
+    ))
+    fig.update_layout(
+        title='Feature Correlation Heatmap',
+        title_x=0.5,
+        title_font_size=20,
+        plot_bgcolor='white',
+        paper_bgcolor='white'
+    )
+    return fig
 def show():
     st.title("Week 5: Introduction to Machine Learning and Linear Regression")
     """)
     # Learning Path
+    st.subheader("Key Concepts You'll Learn")
     st.write("""
     1. **Linear Regression (线性回归):**
        - Definition: A statistical method that models the relationship between a dependent variable and one or more independent variables
        - Confidence intervals: Range where true coefficient likely lies
     """)
+    # Load the data
+    try:
+        df_reviews, df_submissions, df_dec, df_keyword = load_data()
+        # Module 1: Data Exploration
+        st.header("Module 1: Data Exploration")
+        st.write("Let's explore our dataset to understand the review patterns:")
+        # Create features from review text
+        df_reviews['word_count'] = df_reviews['review'].apply(lambda x: len(str(x).split()))
+        df_reviews['sentence_count'] = df_reviews['review'].apply(lambda x: len(str(x).split('.')))
+        # Show basic statistics
+        col1, col2 = st.columns(2)
+        with col1:
+            st.metric("Total Reviews", len(df_reviews))
+            st.metric("Average Rating", f"{df_reviews['rating_int'].mean():.2f}")
+        with col2:
+            st.metric("Average Word Count", f"{df_reviews['word_count'].mean():.0f}")
+            st.metric("Average Confidence", f"{df_reviews['confidence_int'].mean():.2f}")
+        # Create interactive visualizations
+        st.subheader("Review Length vs Rating")
+        fig = create_feature_plot(df_reviews, 'word_count', 'rating_int',
+                                'Relationship between Review Length and Rating')
+        st.plotly_chart(fig, use_container_width=True)
+        # Correlation analysis
+        st.subheader("Feature Correlations")
+        corr_fig = create_correlation_heatmap(df_reviews,
+                                            ['word_count', 'rating_int', 'confidence_int'])
+        st.plotly_chart(corr_fig, use_container_width=True)
+        # Module 2: Feature Engineering
+        st.header("Module 2: Feature Engineering")
         st.write("""
+        Let's create more sophisticated features from our review data:
+        - Review length (word count)
+        - Review rating
+        - Reviewer confidence
+        - Number of keywords in the paper
         """)
+        # Interactive Feature Engineering
+        st.subheader("Try Feature Engineering")
+        review_text = st.text_area(
+            "Enter a review to analyze:",
+            "This paper introduces a novel approach to machine learning. The methodology is sound and the results are promising.",
+            key="review_text"
+        )
+        if st.button("Extract Features"):
+            # Calculate features
+            word_count = len(word_tokenize(review_text))
+            sentence_count = len(review_text.split('.'))
+            # Create a nice display of features
+            col1, col2, col3 = st.columns(3)
+            with col1:
+                st.metric("Word Count", word_count)
+            with col2:
+                st.metric("Sentence Count", sentence_count)
+            with col3:
+                st.metric("Average Words per Sentence", f"{word_count/sentence_count:.1f}")
+        # Module 3: Linear Regression Analysis
+        st.header("Module 3: Linear Regression Analysis")
         st.write("""
+        Let's build a linear regression model to predict paper ratings based on review features.
         """)
+        # Prepare data for modeling
         X = df_reviews[['word_count', 'confidence_int']]
         y = df_reviews['rating_int']
+        # Fit regression model
         model = LinearRegression()
+        model.fit(X, y)
+        # Create 3D visualization of the regression
+        st.subheader("3D Visualization of Review Features")
+        fig = px.scatter_3d(df_reviews.sample(1000),
+                           x='word_count',
+                           y='confidence_int',
+                           z='rating_int',
+                           title='Review Features in 3D Space',
+                           labels={
+                               'word_count': 'Word Count',
+                               'confidence_int': 'Confidence',
+                               'rating_int': 'Rating'
+                           })
+        fig.update_layout(
+            title_x=0.5,
+            title_font_size=20,
+            scene = dict(
+                xaxis_title='Word Count',
+                yaxis_title='Confidence',
+                zaxis_title='Rating'
+            )
+        )
+        st.plotly_chart(fig, use_container_width=True)
+        # Show model metrics
+        st.subheader("Model Performance")
+        col1, col2, col3 = st.columns(3)
+        with col1:
+            st.metric("R-squared", f"{model.score(X, y):.3f}")
+        with col2:
+            st.metric("Word Count Coefficient", f"{model.coef_[0]:.3f}")
+        with col3:
+            st.metric("Confidence Coefficient", f"{model.coef_[1]:.3f}")
+        # Practice Exercises
+        st.header("Practice Exercises")
+        with st.expander("Exercise 1: Feature Engineering"):
+            st.write("""
+            1. Load the reviews dataset
+            2. Create features from review text
+            3. Calculate correlation between features
+            4. Visualize relationships
+            """)
+            st.code("""
+            # Solution
+            import pandas as pd
+            import numpy as np
+            from nltk.tokenize import word_tokenize
+            # Load data
+            df_reviews = pd.read_csv('reviews.csv')
+            # Create features
+            df_reviews['word_count'] = df_reviews['review'].apply(
+                lambda x: len(word_tokenize(x)))
+            df_reviews['sentence_count'] = df_reviews['review'].apply(
+                lambda x: len(x.split('.')))
+            # Calculate correlation
+            correlation = df_reviews[['word_count', 'rating_int',
+                                    'confidence_int']].corr()
+            # Visualize
+            sns.heatmap(correlation, annot=True)
+            plt.show()
+            """)
+        with st.expander("Exercise 2: Building a Predictive Model"):
+            st.write("""
+            1. Prepare features for modeling
+            2. Split data into training and test sets
+            3. Train a linear regression model
+            4. Evaluate model performance
+            """)
+            st.code("""
+            # Solution
+            from sklearn.model_selection import train_test_split
+            from sklearn.linear_model import LinearRegression
+            # Prepare features
+            X = df_reviews[['word_count', 'confidence_int']]
+            y = df_reviews['rating_int']
+            # Split data
+            X_train, X_test, y_train, y_test = train_test_split(
+                X, y, test_size=0.2, random_state=42)
+            # Train model
+            model = LinearRegression()
+            model.fit(X_train, y_train)
+            # Evaluate
+            train_score = model.score(X_train, y_train)
+            test_score = model.score(X_test, y_test)
+            print(f"Training R²: {train_score:.3f}")
+            print(f"Testing R²: {test_score:.3f}")
+            """)
+        # Weekly Assignment
+        username = st.session_state.get("username", "Student")
+        st.header(f"{username}'s Weekly Assignment")
+        if username == "manxiii":
+            st.markdown("""
+            Hello **manxiii**, here is your Assignment 5: Machine Learning Analysis.
+            1. Complete the feature engineering pipeline for the ICLR dataset
+            2. Build a linear regression model to predict paper ratings
+            3. Analyze the relationship between review features and acceptance
+            4. Submit your findings in a Jupyter notebook
+            **Due Date:** End of Week 5
+            """)
+        elif username == "zhu":
+            st.markdown("""
+            Hello **zhu**, here is your Assignment 5: Machine Learning Analysis.
+            1. Implement the complete machine learning workflow
+            2. Create insightful visualizations of model results
+            3. Draw conclusions from your analysis
+            4. Submit your work in a Jupyter notebook
+            **Due Date:** End of Week 5
+            """)
+        elif username == "WK":
+            st.markdown("""
+            Hello **WK**, here is your Assignment 5: Machine Learning Analysis.
+            1. Complete the feature engineering pipeline
+            2. Build and evaluate a linear regression model
+            3. Analyze patterns in the data
+            4. Submit your findings
+            **Due Date:** End of Week 5
+            """)
+        else:
+            st.markdown(f"""
+            Hello **{username}**, here is your Assignment 5: Machine Learning Analysis.
+            1. Complete the feature engineering pipeline
+            2. Build and evaluate a linear regression model
+            3. Analyze patterns in the data
+            4. Submit your findings
+            **Due Date:** End of Week 5
+            """)
+    except Exception as e:
+        st.error(f"Error loading data: {str(e)}")
+        st.write("Please make sure the data files are in the correct location.")