Spaces:

raymondEDS
/

DS_webclass

Running

App Files Files Community

raymondEDS commited on Jun 3

Commit

4a23d33

1 Parent(s): f91be81

Week 6 logistic regression

Browse files

Files changed (8) hide show

Reference files/w6_logistic_regression_lab.py +400 -0
app/__pycache__/main.cpython-311.pyc +0 -0
app/main.py +6 -4
app/pages/__pycache__/week_2.cpython-311.pyc +0 -0
app/pages/__pycache__/week_5.cpython-311.pyc +0 -0
app/pages/__pycache__/week_6.cpython-311.pyc +0 -0
app/pages/week_6.py +803 -0
requirements.txt +2 -1

Reference files/w6_logistic_regression_lab.py ADDED Viewed

	@@ -0,0 +1,400 @@

+# -*- coding: utf-8 -*-
+"""W6_Logistic_regression_lab
+Automatically generated by Colab.
+Original file is located at
+    https://colab.research.google.com/drive/1MG7N2HN-Nxow9fzvc0fzxvp3WyKqtgs8
+# 🚀 Logistic Regression Lab: Stock Market Prediction
+## Lab Overview
+In this lab, we'll use logistic regression to try predicting whether the stock market goes up or down. Spoiler alert: This is intentionally a challenging prediction problem that will teach us important lessons about when logistic regression works well and when it doesn't.
+## Learning Goals:
+- Apply logistic regression to real data
+- Interpret probabilities and coefficients
+- Understand why some prediction problems are inherently difficult
+- Learn proper model evaluation techniques
+## The Stock Market Data
+In this lab we will examine the `Smarket`
+data, which is part of the `ISLP`
+library. This data set consists of percentage returns for the S&P 500
+stock index over 1,250 days, from the beginning of 2001 until the end
+of 2005. For each date, we have recorded the percentage returns for
+each of the five previous trading days,  `Lag1`  through
+ `Lag5`. We have also recorded  `Volume`  (the number of
+shares traded on the previous day, in billions),  `Today`  (the
+percentage return on the date in question) and  `Direction`
+(whether the market was  `Up`  or  `Down`  on this date).
+### Your Challenge
+**Question**: Can we predict if the S&P 500 will go up or down based on recent trading patterns?
+**Why This Matters:** If predictable, this would be incredibly valuable. If not predictable, we learn about market efficiency and realistic expectations for prediction models.
+To answer the question, **we start by importing  our libraries at this top level; these are all imports we have seen in previous labs.**
+"""
+import numpy as np
+import pandas as pd
+from matplotlib.pyplot import subplots
+import statsmodels.api as sm
+from ISLP import load_data
+from ISLP.models import (ModelSpec as MS,
+                         summarize)
+"""We also collect together the new imports needed for this lab."""
+from ISLP import confusion_table
+from ISLP.models import contrast
+from sklearn.discriminant_analysis import \
+     (LinearDiscriminantAnalysis as LDA,
+      QuadraticDiscriminantAnalysis as QDA)
+from sklearn.naive_bayes import GaussianNB
+from sklearn.neighbors import KNeighborsClassifier
+from sklearn.preprocessing import StandardScaler
+from sklearn.model_selection import train_test_split
+from sklearn.linear_model import LogisticRegression
+"""Now we are ready to load the `Smarket` data."""
+Smarket = load_data('Smarket')
+Smarket
+"""This gives a truncated listing of the data.
+We can see what the variable names are.
+"""
+Smarket.columns
+"""We compute the correlation matrix using the `corr()` method
+for data frames, which produces a matrix that contains all of
+the pairwise correlations among the variables.
+By instructing `pandas` to use only numeric variables, the `corr()` method does not report a correlation for the `Direction`  variable because it is
+ qualitative.
+ ![image.png](attachment:image.png)
+"""
+Smarket.corr(numeric_only=True)
+"""As one would expect, the correlations between the lagged return  variables and
+today’s return are close to zero.  The only substantial correlation is between  `Year`  and
+ `Volume`. By plotting the data we see that  `Volume`
+is increasing over time. In other words, the average number of shares traded
+daily increased from 2001 to 2005.
+"""
+Smarket.plot(y='Volume');
+"""## Logistic Regression
+Next, we will fit a logistic regression model in order to predict
+ `Direction`  using  `Lag1`  through  `Lag5`  and
+ `Volume`. The `sm.GLM()`  function fits *generalized linear models*, a class of
+models that includes logistic regression.  Alternatively,
+the function `sm.Logit()` fits a logistic regression
+model directly. The syntax of
+`sm.GLM()` is similar to that of `sm.OLS()`, except
+that we must pass in the argument `family=sm.families.Binomial()`
+in order to tell `statsmodels` to run a logistic regression rather than some other
+type of generalized linear model.
+"""
+allvars = Smarket.columns.drop(['Today', 'Direction', 'Year'])
+design = MS(allvars)
+X = design.fit_transform(Smarket)
+y = Smarket.Direction == 'Up'
+glm = sm.GLM(y,
+             X,
+             family=sm.families.Binomial())
+results = glm.fit()
+summarize(results)
+"""The smallest *p*-value here is associated with  `Lag1`. The
+negative coefficient for this predictor suggests that if the market
+had a positive return yesterday, then it is less likely to go up
+today. However, at a value of 0.15, the *p*-value is still
+relatively large, and so there is no clear evidence of a real
+association between  `Lag1`  and  `Direction`.
+We use the `params`  attribute of `results`
+in order to access just the
+coefficients for this fitted model.
+"""
+results.params
+"""Likewise we can use the
+`pvalues`  attribute to access the *p*-values for the coefficients.
+"""
+results.pvalues
+"""The `predict()`  method of `results` can be used to predict the
+probability that the market will go up, given values of the
+predictors. This method returns predictions
+on the probability scale. If no data set is supplied to the `predict()`
+function, then the probabilities are computed for the training data
+that was used to fit the logistic regression model.
+As with linear regression, one can pass an optional `exog` argument consistent
+with a design matrix if desired. Here we have
+printed only the first ten probabilities.
+"""
+probs = results.predict()
+probs[:10]
+"""In order to make a prediction as to whether the market will go up or
+down on a particular day, we must convert these predicted
+probabilities into class labels,  `Up`  or  `Down`.  The
+following two commands create a vector of class predictions based on
+whether the predicted probability of a market increase is greater than
+or less than 0.5.
+"""
+labels = np.array(['Down']*1250)
+labels[probs>0.5] = "Up"
+"""The `confusion_table()`
+function from the `ISLP` package summarizes these predictions, showing   how
+many observations were correctly or incorrectly classified. Our function, which is adapted from a similar function
+in the module `sklearn.metrics`,  transposes the resulting
+matrix and includes row and column labels.
+The `confusion_table()` function takes as first argument the
+predicted labels, and second argument the true labels.
+"""
+confusion_table(labels, Smarket.Direction)
+"""The diagonal elements of the confusion matrix indicate correct
+predictions, while the off-diagonals represent incorrect
+predictions. Hence our model correctly predicted that the market would
+go up on 507 days and that it would go down on 145 days, for a
+total of 507 + 145 = 652 correct predictions. The `np.mean()`
+function can be used to compute the fraction of days for which the
+prediction was correct. In this case, logistic regression correctly
+predicted the movement of the market 52.2% of the time.
+"""
+(507+145)/1250, np.mean(labels == Smarket.Direction)
+"""At first glance, it appears that the logistic regression model is
+working a little better than random guessing. However, this result is
+misleading because we trained and tested the model on the same set of
+1,250 observations. In other words, $100-52.2=47.8%$ is the
+*training* error  rate. As we have seen
+previously, the training error rate is often overly optimistic --- it
+tends to underestimate the test error rate.  In
+order to better assess the accuracy of the logistic regression model
+in this setting, we can fit the model using part of the data, and
+then examine how well it predicts the *held out* data.  This
+will yield a more realistic error rate, in the sense that in practice
+we will be interested in our model’s performance not on the data that
+we used to fit the model, but rather on days in the future for which
+the market’s movements are unknown.
+To implement this strategy, we first create a Boolean vector
+corresponding to the observations from 2001 through 2004. We  then
+use this vector to create a held out data set of observations from
+2005.
+"""
+train = (Smarket.Year < 2005)
+Smarket_train = Smarket.loc[train]
+Smarket_test = Smarket.loc[~train]
+Smarket_test.shape
+"""The object `train` is a vector of 1,250 elements, corresponding
+to the observations in our data set. The elements of the vector that
+correspond to observations that occurred before 2005 are set to
+`True`, whereas those that correspond to observations in 2005 are
+set to `False`.  Hence `train` is a
+*boolean*   array, since its
+elements are `True` and `False`.  Boolean arrays can be used
+to obtain a subset of the rows or columns of a data frame
+using the `loc` method. For instance,
+the command `Smarket.loc[train]` would pick out a submatrix of the
+stock market data set, corresponding only to the dates before 2005,
+since those are the ones for which the elements of `train` are
+`True`.  The `~` symbol can be used to negate all of the
+elements of a Boolean vector. That is, `~train` is a vector
+similar to `train`, except that the elements that are `True`
+in `train` get swapped to `False` in `~train`, and vice versa.
+Therefore, `Smarket.loc[~train]` yields a
+subset of the rows of the data frame
+of the stock market data containing only the observations for which
+`train` is `False`.
+The output above indicates that there are 252 such
+observations.
+We now fit a logistic regression model using only the subset of the
+observations that correspond to dates before 2005. We then obtain predicted probabilities of the
+stock market going up for each of the days in our test set --- that is,
+for the days in 2005.
+"""
+X_train, X_test = X.loc[train], X.loc[~train]
+y_train, y_test = y.loc[train], y.loc[~train]
+glm_train = sm.GLM(y_train,
+                   X_train,
+                   family=sm.families.Binomial())
+results = glm_train.fit()
+probs = results.predict(exog=X_test)
+"""Notice that we have trained and tested our model on two completely
+separate data sets: training was performed using only the dates before
+2005, and testing was performed using only the dates in 2005.
+Finally, we compare the predictions for 2005 to the
+actual movements of the market over that time period.
+We will first store the test and training labels (recall `y_test` is binary).
+"""
+D = Smarket.Direction
+L_train, L_test = D.loc[train], D.loc[~train]
+"""Now we threshold the
+fitted probability at 50% to form
+our predicted labels.
+"""
+labels = np.array(['Down']*252)
+labels[probs>0.5] = 'Up'
+confusion_table(labels, L_test)
+"""The test accuracy is about 48% while the error rate is about 52%"""
+np.mean(labels == L_test), np.mean(labels != L_test)
+"""The `!=` notation means *not equal to*, and so the last command
+computes the test set error rate. The results are rather
+disappointing: the test error rate is 52%, which is worse than
+random guessing! Of course this result is not all that surprising,
+given that one would not generally expect to be able to use previous
+days’ returns to predict future market performance. (After all, if it
+were possible to do so, then the authors of this book would be out
+striking it rich rather than writing a statistics textbook.)
+We recall that the logistic regression model had very underwhelming
+*p*-values associated with all of the predictors, and that the
+smallest *p*-value, though not very small, corresponded to
+ `Lag1`. Perhaps by removing the variables that appear not to be
+helpful in predicting  `Direction`, we can obtain a more
+effective model. After all, using predictors that have no relationship
+with the response tends to cause a deterioration in the test error
+rate (since such predictors cause an increase in variance without a
+corresponding decrease in bias), and so removing such predictors may
+in turn yield an improvement.  Below we refit the logistic
+regression using just  `Lag1`  and  `Lag2`, which seemed to
+have the highest predictive power in the original logistic regression
+model.
+"""
+model = MS(['Lag1', 'Lag2']).fit(Smarket)
+X = model.transform(Smarket)
+X_train, X_test = X.loc[train], X.loc[~train]
+glm_train = sm.GLM(y_train,
+                   X_train,
+                   family=sm.families.Binomial())
+results = glm_train.fit()
+probs = results.predict(exog=X_test)
+labels = np.array(['Down']*252)
+labels[probs>0.5] = 'Up'
+confusion_table(labels, L_test)
+"""Let’s evaluate the overall accuracy as well as the accuracy within the days when
+logistic regression predicts an increase.
+"""
+(35+106)/252,106/(106+76)
+"""Now the results appear to be a little better: 56% of the daily
+movements have been correctly predicted. It is worth noting that in
+this case, a much simpler strategy of predicting that the market will
+increase every day will also be correct 56% of the time! Hence, in
+terms of overall error rate, the logistic regression method is no
+better than the naive approach. However, the confusion matrix
+shows that on days when logistic regression predicts an increase in
+the market, it has a 58% accuracy rate. This suggests a possible
+trading strategy of buying on days when the model predicts an
+increasing market, and avoiding trades on days when a decrease is
+predicted. Of course one would need to investigate more carefully
+whether this small improvement was real or just due to random chance.
+Suppose that we want to predict the returns associated with particular
+values of  `Lag1`  and  `Lag2`. In particular, we want to
+predict  `Direction`  on a day when  `Lag1`  and
+ `Lag2`  equal $1.2$ and $1.1$, respectively, and on a day when they
+equal $1.5$ and $-0.8$.  We do this using the `predict()`
+function.
+"""
+newdata = pd.DataFrame({'Lag1':[1.2, 1.5],
+                        'Lag2':[1.1, -0.8]});
+newX = model.transform(newdata)
+results.predict(newX)
+Smarket
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.model_selection import train_test_split
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import classification_report, confusion_matrix
+import statsmodels.api as sm
+# Load the dataset
+data = load_data('Smarket')
+# Display the first few rows of the dataset
+print(data.head())
+# Prepare the data for logistic regression
+# Using 'Lag1' and 'Lag2' as predictors and 'Direction' as the response
+data['Direction'] = data['Direction'].map({'Up': 1, 'Down': 0})
+X = data[['Lag1', 'Lag2']]
+y = data['Direction']
+# Split the data into training and testing sets
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
+# Fit the logistic regression model
+log_reg = LogisticRegression()
+log_reg.fit(X_train, y_train)
+# Make predictions on the test set
+y_pred = log_reg.predict(X_test)
+# Print classification report and confusion matrix
+print(classification_report(y_test, y_pred))
+print(confusion_matrix(y_test, y_pred))
+# Visualize the decision boundary
+plt.figure(figsize=(10, 6))
+# Create a mesh grid for plotting decision boundary
+x_min, x_max = X['Lag1'].min() - 1, X['Lag1'].max() + 1
+y_min, y_max = X['Lag2'].min() - 1, X['Lag2'].max() + 1
+xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
+                     np.arange(y_min, y_max, 0.01))
+# Predict the function value for the whole grid
+Z = log_reg.predict(np.c_[xx.ravel(), yy.ravel()])
+Z = Z.reshape(xx.shape)
+# Plot the decision boundary
+plt.contourf(xx, yy, Z, alpha=0.8)
+plt.scatter(X_test['Lag1'], X_test['Lag2'], c=y_test, edgecolor='k', s=20)
+plt.xlabel('Lag1')
+plt.ylabel('Lag2')
+plt.title('Logistic Regression Decision Boundary')
+plt.show()

app/__pycache__/main.cpython-311.pyc CHANGED Viewed

Binary files a/app/__pycache__/main.cpython-311.pyc and b/app/__pycache__/main.cpython-311.pyc differ

app/main.py CHANGED Viewed

@@ -8,8 +8,7 @@ from sklearn.linear_model import LinearRegression
 import nltk
 from nltk.corpus import stopwords
 from nltk.tokenize import word_tokenize, sent_tokenize
-nltk.download('punkt_tab')
-nltk.download('stopwords')
 # Add the parent directory to the Python path
 sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
@@ -23,6 +22,7 @@ from app.pages import week_2
 from app.pages import week_3
 from app.pages import week_4
 from app.pages import week_5
 # Page configuration
 st.set_page_config(
     page_title="Data Science Course App",
@@ -149,6 +149,8 @@ def show_week_content():
         week_4.show()
     elif st.session_state.current_week == 5:
         week_5.show()
     else:
         st.warning("Content for this week is not yet available.")
@@ -161,14 +163,14 @@ def main():
         return
     # User is logged in, show course content
-    if st.session_state.current_week in [1, 2, 3, 4, 5]:
         show_week_content()
     else:
         st.title("Data Science Research Paper Course")
         st.markdown("""
         ## Welcome to the Data Science Research Paper Course! 📚
-        This section has not bee released yet.
         """)
 if __name__ == "__main__":

 import nltk
 from nltk.corpus import stopwords
 from nltk.tokenize import word_tokenize, sent_tokenize
 # Add the parent directory to the Python path
 sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 from app.pages import week_3
 from app.pages import week_4
 from app.pages import week_5
+from app.pages import week_6
 # Page configuration
 st.set_page_config(
     page_title="Data Science Course App",
         week_4.show()
     elif st.session_state.current_week == 5:
         week_5.show()
+    elif st.session_state.current_week == 6:
+        week_6.show()
     else:
         st.warning("Content for this week is not yet available.")
         return
     # User is logged in, show course content
+    if st.session_state.current_week in [1, 2, 3, 4, 5, 6]:
         show_week_content()
     else:
         st.title("Data Science Research Paper Course")
         st.markdown("""
         ## Welcome to the Data Science Research Paper Course! 📚
+        This section has not been released yet.
         """)
 if __name__ == "__main__":

app/pages/__pycache__/week_2.cpython-311.pyc CHANGED Viewed

Binary files a/app/pages/__pycache__/week_2.cpython-311.pyc and b/app/pages/__pycache__/week_2.cpython-311.pyc differ

app/pages/__pycache__/week_5.cpython-311.pyc CHANGED Viewed

Binary files a/app/pages/__pycache__/week_5.cpython-311.pyc and b/app/pages/__pycache__/week_5.cpython-311.pyc differ

app/pages/__pycache__/week_6.cpython-311.pyc ADDED Viewed

Binary file (34.6 kB). View file

app/pages/week_6.py ADDED Viewed

	@@ -0,0 +1,803 @@

+import streamlit as st
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+from sklearn.model_selection import train_test_split
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
+from sklearn.preprocessing import StandardScaler
+import plotly.express as px
+import plotly.graph_objects as go
+from plotly.subplots import make_subplots
+import scipy.stats as stats
+from pathlib import Path
+import statsmodels.api as sm
+from ISLP import load_data
+from ISLP.models import ModelSpec as MS, summarize
+# Set up the style for all plots
+plt.style.use('default')
+sns.set_theme(style="whitegrid", palette="husl")
+def load_smarket_data():
+    """Load and prepare the Smarket data"""
+    try:
+        Smarket = load_data('Smarket')
+        return Smarket
+    except Exception as e:
+        st.error(f"Error loading Smarket data: {str(e)}")
+        return None
+def create_confusion_matrix_plot(y_true, y_pred, title="Confusion Matrix"):
+    """Create an interactive confusion matrix plot"""
+    cm = confusion_matrix(y_true, y_pred)
+    fig = go.Figure(data=go.Heatmap(
+        z=cm,
+        x=['Predicted Down', 'Predicted Up'],
+        y=['Actual Down', 'Actual Up'],
+        colorscale='RdBu',
+        text=[[str(val) for val in row] for row in cm],
+        texttemplate='%{text}',
+        textfont={"size": 16}
+    ))
+    fig.update_layout(
+        title=title,
+        title_x=0.5,
+        title_font_size=20,
+        plot_bgcolor='rgb(30, 30, 30)',
+        paper_bgcolor='rgb(30, 30, 30)',
+        font=dict(color='white')
+    )
+    return fig
+def create_correlation_heatmap(df):
+    """Create a correlation heatmap using plotly"""
+    corr = df.corr(numeric_only=True)
+    fig = go.Figure(data=go.Heatmap(
+        z=corr,
+        x=corr.columns,
+        y=corr.columns,
+        colorscale='RdBu',
+        zmin=-1, zmax=1,
+        text=[[f'{val:.2f}' for val in row] for row in corr.values],
+        texttemplate='%{text}',
+        textfont={"size": 12}
+    ))
+    fig.update_layout(
+        title='S&P 500 Returns Correlation Heatmap',
+        title_x=0.5,
+        title_font_size=20,
+        plot_bgcolor='rgb(30, 30, 30)',
+        paper_bgcolor='rgb(30, 30, 30)',
+        font=dict(color='white')
+    )
+    return fig
+def create_decision_boundary_plot(X, y, model):
+    """Create an interactive decision boundary plot using plotly"""
+    # Create a mesh grid
+    x_min, x_max = X['Lag1'].min() - 1, X['Lag1'].max() + 1
+    y_min, y_max = X['Lag2'].min() - 1, X['Lag2'].max() + 1
+    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
+                        np.arange(y_min, y_max, 0.01))
+    # Get predictions for the mesh grid
+    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
+    Z = Z.reshape(xx.shape)
+    # Create the plot
+    fig = go.Figure()
+    # Add the decision boundary
+    fig.add_trace(go.Contour(
+        x=np.arange(x_min, x_max, 0.01),
+        y=np.arange(y_min, y_max, 0.01),
+        z=Z,
+        colorscale='RdBu',
+        showscale=False,
+        opacity=0.5
+    ))
+    # Add the scatter points
+    fig.add_trace(go.Scatter(
+        x=X['Lag1'],
+        y=X['Lag2'],
+        mode='markers',
+        marker=dict(
+            color=y,
+            colorscale='RdBu',
+            size=8,
+            line=dict(color='black', width=1)
+        ),
+        name='Data Points'
+    ))
+    # Update layout
+    fig.update_layout(
+        title='Logistic Regression Decision Boundary',
+        xaxis_title='Lag1',
+        yaxis_title='Lag2',
+        plot_bgcolor='rgb(30, 30, 30)',
+        paper_bgcolor='rgb(30, 30, 30)',
+        font=dict(color='white'),
+        showlegend=False
+    )
+    return fig
+def show():
+    st.title("Week 6: Logistic Regression and Stock Market Prediction")
+    # Introduction Section
+    st.header("Course Overview")
+    st.write("""
+    In this week, we'll use logistic regression to try predicting whether the stock market goes up or down.
+    This is intentionally a challenging prediction problem that will teach us important lessons about:
+    - When logistic regression works well and when it doesn't
+    - How to interpret probabilities and coefficients
+    - Why some prediction problems are inherently difficult
+    - Proper model evaluation techniques
+    """)
+    # Learning Path
+    st.subheader("Learning Path")
+    st.write("""
+    1. Understanding the Stock Market Data: S&P 500 returns and predictors
+    2. Logistic Regression Fundamentals: From linear to logistic
+    3. Model Training and Evaluation: Proper train-test splitting
+    4. Interpreting Results: Coefficients and probabilities
+    5. Model Assessment: Confusion matrices and metrics
+    6. Real-world Applications: Challenges and limitations
+    """)
+    # Module 1: Understanding the Data
+    st.header("Module 1: Understanding the Stock Market Data")
+    st.write("""
+    We'll examine the Smarket data, which consists of percentage returns for the S&P 500 stock index over 1,250 days,
+    from the beginning of 2001 until the end of 2005. For each date, we have:
+    - Percentage returns for each of the five previous trading days (Lag1 through Lag5)
+    - Volume (number of shares traded on the previous day, in billions)
+    - Today (percentage return on the date in question)
+    - Direction (whether the market was Up or Down on this date)
+    """)
+    # Load and display data
+    Smarket = load_smarket_data()
+    if Smarket is not None:
+        st.write("First few rows of the Smarket data:")
+        st.dataframe(Smarket.head())
+        # EDA Plots
+        st.subheader("Exploratory Data Analysis")
+        # Volume over time
+        st.write("**Trading Volume Over Time**")
+        fig_volume = go.Figure()
+        fig_volume.add_trace(go.Scatter(
+            x=Smarket.index,
+            y=Smarket['Volume'],
+            mode='lines',
+            name='Volume'
+        ))
+        fig_volume.update_layout(
+            title='Trading Volume Over Time',
+            xaxis_title='Time',
+            yaxis_title='Volume (billions of shares)',
+            plot_bgcolor='rgb(30, 30, 30)',
+            paper_bgcolor='rgb(30, 30, 30)',
+            font=dict(color='white')
+        )
+        st.plotly_chart(fig_volume)
+        # Returns distribution
+        st.write("**Distribution of Returns**")
+        # Add column selection
+        selected_columns = st.multiselect(
+            "Select columns to display",
+            options=['Lag1', 'Lag2', 'Lag3', 'Lag4', 'Lag5', 'Today'],
+            default=['Lag1', 'Lag2']
+        )
+        if selected_columns:
+            fig_returns = go.Figure()
+            for col in selected_columns:
+                fig_returns.add_trace(go.Histogram(
+                    x=Smarket[col],
+                    name=col,
+                    opacity=0.7,
+                    nbinsx=50  # Adjust number of bins for better visualization
+                ))
+            # Add mean and std lines
+            for col in selected_columns:
+                mean_val = Smarket[col].mean()
+                std_val = Smarket[col].std()
+                fig_returns.add_vline(
+                    x=mean_val,
+                    line_dash="dash",
+                    line_color="red",
+                    annotation_text=f"{col} Mean: {mean_val:.2f}%",
+                    annotation_position="top right",
+                    annotation=dict(
+                        textangle=-45,
+                        font=dict(size=10)
+                    )
+                )
+                fig_returns.add_vline(
+                    x=mean_val + std_val,
+                    line_dash="dot",
+                    line_color="yellow",
+                    annotation_text=f"{col} +1σ: {mean_val + std_val:.2f}%",
+                    annotation_position="top right",
+                    annotation=dict(
+                        textangle=-45,
+                        font=dict(size=10)
+                    )
+                )
+                fig_returns.add_vline(
+                    x=mean_val - std_val,
+                    line_dash="dot",
+                    line_color="yellow",
+                    annotation_text=f"{col} -1σ: {mean_val - std_val:.2f}%",
+                    annotation_position="top right",
+                    annotation=dict(
+                        textangle=-45,
+                        font=dict(size=10)
+                    )
+                )
+            fig_returns.update_layout(
+                title='Distribution of Returns',
+                xaxis_title='Return (%)',
+                yaxis_title='Frequency',
+                barmode='overlay',
+                plot_bgcolor='rgb(30, 30, 30)',
+                paper_bgcolor='rgb(30, 30, 30)',
+                font=dict(color='white'),
+                showlegend=True,
+                legend=dict(
+                    yanchor="top",
+                    y=0.99,
+                    xanchor="left",
+                    x=0.01
+                )
+            )
+            # Add summary statistics
+            st.write("**Summary Statistics**")
+            summary_stats = Smarket[selected_columns].describe()
+            st.dataframe(summary_stats.style.format('{:.2f}'))
+            st.plotly_chart(fig_returns)
+            # Add interpretation
+            st.write("""
+            **Interpretation:**
+            - The dashed red line shows the mean return for each selected period
+            - The dotted yellow lines show one standard deviation above and below the mean
+            - The overlap of distributions helps identify similarities in return patterns
+            - Wider distributions indicate higher volatility
+            """)
+        # Returns over time
+        st.write("**Returns Over Time**")
+        fig_returns_time = go.Figure()
+        fig_returns_time.add_trace(go.Scatter(
+            x=Smarket.index,
+            y=Smarket['Today'],
+            mode='lines',
+            name='Today\'s Return'
+        ))
+        fig_returns_time.update_layout(
+            title='Daily Returns Over Time',
+            xaxis_title='Time',
+            yaxis_title='Return (%)',
+            plot_bgcolor='rgb(30, 30, 30)',
+            paper_bgcolor='rgb(30, 30, 30)',
+            font=dict(color='white')
+        )
+        st.plotly_chart(fig_returns_time)
+        # Direction distribution
+        st.write("**Market Direction Distribution**")
+        direction_counts = Smarket['Direction'].value_counts()
+        fig_direction = go.Figure(data=[go.Pie(
+            labels=direction_counts.index,
+            values=direction_counts.values,
+            hole=.3
+        )])
+        fig_direction.update_layout(
+            title='Distribution of Market Direction',
+            plot_bgcolor='rgb(30, 30, 30)',
+            paper_bgcolor='rgb(30, 30, 30)',
+            font=dict(color='white')
+        )
+        st.plotly_chart(fig_direction)
+        # Show correlation heatmap
+        st.write("**Correlation Analysis**")
+        st.plotly_chart(create_correlation_heatmap(Smarket))
+        st.write("""
+        Key observations from the exploratory analysis:
+        1. **Trading Volume**:
+           - Shows an increasing trend over time
+           - Higher volatility in recent years
+           - Some periods of unusually high volume
+        2. **Returns Distribution**:
+           - Approximately normal distribution
+           - Most returns are close to zero
+           - Some extreme values (outliers)
+        3. **Market Direction**:
+           - Relatively balanced between Up and Down days
+           - Slight bias towards Up days
+        4. **Correlations**:
+           - Low correlation between lagged returns
+           - Strong correlation between Year and Volume
+           - Today's return shows little correlation with past returns
+        """)
+    # Module 2: Logistic Regression Implementation
+    st.header("Module 2: Logistic Regression Implementation")
+    st.write("""
+    We'll fit a logistic regression model to predict Direction using Lag1 through Lag5 and Volume.
+    The model will help us understand if we can predict market movements based on recent trading patterns.
+    """)
+    if Smarket is not None:
+        # Prepare data for logistic regression
+        allvars = Smarket.columns.drop(['Today', 'Direction', 'Year'])
+        design = MS(allvars)
+        X = design.fit_transform(Smarket)
+        y = Smarket.Direction == 'Up'
+        # Fit the model
+        glm = sm.GLM(y, X, family=sm.families.Binomial())
+        results = glm.fit()
+        # Display model summary
+        st.write("Model Summary:")
+        st.write(summarize(results))
+        # Show coefficients
+        st.write("Model Coefficients:")
+        coef_df = pd.DataFrame({
+            'Feature': allvars,
+            'Coefficient': results.params[1:],  # Skip the intercept
+            'P-value': results.pvalues[1:]  # Skip the intercept
+        })
+        st.write(coef_df)
+    # Module 3: Model Evaluation
+    st.header("Module 3: Model Evaluation")
+    st.write("""
+    We'll evaluate our model using proper train-test splitting, focusing on predicting 2005 data using models trained on 2001-2004 data.
+    This gives us a more realistic assessment of model performance.
+    """)
+    if Smarket is not None:
+        # Split data by year
+        train = (Smarket.Year < 2005)
+        X_train, X_test = X.loc[train], X.loc[~train]
+        y_train, y_test = y.loc[train], y.loc[~train]
+        # Fit model on training data
+        glm_train = sm.GLM(y_train, X_train, family=sm.families.Binomial())
+        results = glm_train.fit()
+        # Make predictions
+        probs = results.predict(exog=X_test)
+        labels = np.array(['Down']*len(probs))
+        labels[probs>0.5] = 'Up'
+        # Show confusion matrix
+        st.plotly_chart(create_confusion_matrix_plot(Smarket.Direction[~train], labels))
+        # Calculate and display accuracy
+        accuracy = np.mean(labels == Smarket.Direction[~train])
+        st.write(f"Test Accuracy: {accuracy:.2%}")
+    # Module 4: Decision Boundary Visualization
+    st.header("Module 4: Decision Boundary Visualization")
+    st.write("""
+    Let's visualize how our logistic regression model separates the market movements using Lag1 and Lag2 as predictors.
+    The decision boundary shows how the model classifies different combinations of previous day returns.
+    """)
+    if Smarket is not None:
+        # Prepare data for decision boundary plot
+        X_plot = Smarket[['Lag1', 'Lag2']]
+        y_plot = (Smarket['Direction'] == 'Up').astype(int)
+        # Fit a simple logistic regression model for visualization
+        log_reg = LogisticRegression()
+        log_reg.fit(X_plot, y_plot)
+        # Create and display the decision boundary plot
+        st.plotly_chart(create_decision_boundary_plot(X_plot, y_plot, log_reg))
+        st.write("""
+        The decision boundary plot shows:
+        - Blue regions indicate where the model predicts the market will go down
+        - Red regions indicate where the model predicts the market will go up
+        - The boundary between these regions represents where the model is uncertain
+        - The scatter points show actual market movements, colored by their true direction
+        """)
+    # Module 5: Interpreting Logistic Regression Results
+    st.header("Module 5: Interpreting Logistic Regression Results")
+    st.subheader("Understanding the Coefficients")
+    st.write("""
+    In logistic regression, coefficients tell us about the relationship between predictors and the probability of the outcome.
+    Let's break down how to interpret them:
+    1. **Coefficient Sign**:
+       - Positive coefficients increase the probability of the outcome (market going up)
+       - Negative coefficients decrease the probability of the outcome (market going down)
+    2. **Coefficient Magnitude**:
+       - Larger absolute values indicate stronger effects
+       - The effect is non-linear due to the logistic function
+    """)
+    # Add visualization comparing linear and logistic regression
+    st.write("**Linear vs Logistic Regression**")
+    # Create sample data
+    x = np.linspace(-5, 5, 100)
+    y_linear = 0.5 * x + 0.5  # Linear regression
+    y_logistic = 1 / (1 + np.exp(-(2 * x)))  # Logistic regression with steeper slope
+    # Create the comparison plot
+    fig_comparison = go.Figure()
+    # Add linear regression line
+    fig_comparison.add_trace(go.Scatter(
+        x=x,
+        y=y_linear,
+        mode='lines',
+        name='Linear Regression',
+        line=dict(color='blue', width=2)
+    ))
+    # Add logistic regression curve
+    fig_comparison.add_trace(go.Scatter(
+        x=x,
+        y=y_logistic,
+        mode='lines',
+        name='Logistic Regression',
+        line=dict(color='red', width=2)
+    ))
+    # Add some sample points with more extreme separation
+    np.random.seed(42)
+    x_samples = np.random.normal(0, 1, 50)
+    # Make the separation more clear
+    y_samples = (x_samples > 0.5).astype(int)  # Changed threshold to 0.5 for clearer separation
+    fig_comparison.add_trace(go.Scatter(
+        x=x_samples,
+        y=y_samples,
+        mode='markers',
+        name='Sample Data',
+        marker=dict(
+            color=['red' if y == 0 else 'green' for y in y_samples],
+            size=8,
+            symbol='circle'
+        )
+    ))
+    # Update layout
+    fig_comparison.update_layout(
+        title='Linear vs Logistic Regression',
+        xaxis_title='Input Feature (X)',
+        yaxis_title='Output',
+        plot_bgcolor='rgb(30, 30, 30)',
+        paper_bgcolor='rgb(30, 30, 30)',
+        font=dict(color='white'),
+        showlegend=True,
+        legend=dict(
+            yanchor="top",
+            y=0.99,
+            xanchor="left",
+            x=0.01
+        ),
+        yaxis=dict(
+            range=[-0.1, 1.1]  # Extend y-axis range slightly
+        )
+    )
+    # Add annotations
+    fig_comparison.add_annotation(
+        x=2, y=0.8,
+        text="Linear Regression<br>predicts continuous values",
+        showarrow=True,
+        arrowhead=1,
+        ax=50, ay=-30,
+        font=dict(color='white', size=10)
+    )
+    fig_comparison.add_annotation(
+        x=2, y=0.3,
+        text="Logistic Regression<br>predicts probabilities<br>(S-shaped curve)",
+        showarrow=True,
+        arrowhead=1,
+        ax=50, ay=30,
+        font=dict(color='white', size=10)
+    )
+    # Add decision boundary annotation
+    fig_comparison.add_annotation(
+        x=0, y=0.5,
+        text="Decision Boundary<br>(p = 0.5)",
+        showarrow=True,
+        arrowhead=1,
+        ax=0, ay=-40,
+        font=dict(color='white', size=10)
+    )
+    st.plotly_chart(fig_comparison)
+    st.write("""
+    **Key Differences:**
+    1. **Output Range**:
+       - Linear Regression: Can predict any value (-∞ to +∞)
+       - Logistic Regression: Predicts probabilities (0 to 1)
+    2. **Function Shape**:
+       - Linear Regression: Straight line
+       - Logistic Regression: S-shaped curve (sigmoid)
+       - The sigmoid function creates a sharp transition around the decision boundary
+    3. **Use Case**:
+       - Linear Regression: Predicting continuous values
+       - Logistic Regression: Predicting binary outcomes (Up/Down)
+    4. **Interpretation**:
+       - Linear Regression: Direct relationship between X and Y
+       - Logistic Regression: Non-linear relationship between X and probability of Y
+       - Small changes in X can lead to large changes in probability near the decision boundary
+    """)
+    if Smarket is not None:
+        # Calculate and display coefficients
+        st.subheader("Example: Interpreting Our Model's Coefficients")
+        # Get coefficients from the model
+        coef_results = pd.DataFrame({
+            'Feature': allvars,
+            'Coefficient': results.params[1:],
+            'P-value': results.pvalues[1:]
+        })
+        st.write("Coefficient Analysis:")
+        st.dataframe(coef_results.style.format({
+            'Coefficient': '{:.4f}',
+            'P-value': '{:.4f}'
+        }))
+        st.write("""
+        Let's interpret some examples from our model:
+        1. **Lag1 Coefficient**:
+           - A positive coefficient means that higher values of Lag1 are associated with higher probability of the market going up
+           - The magnitude tells us how strong this relationship is
+        2. **Volume Coefficient**:
+           - A positive coefficient suggests that higher trading volume is associated with higher probability of upward market movement
+           - The size of the coefficient indicates the strength of this relationship
+        """)
+    st.subheader("Understanding Model Performance")
+    st.write("""
+    Our model's performance metrics tell us important information:
+    1. **Accuracy**:
+       - The proportion of correct predictions
+       - In our case, around 52% accuracy on the test set
+       - This is slightly better than random guessing (50%)
+    2. **Confusion Matrix**:
+       The confusion matrix is a 2x2 table that shows:
+       - **True Positives (TP)**:
+         - Correctly predicted market going up
+         - These are the cases where we predicted 'Up' and the market actually went up
+       - **False Positives (FP)**:
+         - Incorrectly predicted market going up
+         - These are the cases where we predicted 'Up' but the market actually went down
+         - Also known as Type I errors
+       - **True Negatives (TN)**:
+         - Correctly predicted market going down
+         - These are the cases where we predicted 'Down' and the market actually went down
+       - **False Negatives (FN)**:
+         - Incorrectly predicted market going down
+         - These are the cases where we predicted 'Down' but the market actually went up
+         - Also known as Type II errors
+       From these values, we can calculate important metrics:
+       - **Precision** = TP / (TP + FP): How many of our 'Up' predictions were correct
+       - **Recall** = TP / (TP + FN): How many of the actual 'Up' days did we catch
+       - **F1 Score** = 2 * (Precision * Recall) / (Precision + Recall): Balanced measure of precision and recall
+       - **Accuracy** = (TP + TN) / (TP + TN + FP + FN): Overall correct predictions
+    3. **P-values**:
+       - Indicate statistical significance of each predictor
+       - P-value < 0.05 suggests the predictor is significant
+       - In our case, most predictors are not statistically significant
+    """)
+    st.subheader("Practical Implications")
+    st.write("""
+    What does this mean for real-world trading?
+    1. **Model Limitations**:
+       - The model's accuracy is only slightly better than random guessing
+       - This suggests that predicting market direction is inherently difficult
+       - Past returns alone are not reliable predictors
+    2. **Risk Management**:
+       - Even with a model, trading decisions should include:
+         - Stop-loss orders
+         - Position sizing
+         - Diversification
+         - Risk tolerance considerations
+    3. **Model Improvement**:
+       - Consider adding more features:
+         - Technical indicators
+         - Market sentiment
+         - Economic indicators
+       - Use more sophisticated models:
+         - Ensemble methods
+         - Deep learning
+         - Time series models
+    """)
+    st.subheader("Example: Making a Prediction")
+    st.write("""
+    Let's walk through an example of making a prediction:
+    1. **Input Data**:
+       - Lag1 = 1.2% (yesterday's return)
+       - Lag2 = -0.8% (day before yesterday's return)
+       - Volume = 1.1 billion shares
+    2. **Calculate Probability**:
+       - Use the logistic function: P(Y=1) = 1 / (1 + e^(-z))
+       - where z = β₀ + β₁(Lag1) + β₂(Lag2) + ... + β₆(Volume)
+    3. **Interpret Result**:
+       - If P(Y=1) > 0.5, predict market will go up
+       - If P(Y=1) < 0.5, predict market will go down
+       - The probability itself tells us about confidence
+    """)
+    if Smarket is not None:
+        # Example prediction
+        st.write("**Interactive Example:**")
+        col1, col2, col3 = st.columns(3)
+        with col1:
+            lag1 = st.number_input("Lag1 (%)", value=1.2, step=0.1)
+        with col2:
+            lag2 = st.number_input("Lag2 (%)", value=-0.8, step=0.1)
+        with col3:
+            volume = st.number_input("Volume (billions)", value=1.1, step=0.1)
+        # Make prediction
+        X_example = pd.DataFrame({
+            'Lag1': [lag1],
+            'Lag2': [lag2],
+            'Lag3': [0],
+            'Lag4': [0],
+            'Lag5': [0],
+            'Volume': [volume]
+        })
+        # Transform using the same design matrix
+        X_example = design.transform(X_example)
+        prob = results.predict(X_example)[0]
+        st.write(f"""
+        **Prediction Results:**
+        - Probability of market going up: {prob:.2%}
+        - Predicted direction: {'Up' if prob > 0.5 else 'Down'}
+        - Confidence level: {abs(prob - 0.5)*2:.2%}
+        """)
+    # Practice Exercises
+    st.header("Practice Exercises")
+    with st.expander("Exercise 1: Implementing Logistic Regression with Lag1 and Lag2"):
+        st.write("""
+        1. Implement a logistic regression model using only Lag1 and Lag2
+        2. Compare its performance with the full model
+        3. Analyze the coefficients and their significance
+        4. Visualize the results
+        """)
+        st.code("""
+        # Solution
+        model = MS(['Lag1', 'Lag2']).fit(Smarket)
+        X = model.transform(Smarket)
+        X_train, X_test = X.loc[train], X.loc[~train]
+        glm_train = sm.GLM(y_train, X_train, family=sm.families.Binomial())
+        results = glm_train.fit()
+        probs = results.predict(exog=X_test)
+        labels = np.array(['Down']*len(probs))
+        labels[probs>0.5] = 'Up'
+        # Evaluate performance
+        accuracy = np.mean(labels == Smarket.Direction[~train])
+        print(f"Test Accuracy: {accuracy:.2%}")
+        """)
+    with st.expander("Exercise 2: Making Predictions for New Data"):
+        st.write("""
+        1. Create a function to make predictions for new market conditions
+        2. Test the model with specific Lag1 and Lag2 values
+        3. Interpret the predicted probabilities
+        4. Discuss the model's limitations
+        """)
+        st.code("""
+        # Solution
+        def predict_market_direction(lag1, lag2):
+            newdata = pd.DataFrame({'Lag1': [lag1], 'Lag2': [lag2]})
+            newX = model.transform(newdata)
+            prob = results.predict(newX)[0]
+            return prob
+        # Example predictions
+        prob1 = predict_market_direction(1.2, 1.1)
+        prob2 = predict_market_direction(1.5, -0.8)
+        print(f"Probability of market going up for Lag1=1.2, Lag2=1.1: {prob1:.2%}")
+        print(f"Probability of market going up for Lag1=1.5, Lag2=-0.8: {prob2:.2%}")
+        """)
+    # Weekly Assignment
+    username = st.session_state.get("username", "Student")
+    st.header(f"{username}'s Weekly Assignment")
+    if username == "manxiii":
+        st.markdown("""
+        Hello **manxiii**, here is your Assignment 6: Stock Market Prediction with Logistic Regression.
+        1. Implement a logistic regression model using Lag1 and Lag2
+        2. Compare its performance with the full model
+        3. Analyze the coefficients and their significance
+        4. Create visualizations to support your findings
+        5. Write a brief report on why stock market prediction is challenging
+        **Due Date:** End of Week 6
+        """)
+    elif username == "zhu":
+        st.markdown("""
+        Hello **zhu**, here is your Assignment 6: Stock Market Prediction with Logistic Regression.
+        """)
+    elif username == "WK":
+        st.markdown("""
+        Hello **WK**, here is your Assignment 6: Stock Market Prediction with Logistic Regression.
+        """)
+    else:
+        st.markdown(f"""
+        Hello **{username}**, here is your Assignment 6: Stock Market Prediction with Logistic Regression.
+        Please contact the instructor for your specific assignment.
+        """)

requirements.txt CHANGED Viewed

@@ -6,4 +6,5 @@ matplotlib==3.8.3
 seaborn==0.13.2
 plotly==5.18.0
 nltk==3.8.1
-wordcloud==1.9.3

 seaborn==0.13.2
 plotly==5.18.0
 nltk==3.8.1
+wordcloud==1.9.3
+ISLP