Spaces:
Running
Running
# -*- coding: utf-8 -*- | |
"""W6_Logistic_regression_lab | |
Automatically generated by Colab. | |
Original file is located at | |
https://colab.research.google.com/drive/1MG7N2HN-Nxow9fzvc0fzxvp3WyKqtgs8 | |
# 🚀 Logistic Regression Lab: Stock Market Prediction | |
## Lab Overview | |
In this lab, we'll use logistic regression to try predicting whether the stock market goes up or down. Spoiler alert: This is intentionally a challenging prediction problem that will teach us important lessons about when logistic regression works well and when it doesn't. | |
## Learning Goals: | |
- Apply logistic regression to real data | |
- Interpret probabilities and coefficients | |
- Understand why some prediction problems are inherently difficult | |
- Learn proper model evaluation techniques | |
## The Stock Market Data | |
In this lab we will examine the `Smarket` | |
data, which is part of the `ISLP` | |
library. This data set consists of percentage returns for the S&P 500 | |
stock index over 1,250 days, from the beginning of 2001 until the end | |
of 2005. For each date, we have recorded the percentage returns for | |
each of the five previous trading days, `Lag1` through | |
`Lag5`. We have also recorded `Volume` (the number of | |
shares traded on the previous day, in billions), `Today` (the | |
percentage return on the date in question) and `Direction` | |
(whether the market was `Up` or `Down` on this date). | |
### Your Challenge | |
**Question**: Can we predict if the S&P 500 will go up or down based on recent trading patterns? | |
**Why This Matters:** If predictable, this would be incredibly valuable. If not predictable, we learn about market efficiency and realistic expectations for prediction models. | |
To answer the question, **we start by importing our libraries at this top level; these are all imports we have seen in previous labs.** | |
""" | |
import numpy as np | |
import pandas as pd | |
from matplotlib.pyplot import subplots | |
import statsmodels.api as sm | |
from ISLP import load_data | |
from ISLP.models import (ModelSpec as MS, | |
summarize) | |
"""We also collect together the new imports needed for this lab.""" | |
from ISLP import confusion_table | |
from ISLP.models import contrast | |
from sklearn.discriminant_analysis import \ | |
(LinearDiscriminantAnalysis as LDA, | |
QuadraticDiscriminantAnalysis as QDA) | |
from sklearn.naive_bayes import GaussianNB | |
from sklearn.neighbors import KNeighborsClassifier | |
from sklearn.preprocessing import StandardScaler | |
from sklearn.model_selection import train_test_split | |
from sklearn.linear_model import LogisticRegression | |
"""Now we are ready to load the `Smarket` data.""" | |
Smarket = load_data('Smarket') | |
Smarket | |
"""This gives a truncated listing of the data. | |
We can see what the variable names are. | |
""" | |
Smarket.columns | |
"""We compute the correlation matrix using the `corr()` method | |
for data frames, which produces a matrix that contains all of | |
the pairwise correlations among the variables. | |
By instructing `pandas` to use only numeric variables, the `corr()` method does not report a correlation for the `Direction` variable because it is | |
qualitative. | |
 | |
""" | |
Smarket.corr(numeric_only=True) | |
"""As one would expect, the correlations between the lagged return variables and | |
today’s return are close to zero. The only substantial correlation is between `Year` and | |
`Volume`. By plotting the data we see that `Volume` | |
is increasing over time. In other words, the average number of shares traded | |
daily increased from 2001 to 2005. | |
""" | |
Smarket.plot(y='Volume'); | |
"""## Logistic Regression | |
Next, we will fit a logistic regression model in order to predict | |
`Direction` using `Lag1` through `Lag5` and | |
`Volume`. The `sm.GLM()` function fits *generalized linear models*, a class of | |
models that includes logistic regression. Alternatively, | |
the function `sm.Logit()` fits a logistic regression | |
model directly. The syntax of | |
`sm.GLM()` is similar to that of `sm.OLS()`, except | |
that we must pass in the argument `family=sm.families.Binomial()` | |
in order to tell `statsmodels` to run a logistic regression rather than some other | |
type of generalized linear model. | |
""" | |
allvars = Smarket.columns.drop(['Today', 'Direction', 'Year']) | |
design = MS(allvars) | |
X = design.fit_transform(Smarket) | |
y = Smarket.Direction == 'Up' | |
glm = sm.GLM(y, | |
X, | |
family=sm.families.Binomial()) | |
results = glm.fit() | |
summarize(results) | |
"""The smallest *p*-value here is associated with `Lag1`. The | |
negative coefficient for this predictor suggests that if the market | |
had a positive return yesterday, then it is less likely to go up | |
today. However, at a value of 0.15, the *p*-value is still | |
relatively large, and so there is no clear evidence of a real | |
association between `Lag1` and `Direction`. | |
We use the `params` attribute of `results` | |
in order to access just the | |
coefficients for this fitted model. | |
""" | |
results.params | |
"""Likewise we can use the | |
`pvalues` attribute to access the *p*-values for the coefficients. | |
""" | |
results.pvalues | |
"""The `predict()` method of `results` can be used to predict the | |
probability that the market will go up, given values of the | |
predictors. This method returns predictions | |
on the probability scale. If no data set is supplied to the `predict()` | |
function, then the probabilities are computed for the training data | |
that was used to fit the logistic regression model. | |
As with linear regression, one can pass an optional `exog` argument consistent | |
with a design matrix if desired. Here we have | |
printed only the first ten probabilities. | |
""" | |
probs = results.predict() | |
probs[:10] | |
"""In order to make a prediction as to whether the market will go up or | |
down on a particular day, we must convert these predicted | |
probabilities into class labels, `Up` or `Down`. The | |
following two commands create a vector of class predictions based on | |
whether the predicted probability of a market increase is greater than | |
or less than 0.5. | |
""" | |
labels = np.array(['Down']*1250) | |
labels[probs>0.5] = "Up" | |
"""The `confusion_table()` | |
function from the `ISLP` package summarizes these predictions, showing how | |
many observations were correctly or incorrectly classified. Our function, which is adapted from a similar function | |
in the module `sklearn.metrics`, transposes the resulting | |
matrix and includes row and column labels. | |
The `confusion_table()` function takes as first argument the | |
predicted labels, and second argument the true labels. | |
""" | |
confusion_table(labels, Smarket.Direction) | |
"""The diagonal elements of the confusion matrix indicate correct | |
predictions, while the off-diagonals represent incorrect | |
predictions. Hence our model correctly predicted that the market would | |
go up on 507 days and that it would go down on 145 days, for a | |
total of 507 + 145 = 652 correct predictions. The `np.mean()` | |
function can be used to compute the fraction of days for which the | |
prediction was correct. In this case, logistic regression correctly | |
predicted the movement of the market 52.2% of the time. | |
""" | |
(507+145)/1250, np.mean(labels == Smarket.Direction) | |
"""At first glance, it appears that the logistic regression model is | |
working a little better than random guessing. However, this result is | |
misleading because we trained and tested the model on the same set of | |
1,250 observations. In other words, $100-52.2=47.8%$ is the | |
*training* error rate. As we have seen | |
previously, the training error rate is often overly optimistic --- it | |
tends to underestimate the test error rate. In | |
order to better assess the accuracy of the logistic regression model | |
in this setting, we can fit the model using part of the data, and | |
then examine how well it predicts the *held out* data. This | |
will yield a more realistic error rate, in the sense that in practice | |
we will be interested in our model’s performance not on the data that | |
we used to fit the model, but rather on days in the future for which | |
the market’s movements are unknown. | |
To implement this strategy, we first create a Boolean vector | |
corresponding to the observations from 2001 through 2004. We then | |
use this vector to create a held out data set of observations from | |
2005. | |
""" | |
train = (Smarket.Year < 2005) | |
Smarket_train = Smarket.loc[train] | |
Smarket_test = Smarket.loc[~train] | |
Smarket_test.shape | |
"""The object `train` is a vector of 1,250 elements, corresponding | |
to the observations in our data set. The elements of the vector that | |
correspond to observations that occurred before 2005 are set to | |
`True`, whereas those that correspond to observations in 2005 are | |
set to `False`. Hence `train` is a | |
*boolean* array, since its | |
elements are `True` and `False`. Boolean arrays can be used | |
to obtain a subset of the rows or columns of a data frame | |
using the `loc` method. For instance, | |
the command `Smarket.loc[train]` would pick out a submatrix of the | |
stock market data set, corresponding only to the dates before 2005, | |
since those are the ones for which the elements of `train` are | |
`True`. The `~` symbol can be used to negate all of the | |
elements of a Boolean vector. That is, `~train` is a vector | |
similar to `train`, except that the elements that are `True` | |
in `train` get swapped to `False` in `~train`, and vice versa. | |
Therefore, `Smarket.loc[~train]` yields a | |
subset of the rows of the data frame | |
of the stock market data containing only the observations for which | |
`train` is `False`. | |
The output above indicates that there are 252 such | |
observations. | |
We now fit a logistic regression model using only the subset of the | |
observations that correspond to dates before 2005. We then obtain predicted probabilities of the | |
stock market going up for each of the days in our test set --- that is, | |
for the days in 2005. | |
""" | |
X_train, X_test = X.loc[train], X.loc[~train] | |
y_train, y_test = y.loc[train], y.loc[~train] | |
glm_train = sm.GLM(y_train, | |
X_train, | |
family=sm.families.Binomial()) | |
results = glm_train.fit() | |
probs = results.predict(exog=X_test) | |
"""Notice that we have trained and tested our model on two completely | |
separate data sets: training was performed using only the dates before | |
2005, and testing was performed using only the dates in 2005. | |
Finally, we compare the predictions for 2005 to the | |
actual movements of the market over that time period. | |
We will first store the test and training labels (recall `y_test` is binary). | |
""" | |
D = Smarket.Direction | |
L_train, L_test = D.loc[train], D.loc[~train] | |
"""Now we threshold the | |
fitted probability at 50% to form | |
our predicted labels. | |
""" | |
labels = np.array(['Down']*252) | |
labels[probs>0.5] = 'Up' | |
confusion_table(labels, L_test) | |
"""The test accuracy is about 48% while the error rate is about 52%""" | |
np.mean(labels == L_test), np.mean(labels != L_test) | |
"""The `!=` notation means *not equal to*, and so the last command | |
computes the test set error rate. The results are rather | |
disappointing: the test error rate is 52%, which is worse than | |
random guessing! Of course this result is not all that surprising, | |
given that one would not generally expect to be able to use previous | |
days’ returns to predict future market performance. (After all, if it | |
were possible to do so, then the authors of this book would be out | |
striking it rich rather than writing a statistics textbook.) | |
We recall that the logistic regression model had very underwhelming | |
*p*-values associated with all of the predictors, and that the | |
smallest *p*-value, though not very small, corresponded to | |
`Lag1`. Perhaps by removing the variables that appear not to be | |
helpful in predicting `Direction`, we can obtain a more | |
effective model. After all, using predictors that have no relationship | |
with the response tends to cause a deterioration in the test error | |
rate (since such predictors cause an increase in variance without a | |
corresponding decrease in bias), and so removing such predictors may | |
in turn yield an improvement. Below we refit the logistic | |
regression using just `Lag1` and `Lag2`, which seemed to | |
have the highest predictive power in the original logistic regression | |
model. | |
""" | |
model = MS(['Lag1', 'Lag2']).fit(Smarket) | |
X = model.transform(Smarket) | |
X_train, X_test = X.loc[train], X.loc[~train] | |
glm_train = sm.GLM(y_train, | |
X_train, | |
family=sm.families.Binomial()) | |
results = glm_train.fit() | |
probs = results.predict(exog=X_test) | |
labels = np.array(['Down']*252) | |
labels[probs>0.5] = 'Up' | |
confusion_table(labels, L_test) | |
"""Let’s evaluate the overall accuracy as well as the accuracy within the days when | |
logistic regression predicts an increase. | |
""" | |
(35+106)/252,106/(106+76) | |
"""Now the results appear to be a little better: 56% of the daily | |
movements have been correctly predicted. It is worth noting that in | |
this case, a much simpler strategy of predicting that the market will | |
increase every day will also be correct 56% of the time! Hence, in | |
terms of overall error rate, the logistic regression method is no | |
better than the naive approach. However, the confusion matrix | |
shows that on days when logistic regression predicts an increase in | |
the market, it has a 58% accuracy rate. This suggests a possible | |
trading strategy of buying on days when the model predicts an | |
increasing market, and avoiding trades on days when a decrease is | |
predicted. Of course one would need to investigate more carefully | |
whether this small improvement was real or just due to random chance. | |
Suppose that we want to predict the returns associated with particular | |
values of `Lag1` and `Lag2`. In particular, we want to | |
predict `Direction` on a day when `Lag1` and | |
`Lag2` equal $1.2$ and $1.1$, respectively, and on a day when they | |
equal $1.5$ and $-0.8$. We do this using the `predict()` | |
function. | |
""" | |
newdata = pd.DataFrame({'Lag1':[1.2, 1.5], | |
'Lag2':[1.1, -0.8]}); | |
newX = model.transform(newdata) | |
results.predict(newX) | |
Smarket | |
import pandas as pd | |
import numpy as np | |
import matplotlib.pyplot as plt | |
from sklearn.model_selection import train_test_split | |
from sklearn.linear_model import LogisticRegression | |
from sklearn.metrics import classification_report, confusion_matrix | |
import statsmodels.api as sm | |
# Load the dataset | |
data = load_data('Smarket') | |
# Display the first few rows of the dataset | |
print(data.head()) | |
# Prepare the data for logistic regression | |
# Using 'Lag1' and 'Lag2' as predictors and 'Direction' as the response | |
data['Direction'] = data['Direction'].map({'Up': 1, 'Down': 0}) | |
X = data[['Lag1', 'Lag2']] | |
y = data['Direction'] | |
# Split the data into training and testing sets | |
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) | |
# Fit the logistic regression model | |
log_reg = LogisticRegression() | |
log_reg.fit(X_train, y_train) | |
# Make predictions on the test set | |
y_pred = log_reg.predict(X_test) | |
# Print classification report and confusion matrix | |
print(classification_report(y_test, y_pred)) | |
print(confusion_matrix(y_test, y_pred)) | |
# Visualize the decision boundary | |
plt.figure(figsize=(10, 6)) | |
# Create a mesh grid for plotting decision boundary | |
x_min, x_max = X['Lag1'].min() - 1, X['Lag1'].max() + 1 | |
y_min, y_max = X['Lag2'].min() - 1, X['Lag2'].max() + 1 | |
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01), | |
np.arange(y_min, y_max, 0.01)) | |
# Predict the function value for the whole grid | |
Z = log_reg.predict(np.c_[xx.ravel(), yy.ravel()]) | |
Z = Z.reshape(xx.shape) | |
# Plot the decision boundary | |
plt.contourf(xx, yy, Z, alpha=0.8) | |
plt.scatter(X_test['Lag1'], X_test['Lag2'], c=y_test, edgecolor='k', s=20) | |
plt.xlabel('Lag1') | |
plt.ylabel('Lag2') | |
plt.title('Logistic Regression Decision Boundary') | |
plt.show() |