Spaces:

raymondEDS
/

DS_webclass

Running

DS_webclass / Reference files /w6_logistic_regression_lab.py

raymondEDS

Week 6 logistic regression

4a23d33 3 months ago

15.6 kB

	# -- coding: utf-8 --
	"""W6_Logistic_regression_lab

	Automatically generated by Colab.

	Original file is located at
	https://colab.research.google.com/drive/1MG7N2HN-Nxow9fzvc0fzxvp3WyKqtgs8

	# 🚀 Logistic Regression Lab: Stock Market Prediction

	## Lab Overview
	In this lab, we'll use logistic regression to try predicting whether the stock market goes up or down. Spoiler alert: This is intentionally a challenging prediction problem that will teach us important lessons about when logistic regression works well and when it doesn't.
	## Learning Goals:

	- Apply logistic regression to real data
	- Interpret probabilities and coefficients
	- Understand why some prediction problems are inherently difficult
	- Learn proper model evaluation techniques

	## The Stock Market Data

	In this lab we will examine the `Smarket`
	data, which is part of the `ISLP`
	library. This data set consists of percentage returns for the S&P 500
	stock index over 1,250 days, from the beginning of 2001 until the end
	of 2005. For each date, we have recorded the percentage returns for
	each of the five previous trading days, `Lag1` through
	`Lag5`. We have also recorded `Volume` (the number of
	shares traded on the previous day, in billions), `Today` (the
	percentage return on the date in question) and `Direction`
	(whether the market was `Up` or `Down` on this date).

	### Your Challenge
	Question: Can we predict if the S&P 500 will go up or down based on recent trading patterns?

	Why This Matters: If predictable, this would be incredibly valuable. If not predictable, we learn about market efficiency and realistic expectations for prediction models.


	To answer the question, we start by importing our libraries at this top level; these are all imports we have seen in previous labs.
	"""

	import numpy as np
	import pandas as pd
	from matplotlib.pyplot import subplots
	import statsmodels.api as sm
	from ISLP import load_data
	from ISLP.models import (ModelSpec as MS,
	summarize)

	"""We also collect together the new imports needed for this lab."""

	from ISLP import confusion_table
	from ISLP.models import contrast
	from sklearn.discriminant_analysis import \
	(LinearDiscriminantAnalysis as LDA,
	QuadraticDiscriminantAnalysis as QDA)
	from sklearn.naive_bayes import GaussianNB
	from sklearn.neighbors import KNeighborsClassifier
	from sklearn.preprocessing import StandardScaler
	from sklearn.model_selection import train_test_split
	from sklearn.linear_model import LogisticRegression

	"""Now we are ready to load the `Smarket` data."""

	Smarket = load_data('Smarket')
	Smarket

	"""This gives a truncated listing of the data.
	We can see what the variable names are.
	"""

	Smarket.columns

	"""We compute the correlation matrix using the `corr()` method
	for data frames, which produces a matrix that contains all of
	the pairwise correlations among the variables.

	By instructing `pandas` to use only numeric variables, the `corr()` method does not report a correlation for the `Direction` variable because it is
	qualitative.

	![image.png](attachment:image.png)
	"""

	Smarket.corr(numeric_only=True)

	"""As one would expect, the correlations between the lagged return variables and
	today’s return are close to zero. The only substantial correlation is between `Year` and
	`Volume`. By plotting the data we see that `Volume`
	is increasing over time. In other words, the average number of shares traded
	daily increased from 2001 to 2005.

	"""

	Smarket.plot(y='Volume');

	"""## Logistic Regression
	Next, we will fit a logistic regression model in order to predict
	`Direction` using `Lag1` through `Lag5` and
	`Volume`. The `sm.GLM()` function fits generalized linear models, a class of
	models that includes logistic regression. Alternatively,
	the function `sm.Logit()` fits a logistic regression
	model directly. The syntax of
	`sm.GLM()` is similar to that of `sm.OLS()`, except
	that we must pass in the argument `family=sm.families.Binomial()`
	in order to tell `statsmodels` to run a logistic regression rather than some other
	type of generalized linear model.
	"""

	allvars = Smarket.columns.drop(['Today', 'Direction', 'Year'])
	design = MS(allvars)
	X = design.fit_transform(Smarket)
	y = Smarket.Direction == 'Up'
	glm = sm.GLM(y,
	X,
	family=sm.families.Binomial())
	results = glm.fit()
	summarize(results)

	"""The smallest p-value here is associated with `Lag1`. The
	negative coefficient for this predictor suggests that if the market
	had a positive return yesterday, then it is less likely to go up
	today. However, at a value of 0.15, the p-value is still
	relatively large, and so there is no clear evidence of a real
	association between `Lag1` and `Direction`.

	We use the `params` attribute of `results`
	in order to access just the
	coefficients for this fitted model.
	"""

	results.params

	"""Likewise we can use the
	`pvalues` attribute to access the p-values for the coefficients.
	"""

	results.pvalues

	"""The `predict()` method of `results` can be used to predict the
	probability that the market will go up, given values of the
	predictors. This method returns predictions
	on the probability scale. If no data set is supplied to the `predict()`
	function, then the probabilities are computed for the training data
	that was used to fit the logistic regression model.
	As with linear regression, one can pass an optional `exog` argument consistent
	with a design matrix if desired. Here we have
	printed only the first ten probabilities.
	"""

	probs = results.predict()
	probs[:10]

	"""In order to make a prediction as to whether the market will go up or
	down on a particular day, we must convert these predicted
	probabilities into class labels, `Up` or `Down`. The
	following two commands create a vector of class predictions based on
	whether the predicted probability of a market increase is greater than
	or less than 0.5.
	"""

	labels = np.array(['Down']*1250)
	labels[probs>0.5] = "Up"

	"""The `confusion_table()`
	function from the `ISLP` package summarizes these predictions, showing how
	many observations were correctly or incorrectly classified. Our function, which is adapted from a similar function
	in the module `sklearn.metrics`, transposes the resulting
	matrix and includes row and column labels.
	The `confusion_table()` function takes as first argument the
	predicted labels, and second argument the true labels.
	"""

	confusion_table(labels, Smarket.Direction)

	"""The diagonal elements of the confusion matrix indicate correct
	predictions, while the off-diagonals represent incorrect
	predictions. Hence our model correctly predicted that the market would
	go up on 507 days and that it would go down on 145 days, for a
	total of 507 + 145 = 652 correct predictions. The `np.mean()`
	function can be used to compute the fraction of days for which the
	prediction was correct. In this case, logistic regression correctly
	predicted the movement of the market 52.2% of the time.

	"""

	(507+145)/1250, np.mean(labels == Smarket.Direction)

	"""At first glance, it appears that the logistic regression model is
	working a little better than random guessing. However, this result is
	misleading because we trained and tested the model on the same set of
	1,250 observations. In other words, $100-52.2=47.8%$ is the
	training error rate. As we have seen
	previously, the training error rate is often overly optimistic --- it
	tends to underestimate the test error rate. In
	order to better assess the accuracy of the logistic regression model
	in this setting, we can fit the model using part of the data, and
	then examine how well it predicts the held out data. This
	will yield a more realistic error rate, in the sense that in practice
	we will be interested in our model’s performance not on the data that
	we used to fit the model, but rather on days in the future for which
	the market’s movements are unknown.

	To implement this strategy, we first create a Boolean vector
	corresponding to the observations from 2001 through 2004. We then
	use this vector to create a held out data set of observations from
	2005.
	"""

	train = (Smarket.Year < 2005)
	Smarket_train = Smarket.loc[train]
	Smarket_test = Smarket.loc[~train]
	Smarket_test.shape

	"""The object `train` is a vector of 1,250 elements, corresponding
	to the observations in our data set. The elements of the vector that
	correspond to observations that occurred before 2005 are set to
	`True`, whereas those that correspond to observations in 2005 are
	set to `False`. Hence `train` is a
	boolean array, since its
	elements are `True` and `False`. Boolean arrays can be used
	to obtain a subset of the rows or columns of a data frame
	using the `loc` method. For instance,
	the command `Smarket.loc[train]` would pick out a submatrix of the
	stock market data set, corresponding only to the dates before 2005,
	since those are the ones for which the elements of `train` are
	`True`. The `~` symbol can be used to negate all of the
	elements of a Boolean vector. That is, `~train` is a vector
	similar to `train`, except that the elements that are `True`
	in `train` get swapped to `False` in `~train`, and vice versa.
	Therefore, `Smarket.loc[~train]` yields a
	subset of the rows of the data frame
	of the stock market data containing only the observations for which
	`train` is `False`.
	The output above indicates that there are 252 such
	observations.

	We now fit a logistic regression model using only the subset of the
	observations that correspond to dates before 2005. We then obtain predicted probabilities of the
	stock market going up for each of the days in our test set --- that is,
	for the days in 2005.
	"""

	X_train, X_test = X.loc[train], X.loc[~train]
	y_train, y_test = y.loc[train], y.loc[~train]
	glm_train = sm.GLM(y_train,
	X_train,
	family=sm.families.Binomial())
	results = glm_train.fit()
	probs = results.predict(exog=X_test)

	"""Notice that we have trained and tested our model on two completely
	separate data sets: training was performed using only the dates before
	2005, and testing was performed using only the dates in 2005.

	Finally, we compare the predictions for 2005 to the
	actual movements of the market over that time period.
	We will first store the test and training labels (recall `y_test` is binary).
	"""

	D = Smarket.Direction
	L_train, L_test = D.loc[train], D.loc[~train]

	"""Now we threshold the
	fitted probability at 50% to form
	our predicted labels.
	"""

	labels = np.array(['Down']*252)
	labels[probs>0.5] = 'Up'
	confusion_table(labels, L_test)

	"""The test accuracy is about 48% while the error rate is about 52%"""

	np.mean(labels == L_test), np.mean(labels != L_test)

	"""The `!=` notation means not equal to, and so the last command
	computes the test set error rate. The results are rather
	disappointing: the test error rate is 52%, which is worse than
	random guessing! Of course this result is not all that surprising,
	given that one would not generally expect to be able to use previous
	days’ returns to predict future market performance. (After all, if it
	were possible to do so, then the authors of this book would be out
	striking it rich rather than writing a statistics textbook.)

	We recall that the logistic regression model had very underwhelming
	p-values associated with all of the predictors, and that the
	smallest p-value, though not very small, corresponded to
	`Lag1`. Perhaps by removing the variables that appear not to be
	helpful in predicting `Direction`, we can obtain a more
	effective model. After all, using predictors that have no relationship
	with the response tends to cause a deterioration in the test error
	rate (since such predictors cause an increase in variance without a
	corresponding decrease in bias), and so removing such predictors may
	in turn yield an improvement. Below we refit the logistic
	regression using just `Lag1` and `Lag2`, which seemed to
	have the highest predictive power in the original logistic regression
	model.
	"""

	model = MS(['Lag1', 'Lag2']).fit(Smarket)
	X = model.transform(Smarket)
	X_train, X_test = X.loc[train], X.loc[~train]
	glm_train = sm.GLM(y_train,
	X_train,
	family=sm.families.Binomial())
	results = glm_train.fit()
	probs = results.predict(exog=X_test)
	labels = np.array(['Down']*252)
	labels[probs>0.5] = 'Up'
	confusion_table(labels, L_test)

	"""Let’s evaluate the overall accuracy as well as the accuracy within the days when
	logistic regression predicts an increase.
	"""

	(35+106)/252,106/(106+76)

	"""Now the results appear to be a little better: 56% of the daily
	movements have been correctly predicted. It is worth noting that in
	this case, a much simpler strategy of predicting that the market will
	increase every day will also be correct 56% of the time! Hence, in
	terms of overall error rate, the logistic regression method is no
	better than the naive approach. However, the confusion matrix
	shows that on days when logistic regression predicts an increase in
	the market, it has a 58% accuracy rate. This suggests a possible
	trading strategy of buying on days when the model predicts an
	increasing market, and avoiding trades on days when a decrease is
	predicted. Of course one would need to investigate more carefully
	whether this small improvement was real or just due to random chance.

	Suppose that we want to predict the returns associated with particular
	values of `Lag1` and `Lag2`. In particular, we want to
	predict `Direction` on a day when `Lag1` and
	`Lag2` equal $1.2$ and $1.1$, respectively, and on a day when they
	equal $1.5$ and $-0.8$. We do this using the `predict()`
	function.
	"""

	newdata = pd.DataFrame({'Lag1':[1.2, 1.5],
	'Lag2':[1.1, -0.8]});
	newX = model.transform(newdata)
	results.predict(newX)

	Smarket

	import pandas as pd
	import numpy as np
	import matplotlib.pyplot as plt
	from sklearn.model_selection import train_test_split
	from sklearn.linear_model import LogisticRegression
	from sklearn.metrics import classification_report, confusion_matrix
	import statsmodels.api as sm


	# Load the dataset
	data = load_data('Smarket')

	# Display the first few rows of the dataset
	print(data.head())

	# Prepare the data for logistic regression
	# Using 'Lag1' and 'Lag2' as predictors and 'Direction' as the response
	data['Direction'] = data['Direction'].map({'Up': 1, 'Down': 0})
	X = data[['Lag1', 'Lag2']]
	y = data['Direction']

	# Split the data into training and testing sets
	X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

	# Fit the logistic regression model
	log_reg = LogisticRegression()
	log_reg.fit(X_train, y_train)

	# Make predictions on the test set
	y_pred = log_reg.predict(X_test)

	# Print classification report and confusion matrix
	print(classification_report(y_test, y_pred))
	print(confusion_matrix(y_test, y_pred))

	# Visualize the decision boundary
	plt.figure(figsize=(10, 6))

	# Create a mesh grid for plotting decision boundary
	x_min, x_max = X['Lag1'].min() - 1, X['Lag1'].max() + 1
	y_min, y_max = X['Lag2'].min() - 1, X['Lag2'].max() + 1
	xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
	np.arange(y_min, y_max, 0.01))

	# Predict the function value for the whole grid
	Z = log_reg.predict(np.c_[xx.ravel(), yy.ravel()])
	Z = Z.reshape(xx.shape)

	# Plot the decision boundary
	plt.contourf(xx, yy, Z, alpha=0.8)
	plt.scatter(X_test['Lag1'], X_test['Lag2'], c=y_test, edgecolor='k', s=20)
	plt.xlabel('Lag1')
	plt.ylabel('Lag2')
	plt.title('Logistic Regression Decision Boundary')
	plt.show()