File size: 3,774 Bytes
3f4ff30 89ac66e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
---
license: mit
datasets:
- Nnaodeh/Stroke_Prediction_Dataset
language:
- en
pipeline_tag: tabular-classification
---
# Stroke Prediction Model
This project implements a machine learning pipeline for predicting stroke risk using tabular data from the patient dataset. Multiple models are trained to choose the best performing. Below is a detailed explanation of how each key consideration was implemented.
### Data Set
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient.
### Attribute Information
1. id: unique identifier
2. gender: "Male", "Female" or "Other"
3. age: age of the patient
4. hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
5. heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
6. ever_married: "No" or "Yes"
7. work_type: "children", "Govt_jov", "Never_worked", "Private" or "Self-employed"
8. Residence_type: "Rural" or "Urban"
9. avg_glucose_level: average glucose level in blood
10. bmi: body mass index
11. smoking_status: "formerly smoked", "never smoked", "smokes" or "Unknown"\*
12. stroke: 1 if the patient had a stroke or 0 if not
## Key Considerations Implementation
## Data Cleaning
#### Drop id column
The id column is dropped as it serves as a unique identifier for each row but does not contribute to the predictive power of the model.
#### Remove missing values
Remove data entries with missing 'bmi' as it corresponds no impact to model accuracy being less in number
## Feature Engineering
#### Binary Encoding
Convert categorical features with only two unique values into binary numeric format for easier processing by machine learning models:
- ever_married: Encoded as 0 for “No” and 1 for “Yes”.
- Residence_type: Encoded as 0 for “Rural” and 1 for “Urban”.
#### One-Hot Encoding for Multi-Class Categorical Features
- For features with more than two categories, such as gender, work_type, and smoking_status, apply one-hot encoding to create separate binary columns for each category.
- The onehot_encode function is assumed to handle the transformation, creating additional columns for each category while dropping the original column.
#### Split Dataset into Features and Target
- Separate the target variable (stroke) from the features:
- X: Contains all feature columns used as input for the model.
- y: Contains the target column, which indicates whether a stroke occurred.
#### Train-Test Split
- Split the dataset into training and testing sets to evaluate model performance effectively. This ensures the model is tested on unseen data and helps prevent overfitting.
- The specific split ratio (e.g., 70% train, 30% test) can be customized as needed.
### Model Selection
Following models are evaluated:
- Logistic Regression
- K-Nearest Neighbors
- Support Vector Machine (Linear Kernel)
- Support Vector Machine (RBF Kernel)
- Neural Network
- Gradient Boosting
Evaluated for:
- Handles both numerical and categorical features
- Resistant to overfitting
- Provides feature importance
- Good performance on imbalanced data
### 4. Software Engineering Best Practices
#### A. Logging
Comprehensive logging system:
```python
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
```
Logging features:
- Timestamp for each operation
- Different log levels (INFO, ERROR)
- Operation tracking
- Error capture and reporting
#### B. Documentation
- Docstrings for all classes and methods
- Clear code structure with comments
- This README file
- Logging outputs for tracking
|