# Linear Regression

**Category:** Classical Machine Learning

## Overview

Linear regression establishes a mathematical relationship between predictor variables and a continuous outcome variable. The algorithm identifies the optimal linear combination of input features that best predicts the target variable by minimizing the sum of squared residuals between predicted and observed values.

## Mathematical Foundation

### Linear Model Equation
```
y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ε
```

Where:
- **y** = dependent variable (outcome)
- **β₀** = intercept (baseline value when all predictors = 0)
- **β₁, β₂, ..., βₚ** = regression coefficients
- **x₁, x₂, ..., xₚ** = independent variables (predictors)
- **ε** = error term (residual)

### Ordinary Least Squares (OLS)

The method minimizes the sum of squared residuals:
```
minimize: Σ(yᵢ - ŷᵢ)²
```

This optimization finds coefficient values that provide the best linear fit to the observed data.

## Medical Applications

### Clinical Research
- **Dose-response modeling** for pharmaceutical studies
- **Biomarker relationship analysis** in clinical trials
- **Treatment outcome prediction** based on patient characteristics
- **Healthcare cost prediction** models

### Diagnostic Applications
- Predicting laboratory values based on patient demographics
- Modeling disease progression rates
- Risk factor quantification for preventive medicine

### Epidemiological Studies
- Population health trend analysis
- Environmental factor impact assessment
- Disease prevalence modeling

## Key Assumptions

### 1. Linearity
The relationship between independent and dependent variables must be linear. Non-linear relationships require transformation or alternative methods.

### 2. Independence
Observations must be independent of each other. Violations occur in:
- Repeated measures data
- Clustered sampling
- Time series data

### 3. Homoscedasticity
Residual variance should be constant across all fitted values. Heteroscedasticity can be detected through:
- Residual plots
- Breusch-Pagan test
- White test

### 4. Normality of Residuals
Residuals should follow a normal distribution, particularly important for:
- Confidence intervals
- Hypothesis testing
- Prediction intervals

## Clinical Considerations

### Multicollinearity
**Problem:** Highly correlated predictors can make coefficient interpretation unreliable.

**Solutions:**
- Variance Inflation Factor (VIF) assessment
- Principal component analysis
- Ridge or LASSO regression

### Outliers and Influential Points
**Detection Methods:**
- Cook's distance
- Leverage values
- Studentized residuals

**Impact:** Single outliers can significantly alter regression coefficients and predictions.

### Sample Size Requirements
**General Rule:** Minimum 10-15 observations per predictor variable.

**Considerations:**
- Effect size expectations
- Desired statistical power
- Expected R² value

### Model Validation
- **Cross-validation** for generalizability assessment
- **Hold-out validation** sets
- **Bootstrap methods** for confidence intervals

## Performance Metrics

### R² (Coefficient of Determination)
- **Range:** 0 to 1
- **Interpretation:** Proportion of variance explained by the model
- **Medical Context:** Higher values indicate better predictive capability

### Root Mean Square Error (RMSE)
- **Units:** Same as dependent variable
- **Interpretation:** Average prediction error magnitude
- **Clinical Relevance:** Should be within acceptable clinical tolerance

### F-statistic
- **Purpose:** Tests overall model significance
- **Interpretation:** Higher values with low p-values indicate significant predictive ability
- **Threshold:** p < 0.05 for statistical significance

### Adjusted R²
- **Advantage:** Penalizes for additional predictors
- **Use Case:** Model comparison with different numbers of variables

## Limitations in Medical Context

### Extrapolation Risks
Predictions outside the observed data range may be unreliable, particularly concerning for:
- Extreme dosages
- Unusual patient populations
- Novel treatment combinations

### Assumption Violations
Many medical datasets violate linear regression assumptions:
- Non-linear dose-response relationships
- Heteroscedastic error patterns
- Non-normal distributions

### Interpretation Challenges
- Correlation vs. causation distinctions
- Confounding variable effects
- Clinical vs. statistical significance

## Best Practices for Medical Applications

### Model Development
1. Conduct thorough exploratory data analysis
2. Test all assumptions before model fitting
3. Consider clinical knowledge in variable selection
4. Validate findings with independent datasets

### Reporting Standards
1. Report confidence intervals alongside point estimates
2. Discuss clinical significance of coefficients
3. Acknowledge model limitations
4. Provide goodness-of-fit statistics

### Clinical Implementation
1. Validate in target population
2. Consider integration with clinical workflow
3. Monitor performance over time
4. Update models as new data becomes available