raymondEDS
week 7
46e47b6
raw
history blame
15.9 kB
Unsupervised Learning: K-means and Hierarchical Clustering
1. Course Overview
The State Safety Profile Challenge
In this week, we'll explore unsupervised machine learning through a compelling real-world challenge: Understanding crime patterns across US states without any predetermined categories.
Unsupervised Learning: A type of machine learning where we find hidden patterns in data without being told what to look for. Think of it like being a detective who examines evidence without knowing what crime was committed - you're looking for patterns and connections that emerge naturally from the data.
Example: Instead of being told "find violent states vs. peaceful states," unsupervised learning lets the data reveal its own natural groupings, like "states with high murder but low assault" or "urban states with moderate crime."
Imagine you're a policy researcher working with the FBI's crime statistics. You have data on violent crime rates across all 50 US states - murder rates, assault rates, urban population percentages, and rape statistics. But here's the key challenge: you don't know how states naturally group together in terms of crime profiles.
Your Mission: Discover hidden patterns in state crime profiles without any predefined classifications!
The Challenge: Without any predetermined safety categories, you need to:
● Uncover natural groupings of states based on their crime characteristics
● Identify which crime factors tend to cluster together
● Understand regional patterns that might not follow obvious geographic boundaries
● Find states with surprisingly similar or different crime profiles
Cluster: A group of similar things. In our case, states that have similar crime patterns naturally group together in a cluster.
Example: You might discover that Alaska, Nevada, and Florida cluster together because they all have high crime rates despite being in different regions of the country.
Why This Matters: Traditional approaches might group states by region (South, Northeast, etc.) or population size. But what if crime patterns reveal different natural groupings? What if some Southern states cluster more closely with Western states based on crime profiles? What if urban percentage affects crime differently than expected?
Urban Percentage: The proportion of a state's population that lives in cities rather than rural areas.
Example: New York has a high urban percentage (87%) while Wyoming has a low urban percentage (29%).
What You'll Discover Through This Challenge
● Hidden State Safety Types: Use clustering to identify groups of states with similar crime profiles
● Crime Pattern Relationships: Find unexpected connections between different types of violent crime
● Urban vs. Rural Effects: Discover how urbanization relates to different crime patterns
● Policy Insights: Understand which states face similar challenges and might benefit from shared approaches
Clustering: The process of grouping similar data points together. It's like organizing your music library - songs naturally group by genre, but clustering might reveal unexpected groups like "workout songs" or "rainy day music" that cross traditional genre boundaries.
Core Techniques We'll Master
K-Means Clustering: A method that divides data into exactly K groups (where you choose the number K). It's like being asked to organize 50 students into exactly 4 study groups based on their academic interests.
Hierarchical Clustering: A method that creates a tree-like structure showing how data points relate to each other at different levels. It's like a family tree, but for data - showing which states are "cousins" and which are "distant relatives" in terms of crime patterns.
Both K-Means and Hierarchical Clustering are examples of unsupervised learning.
2. K-Means Clustering
What it does: Divides data into exactly K groups by finding central points (centroids).
Central Points (Centroids): The "center" or average point of each group. Think of it like the center of a basketball team huddle - it's the point that best represents where all the players are standing.
Example: If you have a cluster of high-crime states, the centroid might represent "average murder rate of 8.5, average assault rate of 250, average urban population of 70%."
USArrests Example: Analyzing crime data across 50 states, you might discover 4 distinct state safety profiles:
● High Crime States (above average in murder, assault, and rape rates)
● Urban Safe States (high urban population but lower violent crime rates)
● Rural Traditional States (low urban population, moderate crime rates)
● Mixed Profile States (high in some crime types but not others)
How to Read K-Means Results:
● Scatter Plot: Points (states) colored by cluster membership
β—‹ Well-separated colors indicate distinct state profiles
β—‹ Mixed colors suggest overlapping crime patterns
● Cluster Centers: Average crime characteristics of each state group
● Elbow Plot: Helps choose optimal number of state groupings
Cluster Membership: Which group each data point belongs to. Like being assigned to a team - each state gets assigned to exactly one crime profile group.
Example: Texas might be assigned to "High Crime States" while Vermont is assigned to "Rural Traditional States."
Scatter Plot: A graph where each point represents one observation (in our case, one state). Points that are close together have similar characteristics.
Elbow Plot: A graph that helps you choose the right number of clusters. It's called "elbow" because you look for a bend in the line that looks like an elbow joint.
Key Parameters:
python
# Essential parameters from the lab
KMeans(
n_clusters=4, # Number of state safety profiles to discover
random_state=42, # For reproducible results
n_init=20 # Run algorithm 20 times, keep best result
)
Parameters: Settings that control how the algorithm works. Like settings on your phone - you can adjust them to get different results.
n_clusters: How many groups you want to create. You have to decide this ahead of time.
random_state: A number that ensures you get the same results every time you run the analysis. Like setting a specific starting point so everyone gets the same answer.
n_init: How many times to run the algorithm. The computer tries multiple starting points and picks the best result. More tries = better results.
3. Hierarchical Clustering
What it does: Creates a tree structure (dendrogram) showing how data points group together at different levels.
Dendrogram: A tree-like diagram that shows how groups form at different levels. Think of it like a family tree, but for data. At the bottom are individuals (states), and as you go up, you see how they group into families, then extended families, then larger clans.
Example: At the bottom level, you might see Vermont and New Hampshire grouped together. Moving up, they might join with Maine to form a "New England Low Crime" group. Moving up further, this group might combine with other regional groups.
USArrests Example: Analyzing state crime patterns might reveal:
● Level 1: High Crime vs. Low Crime states
● Level 2: Within high crime: Urban-driven vs. Rural-driven crime patterns
● Level 3: Within urban-driven: Assault-heavy vs. Murder-heavy profiles
How to Read Dendrograms:
● Height: Distance between groups when they merge
β—‹ Higher merges = very different crime profiles
β—‹ Lower merges = similar crime patterns
● Branches: Each split shows a potential state grouping
● Cutting the Tree: Draw a horizontal line to create clusters
Height: In a dendrogram, height represents how different two groups are. Think of it like difficulty level - it takes more "effort" (higher height) to combine very different groups.
Example: Combining two very similar states (like Vermont and New Hampshire) happens at low height. Combining very different groups (like "High Crime States" and "Low Crime States") happens at high height.
Cutting the Tree: Drawing a horizontal line across the dendrogram to create a specific number of groups. Like slicing a layer cake - where you cut determines how many pieces you get.
Three Linkage Methods:
● Complete Linkage: Measures distance between most different states (good for distinct profiles)
● Average Linkage: Uses average distance between all states (balanced approach)
● Single Linkage: Uses closest states (tends to create chains, often less useful)
Linkage Methods: Different ways to measure how close or far apart groups are. It's like different ways to measure the distance between two cities - you could use the distance between the farthest suburbs (complete), the average distance between all neighborhoods (average), or the distance between the closest points (single).
Example: When deciding if "High Crime Group" and "Medium Crime Group" should merge, complete linkage looks at the most different states between the groups, while average linkage looks at the typical difference.
Choosing Between K-Means and Hierarchical:
● Use K-Means when: You want to segment states into specific number of safety categories for policy targeting
● Use Hierarchical when: You want to explore the natural structure of crime patterns without assumptions
Segmentation: Dividing your data into groups for specific purposes. Like organizing students into study groups - you might want exactly 4 groups so each has a teaching assistant.
Exploratory Analysis: Looking at data to discover patterns without knowing what you'll find. Like being an explorer in uncharted territory - you're not looking for a specific destination, just seeing what interesting things you can discover.
4. Data Exploration
Step 1: Understanding Your Data
Essential Checks (from the USArrests example):
python
# Check the basic structure
print(data.shape) # How many observations and variables?
print(data.columns) # What variables do you have?
print(data.head()) # What do the first few rows look like?
# Examine the distribution
print(data.mean()) # Average values
print(data.var()) # Variability
print(data.describe()) # Full statistical summary
Observations: Individual data points we're studying. In our case, each of the 50 US states is one observation.
Variables: The characteristics we're measuring for each observation. In USArrests, we have 4 variables: Murder rate, Assault rate, Urban Population percentage, and Rape rate.
Example: For California (one observation), we might have Murder=9.0, Assault=276, UrbanPop=91, Rape=40.6 (four variables).
Distribution: How values are spread out. Like looking at test scores in a class - are most scores clustered around the average, or spread out widely?
Variability (Variance): How much the values differ from each other. High variance means values are spread out; low variance means they're clustered together.
Why This Matters: The USArrests data showed vastly different scales:
● Murder: Average 7.8, Variance 19
● Assault: Average 170.8, Variance 6,945
● This scale difference would dominate any analysis without preprocessing
Scales: The range and units of measurement for different variables. Like comparing dollars ($50,000 salary) to percentages (75% approval rating) - they're measured very differently.
Example: Assault rates are in the hundreds (like 276 per 100,000) while murder rates are single digits (like 7.8 per 100,000). Without adjustment, assault would seem much more important just because the numbers are bigger.
Step 2: Data Preprocessing
Standardization (Critical for clustering):
python
from sklearn.preprocessing import StandardScaler
# Always scale when variables have different units
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
Standardization: Converting all variables to the same scale so they can be fairly compared. Like converting all measurements to the same units - instead of comparing feet to meters, you convert everything to inches.
StandardScaler: A tool that transforms data so each variable has an average of 0 and standard deviation of 1. Think of it like grading on a curve - it makes all variables equally important.
Example: After standardization, a murder rate of 7.8 might become 0.2, and an assault rate of 276 might become 1.5. Now they're on comparable scales.
When to Scale:
● βœ… Always scale when variables have different units (dollars vs. percentages)
● βœ… Scale when variances differ by orders of magnitude
● ❓ Consider not scaling when all variables are in the same meaningful units
Orders of Magnitude: When one number is 10 times, 100 times, or 1000 times bigger than another. In USArrests, assault variance (6,945) is about 365 times bigger than murder variance (19) - that's two orders of magnitude difference.
Step 3: Exploratory Analysis
For K-Means Clustering:
python
# Try different numbers of clusters to find optimal K
inertias = []
K_range = range(1, 11)
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
kmeans.fit(data_scaled)
inertias.append(kmeans.inertia_)
# Plot elbow curve
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Within-Cluster Sum of Squares')
plt.title('Elbow Method for Optimal K')
Inertias: A measure of how tightly grouped each cluster is. Lower inertia means points in each cluster are closer together (better clustering). It's like measuring how close teammates stand to each other - closer teammates indicate better team cohesion.
Within-Cluster Sum of Squares: The total distance from each point to its cluster center. Think of it as measuring how far each student sits from their group's center - smaller distances mean tighter, more cohesive groups.
Elbow Method: A technique for choosing the best number of clusters. You plot the results and look for the "elbow" - the point where adding more clusters doesn't help much anymore.
For Hierarchical Clustering:
python
# Create dendrogram to explore natural groupings
from sklearn.cluster import AgglomerativeClustering
from ISLP.cluster import compute_linkage
from scipy.cluster.hierarchy import dendrogram
hc = AgglomerativeClustering(distance_threshold=0, n_clusters=None, linkage='complete')
hc.fit(data_scaled)
linkage_matrix = compute_linkage(hc)
plt.figure(figsize=(12, 8))
dendrogram(linkage_matrix, color_threshold=-np.inf, above_threshold_color='black')
plt.title('Hierarchical Clustering Dendrogram')
AgglomerativeClustering: A type of hierarchical clustering that starts with individual points and gradually combines them into larger groups. Like building a pyramid from the bottom up.
distance_threshold=0: A setting that tells the algorithm to build the complete tree structure without stopping early.
Linkage Matrix: A mathematical representation of how the tree structure was built. Think of it as the blueprint showing how the dendrogram was constructed.
Step 4: Validation Questions
Before proceeding with analysis, ask:
1. Do the variables make sense together? (e.g., don't cluster height with income)
2. Are there obvious outliers that need attention?
3. Do you have enough data points? (Rule of thumb: at least 10x more observations than variables)
4. Are there missing values that need handling?
Outliers: Data points that are very different from all the others. Like a 7-foot-tall person in a group of average-height people - they're so different they might skew your analysis.
Example: If most states have murder rates between 1-15, but one state has a rate of 50, that's probably an outlier that needs special attention.
Missing Values: Data points where we don't have complete information. Like a student who didn't take one of the tests - you need to decide how to handle that gap in the data.
Rule of Thumb: A general guideline that works in most situations. For clustering, having at least 10 times more observations than variables helps ensure reliable results.