Spaces:
Running
Running
raymondEDS
commited on
Commit
·
46e47b6
1
Parent(s):
63732ac
week 7
Browse files- Reference files/week 7/W7_Lab_KNN_clustering.ipynb +481 -0
- Reference files/week 7/Week7_Clustering Curriculum.docx +0 -0
- Reference files/week 7/Week7_Clustering Learning Objectives.docx +0 -0
- Reference files/week 7/w7_curriculum +178 -0
- app/.DS_Store +0 -0
- app/main.py +26 -15
- app/pages/__pycache__/week_5.cpython-311.pyc +0 -0
- app/pages/__pycache__/week_7.cpython-311.pyc +0 -0
- app/pages/week_7.py +715 -236
Reference files/week 7/W7_Lab_KNN_clustering.ipynb
ADDED
@@ -0,0 +1,481 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cells": [
|
3 |
+
{
|
4 |
+
"cell_type": "markdown",
|
5 |
+
"id": "b1c6a137",
|
6 |
+
"metadata": {
|
7 |
+
"id": "b1c6a137"
|
8 |
+
},
|
9 |
+
"source": [
|
10 |
+
"# Clustering Lab: State Crime Pattern Analysis\n",
|
11 |
+
"\n",
|
12 |
+
"## Lab Overview\n",
|
13 |
+
"\n",
|
14 |
+
"Welcome to your hands-on clustering lab! You'll be working as a policy analyst for the Department of Justice, analyzing crime patterns across US states. Your mission: discover hidden safety profiles that could inform federal resource allocation and crime prevention strategies.\n",
|
15 |
+
"\n",
|
16 |
+
"**Your Deliverable**: A policy brief with visualizations and recommendations based on your clustering analysis.\n",
|
17 |
+
"\n",
|
18 |
+
"---\n",
|
19 |
+
"\n",
|
20 |
+
"## Exercise 1: Data Detective Work\n",
|
21 |
+
"**Time: 15 minutes | Product: Data Summary Report**\n",
|
22 |
+
"\n",
|
23 |
+
"### Your Task\n",
|
24 |
+
"Before any analysis, you need to understand what you're working with. Create a brief data summary that a non-technical policy maker could understand.\n"
|
25 |
+
]
|
26 |
+
},
|
27 |
+
{
|
28 |
+
"cell_type": "code",
|
29 |
+
"source": [
|
30 |
+
"```python\n",
|
31 |
+
"import numpy as np\n",
|
32 |
+
"import pandas as pd\n",
|
33 |
+
"import matplotlib.pyplot as plt\n",
|
34 |
+
"from statsmodels.datasets import get_rdataset\n",
|
35 |
+
"from sklearn.preprocessing import StandardScaler\n",
|
36 |
+
"from sklearn.cluster import KMeans, AgglomerativeClustering\n",
|
37 |
+
"\n",
|
38 |
+
"# Load the data\n",
|
39 |
+
"USArrests = get_rdataset('USArrests').data\n",
|
40 |
+
"print(\"Dataset shape:\", USArrests.shape)\n",
|
41 |
+
"print(\"\\nVariables:\", USArrests.columns.tolist())\n",
|
42 |
+
"print(\"\\nFirst 5 states:\")\n",
|
43 |
+
"print(USArrests.head())\n",
|
44 |
+
"```"
|
45 |
+
],
|
46 |
+
"metadata": {
|
47 |
+
"colab": {
|
48 |
+
"base_uri": "https://localhost:8080/",
|
49 |
+
"height": 106
|
50 |
+
},
|
51 |
+
"id": "mqRVE1hlXK9x",
|
52 |
+
"outputId": "5a1bbd64-15cd-4e1c-9344-64a901d8a396"
|
53 |
+
},
|
54 |
+
"id": "mqRVE1hlXK9x",
|
55 |
+
"execution_count": null,
|
56 |
+
"outputs": [
|
57 |
+
{
|
58 |
+
"output_type": "error",
|
59 |
+
"ename": "SyntaxError",
|
60 |
+
"evalue": "invalid syntax (<ipython-input-1-2035427107>, line 1)",
|
61 |
+
"traceback": [
|
62 |
+
"\u001b[0;36m File \u001b[0;32m\"<ipython-input-1-2035427107>\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m ```python\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
|
63 |
+
]
|
64 |
+
}
|
65 |
+
]
|
66 |
+
},
|
67 |
+
{
|
68 |
+
"cell_type": "markdown",
|
69 |
+
"source": [
|
70 |
+
"## Your Investigation\n",
|
71 |
+
"Complete this data summary table:\n",
|
72 |
+
"\n",
|
73 |
+
"| Variable | What it measures | Average Value | Highest State | Lowest State |\n",
|
74 |
+
"|----------|------------------|---------------|---------------|--------------|\n",
|
75 |
+
"| Murder | Rate per 100,000 people | ??? | ??? | ??? |\n",
|
76 |
+
"| Assault | Rate per 100,000 people | ??? | ??? | ??? |\n",
|
77 |
+
"| UrbanPop | Percentage living in cities | ??? | ??? | ??? |\n",
|
78 |
+
"| Rape | Rate per 100,000 people | ??? | ??? | ??? |\n",
|
79 |
+
"\n",
|
80 |
+
"**Deliverable**: Write 2-3 sentences describing the biggest surprises in this data. Which states are not what you expected?\n",
|
81 |
+
"\n",
|
82 |
+
"---\n",
|
83 |
+
"\n",
|
84 |
+
"## Exercise 2: The Scaling Challenge\n",
|
85 |
+
"**Time: 10 minutes | Product: Before/After Comparison**\n",
|
86 |
+
"\n",
|
87 |
+
"### Your Task\n",
|
88 |
+
"Demonstrate why scaling is critical for clustering crime data.\n",
|
89 |
+
"\n"
|
90 |
+
],
|
91 |
+
"metadata": {
|
92 |
+
"id": "7qkDKTe4XLtG"
|
93 |
+
},
|
94 |
+
"id": "7qkDKTe4XLtG"
|
95 |
+
},
|
96 |
+
{
|
97 |
+
"cell_type": "code",
|
98 |
+
"source": [
|
99 |
+
"```python\n",
|
100 |
+
"# Check the scale differences\n",
|
101 |
+
"print(\"Original data ranges:\")\n",
|
102 |
+
"print(USArrests.describe())\n",
|
103 |
+
"\n",
|
104 |
+
"print(\"\\nVariances (how spread out the data is):\")\n",
|
105 |
+
"print(USArrests.var())\n",
|
106 |
+
"\n",
|
107 |
+
"# Scale the data\n",
|
108 |
+
"scaler = StandardScaler()\n",
|
109 |
+
"USArrests_scaled = scaler.fit_transform(USArrests)\n",
|
110 |
+
"scaled_df = pd.DataFrame(USArrests_scaled,\n",
|
111 |
+
" columns=USArrests.columns,\n",
|
112 |
+
" index=USArrests.index)\n",
|
113 |
+
"\n",
|
114 |
+
"print(\"\\nAfter scaling - all variables now have similar ranges:\")\n",
|
115 |
+
"print(scaled_df.describe())\n",
|
116 |
+
"```"
|
117 |
+
],
|
118 |
+
"metadata": {
|
119 |
+
"id": "zQ3VowYNXLeQ"
|
120 |
+
},
|
121 |
+
"id": "zQ3VowYNXLeQ",
|
122 |
+
"execution_count": null,
|
123 |
+
"outputs": []
|
124 |
+
},
|
125 |
+
{
|
126 |
+
"cell_type": "markdown",
|
127 |
+
"source": [
|
128 |
+
"### Your Analysis\n",
|
129 |
+
"1. **Before scaling**: Which variable would dominate the clustering? Why?\n",
|
130 |
+
"2. **After scaling**: Explain in simple terms what StandardScaler did to the data.\n",
|
131 |
+
"\n",
|
132 |
+
"**Deliverable**: One paragraph explaining why a policy analyst should care about data scaling.\n",
|
133 |
+
"\n",
|
134 |
+
"---\n",
|
135 |
+
"\n",
|
136 |
+
"## Exercise 3: Finding the Right Number of Groups\n",
|
137 |
+
"**Time: 20 minutes | Product: Recommendation with Visual Evidence**\n",
|
138 |
+
"\n",
|
139 |
+
"### Your Task\n",
|
140 |
+
"Use the elbow method to determine how many distinct crime profiles exist among US states.\n"
|
141 |
+
],
|
142 |
+
"metadata": {
|
143 |
+
"id": "FnOT700SXLPh"
|
144 |
+
},
|
145 |
+
"id": "FnOT700SXLPh"
|
146 |
+
},
|
147 |
+
{
|
148 |
+
"cell_type": "code",
|
149 |
+
"source": [
|
150 |
+
"```python\n",
|
151 |
+
"# Test different numbers of clusters\n",
|
152 |
+
"inertias = []\n",
|
153 |
+
"K_values = range(1, 11)\n",
|
154 |
+
"\n",
|
155 |
+
"for k in K_values:\n",
|
156 |
+
" kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)\n",
|
157 |
+
" kmeans.fit(USArrests_scaled)\n",
|
158 |
+
" inertias.append(kmeans.inertia_)\n",
|
159 |
+
"\n",
|
160 |
+
"# Create the elbow plot\n",
|
161 |
+
"plt.figure(figsize=(10, 6))\n",
|
162 |
+
"plt.plot(K_values, inertias, 'bo-', linewidth=2, markersize=8)\n",
|
163 |
+
"plt.xlabel('Number of Clusters (K)')\n",
|
164 |
+
"plt.ylabel('Within-Cluster Sum of Squares')\n",
|
165 |
+
"plt.title('Finding the Optimal Number of State Crime Profiles')\n",
|
166 |
+
"plt.grid(True, alpha=0.3)\n",
|
167 |
+
"plt.show()\n",
|
168 |
+
"\n",
|
169 |
+
"# Print the inertia values\n",
|
170 |
+
"for k, inertia in zip(K_values, inertias):\n",
|
171 |
+
" print(f\"K={k}: Inertia = {inertia:.1f}\")\n",
|
172 |
+
"```"
|
173 |
+
],
|
174 |
+
"metadata": {
|
175 |
+
"id": "zOQrS9lmXpTF"
|
176 |
+
},
|
177 |
+
"id": "zOQrS9lmXpTF",
|
178 |
+
"execution_count": null,
|
179 |
+
"outputs": []
|
180 |
+
},
|
181 |
+
{
|
182 |
+
"cell_type": "markdown",
|
183 |
+
"id": "2e388ef2",
|
184 |
+
"metadata": {
|
185 |
+
"id": "2e388ef2"
|
186 |
+
},
|
187 |
+
"source": [
|
188 |
+
"### Your Decision\n",
|
189 |
+
"Based on your elbow plot:\n",
|
190 |
+
"1. **What value of K do you recommend?** (Look for the \"elbow\" where the line starts to flatten)\n",
|
191 |
+
"2. **What does this mean in policy terms?** (How many distinct types of state crime profiles exist?)\n",
|
192 |
+
"\n",
|
193 |
+
"**Deliverable**: A one-paragraph recommendation with your chosen K value and reasoning.\n",
|
194 |
+
"\n",
|
195 |
+
"---\n",
|
196 |
+
"\n",
|
197 |
+
"## Exercise 4: K-Means State Profiling\n",
|
198 |
+
"**Time: 25 minutes | Product: State Crime Profile Report**\n",
|
199 |
+
"\n",
|
200 |
+
"### Your Task\n",
|
201 |
+
"Create distinct crime profiles and identify which states belong to each category.\n",
|
202 |
+
"\n",
|
203 |
+
"\n",
|
204 |
+
"\n",
|
205 |
+
"\n"
|
206 |
+
]
|
207 |
+
},
|
208 |
+
{
|
209 |
+
"cell_type": "code",
|
210 |
+
"source": [
|
211 |
+
"```python\n",
|
212 |
+
"# Use your chosen K value from Exercise 3\n",
|
213 |
+
"optimal_k = 4 # Replace with your chosen value\n",
|
214 |
+
"\n",
|
215 |
+
"# Perform K-means clustering\n",
|
216 |
+
"kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=20)\n",
|
217 |
+
"cluster_labels = kmeans.fit_predict(USArrests_scaled)\n",
|
218 |
+
"\n",
|
219 |
+
"# Add cluster labels to original data\n",
|
220 |
+
"USArrests_clustered = USArrests.copy()\n",
|
221 |
+
"USArrests_clustered['Cluster'] = cluster_labels\n",
|
222 |
+
"\n",
|
223 |
+
"# Analyze each cluster\n",
|
224 |
+
"print(\"State Crime Profiles Analysis\")\n",
|
225 |
+
"print(\"=\" * 50)\n",
|
226 |
+
"\n",
|
227 |
+
"for cluster_num in range(optimal_k):\n",
|
228 |
+
" cluster_states = USArrests_clustered[USArrests_clustered['Cluster'] == cluster_num]\n",
|
229 |
+
" print(f\"\\nCLUSTER {cluster_num}: {len(cluster_states)} states\")\n",
|
230 |
+
" print(\"States:\", \", \".join(cluster_states.index.tolist()))\n",
|
231 |
+
" print(\"Average characteristics:\")\n",
|
232 |
+
" avg_profile = cluster_states[['Murder', 'Assault', 'UrbanPop', 'Rape']].mean()\n",
|
233 |
+
" for var, value in avg_profile.items():\n",
|
234 |
+
" print(f\" {var}: {value:.1f}\")\n",
|
235 |
+
"```"
|
236 |
+
],
|
237 |
+
"metadata": {
|
238 |
+
"id": "_5b0nE6KXv1P"
|
239 |
+
},
|
240 |
+
"id": "_5b0nE6KXv1P",
|
241 |
+
"execution_count": null,
|
242 |
+
"outputs": []
|
243 |
+
},
|
244 |
+
{
|
245 |
+
"cell_type": "markdown",
|
246 |
+
"source": [
|
247 |
+
"### Your Analysis\n",
|
248 |
+
"For each cluster, create a profile:\n",
|
249 |
+
"\n",
|
250 |
+
"**Cluster 0: \"[Your Creative Name]\"**\n",
|
251 |
+
"- **States**: [List them]\n",
|
252 |
+
"- **Characteristics**: [Describe the pattern]\n",
|
253 |
+
"- **Policy Insight**: [What should federal agencies know about these states?]\n",
|
254 |
+
"\n",
|
255 |
+
"**Deliverable**: A table summarizing each cluster with creative names and policy recommendations.\n",
|
256 |
+
"\n",
|
257 |
+
"---\n",
|
258 |
+
"\n",
|
259 |
+
"## Exercise 5: Hierarchical Clustering Exploration\n",
|
260 |
+
"**Time: 25 minutes | Product: Family Tree Interpretation**\n",
|
261 |
+
"\n",
|
262 |
+
"### Your Task\n",
|
263 |
+
"Create a dendrogram to understand how states naturally group together.\n"
|
264 |
+
],
|
265 |
+
"metadata": {
|
266 |
+
"id": "J1WVGb_nX4ye"
|
267 |
+
},
|
268 |
+
"id": "J1WVGb_nX4ye"
|
269 |
+
},
|
270 |
+
{
|
271 |
+
"cell_type": "code",
|
272 |
+
"source": [
|
273 |
+
"```python\n",
|
274 |
+
"from scipy.cluster.hierarchy import dendrogram, linkage\n",
|
275 |
+
"\n",
|
276 |
+
"# Create hierarchical clustering\n",
|
277 |
+
"linkage_matrix = linkage(USArrests_scaled, method='complete')\n",
|
278 |
+
"\n",
|
279 |
+
"# Plot the dendrogram\n",
|
280 |
+
"plt.figure(figsize=(15, 8))\n",
|
281 |
+
"dendrogram(linkage_matrix,\n",
|
282 |
+
" labels=USArrests.index.tolist(),\n",
|
283 |
+
" leaf_rotation=90,\n",
|
284 |
+
" leaf_font_size=10)\n",
|
285 |
+
"plt.title('State Crime Pattern Family Tree')\n",
|
286 |
+
"plt.xlabel('States')\n",
|
287 |
+
"plt.ylabel('Distance Between Groups')\n",
|
288 |
+
"plt.tight_layout()\n",
|
289 |
+
"plt.show()\n",
|
290 |
+
"```"
|
291 |
+
],
|
292 |
+
"metadata": {
|
293 |
+
"id": "Y9a_cbZKX7QX"
|
294 |
+
},
|
295 |
+
"id": "Y9a_cbZKX7QX",
|
296 |
+
"execution_count": null,
|
297 |
+
"outputs": []
|
298 |
+
},
|
299 |
+
{
|
300 |
+
"cell_type": "markdown",
|
301 |
+
"source": [
|
302 |
+
"### Your Interpretation\n",
|
303 |
+
"1. **Closest Pairs**: Which two states are most similar in crime patterns?\n",
|
304 |
+
"2. **Biggest Divide**: Where is the largest split in the tree? What does this represent?\n",
|
305 |
+
"3. **Surprising Neighbors**: Which states cluster together that surprised you geographically?\n",
|
306 |
+
"\n",
|
307 |
+
"### Code to Compare Methods"
|
308 |
+
],
|
309 |
+
"metadata": {
|
310 |
+
"id": "0PaImqZtX6f3"
|
311 |
+
},
|
312 |
+
"id": "0PaImqZtX6f3"
|
313 |
+
},
|
314 |
+
{
|
315 |
+
"cell_type": "code",
|
316 |
+
"source": [
|
317 |
+
"```python\n",
|
318 |
+
"# Compare your K-means results with hierarchical clustering\n",
|
319 |
+
"from scipy.cluster.hierarchy import fcluster\n",
|
320 |
+
"\n",
|
321 |
+
"# Cut the tree to get the same number of clusters as K-means\n",
|
322 |
+
"hierarchical_labels = fcluster(linkage_matrix, optimal_k, criterion='maxclust') - 1\n",
|
323 |
+
"\n",
|
324 |
+
"# Create comparison\n",
|
325 |
+
"comparison_df = pd.DataFrame({\n",
|
326 |
+
" 'State': USArrests.index,\n",
|
327 |
+
" 'K_Means_Cluster': cluster_labels,\n",
|
328 |
+
" 'Hierarchical_Cluster': hierarchical_labels\n",
|
329 |
+
"})\n",
|
330 |
+
"\n",
|
331 |
+
"print(\"Comparison of K-Means vs Hierarchical Clustering:\")\n",
|
332 |
+
"print(comparison_df.sort_values('State'))\n",
|
333 |
+
"\n",
|
334 |
+
"# Count agreements\n",
|
335 |
+
"agreements = sum(comparison_df['K_Means_Cluster'] == comparison_df['Hierarchical_Cluster'])\n",
|
336 |
+
"print(f\"\\nMethods agreed on {agreements} out of {len(comparison_df)} states ({agreements/len(comparison_df)*100:.1f}%)\")\n",
|
337 |
+
"```"
|
338 |
+
],
|
339 |
+
"metadata": {
|
340 |
+
"id": "tJQ-C5GFYBRT"
|
341 |
+
},
|
342 |
+
"id": "tJQ-C5GFYBRT",
|
343 |
+
"execution_count": null,
|
344 |
+
"outputs": []
|
345 |
+
},
|
346 |
+
{
|
347 |
+
"cell_type": "markdown",
|
348 |
+
"source": [
|
349 |
+
"**Deliverable**: A paragraph explaining the key differences between what K-means and hierarchical clustering revealed.\n",
|
350 |
+
"\n",
|
351 |
+
"---\n",
|
352 |
+
"\n",
|
353 |
+
"## Exercise 6: Policy Brief Creation\n",
|
354 |
+
"**Time: 20 minutes | Product: Executive Summary**\n",
|
355 |
+
"\n",
|
356 |
+
"### Your Task\n",
|
357 |
+
"Synthesize your findings into a policy brief for Department of Justice leadership.\n",
|
358 |
+
"\n",
|
359 |
+
"### Code Framework for Final Visualization"
|
360 |
+
],
|
361 |
+
"metadata": {
|
362 |
+
"id": "dx1fNhu4YD7-"
|
363 |
+
},
|
364 |
+
"id": "dx1fNhu4YD7-"
|
365 |
+
},
|
366 |
+
{
|
367 |
+
"cell_type": "code",
|
368 |
+
"source": [
|
369 |
+
"```python\n",
|
370 |
+
"# Create a comprehensive visualization\n",
|
371 |
+
"fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))\n",
|
372 |
+
"\n",
|
373 |
+
"# Plot 1: Murder vs Assault by cluster\n",
|
374 |
+
"colors = ['red', 'blue', 'green', 'orange', 'purple']\n",
|
375 |
+
"for i in range(optimal_k):\n",
|
376 |
+
" cluster_data = USArrests_clustered[USArrests_clustered['Cluster'] == i]\n",
|
377 |
+
" ax1.scatter(cluster_data['Murder'], cluster_data['Assault'],\n",
|
378 |
+
" c=colors[i], label=f'Cluster {i}', s=60, alpha=0.7)\n",
|
379 |
+
"ax1.set_xlabel('Murder Rate')\n",
|
380 |
+
"ax1.set_ylabel('Assault Rate')\n",
|
381 |
+
"ax1.set_title('Murder vs Assault by Crime Profile')\n",
|
382 |
+
"ax1.legend()\n",
|
383 |
+
"ax1.grid(True, alpha=0.3)\n",
|
384 |
+
"\n",
|
385 |
+
"# Plot 2: Urban Population vs Rape by cluster\n",
|
386 |
+
"for i in range(optimal_k):\n",
|
387 |
+
" cluster_data = USArrests_clustered[USArrests_clustered['Cluster'] == i]\n",
|
388 |
+
" ax2.scatter(cluster_data['UrbanPop'], cluster_data['Rape'],\n",
|
389 |
+
" c=colors[i], label=f'Cluster {i}', s=60, alpha=0.7)\n",
|
390 |
+
"ax2.set_xlabel('Urban Population %')\n",
|
391 |
+
"ax2.set_ylabel('Rape Rate')\n",
|
392 |
+
"ax2.set_title('Urban Population vs Rape Rate by Crime Profile')\n",
|
393 |
+
"ax2.legend()\n",
|
394 |
+
"ax2.grid(True, alpha=0.3)\n",
|
395 |
+
"\n",
|
396 |
+
"# Plot 3: Cluster size comparison\n",
|
397 |
+
"cluster_sizes = USArrests_clustered['Cluster'].value_counts().sort_index()\n",
|
398 |
+
"ax3.bar(range(len(cluster_sizes)), cluster_sizes.values, color=colors[:len(cluster_sizes)])\n",
|
399 |
+
"ax3.set_xlabel('Cluster Number')\n",
|
400 |
+
"ax3.set_ylabel('Number of States')\n",
|
401 |
+
"ax3.set_title('Number of States in Each Crime Profile')\n",
|
402 |
+
"ax3.set_xticks(range(len(cluster_sizes)))\n",
|
403 |
+
"\n",
|
404 |
+
"# Plot 4: Average crime rates by cluster\n",
|
405 |
+
"cluster_means = USArrests_clustered.groupby('Cluster')[['Murder', 'Assault', 'Rape']].mean()\n",
|
406 |
+
"cluster_means.plot(kind='bar', ax=ax4)\n",
|
407 |
+
"ax4.set_xlabel('Cluster Number')\n",
|
408 |
+
"ax4.set_ylabel('Average Rate')\n",
|
409 |
+
"ax4.set_title('Average Crime Rates by Profile')\n",
|
410 |
+
"ax4.legend()\n",
|
411 |
+
"ax4.tick_params(axis='x', rotation=0)\n",
|
412 |
+
"\n",
|
413 |
+
"plt.tight_layout()\n",
|
414 |
+
"plt.show()\n",
|
415 |
+
"```"
|
416 |
+
],
|
417 |
+
"metadata": {
|
418 |
+
"id": "N8bkxURpYHJF"
|
419 |
+
},
|
420 |
+
"id": "N8bkxURpYHJF",
|
421 |
+
"execution_count": null,
|
422 |
+
"outputs": []
|
423 |
+
},
|
424 |
+
{
|
425 |
+
"cell_type": "markdown",
|
426 |
+
"source": [
|
427 |
+
"### Your Policy Brief Template\n",
|
428 |
+
"\n",
|
429 |
+
"**EXECUTIVE SUMMARY: US State Crime Profile Analysis**\n",
|
430 |
+
"\n",
|
431 |
+
"**Key Findings:**\n",
|
432 |
+
"- We identified [X] distinct crime profiles among US states\n",
|
433 |
+
"- [State examples] represent the highest-risk profile\n",
|
434 |
+
"- [State examples] represent the lowest-risk profile\n",
|
435 |
+
"- Urban population [does/does not] strongly correlate with violent crime\n",
|
436 |
+
"\n",
|
437 |
+
"**Policy Recommendations:**\n",
|
438 |
+
"1. **High-Priority States**: [List and explain why]\n",
|
439 |
+
"2. **Resource Allocation**: [Suggest how to distribute federal crime prevention funds]\n",
|
440 |
+
"3. **Best Practice Sharing**: [Which states should learn from which others?]\n",
|
441 |
+
"\n",
|
442 |
+
"**Methodology Note**: Analysis used unsupervised clustering on 4 crime variables across 50 states, with data standardization to ensure fair comparison.\n",
|
443 |
+
"\n",
|
444 |
+
"**Deliverable**: A complete 1-page policy brief with your clustering insights and specific recommendations.\n"
|
445 |
+
],
|
446 |
+
"metadata": {
|
447 |
+
"id": "rAy_Ye0WYLK0"
|
448 |
+
},
|
449 |
+
"id": "rAy_Ye0WYLK0"
|
450 |
+
}
|
451 |
+
],
|
452 |
+
"metadata": {
|
453 |
+
"jupytext": {
|
454 |
+
"cell_metadata_filter": "-all",
|
455 |
+
"formats": "Rmd,ipynb",
|
456 |
+
"main_language": "python"
|
457 |
+
},
|
458 |
+
"kernelspec": {
|
459 |
+
"display_name": "Python 3 (ipykernel)",
|
460 |
+
"language": "python",
|
461 |
+
"name": "python3"
|
462 |
+
},
|
463 |
+
"language_info": {
|
464 |
+
"codemirror_mode": {
|
465 |
+
"name": "ipython",
|
466 |
+
"version": 3
|
467 |
+
},
|
468 |
+
"file_extension": ".py",
|
469 |
+
"mimetype": "text/x-python",
|
470 |
+
"name": "python",
|
471 |
+
"nbconvert_exporter": "python",
|
472 |
+
"pygments_lexer": "ipython3",
|
473 |
+
"version": "3.10.4"
|
474 |
+
},
|
475 |
+
"colab": {
|
476 |
+
"provenance": []
|
477 |
+
}
|
478 |
+
},
|
479 |
+
"nbformat": 4,
|
480 |
+
"nbformat_minor": 5
|
481 |
+
}
|
Reference files/week 7/Week7_Clustering Curriculum.docx
ADDED
Binary file (18.4 kB). View file
|
|
Reference files/week 7/Week7_Clustering Learning Objectives.docx
ADDED
Binary file (11.4 kB). View file
|
|
Reference files/week 7/w7_curriculum
ADDED
@@ -0,0 +1,178 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
Unsupervised Learning: K-means and Hierarchical Clustering
|
2 |
+
1. Course Overview
|
3 |
+
The State Safety Profile Challenge
|
4 |
+
In this week, we'll explore unsupervised machine learning through a compelling real-world challenge: Understanding crime patterns across US states without any predetermined categories.
|
5 |
+
Unsupervised Learning: A type of machine learning where we find hidden patterns in data without being told what to look for. Think of it like being a detective who examines evidence without knowing what crime was committed - you're looking for patterns and connections that emerge naturally from the data.
|
6 |
+
Example: Instead of being told "find violent states vs. peaceful states," unsupervised learning lets the data reveal its own natural groupings, like "states with high murder but low assault" or "urban states with moderate crime."
|
7 |
+
Imagine you're a policy researcher working with the FBI's crime statistics. You have data on violent crime rates across all 50 US states - murder rates, assault rates, urban population percentages, and rape statistics. But here's the key challenge: you don't know how states naturally group together in terms of crime profiles.
|
8 |
+
Your Mission: Discover hidden patterns in state crime profiles without any predefined classifications!
|
9 |
+
The Challenge: Without any predetermined safety categories, you need to:
|
10 |
+
● Uncover natural groupings of states based on their crime characteristics
|
11 |
+
● Identify which crime factors tend to cluster together
|
12 |
+
● Understand regional patterns that might not follow obvious geographic boundaries
|
13 |
+
● Find states with surprisingly similar or different crime profiles
|
14 |
+
Cluster: A group of similar things. In our case, states that have similar crime patterns naturally group together in a cluster.
|
15 |
+
Example: You might discover that Alaska, Nevada, and Florida cluster together because they all have high crime rates despite being in different regions of the country.
|
16 |
+
Why This Matters: Traditional approaches might group states by region (South, Northeast, etc.) or population size. But what if crime patterns reveal different natural groupings? What if some Southern states cluster more closely with Western states based on crime profiles? What if urban percentage affects crime differently than expected?
|
17 |
+
Urban Percentage: The proportion of a state's population that lives in cities rather than rural areas.
|
18 |
+
Example: New York has a high urban percentage (87%) while Wyoming has a low urban percentage (29%).
|
19 |
+
What You'll Discover Through This Challenge
|
20 |
+
● Hidden State Safety Types: Use clustering to identify groups of states with similar crime profiles
|
21 |
+
● Crime Pattern Relationships: Find unexpected connections between different types of violent crime
|
22 |
+
● Urban vs. Rural Effects: Discover how urbanization relates to different crime patterns
|
23 |
+
● Policy Insights: Understand which states face similar challenges and might benefit from shared approaches
|
24 |
+
Clustering: The process of grouping similar data points together. It's like organizing your music library - songs naturally group by genre, but clustering might reveal unexpected groups like "workout songs" or "rainy day music" that cross traditional genre boundaries.
|
25 |
+
Core Techniques We'll Master
|
26 |
+
K-Means Clustering: A method that divides data into exactly K groups (where you choose the number K). It's like being asked to organize 50 students into exactly 4 study groups based on their academic interests.
|
27 |
+
Hierarchical Clustering: A method that creates a tree-like structure showing how data points relate to each other at different levels. It's like a family tree, but for data - showing which states are "cousins" and which are "distant relatives" in terms of crime patterns.
|
28 |
+
Both K-Means and Hierarchical Clustering are examples of unsupervised learning.
|
29 |
+
|
30 |
+
2. K-Means Clustering
|
31 |
+
|
32 |
+
What it does: Divides data into exactly K groups by finding central points (centroids).
|
33 |
+
Central Points (Centroids): The "center" or average point of each group. Think of it like the center of a basketball team huddle - it's the point that best represents where all the players are standing.
|
34 |
+
Example: If you have a cluster of high-crime states, the centroid might represent "average murder rate of 8.5, average assault rate of 250, average urban population of 70%."
|
35 |
+
USArrests Example: Analyzing crime data across 50 states, you might discover 4 distinct state safety profiles:
|
36 |
+
● High Crime States (above average in murder, assault, and rape rates)
|
37 |
+
● Urban Safe States (high urban population but lower violent crime rates)
|
38 |
+
● Rural Traditional States (low urban population, moderate crime rates)
|
39 |
+
● Mixed Profile States (high in some crime types but not others)
|
40 |
+
How to Read K-Means Results:
|
41 |
+
● Scatter Plot: Points (states) colored by cluster membership
|
42 |
+
○ Well-separated colors indicate distinct state profiles
|
43 |
+
○ Mixed colors suggest overlapping crime patterns
|
44 |
+
● Cluster Centers: Average crime characteristics of each state group
|
45 |
+
● Elbow Plot: Helps choose optimal number of state groupings
|
46 |
+
Cluster Membership: Which group each data point belongs to. Like being assigned to a team - each state gets assigned to exactly one crime profile group.
|
47 |
+
Example: Texas might be assigned to "High Crime States" while Vermont is assigned to "Rural Traditional States."
|
48 |
+
Scatter Plot: A graph where each point represents one observation (in our case, one state). Points that are close together have similar characteristics.
|
49 |
+
Elbow Plot: A graph that helps you choose the right number of clusters. It's called "elbow" because you look for a bend in the line that looks like an elbow joint.
|
50 |
+
Key Parameters:
|
51 |
+
python
|
52 |
+
# Essential parameters from the lab
|
53 |
+
KMeans(
|
54 |
+
n_clusters=4, # Number of state safety profiles to discover
|
55 |
+
random_state=42, # For reproducible results
|
56 |
+
n_init=20 # Run algorithm 20 times, keep best result
|
57 |
+
)
|
58 |
+
Parameters: Settings that control how the algorithm works. Like settings on your phone - you can adjust them to get different results.
|
59 |
+
n_clusters: How many groups you want to create. You have to decide this ahead of time.
|
60 |
+
random_state: A number that ensures you get the same results every time you run the analysis. Like setting a specific starting point so everyone gets the same answer.
|
61 |
+
n_init: How many times to run the algorithm. The computer tries multiple starting points and picks the best result. More tries = better results.
|
62 |
+
|
63 |
+
3. Hierarchical Clustering
|
64 |
+
What it does: Creates a tree structure (dendrogram) showing how data points group together at different levels.
|
65 |
+
Dendrogram: A tree-like diagram that shows how groups form at different levels. Think of it like a family tree, but for data. At the bottom are individuals (states), and as you go up, you see how they group into families, then extended families, then larger clans.
|
66 |
+
Example: At the bottom level, you might see Vermont and New Hampshire grouped together. Moving up, they might join with Maine to form a "New England Low Crime" group. Moving up further, this group might combine with other regional groups.
|
67 |
+
USArrests Example: Analyzing state crime patterns might reveal:
|
68 |
+
● Level 1: High Crime vs. Low Crime states
|
69 |
+
● Level 2: Within high crime: Urban-driven vs. Rural-driven crime patterns
|
70 |
+
● Level 3: Within urban-driven: Assault-heavy vs. Murder-heavy profiles
|
71 |
+
How to Read Dendrograms:
|
72 |
+
● Height: Distance between groups when they merge
|
73 |
+
○ Higher merges = very different crime profiles
|
74 |
+
○ Lower merges = similar crime patterns
|
75 |
+
● Branches: Each split shows a potential state grouping
|
76 |
+
● Cutting the Tree: Draw a horizontal line to create clusters
|
77 |
+
Height: In a dendrogram, height represents how different two groups are. Think of it like difficulty level - it takes more "effort" (higher height) to combine very different groups.
|
78 |
+
Example: Combining two very similar states (like Vermont and New Hampshire) happens at low height. Combining very different groups (like "High Crime States" and "Low Crime States") happens at high height.
|
79 |
+
Cutting the Tree: Drawing a horizontal line across the dendrogram to create a specific number of groups. Like slicing a layer cake - where you cut determines how many pieces you get.
|
80 |
+
Three Linkage Methods:
|
81 |
+
● Complete Linkage: Measures distance between most different states (good for distinct profiles)
|
82 |
+
● Average Linkage: Uses average distance between all states (balanced approach)
|
83 |
+
● Single Linkage: Uses closest states (tends to create chains, often less useful)
|
84 |
+
Linkage Methods: Different ways to measure how close or far apart groups are. It's like different ways to measure the distance between two cities - you could use the distance between the farthest suburbs (complete), the average distance between all neighborhoods (average), or the distance between the closest points (single).
|
85 |
+
Example: When deciding if "High Crime Group" and "Medium Crime Group" should merge, complete linkage looks at the most different states between the groups, while average linkage looks at the typical difference.
|
86 |
+
Choosing Between K-Means and Hierarchical:
|
87 |
+
● Use K-Means when: You want to segment states into specific number of safety categories for policy targeting
|
88 |
+
● Use Hierarchical when: You want to explore the natural structure of crime patterns without assumptions
|
89 |
+
Segmentation: Dividing your data into groups for specific purposes. Like organizing students into study groups - you might want exactly 4 groups so each has a teaching assistant.
|
90 |
+
Exploratory Analysis: Looking at data to discover patterns without knowing what you'll find. Like being an explorer in uncharted territory - you're not looking for a specific destination, just seeing what interesting things you can discover.
|
91 |
+
|
92 |
+
4. Data Exploration
|
93 |
+
Step 1: Understanding Your Data
|
94 |
+
Essential Checks (from the USArrests example):
|
95 |
+
python
|
96 |
+
# Check the basic structure
|
97 |
+
print(data.shape) # How many observations and variables?
|
98 |
+
print(data.columns) # What variables do you have?
|
99 |
+
print(data.head()) # What do the first few rows look like?
|
100 |
+
|
101 |
+
# Examine the distribution
|
102 |
+
print(data.mean()) # Average values
|
103 |
+
print(data.var()) # Variability
|
104 |
+
print(data.describe()) # Full statistical summary
|
105 |
+
Observations: Individual data points we're studying. In our case, each of the 50 US states is one observation.
|
106 |
+
Variables: The characteristics we're measuring for each observation. In USArrests, we have 4 variables: Murder rate, Assault rate, Urban Population percentage, and Rape rate.
|
107 |
+
Example: For California (one observation), we might have Murder=9.0, Assault=276, UrbanPop=91, Rape=40.6 (four variables).
|
108 |
+
Distribution: How values are spread out. Like looking at test scores in a class - are most scores clustered around the average, or spread out widely?
|
109 |
+
Variability (Variance): How much the values differ from each other. High variance means values are spread out; low variance means they're clustered together.
|
110 |
+
Why This Matters: The USArrests data showed vastly different scales:
|
111 |
+
● Murder: Average 7.8, Variance 19
|
112 |
+
● Assault: Average 170.8, Variance 6,945
|
113 |
+
● This scale difference would dominate any analysis without preprocessing
|
114 |
+
Scales: The range and units of measurement for different variables. Like comparing dollars ($50,000 salary) to percentages (75% approval rating) - they're measured very differently.
|
115 |
+
Example: Assault rates are in the hundreds (like 276 per 100,000) while murder rates are single digits (like 7.8 per 100,000). Without adjustment, assault would seem much more important just because the numbers are bigger.
|
116 |
+
Step 2: Data Preprocessing
|
117 |
+
Standardization (Critical for clustering):
|
118 |
+
python
|
119 |
+
from sklearn.preprocessing import StandardScaler
|
120 |
+
|
121 |
+
# Always scale when variables have different units
|
122 |
+
scaler = StandardScaler()
|
123 |
+
data_scaled = scaler.fit_transform(data)
|
124 |
+
Standardization: Converting all variables to the same scale so they can be fairly compared. Like converting all measurements to the same units - instead of comparing feet to meters, you convert everything to inches.
|
125 |
+
StandardScaler: A tool that transforms data so each variable has an average of 0 and standard deviation of 1. Think of it like grading on a curve - it makes all variables equally important.
|
126 |
+
Example: After standardization, a murder rate of 7.8 might become 0.2, and an assault rate of 276 might become 1.5. Now they're on comparable scales.
|
127 |
+
When to Scale:
|
128 |
+
● ✅ Always scale when variables have different units (dollars vs. percentages)
|
129 |
+
● ✅ Scale when variances differ by orders of magnitude
|
130 |
+
● ❓ Consider not scaling when all variables are in the same meaningful units
|
131 |
+
Orders of Magnitude: When one number is 10 times, 100 times, or 1000 times bigger than another. In USArrests, assault variance (6,945) is about 365 times bigger than murder variance (19) - that's two orders of magnitude difference.
|
132 |
+
Step 3: Exploratory Analysis
|
133 |
+
For K-Means Clustering:
|
134 |
+
python
|
135 |
+
# Try different numbers of clusters to find optimal K
|
136 |
+
inertias = []
|
137 |
+
K_range = range(1, 11)
|
138 |
+
for k in K_range:
|
139 |
+
kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
|
140 |
+
kmeans.fit(data_scaled)
|
141 |
+
inertias.append(kmeans.inertia_)
|
142 |
+
|
143 |
+
# Plot elbow curve
|
144 |
+
plt.plot(K_range, inertias, 'bo-')
|
145 |
+
plt.xlabel('Number of Clusters (K)')
|
146 |
+
plt.ylabel('Within-Cluster Sum of Squares')
|
147 |
+
plt.title('Elbow Method for Optimal K')
|
148 |
+
Inertias: A measure of how tightly grouped each cluster is. Lower inertia means points in each cluster are closer together (better clustering). It's like measuring how close teammates stand to each other - closer teammates indicate better team cohesion.
|
149 |
+
Within-Cluster Sum of Squares: The total distance from each point to its cluster center. Think of it as measuring how far each student sits from their group's center - smaller distances mean tighter, more cohesive groups.
|
150 |
+
Elbow Method: A technique for choosing the best number of clusters. You plot the results and look for the "elbow" - the point where adding more clusters doesn't help much anymore.
|
151 |
+
For Hierarchical Clustering:
|
152 |
+
python
|
153 |
+
# Create dendrogram to explore natural groupings
|
154 |
+
from sklearn.cluster import AgglomerativeClustering
|
155 |
+
from ISLP.cluster import compute_linkage
|
156 |
+
from scipy.cluster.hierarchy import dendrogram
|
157 |
+
|
158 |
+
hc = AgglomerativeClustering(distance_threshold=0, n_clusters=None, linkage='complete')
|
159 |
+
hc.fit(data_scaled)
|
160 |
+
linkage_matrix = compute_linkage(hc)
|
161 |
+
|
162 |
+
plt.figure(figsize=(12, 8))
|
163 |
+
dendrogram(linkage_matrix, color_threshold=-np.inf, above_threshold_color='black')
|
164 |
+
plt.title('Hierarchical Clustering Dendrogram')
|
165 |
+
AgglomerativeClustering: A type of hierarchical clustering that starts with individual points and gradually combines them into larger groups. Like building a pyramid from the bottom up.
|
166 |
+
distance_threshold=0: A setting that tells the algorithm to build the complete tree structure without stopping early.
|
167 |
+
Linkage Matrix: A mathematical representation of how the tree structure was built. Think of it as the blueprint showing how the dendrogram was constructed.
|
168 |
+
Step 4: Validation Questions
|
169 |
+
Before proceeding with analysis, ask:
|
170 |
+
1. Do the variables make sense together? (e.g., don't cluster height with income)
|
171 |
+
2. Are there obvious outliers that need attention?
|
172 |
+
3. Do you have enough data points? (Rule of thumb: at least 10x more observations than variables)
|
173 |
+
4. Are there missing values that need handling?
|
174 |
+
Outliers: Data points that are very different from all the others. Like a 7-foot-tall person in a group of average-height people - they're so different they might skew your analysis.
|
175 |
+
Example: If most states have murder rates between 1-15, but one state has a rate of 50, that's probably an outlier that needs special attention.
|
176 |
+
Missing Values: Data points where we don't have complete information. Like a student who didn't take one of the tests - you need to decide how to handle that gap in the data.
|
177 |
+
Rule of Thumb: A general guideline that works in most situations. For clustering, having at least 10 times more observations than variables helps ensure reliable results.
|
178 |
+
|
app/.DS_Store
CHANGED
Binary files a/app/.DS_Store and b/app/.DS_Store differ
|
|
app/main.py
CHANGED
@@ -29,9 +29,24 @@ st.set_page_config(
|
|
29 |
page_title="Data Science Course App",
|
30 |
page_icon="📚",
|
31 |
layout="wide",
|
32 |
-
initial_sidebar_state="expanded"
|
|
|
|
|
|
|
|
|
|
|
33 |
)
|
34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
# Custom CSS
|
36 |
def load_css():
|
37 |
try:
|
@@ -56,11 +71,6 @@ def load_css():
|
|
56 |
margin-bottom: 1rem;
|
57 |
}
|
58 |
|
59 |
-
/* Sidebar styling */
|
60 |
-
.sidebar .sidebar-content {
|
61 |
-
background-color: #f8f9fa;
|
62 |
-
}
|
63 |
-
|
64 |
/* Button styling */
|
65 |
.stButton>button {
|
66 |
width: 100%;
|
@@ -125,16 +135,17 @@ def sidebar_navigation():
|
|
125 |
st.rerun()
|
126 |
|
127 |
st.markdown("---")
|
128 |
-
st.subheader("Course
|
129 |
-
progress = st.progress(st.session_state.current_week / 10)
|
130 |
-
st.write(f"Week {st.session_state.current_week} of 10")
|
131 |
|
132 |
-
|
133 |
-
st.
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
|
|
|
|
|
|
138 |
|
139 |
def show_week_content():
|
140 |
# Debug print to show current week
|
|
|
29 |
page_title="Data Science Course App",
|
30 |
page_icon="📚",
|
31 |
layout="wide",
|
32 |
+
initial_sidebar_state="expanded",
|
33 |
+
menu_items={
|
34 |
+
'Get Help': None,
|
35 |
+
'Report a bug': None,
|
36 |
+
'About': None
|
37 |
+
}
|
38 |
)
|
39 |
|
40 |
+
# Disable URL paths and hide Streamlit elements
|
41 |
+
st.markdown("""
|
42 |
+
<style>
|
43 |
+
#MainMenu {visibility: hidden;}
|
44 |
+
footer {visibility: hidden;}
|
45 |
+
.stDeployButton {display: none;}
|
46 |
+
.viewerBadge_container__1QSob {display: none;}
|
47 |
+
</style>
|
48 |
+
""", unsafe_allow_html=True)
|
49 |
+
|
50 |
# Custom CSS
|
51 |
def load_css():
|
52 |
try:
|
|
|
71 |
margin-bottom: 1rem;
|
72 |
}
|
73 |
|
|
|
|
|
|
|
|
|
|
|
74 |
/* Button styling */
|
75 |
.stButton>button {
|
76 |
width: 100%;
|
|
|
135 |
st.rerun()
|
136 |
|
137 |
st.markdown("---")
|
138 |
+
st.subheader("Course Content")
|
|
|
|
|
139 |
|
140 |
+
# Create a container for week buttons
|
141 |
+
week_container = st.container()
|
142 |
+
|
143 |
+
# Add week buttons with custom styling
|
144 |
+
with week_container:
|
145 |
+
for week in range(1, 11):
|
146 |
+
if st.button(f"Week {week}", key=f"week_{week}"):
|
147 |
+
st.session_state.current_week = week
|
148 |
+
st.rerun()
|
149 |
|
150 |
def show_week_content():
|
151 |
# Debug print to show current week
|
app/pages/__pycache__/week_5.cpython-311.pyc
CHANGED
Binary files a/app/pages/__pycache__/week_5.cpython-311.pyc and b/app/pages/__pycache__/week_5.cpython-311.pyc differ
|
|
app/pages/__pycache__/week_7.cpython-311.pyc
CHANGED
Binary files a/app/pages/__pycache__/week_7.cpython-311.pyc and b/app/pages/__pycache__/week_7.cpython-311.pyc differ
|
|
app/pages/week_7.py
CHANGED
@@ -6,16 +6,21 @@ import seaborn as sns
|
|
6 |
import plotly.express as px
|
7 |
import plotly.graph_objects as go
|
8 |
from plotly.subplots import make_subplots
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
|
10 |
# Set up the style for all plots
|
11 |
plt.style.use('default')
|
12 |
sns.set_theme(style="whitegrid", palette="husl")
|
13 |
|
14 |
-
def
|
15 |
-
"""Load and return the
|
16 |
-
|
17 |
-
|
18 |
-
return df
|
19 |
|
20 |
def create_categorical_plot(df, column, target='Survived'):
|
21 |
"""Create an interactive plot for categorical variables"""
|
@@ -54,284 +59,758 @@ def create_numeric_plot(df, column, target='Survived'):
|
|
54 |
return fig
|
55 |
|
56 |
def show():
|
57 |
-
st.title("Week 7:
|
58 |
|
59 |
-
#
|
60 |
-
st.
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
-
|
68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
69 |
""")
|
70 |
|
71 |
-
#
|
72 |
-
st.
|
73 |
st.write("""
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
4. Feature Engineering: Creating new features
|
78 |
-
5. Data Visualization: Interactive plots and insights
|
79 |
-
6. Practical Applications: Real-world data analysis
|
80 |
""")
|
81 |
-
|
82 |
# Load Data
|
83 |
-
st.header("
|
84 |
-
st.write(""
|
85 |
-
|
86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
87 |
""")
|
88 |
|
89 |
-
|
90 |
-
|
91 |
-
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
|
97 |
-
|
98 |
-
|
99 |
-
|
100 |
-
|
101 |
-
|
102 |
-
title='Missing Values by Column',
|
103 |
-
labels={'x': 'Columns', 'y': 'Number of Missing Values'}
|
104 |
)
|
105 |
-
|
106 |
-
|
107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
108 |
plot_bgcolor='rgb(30, 30, 30)',
|
109 |
paper_bgcolor='rgb(30, 30, 30)',
|
110 |
font=dict(color='white')
|
111 |
)
|
112 |
-
st.plotly_chart(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
113 |
|
114 |
-
|
115 |
-
|
116 |
|
117 |
-
|
118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
119 |
st.write("""
|
120 |
-
Let's
|
121 |
-
1. Filling missing Age values with median
|
122 |
-
2. Filling missing Embarked values with mode
|
123 |
-
3. Creating a new feature for Cabin availability
|
124 |
""")
|
125 |
|
126 |
-
#
|
127 |
-
|
128 |
|
129 |
-
#
|
130 |
-
|
131 |
-
|
132 |
-
df_cleaned['HasCabin'] = df_cleaned['Cabin'].notna().astype(int)
|
133 |
|
134 |
-
#
|
135 |
-
|
|
|
136 |
|
137 |
-
#
|
138 |
-
|
139 |
-
|
140 |
-
|
141 |
-
|
142 |
-
|
|
|
|
|
143 |
|
144 |
-
#
|
145 |
-
|
146 |
-
|
147 |
|
148 |
-
|
149 |
-
|
|
|
|
|
150 |
|
151 |
-
|
152 |
-
|
153 |
-
|
154 |
-
|
155 |
-
|
156 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
157 |
|
158 |
-
#
|
159 |
-
|
160 |
-
|
|
|
|
|
|
|
|
|
161 |
|
162 |
-
# Reference Code Section
|
163 |
-
st.header("Reference Code")
|
164 |
st.write("""
|
165 |
-
|
166 |
-
|
|
|
|
|
167 |
""")
|
168 |
|
169 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
170 |
st.code("""
|
171 |
-
#
|
172 |
-
|
173 |
-
df_cleaned['Age'].fillna(df_cleaned['Age'].median(), inplace=True)
|
174 |
-
df_cleaned['Embarked'].fillna(df_cleaned['Embarked'].mode()[0], inplace=True)
|
175 |
-
df_cleaned['HasCabin'] = df_cleaned['Cabin'].notna().astype(int)
|
176 |
|
177 |
-
#
|
178 |
-
|
179 |
-
fig = px.bar(
|
180 |
-
df.groupby(column)[target].mean().reset_index(),
|
181 |
-
x=column,
|
182 |
-
y=target,
|
183 |
-
title=f'Survival Rate by {column}',
|
184 |
-
labels={target: 'Survival Rate', column: column},
|
185 |
-
color=target,
|
186 |
-
color_continuous_scale='RdBu'
|
187 |
-
)
|
188 |
-
return fig
|
189 |
|
190 |
-
#
|
191 |
-
|
192 |
-
|
193 |
-
df,
|
194 |
-
x=target,
|
195 |
-
y=column,
|
196 |
-
title=f'{column} Distribution by Survival',
|
197 |
-
labels={target: 'Survived', column: column},
|
198 |
-
color=target,
|
199 |
-
color_discrete_sequence=px.colors.qualitative.Set1
|
200 |
-
)
|
201 |
-
return fig
|
202 |
|
203 |
-
|
204 |
-
|
205 |
-
|
206 |
-
|
207 |
-
|
208 |
-
|
|
|
|
|
|
|
|
|
|
|
209 |
)
|
210 |
-
|
|
|
|
|
|
|
|
|
211 |
""", language="python")
|
212 |
|
213 |
-
|
214 |
-
|
215 |
-
|
216 |
-
|
217 |
-
|
218 |
-
|
219 |
-
|
220 |
-
|
221 |
-
|
222 |
-
|
223 |
-
|
224 |
-
|
225 |
-
|
226 |
-
|
227 |
-
|
228 |
-
|
229 |
-
|
230 |
-
],
|
231 |
-
"correct": 1
|
232 |
-
},
|
233 |
-
"q2": {
|
234 |
-
"question": "Why do we create the 'HasCabin' feature?",
|
235 |
-
"options": [
|
236 |
-
"To reduce the number of missing values",
|
237 |
-
"To create a binary indicator for cabin availability",
|
238 |
-
"To make the data more complex",
|
239 |
-
"To remove the Cabin column"
|
240 |
-
],
|
241 |
-
"correct": 1
|
242 |
-
},
|
243 |
-
"q3": {
|
244 |
-
"question": "What does the FamilySize feature represent?",
|
245 |
-
"options": [
|
246 |
-
"Number of siblings only",
|
247 |
-
"Number of parents only",
|
248 |
-
"Total family members (including the passenger)",
|
249 |
-
"Number of children only"
|
250 |
-
],
|
251 |
-
"correct": 2
|
252 |
-
}
|
253 |
-
}
|
254 |
-
|
255 |
-
# Display quiz if not submitted
|
256 |
-
if not st.session_state.quiz_submitted:
|
257 |
-
answers = {}
|
258 |
-
for q_id, q_data in questions.items():
|
259 |
-
st.write(f"**{q_data['question']}**")
|
260 |
-
answers[q_id] = st.radio(
|
261 |
-
"Select your answer:",
|
262 |
-
q_data["options"],
|
263 |
-
key=q_id
|
264 |
-
)
|
265 |
-
|
266 |
-
if st.button("Submit Quiz"):
|
267 |
-
# Calculate score
|
268 |
-
score = sum(1 for q_id, q_data in questions.items()
|
269 |
-
if answers[q_id] == q_data["options"][q_data["correct"]])
|
270 |
-
|
271 |
-
# Show results
|
272 |
-
st.write(f"Your score: {score}/{len(questions)}")
|
273 |
-
|
274 |
-
# Show correct answers
|
275 |
-
st.write("Correct answers:")
|
276 |
-
for q_id, q_data in questions.items():
|
277 |
-
st.write(f"- {q_data['question']}")
|
278 |
-
st.write(f" Correct answer: {q_data['options'][q_data['correct']]}")
|
279 |
-
|
280 |
-
st.session_state.quiz_submitted = True
|
281 |
-
|
282 |
-
# Reset quiz button
|
283 |
-
if st.session_state.quiz_submitted:
|
284 |
-
if st.button("Take Quiz Again"):
|
285 |
-
st.session_state.quiz_submitted = False
|
286 |
-
st.rerun()
|
287 |
-
|
288 |
-
# Feature Engineering
|
289 |
-
st.header("Feature Engineering")
|
290 |
-
st.write("""
|
291 |
-
Let's create some new features:
|
292 |
-
1. Family Size = SibSp + Parch + 1
|
293 |
-
2. Age Groups
|
294 |
-
3. Fare per Person
|
295 |
""")
|
296 |
|
297 |
-
# Create
|
298 |
-
|
299 |
-
|
300 |
-
|
301 |
-
|
302 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
303 |
)
|
304 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
305 |
|
306 |
-
#
|
307 |
-
st.subheader("
|
308 |
|
309 |
-
#
|
310 |
-
|
311 |
-
|
|
|
|
|
|
|
312 |
|
313 |
-
#
|
314 |
-
|
315 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
316 |
|
317 |
-
# Conclusion
|
318 |
-
st.header("Conclusion")
|
319 |
st.write("""
|
320 |
-
|
321 |
-
|
322 |
-
|
323 |
-
-
|
324 |
-
-
|
325 |
-
-
|
|
|
|
|
|
|
|
|
|
|
326 |
""")
|
327 |
|
328 |
# Additional Resources
|
329 |
st.header("Additional Resources")
|
330 |
st.write("""
|
331 |
-
- [
|
332 |
-
- [
|
333 |
-
- [Plotly Documentation](https://plotly.com/python/)
|
334 |
-
- [Data Cleaning Best Practices](https://towardsdatascience.com/data-cleaning-steps-and-process-8ae2d0f5147)
|
335 |
-
- [Colab Notebook](https://colab.research.google.com/drive/1ScwSa8WBcOMCloXsTV5TPFoVrcPHXlW2#scrollTo=VDMRGRbSR0gc)
|
336 |
-
- [Overleaf Project](https://www.overleaf.com/project/68228f4ccb9d18d92c26ba13)
|
337 |
""")
|
|
|
6 |
import plotly.express as px
|
7 |
import plotly.graph_objects as go
|
8 |
from plotly.subplots import make_subplots
|
9 |
+
from sklearn.cluster import KMeans
|
10 |
+
from sklearn.neighbors import KNeighborsClassifier
|
11 |
+
from sklearn.preprocessing import StandardScaler
|
12 |
+
from sklearn.metrics import silhouette_score
|
13 |
+
from statsmodels.datasets import get_rdataset
|
14 |
+
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
|
15 |
|
16 |
# Set up the style for all plots
|
17 |
plt.style.use('default')
|
18 |
sns.set_theme(style="whitegrid", palette="husl")
|
19 |
|
20 |
+
def load_arrests_data():
|
21 |
+
"""Load and return the US Arrests dataset"""
|
22 |
+
USArrests = get_rdataset('USArrests').data
|
23 |
+
return USArrests
|
|
|
24 |
|
25 |
def create_categorical_plot(df, column, target='Survived'):
|
26 |
"""Create an interactive plot for categorical variables"""
|
|
|
59 |
return fig
|
60 |
|
61 |
def show():
|
62 |
+
st.title("Week 7: Clustering Lab - State Crime Pattern Analysis")
|
63 |
|
64 |
+
# Code Example: Loading and Basic Data Exploration
|
65 |
+
with st.expander("Code Example: Loading and Exploring Data"):
|
66 |
+
st.code("""
|
67 |
+
# Load the data
|
68 |
+
from statsmodels.datasets import get_rdataset
|
69 |
+
USArrests = get_rdataset('USArrests').data
|
70 |
+
|
71 |
+
# Basic data exploration
|
72 |
+
print("Dataset shape:", USArrests.shape)
|
73 |
+
print("\\nVariables:", USArrests.columns.tolist())
|
74 |
+
print("\\nFirst 5 states:")
|
75 |
+
print(USArrests.head())
|
76 |
+
|
77 |
+
# Basic statistics
|
78 |
+
print("\\nData Summary:")
|
79 |
+
print(USArrests.describe())
|
80 |
+
""", language="python")
|
81 |
+
|
82 |
+
# Introduction Section with Learning Objectives
|
83 |
+
st.header("Learning Objectives")
|
84 |
+
st.markdown("""
|
85 |
+
In this week, you'll master:
|
86 |
+
1. **Unsupervised Learning**: Discover hidden patterns in crime data without predefined categories
|
87 |
+
2. **K-Means Clustering**: Learn to divide states into distinct safety profiles
|
88 |
+
3. **Hierarchical Clustering**: Create a "family tree" of state crime patterns
|
89 |
+
4. **Data Preprocessing**: Understand why scaling is crucial for fair comparisons
|
90 |
""")
|
91 |
|
92 |
+
# Interactive Overview
|
93 |
+
st.header("Lab Overview")
|
94 |
st.write("""
|
95 |
+
Welcome to your hands-on clustering lab! You'll be working as a policy analyst for the Department of Justice,
|
96 |
+
analyzing crime patterns across US states. Your mission: discover hidden safety profiles that could inform
|
97 |
+
federal resource allocation and crime prevention strategies.
|
|
|
|
|
|
|
98 |
""")
|
99 |
+
|
100 |
# Load Data
|
101 |
+
st.header("Exercise 1: Data Detective Work")
|
102 |
+
st.write("Let's start by understanding our dataset - the US Arrests data.")
|
103 |
+
|
104 |
+
df = load_arrests_data()
|
105 |
+
|
106 |
+
# Code Example: Data Visualization
|
107 |
+
with st.expander("Code Example: Creating Visualizations"):
|
108 |
+
st.code("""
|
109 |
+
# Create correlation heatmap
|
110 |
+
import plotly.express as px
|
111 |
+
fig = px.imshow(df.corr(),
|
112 |
+
labels=dict(color="Correlation"),
|
113 |
+
color_continuous_scale="RdBu")
|
114 |
+
fig.show()
|
115 |
+
|
116 |
+
# Create box plots
|
117 |
+
fig = px.box(df, title="Data Distribution")
|
118 |
+
fig.show()
|
119 |
+
""", language="python")
|
120 |
+
|
121 |
+
# Interactive Data Exploration
|
122 |
+
col1, col2 = st.columns(2)
|
123 |
+
|
124 |
+
with col1:
|
125 |
+
st.subheader("Dataset Overview")
|
126 |
+
st.write(f"Number of states: {len(df)}")
|
127 |
+
st.write(f"Number of variables: {len(df.columns)}")
|
128 |
+
st.write("\nVariables:", df.columns.tolist())
|
129 |
+
|
130 |
+
# Interactive data summary
|
131 |
+
st.subheader("Data Summary")
|
132 |
+
summary = df.describe()
|
133 |
+
st.dataframe(summary)
|
134 |
+
|
135 |
+
with col2:
|
136 |
+
st.subheader("First 5 States")
|
137 |
+
st.dataframe(df.head())
|
138 |
+
|
139 |
+
# Interactive correlation heatmap
|
140 |
+
st.subheader("Correlation Heatmap")
|
141 |
+
fig = px.imshow(df.corr(),
|
142 |
+
labels=dict(color="Correlation"),
|
143 |
+
color_continuous_scale="RdBu")
|
144 |
+
st.plotly_chart(fig)
|
145 |
+
|
146 |
+
# Exercise 2: Scaling Challenge
|
147 |
+
st.header("Exercise 2: The Scaling Challenge")
|
148 |
+
|
149 |
+
# Code Example: Data Scaling
|
150 |
+
with st.expander("Code Example: Scaling Data"):
|
151 |
+
st.code("""
|
152 |
+
# Import StandardScaler
|
153 |
+
from sklearn.preprocessing import StandardScaler
|
154 |
+
|
155 |
+
# Create and fit the scaler
|
156 |
+
scaler = StandardScaler()
|
157 |
+
df_scaled = scaler.fit_transform(df)
|
158 |
+
|
159 |
+
# Convert back to DataFrame
|
160 |
+
df_scaled = pd.DataFrame(df_scaled,
|
161 |
+
columns=df.columns,
|
162 |
+
index=df.index)
|
163 |
+
|
164 |
+
# Compare original vs scaled data
|
165 |
+
print("Original data ranges:")
|
166 |
+
print(df.describe())
|
167 |
+
print("\\nScaled data ranges:")
|
168 |
+
print(df_scaled.describe())
|
169 |
+
""", language="python")
|
170 |
+
|
171 |
+
# Explanation of scaling
|
172 |
+
st.markdown("""
|
173 |
+
### Why Do We Need Scaling?
|
174 |
+
|
175 |
+
In our crime data, we have variables measured in very different scales:
|
176 |
+
- Murder rates: typically 0-20 per 100,000
|
177 |
+
- Assault rates: typically 50-350 per 100,000
|
178 |
+
- Urban population: 0-100 percentage
|
179 |
+
- Rape rates: typically 0-50 per 100,000
|
180 |
+
|
181 |
+
Without scaling, variables with larger numbers (like Assault) would dominate our analysis,
|
182 |
+
making smaller-scale variables (like Murder) less influential. This would be like comparing
|
183 |
+
dollars to cents - the cents would seem insignificant even if they were important!
|
184 |
+
""")
|
185 |
+
|
186 |
+
# Show original data ranges
|
187 |
+
st.subheader("Original Data Ranges")
|
188 |
+
col1, col2 = st.columns(2)
|
189 |
+
|
190 |
+
with col1:
|
191 |
+
# Create a bar chart of variances
|
192 |
+
fig_var = px.bar(
|
193 |
+
x=df.columns,
|
194 |
+
y=df.var(),
|
195 |
+
title="Variance of Each Variable (Before Scaling)",
|
196 |
+
labels={'x': 'Crime Variables', 'y': 'Variance'},
|
197 |
+
color=df.var(),
|
198 |
+
color_continuous_scale='Viridis'
|
199 |
+
)
|
200 |
+
st.plotly_chart(fig_var)
|
201 |
+
|
202 |
+
st.write("""
|
203 |
+
Notice how Assault has a much larger variance (6,945) compared to Murder (19).
|
204 |
+
This means Assault would dominate our clustering if we didn't scale the data!
|
205 |
+
""")
|
206 |
+
|
207 |
+
with col2:
|
208 |
+
# Create box plots of original data
|
209 |
+
fig_box = px.box(df, title="Original Data Distribution")
|
210 |
+
fig_box.update_layout(
|
211 |
+
xaxis_title="Crime Variables",
|
212 |
+
yaxis_title="Rate per 100,000"
|
213 |
+
)
|
214 |
+
st.plotly_chart(fig_box)
|
215 |
+
|
216 |
+
# Explain standardization
|
217 |
+
st.markdown("""
|
218 |
+
### What is Standardization?
|
219 |
+
|
220 |
+
Standardization (also called Z-score normalization) transforms our data so that:
|
221 |
+
1. Each variable has a mean of 0
|
222 |
+
2. Each variable has a standard deviation of 1
|
223 |
+
|
224 |
+
The formula is: z = (x - μ) / σ
|
225 |
+
- x is the original value
|
226 |
+
- μ is the mean of the variable
|
227 |
+
- σ is the standard deviation of the variable
|
228 |
""")
|
229 |
|
230 |
+
# Scale the data
|
231 |
+
scaler = StandardScaler()
|
232 |
+
df_scaled = scaler.fit_transform(df)
|
233 |
+
df_scaled = pd.DataFrame(df_scaled, columns=df.columns, index=df.index)
|
234 |
+
|
235 |
+
# Show scaled data
|
236 |
+
st.subheader("After Scaling")
|
237 |
+
|
238 |
+
# Create box plots of scaled data
|
239 |
+
fig_scaled = px.box(df_scaled, title="Scaled Data Distribution")
|
240 |
+
fig_scaled.update_layout(
|
241 |
+
xaxis_title="Crime Variables",
|
242 |
+
yaxis_title="Standardized Values"
|
|
|
|
|
243 |
)
|
244 |
+
st.plotly_chart(fig_scaled)
|
245 |
+
|
246 |
+
st.write("""
|
247 |
+
After scaling, all variables are on the same scale:
|
248 |
+
- Mean = 0
|
249 |
+
- Standard Deviation = 1
|
250 |
+
- Values typically range from -3 to +3
|
251 |
+
""")
|
252 |
+
|
253 |
+
# Show before/after comparison for a few states
|
254 |
+
st.write("### Before vs After Scaling (Sample States)")
|
255 |
+
comparison_df = pd.DataFrame({
|
256 |
+
'State': df.index[:5],
|
257 |
+
'Original Murder': df['Murder'][:5],
|
258 |
+
'Scaled Murder': df_scaled['Murder'][:5],
|
259 |
+
'Original Assault': df['Assault'][:5],
|
260 |
+
'Scaled Assault': df_scaled['Assault'][:5]
|
261 |
+
})
|
262 |
+
st.dataframe(comparison_df)
|
263 |
+
|
264 |
+
st.write("""
|
265 |
+
Notice how the relative differences between states are preserved,
|
266 |
+
but now all variables contribute equally to our analysis!
|
267 |
+
""")
|
268 |
+
|
269 |
+
# Why scaling matters for clustering
|
270 |
+
st.markdown("""
|
271 |
+
### Why Scaling Matters for Clustering
|
272 |
+
|
273 |
+
In clustering, we measure distances between data points. Without scaling:
|
274 |
+
- States might be grouped together just because they have similar assault rates
|
275 |
+
- Important differences in murder rates might be ignored
|
276 |
+
|
277 |
+
With scaling:
|
278 |
+
- All variables contribute equally to the distance calculations
|
279 |
+
- We can find true patterns in the data, not just patterns in the largest numbers
|
280 |
+
""")
|
281 |
+
|
282 |
+
# Exercise 3: Finding Optimal Clusters
|
283 |
+
st.header("Exercise 3: Finding the Right Number of Groups")
|
284 |
+
|
285 |
+
# Code Example: Elbow Method
|
286 |
+
with st.expander("Code Example: Finding Optimal K"):
|
287 |
+
st.code("""
|
288 |
+
# Calculate inertias for different K values
|
289 |
+
inertias = []
|
290 |
+
K_values = range(1, 11)
|
291 |
+
|
292 |
+
for k in K_values:
|
293 |
+
kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
|
294 |
+
kmeans.fit(df_scaled)
|
295 |
+
inertias.append(kmeans.inertia_)
|
296 |
+
|
297 |
+
# Create elbow plot
|
298 |
+
import plotly.graph_objects as go
|
299 |
+
fig = go.Figure()
|
300 |
+
fig.add_trace(go.Scatter(
|
301 |
+
x=list(K_values),
|
302 |
+
y=inertias,
|
303 |
+
mode='lines+markers',
|
304 |
+
name='Inertia'
|
305 |
+
))
|
306 |
+
fig.update_layout(
|
307 |
+
title='Finding the Optimal Number of Clusters',
|
308 |
+
xaxis_title='Number of Clusters (K)',
|
309 |
+
yaxis_title='Within-Cluster Sum of Squares'
|
310 |
+
)
|
311 |
+
fig.show()
|
312 |
+
""", language="python")
|
313 |
+
|
314 |
+
st.markdown("""
|
315 |
+
### The Elbow Method Explained
|
316 |
+
|
317 |
+
The elbow method helps us find the optimal number of clusters (K) by looking at how the "within-cluster sum of squares"
|
318 |
+
(WCSS) changes as we increase the number of clusters. Think of it like this:
|
319 |
+
|
320 |
+
- **What is WCSS?** It's a measure of how spread out the points are within each cluster
|
321 |
+
- **Lower WCSS** means points are closer to their cluster center (better clustering)
|
322 |
+
- **Higher WCSS** means points are more spread out from their cluster center
|
323 |
+
|
324 |
+
As we increase K:
|
325 |
+
1. WCSS always decreases (more clusters = tighter groups)
|
326 |
+
2. The rate of decrease slows down
|
327 |
+
3. We look for the "elbow" - where adding more clusters doesn't help much anymore
|
328 |
+
""")
|
329 |
+
|
330 |
+
# Calculate inertias for different K values
|
331 |
+
inertias = []
|
332 |
+
K_values = range(1, 11)
|
333 |
+
|
334 |
+
for k in K_values:
|
335 |
+
kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
|
336 |
+
kmeans.fit(df_scaled)
|
337 |
+
inertias.append(kmeans.inertia_)
|
338 |
+
|
339 |
+
# Create interactive elbow plot
|
340 |
+
fig_elbow = go.Figure()
|
341 |
+
fig_elbow.add_trace(go.Scatter(
|
342 |
+
x=list(K_values),
|
343 |
+
y=inertias,
|
344 |
+
mode='lines+markers',
|
345 |
+
name='Inertia'
|
346 |
+
))
|
347 |
+
fig_elbow.update_layout(
|
348 |
+
title='Finding the Optimal Number of State Crime Profiles',
|
349 |
+
xaxis_title='Number of Clusters (K)',
|
350 |
+
yaxis_title='Within-Cluster Sum of Squares',
|
351 |
plot_bgcolor='rgb(30, 30, 30)',
|
352 |
paper_bgcolor='rgb(30, 30, 30)',
|
353 |
font=dict(color='white')
|
354 |
)
|
355 |
+
st.plotly_chart(fig_elbow)
|
356 |
+
|
357 |
+
# Interpretation guide
|
358 |
+
st.markdown("""
|
359 |
+
### How to Interpret the Elbow Plot
|
360 |
+
|
361 |
+
Look at the plot above and ask yourself:
|
362 |
+
1. **Where is the "elbow"?**
|
363 |
+
- The point where the line starts to level off
|
364 |
+
- Adding more clusters doesn't give much improvement
|
365 |
+
- In our case, it's around K=4
|
366 |
+
|
367 |
+
2. **What do the numbers mean?**
|
368 |
+
- K=1: All states in one group (not useful)
|
369 |
+
- K=2: Basic high/low crime split
|
370 |
+
- K=3: More nuanced grouping
|
371 |
+
- K=4: Our "elbow" - good balance of detail and simplicity
|
372 |
+
- K>4: Diminishing returns - more complexity without much benefit
|
373 |
+
|
374 |
+
3. **Why not just use more clusters?**
|
375 |
+
- More clusters = more complex to interpret
|
376 |
+
- Small clusters might not be meaningful
|
377 |
+
- Goal is to find the simplest model that captures the main patterns
|
378 |
+
""")
|
379 |
+
|
380 |
+
# Show the actual values
|
381 |
+
st.write("### WCSS Values for Each K")
|
382 |
+
wcss_df = pd.DataFrame({
|
383 |
+
'Number of Clusters (K)': K_values,
|
384 |
+
'Within-Cluster Sum of Squares': inertias,
|
385 |
+
'Improvement from Previous K': [0] + [inertias[i-1] - inertias[i] for i in range(1, len(inertias))]
|
386 |
+
})
|
387 |
+
st.dataframe(wcss_df)
|
388 |
+
|
389 |
+
st.markdown("""
|
390 |
+
### Making the Decision
|
391 |
+
|
392 |
+
Based on our elbow plot and the numbers above:
|
393 |
+
1. The biggest improvements happen from K=1 to K=4
|
394 |
+
2. After K=4, the improvements get much smaller
|
395 |
+
3. K=4 gives us a good balance of:
|
396 |
+
- Capturing meaningful patterns
|
397 |
+
- Keeping the model simple enough to interpret
|
398 |
+
- Having enough states in each cluster to be meaningful
|
399 |
+
|
400 |
+
This is why we'll use K=4 for our clustering analysis!
|
401 |
+
""")
|
402 |
+
|
403 |
+
# Exercise 4: K-Means Clustering
|
404 |
+
st.header("Exercise 4: K-Means State Profiling")
|
405 |
+
|
406 |
+
# Code Example: K-Means Clustering
|
407 |
+
with st.expander("Code Example: K-Means Implementation"):
|
408 |
+
st.code("""
|
409 |
+
# Perform K-means clustering
|
410 |
+
from sklearn.cluster import KMeans
|
411 |
+
|
412 |
+
# Create and fit the model
|
413 |
+
kmeans = KMeans(
|
414 |
+
n_clusters=4, # Number of clusters
|
415 |
+
random_state=42, # For reproducibility
|
416 |
+
n_init=20 # Number of times to run with different centroids
|
417 |
+
)
|
418 |
+
cluster_labels = kmeans.fit_predict(df_scaled)
|
419 |
+
|
420 |
+
# Add cluster labels to original data
|
421 |
+
df_clustered = df.copy()
|
422 |
+
df_clustered['Cluster'] = cluster_labels
|
423 |
+
|
424 |
+
# Visualize the clusters
|
425 |
+
import plotly.express as px
|
426 |
+
fig = px.scatter(df_clustered,
|
427 |
+
x='Murder',
|
428 |
+
y='Assault',
|
429 |
+
color='Cluster',
|
430 |
+
hover_data=['UrbanPop', 'Rape'],
|
431 |
+
title='State Crime Profiles')
|
432 |
+
fig.show()
|
433 |
+
|
434 |
+
# Show cluster centers
|
435 |
+
centers_df = pd.DataFrame(
|
436 |
+
kmeans.cluster_centers_,
|
437 |
+
columns=df.columns
|
438 |
+
)
|
439 |
+
print("Cluster Centers:")
|
440 |
+
print(centers_df)
|
441 |
+
""", language="python")
|
442 |
|
443 |
+
st.markdown("""
|
444 |
+
### What is K-Means Clustering?
|
445 |
|
446 |
+
K-means is an unsupervised learning algorithm that groups similar data points together. Think of it like organizing
|
447 |
+
students into study groups based on their interests:
|
448 |
+
|
449 |
+
1. **Initialization**:
|
450 |
+
- We randomly place K "centers" (centroids) in our data space
|
451 |
+
- Each center represents the "average" of its cluster
|
452 |
+
- In our case, each center represents a typical crime profile
|
453 |
+
|
454 |
+
2. **Assignment**:
|
455 |
+
- Each state is assigned to its nearest center
|
456 |
+
- "Nearest" is measured by Euclidean distance
|
457 |
+
- States with similar crime patterns end up in the same cluster
|
458 |
+
|
459 |
+
3. **Update**:
|
460 |
+
- Centers move to the average position of their assigned states
|
461 |
+
- This process repeats until centers stop moving
|
462 |
+
- The algorithm converges when states are optimally grouped
|
463 |
+
""")
|
464 |
+
|
465 |
+
# Visualize the process
|
466 |
+
st.subheader("K-Means in Action")
|
467 |
st.write("""
|
468 |
+
Let's see how K-means works with our state crime data. We'll use K=4 clusters to find distinct crime profiles.
|
|
|
|
|
|
|
469 |
""")
|
470 |
|
471 |
+
# Let user choose number of clusters
|
472 |
+
k = st.slider("Choose number of clusters (K)", 2, 6, 4)
|
473 |
|
474 |
+
# Perform K-means clustering
|
475 |
+
kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
|
476 |
+
cluster_labels = kmeans.fit_predict(df_scaled)
|
|
|
477 |
|
478 |
+
# Add cluster labels to original data
|
479 |
+
df_clustered = df.copy()
|
480 |
+
df_clustered['Cluster'] = cluster_labels
|
481 |
|
482 |
+
# Create interactive scatter plot
|
483 |
+
fig = px.scatter(df_clustered,
|
484 |
+
x='Murder',
|
485 |
+
y='Assault',
|
486 |
+
color='Cluster',
|
487 |
+
hover_data=['UrbanPop', 'Rape'],
|
488 |
+
title='State Crime Profiles')
|
489 |
+
st.plotly_chart(fig)
|
490 |
|
491 |
+
# Explain hyperparameters
|
492 |
+
st.markdown("""
|
493 |
+
### K-Means Hyperparameters Explained
|
494 |
|
495 |
+
1. **n_clusters (K)**
|
496 |
+
- The number of groups we want to create
|
497 |
+
- We chose K=4 based on the elbow method
|
498 |
+
- Each cluster represents a distinct crime profile
|
499 |
|
500 |
+
2. **random_state**
|
501 |
+
- Controls the random initialization of centroids
|
502 |
+
- Setting it to 42 ensures reproducible results
|
503 |
+
- Different values might give slightly different clusters
|
504 |
+
|
505 |
+
3. **n_init**
|
506 |
+
- Number of times to run the algorithm with different initial centroids
|
507 |
+
- We use 20 to find the best possible clustering
|
508 |
+
- Higher values give more reliable results but take longer
|
509 |
+
|
510 |
+
4. **max_iter**
|
511 |
+
- Maximum number of iterations for each run
|
512 |
+
- Default is 300, which is usually enough
|
513 |
+
- Algorithm stops earlier if it converges
|
514 |
+
|
515 |
+
5. **algorithm**
|
516 |
+
- 'auto': Automatically chooses the best algorithm
|
517 |
+
- 'full': Traditional K-means
|
518 |
+
- 'elkan': More efficient for well-separated clusters
|
519 |
+
""")
|
520 |
|
521 |
+
# Show cluster centers
|
522 |
+
st.subheader("Cluster Centers (Typical Crime Profiles)")
|
523 |
+
centers_df = pd.DataFrame(
|
524 |
+
kmeans.cluster_centers_,
|
525 |
+
columns=df.columns
|
526 |
+
)
|
527 |
+
st.dataframe(centers_df)
|
528 |
|
|
|
|
|
529 |
st.write("""
|
530 |
+
Each row represents the "average" crime profile for that cluster. For example:
|
531 |
+
- High values in Murder and Assault indicate a high-crime cluster
|
532 |
+
- High UrbanPop with low crime rates might indicate urban safety
|
533 |
+
- Low values across all metrics might indicate rural safety
|
534 |
""")
|
535 |
|
536 |
+
# Display cluster analysis
|
537 |
+
st.subheader("State Crime Profiles Analysis")
|
538 |
+
|
539 |
+
for cluster_num in range(k):
|
540 |
+
cluster_states = df_clustered[df_clustered['Cluster'] == cluster_num]
|
541 |
+
st.write(f"\n**CLUSTER {cluster_num}: {len(cluster_states)} states**")
|
542 |
+
st.write("States:", ", ".join(cluster_states.index.tolist()))
|
543 |
+
st.write("Average characteristics:")
|
544 |
+
avg_profile = cluster_states[['Murder', 'Assault', 'UrbanPop', 'Rape']].mean()
|
545 |
+
st.write(avg_profile)
|
546 |
+
|
547 |
+
# Explain the results
|
548 |
+
st.markdown("""
|
549 |
+
### Interpreting the Results
|
550 |
+
|
551 |
+
Each cluster represents a distinct crime profile:
|
552 |
+
1. **Cluster Characteristics**
|
553 |
+
- Look at the average values for each crime type
|
554 |
+
- Compare urban population percentages
|
555 |
+
- Identify the defining features of each cluster
|
556 |
+
|
557 |
+
2. **State Groupings**
|
558 |
+
- States in the same cluster have similar crime patterns
|
559 |
+
- Geographic proximity doesn't always mean similar profiles
|
560 |
+
- Some states might surprise you with their cluster membership
|
561 |
+
|
562 |
+
3. **Policy Implications**
|
563 |
+
- Clusters help identify states with similar challenges
|
564 |
+
- Can guide resource allocation and policy development
|
565 |
+
- Enables targeted interventions based on crime profiles
|
566 |
+
""")
|
567 |
+
|
568 |
+
# Exercise 5: Hierarchical Clustering
|
569 |
+
st.header("Exercise 5: Hierarchical Clustering Exploration")
|
570 |
+
|
571 |
+
# Code Example: Hierarchical Clustering
|
572 |
+
with st.expander("Code Example: Hierarchical Clustering"):
|
573 |
st.code("""
|
574 |
+
# Create hierarchical clustering
|
575 |
+
from scipy.cluster.hierarchy import linkage, dendrogram
|
|
|
|
|
|
|
576 |
|
577 |
+
# Create linkage matrix
|
578 |
+
linkage_matrix = linkage(df_scaled, method='complete')
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
579 |
|
580 |
+
# Plot dendrogram
|
581 |
+
import plotly.graph_objects as go
|
582 |
+
dendro = dendrogram(linkage_matrix, labels=df.index.tolist(), no_plot=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
583 |
|
584 |
+
fig = go.Figure()
|
585 |
+
fig.add_trace(go.Scatter(
|
586 |
+
x=dendro['icoord'],
|
587 |
+
y=dendro['dcoord'],
|
588 |
+
mode='lines',
|
589 |
+
line=dict(color='white')
|
590 |
+
))
|
591 |
+
fig.update_layout(
|
592 |
+
title='State Crime Pattern Family Tree',
|
593 |
+
xaxis_title='States',
|
594 |
+
yaxis_title='Distance Between Groups'
|
595 |
)
|
596 |
+
fig.show()
|
597 |
+
|
598 |
+
# Cut the tree to get clusters
|
599 |
+
from scipy.cluster.hierarchy import fcluster
|
600 |
+
hierarchical_labels = fcluster(linkage_matrix, k, criterion='maxclust') - 1
|
601 |
""", language="python")
|
602 |
|
603 |
+
st.markdown("""
|
604 |
+
### What is Hierarchical Clustering?
|
605 |
+
|
606 |
+
Hierarchical clustering creates a tree-like structure (dendrogram) that shows how data points are related at different levels.
|
607 |
+
Think of it like building a family tree for states based on their crime patterns:
|
608 |
+
|
609 |
+
1. **Bottom-Up Approach (Agglomerative)**:
|
610 |
+
- Start with each state as its own cluster
|
611 |
+
- Find the two closest states and merge them
|
612 |
+
- Continue merging until all states are in one cluster
|
613 |
+
- Creates a complete hierarchy of relationships
|
614 |
+
|
615 |
+
2. **Distance Measurement**:
|
616 |
+
- Complete Linkage: Uses the maximum distance between states
|
617 |
+
- Average Linkage: Uses the average distance between states
|
618 |
+
- Single Linkage: Uses the minimum distance between states
|
619 |
+
- We use complete linkage for more distinct clusters
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
620 |
""")
|
621 |
|
622 |
+
# Create hierarchical clustering
|
623 |
+
linkage_matrix = linkage(df_scaled, method='complete')
|
624 |
+
|
625 |
+
# Create interactive dendrogram
|
626 |
+
fig_dendro = go.Figure()
|
627 |
+
dendro = dendrogram(linkage_matrix, labels=df.index.tolist(), no_plot=True)
|
628 |
+
|
629 |
+
fig_dendro.add_trace(go.Scatter(
|
630 |
+
x=dendro['icoord'],
|
631 |
+
y=dendro['dcoord'],
|
632 |
+
mode='lines',
|
633 |
+
line=dict(color='white')
|
634 |
+
))
|
635 |
+
|
636 |
+
fig_dendro.update_layout(
|
637 |
+
title='State Crime Pattern Family Tree',
|
638 |
+
xaxis_title='States',
|
639 |
+
yaxis_title='Distance Between Groups',
|
640 |
+
plot_bgcolor='rgb(30, 30, 30)',
|
641 |
+
paper_bgcolor='rgb(30, 30, 30)',
|
642 |
+
font=dict(color='white')
|
643 |
)
|
644 |
+
st.plotly_chart(fig_dendro)
|
645 |
+
|
646 |
+
# Explain how to read the dendrogram
|
647 |
+
st.markdown("""
|
648 |
+
### How to Read the Dendrogram
|
649 |
+
|
650 |
+
1. **Height of Connections**:
|
651 |
+
- Higher connections = more different groups
|
652 |
+
- Lower connections = more similar groups
|
653 |
+
- The height shows how different two groups are
|
654 |
+
|
655 |
+
2. **Cutting the Tree**:
|
656 |
+
- Draw a horizontal line to create clusters
|
657 |
+
- Where you cut determines the number of clusters
|
658 |
+
- We'll cut at a height that gives us 4 clusters (like K-means)
|
659 |
+
""")
|
660 |
+
|
661 |
+
# Cut the tree to get clusters
|
662 |
+
hierarchical_labels = fcluster(linkage_matrix, k, criterion='maxclust') - 1
|
663 |
+
|
664 |
+
# Compare K-means and Hierarchical Clustering
|
665 |
+
st.header("Comparing K-Means and Hierarchical Clustering")
|
666 |
+
|
667 |
+
# Create side-by-side comparison
|
668 |
+
col1, col2 = st.columns(2)
|
669 |
+
|
670 |
+
with col1:
|
671 |
+
st.subheader("K-Means Clustering")
|
672 |
+
fig_kmeans = px.scatter(df_clustered,
|
673 |
+
x='Murder',
|
674 |
+
y='Assault',
|
675 |
+
color='Cluster',
|
676 |
+
title='K-Means Clustering (K=4)',
|
677 |
+
hover_data=['UrbanPop', 'Rape'])
|
678 |
+
st.plotly_chart(fig_kmeans)
|
679 |
+
|
680 |
+
st.markdown("""
|
681 |
+
**K-Means Characteristics**:
|
682 |
+
- Requires specifying number of clusters upfront
|
683 |
+
- Creates clusters of similar size
|
684 |
+
- Works well with spherical clusters
|
685 |
+
- Faster for large datasets
|
686 |
+
- Can be sensitive to outliers
|
687 |
+
""")
|
688 |
+
|
689 |
+
with col2:
|
690 |
+
st.subheader("Hierarchical Clustering")
|
691 |
+
df_hierarchical = df.copy()
|
692 |
+
df_hierarchical['Cluster'] = hierarchical_labels
|
693 |
+
fig_hierarchical = px.scatter(df_hierarchical,
|
694 |
+
x='Murder',
|
695 |
+
y='Assault',
|
696 |
+
color='Cluster',
|
697 |
+
title='Hierarchical Clustering (4 clusters)',
|
698 |
+
hover_data=['UrbanPop', 'Rape'])
|
699 |
+
st.plotly_chart(fig_hierarchical)
|
700 |
+
|
701 |
+
st.markdown("""
|
702 |
+
**Hierarchical Clustering Characteristics**:
|
703 |
+
- Creates a complete hierarchy of clusters
|
704 |
+
- Can handle non-spherical clusters
|
705 |
+
- More flexible in cluster shapes
|
706 |
+
- Slower for large datasets
|
707 |
+
- Less sensitive to outliers
|
708 |
+
""")
|
709 |
|
710 |
+
# Show agreement between methods
|
711 |
+
st.subheader("Comparing the Results")
|
712 |
|
713 |
+
# Create comparison dataframe
|
714 |
+
comparison_df = pd.DataFrame({
|
715 |
+
'State': df.index,
|
716 |
+
'K-Means Cluster': cluster_labels,
|
717 |
+
'Hierarchical Cluster': hierarchical_labels
|
718 |
+
})
|
719 |
|
720 |
+
# Count agreements
|
721 |
+
agreements = sum(comparison_df['K-Means Cluster'] == comparison_df['Hierarchical Cluster'])
|
722 |
+
agreement_percentage = (agreements / len(comparison_df)) * 100
|
723 |
+
|
724 |
+
st.write(f"Methods agreed on {agreements} out of {len(comparison_df)} states ({agreement_percentage:.1f}%)")
|
725 |
+
|
726 |
+
# Show states where methods disagree
|
727 |
+
disagreements = comparison_df[comparison_df['K-Means Cluster'] != comparison_df['Hierarchical Cluster']]
|
728 |
+
if not disagreements.empty:
|
729 |
+
st.write("States where the methods disagreed:")
|
730 |
+
st.dataframe(disagreements)
|
731 |
+
|
732 |
+
st.markdown("""
|
733 |
+
### When to Use Each Method
|
734 |
+
|
735 |
+
1. **Use K-Means when**:
|
736 |
+
- You know the number of clusters
|
737 |
+
- Your data has spherical clusters
|
738 |
+
- You need fast computation
|
739 |
+
- You want clusters of similar size
|
740 |
+
|
741 |
+
2. **Use Hierarchical Clustering when**:
|
742 |
+
- You don't know the number of clusters
|
743 |
+
- You want to explore the hierarchy
|
744 |
+
- Your clusters might be non-spherical
|
745 |
+
- You need to handle outliers carefully
|
746 |
+
|
747 |
+
In our case, both methods found similar patterns, suggesting our clusters are robust!
|
748 |
+
""")
|
749 |
+
|
750 |
+
# Exercise 6: Policy Brief
|
751 |
+
st.header("Exercise 6: Policy Brief Creation")
|
752 |
+
|
753 |
+
# Code Example: Creating Final Visualizations
|
754 |
+
with st.expander("Code Example: Creating Policy Brief Visualizations"):
|
755 |
+
st.code("""
|
756 |
+
# Create a comprehensive visualization
|
757 |
+
import plotly.graph_objects as go
|
758 |
+
from plotly.subplots import make_subplots
|
759 |
+
|
760 |
+
# Create subplots
|
761 |
+
fig = make_subplots(rows=2, cols=2)
|
762 |
+
|
763 |
+
# Plot 1: Murder vs Assault by cluster
|
764 |
+
for i in range(k):
|
765 |
+
cluster_data = df_clustered[df_clustered['Cluster'] == i]
|
766 |
+
fig.add_trace(
|
767 |
+
go.Scatter(
|
768 |
+
x=cluster_data['Murder'],
|
769 |
+
y=cluster_data['Assault'],
|
770 |
+
mode='markers',
|
771 |
+
name=f'Cluster {i}'
|
772 |
+
),
|
773 |
+
row=1, col=1
|
774 |
+
)
|
775 |
+
|
776 |
+
# Plot 2: Urban Population vs Rape by cluster
|
777 |
+
for i in range(k):
|
778 |
+
cluster_data = df_clustered[df_clustered['Cluster'] == i]
|
779 |
+
fig.add_trace(
|
780 |
+
go.Scatter(
|
781 |
+
x=cluster_data['UrbanPop'],
|
782 |
+
y=cluster_data['Rape'],
|
783 |
+
mode='markers',
|
784 |
+
name=f'Cluster {i}'
|
785 |
+
),
|
786 |
+
row=1, col=2
|
787 |
+
)
|
788 |
+
|
789 |
+
# Update layout
|
790 |
+
fig.update_layout(
|
791 |
+
title_text="State Crime Profile Analysis",
|
792 |
+
showlegend=True
|
793 |
+
)
|
794 |
+
fig.show()
|
795 |
+
""", language="python")
|
796 |
|
|
|
|
|
797 |
st.write("""
|
798 |
+
Based on our analysis, here's a summary of findings and recommendations:
|
799 |
+
|
800 |
+
**Key Findings:**
|
801 |
+
- We identified distinct crime profiles among US states
|
802 |
+
- Each cluster represents a unique pattern of crime rates and urban population
|
803 |
+
- Some states show surprising similarities despite geographic distance
|
804 |
+
|
805 |
+
**Policy Recommendations:**
|
806 |
+
1. High-Priority States: Focus on states in high-crime clusters
|
807 |
+
2. Resource Allocation: Distribute federal crime prevention funds based on cluster profiles
|
808 |
+
3. Best Practice Sharing: Encourage states within the same cluster to share successful strategies
|
809 |
""")
|
810 |
|
811 |
# Additional Resources
|
812 |
st.header("Additional Resources")
|
813 |
st.write("""
|
814 |
+
- [Scikit-learn Clustering Documentation](https://scikit-learn.org/stable/modules/clustering.html)
|
815 |
+
- [KNN Documentation](https://scikit-learn.org/stable/modules/neighbors.html)
|
|
|
|
|
|
|
|
|
816 |
""")
|