raymondEDS commited on
Commit
46e47b6
·
1 Parent(s): 63732ac
Reference files/week 7/W7_Lab_KNN_clustering.ipynb ADDED
@@ -0,0 +1,481 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "b1c6a137",
6
+ "metadata": {
7
+ "id": "b1c6a137"
8
+ },
9
+ "source": [
10
+ "# Clustering Lab: State Crime Pattern Analysis\n",
11
+ "\n",
12
+ "## Lab Overview\n",
13
+ "\n",
14
+ "Welcome to your hands-on clustering lab! You'll be working as a policy analyst for the Department of Justice, analyzing crime patterns across US states. Your mission: discover hidden safety profiles that could inform federal resource allocation and crime prevention strategies.\n",
15
+ "\n",
16
+ "**Your Deliverable**: A policy brief with visualizations and recommendations based on your clustering analysis.\n",
17
+ "\n",
18
+ "---\n",
19
+ "\n",
20
+ "## Exercise 1: Data Detective Work\n",
21
+ "**Time: 15 minutes | Product: Data Summary Report**\n",
22
+ "\n",
23
+ "### Your Task\n",
24
+ "Before any analysis, you need to understand what you're working with. Create a brief data summary that a non-technical policy maker could understand.\n"
25
+ ]
26
+ },
27
+ {
28
+ "cell_type": "code",
29
+ "source": [
30
+ "```python\n",
31
+ "import numpy as np\n",
32
+ "import pandas as pd\n",
33
+ "import matplotlib.pyplot as plt\n",
34
+ "from statsmodels.datasets import get_rdataset\n",
35
+ "from sklearn.preprocessing import StandardScaler\n",
36
+ "from sklearn.cluster import KMeans, AgglomerativeClustering\n",
37
+ "\n",
38
+ "# Load the data\n",
39
+ "USArrests = get_rdataset('USArrests').data\n",
40
+ "print(\"Dataset shape:\", USArrests.shape)\n",
41
+ "print(\"\\nVariables:\", USArrests.columns.tolist())\n",
42
+ "print(\"\\nFirst 5 states:\")\n",
43
+ "print(USArrests.head())\n",
44
+ "```"
45
+ ],
46
+ "metadata": {
47
+ "colab": {
48
+ "base_uri": "https://localhost:8080/",
49
+ "height": 106
50
+ },
51
+ "id": "mqRVE1hlXK9x",
52
+ "outputId": "5a1bbd64-15cd-4e1c-9344-64a901d8a396"
53
+ },
54
+ "id": "mqRVE1hlXK9x",
55
+ "execution_count": null,
56
+ "outputs": [
57
+ {
58
+ "output_type": "error",
59
+ "ename": "SyntaxError",
60
+ "evalue": "invalid syntax (<ipython-input-1-2035427107>, line 1)",
61
+ "traceback": [
62
+ "\u001b[0;36m File \u001b[0;32m\"<ipython-input-1-2035427107>\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m ```python\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
63
+ ]
64
+ }
65
+ ]
66
+ },
67
+ {
68
+ "cell_type": "markdown",
69
+ "source": [
70
+ "## Your Investigation\n",
71
+ "Complete this data summary table:\n",
72
+ "\n",
73
+ "| Variable | What it measures | Average Value | Highest State | Lowest State |\n",
74
+ "|----------|------------------|---------------|---------------|--------------|\n",
75
+ "| Murder | Rate per 100,000 people | ??? | ??? | ??? |\n",
76
+ "| Assault | Rate per 100,000 people | ??? | ??? | ??? |\n",
77
+ "| UrbanPop | Percentage living in cities | ??? | ??? | ??? |\n",
78
+ "| Rape | Rate per 100,000 people | ??? | ??? | ??? |\n",
79
+ "\n",
80
+ "**Deliverable**: Write 2-3 sentences describing the biggest surprises in this data. Which states are not what you expected?\n",
81
+ "\n",
82
+ "---\n",
83
+ "\n",
84
+ "## Exercise 2: The Scaling Challenge\n",
85
+ "**Time: 10 minutes | Product: Before/After Comparison**\n",
86
+ "\n",
87
+ "### Your Task\n",
88
+ "Demonstrate why scaling is critical for clustering crime data.\n",
89
+ "\n"
90
+ ],
91
+ "metadata": {
92
+ "id": "7qkDKTe4XLtG"
93
+ },
94
+ "id": "7qkDKTe4XLtG"
95
+ },
96
+ {
97
+ "cell_type": "code",
98
+ "source": [
99
+ "```python\n",
100
+ "# Check the scale differences\n",
101
+ "print(\"Original data ranges:\")\n",
102
+ "print(USArrests.describe())\n",
103
+ "\n",
104
+ "print(\"\\nVariances (how spread out the data is):\")\n",
105
+ "print(USArrests.var())\n",
106
+ "\n",
107
+ "# Scale the data\n",
108
+ "scaler = StandardScaler()\n",
109
+ "USArrests_scaled = scaler.fit_transform(USArrests)\n",
110
+ "scaled_df = pd.DataFrame(USArrests_scaled,\n",
111
+ " columns=USArrests.columns,\n",
112
+ " index=USArrests.index)\n",
113
+ "\n",
114
+ "print(\"\\nAfter scaling - all variables now have similar ranges:\")\n",
115
+ "print(scaled_df.describe())\n",
116
+ "```"
117
+ ],
118
+ "metadata": {
119
+ "id": "zQ3VowYNXLeQ"
120
+ },
121
+ "id": "zQ3VowYNXLeQ",
122
+ "execution_count": null,
123
+ "outputs": []
124
+ },
125
+ {
126
+ "cell_type": "markdown",
127
+ "source": [
128
+ "### Your Analysis\n",
129
+ "1. **Before scaling**: Which variable would dominate the clustering? Why?\n",
130
+ "2. **After scaling**: Explain in simple terms what StandardScaler did to the data.\n",
131
+ "\n",
132
+ "**Deliverable**: One paragraph explaining why a policy analyst should care about data scaling.\n",
133
+ "\n",
134
+ "---\n",
135
+ "\n",
136
+ "## Exercise 3: Finding the Right Number of Groups\n",
137
+ "**Time: 20 minutes | Product: Recommendation with Visual Evidence**\n",
138
+ "\n",
139
+ "### Your Task\n",
140
+ "Use the elbow method to determine how many distinct crime profiles exist among US states.\n"
141
+ ],
142
+ "metadata": {
143
+ "id": "FnOT700SXLPh"
144
+ },
145
+ "id": "FnOT700SXLPh"
146
+ },
147
+ {
148
+ "cell_type": "code",
149
+ "source": [
150
+ "```python\n",
151
+ "# Test different numbers of clusters\n",
152
+ "inertias = []\n",
153
+ "K_values = range(1, 11)\n",
154
+ "\n",
155
+ "for k in K_values:\n",
156
+ " kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)\n",
157
+ " kmeans.fit(USArrests_scaled)\n",
158
+ " inertias.append(kmeans.inertia_)\n",
159
+ "\n",
160
+ "# Create the elbow plot\n",
161
+ "plt.figure(figsize=(10, 6))\n",
162
+ "plt.plot(K_values, inertias, 'bo-', linewidth=2, markersize=8)\n",
163
+ "plt.xlabel('Number of Clusters (K)')\n",
164
+ "plt.ylabel('Within-Cluster Sum of Squares')\n",
165
+ "plt.title('Finding the Optimal Number of State Crime Profiles')\n",
166
+ "plt.grid(True, alpha=0.3)\n",
167
+ "plt.show()\n",
168
+ "\n",
169
+ "# Print the inertia values\n",
170
+ "for k, inertia in zip(K_values, inertias):\n",
171
+ " print(f\"K={k}: Inertia = {inertia:.1f}\")\n",
172
+ "```"
173
+ ],
174
+ "metadata": {
175
+ "id": "zOQrS9lmXpTF"
176
+ },
177
+ "id": "zOQrS9lmXpTF",
178
+ "execution_count": null,
179
+ "outputs": []
180
+ },
181
+ {
182
+ "cell_type": "markdown",
183
+ "id": "2e388ef2",
184
+ "metadata": {
185
+ "id": "2e388ef2"
186
+ },
187
+ "source": [
188
+ "### Your Decision\n",
189
+ "Based on your elbow plot:\n",
190
+ "1. **What value of K do you recommend?** (Look for the \"elbow\" where the line starts to flatten)\n",
191
+ "2. **What does this mean in policy terms?** (How many distinct types of state crime profiles exist?)\n",
192
+ "\n",
193
+ "**Deliverable**: A one-paragraph recommendation with your chosen K value and reasoning.\n",
194
+ "\n",
195
+ "---\n",
196
+ "\n",
197
+ "## Exercise 4: K-Means State Profiling\n",
198
+ "**Time: 25 minutes | Product: State Crime Profile Report**\n",
199
+ "\n",
200
+ "### Your Task\n",
201
+ "Create distinct crime profiles and identify which states belong to each category.\n",
202
+ "\n",
203
+ "\n",
204
+ "\n",
205
+ "\n"
206
+ ]
207
+ },
208
+ {
209
+ "cell_type": "code",
210
+ "source": [
211
+ "```python\n",
212
+ "# Use your chosen K value from Exercise 3\n",
213
+ "optimal_k = 4 # Replace with your chosen value\n",
214
+ "\n",
215
+ "# Perform K-means clustering\n",
216
+ "kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=20)\n",
217
+ "cluster_labels = kmeans.fit_predict(USArrests_scaled)\n",
218
+ "\n",
219
+ "# Add cluster labels to original data\n",
220
+ "USArrests_clustered = USArrests.copy()\n",
221
+ "USArrests_clustered['Cluster'] = cluster_labels\n",
222
+ "\n",
223
+ "# Analyze each cluster\n",
224
+ "print(\"State Crime Profiles Analysis\")\n",
225
+ "print(\"=\" * 50)\n",
226
+ "\n",
227
+ "for cluster_num in range(optimal_k):\n",
228
+ " cluster_states = USArrests_clustered[USArrests_clustered['Cluster'] == cluster_num]\n",
229
+ " print(f\"\\nCLUSTER {cluster_num}: {len(cluster_states)} states\")\n",
230
+ " print(\"States:\", \", \".join(cluster_states.index.tolist()))\n",
231
+ " print(\"Average characteristics:\")\n",
232
+ " avg_profile = cluster_states[['Murder', 'Assault', 'UrbanPop', 'Rape']].mean()\n",
233
+ " for var, value in avg_profile.items():\n",
234
+ " print(f\" {var}: {value:.1f}\")\n",
235
+ "```"
236
+ ],
237
+ "metadata": {
238
+ "id": "_5b0nE6KXv1P"
239
+ },
240
+ "id": "_5b0nE6KXv1P",
241
+ "execution_count": null,
242
+ "outputs": []
243
+ },
244
+ {
245
+ "cell_type": "markdown",
246
+ "source": [
247
+ "### Your Analysis\n",
248
+ "For each cluster, create a profile:\n",
249
+ "\n",
250
+ "**Cluster 0: \"[Your Creative Name]\"**\n",
251
+ "- **States**: [List them]\n",
252
+ "- **Characteristics**: [Describe the pattern]\n",
253
+ "- **Policy Insight**: [What should federal agencies know about these states?]\n",
254
+ "\n",
255
+ "**Deliverable**: A table summarizing each cluster with creative names and policy recommendations.\n",
256
+ "\n",
257
+ "---\n",
258
+ "\n",
259
+ "## Exercise 5: Hierarchical Clustering Exploration\n",
260
+ "**Time: 25 minutes | Product: Family Tree Interpretation**\n",
261
+ "\n",
262
+ "### Your Task\n",
263
+ "Create a dendrogram to understand how states naturally group together.\n"
264
+ ],
265
+ "metadata": {
266
+ "id": "J1WVGb_nX4ye"
267
+ },
268
+ "id": "J1WVGb_nX4ye"
269
+ },
270
+ {
271
+ "cell_type": "code",
272
+ "source": [
273
+ "```python\n",
274
+ "from scipy.cluster.hierarchy import dendrogram, linkage\n",
275
+ "\n",
276
+ "# Create hierarchical clustering\n",
277
+ "linkage_matrix = linkage(USArrests_scaled, method='complete')\n",
278
+ "\n",
279
+ "# Plot the dendrogram\n",
280
+ "plt.figure(figsize=(15, 8))\n",
281
+ "dendrogram(linkage_matrix,\n",
282
+ " labels=USArrests.index.tolist(),\n",
283
+ " leaf_rotation=90,\n",
284
+ " leaf_font_size=10)\n",
285
+ "plt.title('State Crime Pattern Family Tree')\n",
286
+ "plt.xlabel('States')\n",
287
+ "plt.ylabel('Distance Between Groups')\n",
288
+ "plt.tight_layout()\n",
289
+ "plt.show()\n",
290
+ "```"
291
+ ],
292
+ "metadata": {
293
+ "id": "Y9a_cbZKX7QX"
294
+ },
295
+ "id": "Y9a_cbZKX7QX",
296
+ "execution_count": null,
297
+ "outputs": []
298
+ },
299
+ {
300
+ "cell_type": "markdown",
301
+ "source": [
302
+ "### Your Interpretation\n",
303
+ "1. **Closest Pairs**: Which two states are most similar in crime patterns?\n",
304
+ "2. **Biggest Divide**: Where is the largest split in the tree? What does this represent?\n",
305
+ "3. **Surprising Neighbors**: Which states cluster together that surprised you geographically?\n",
306
+ "\n",
307
+ "### Code to Compare Methods"
308
+ ],
309
+ "metadata": {
310
+ "id": "0PaImqZtX6f3"
311
+ },
312
+ "id": "0PaImqZtX6f3"
313
+ },
314
+ {
315
+ "cell_type": "code",
316
+ "source": [
317
+ "```python\n",
318
+ "# Compare your K-means results with hierarchical clustering\n",
319
+ "from scipy.cluster.hierarchy import fcluster\n",
320
+ "\n",
321
+ "# Cut the tree to get the same number of clusters as K-means\n",
322
+ "hierarchical_labels = fcluster(linkage_matrix, optimal_k, criterion='maxclust') - 1\n",
323
+ "\n",
324
+ "# Create comparison\n",
325
+ "comparison_df = pd.DataFrame({\n",
326
+ " 'State': USArrests.index,\n",
327
+ " 'K_Means_Cluster': cluster_labels,\n",
328
+ " 'Hierarchical_Cluster': hierarchical_labels\n",
329
+ "})\n",
330
+ "\n",
331
+ "print(\"Comparison of K-Means vs Hierarchical Clustering:\")\n",
332
+ "print(comparison_df.sort_values('State'))\n",
333
+ "\n",
334
+ "# Count agreements\n",
335
+ "agreements = sum(comparison_df['K_Means_Cluster'] == comparison_df['Hierarchical_Cluster'])\n",
336
+ "print(f\"\\nMethods agreed on {agreements} out of {len(comparison_df)} states ({agreements/len(comparison_df)*100:.1f}%)\")\n",
337
+ "```"
338
+ ],
339
+ "metadata": {
340
+ "id": "tJQ-C5GFYBRT"
341
+ },
342
+ "id": "tJQ-C5GFYBRT",
343
+ "execution_count": null,
344
+ "outputs": []
345
+ },
346
+ {
347
+ "cell_type": "markdown",
348
+ "source": [
349
+ "**Deliverable**: A paragraph explaining the key differences between what K-means and hierarchical clustering revealed.\n",
350
+ "\n",
351
+ "---\n",
352
+ "\n",
353
+ "## Exercise 6: Policy Brief Creation\n",
354
+ "**Time: 20 minutes | Product: Executive Summary**\n",
355
+ "\n",
356
+ "### Your Task\n",
357
+ "Synthesize your findings into a policy brief for Department of Justice leadership.\n",
358
+ "\n",
359
+ "### Code Framework for Final Visualization"
360
+ ],
361
+ "metadata": {
362
+ "id": "dx1fNhu4YD7-"
363
+ },
364
+ "id": "dx1fNhu4YD7-"
365
+ },
366
+ {
367
+ "cell_type": "code",
368
+ "source": [
369
+ "```python\n",
370
+ "# Create a comprehensive visualization\n",
371
+ "fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))\n",
372
+ "\n",
373
+ "# Plot 1: Murder vs Assault by cluster\n",
374
+ "colors = ['red', 'blue', 'green', 'orange', 'purple']\n",
375
+ "for i in range(optimal_k):\n",
376
+ " cluster_data = USArrests_clustered[USArrests_clustered['Cluster'] == i]\n",
377
+ " ax1.scatter(cluster_data['Murder'], cluster_data['Assault'],\n",
378
+ " c=colors[i], label=f'Cluster {i}', s=60, alpha=0.7)\n",
379
+ "ax1.set_xlabel('Murder Rate')\n",
380
+ "ax1.set_ylabel('Assault Rate')\n",
381
+ "ax1.set_title('Murder vs Assault by Crime Profile')\n",
382
+ "ax1.legend()\n",
383
+ "ax1.grid(True, alpha=0.3)\n",
384
+ "\n",
385
+ "# Plot 2: Urban Population vs Rape by cluster\n",
386
+ "for i in range(optimal_k):\n",
387
+ " cluster_data = USArrests_clustered[USArrests_clustered['Cluster'] == i]\n",
388
+ " ax2.scatter(cluster_data['UrbanPop'], cluster_data['Rape'],\n",
389
+ " c=colors[i], label=f'Cluster {i}', s=60, alpha=0.7)\n",
390
+ "ax2.set_xlabel('Urban Population %')\n",
391
+ "ax2.set_ylabel('Rape Rate')\n",
392
+ "ax2.set_title('Urban Population vs Rape Rate by Crime Profile')\n",
393
+ "ax2.legend()\n",
394
+ "ax2.grid(True, alpha=0.3)\n",
395
+ "\n",
396
+ "# Plot 3: Cluster size comparison\n",
397
+ "cluster_sizes = USArrests_clustered['Cluster'].value_counts().sort_index()\n",
398
+ "ax3.bar(range(len(cluster_sizes)), cluster_sizes.values, color=colors[:len(cluster_sizes)])\n",
399
+ "ax3.set_xlabel('Cluster Number')\n",
400
+ "ax3.set_ylabel('Number of States')\n",
401
+ "ax3.set_title('Number of States in Each Crime Profile')\n",
402
+ "ax3.set_xticks(range(len(cluster_sizes)))\n",
403
+ "\n",
404
+ "# Plot 4: Average crime rates by cluster\n",
405
+ "cluster_means = USArrests_clustered.groupby('Cluster')[['Murder', 'Assault', 'Rape']].mean()\n",
406
+ "cluster_means.plot(kind='bar', ax=ax4)\n",
407
+ "ax4.set_xlabel('Cluster Number')\n",
408
+ "ax4.set_ylabel('Average Rate')\n",
409
+ "ax4.set_title('Average Crime Rates by Profile')\n",
410
+ "ax4.legend()\n",
411
+ "ax4.tick_params(axis='x', rotation=0)\n",
412
+ "\n",
413
+ "plt.tight_layout()\n",
414
+ "plt.show()\n",
415
+ "```"
416
+ ],
417
+ "metadata": {
418
+ "id": "N8bkxURpYHJF"
419
+ },
420
+ "id": "N8bkxURpYHJF",
421
+ "execution_count": null,
422
+ "outputs": []
423
+ },
424
+ {
425
+ "cell_type": "markdown",
426
+ "source": [
427
+ "### Your Policy Brief Template\n",
428
+ "\n",
429
+ "**EXECUTIVE SUMMARY: US State Crime Profile Analysis**\n",
430
+ "\n",
431
+ "**Key Findings:**\n",
432
+ "- We identified [X] distinct crime profiles among US states\n",
433
+ "- [State examples] represent the highest-risk profile\n",
434
+ "- [State examples] represent the lowest-risk profile\n",
435
+ "- Urban population [does/does not] strongly correlate with violent crime\n",
436
+ "\n",
437
+ "**Policy Recommendations:**\n",
438
+ "1. **High-Priority States**: [List and explain why]\n",
439
+ "2. **Resource Allocation**: [Suggest how to distribute federal crime prevention funds]\n",
440
+ "3. **Best Practice Sharing**: [Which states should learn from which others?]\n",
441
+ "\n",
442
+ "**Methodology Note**: Analysis used unsupervised clustering on 4 crime variables across 50 states, with data standardization to ensure fair comparison.\n",
443
+ "\n",
444
+ "**Deliverable**: A complete 1-page policy brief with your clustering insights and specific recommendations.\n"
445
+ ],
446
+ "metadata": {
447
+ "id": "rAy_Ye0WYLK0"
448
+ },
449
+ "id": "rAy_Ye0WYLK0"
450
+ }
451
+ ],
452
+ "metadata": {
453
+ "jupytext": {
454
+ "cell_metadata_filter": "-all",
455
+ "formats": "Rmd,ipynb",
456
+ "main_language": "python"
457
+ },
458
+ "kernelspec": {
459
+ "display_name": "Python 3 (ipykernel)",
460
+ "language": "python",
461
+ "name": "python3"
462
+ },
463
+ "language_info": {
464
+ "codemirror_mode": {
465
+ "name": "ipython",
466
+ "version": 3
467
+ },
468
+ "file_extension": ".py",
469
+ "mimetype": "text/x-python",
470
+ "name": "python",
471
+ "nbconvert_exporter": "python",
472
+ "pygments_lexer": "ipython3",
473
+ "version": "3.10.4"
474
+ },
475
+ "colab": {
476
+ "provenance": []
477
+ }
478
+ },
479
+ "nbformat": 4,
480
+ "nbformat_minor": 5
481
+ }
Reference files/week 7/Week7_Clustering Curriculum.docx ADDED
Binary file (18.4 kB). View file
 
Reference files/week 7/Week7_Clustering Learning Objectives.docx ADDED
Binary file (11.4 kB). View file
 
Reference files/week 7/w7_curriculum ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Unsupervised Learning: K-means and Hierarchical Clustering
2
+ 1. Course Overview
3
+ The State Safety Profile Challenge
4
+ In this week, we'll explore unsupervised machine learning through a compelling real-world challenge: Understanding crime patterns across US states without any predetermined categories.
5
+ Unsupervised Learning: A type of machine learning where we find hidden patterns in data without being told what to look for. Think of it like being a detective who examines evidence without knowing what crime was committed - you're looking for patterns and connections that emerge naturally from the data.
6
+ Example: Instead of being told "find violent states vs. peaceful states," unsupervised learning lets the data reveal its own natural groupings, like "states with high murder but low assault" or "urban states with moderate crime."
7
+ Imagine you're a policy researcher working with the FBI's crime statistics. You have data on violent crime rates across all 50 US states - murder rates, assault rates, urban population percentages, and rape statistics. But here's the key challenge: you don't know how states naturally group together in terms of crime profiles.
8
+ Your Mission: Discover hidden patterns in state crime profiles without any predefined classifications!
9
+ The Challenge: Without any predetermined safety categories, you need to:
10
+ ● Uncover natural groupings of states based on their crime characteristics
11
+ ● Identify which crime factors tend to cluster together
12
+ ● Understand regional patterns that might not follow obvious geographic boundaries
13
+ ● Find states with surprisingly similar or different crime profiles
14
+ Cluster: A group of similar things. In our case, states that have similar crime patterns naturally group together in a cluster.
15
+ Example: You might discover that Alaska, Nevada, and Florida cluster together because they all have high crime rates despite being in different regions of the country.
16
+ Why This Matters: Traditional approaches might group states by region (South, Northeast, etc.) or population size. But what if crime patterns reveal different natural groupings? What if some Southern states cluster more closely with Western states based on crime profiles? What if urban percentage affects crime differently than expected?
17
+ Urban Percentage: The proportion of a state's population that lives in cities rather than rural areas.
18
+ Example: New York has a high urban percentage (87%) while Wyoming has a low urban percentage (29%).
19
+ What You'll Discover Through This Challenge
20
+ ● Hidden State Safety Types: Use clustering to identify groups of states with similar crime profiles
21
+ ● Crime Pattern Relationships: Find unexpected connections between different types of violent crime
22
+ ● Urban vs. Rural Effects: Discover how urbanization relates to different crime patterns
23
+ ● Policy Insights: Understand which states face similar challenges and might benefit from shared approaches
24
+ Clustering: The process of grouping similar data points together. It's like organizing your music library - songs naturally group by genre, but clustering might reveal unexpected groups like "workout songs" or "rainy day music" that cross traditional genre boundaries.
25
+ Core Techniques We'll Master
26
+ K-Means Clustering: A method that divides data into exactly K groups (where you choose the number K). It's like being asked to organize 50 students into exactly 4 study groups based on their academic interests.
27
+ Hierarchical Clustering: A method that creates a tree-like structure showing how data points relate to each other at different levels. It's like a family tree, but for data - showing which states are "cousins" and which are "distant relatives" in terms of crime patterns.
28
+ Both K-Means and Hierarchical Clustering are examples of unsupervised learning.
29
+
30
+ 2. K-Means Clustering
31
+
32
+ What it does: Divides data into exactly K groups by finding central points (centroids).
33
+ Central Points (Centroids): The "center" or average point of each group. Think of it like the center of a basketball team huddle - it's the point that best represents where all the players are standing.
34
+ Example: If you have a cluster of high-crime states, the centroid might represent "average murder rate of 8.5, average assault rate of 250, average urban population of 70%."
35
+ USArrests Example: Analyzing crime data across 50 states, you might discover 4 distinct state safety profiles:
36
+ ● High Crime States (above average in murder, assault, and rape rates)
37
+ ● Urban Safe States (high urban population but lower violent crime rates)
38
+ ● Rural Traditional States (low urban population, moderate crime rates)
39
+ ● Mixed Profile States (high in some crime types but not others)
40
+ How to Read K-Means Results:
41
+ ● Scatter Plot: Points (states) colored by cluster membership
42
+ ○ Well-separated colors indicate distinct state profiles
43
+ ○ Mixed colors suggest overlapping crime patterns
44
+ ● Cluster Centers: Average crime characteristics of each state group
45
+ ● Elbow Plot: Helps choose optimal number of state groupings
46
+ Cluster Membership: Which group each data point belongs to. Like being assigned to a team - each state gets assigned to exactly one crime profile group.
47
+ Example: Texas might be assigned to "High Crime States" while Vermont is assigned to "Rural Traditional States."
48
+ Scatter Plot: A graph where each point represents one observation (in our case, one state). Points that are close together have similar characteristics.
49
+ Elbow Plot: A graph that helps you choose the right number of clusters. It's called "elbow" because you look for a bend in the line that looks like an elbow joint.
50
+ Key Parameters:
51
+ python
52
+ # Essential parameters from the lab
53
+ KMeans(
54
+ n_clusters=4, # Number of state safety profiles to discover
55
+ random_state=42, # For reproducible results
56
+ n_init=20 # Run algorithm 20 times, keep best result
57
+ )
58
+ Parameters: Settings that control how the algorithm works. Like settings on your phone - you can adjust them to get different results.
59
+ n_clusters: How many groups you want to create. You have to decide this ahead of time.
60
+ random_state: A number that ensures you get the same results every time you run the analysis. Like setting a specific starting point so everyone gets the same answer.
61
+ n_init: How many times to run the algorithm. The computer tries multiple starting points and picks the best result. More tries = better results.
62
+
63
+ 3. Hierarchical Clustering
64
+ What it does: Creates a tree structure (dendrogram) showing how data points group together at different levels.
65
+ Dendrogram: A tree-like diagram that shows how groups form at different levels. Think of it like a family tree, but for data. At the bottom are individuals (states), and as you go up, you see how they group into families, then extended families, then larger clans.
66
+ Example: At the bottom level, you might see Vermont and New Hampshire grouped together. Moving up, they might join with Maine to form a "New England Low Crime" group. Moving up further, this group might combine with other regional groups.
67
+ USArrests Example: Analyzing state crime patterns might reveal:
68
+ ● Level 1: High Crime vs. Low Crime states
69
+ ● Level 2: Within high crime: Urban-driven vs. Rural-driven crime patterns
70
+ ● Level 3: Within urban-driven: Assault-heavy vs. Murder-heavy profiles
71
+ How to Read Dendrograms:
72
+ ● Height: Distance between groups when they merge
73
+ ○ Higher merges = very different crime profiles
74
+ ○ Lower merges = similar crime patterns
75
+ ● Branches: Each split shows a potential state grouping
76
+ ● Cutting the Tree: Draw a horizontal line to create clusters
77
+ Height: In a dendrogram, height represents how different two groups are. Think of it like difficulty level - it takes more "effort" (higher height) to combine very different groups.
78
+ Example: Combining two very similar states (like Vermont and New Hampshire) happens at low height. Combining very different groups (like "High Crime States" and "Low Crime States") happens at high height.
79
+ Cutting the Tree: Drawing a horizontal line across the dendrogram to create a specific number of groups. Like slicing a layer cake - where you cut determines how many pieces you get.
80
+ Three Linkage Methods:
81
+ ● Complete Linkage: Measures distance between most different states (good for distinct profiles)
82
+ ● Average Linkage: Uses average distance between all states (balanced approach)
83
+ ● Single Linkage: Uses closest states (tends to create chains, often less useful)
84
+ Linkage Methods: Different ways to measure how close or far apart groups are. It's like different ways to measure the distance between two cities - you could use the distance between the farthest suburbs (complete), the average distance between all neighborhoods (average), or the distance between the closest points (single).
85
+ Example: When deciding if "High Crime Group" and "Medium Crime Group" should merge, complete linkage looks at the most different states between the groups, while average linkage looks at the typical difference.
86
+ Choosing Between K-Means and Hierarchical:
87
+ ● Use K-Means when: You want to segment states into specific number of safety categories for policy targeting
88
+ ● Use Hierarchical when: You want to explore the natural structure of crime patterns without assumptions
89
+ Segmentation: Dividing your data into groups for specific purposes. Like organizing students into study groups - you might want exactly 4 groups so each has a teaching assistant.
90
+ Exploratory Analysis: Looking at data to discover patterns without knowing what you'll find. Like being an explorer in uncharted territory - you're not looking for a specific destination, just seeing what interesting things you can discover.
91
+
92
+ 4. Data Exploration
93
+ Step 1: Understanding Your Data
94
+ Essential Checks (from the USArrests example):
95
+ python
96
+ # Check the basic structure
97
+ print(data.shape) # How many observations and variables?
98
+ print(data.columns) # What variables do you have?
99
+ print(data.head()) # What do the first few rows look like?
100
+
101
+ # Examine the distribution
102
+ print(data.mean()) # Average values
103
+ print(data.var()) # Variability
104
+ print(data.describe()) # Full statistical summary
105
+ Observations: Individual data points we're studying. In our case, each of the 50 US states is one observation.
106
+ Variables: The characteristics we're measuring for each observation. In USArrests, we have 4 variables: Murder rate, Assault rate, Urban Population percentage, and Rape rate.
107
+ Example: For California (one observation), we might have Murder=9.0, Assault=276, UrbanPop=91, Rape=40.6 (four variables).
108
+ Distribution: How values are spread out. Like looking at test scores in a class - are most scores clustered around the average, or spread out widely?
109
+ Variability (Variance): How much the values differ from each other. High variance means values are spread out; low variance means they're clustered together.
110
+ Why This Matters: The USArrests data showed vastly different scales:
111
+ ● Murder: Average 7.8, Variance 19
112
+ ● Assault: Average 170.8, Variance 6,945
113
+ ● This scale difference would dominate any analysis without preprocessing
114
+ Scales: The range and units of measurement for different variables. Like comparing dollars ($50,000 salary) to percentages (75% approval rating) - they're measured very differently.
115
+ Example: Assault rates are in the hundreds (like 276 per 100,000) while murder rates are single digits (like 7.8 per 100,000). Without adjustment, assault would seem much more important just because the numbers are bigger.
116
+ Step 2: Data Preprocessing
117
+ Standardization (Critical for clustering):
118
+ python
119
+ from sklearn.preprocessing import StandardScaler
120
+
121
+ # Always scale when variables have different units
122
+ scaler = StandardScaler()
123
+ data_scaled = scaler.fit_transform(data)
124
+ Standardization: Converting all variables to the same scale so they can be fairly compared. Like converting all measurements to the same units - instead of comparing feet to meters, you convert everything to inches.
125
+ StandardScaler: A tool that transforms data so each variable has an average of 0 and standard deviation of 1. Think of it like grading on a curve - it makes all variables equally important.
126
+ Example: After standardization, a murder rate of 7.8 might become 0.2, and an assault rate of 276 might become 1.5. Now they're on comparable scales.
127
+ When to Scale:
128
+ ● ✅ Always scale when variables have different units (dollars vs. percentages)
129
+ ● ✅ Scale when variances differ by orders of magnitude
130
+ ● ❓ Consider not scaling when all variables are in the same meaningful units
131
+ Orders of Magnitude: When one number is 10 times, 100 times, or 1000 times bigger than another. In USArrests, assault variance (6,945) is about 365 times bigger than murder variance (19) - that's two orders of magnitude difference.
132
+ Step 3: Exploratory Analysis
133
+ For K-Means Clustering:
134
+ python
135
+ # Try different numbers of clusters to find optimal K
136
+ inertias = []
137
+ K_range = range(1, 11)
138
+ for k in K_range:
139
+ kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
140
+ kmeans.fit(data_scaled)
141
+ inertias.append(kmeans.inertia_)
142
+
143
+ # Plot elbow curve
144
+ plt.plot(K_range, inertias, 'bo-')
145
+ plt.xlabel('Number of Clusters (K)')
146
+ plt.ylabel('Within-Cluster Sum of Squares')
147
+ plt.title('Elbow Method for Optimal K')
148
+ Inertias: A measure of how tightly grouped each cluster is. Lower inertia means points in each cluster are closer together (better clustering). It's like measuring how close teammates stand to each other - closer teammates indicate better team cohesion.
149
+ Within-Cluster Sum of Squares: The total distance from each point to its cluster center. Think of it as measuring how far each student sits from their group's center - smaller distances mean tighter, more cohesive groups.
150
+ Elbow Method: A technique for choosing the best number of clusters. You plot the results and look for the "elbow" - the point where adding more clusters doesn't help much anymore.
151
+ For Hierarchical Clustering:
152
+ python
153
+ # Create dendrogram to explore natural groupings
154
+ from sklearn.cluster import AgglomerativeClustering
155
+ from ISLP.cluster import compute_linkage
156
+ from scipy.cluster.hierarchy import dendrogram
157
+
158
+ hc = AgglomerativeClustering(distance_threshold=0, n_clusters=None, linkage='complete')
159
+ hc.fit(data_scaled)
160
+ linkage_matrix = compute_linkage(hc)
161
+
162
+ plt.figure(figsize=(12, 8))
163
+ dendrogram(linkage_matrix, color_threshold=-np.inf, above_threshold_color='black')
164
+ plt.title('Hierarchical Clustering Dendrogram')
165
+ AgglomerativeClustering: A type of hierarchical clustering that starts with individual points and gradually combines them into larger groups. Like building a pyramid from the bottom up.
166
+ distance_threshold=0: A setting that tells the algorithm to build the complete tree structure without stopping early.
167
+ Linkage Matrix: A mathematical representation of how the tree structure was built. Think of it as the blueprint showing how the dendrogram was constructed.
168
+ Step 4: Validation Questions
169
+ Before proceeding with analysis, ask:
170
+ 1. Do the variables make sense together? (e.g., don't cluster height with income)
171
+ 2. Are there obvious outliers that need attention?
172
+ 3. Do you have enough data points? (Rule of thumb: at least 10x more observations than variables)
173
+ 4. Are there missing values that need handling?
174
+ Outliers: Data points that are very different from all the others. Like a 7-foot-tall person in a group of average-height people - they're so different they might skew your analysis.
175
+ Example: If most states have murder rates between 1-15, but one state has a rate of 50, that's probably an outlier that needs special attention.
176
+ Missing Values: Data points where we don't have complete information. Like a student who didn't take one of the tests - you need to decide how to handle that gap in the data.
177
+ Rule of Thumb: A general guideline that works in most situations. For clustering, having at least 10 times more observations than variables helps ensure reliable results.
178
+
app/.DS_Store CHANGED
Binary files a/app/.DS_Store and b/app/.DS_Store differ
 
app/main.py CHANGED
@@ -29,9 +29,24 @@ st.set_page_config(
29
  page_title="Data Science Course App",
30
  page_icon="📚",
31
  layout="wide",
32
- initial_sidebar_state="expanded"
 
 
 
 
 
33
  )
34
 
 
 
 
 
 
 
 
 
 
 
35
  # Custom CSS
36
  def load_css():
37
  try:
@@ -56,11 +71,6 @@ def load_css():
56
  margin-bottom: 1rem;
57
  }
58
 
59
- /* Sidebar styling */
60
- .sidebar .sidebar-content {
61
- background-color: #f8f9fa;
62
- }
63
-
64
  /* Button styling */
65
  .stButton>button {
66
  width: 100%;
@@ -125,16 +135,17 @@ def sidebar_navigation():
125
  st.rerun()
126
 
127
  st.markdown("---")
128
- st.subheader("Course Progress")
129
- progress = st.progress(st.session_state.current_week / 10)
130
- st.write(f"Week {st.session_state.current_week} of 10")
131
 
132
- st.markdown("---")
133
- st.subheader("Quick Links")
134
- for week in range(1, 11):
135
- if st.button(f"Week {week}", key=f"week_{week}"):
136
- st.session_state.current_week = week
137
- st.rerun()
 
 
 
138
 
139
  def show_week_content():
140
  # Debug print to show current week
 
29
  page_title="Data Science Course App",
30
  page_icon="📚",
31
  layout="wide",
32
+ initial_sidebar_state="expanded",
33
+ menu_items={
34
+ 'Get Help': None,
35
+ 'Report a bug': None,
36
+ 'About': None
37
+ }
38
  )
39
 
40
+ # Disable URL paths and hide Streamlit elements
41
+ st.markdown("""
42
+ <style>
43
+ #MainMenu {visibility: hidden;}
44
+ footer {visibility: hidden;}
45
+ .stDeployButton {display: none;}
46
+ .viewerBadge_container__1QSob {display: none;}
47
+ </style>
48
+ """, unsafe_allow_html=True)
49
+
50
  # Custom CSS
51
  def load_css():
52
  try:
 
71
  margin-bottom: 1rem;
72
  }
73
 
 
 
 
 
 
74
  /* Button styling */
75
  .stButton>button {
76
  width: 100%;
 
135
  st.rerun()
136
 
137
  st.markdown("---")
138
+ st.subheader("Course Content")
 
 
139
 
140
+ # Create a container for week buttons
141
+ week_container = st.container()
142
+
143
+ # Add week buttons with custom styling
144
+ with week_container:
145
+ for week in range(1, 11):
146
+ if st.button(f"Week {week}", key=f"week_{week}"):
147
+ st.session_state.current_week = week
148
+ st.rerun()
149
 
150
  def show_week_content():
151
  # Debug print to show current week
app/pages/__pycache__/week_5.cpython-311.pyc CHANGED
Binary files a/app/pages/__pycache__/week_5.cpython-311.pyc and b/app/pages/__pycache__/week_5.cpython-311.pyc differ
 
app/pages/__pycache__/week_7.cpython-311.pyc CHANGED
Binary files a/app/pages/__pycache__/week_7.cpython-311.pyc and b/app/pages/__pycache__/week_7.cpython-311.pyc differ
 
app/pages/week_7.py CHANGED
@@ -6,16 +6,21 @@ import seaborn as sns
6
  import plotly.express as px
7
  import plotly.graph_objects as go
8
  from plotly.subplots import make_subplots
 
 
 
 
 
 
9
 
10
  # Set up the style for all plots
11
  plt.style.use('default')
12
  sns.set_theme(style="whitegrid", palette="husl")
13
 
14
- def load_titanic_data():
15
- """Load and return the Titanic dataset"""
16
- url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
17
- df = pd.read_csv(url)
18
- return df
19
 
20
  def create_categorical_plot(df, column, target='Survived'):
21
  """Create an interactive plot for categorical variables"""
@@ -54,284 +59,758 @@ def create_numeric_plot(df, column, target='Survived'):
54
  return fig
55
 
56
  def show():
57
- st.title("Week 7: Data Cleaning and EDA with Categorical Variables")
58
 
59
- # Introduction Section
60
- st.header("Course Overview")
61
- st.write("""
62
- This week, we'll explore data cleaning and exploratory data analysis (EDA) with a focus on categorical variables.
63
- We'll use the Titanic dataset to demonstrate:
64
- - Data cleaning techniques
65
- - Handling missing values
66
- - Analyzing categorical variables
67
- - Creating meaningful visualizations
68
- - Feature engineering
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
  """)
70
 
71
- # Learning Path
72
- st.subheader("Learning Path")
73
  st.write("""
74
- 1. Understanding the Dataset: Titanic passenger data
75
- 2. Data Cleaning: Handling missing values and outliers
76
- 3. Categorical Variables: Analysis and visualization
77
- 4. Feature Engineering: Creating new features
78
- 5. Data Visualization: Interactive plots and insights
79
- 6. Practical Applications: Real-world data analysis
80
  """)
81
-
82
  # Load Data
83
- st.header("The Dataset")
84
- st.write("""
85
- We'll be working with the Titanic dataset, which contains information about passengers aboard the Titanic.
86
- The dataset includes both categorical and numerical variables, making it perfect for learning data cleaning and EDA.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  """)
88
 
89
- df = load_titanic_data()
90
-
91
- # Display basic information
92
- st.subheader("Dataset Overview")
93
- st.write(f"Number of rows: {len(df)}")
94
- st.write(f"Number of columns: {len(df.columns)}")
95
-
96
- # Display missing values
97
- st.subheader("Missing Values Analysis")
98
- missing_values = df.isnull().sum()
99
- fig_missing = px.bar(
100
- x=missing_values.index,
101
- y=missing_values.values,
102
- title='Missing Values by Column',
103
- labels={'x': 'Columns', 'y': 'Number of Missing Values'}
104
  )
105
- fig_missing.update_layout(
106
- title_x=0.5,
107
- title_font_size=20,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
108
  plot_bgcolor='rgb(30, 30, 30)',
109
  paper_bgcolor='rgb(30, 30, 30)',
110
  font=dict(color='white')
111
  )
112
- st.plotly_chart(fig_missing)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
113
 
114
- # Data Cleaning Section
115
- st.header("Data Cleaning")
116
 
117
- # Handle missing values
118
- st.subheader("Handling Missing Values")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
119
  st.write("""
120
- Let's clean the data by:
121
- 1. Filling missing Age values with median
122
- 2. Filling missing Embarked values with mode
123
- 3. Creating a new feature for Cabin availability
124
  """)
125
 
126
- # Create a copy for cleaning
127
- df_cleaned = df.copy()
128
 
129
- # Fill missing values
130
- df_cleaned['Age'].fillna(df_cleaned['Age'].median(), inplace=True)
131
- df_cleaned['Embarked'].fillna(df_cleaned['Embarked'].mode()[0], inplace=True)
132
- df_cleaned['HasCabin'] = df_cleaned['Cabin'].notna().astype(int)
133
 
134
- # Categorical Variables Analysis
135
- st.header("Categorical Variables Analysis")
 
136
 
137
- # Select categorical column to analyze
138
- categorical_cols = ['Pclass', 'Sex', 'Embarked', 'HasCabin']
139
- selected_col = st.selectbox(
140
- "Select Categorical Variable to Analyze",
141
- categorical_cols
142
- )
 
 
143
 
144
- # Create and display categorical plot
145
- fig_cat = create_categorical_plot(df_cleaned, selected_col)
146
- st.plotly_chart(fig_cat)
147
 
148
- # Numeric Variables Analysis
149
- st.header("Numeric Variables Analysis")
 
 
150
 
151
- # Select numeric column to analyze
152
- numeric_cols = ['Age', 'Fare', 'SibSp', 'Parch']
153
- selected_num_col = st.selectbox(
154
- "Select Numeric Variable to Analyze",
155
- numeric_cols
156
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
157
 
158
- # Create and display numeric plot
159
- fig_num = create_numeric_plot(df_cleaned, selected_num_col)
160
- st.plotly_chart(fig_num)
 
 
 
 
161
 
162
- # Reference Code Section
163
- st.header("Reference Code")
164
  st.write("""
165
- Below is the reference code for the data cleaning and analysis we just performed.
166
- Study this code to understand how we implemented the analysis.
 
 
167
  """)
168
 
169
- with st.expander("View Reference Code"):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
  st.code("""
171
- # Data Cleaning
172
- df_cleaned = df.copy()
173
- df_cleaned['Age'].fillna(df_cleaned['Age'].median(), inplace=True)
174
- df_cleaned['Embarked'].fillna(df_cleaned['Embarked'].mode()[0], inplace=True)
175
- df_cleaned['HasCabin'] = df_cleaned['Cabin'].notna().astype(int)
176
 
177
- # Categorical Analysis
178
- def create_categorical_plot(df, column, target='Survived'):
179
- fig = px.bar(
180
- df.groupby(column)[target].mean().reset_index(),
181
- x=column,
182
- y=target,
183
- title=f'Survival Rate by {column}',
184
- labels={target: 'Survival Rate', column: column},
185
- color=target,
186
- color_continuous_scale='RdBu'
187
- )
188
- return fig
189
 
190
- # Numeric Analysis
191
- def create_numeric_plot(df, column, target='Survived'):
192
- fig = px.box(
193
- df,
194
- x=target,
195
- y=column,
196
- title=f'{column} Distribution by Survival',
197
- labels={target: 'Survived', column: column},
198
- color=target,
199
- color_discrete_sequence=px.colors.qualitative.Set1
200
- )
201
- return fig
202
 
203
- # Feature Engineering
204
- df_cleaned['FamilySize'] = df_cleaned['SibSp'] + df_cleaned['Parch'] + 1
205
- df_cleaned['AgeGroup'] = pd.cut(
206
- df_cleaned['Age'],
207
- bins=[0, 12, 18, 35, 60, 100],
208
- labels=['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior']
 
 
 
 
 
209
  )
210
- df_cleaned['FarePerPerson'] = df_cleaned['Fare'] / df_cleaned['FamilySize']
 
 
 
 
211
  """, language="python")
212
 
213
- # Knowledge Check Quiz
214
- st.header("Knowledge Check")
215
- st.write("Test your understanding of the concepts covered in this section.")
216
-
217
- # Initialize session state for quiz if not exists
218
- if 'quiz_submitted' not in st.session_state:
219
- st.session_state.quiz_submitted = False
220
-
221
- # Quiz questions
222
- questions = {
223
- "q1": {
224
- "question": "What is the best way to handle missing values in the 'Age' column?",
225
- "options": [
226
- "Fill with 0",
227
- "Fill with the median age",
228
- "Remove all rows with missing age",
229
- "Fill with the mean age"
230
- ],
231
- "correct": 1
232
- },
233
- "q2": {
234
- "question": "Why do we create the 'HasCabin' feature?",
235
- "options": [
236
- "To reduce the number of missing values",
237
- "To create a binary indicator for cabin availability",
238
- "To make the data more complex",
239
- "To remove the Cabin column"
240
- ],
241
- "correct": 1
242
- },
243
- "q3": {
244
- "question": "What does the FamilySize feature represent?",
245
- "options": [
246
- "Number of siblings only",
247
- "Number of parents only",
248
- "Total family members (including the passenger)",
249
- "Number of children only"
250
- ],
251
- "correct": 2
252
- }
253
- }
254
-
255
- # Display quiz if not submitted
256
- if not st.session_state.quiz_submitted:
257
- answers = {}
258
- for q_id, q_data in questions.items():
259
- st.write(f"**{q_data['question']}**")
260
- answers[q_id] = st.radio(
261
- "Select your answer:",
262
- q_data["options"],
263
- key=q_id
264
- )
265
-
266
- if st.button("Submit Quiz"):
267
- # Calculate score
268
- score = sum(1 for q_id, q_data in questions.items()
269
- if answers[q_id] == q_data["options"][q_data["correct"]])
270
-
271
- # Show results
272
- st.write(f"Your score: {score}/{len(questions)}")
273
-
274
- # Show correct answers
275
- st.write("Correct answers:")
276
- for q_id, q_data in questions.items():
277
- st.write(f"- {q_data['question']}")
278
- st.write(f" Correct answer: {q_data['options'][q_data['correct']]}")
279
-
280
- st.session_state.quiz_submitted = True
281
-
282
- # Reset quiz button
283
- if st.session_state.quiz_submitted:
284
- if st.button("Take Quiz Again"):
285
- st.session_state.quiz_submitted = False
286
- st.rerun()
287
-
288
- # Feature Engineering
289
- st.header("Feature Engineering")
290
- st.write("""
291
- Let's create some new features:
292
- 1. Family Size = SibSp + Parch + 1
293
- 2. Age Groups
294
- 3. Fare per Person
295
  """)
296
 
297
- # Create new features
298
- df_cleaned['FamilySize'] = df_cleaned['SibSp'] + df_cleaned['Parch'] + 1
299
- df_cleaned['AgeGroup'] = pd.cut(
300
- df_cleaned['Age'],
301
- bins=[0, 12, 18, 35, 60, 100],
302
- labels=['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior']
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
303
  )
304
- df_cleaned['FarePerPerson'] = df_cleaned['Fare'] / df_cleaned['FamilySize']
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
305
 
306
- # Display new features
307
- st.subheader("New Features Analysis")
308
 
309
- # Family Size Analysis
310
- fig_family = create_categorical_plot(df_cleaned, 'FamilySize')
311
- st.plotly_chart(fig_family)
 
 
 
312
 
313
- # Age Group Analysis
314
- fig_age = create_categorical_plot(df_cleaned, 'AgeGroup')
315
- st.plotly_chart(fig_age)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
316
 
317
- # Conclusion
318
- st.header("Conclusion")
319
  st.write("""
320
- Through this analysis, we've learned:
321
- - How to handle missing values in real-world datasets
322
- - Techniques for analyzing categorical variables
323
- - Methods for creating meaningful visualizations
324
- - Feature engineering approaches
325
- - Best practices for data cleaning and EDA
 
 
 
 
 
326
  """)
327
 
328
  # Additional Resources
329
  st.header("Additional Resources")
330
  st.write("""
331
- - [Pandas Documentation](https://pandas.pydata.org/docs/)
332
- - [Seaborn Documentation](https://seaborn.pydata.org/)
333
- - [Plotly Documentation](https://plotly.com/python/)
334
- - [Data Cleaning Best Practices](https://towardsdatascience.com/data-cleaning-steps-and-process-8ae2d0f5147)
335
- - [Colab Notebook](https://colab.research.google.com/drive/1ScwSa8WBcOMCloXsTV5TPFoVrcPHXlW2#scrollTo=VDMRGRbSR0gc)
336
- - [Overleaf Project](https://www.overleaf.com/project/68228f4ccb9d18d92c26ba13)
337
  """)
 
6
  import plotly.express as px
7
  import plotly.graph_objects as go
8
  from plotly.subplots import make_subplots
9
+ from sklearn.cluster import KMeans
10
+ from sklearn.neighbors import KNeighborsClassifier
11
+ from sklearn.preprocessing import StandardScaler
12
+ from sklearn.metrics import silhouette_score
13
+ from statsmodels.datasets import get_rdataset
14
+ from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
15
 
16
  # Set up the style for all plots
17
  plt.style.use('default')
18
  sns.set_theme(style="whitegrid", palette="husl")
19
 
20
+ def load_arrests_data():
21
+ """Load and return the US Arrests dataset"""
22
+ USArrests = get_rdataset('USArrests').data
23
+ return USArrests
 
24
 
25
  def create_categorical_plot(df, column, target='Survived'):
26
  """Create an interactive plot for categorical variables"""
 
59
  return fig
60
 
61
  def show():
62
+ st.title("Week 7: Clustering Lab - State Crime Pattern Analysis")
63
 
64
+ # Code Example: Loading and Basic Data Exploration
65
+ with st.expander("Code Example: Loading and Exploring Data"):
66
+ st.code("""
67
+ # Load the data
68
+ from statsmodels.datasets import get_rdataset
69
+ USArrests = get_rdataset('USArrests').data
70
+
71
+ # Basic data exploration
72
+ print("Dataset shape:", USArrests.shape)
73
+ print("\\nVariables:", USArrests.columns.tolist())
74
+ print("\\nFirst 5 states:")
75
+ print(USArrests.head())
76
+
77
+ # Basic statistics
78
+ print("\\nData Summary:")
79
+ print(USArrests.describe())
80
+ """, language="python")
81
+
82
+ # Introduction Section with Learning Objectives
83
+ st.header("Learning Objectives")
84
+ st.markdown("""
85
+ In this week, you'll master:
86
+ 1. **Unsupervised Learning**: Discover hidden patterns in crime data without predefined categories
87
+ 2. **K-Means Clustering**: Learn to divide states into distinct safety profiles
88
+ 3. **Hierarchical Clustering**: Create a "family tree" of state crime patterns
89
+ 4. **Data Preprocessing**: Understand why scaling is crucial for fair comparisons
90
  """)
91
 
92
+ # Interactive Overview
93
+ st.header("Lab Overview")
94
  st.write("""
95
+ Welcome to your hands-on clustering lab! You'll be working as a policy analyst for the Department of Justice,
96
+ analyzing crime patterns across US states. Your mission: discover hidden safety profiles that could inform
97
+ federal resource allocation and crime prevention strategies.
 
 
 
98
  """)
99
+
100
  # Load Data
101
+ st.header("Exercise 1: Data Detective Work")
102
+ st.write("Let's start by understanding our dataset - the US Arrests data.")
103
+
104
+ df = load_arrests_data()
105
+
106
+ # Code Example: Data Visualization
107
+ with st.expander("Code Example: Creating Visualizations"):
108
+ st.code("""
109
+ # Create correlation heatmap
110
+ import plotly.express as px
111
+ fig = px.imshow(df.corr(),
112
+ labels=dict(color="Correlation"),
113
+ color_continuous_scale="RdBu")
114
+ fig.show()
115
+
116
+ # Create box plots
117
+ fig = px.box(df, title="Data Distribution")
118
+ fig.show()
119
+ """, language="python")
120
+
121
+ # Interactive Data Exploration
122
+ col1, col2 = st.columns(2)
123
+
124
+ with col1:
125
+ st.subheader("Dataset Overview")
126
+ st.write(f"Number of states: {len(df)}")
127
+ st.write(f"Number of variables: {len(df.columns)}")
128
+ st.write("\nVariables:", df.columns.tolist())
129
+
130
+ # Interactive data summary
131
+ st.subheader("Data Summary")
132
+ summary = df.describe()
133
+ st.dataframe(summary)
134
+
135
+ with col2:
136
+ st.subheader("First 5 States")
137
+ st.dataframe(df.head())
138
+
139
+ # Interactive correlation heatmap
140
+ st.subheader("Correlation Heatmap")
141
+ fig = px.imshow(df.corr(),
142
+ labels=dict(color="Correlation"),
143
+ color_continuous_scale="RdBu")
144
+ st.plotly_chart(fig)
145
+
146
+ # Exercise 2: Scaling Challenge
147
+ st.header("Exercise 2: The Scaling Challenge")
148
+
149
+ # Code Example: Data Scaling
150
+ with st.expander("Code Example: Scaling Data"):
151
+ st.code("""
152
+ # Import StandardScaler
153
+ from sklearn.preprocessing import StandardScaler
154
+
155
+ # Create and fit the scaler
156
+ scaler = StandardScaler()
157
+ df_scaled = scaler.fit_transform(df)
158
+
159
+ # Convert back to DataFrame
160
+ df_scaled = pd.DataFrame(df_scaled,
161
+ columns=df.columns,
162
+ index=df.index)
163
+
164
+ # Compare original vs scaled data
165
+ print("Original data ranges:")
166
+ print(df.describe())
167
+ print("\\nScaled data ranges:")
168
+ print(df_scaled.describe())
169
+ """, language="python")
170
+
171
+ # Explanation of scaling
172
+ st.markdown("""
173
+ ### Why Do We Need Scaling?
174
+
175
+ In our crime data, we have variables measured in very different scales:
176
+ - Murder rates: typically 0-20 per 100,000
177
+ - Assault rates: typically 50-350 per 100,000
178
+ - Urban population: 0-100 percentage
179
+ - Rape rates: typically 0-50 per 100,000
180
+
181
+ Without scaling, variables with larger numbers (like Assault) would dominate our analysis,
182
+ making smaller-scale variables (like Murder) less influential. This would be like comparing
183
+ dollars to cents - the cents would seem insignificant even if they were important!
184
+ """)
185
+
186
+ # Show original data ranges
187
+ st.subheader("Original Data Ranges")
188
+ col1, col2 = st.columns(2)
189
+
190
+ with col1:
191
+ # Create a bar chart of variances
192
+ fig_var = px.bar(
193
+ x=df.columns,
194
+ y=df.var(),
195
+ title="Variance of Each Variable (Before Scaling)",
196
+ labels={'x': 'Crime Variables', 'y': 'Variance'},
197
+ color=df.var(),
198
+ color_continuous_scale='Viridis'
199
+ )
200
+ st.plotly_chart(fig_var)
201
+
202
+ st.write("""
203
+ Notice how Assault has a much larger variance (6,945) compared to Murder (19).
204
+ This means Assault would dominate our clustering if we didn't scale the data!
205
+ """)
206
+
207
+ with col2:
208
+ # Create box plots of original data
209
+ fig_box = px.box(df, title="Original Data Distribution")
210
+ fig_box.update_layout(
211
+ xaxis_title="Crime Variables",
212
+ yaxis_title="Rate per 100,000"
213
+ )
214
+ st.plotly_chart(fig_box)
215
+
216
+ # Explain standardization
217
+ st.markdown("""
218
+ ### What is Standardization?
219
+
220
+ Standardization (also called Z-score normalization) transforms our data so that:
221
+ 1. Each variable has a mean of 0
222
+ 2. Each variable has a standard deviation of 1
223
+
224
+ The formula is: z = (x - μ) / σ
225
+ - x is the original value
226
+ - μ is the mean of the variable
227
+ - σ is the standard deviation of the variable
228
  """)
229
 
230
+ # Scale the data
231
+ scaler = StandardScaler()
232
+ df_scaled = scaler.fit_transform(df)
233
+ df_scaled = pd.DataFrame(df_scaled, columns=df.columns, index=df.index)
234
+
235
+ # Show scaled data
236
+ st.subheader("After Scaling")
237
+
238
+ # Create box plots of scaled data
239
+ fig_scaled = px.box(df_scaled, title="Scaled Data Distribution")
240
+ fig_scaled.update_layout(
241
+ xaxis_title="Crime Variables",
242
+ yaxis_title="Standardized Values"
 
 
243
  )
244
+ st.plotly_chart(fig_scaled)
245
+
246
+ st.write("""
247
+ After scaling, all variables are on the same scale:
248
+ - Mean = 0
249
+ - Standard Deviation = 1
250
+ - Values typically range from -3 to +3
251
+ """)
252
+
253
+ # Show before/after comparison for a few states
254
+ st.write("### Before vs After Scaling (Sample States)")
255
+ comparison_df = pd.DataFrame({
256
+ 'State': df.index[:5],
257
+ 'Original Murder': df['Murder'][:5],
258
+ 'Scaled Murder': df_scaled['Murder'][:5],
259
+ 'Original Assault': df['Assault'][:5],
260
+ 'Scaled Assault': df_scaled['Assault'][:5]
261
+ })
262
+ st.dataframe(comparison_df)
263
+
264
+ st.write("""
265
+ Notice how the relative differences between states are preserved,
266
+ but now all variables contribute equally to our analysis!
267
+ """)
268
+
269
+ # Why scaling matters for clustering
270
+ st.markdown("""
271
+ ### Why Scaling Matters for Clustering
272
+
273
+ In clustering, we measure distances between data points. Without scaling:
274
+ - States might be grouped together just because they have similar assault rates
275
+ - Important differences in murder rates might be ignored
276
+
277
+ With scaling:
278
+ - All variables contribute equally to the distance calculations
279
+ - We can find true patterns in the data, not just patterns in the largest numbers
280
+ """)
281
+
282
+ # Exercise 3: Finding Optimal Clusters
283
+ st.header("Exercise 3: Finding the Right Number of Groups")
284
+
285
+ # Code Example: Elbow Method
286
+ with st.expander("Code Example: Finding Optimal K"):
287
+ st.code("""
288
+ # Calculate inertias for different K values
289
+ inertias = []
290
+ K_values = range(1, 11)
291
+
292
+ for k in K_values:
293
+ kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
294
+ kmeans.fit(df_scaled)
295
+ inertias.append(kmeans.inertia_)
296
+
297
+ # Create elbow plot
298
+ import plotly.graph_objects as go
299
+ fig = go.Figure()
300
+ fig.add_trace(go.Scatter(
301
+ x=list(K_values),
302
+ y=inertias,
303
+ mode='lines+markers',
304
+ name='Inertia'
305
+ ))
306
+ fig.update_layout(
307
+ title='Finding the Optimal Number of Clusters',
308
+ xaxis_title='Number of Clusters (K)',
309
+ yaxis_title='Within-Cluster Sum of Squares'
310
+ )
311
+ fig.show()
312
+ """, language="python")
313
+
314
+ st.markdown("""
315
+ ### The Elbow Method Explained
316
+
317
+ The elbow method helps us find the optimal number of clusters (K) by looking at how the "within-cluster sum of squares"
318
+ (WCSS) changes as we increase the number of clusters. Think of it like this:
319
+
320
+ - **What is WCSS?** It's a measure of how spread out the points are within each cluster
321
+ - **Lower WCSS** means points are closer to their cluster center (better clustering)
322
+ - **Higher WCSS** means points are more spread out from their cluster center
323
+
324
+ As we increase K:
325
+ 1. WCSS always decreases (more clusters = tighter groups)
326
+ 2. The rate of decrease slows down
327
+ 3. We look for the "elbow" - where adding more clusters doesn't help much anymore
328
+ """)
329
+
330
+ # Calculate inertias for different K values
331
+ inertias = []
332
+ K_values = range(1, 11)
333
+
334
+ for k in K_values:
335
+ kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
336
+ kmeans.fit(df_scaled)
337
+ inertias.append(kmeans.inertia_)
338
+
339
+ # Create interactive elbow plot
340
+ fig_elbow = go.Figure()
341
+ fig_elbow.add_trace(go.Scatter(
342
+ x=list(K_values),
343
+ y=inertias,
344
+ mode='lines+markers',
345
+ name='Inertia'
346
+ ))
347
+ fig_elbow.update_layout(
348
+ title='Finding the Optimal Number of State Crime Profiles',
349
+ xaxis_title='Number of Clusters (K)',
350
+ yaxis_title='Within-Cluster Sum of Squares',
351
  plot_bgcolor='rgb(30, 30, 30)',
352
  paper_bgcolor='rgb(30, 30, 30)',
353
  font=dict(color='white')
354
  )
355
+ st.plotly_chart(fig_elbow)
356
+
357
+ # Interpretation guide
358
+ st.markdown("""
359
+ ### How to Interpret the Elbow Plot
360
+
361
+ Look at the plot above and ask yourself:
362
+ 1. **Where is the "elbow"?**
363
+ - The point where the line starts to level off
364
+ - Adding more clusters doesn't give much improvement
365
+ - In our case, it's around K=4
366
+
367
+ 2. **What do the numbers mean?**
368
+ - K=1: All states in one group (not useful)
369
+ - K=2: Basic high/low crime split
370
+ - K=3: More nuanced grouping
371
+ - K=4: Our "elbow" - good balance of detail and simplicity
372
+ - K>4: Diminishing returns - more complexity without much benefit
373
+
374
+ 3. **Why not just use more clusters?**
375
+ - More clusters = more complex to interpret
376
+ - Small clusters might not be meaningful
377
+ - Goal is to find the simplest model that captures the main patterns
378
+ """)
379
+
380
+ # Show the actual values
381
+ st.write("### WCSS Values for Each K")
382
+ wcss_df = pd.DataFrame({
383
+ 'Number of Clusters (K)': K_values,
384
+ 'Within-Cluster Sum of Squares': inertias,
385
+ 'Improvement from Previous K': [0] + [inertias[i-1] - inertias[i] for i in range(1, len(inertias))]
386
+ })
387
+ st.dataframe(wcss_df)
388
+
389
+ st.markdown("""
390
+ ### Making the Decision
391
+
392
+ Based on our elbow plot and the numbers above:
393
+ 1. The biggest improvements happen from K=1 to K=4
394
+ 2. After K=4, the improvements get much smaller
395
+ 3. K=4 gives us a good balance of:
396
+ - Capturing meaningful patterns
397
+ - Keeping the model simple enough to interpret
398
+ - Having enough states in each cluster to be meaningful
399
+
400
+ This is why we'll use K=4 for our clustering analysis!
401
+ """)
402
+
403
+ # Exercise 4: K-Means Clustering
404
+ st.header("Exercise 4: K-Means State Profiling")
405
+
406
+ # Code Example: K-Means Clustering
407
+ with st.expander("Code Example: K-Means Implementation"):
408
+ st.code("""
409
+ # Perform K-means clustering
410
+ from sklearn.cluster import KMeans
411
+
412
+ # Create and fit the model
413
+ kmeans = KMeans(
414
+ n_clusters=4, # Number of clusters
415
+ random_state=42, # For reproducibility
416
+ n_init=20 # Number of times to run with different centroids
417
+ )
418
+ cluster_labels = kmeans.fit_predict(df_scaled)
419
+
420
+ # Add cluster labels to original data
421
+ df_clustered = df.copy()
422
+ df_clustered['Cluster'] = cluster_labels
423
+
424
+ # Visualize the clusters
425
+ import plotly.express as px
426
+ fig = px.scatter(df_clustered,
427
+ x='Murder',
428
+ y='Assault',
429
+ color='Cluster',
430
+ hover_data=['UrbanPop', 'Rape'],
431
+ title='State Crime Profiles')
432
+ fig.show()
433
+
434
+ # Show cluster centers
435
+ centers_df = pd.DataFrame(
436
+ kmeans.cluster_centers_,
437
+ columns=df.columns
438
+ )
439
+ print("Cluster Centers:")
440
+ print(centers_df)
441
+ """, language="python")
442
 
443
+ st.markdown("""
444
+ ### What is K-Means Clustering?
445
 
446
+ K-means is an unsupervised learning algorithm that groups similar data points together. Think of it like organizing
447
+ students into study groups based on their interests:
448
+
449
+ 1. **Initialization**:
450
+ - We randomly place K "centers" (centroids) in our data space
451
+ - Each center represents the "average" of its cluster
452
+ - In our case, each center represents a typical crime profile
453
+
454
+ 2. **Assignment**:
455
+ - Each state is assigned to its nearest center
456
+ - "Nearest" is measured by Euclidean distance
457
+ - States with similar crime patterns end up in the same cluster
458
+
459
+ 3. **Update**:
460
+ - Centers move to the average position of their assigned states
461
+ - This process repeats until centers stop moving
462
+ - The algorithm converges when states are optimally grouped
463
+ """)
464
+
465
+ # Visualize the process
466
+ st.subheader("K-Means in Action")
467
  st.write("""
468
+ Let's see how K-means works with our state crime data. We'll use K=4 clusters to find distinct crime profiles.
 
 
 
469
  """)
470
 
471
+ # Let user choose number of clusters
472
+ k = st.slider("Choose number of clusters (K)", 2, 6, 4)
473
 
474
+ # Perform K-means clustering
475
+ kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)
476
+ cluster_labels = kmeans.fit_predict(df_scaled)
 
477
 
478
+ # Add cluster labels to original data
479
+ df_clustered = df.copy()
480
+ df_clustered['Cluster'] = cluster_labels
481
 
482
+ # Create interactive scatter plot
483
+ fig = px.scatter(df_clustered,
484
+ x='Murder',
485
+ y='Assault',
486
+ color='Cluster',
487
+ hover_data=['UrbanPop', 'Rape'],
488
+ title='State Crime Profiles')
489
+ st.plotly_chart(fig)
490
 
491
+ # Explain hyperparameters
492
+ st.markdown("""
493
+ ### K-Means Hyperparameters Explained
494
 
495
+ 1. **n_clusters (K)**
496
+ - The number of groups we want to create
497
+ - We chose K=4 based on the elbow method
498
+ - Each cluster represents a distinct crime profile
499
 
500
+ 2. **random_state**
501
+ - Controls the random initialization of centroids
502
+ - Setting it to 42 ensures reproducible results
503
+ - Different values might give slightly different clusters
504
+
505
+ 3. **n_init**
506
+ - Number of times to run the algorithm with different initial centroids
507
+ - We use 20 to find the best possible clustering
508
+ - Higher values give more reliable results but take longer
509
+
510
+ 4. **max_iter**
511
+ - Maximum number of iterations for each run
512
+ - Default is 300, which is usually enough
513
+ - Algorithm stops earlier if it converges
514
+
515
+ 5. **algorithm**
516
+ - 'auto': Automatically chooses the best algorithm
517
+ - 'full': Traditional K-means
518
+ - 'elkan': More efficient for well-separated clusters
519
+ """)
520
 
521
+ # Show cluster centers
522
+ st.subheader("Cluster Centers (Typical Crime Profiles)")
523
+ centers_df = pd.DataFrame(
524
+ kmeans.cluster_centers_,
525
+ columns=df.columns
526
+ )
527
+ st.dataframe(centers_df)
528
 
 
 
529
  st.write("""
530
+ Each row represents the "average" crime profile for that cluster. For example:
531
+ - High values in Murder and Assault indicate a high-crime cluster
532
+ - High UrbanPop with low crime rates might indicate urban safety
533
+ - Low values across all metrics might indicate rural safety
534
  """)
535
 
536
+ # Display cluster analysis
537
+ st.subheader("State Crime Profiles Analysis")
538
+
539
+ for cluster_num in range(k):
540
+ cluster_states = df_clustered[df_clustered['Cluster'] == cluster_num]
541
+ st.write(f"\n**CLUSTER {cluster_num}: {len(cluster_states)} states**")
542
+ st.write("States:", ", ".join(cluster_states.index.tolist()))
543
+ st.write("Average characteristics:")
544
+ avg_profile = cluster_states[['Murder', 'Assault', 'UrbanPop', 'Rape']].mean()
545
+ st.write(avg_profile)
546
+
547
+ # Explain the results
548
+ st.markdown("""
549
+ ### Interpreting the Results
550
+
551
+ Each cluster represents a distinct crime profile:
552
+ 1. **Cluster Characteristics**
553
+ - Look at the average values for each crime type
554
+ - Compare urban population percentages
555
+ - Identify the defining features of each cluster
556
+
557
+ 2. **State Groupings**
558
+ - States in the same cluster have similar crime patterns
559
+ - Geographic proximity doesn't always mean similar profiles
560
+ - Some states might surprise you with their cluster membership
561
+
562
+ 3. **Policy Implications**
563
+ - Clusters help identify states with similar challenges
564
+ - Can guide resource allocation and policy development
565
+ - Enables targeted interventions based on crime profiles
566
+ """)
567
+
568
+ # Exercise 5: Hierarchical Clustering
569
+ st.header("Exercise 5: Hierarchical Clustering Exploration")
570
+
571
+ # Code Example: Hierarchical Clustering
572
+ with st.expander("Code Example: Hierarchical Clustering"):
573
  st.code("""
574
+ # Create hierarchical clustering
575
+ from scipy.cluster.hierarchy import linkage, dendrogram
 
 
 
576
 
577
+ # Create linkage matrix
578
+ linkage_matrix = linkage(df_scaled, method='complete')
 
 
 
 
 
 
 
 
 
 
579
 
580
+ # Plot dendrogram
581
+ import plotly.graph_objects as go
582
+ dendro = dendrogram(linkage_matrix, labels=df.index.tolist(), no_plot=True)
 
 
 
 
 
 
 
 
 
583
 
584
+ fig = go.Figure()
585
+ fig.add_trace(go.Scatter(
586
+ x=dendro['icoord'],
587
+ y=dendro['dcoord'],
588
+ mode='lines',
589
+ line=dict(color='white')
590
+ ))
591
+ fig.update_layout(
592
+ title='State Crime Pattern Family Tree',
593
+ xaxis_title='States',
594
+ yaxis_title='Distance Between Groups'
595
  )
596
+ fig.show()
597
+
598
+ # Cut the tree to get clusters
599
+ from scipy.cluster.hierarchy import fcluster
600
+ hierarchical_labels = fcluster(linkage_matrix, k, criterion='maxclust') - 1
601
  """, language="python")
602
 
603
+ st.markdown("""
604
+ ### What is Hierarchical Clustering?
605
+
606
+ Hierarchical clustering creates a tree-like structure (dendrogram) that shows how data points are related at different levels.
607
+ Think of it like building a family tree for states based on their crime patterns:
608
+
609
+ 1. **Bottom-Up Approach (Agglomerative)**:
610
+ - Start with each state as its own cluster
611
+ - Find the two closest states and merge them
612
+ - Continue merging until all states are in one cluster
613
+ - Creates a complete hierarchy of relationships
614
+
615
+ 2. **Distance Measurement**:
616
+ - Complete Linkage: Uses the maximum distance between states
617
+ - Average Linkage: Uses the average distance between states
618
+ - Single Linkage: Uses the minimum distance between states
619
+ - We use complete linkage for more distinct clusters
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
620
  """)
621
 
622
+ # Create hierarchical clustering
623
+ linkage_matrix = linkage(df_scaled, method='complete')
624
+
625
+ # Create interactive dendrogram
626
+ fig_dendro = go.Figure()
627
+ dendro = dendrogram(linkage_matrix, labels=df.index.tolist(), no_plot=True)
628
+
629
+ fig_dendro.add_trace(go.Scatter(
630
+ x=dendro['icoord'],
631
+ y=dendro['dcoord'],
632
+ mode='lines',
633
+ line=dict(color='white')
634
+ ))
635
+
636
+ fig_dendro.update_layout(
637
+ title='State Crime Pattern Family Tree',
638
+ xaxis_title='States',
639
+ yaxis_title='Distance Between Groups',
640
+ plot_bgcolor='rgb(30, 30, 30)',
641
+ paper_bgcolor='rgb(30, 30, 30)',
642
+ font=dict(color='white')
643
  )
644
+ st.plotly_chart(fig_dendro)
645
+
646
+ # Explain how to read the dendrogram
647
+ st.markdown("""
648
+ ### How to Read the Dendrogram
649
+
650
+ 1. **Height of Connections**:
651
+ - Higher connections = more different groups
652
+ - Lower connections = more similar groups
653
+ - The height shows how different two groups are
654
+
655
+ 2. **Cutting the Tree**:
656
+ - Draw a horizontal line to create clusters
657
+ - Where you cut determines the number of clusters
658
+ - We'll cut at a height that gives us 4 clusters (like K-means)
659
+ """)
660
+
661
+ # Cut the tree to get clusters
662
+ hierarchical_labels = fcluster(linkage_matrix, k, criterion='maxclust') - 1
663
+
664
+ # Compare K-means and Hierarchical Clustering
665
+ st.header("Comparing K-Means and Hierarchical Clustering")
666
+
667
+ # Create side-by-side comparison
668
+ col1, col2 = st.columns(2)
669
+
670
+ with col1:
671
+ st.subheader("K-Means Clustering")
672
+ fig_kmeans = px.scatter(df_clustered,
673
+ x='Murder',
674
+ y='Assault',
675
+ color='Cluster',
676
+ title='K-Means Clustering (K=4)',
677
+ hover_data=['UrbanPop', 'Rape'])
678
+ st.plotly_chart(fig_kmeans)
679
+
680
+ st.markdown("""
681
+ **K-Means Characteristics**:
682
+ - Requires specifying number of clusters upfront
683
+ - Creates clusters of similar size
684
+ - Works well with spherical clusters
685
+ - Faster for large datasets
686
+ - Can be sensitive to outliers
687
+ """)
688
+
689
+ with col2:
690
+ st.subheader("Hierarchical Clustering")
691
+ df_hierarchical = df.copy()
692
+ df_hierarchical['Cluster'] = hierarchical_labels
693
+ fig_hierarchical = px.scatter(df_hierarchical,
694
+ x='Murder',
695
+ y='Assault',
696
+ color='Cluster',
697
+ title='Hierarchical Clustering (4 clusters)',
698
+ hover_data=['UrbanPop', 'Rape'])
699
+ st.plotly_chart(fig_hierarchical)
700
+
701
+ st.markdown("""
702
+ **Hierarchical Clustering Characteristics**:
703
+ - Creates a complete hierarchy of clusters
704
+ - Can handle non-spherical clusters
705
+ - More flexible in cluster shapes
706
+ - Slower for large datasets
707
+ - Less sensitive to outliers
708
+ """)
709
 
710
+ # Show agreement between methods
711
+ st.subheader("Comparing the Results")
712
 
713
+ # Create comparison dataframe
714
+ comparison_df = pd.DataFrame({
715
+ 'State': df.index,
716
+ 'K-Means Cluster': cluster_labels,
717
+ 'Hierarchical Cluster': hierarchical_labels
718
+ })
719
 
720
+ # Count agreements
721
+ agreements = sum(comparison_df['K-Means Cluster'] == comparison_df['Hierarchical Cluster'])
722
+ agreement_percentage = (agreements / len(comparison_df)) * 100
723
+
724
+ st.write(f"Methods agreed on {agreements} out of {len(comparison_df)} states ({agreement_percentage:.1f}%)")
725
+
726
+ # Show states where methods disagree
727
+ disagreements = comparison_df[comparison_df['K-Means Cluster'] != comparison_df['Hierarchical Cluster']]
728
+ if not disagreements.empty:
729
+ st.write("States where the methods disagreed:")
730
+ st.dataframe(disagreements)
731
+
732
+ st.markdown("""
733
+ ### When to Use Each Method
734
+
735
+ 1. **Use K-Means when**:
736
+ - You know the number of clusters
737
+ - Your data has spherical clusters
738
+ - You need fast computation
739
+ - You want clusters of similar size
740
+
741
+ 2. **Use Hierarchical Clustering when**:
742
+ - You don't know the number of clusters
743
+ - You want to explore the hierarchy
744
+ - Your clusters might be non-spherical
745
+ - You need to handle outliers carefully
746
+
747
+ In our case, both methods found similar patterns, suggesting our clusters are robust!
748
+ """)
749
+
750
+ # Exercise 6: Policy Brief
751
+ st.header("Exercise 6: Policy Brief Creation")
752
+
753
+ # Code Example: Creating Final Visualizations
754
+ with st.expander("Code Example: Creating Policy Brief Visualizations"):
755
+ st.code("""
756
+ # Create a comprehensive visualization
757
+ import plotly.graph_objects as go
758
+ from plotly.subplots import make_subplots
759
+
760
+ # Create subplots
761
+ fig = make_subplots(rows=2, cols=2)
762
+
763
+ # Plot 1: Murder vs Assault by cluster
764
+ for i in range(k):
765
+ cluster_data = df_clustered[df_clustered['Cluster'] == i]
766
+ fig.add_trace(
767
+ go.Scatter(
768
+ x=cluster_data['Murder'],
769
+ y=cluster_data['Assault'],
770
+ mode='markers',
771
+ name=f'Cluster {i}'
772
+ ),
773
+ row=1, col=1
774
+ )
775
+
776
+ # Plot 2: Urban Population vs Rape by cluster
777
+ for i in range(k):
778
+ cluster_data = df_clustered[df_clustered['Cluster'] == i]
779
+ fig.add_trace(
780
+ go.Scatter(
781
+ x=cluster_data['UrbanPop'],
782
+ y=cluster_data['Rape'],
783
+ mode='markers',
784
+ name=f'Cluster {i}'
785
+ ),
786
+ row=1, col=2
787
+ )
788
+
789
+ # Update layout
790
+ fig.update_layout(
791
+ title_text="State Crime Profile Analysis",
792
+ showlegend=True
793
+ )
794
+ fig.show()
795
+ """, language="python")
796
 
 
 
797
  st.write("""
798
+ Based on our analysis, here's a summary of findings and recommendations:
799
+
800
+ **Key Findings:**
801
+ - We identified distinct crime profiles among US states
802
+ - Each cluster represents a unique pattern of crime rates and urban population
803
+ - Some states show surprising similarities despite geographic distance
804
+
805
+ **Policy Recommendations:**
806
+ 1. High-Priority States: Focus on states in high-crime clusters
807
+ 2. Resource Allocation: Distribute federal crime prevention funds based on cluster profiles
808
+ 3. Best Practice Sharing: Encourage states within the same cluster to share successful strategies
809
  """)
810
 
811
  # Additional Resources
812
  st.header("Additional Resources")
813
  st.write("""
814
+ - [Scikit-learn Clustering Documentation](https://scikit-learn.org/stable/modules/clustering.html)
815
+ - [KNN Documentation](https://scikit-learn.org/stable/modules/neighbors.html)
 
 
 
 
816
  """)