Spaces:

raymondEDS
/

DS_webclass

Sleeping

File size: 17,798 Bytes

46e47b6

{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "b1c6a137",
      "metadata": {
        "id": "b1c6a137"
      },
      "source": [
        "# Clustering Lab: State Crime Pattern Analysis\n",
        "\n",
        "## Lab Overview\n",
        "\n",
        "Welcome to your hands-on clustering lab! You'll be working as a policy analyst for the Department of Justice, analyzing crime patterns across US states. Your mission: discover hidden safety profiles that could inform federal resource allocation and crime prevention strategies.\n",
        "\n",
        "**Your Deliverable**: A policy brief with visualizations and recommendations based on your clustering analysis.\n",
        "\n",
        "---\n",
        "\n",
        "## Exercise 1: Data Detective Work\n",
        "**Time: 15 minutes | Product: Data Summary Report**\n",
        "\n",
        "### Your Task\n",
        "Before any analysis, you need to understand what you're working with. Create a brief data summary that a non-technical policy maker could understand.\n"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "```python\n",
        "import numpy as np\n",
        "import pandas as pd\n",
        "import matplotlib.pyplot as plt\n",
        "from statsmodels.datasets import get_rdataset\n",
        "from sklearn.preprocessing import StandardScaler\n",
        "from sklearn.cluster import KMeans, AgglomerativeClustering\n",
        "\n",
        "# Load the data\n",
        "USArrests = get_rdataset('USArrests').data\n",
        "print(\"Dataset shape:\", USArrests.shape)\n",
        "print(\"\\nVariables:\", USArrests.columns.tolist())\n",
        "print(\"\\nFirst 5 states:\")\n",
        "print(USArrests.head())\n",
        "```"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 106
        },
        "id": "mqRVE1hlXK9x",
        "outputId": "5a1bbd64-15cd-4e1c-9344-64a901d8a396"
      },
      "id": "mqRVE1hlXK9x",
      "execution_count": null,
      "outputs": [
        {
          "output_type": "error",
          "ename": "SyntaxError",
          "evalue": "invalid syntax (<ipython-input-1-2035427107>, line 1)",
          "traceback": [
            "\u001b[0;36m  File \u001b[0;32m\"<ipython-input-1-2035427107>\"\u001b[0;36m, line \u001b[0;32m1\u001b[0m\n\u001b[0;31m    ```python\u001b[0m\n\u001b[0m    ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Your Investigation\n",
        "Complete this data summary table:\n",
        "\n",
        "| Variable | What it measures | Average Value | Highest State | Lowest State |\n",
        "|----------|------------------|---------------|---------------|--------------|\n",
        "| Murder | Rate per 100,000 people | ??? | ??? | ??? |\n",
        "| Assault | Rate per 100,000 people | ??? | ??? | ??? |\n",
        "| UrbanPop | Percentage living in cities | ??? | ??? | ??? |\n",
        "| Rape | Rate per 100,000 people | ??? | ??? | ??? |\n",
        "\n",
        "**Deliverable**: Write 2-3 sentences describing the biggest surprises in this data. Which states are not what you expected?\n",
        "\n",
        "---\n",
        "\n",
        "## Exercise 2: The Scaling Challenge\n",
        "**Time: 10 minutes | Product: Before/After Comparison**\n",
        "\n",
        "### Your Task\n",
        "Demonstrate why scaling is critical for clustering crime data.\n",
        "\n"
      ],
      "metadata": {
        "id": "7qkDKTe4XLtG"
      },
      "id": "7qkDKTe4XLtG"
    },
    {
      "cell_type": "code",
      "source": [
        "```python\n",
        "# Check the scale differences\n",
        "print(\"Original data ranges:\")\n",
        "print(USArrests.describe())\n",
        "\n",
        "print(\"\\nVariances (how spread out the data is):\")\n",
        "print(USArrests.var())\n",
        "\n",
        "# Scale the data\n",
        "scaler = StandardScaler()\n",
        "USArrests_scaled = scaler.fit_transform(USArrests)\n",
        "scaled_df = pd.DataFrame(USArrests_scaled,\n",
        "                        columns=USArrests.columns,\n",
        "                        index=USArrests.index)\n",
        "\n",
        "print(\"\\nAfter scaling - all variables now have similar ranges:\")\n",
        "print(scaled_df.describe())\n",
        "```"
      ],
      "metadata": {
        "id": "zQ3VowYNXLeQ"
      },
      "id": "zQ3VowYNXLeQ",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Your Analysis\n",
        "1. **Before scaling**: Which variable would dominate the clustering? Why?\n",
        "2. **After scaling**: Explain in simple terms what StandardScaler did to the data.\n",
        "\n",
        "**Deliverable**: One paragraph explaining why a policy analyst should care about data scaling.\n",
        "\n",
        "---\n",
        "\n",
        "## Exercise 3: Finding the Right Number of Groups\n",
        "**Time: 20 minutes | Product: Recommendation with Visual Evidence**\n",
        "\n",
        "### Your Task\n",
        "Use the elbow method to determine how many distinct crime profiles exist among US states.\n"
      ],
      "metadata": {
        "id": "FnOT700SXLPh"
      },
      "id": "FnOT700SXLPh"
    },
    {
      "cell_type": "code",
      "source": [
        "```python\n",
        "# Test different numbers of clusters\n",
        "inertias = []\n",
        "K_values = range(1, 11)\n",
        "\n",
        "for k in K_values:\n",
        "    kmeans = KMeans(n_clusters=k, random_state=42, n_init=20)\n",
        "    kmeans.fit(USArrests_scaled)\n",
        "    inertias.append(kmeans.inertia_)\n",
        "\n",
        "# Create the elbow plot\n",
        "plt.figure(figsize=(10, 6))\n",
        "plt.plot(K_values, inertias, 'bo-', linewidth=2, markersize=8)\n",
        "plt.xlabel('Number of Clusters (K)')\n",
        "plt.ylabel('Within-Cluster Sum of Squares')\n",
        "plt.title('Finding the Optimal Number of State Crime Profiles')\n",
        "plt.grid(True, alpha=0.3)\n",
        "plt.show()\n",
        "\n",
        "# Print the inertia values\n",
        "for k, inertia in zip(K_values, inertias):\n",
        "    print(f\"K={k}: Inertia = {inertia:.1f}\")\n",
        "```"
      ],
      "metadata": {
        "id": "zOQrS9lmXpTF"
      },
      "id": "zOQrS9lmXpTF",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "id": "2e388ef2",
      "metadata": {
        "id": "2e388ef2"
      },
      "source": [
        "### Your Decision\n",
        "Based on your elbow plot:\n",
        "1. **What value of K do you recommend?** (Look for the \"elbow\" where the line starts to flatten)\n",
        "2. **What does this mean in policy terms?** (How many distinct types of state crime profiles exist?)\n",
        "\n",
        "**Deliverable**: A one-paragraph recommendation with your chosen K value and reasoning.\n",
        "\n",
        "---\n",
        "\n",
        "## Exercise 4: K-Means State Profiling\n",
        "**Time: 25 minutes | Product: State Crime Profile Report**\n",
        "\n",
        "### Your Task\n",
        "Create distinct crime profiles and identify which states belong to each category.\n",
        "\n",
        "\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "```python\n",
        "# Use your chosen K value from Exercise 3\n",
        "optimal_k = 4  # Replace with your chosen value\n",
        "\n",
        "# Perform K-means clustering\n",
        "kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=20)\n",
        "cluster_labels = kmeans.fit_predict(USArrests_scaled)\n",
        "\n",
        "# Add cluster labels to original data\n",
        "USArrests_clustered = USArrests.copy()\n",
        "USArrests_clustered['Cluster'] = cluster_labels\n",
        "\n",
        "# Analyze each cluster\n",
        "print(\"State Crime Profiles Analysis\")\n",
        "print(\"=\" * 50)\n",
        "\n",
        "for cluster_num in range(optimal_k):\n",
        "    cluster_states = USArrests_clustered[USArrests_clustered['Cluster'] == cluster_num]\n",
        "    print(f\"\\nCLUSTER {cluster_num}: {len(cluster_states)} states\")\n",
        "    print(\"States:\", \", \".join(cluster_states.index.tolist()))\n",
        "    print(\"Average characteristics:\")\n",
        "    avg_profile = cluster_states[['Murder', 'Assault', 'UrbanPop', 'Rape']].mean()\n",
        "    for var, value in avg_profile.items():\n",
        "        print(f\"  {var}: {value:.1f}\")\n",
        "```"
      ],
      "metadata": {
        "id": "_5b0nE6KXv1P"
      },
      "id": "_5b0nE6KXv1P",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Your Analysis\n",
        "For each cluster, create a profile:\n",
        "\n",
        "**Cluster 0: \"[Your Creative Name]\"**\n",
        "- **States**: [List them]\n",
        "- **Characteristics**: [Describe the pattern]\n",
        "- **Policy Insight**: [What should federal agencies know about these states?]\n",
        "\n",
        "**Deliverable**: A table summarizing each cluster with creative names and policy recommendations.\n",
        "\n",
        "---\n",
        "\n",
        "## Exercise 5: Hierarchical Clustering Exploration\n",
        "**Time: 25 minutes | Product: Family Tree Interpretation**\n",
        "\n",
        "### Your Task\n",
        "Create a dendrogram to understand how states naturally group together.\n"
      ],
      "metadata": {
        "id": "J1WVGb_nX4ye"
      },
      "id": "J1WVGb_nX4ye"
    },
    {
      "cell_type": "code",
      "source": [
        "```python\n",
        "from scipy.cluster.hierarchy import dendrogram, linkage\n",
        "\n",
        "# Create hierarchical clustering\n",
        "linkage_matrix = linkage(USArrests_scaled, method='complete')\n",
        "\n",
        "# Plot the dendrogram\n",
        "plt.figure(figsize=(15, 8))\n",
        "dendrogram(linkage_matrix,\n",
        "           labels=USArrests.index.tolist(),\n",
        "           leaf_rotation=90,\n",
        "           leaf_font_size=10)\n",
        "plt.title('State Crime Pattern Family Tree')\n",
        "plt.xlabel('States')\n",
        "plt.ylabel('Distance Between Groups')\n",
        "plt.tight_layout()\n",
        "plt.show()\n",
        "```"
      ],
      "metadata": {
        "id": "Y9a_cbZKX7QX"
      },
      "id": "Y9a_cbZKX7QX",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Your Interpretation\n",
        "1. **Closest Pairs**: Which two states are most similar in crime patterns?\n",
        "2. **Biggest Divide**: Where is the largest split in the tree? What does this represent?\n",
        "3. **Surprising Neighbors**: Which states cluster together that surprised you geographically?\n",
        "\n",
        "### Code to Compare Methods"
      ],
      "metadata": {
        "id": "0PaImqZtX6f3"
      },
      "id": "0PaImqZtX6f3"
    },
    {
      "cell_type": "code",
      "source": [
        "```python\n",
        "# Compare your K-means results with hierarchical clustering\n",
        "from scipy.cluster.hierarchy import fcluster\n",
        "\n",
        "# Cut the tree to get the same number of clusters as K-means\n",
        "hierarchical_labels = fcluster(linkage_matrix, optimal_k, criterion='maxclust') - 1\n",
        "\n",
        "# Create comparison\n",
        "comparison_df = pd.DataFrame({\n",
        "    'State': USArrests.index,\n",
        "    'K_Means_Cluster': cluster_labels,\n",
        "    'Hierarchical_Cluster': hierarchical_labels\n",
        "})\n",
        "\n",
        "print(\"Comparison of K-Means vs Hierarchical Clustering:\")\n",
        "print(comparison_df.sort_values('State'))\n",
        "\n",
        "# Count agreements\n",
        "agreements = sum(comparison_df['K_Means_Cluster'] == comparison_df['Hierarchical_Cluster'])\n",
        "print(f\"\\nMethods agreed on {agreements} out of {len(comparison_df)} states ({agreements/len(comparison_df)*100:.1f}%)\")\n",
        "```"
      ],
      "metadata": {
        "id": "tJQ-C5GFYBRT"
      },
      "id": "tJQ-C5GFYBRT",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "**Deliverable**: A paragraph explaining the key differences between what K-means and hierarchical clustering revealed.\n",
        "\n",
        "---\n",
        "\n",
        "## Exercise 6: Policy Brief Creation\n",
        "**Time: 20 minutes | Product: Executive Summary**\n",
        "\n",
        "### Your Task\n",
        "Synthesize your findings into a policy brief for Department of Justice leadership.\n",
        "\n",
        "### Code Framework for Final Visualization"
      ],
      "metadata": {
        "id": "dx1fNhu4YD7-"
      },
      "id": "dx1fNhu4YD7-"
    },
    {
      "cell_type": "code",
      "source": [
        "```python\n",
        "# Create a comprehensive visualization\n",
        "fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(15, 12))\n",
        "\n",
        "# Plot 1: Murder vs Assault by cluster\n",
        "colors = ['red', 'blue', 'green', 'orange', 'purple']\n",
        "for i in range(optimal_k):\n",
        "    cluster_data = USArrests_clustered[USArrests_clustered['Cluster'] == i]\n",
        "    ax1.scatter(cluster_data['Murder'], cluster_data['Assault'],\n",
        "               c=colors[i], label=f'Cluster {i}', s=60, alpha=0.7)\n",
        "ax1.set_xlabel('Murder Rate')\n",
        "ax1.set_ylabel('Assault Rate')\n",
        "ax1.set_title('Murder vs Assault by Crime Profile')\n",
        "ax1.legend()\n",
        "ax1.grid(True, alpha=0.3)\n",
        "\n",
        "# Plot 2: Urban Population vs Rape by cluster\n",
        "for i in range(optimal_k):\n",
        "    cluster_data = USArrests_clustered[USArrests_clustered['Cluster'] == i]\n",
        "    ax2.scatter(cluster_data['UrbanPop'], cluster_data['Rape'],\n",
        "               c=colors[i], label=f'Cluster {i}', s=60, alpha=0.7)\n",
        "ax2.set_xlabel('Urban Population %')\n",
        "ax2.set_ylabel('Rape Rate')\n",
        "ax2.set_title('Urban Population vs Rape Rate by Crime Profile')\n",
        "ax2.legend()\n",
        "ax2.grid(True, alpha=0.3)\n",
        "\n",
        "# Plot 3: Cluster size comparison\n",
        "cluster_sizes = USArrests_clustered['Cluster'].value_counts().sort_index()\n",
        "ax3.bar(range(len(cluster_sizes)), cluster_sizes.values, color=colors[:len(cluster_sizes)])\n",
        "ax3.set_xlabel('Cluster Number')\n",
        "ax3.set_ylabel('Number of States')\n",
        "ax3.set_title('Number of States in Each Crime Profile')\n",
        "ax3.set_xticks(range(len(cluster_sizes)))\n",
        "\n",
        "# Plot 4: Average crime rates by cluster\n",
        "cluster_means = USArrests_clustered.groupby('Cluster')[['Murder', 'Assault', 'Rape']].mean()\n",
        "cluster_means.plot(kind='bar', ax=ax4)\n",
        "ax4.set_xlabel('Cluster Number')\n",
        "ax4.set_ylabel('Average Rate')\n",
        "ax4.set_title('Average Crime Rates by Profile')\n",
        "ax4.legend()\n",
        "ax4.tick_params(axis='x', rotation=0)\n",
        "\n",
        "plt.tight_layout()\n",
        "plt.show()\n",
        "```"
      ],
      "metadata": {
        "id": "N8bkxURpYHJF"
      },
      "id": "N8bkxURpYHJF",
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Your Policy Brief Template\n",
        "\n",
        "**EXECUTIVE SUMMARY: US State Crime Profile Analysis**\n",
        "\n",
        "**Key Findings:**\n",
        "- We identified [X] distinct crime profiles among US states\n",
        "- [State examples] represent the highest-risk profile\n",
        "- [State examples] represent the lowest-risk profile\n",
        "- Urban population [does/does not] strongly correlate with violent crime\n",
        "\n",
        "**Policy Recommendations:**\n",
        "1. **High-Priority States**: [List and explain why]\n",
        "2. **Resource Allocation**: [Suggest how to distribute federal crime prevention funds]\n",
        "3. **Best Practice Sharing**: [Which states should learn from which others?]\n",
        "\n",
        "**Methodology Note**: Analysis used unsupervised clustering on 4 crime variables across 50 states, with data standardization to ensure fair comparison.\n",
        "\n",
        "**Deliverable**: A complete 1-page policy brief with your clustering insights and specific recommendations.\n"
      ],
      "metadata": {
        "id": "rAy_Ye0WYLK0"
      },
      "id": "rAy_Ye0WYLK0"
    }
  ],
  "metadata": {
    "jupytext": {
      "cell_metadata_filter": "-all",
      "formats": "Rmd,ipynb",
      "main_language": "python"
    },
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    },
    "colab": {
      "provenance": []
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}