White Hat Security Agent Prompts 600K Dataset by Yatin Taneja

Community Article Published March 23, 2026

Access the full dataset and data viewer on Hugging Face here.

Overview

The White-Hat-Security-Agent-Prompts-600K dataset is a practitioner-perspective security prompts corpus of 596,295 richly contextualized queries, designed to represent how real-world defensive security professionals communicate, interrogate, and reason through active threat scenarios.

Where most security datasets catalogue CVEs, malware signatures, or CTF write-ups, this collection teaches models to operate from inside the defender's mind, receiving complex, multi-layered security challenges the way a Trust & Safety lead, CISO, or threat hunter would actually frame them during live operations.

The Defender's Vantage Point

Every prompt in this dataset is written from an active operational posture. The model is not given sanitized, textbook questions; it is placed inside scenarios that carry all the complexity, urgency, and technical specificity of a live security engagement.

The prompts span the full spectrum of a security professional's working context:

Incident Response Mode: Active compromise, live SCADA breach, exfiltration in progress. Prompts that demand immediate, technically precise, prioritized guidance.
Red Team Simulation: Authorized adversarial scenario planning, threat emulation, and controlled attack-path analysis for enterprise hardening.
Paranoid CISO Review: Deep architectural skepticism, vendor trust assessments, and systemic resilience evaluation across critical infrastructure.
Post-Mortem Analysis: Retrospective forensic dissection of attack chains, attribution analysis, and control gap identification.
Threat Intelligence Briefing: Nation-state TTPs, emerging threat actor profiling, and geopolitical threat vector contextualization.

This is not the perspective of a student asking how encryption works. It is the perspective of a practitioner demanding to know what their next 90 seconds should look like.

Taxonomy & Engineering Architecture

The dataset is generated from a highly granular security taxonomy spanning conventional cybersecurity, AI safety, and emerging frontier threat categories. Each vector carries its own curated threat registry, attacker tooling repertoire, and defensive system landscape.

Security Domains:

Category	Domains
Information Security	Network, Malware, Web, Social Engineering, Cloud, Supply Chain, IoT/OT, Finance & DeFi, Insider Threat, Privacy, Identity & IAM, Mobile, Physical/OPSEC, Critical Infrastructure, Telecom
AI Safety	Adversarial ML, Malicious Intent Detection, Model Alignment
Emerging & Frontier	Quantum Cryptography, Synthetic Biology, Autonomous Systems
Advanced Persistent Threats	Nation-State APT Operations

Combinatorial Engineering

The generation matrix for each domain independently parameterizes:

Threat: Specific, named adversarial capability (e.g., Harvest Now Decrypt Later, Mirai-style Botnets, Hardware Trojans, Flash Loan Attacks)
Attack Vector: The precise technical entry or exploitation pathway
Practitioner Role: The security professional framing and expertise level
Defensive System: The specific control surface or tooling stack in scope
Target Sector: Industry vertical contextualizing the operational environment
Impact Level: Severity stratification from business nuisance to existential risk

This yields a vast combinatorial search space of over 76.8 Million unique threat scenarios across the entire architectural landscape. The 596,295 prompts in this dataset represent a carefully sampled cross-section of that space, curated for maximum contextual diversity.

Architecture & Scale

Summary Statistics:

Total Prompts: 596,295
Unique Threat Categories: 131 specifically named adversarial capabilities (spanning conventional InfoSec, AI Safety, and frontier threats)
Impact Level Tiers: 5 (uniformly distributed across severity spectrum)
Average Prompt Density: ~211 words of domain-specific, operationally grounded context per prompt
Combinatorial Base Volume: Sampled from an exhaustive space of over 76.8 Million unique threat permutations

Impact Level Distribution (approximately uniform by design):

Impact Level	Description
`Catastrophic (Existential / Loss of Life)`	Scenarios threatening human life, national sovereignty, or civilizational systems
`Critical (National Security / Safety Risk)`	Critical infrastructure compromise, government systems, strategic assets
`High (Financial/Reputational Damage)`	Enterprise-scale financial loss, regulatory exposure, brand destruction
`Medium (Business Disruption)`	Operational downtime, data breach, customer-facing degradation
`Low (Nuisance)`	Isolated incidents, minor data exposure, limited blast radius

Data Structure / Schema

The dataset is distributed natively chunked in .parquet files and has been meticulously cleaned to ensure 100% data density.

Column	Type	Description
`batch_index`	int64	Fixed sequence index for reproducible sampling and deduplication
`user_prompt`	string	The full practitioner-framed security prompt, the core content of the dataset
`threat`	string	Named threat category the scenario is centered around (131 unique values)
`impact_level`	string	Severity classification of the underlying threat scenario (5 tiers)

Recommended Use Cases

Security-Specialized LLM Fine-Tuning: Train base models to understand and respond accurately to the technical language, urgency, and operational context of real security engagements, spanning 131 distinct threat categories and over 76 Million unique attack permutations.
SOC Assistant Development: Source material for fine-tuning AI assistants that support Security Operations Center (SOC) analysts with threat-aware, contextually grounded guidance.
Threat-Aware Instruction Following: Train models to calibrate response depth and precision based on the impact_level signal, producing appropriately cautious, detail-rich guidance for Critical and Catastrophic scenarios.
Multi-Domain Security Classification: Use the threat column to train classifiers that can identify which specific adversarial category an incoming query relates to across 131 named threat vectors.
Red Team Scenario Generation Research: Study the linguistic and structural patterns of expert-level red team scenario framing to build systems that can generate or evaluate adversarial test cases.
AI Safety and Alignment Research: The AISafety domain subset provides prompts specifically addressing adversarial ML, prompt injection, model alignment failures, and malicious intent detection, and is directly useful for frontier model safety work.

Developer & Architect

This dataset, its expansive 131-category taxonomy, combinatorial generation matrix, and multi-agent engineering pipeline were designed and built by Yatin Taneja.

In an era where adversaries only have to be right once, security agents must be intelligent everywhere. I believe that the best defense against emerging AI threats requires systems that can think like practitioners, not systems trained on sanitized textbooks. The security professional's mindset is one of radical skepticism, contextual pattern recognition, and adaptive reasoning under pressure. That is precisely what this dataset is built to instill.

The frontier of AI safety work requires models that don't just know what a supply chain attack is; they need to understand what it feels like to be the engineer responsible for stopping one at 2am on a Wednesday.

Weblinks

IM Superintelligence: Visit my central knowledge hub hosting other open datasets and over 2,000 articles exploring Superintelligence, cognitive architectures, quantum computing, distributed networks, and the future of the global education sector, authored through a custom 8-step multi-model agentic infrastructure.
Yatin Taneja | Professional Portfolio: View my professional portfolio for a comprehensive overview of my skills, industry experience, and software prototypes.
LinkedIn: Connect to collaborate on advanced autonomous systems, enterprise AI implementations, or to follow my ongoing research.

License & Usage

This dataset is released under the Creative Commons Attribution 4.0 International License (CC-BY 4.0). You are free to use, share, redistribute, and build upon this dataset for any purpose, including commercial model training and research applications, provided that appropriate credit is given to the original author.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote