Thoughts on NIST's Draft Cybersecurity Framework Profile for AI

Deployment patterns, evidence primitives, and procurement artifacts: three additions that make the profile work in practice

Dec 22, 2025

Landing page for NIST’s Cyber AI Profile

NIST just released a preliminary draft of a Cybersecurity Framework (CSF) Profile for AI and invited public comment. My thoughts on how to improve that draft below.

For context, a CSF profile adapts the CSF’s structure (govern, identify, protect, detect, respond, recover) to a particular domain so organizations can align expectations, controls, and evidence without inventing a new framework. This draft is a useful step toward shared language for AI cybersecurity.

My feedback is aimed at making it more operational in three places:

Add a small bridge layer that names recurring GenAI deployment patterns
Define ‘rigorous evaluation’ in terms of evidence primitives, not just ‘testing occurred’ controls; and
Provide a concise buyer packet template that procurement and auditing teams can use.

This is not an argument for replacing the CSF or the profile approach. It’s an argument for adding just enough structure that the profile can drive repeatable decisions and comparable evidence across organizations.

Why the current form makes sense and where it peters out

Most organizations already run cybersecurity programs with familiar components: governance, inventories, access control, monitoring, incident response, and recovery planning. The CSF works because it provides a stable structure for those activities and a vocabulary that can be shared across security teams, executives, vendors, and auditors. A profile becomes valuable when it maps a new technical domain into that structure so teams can reason about expectations and evidence using concepts their organization already understands.

The draft also has a durability constraint. AI systems change quickly; language needs to remain relevant as architectures, vendors, and deployment habits shift. That pushes NIST toward technology-neutral statements and away from design prescriptions. This instinct is good.

The practical gap is that a broadly applicable framework can become hard to apply unless the profile includes an intermediate layer that helps teams translate outcomes into concrete expectations for the kinds of systems they actually deploy. Coordination and engineering operate at different levels. The profile can keep the coordination layer and still add a narrow bridge layer that reduces ambiguity.

This is not unique to AI. ISO/IEC 27001 (a widely used standard for running an information security management system) emphasizes organizational discipline and risk-based tailoring rather than prescribing a single technical stack. NIST Special Publication 800-53 (a large catalog of security and privacy controls, widely referenced beyond U.S. government contexts) provides a menu of controls that must be selected, tailored, and assessed per system. In both cases, teams still need a way to interpret what ‘good’ means for a specific system and deployment.

A bridge layer serves that role. A ‘vehicle class’ analogy from automotive can help here: the same high-level safety and security goals apply across vehicle categories, but different vehicle types concentrate different risks, so controls, testing, and evidence emphases differ. For genAI cybersecurity, the equivalent is a small taxonomy of deployment patterns so the CSF functions can be interpreted consistently without turning the profile into a design specification.

Add a bridge layer: a minimal taxonomy of genAI deployments

The draft’s scope is broad, spanning traditional machine learning through generative systems and increasingly autonomous workflows. That breadth is appropriate for a durable profile. I focus the bridge layer on genAI because this is where adoption is accelerating, architectures are shifting quickly, and cybersecurity risk is rising through new combinations: untrusted inputs at scale, tool and connector ecosystems, delegated identities, and long-horizon state.

A predictable pushback is that taxonomies are only useful after a field has settled, that it is premature to name classes now because they will certainly evolve. The automotive autonomy taxonomy is a useful counterexample. SAE’s levels of driving automation (codified in SAE J3016) were developed quickly, in the middle of a fast-moving technology cycle, before the industry had the luxury of decades to converge on shared language. The taxonomy has been revised and it does not eliminate every ambiguity, but it created shared handles early, handles that regulators, standards bodies, and procurement teams could reference when describing capability, responsibility, and verification expectations. That is the point: a taxonomy can be useful even when it is provisional, as long as it is clear, widely usable, and explicitly treated as subject to revision.

For genAI, a useful taxonomy should be small and defined by security-relevant invariants: what the system can read, what it can change, what identities it holds, whether it ingests untrusted content, and whether it carries state across steps.

Here is a starting set of five deployment patterns:

Retrieval-augmented assistants. These systems retrieve from internal knowledge sources and synthesize responses. Risk concentrates at the boundary between data and instructions: injection through retrieved content, leakage through synthesis, and failures of provenance and access control. Emphasis: provenance and access boundaries; separation of instruction and data channels; leakage prevention and monitoring.
Transactional tool-calling copilots. These systems take actions through tools (create/change records, trigger workflows, submit requests). Risk concentrates in authorization, scope, and auditability: over-broad tool permissions, ambiguous intent mapped to irreversible actions, and weak fail-closed behavior. Emphasis: action-level authorization and least privilege; scoped non-human identities and keys; action-level logging with traceability.
Browser-connected assistants. These systems ingest open-web content and may act based on it. Risk concentrates in adversarial content shaping decisions via indirect prompt injection and related manipulation. Emphasis: isolation and transformation of untrusted content; navigation constraints; separation of browsing content from instruction channels; conservative action policies under uncertainty.
Code-generation and code-execution helpers. These systems generate code and sometimes execute it in sandboxes. Risk concentrates in arbitrary execution, dependency surprises, network access and data egress, and proximity to secrets. Emphasis: runtime isolation; restrictive egress; dependency controls; secret handling; observability at the boundary between generated code and the environment.
Long-horizon workflow systems. These systems plan, act, observe, and adapt over many steps, often with memory. Risk concentrates in trajectories rather than single outputs: compounding errors, goal drift, hidden state, memory poisoning, and multi-step exploitation across tools. Emphasis: step-level authorization; explicit state and memory management; recovery logic and kill switches; traceability across the full trajectory.

This taxonomy is not meant to be final. Its purpose is to make the profile more actionable without turning it into a design specification. Once patterns are named, the profile can indicate which control concepts deserve emphasis for each pattern (e.g, authorization, context isolation, non-human identities and keys, connector/protocol supply chain) and teams can scope evaluation so coverage has a shared meaning.

The simplest implementation is an annex: short pattern definitions and a mapping from each pattern to the CSF categories where emphasis shifts, with the expectation that overlays can extend and refine. That is enough to reduce ambiguity while preserving durability. This also keeps debates tractable: patterns are defined by security-relevant invariants rather than by vendor labels or product marketing. And it gives NIST a clean hook for future overlays (for example, distinctions around hosting and isolation, or around additional monitoring and steering mechanisms) without changing the core profile.

Define rigor: evidence, not just the presence of testing

Many compliance programs are effective at confirming that security processes exist and controls operate. That discipline matters, and frameworks like SOC 2 (a widely used service-organization controls attestation) often raise the floor for operational hygiene. The issue for genAI deployments is that ‘testing occurred’ does not convey assurance unless it is tied to a definition of rigor and to evidence that reflects operational reality. GenAI failures often depend on state, interaction sequences, and connector behavior, not just on a model’s isolated responses.

There are adjacent domains where standards and formal regimes go beyond process confirmation and specify what rigor looks like:

Security assurance and evaluation regimes (for example, Common Criteria-style evaluation) connect security claims to structured evidence artifacts and defined evaluation procedures, often with independent assessment.
Cryptographic module validations (the FIPS 140 family) require conformance testing against published requirements and assessment methods before a module can be relied on for certain classes of use.
Safety-critical software standards such as DO-178C (avionics software), IEC 61508 (general functional safety), and ISO 26262 (automotive functional safety) encode verification discipline: traceability from requirements to verification artifacts, explicit verification objectives, review depth and independence as criticality rises, and clear rules for how changes affect prior evidence.

GenAI security should not copy these regimes wholesale. The transferable idea is narrower: remain technology-neutral, but require that claims imply evidence, and connect evaluation depth to risk. A profile can make that legible by encouraging graded evidence rather than a single vague bar: baseline expectations for read-heavy systems; higher expectations when untrusted web content is in the loop; the highest expectations when the system can take irreversible actions, touch sensitive data, or execute code.

Rigor should also be situated. Many failures appear mid-task, with noisy inputs, with connectors and recovery logic in the loop, and with state carried across steps. Situated evaluation does not mean simulating the world; it means exercising the system’s real seams: the boundary between retrieved content and instruction; between untrusted web content and decision channels; between tool outputs and action selection; between generated code and the environment it touches; between one step and the next in a long-horizon trajectory.

A minimal evidence bundle that a buyer (and auditor) can consume

The draft already shows supply-chain and procurement awareness. The next step is to define a minimal evidence bundle that makes security claims interpretable and comparable without freezing the field. This bundle standardizes evidence primitives, not tools.

Scope and boundaries: what system is assessed, where it runs, what it can access, what connectors/tools it can invoke, and what is out of scope.
Deployment pattern declaration: which patterns apply; if multiple apply, how they interact.
Concrete claims: a small set of testable statements (authorization gates; scoped non-human identities; separation of untrusted content from instruction channels; action-level logging; fail-closed behavior under policy uncertainty).
Threat assumptions: attacker capabilities and influence paths in scope.
Evaluation plan and acceptance criteria: scenarios exercised under realistic conditions; pass/fail criteria; connectors and recovery paths in the loop where relevant; long-horizon workflows included where relevant.
Method characterization and change-impact discipline: what methods tend to reveal and where they are blind; what changes invalidate evidence and how it is refreshed.

This bundle upgrades a vague control into a meaningful statement about what was tested, why, how, and with what confidence.

Make procurement easier: a buyer packet aligned to the profile

Frameworks become operational when they shape procurement. Buyers can demand evidence; vendors converge on producing it; and over time the ecosystem develops stable artifacts that standards can refine.

NIST could accelerate that convergence with a short buyer packet template aligned to the profile. It should be concise enough to be routinely requested and provided, but structured enough to be audited. In practice, it can be compact: system description and intended-use boundaries; a software bill of materials plus an ‘AI bill’ covering models, retrieval corpora, prompt/policy layers, and connectors; the declared deployment pattern(s); a summary of evaluation coverage with acceptance criteria; and a short set of operational commitments (logging, incident response, disclosure timelines, and support for independent reassessment). For action-taking systems, the packet should also include the authorization model (who/what can authorize which actions), non-human identity and key-management practices, and a change-impact policy that states which updates invalidate prior evidence and trigger re-testing.

A shared packet format also helps builders. It turns alignment into concrete artifacts teams can plan for and maintain as systems evolve.

Three targeted changes that would strengthen the draft

First, add a genAI deployment-pattern annex as a bridge layer. Name a small set of common patterns, define them by security-relevant invariants, and map each pattern to the control concepts and evaluation emphases that matter most, with the expectation that it will evolve.

Second, define rigorous evaluation in terms of evidence primitives. Encourage situated evaluation and specify minimum evidence elements: scope, claims, threat assumptions, evaluation plan with acceptance criteria, and method characterization.

Third, provide a procurement buyer packet template aligned to the profile. Make it easy for buyers to request, and for vendors to provide, comparable evidence.

Overall, the draft Profile is a useful step toward shared language for AI cybersecurity. The opportunity now is to add a bridge layer for genAI deployments, raise the bar on what counts as rigorous evidence, and provide a buyer packet that makes the profile enforceable in procurement, all while keeping the document technology-neutral and durable.

Jason Stanley

Discussion about this post

Ready for more?