You can hand a framework a written description of how your agent should behave and get back a suite of tests that catch it misbehaving. Not by hand-authoring every assertion: you describe the behavior, declare the dimensions that matter, and a model compiles the rest into tests that run. The capability is new, and the pattern behind it is about to be everywhere.

It is half of a control.

Tests tell you, after a run, whether the agent behaved. They say nothing about what the agent was allowed to do before it started. For a stateless chatbot answering over non-sensitive data, that gap barely mattered; the worst case was a bad answer you could throw away. For an agent that files expenses, moves tickets, queries financial systems, and sends mail, it is the whole game. These things act across systems their authors do not control, and some of what they do does not undo.

The question is no longer only whether the agent behaved, but what it was allowed to do in the first place. The industry is meeting that second question with the instinct it brought to the first: stop leaving intent in prose only a human can act on, and compile it into something a machine can execute. That showed up first in testing. It belongs just as much in authorization, where, even as the enforcement standards keep shipping, the step that compiles intent into them has barely begun.

Intent That Runs

Start with the half that already works, because it shows the move in its cleanest form.

Microsoft’s ASSERT, short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing and released as part of its open trust stack at Build 2026, is a clean instance of the pattern. You write the behavior you expect. ASSERT derives behavior categories from it, generates single- and multi-turn test cases, runs them against your agent, and uses an LLM judge to score each conversation against your policy, tracing each verdict to the policy stance and the trace action behind it (Microsoft’s writeup). Written intent in, a reviewable test suite out, none of it hand-authored case by case.

Strip away the specifics and the move is general: take intent expressed in language, compile it into a structured artifact, and let that artifact do the work the prose never could. ASSERT makes intent executable for verification, answering, at far less cost, the question teams used to answer one test at a time: did the agent behave the way we said it should?

It says nothing about a different question, one that comes earlier in time and matters more for anything that cannot be undone: what should the agent have been allowed to do at all?

Verification Is Not Enforcement

Those two questions sound like neighbors. They are different controls, and the difference is not academic.

Picture the billing agent. You hand it a task that sounds routine: resolve a customer’s billing dispute. To do that it pulls up the account, reads the invoices, checks payment history, updates the support ticket. Somewhere in the middle, a support note it ingests nudges its reasoning, or it simply generalizes “look at the customer’s records” a little too far, and it queries a second customer’s account to compare. Nothing in its instructions forbade it. It was holding a credential that worked.

In an eval-only world, here is how that plays out. The release-gating suite passed before any of this shipped: this run, with this injected note and this data, was never one of the cases it generated. The agent runs in production, the second customer’s data is read, already in context, possibly summarized into an output you cannot recall, and nothing flags it in the moment. You learn about it later, from a log or an incident. A post-hoc eval over the trace could score it after the fact, but after the fact is exactly the problem. A clean release report sits comfortably beside a real breach.

That is the difference, and it is not really about timing. An eval, even one wired to run inline, produces a verdict, not a binding entitlement decision; an authorization system makes the decision and refuses at the moment of the call. The release-gating and post-hoc evals here also run too late to matter. An agent that reads private data, takes in untrusted content, and can act on the outside world, the combination Simon Willison named the lethal trifecta, does things that do not reverse. You cannot un-send a payment, un-email a list, or un-leak a contract. The usual advice for the trifecta is to break it by removing a leg; an eval removes none. It watches an agent that still has all three, and detection does not give the data back.

A camera is not a lock. Monitoring records the break-in; the lock prevents it. Agents got cameras first because grading is easy: you can score behavior without touching the runtime, while enforcement means deciding, live, inside the path of every call, what the agent may reach. The easier half shipped first. It is not the half that stops anything.

The Missing Step

So put a lock on it. We are not short on locks. Fine-grained authorization, policy engines, runtime decision points, scoped tokens: the machinery to refuse a call at the moment it is made exists and keeps improving. The same Build 2026 trust stack that shipped ASSERT also shipped the Agent Control Specification, a portable standard for deterministic controls at checkpoints across an agent’s workflow. But ACS, like every lock on that list, assumes the controls already exist, authored separately. None of them turns a goal into the controls it should carry. The trouble is upstream. The instruction you started with does not describe a boundary for any of that machinery to enforce.

“Resolve the customer’s billing dispute” is a goal. It names an outcome, not a boundary. To pursue it the agent legitimately needs the customer’s account, their invoices, and the open ticket. It almost certainly does not need payroll, the source repository, HR records, or any other customer. But none of that is written anywhere. The user said what they wanted accomplished and nothing about scope, because to a human the scope goes without saying. To an agent, nothing goes without saying.

Closing that gap is the mission shaping problem: taking the open-ended request and turning it, before the agent acts, into a structured and bounded statement of what the task is and is not. I have written about why open-world OAuth needs that shaping step rather than a static grant. Applied to the billing agent, the shaper’s proposal looks like this:

1
2
3
4
5
6
7
{
  "goal": "Resolve the customer's billing dispute",
  "objects": ["customer 1234's account", "their invoices", "support ticket 456"],
  "constraints": ["read-only except updating the ticket", "billing domain only", "this customer only"],
  "success_criteria": ["dispute resolved or escalated to a human", "ticket updated"],
  "mission_expiry": "2026-06-30T18:00:00Z"
}

That object is not yet authorization, and it is not yet trusted. Whatever does the shaping, a model here, produces only a proposal. A trusted authority validates it against policy, narrows it, derives the candidate authority, and records an approval before any of it binds; only an approved mission is active. Skip that and you are enforcing whatever the shaper asked for. What enforcement then checks, on every consequential call, is the authority that survived approval.

Materialization turns the whole approved mission into something a machine executes: the authority set along with the tenant and actor bounds, the capability bindings, and the policy version it was approved under. It takes the resource systems’ help, since only the billing systems know which resources and actions they actually expose. Simplified, the result is rules an enforcement point can check without interpreting any natural language:

1
2
3
4
5
6
7
8
9
{
  "allow": [
    { "action": "read",   "resource": "account", "where": { "customer": "1234" } },
    { "action": "read",   "resource": "invoice", "where": { "customer": "1234" } },
    { "action": "update", "resource": "ticket",  "where": { "id": "456" } }
  ],
  "default": "deny",
  "expires": "2026-06-30T18:00:00Z"
}

This is the executable authorization: the mission compiled into a boundary a policy engine evaluates on every consequential call. Read account 1234 and the request matches an allow rule and clears. Reach for account 5678, the customer the agent drifted to in the breach, and nothing matches, the default denies, and the call returns nothing. Not data plus a note for a reviewer next week. Nothing. And notice what is not happening: no model is deciding whether to grant. A model may still contribute a signal, a risk score or a content classification, but it cannot hold the authority or widen it. The authority was fixed and approved before the agent started, and the call is evaluated fresh against it: current mission state (a revoked or expired mission fails closed), the resource’s own policy, the actor, and the parameters. The model can advise; the approved authority caps what may be permitted, and the PDP decides within that cap.

A fixed, pre-approved boundary does not make the trifecta safe. An injected agent can still combine the private reads the mission permits with the outbound actions it permits, and exfiltrate within scope. What the boundary buys is a bounded blast radius: the injection cannot widen what the agent may reach, so exposure is capped at the task’s authority instead of the agent’s whole credential. Capping is not closing. Closing is the rest of the work: private-read, untrusted-input, and external-action authority separately typed, each consequential action evaluated as it happens with its parameters bound in, egress behind the same control, step-up approval on the highest-stakes steps. A mission does not dissolve the trifecta. It makes it governable, which is the most any boundary can honestly claim.

The mission and the rules it produces are worth keeping straight. The mission is the artifact you stand behind: human-readable, the record of what the task was for and why these objects and not others, signed off by its approver and readable later by an auditor. The executable authorization is that mission in force: literal, machine-facing rules an enforcement point runs throughout the task, meaningless on their own without the mission that explains them. The mission is approved once; the enforcement point runs its rules a thousand times. It is the relationship an eval has to a test run: you stand behind the artifact, the machine executes it.

For that record to be defensible, the mission has to carry more than a goal and some objects. It binds to the requester and the user it acts for, the approving authority, the resource context it was scoped against, the policy and derivation versions it was built under, and an expiry. Without those it is a blob of intent, not an authorization artifact anyone could stand behind in a review.

And when the task needs more than the mission anticipated, say the dispute hinges on a related order in another system, the answer is not a standing grant broad enough to have covered it. It is a governed expansion: a successor mission carrying the wider authority, the prior one completed, its record and lineage kept for audit. Scope tracks the task, not the org chart, and it moves by minting a new bounded mission, never by loosening an old one.

Intent Becomes the Unit of Authorization

Step back and the shift underneath this is bigger than one billing agent. Mainstream enterprise authorization, the role-based kind most companies actually run, ties what you may do to who you are: a role provisioned ahead of time that holds as long as you hold the job. Richer models exist, attribute-based, capability, and contextual policy all predate agents, but the static role is the dominant pattern, and it works when identities map to stable jobs.

Agents do not hold jobs. They take tasks. The same agent, under the same identity, resolves a billing dispute at ten o’clock and drafts a board summary at eleven, and the two tasks have almost nothing in common in what they should be allowed to touch. Give that model a single static role for “the agent” and it has to be the union of everything any task might ever need, broad enough that it is no longer least privilege for any one of them.

Identity still matters, for a narrower job: it establishes who is acting and for whom. The ceiling, the most the agent could ever touch, comes from what that identity is entitled to, plus client registration, delegation, and tenant and resource policy. A grant can never exceed it. But a ceiling is not a scope: it says what is possible, not what is needed now.

What is needed now is a different question, “what are you trying to do,” answered one task at a time, because the task is the only place the answer lives. A session is not a mission: a login that lasts all day is the wrong container for an intent that lasts twenty minutes and wants three specific things. So the mission does not replace identity; it narrows within it. The grant is the intersection: bounded above by that ceiling, bounded to the task by the mission, which adds what conventional deployments usually lack: a durable, approved, machine-readable statement of what this task is for, narrow enough to bound the work and fresh enough to fit it.

Mission Objects Are the Authorization Counterparts to Evals

If you have written an eval, you have already done this once. You took intent in language and compiled it into an artifact a machine acts on. You did it for the verification side. The mission is the authorization counterpart, built with the same compilation pattern, and most teams that have the first do not yet have the second.

Both begin with intent in plain language. Both compile that language into a structured artifact, often with a model, though a rules engine, a form, or a workflow can do the shaping too. Both exist so that something a human meant can be acted on by a machine that would otherwise see only tokens. ASSERT compiles intent into a test that asks, after the run, did the agent behave? On the authorization side, the same kind of intent is shaped into a proposal, validated and narrowed, derived into a candidate authority, then approved and activated as the boundary that asks, before the run, what may the agent do?

If you understand why generated evals are a good idea, you already understand why generated authorization boundaries are. The reason to compile rather than hand-author is the same on both sides: the inputs are open-ended. You stopped hand-writing test suites because you could not enumerate every behavior worth checking; you cannot hand-write a scope for every task a user might dream up either. A goal no one anticipated still needs a boundary, and something has to turn the task into one.

The symmetry is what makes the analogy useful. The asymmetry is what makes it matter:

Eval (e.g. ASSERT)Mission object
Derived fromwritten intentwritten intent
Produced bya model, reviewed by a humanproposed by a shaper, validated by a state authority, approved by a principal or policy
Bindsnothing; produces a non-binding verdicteach consequential action, through the PDP that enforces it
Roledetective: did it behave?preventive: what may it do?
Lifecyclea durable suite, reused across runsper task, short-lived, tied to one goal
Authoritynone; it observes and scoresgoverns the authority set a PDP permits and denies against
Cost of being wronga false pass greenlights a vulnerable releasethis execution is wrongly authorized

Read the bottom rows together. An eval carries no authority; it produces evidence. A wrong eval is not harmless, a false pass can greenlight a vulnerable release, but it does its damage indirectly, by feeding a deployment decision a human still owns. A mission governs authority directly. It is not advice a later decision weighs; it commits to an authority set that every runtime check is made against, so a wrong mission is wrong at the moment of each action it governs. That is why a mission cannot be an eval turned around. The same construction, once it confers authority rather than evidence, needs what an eval never had: an approval path, a default that denies when intent is unclear, and a lifecycle that ends.

There is a sharper way to feel it. An eval is re-runnable: tune the suite, run it again, watch the score move, all before anything ships. An approved mission is not a rehearsal. You improve the shaping offline, on past tasks, with evals, but once a mission is active its authority is live, and the first action under it is the real thing. It can still be suspended or expired; what it cannot be is quietly re-run against the world the way a test is.

Why Not Just Judge Every Action?

If a mission is so much more fraught than an eval, the obvious shortcut is to skip the artifact. Keep the model in the loop, point it at each action before the agent takes it, and let it decide. An LLM judge as the gate. People are already building this, under the name guardrails.

It does not hold up as authorization, and seeing why is the clearest case for the mission.

Run the judge after the fact and you are back to detection: the action happened, the score came late, the data already moved. So the judge has to run inline, before each call. The moment it does, three problems arrive together.

It makes a security decision probabilistic, turning allow or deny on how a model reads a situation rather than on a rule that, for the same inputs and policy version, returns the same answer every time. And the model is injectable: give the judge enough context to make a real call and you have handed it the same untrusted content steering the agent, so the injection that fools the agent also gets a vote on allowing it. Starve it of that context and it can no longer judge much, collapsing toward a fixed rule, a conventional deterministic policy check with a model’s latency bolted on, and none of a mission’s task-specific authority at that. A grant should be the one thing untrusted input cannot argue with. A model in the decision path is the opposite.

It answers the wrong question. An eval scores behavioral conformance: does this transcript look like the spec? Authorization is an entitlement check: may this principal, acting for this user, touch this resource right now, given delegation, tenant, and revocation? A model reading a conversation cannot know whether this customer’s agent may read that customer’s invoice; that fact lives in the identity and entitlement systems, not the transcript. Give the judge access to them and the authority has moved into them, where it belongs. The model can still help, classifying risk or reading context that the policy then weighs, but it is no longer the thing that decides. That is the line: a model may inform an authorization decision; it must not be the authority that makes it, and it must not be able to widen access on its own.

And it still needs something to judge against. A generic global rule, do not touch other customers, is not least privilege; it cannot say this task may read these three things and nothing else. The per-task specification the judge would check is the mission. Skip the artifact and you have not removed the mission, you have buried it in a prompt and made it non-deterministic and unauditable.

The division of labor is not a turf fight. An eval observes. A guardrail filters. A mission authorizes. An inline judge is a fine extra guardrail and a poor authorization system, because what authorizes has to be deterministic, inspectable, bound to an identity, and fixed before execution, so the untrusted content the agent ingests as it runs cannot alter or widen it. That is the approved mission and the authority set it commits, not a model deciding live.

The Hard Part Is the Same Hard Part

Keeping the model out of the gate raises a fair follow-up. The shaper that produces the mission is often a language model too, the very thing whose judgment we are trying to contain. It does not have to be: a rules engine, a form, or a workflow can shape just as well, and there the question never arises. But take the hardest case, a model shaper: if an inline judge is too injectable to authorize, why is a model trustworthy enough to write the boundary? It sounds like handing the fox the keys.

It would be, if the model’s output were the grant. It is not, and that distinction is the whole design. The shaper proposes a mission intent; a state authority validates and narrows it, derives the candidate authority, and renders it for approval by a person or a policy precise enough to stand in for one; only on approval does the mission become active, and that approved mission, fixed and inspectable, is what gets enforced, not the model’s live reasoning as it runs. The model drafts, something accountable disposes, and the approved artifact is what binds.

The eval world already trusts model-generated artifacts this way: ASSERT writes its tests with a model and scores them with a judge, and the answer to “why trust that” is not faith but review and a measured judge-human agreement rate. The mission side needs the same discipline at a higher bar, and the reason the bar is higher is the point. Grading is allowed to be approximate: a judge whose agreement with human reviewers has been measured for the behaviors it scores can be good enough, though that agreement varies by model and by how fine the policy distinction is. Authorization cannot be approximate; a grant that is right most of the time is a breach the rest of the time. So a model can decide an eval, but on the authorization side it can only draft the artifact an accountable principal or governing policy approves and a deterministic check enforces. Same model, same intent, different tolerance for being wrong.

The difficulties that remain are the ones you would expect from making fuzzy human intent binding:

  • Intent is underspecified. “Resolve the billing dispute” never says which systems. When the ambiguity is material, the shaper must not guess: narrowing silently can misread the task as badly as widening it. It has to clarify or refuse, with the user as the final disambiguator. The preventive bias is to put real ambiguity in front of a person, not to resolve it quietly in either direction.
  • Scope needs domain knowledge. “Billing domain only” has to resolve to the resources and actions the billing systems actually expose. The shaper proposes the shape; the systems that own those resources confirm it is real and allowed. Intent does not get to invent capabilities that do not exist.
  • Tight missions break tasks. The price of any preventive control is false denials, and a mission drawn too tight strands the agent halfway through legitimate work. The answer is the governed widening from earlier, not drawing missions loose enough to never get in the way.

None of these is a reason to wait, but they are a reason to be honest about the size of it. This is a heavier build than an eval harness: shaping, derivation, enforcement, an approval path, a way to widen a mission mid-task. Evals had a head start because grading is cheap to build and its mistakes have a smaller direct blast radius. Authorization has neither advantage, which is exactly why irreversible actions cannot keep waiting for it.

One Source, Many Artifacts

Follow this far enough and a larger architecture appears. One approved mission need not feed only the enforcement rules. The same validated intent and authority set can drive the runtime constraints and the behavioral evals as well, each compiled from the artifact that was already approved, not from the raw prose.

Deriving them from that shared, approved source keeps things aligned. An approved mission carries both an authority set and a goal with constraints and success criteria, so it can generate two kinds of eval: one asking whether the agent stayed within its granted authority, and one, the behavioral eval ASSERT actually writes, asking whether it pursued the goal within those constraints and met the success criteria.

On the authorization side, a flagged behavior means different things depending on what enforcement did. An out-of-bounds attempt the PDP denied is a misbehaving agent and a working control, worth correcting, not a breach. An out-of-bounds action that went through is an enforcement or materialization failure. An in-bounds action that still did harm points elsewhere: the mission was too broad, or the resource’s policy was. And none of it says whether the mission captured the user’s intent to begin with; that is a separate question for a shaping-quality eval or a human-labelled oracle scoring the mission against the original intent, not against itself. Conformance and faithfulness are different checks, and one artifact answers only the first. The loop closes:

flowchart TB Goal([User goal]) --> Shape[Mission shaping] Shape -->|proposed intent| Auth[State authority
validate, narrow, derive] Auth -->|rendered for approval| Approve{Approver
principal or policy} Approve -->|approved| Mission[Mission + authority set] Mission -->|audience-scoped projection| PDP[Per-call decision / PDP] Ctx[Runtime context
mission state, actor, resource policy, parameters] --> PDP PDP -->|permit| Exec[Agent execution] Exec -->|each consequential action| PDP Exec --> Eval[Evaluation, e.g. ASSERT] Eval --> Verify([Behavior verification]) Verify -.->|findings refine the next mission| Shape

The first half constrains the agent before it acts; the second checks what it did. The dotted edge is the part most people skip, but a finding does not name its own fix: read against the cases above, it points at a layer, the agent, enforcement, the mission, or the shaping behind it, and that is where the correction goes, not always into a sharper test.

The Chain Worth Building

None of this is really about prompts, or policies, or evals as separate wins. It is about a chain. Intent enters as a sentence a person types, and today it dissolves almost immediately: the prompt runs, the agent fans out across tools, and the original goal survives, if at all, as a line in a log someone might grep after something goes wrong. The opportunity is to keep that intent intact and load-bearing the whole way through, from the goal, into the authority the agent is granted, through the actions it takes, into the verdict on whether it behaved.

The pieces are arriving. Executable evaluations are furthest along, because grading is the safest place to start. Executable authorization, with a mission object as its artifact, is arriving second but has to run first, because shipping agents with evals and nothing on the authorization side leaves you with detection on actions that do not undo. A camera pointed at a door with no lock. Worth having, and not the same as safe.

Both rest on the same conviction, and it is worth saying plainly:

Human intent should not stay trapped in natural language. It should become something a machine can enforce and verify. And it should be the same intent on both sides.