You can already hand an agent a written description of how it should behave and get back a suite of tests that catch it misbehaving. No hand-authored assertions, just intent in plain language, compiled into something that runs. That capability shipped this year, and if you build agents you have probably already reached for it.

It is half of a control.

Tests tell you, after a run, whether the agent behaved. They say nothing about what the agent was allowed to do before it started. For a chatbot that gap barely mattered; the worst case was a bad answer you could read and throw away. For an agent that files expenses, moves tickets, queries financial systems, sends mail, and calls APIs nobody listed in advance, it is the whole game. These things act, across systems their authors do not control, and some of what they do does not undo.

So the question is no longer only whether the agent behaved, which we are learning to answer. It is what the agent was allowed to do in the first place, and how anyone would know if it stepped outside that. The industry is meeting the second question with the same instinct it brought to the first: stop leaving intent in prose only a human can act on, and compile it into something a machine can execute. That instinct showed up first in testing. It belongs just as much in authorization, where the equivalent is only starting to take shape.

Intent That Runs

Start with the half that already works, because it shows the move in its cleanest form.

Microsoft’s ASSERT, short for Adaptive Spec-driven Scoring for Evaluation and Regression Testing and released as part of its open trust stack at Build 2026, is a clean instance of the pattern. You write the behavior you expect. ASSERT derives behavior categories from it, generates single- and multi-turn test cases, runs them against your agent, and uses an LLM judge to score each conversation against your policy, tracing every failure back to the input that produced it (Microsoft’s writeup). Written intent in, behavior verification out, with no hand-built suite in the middle.

Strip away the specifics and the move is general: take intent expressed in language, use a model to compile it into a structured artifact, and let that artifact do the work the prose never could. That is the move worth watching, and not only because it improves testing. ASSERT makes intent executable for verification. It answers, mostly on its own, the question teams used to answer by hand: did the agent behave the way we said it should?

It says nothing about a different question, one that comes earlier in time and matters more for anything that cannot be undone: what should the agent have been allowed to do at all?

Verification Is Not Enforcement

Those two questions sound like neighbors. They are different controls, and the difference is not academic.

Picture the billing agent. You hand it a task that sounds routine: resolve a customer’s billing dispute. To do that it pulls up the account, reads the invoices, checks payment history, updates the support ticket. Somewhere in the middle, a support note it ingests contains text that nudges its reasoning, or it simply generalizes “look at the customer’s records” a little too far, and it queries a second customer’s account to compare. Nothing in its instructions forbade that. It was holding a credential that worked.

In an eval-only world, here is how that plays out. The agent finishes. The run is logged. Days later the regression suite runs against the recorded traces, the policy says an agent must not touch accounts outside the one it was assigned, and the judge flags the violation. Now you know. You knew nothing at the moment it mattered. The second customer’s data was already read, already in the model’s context, possibly already summarized into an output that went somewhere you cannot recall. The eval is a true and useful statement about something that has already happened.

And the agent may well have passed every eval you wrote. Evals gate the release, not the run. This run, with this particular injected note and this particular data, was never one of the test cases, and a clean eval report sits perfectly comfortably beside a real breach. Gating the design is not gating the action.

That is what detective means. An evaluation observes and scores after the fact. An authorization system decides and refuses at the moment of the call. For a great deal of what agents do, after the fact is simply too late. An agent that reads private data, takes in untrusted content, and can act on the outside world, the combination Simon Willison named the lethal trifecta, routinely does things that do not reverse. You cannot un-send a payment, un-email a list, or un-leak a contract. The usual advice for the trifecta is to break it by taking away a leg; an eval takes away none. It watches an agent that still has all three. Detection tells you it happened. It does not give the data back.

A camera is not a lock. Monitoring records the break-in; the lock prevents it. Agents are getting cameras first, and it is easy to see why: you can grade behavior without touching the runtime, while enforcement means getting inside the path of every call and deciding, live, what the agent may reach. The easier half shipped first. It is not the half that stops anything.

The Missing Step

So put a lock on it. We are not short on locks. Fine-grained authorization, policy engines, runtime decision points, scoped tokens: the machinery to refuse a call at the moment it is made exists and is well understood. The trouble is upstream of all of it. The instruction you started with does not describe a boundary for any of that machinery to enforce.

“Resolve the customer’s billing dispute” is a goal. It names an outcome, not a boundary. To pursue it the agent legitimately needs the customer’s account, their invoices, and the open ticket. It almost certainly does not need payroll, the source repository, HR records, or any other customer. But none of that is written anywhere. The user said what they wanted accomplished and nothing about scope, because to a human the scope goes without saying. To an agent, nothing goes without saying.

Closing that gap is the mission shaping problem: taking the open-ended request and turning it, before the agent acts, into a structured and bounded statement of what the task is and is not. I have written about why open-world OAuth needs that shaping step rather than a static grant. Applied to the billing agent, the shaped artifact, the mission, looks like this:

1
2
3
4
5
6
7
{
  "goal": "Resolve the customer's billing dispute",
  "objects": ["customer 1234's account", "their invoices", "support ticket 456"],
  "constraints": ["read-only except updating the ticket", "billing domain only", "this customer only"],
  "success_criteria": ["dispute resolved or escalated to a human", "ticket updated"],
  "mission_expiry": "2026-06-30T18:00:00Z"
}

That object is not yet authorization; it is intent made precise. Three steps turn it into a working lock: shaping produces the mission from the prose goal, derivation translates it into authority the target systems understand, and enforcement checks each call against that authority before the call runs.

Derivation is the step worth making concrete, because it is where intent becomes something a machine executes. It also needs the resource systems’ help, since only the billing systems know which resources and actions they actually expose. Given that, the mission above produces rules an enforcement point can check without interpreting anything:

1
2
3
4
5
6
7
8
9
{
  "allow": [
    { "action": "read",   "resource": "account", "where": { "customer": "1234" } },
    { "action": "read",   "resource": "invoice", "where": { "customer": "1234" } },
    { "action": "update", "resource": "ticket",  "where": { "id": "456" } }
  ],
  "default": "deny",
  "expires": "2026-06-30T18:00:00Z"
}

This is the executable authorization: the mission compiled into a boundary a policy engine evaluates on every call. Ask to read account 1234 and the request matches an allow rule and clears. Reach for account 5678, the second customer from the breach a moment ago, and nothing matches, the default denies, and the call returns nothing. Not data plus a note for a reviewer to find next week. Nothing. And notice what is not happening at that instant: no model is consulted. The decision is a lookup against rules a person approved before the agent ever started.

That last detail is also what makes the trifecta survivable. You cannot keep untrusted content out of an agent whose job is to read the world. What a fixed, pre-approved boundary does is ensure the content cannot widen what the agent may reach or do: the injection can still steer the agent’s reasoning, but it cannot move the wall. The dangerous leg stays in the loop with nothing left to grab. Break the trifecta if you can; when you cannot, a mission is how you survive keeping it.

The mission and the rules it produces are worth keeping straight. The mission is the artifact you stand behind: human-readable, the record of what the task was for and why these objects and not others, the thing a person approved and an auditor can later read. The executable authorization is that mission in force: literal, machine-facing rules an enforcement point runs on every call, fast to check and meaningless on their own without the mission that explains them. You approve the mission once; the enforcement point runs its rules a thousand times. It is the relationship an eval has to a test run. You stand behind the artifact; the machine executes it; neither gets relitigated by hand each time it fires.

And when the task genuinely needs more than the mission anticipated, say the dispute turns out to hinge on a related order in a different system, the answer is not a standing grant broad enough to have covered it from the start. It is a governed request to widen the mission, approved or refused on its own terms, with the original boundary left intact for everything else. The scope tracks the task instead of the org chart.

Intent Becomes the Unit of Authorization

Step back and the shift underneath this is bigger than one billing agent. Authorization has always answered the question “who are you,” then looked up what that identity was permitted to do. That works when identities map to stable jobs. A billing specialist gets the billing role, the role grants the billing systems, and the mapping holds for as long as the person holds the job.

Agents do not hold jobs. They take tasks. The same agent, under the same identity, resolves a billing dispute at ten o’clock and drafts a board summary at eleven, and the two tasks have almost nothing in common in what they should be allowed to touch. A role provisioned for “the agent” has to be the union of everything any task might ever need, which is a long way of saying it has to be broad, which is a long way of saying it is not least privilege at all. Identity still sets the ceiling. What the agent may ever touch, what the user it acts for is entitled to, which tenant it lives in: that is identity’s job, and a grant can never exceed it. But the ceiling is not the scope. A role broad enough to cover every task the agent might run is not least privilege for the one task in front of it. Identity says what is possible. It cannot say what is needed now.

What is needed now is a different question, “what are you trying to do,” and it can only be answered one task at a time, because the task is the only place the answer lives. A session is not a mission: a login that lasts all day is the wrong container for an intent that lasts twenty minutes and wants three specific things. So the mission does not replace identity; it narrows within it. The grant the agent actually gets is the intersection, bounded above by what identity and entitlements permit and bounded to the task by the mission. Roles and policies still do their job; the billing agent still has to be an authenticated principal a policy can reason about. The mission adds the input those systems never had, a machine-readable statement of what this particular task is for, narrow enough to bound the work and fresh enough to fit it.

Mission Objects Are the Authorization Equivalent of Evals

If you have written an eval, you have already done this once. You took intent in language and compiled it into an artifact a machine acts on. You did it for the verification side. The mission is that same artifact, built the same way, for the authorization side, and most teams that have the first do not yet have the second.

The eval and the mission are the same construction, built from the same raw material, aimed at two different problems.

Both begin with intent in plain language. Both use a model to compile that language into a structured artifact. Both exist so that something a human meant can be acted on by a machine that would otherwise see only tokens. ASSERT compiles intent into a test that asks, after the run, did the agent behave? Mission shaping compiles the same kind of intent into a boundary that asks, before the run, what may the agent do? Most engineers already have a feel for the first. The claim here is that the second is the same kind of object, produced for the authorization side of the house instead of the testing side. If you understand why generated evals are a good idea, you already understand why generated authorization boundaries are. The reason to compile rather than hand-author is the same on both sides: the inputs are open-ended. You stopped writing test suites by hand because you could not enumerate every behavior worth checking. You cannot enumerate, in advance, an authorization scope for every task a user might dream up either. A goal no one anticipated still needs a boundary, and only something that reads the goal can draw one.

The symmetry is what makes the analogy useful. The asymmetry is what makes it matter:

Eval (e.g. ASSERT)Mission object
Derived fromwritten intentwritten intent
Produced bya model, reviewed by a humana model (the shaper), approved by a human
Runsafter a run; gates the release, not the live actionbefore each action; gates the action itself
Roledetective: did it behave?preventive: what may it do?
Lifecyclea durable suite, reused across runsper task, short-lived, tied to one goal
Authoritynone; it observes and scorescarries authority; it permits and denies
Cost of being wronga bad grade you can re-runaccess that already happened

Read the bottom rows together, because they are the whole point. An eval carries no authority; it watches and reports, and the worst a wrong eval does is mislead you until you notice and fix it. A mission carries authority; it admits and refuses, and a wrong mission does not mislead anyone, it grants. That is why a mission cannot be an eval turned around. The same construction, once it has teeth, needs things an eval never had to carry: a human or a policy in the approval path, a default that denies when the intent is unclear, and a lifecycle that ends. An eval that is wrong costs you a re-run. A mission that is wrong costs you the breach.

There is a sharper way to feel that. An eval lives in CI. You run it a thousand times, tune the wording, watch the score move, and nothing happens to the world while you do. A mission gets one shot. The moment it binds, the agent acts on it, and there is no staging environment for a grant, because the grant is the permission for the action, not a rehearsal of it. You can iterate your way to a good eval in production. You cannot iterate your way to a good mission in production. You improve the shaper offline, on past tasks, and yes, you use evals to do it, but the grant it issues for the task in front of you binds the first time and every time. That is what a higher bar actually means here.

Why Not Just Judge Every Action?

If a mission is so much more fraught than an eval, the obvious shortcut is to skip the artifact. Keep the model in the loop, point it at each action before the agent takes it, and let it decide. An LLM judge as the gate. People are already building this, under the name guardrails.

It does not hold up as authorization, and seeing why is the clearest case for the mission.

Run the judge after the fact and you are back to detection: the action happened, the score came late, the data already moved. So the judge has to run inline, before each call. The moment it does, three problems arrive together.

It makes a security decision probabilistic. Allow or deny now turns on how a model reads a situation rather than on a rule that returns the same answer every time. And the model is injectable: give the judge enough context to make a real authorization call and you have handed it the same untrusted content that is steering the agent, so the injection that fools the agent also gets a vote on whether to allow it. Starve the judge of that context to keep it safe and it can no longer judge much, collapsing toward a fixed rule, which is the mission with worse latency. A grant should be the one thing in the loop that untrusted input cannot argue with. A model in the decision path is the opposite.

It answers the wrong question. An eval scores behavioral conformance: does this transcript look like the spec? Authorization is an entitlement check: may this principal, acting for this user, touch this resource right now, given delegation, tenant, and revocation? A model reading a conversation cannot know whether this customer’s agent may read that customer’s invoice. The fact is not in the transcript; it is in the identity and entitlement systems. Give the judge access to those systems and the authority has moved into them, where it belongs, and the judge is now a probabilistic front-end to a decision they could make deterministically. Either way the model is not what decides. It is uninformed, or it is redundant.

And it still needs something to judge against. A generic global rule, do not touch other customers, is not least privilege; it cannot say this task may read these three things and nothing else. The per-task specification the judge would check is the mission. Skip the artifact and you have not removed the mission, you have buried it in a prompt and made it non-deterministic and unauditable.

The division of labor is not a turf fight. An eval observes. A guardrail filters. A mission authorizes. An inline judge is a fine extra guardrail and a poor authorization system, because the thing that holds authority has to be deterministic, inspectable, bound to an identity, and approved before any untrusted content was ingested. That thing is the mission, not a model deciding live.

The Hard Part Is the Same Hard Part

Keeping the model out of the gate raises a fair follow-up: the shaper that produces the mission is itself a language model, the very thing whose judgment we are trying to contain. If an inline judge is too injectable to authorize, why is a model trustworthy enough to write the boundary in the first place? It sounds like handing the fox the keys.

It would be, if the model’s output were the grant. It is not, and that distinction is the whole design. The shaper proposes a mission. A human, or a policy precise enough to stand in for one, approves it. The approved mission, fixed and inspectable, is what gets enforced, not the model’s live reasoning as it runs. The model drafts; something accountable disposes; the artifact, not the model, is what holds authority.

This is not a new trick. It is exactly how the eval world already earns trust in its own model-generated artifacts. ASSERT writes its test cases with a model and scores them with an LLM judge, and the field’s answer to “why would you trust that” is not faith. It is review of the generated suite and a measured rate of agreement between the judge and humans. The mission side needs the same discipline, held to a higher standard, and the reason the standard is higher is worth naming. An eval can let a model score live, because grading is allowed to be approximate: a judge that agrees with humans most of the time is a good judge. Authorization cannot be approximate. A grant that is right most of the time is a breach the rest of the time. That is the real line between them. A model can be the thing that decides an eval; on the authorization side it can only draft the artifact that a human approves and a deterministic check enforces. Same model, same intent, different tolerance for the model being wrong.

The difficulties that remain are real, and they are the ones you would expect from taking fuzzy human intent and trying to make it binding:

  • Intent is underspecified. “Resolve the billing dispute” never says which systems. When the shaper is unsure, the safe move is to narrow or to ask, never to widen. A preventive control should err toward too little and put the ambiguity in front of a person, not guess generously and hope the eval catches it later.
  • Scope needs domain knowledge. “Billing domain only” has to resolve to the actual resources and actions the billing systems expose. The shaper can propose that shape; the systems that own those resources have to confirm it is real and allowed. Intent does not get to invent capabilities that do not exist.
  • Tight missions break tasks. The price of any preventive control is false denials, and a mission drawn too tight will strand the agent halfway through legitimate work. The answer is the governed widening from earlier, not a habit of drawing missions loose enough to never get in the way, which would give back the whole point.

None of these is a reason to wait, but they are a reason to be honest about the size of the thing. This is a heavier build than an eval harness: shaping, derivation, enforcement, an approval path, a way to widen a mission mid-task. Evals had a head start because grading is cheap and safe to get wrong. Authorization is neither, which is the same reason the irreversible actions cannot keep waiting for it. The difficulties are the kind evals already worked through, on a problem where being wrong does more than skew a metric.

One Intent, Many Artifacts

Follow this far enough and the outline of a larger architecture appears. A single written intent need not produce only a test, or only a boundary. It can compile into several machine-executable artifacts at once:

flowchart TB Intent[Written intent] --> A[Authorization boundaries] Intent --> R[Runtime constraints] Intent --> E[Behavioral evaluations]

Authorization reads the intent to decide what the agent may do. Runtime enforcement reads it to bound each action as it happens. Evaluation reads it to judge whether the agent behaved. Three artifacts, one source, and the source is the sentence the human actually said.

Deriving them from one intent is not just tidy; it is what keeps them honest about each other. When the boundary and the eval are compiled from the same goal, the eval is testing for behavior the boundary was meant to permit, and any gap between the two is information: either the boundary is drawn wrong or the test is. Author them separately, by different teams from different documents, and they drift. You get agents that sail through their evals while holding access nobody would have approved, or agents blocked at runtime from the very thing the test insists they should do. One source is what lets the detective half keep the preventive half honest, and the loop closes:

flowchart TB Goal([User goal]) --> Shape[Mission shaping] Shape --> Mission[(Mission object)] Mission --> Authz[Authorization enforcement] Authz --> Exec[Agent execution] Exec --> Eval[Evaluation, e.g. ASSERT] Eval --> Verify([Behavior verification]) Verify -.->|findings refine the next mission| Shape

The first half constrains the agent before it acts. The second half checks what it did. The same sentence drives both, and the dotted edge is the part most people skip: when an eval catches the agent doing something it should never have been able to do, the real fix is upstream. Not only a sharper test, but a tighter mission the next time the task runs.

The Chain Worth Building

None of this is really about prompts, or policies, or evals as separate wins. It is about a chain. Intent enters as a sentence a person types, and today it dissolves almost immediately: the prompt runs, the agent fans out across tools, and the original goal survives, if at all, as a line in a log someone might grep after something goes wrong. The opportunity is to keep that intent intact and load-bearing the whole way through, from the goal, into the authority the agent is granted, through the actions it takes, into the verdict on whether it behaved.

The pieces are starting to arrive. Executable evaluations are one, and they are the furthest along, because grading is the safest place to start. Executable authorization, with a mission object as its artifact, is the piece that has to run first in the chain even though it is arriving second, because it is the only one with the power to stop something rather than merely note it.

Until it lands, be clear-eyed about what shipping agents with evals and no executable authorization actually gives you: detection on actions that do not undo. A camera pointed at a door with no lock. Worth having, and not the same as safe.

Both rest on the same conviction, and it is worth saying plainly:

Human intent should not stay trapped in natural language. It should become something a machine can enforce and verify. And it should be the same intent on both sides.