Classification Is Not Control: What Gartner's Agent Governance Framework Leaves Unanswered

A May 2026 Gartner press release warns that applying uniform governance across AI agents will lead to enterprise failure. The diagnosis is correct. Most organisations deploying agents today apply the same governance rules to a document summarisation tool as they do to an agent that can modify production systems, send communications, or commit funds. The result is noise applied uniformly, blocking value at the low end while providing false assurance at the high end simultaneously.

Gartner’s proposed remedy is a four-level autonomy framework. Level 1 agents observe: read-only, output to the requesting user only. Level 2 agents advise: drafts and recommendations, humans execute. Level 3 agents act with approval: they can write, send, and modify, but every action requires explicit human sign-off. Level 4 agents act autonomously: they execute independently within guardrails, with humans reviewing exceptions rather than individual decisions.

The framework is useful. Most enterprises lack any taxonomy for the agents they are running, and classifying them by autonomy level is a better starting point than classifying them by vendor, use case, or cost centre. The framework should change how organisations think about agent inventories.

The problem is what the framework does not address.

Classification is design-time. Failure is runtime.

Gartner’s four tiers tell you what controls to put in place when you design and deploy an agent. Apply scoped data access at Level 1. Add hallucination testing and user training on automation bias at Level 2. Build approval workflows with audit trails at Level 3. Implement continuous monitoring, circuit breakers, and rollback mechanisms at Level 4.

These are all correct recommendations, and all design-time activities: decisions made before the agent runs, documented in policy, checked during onboarding or audit.

The failure modes Gartner identifies are runtime phenomena.

Approval fatigue does not show up in a policy document. It accumulates over weeks as Level 3 agents generate approval requests faster than humans can evaluate them meaningfully. The approval workflow is intact. The human review has become a rubber stamp. The governance control still exists on paper and has stopped functioning in practice.

Automation bias is a behavioural shift, not a configuration error. It accumulates gradually as users learn to trust Level 2 advisory outputs and stop applying independent judgment to them. The agent remains correctly classified while the human oversight that justified that classification quietly erodes.

At Level 4, the concern Gartner raises is that “actions are executed at a scale and speed that can outpace human oversight.” Continuous monitoring and circuit breakers are the recommended response. But detection is not the same as prevention. An agent that is detected doing the wrong thing has already done it. The question that comes before detection is whether the action should have been admissible in the first place.

What the research shows

A paper published in April 2026 by researchers at Microsoft Research makes the enforcement gap concrete. Laban, Schnabel, and Neville studied what happens when LLMs are delegated long document-editing tasks across 52 professional domains. Their finding: even frontier models corrupt an average of 25% of document content by the end of long workflows. The errors are sparse, meaning they are difficult to detect in any individual transaction, and they compound over time.

The authors describe these as “silent corruptions”: modifications that pass review because each individual change looks plausible, but that collectively degrade the integrity of the document. The models are operating within their assigned scope, doing exactly what they were delegated to do. The output is wrong in ways that a design-time classification and a governance policy cannot prevent, because the problem only becomes visible across a sequence of actions examined in aggregate. A guardrail that exists and still does not catch what it needs to catch.

The gap between what is governed and what is happening

The practical consequence for enterprises is this: Gartner’s framework gives you a better map. It does not close the distance between the map and the territory.

An agent classified at Level 3, with approval workflows and audit trails in place, can still cause problems if the approval process has degraded into a formality, if the audit logs exist but are not monitored in context, or if the scope definition was accurate at deployment and has since drifted as the agent’s inputs changed.

An agent classified at Level 1 can still create downstream risk if its outputs are fed into decision processes faster than humans can evaluate them, which is precisely the automation bias risk Gartner flags at Level 2. The boundaries between levels are not as clean in practice as they appear in a framework.

Gartner is right that organisations need to stop governing all agents the same way. The next question, which the framework leaves unanswered, is how you know whether your governance is actually working once the agents are running. Not whether the controls were designed correctly. Whether they are functioning.

What enforcement-layer governance requires

There is a useful distinction that the tiered framework does not make explicit:

Logs explain what happened. Monitoring detects what is happening. Enforcement determines what is allowed to happen.

These are not the same capability at different maturity levels. They are categorically different things. An organisation can have excellent logging and functional monitoring and still have no enforcement layer, meaning no mechanism that evaluates whether an action is admissible before it executes. That is where most enterprises are today, including many that believe their governance is mature.

Three things the tiered framework does not resolve:

Binding constraints at execution time, not just design time. A classification framework can identify that a Level 4 agent should not approve payments above a defined threshold, access certain systems, or operate without escalation. But the classification itself does not enforce those constraints. Somewhere in the execution path, a mechanism must decide whether an action is admissible before it occurs. Without that enforcement point, governance remains descriptive rather than operational. The policy says what should not happen. The enforcement layer ensures it cannot.

Behavioural testing against realistic inputs, not just functional testing. Gartner recommends security testing at higher autonomy levels. This is necessary but not sufficient. Agents need to be tested against the kinds of inputs they will actually encounter, including edge cases, ambiguous instructions, and scenarios where the correct action is to halt rather than proceed. The DELEGATE-52 research demonstrates that degradation is exacerbated by document size, interaction length, and the presence of distractor files. These are normal production conditions, not exceptional ones.

Ownership with operational authority. Gartner specifies that Level 4 requires “clear ownership for agent behaviour.” Ownership without the ability to interrupt execution in real time is accountability that can only be exercised after the fact. The person named as responsible needs to be able to halt what the agent is doing, not only explain what it did.

The classification problem Gartner describes is real and the framework is a genuine contribution to how organisations should think about agent inventories. Solving the classification problem is the right starting point.

The framework tells you what to govern. The next question is where governance becomes binding. The organisations that avoid agent failure will be the ones that treat the enforcement layer as the object of design, not the documentation layer.

If you are mapping your agent inventory against a framework like this and want a second perspective on where your enforcement layer has gaps, book the 30-Minute Enforcement Gap Review.

Classification is design-time. Failure is runtime.

What the research shows

The gap between what is governed and what is happening

What enforcement-layer governance requires

Put what you just read to work