Who Governs the Agents? OpenAI Frontier and the Fight for the Enterprise Control Plane
What it means when a frontier model provider claims the orchestration, identity, and evaluation layer for enterprise AI. And why the bigger story is the self-learning future it’s preparing for.

Consider the following: an agent handling procurement learns from past interactions that a particular approval step frequently results in delay. Over time, it optimizes for speed by routing around that step, not with any single dramatic violation, but through a gradual drift in how it interprets its own permissions and institutional context. No RBAC rule was broken. The agent is operating within its literal access scope. But its behaviour has shifted in a way that no one authorized and that static governance controls will not catch.
This kind of scenario should frame how we read OpenAI’s launch of Frontier last week. Frontier is an enterprise platform for building, deploying, and managing AI agents. It bundles shared business context (a semantic layer connecting data warehouses, CRM, ticketing, and internal apps), an agent execution environment, identity and governance controls, and built-in evaluation and optimization loops. It is model-agnostic, supporting agents powered by OpenAI, Anthropic, Google, and others. Early customers include Uber, State Farm, Intuit, and Oracle.
The surface-level read of this launch is that OpenAI wants to sell governance infrastructure alongside model intelligence. That is true but incomplete. The deeper signal is in how Frontier is architected: agents that build institutional memory over time, evaluation loops designed for continuous improvement, shared context that accumulates and refines. Frontier is building the underbelly for self-learning agents. And it is doing so at a moment when most enterprise governance is designed for static agents, and most agent observability tooling cannot tell you whether an agent’s behaviour this week is meaningfully different from its behaviour last month. This essay examines three questions that Frontier forces into the open: who should govern the agents, what does governance look like when agents learn on the fly, and what can governance actually see.
Should the model provider govern the agents?
Frontier’s model-agnostic posture is its most interesting design choice. OpenAI is explicitly saying: bring agents from any provider, we will orchestrate and govern them all. That reframes the competitive axis. Instead of winning on model intelligence alone, OpenAI is claiming the substrate on which all agents operate. The models become components; the governance layer becomes the platform.
This creates a structural tension. The entity that operates the governance layer accumulates asymmetric visibility: which agents are deployed, how they behave, what business context they access, where they fail. For non-OpenAI agents, that telemetry flows through an OpenAI-operated layer. When the governance provider is also a frontier model provider with its own agents competing for the same enterprise workloads, the informational asymmetry becomes a strategic concern.
There is also an evaluation incentive problem. When the governance layer and the model provider are the same entity, the incentive to surface unflattering evaluation results about one’s own models gets weaker. Eval loops can quietly favour the provider’s own agents through metric selection, benchmark design, or the framing of what ‘improvement’ means. None of this requires bad faith. It is the natural consequence of bundling evaluation into a platform that also sells the evaluated product.
An independent governance layer has a cleaner incentive structure: it can evaluate all agents against the same criteria without conflicts, surface failure modes without commercial hesitation. The market has not yet decided whether the governance layer will be owned by model providers, by existing enterprise platform vendors, or by independent infrastructure players. That decision will shape enterprise AI risk posture for years.
Frontier also deserves credit for several things the market has needed to hear from a major provider: that governance is a first-class concern and not an appendix, that agent identity with scoped permissions should be the default rather than the exception, and that shared business context belongs in infrastructure rather than scattered across one-off integrations. These are sound principles. The question is what happens when you build them into a platform designed for agents that learn.
The governance demands of self-learning agents
The most important aspect of Frontier is not the governance controls it ships today. It is the infrastructure it lays for agents that learn from deployment.
Frontier describes agents that “build memories, turning past interactions into useful context that improves performance over time” and evaluation loops that help agents “improve with experience.” This is the language of self-learning systems, not static workflow engines. And it reflects a direction that is maturing rapidly in research, with direct implications for what governance and observability need to become.
Self-learning for deployed agents is emerging along three lines. The first is context-level learning: treating the agent’s context as an evolving playbook that accumulates strategies from past interactions, with no weight updates at all. Stanford’s ACE framework showed that agents using smaller open-source models could match frontier-model-powered agents on agentic benchmarks, purely through structured context evolution from deployment feedback. The second is experience distillation: capturing deployment traces, curating them offline, and feeding refined strategies back into future behaviour. Multi-agent variants extend this to collaborative settings with shared experience libraries. The third is trajectory-level reinforcement learning, where models are actually updated based on deployment performance, a direction frontier labs are actively pursuing but not yet common in enterprise.
What matters for governance are two things these approaches expose:
The first is a shared dependency on structured deployment traces. All three families need high-quality records of what happened, what worked, what failed, what tools were called, what context was used, and what the outcome was. You cannot do any form of self-learning without this data. Frontier, by integrating execution, evaluation, and context into a single platform, is positioned to capture exactly this kind of structured trace data. An observability tool that watches from the outside, logging API calls and token counts, does not produce it. A governance platform that sits at the execution substrate potentially does.
The second, and more consequential, is that these three families create fundamentally different reversibility profiles. Context-level learning is the most governable: if an agent learns something wrong through accumulated context, you can in principle identify and remove the bad entries, roll back to a known-good state, or rebuild the playbook. Experience distillation is harder: curated strategies are abstracted from raw traces, and the provenance chain from a specific deployment experience to a specific strategic recommendation may not be cleanly traceable. Trajectory-level RL is essentially irreversible without retraining: once weights are updated based on deployment data, you cannot selectively unlearn a specific lesson.
This means that the governance architecture you need depends on which kind of self-learning your agents are doing. Static governance (RBAC, permissions, guardrails) is necessary for all three, but sufficient for none. Context-level learning demands provenance tracking and rollback mechanisms for the context layer. Experience distillation demands lineage from traces to strategies, with the ability to audit and override specific distilled recommendations. Trajectory-level RL demands pre-deployment sandboxing and post-deployment behavioural monitoring that can flag when the updated model has drifted outside its intended operational envelope. These are different governance architectures, not variations on a single theme.
Frontier’s governance controls today are largely static: RBAC, audit trails, explicit permissions, behavioural boundaries set at deployment. That is a gap. But the architecture, because it captures execution traces and maintains agent state, is well-positioned to support adaptive governance as a future capability. The question is how fast it gets there.
Who owns the learning loop?
There is a deeper conflict of interest than evaluation benchmarks. If Frontier captures your agents’ deployment traces and uses them to improve the platform’s own optimization recommendations, context layer, or eval loops, that is a fundamentally different proposition than if you retain ownership of your traces and control your own improvement pipelines.
The data that makes self-learning possible is also the data that makes the governance platform more valuable over time. Every deployment trace, every success-failure pair, every trajectory that reveals how agents interact with your business processes becomes training signal. The question of who owns that signal, who can use it for improvement, and what happens to it when you leave the platform is the most consequential data governance question in enterprise AI right now. It deserves the same scrutiny that cloud data residency and vendor lock-in received in the past.
Institutional memory as attack surface
Frontier’s shared business context, the semantic layer that gives agents institutional memory, is architecturally valuable and, by construction, a high-value target. But what makes poisoning institutional memory different from poisoning a database is that agents use context interpretively rather than deterministically. A corrupted database field produces a wrong answer in a predictable place. Corrupted institutional memory produces wrong judgments across unpredictable situations, because agents apply context flexibly to novel tasks.
When every agent, regardless of provider, references the same semantic layer, integrity failures propagate broadly and simultaneously. An agent that ‘knows’ something false about how your procurement process works will confidently take wrong actions that look right. And because the corruption lives in the context layer rather than in the agent’s code or model weights, standard application security testing will not surface it.
Practical questions to consider: what integrity controls protect the semantic layer from poisoning, drift, and error accumulation? Can you trace provenance of institutional memory? Can you roll back specific contributions? Is there a mechanism for agents or humans to flag when shared context contradicts observed reality?
The observability ceiling and what model-agnostic implies
Frontier governs agents through traces, tool calls, outputs, and business context. For agents powered by OpenAI’s own models, the platform may have access to richer internal signals. For agents powered by Anthropic, Google, or open-source models, the governance layer almost certainly operates at the trace and output level only.
This deserves a direct statement: model-agnostic governance is shallow governance. You can govern all models equally at the trace level, but you lose everything that makes governance powerful for detecting early-stage problems. Research on internal representations has shown that the richest signals for detecting when an agent is entering a risky regime sit closer to the model’s internal machinery than to its outputs. An agent that is about to fail often shows signs in its activations before those signs appear in its behaviour. A governance platform that cannot see those signals is governing at the perimeter.
The question is whether model providers will expose substrate-adjacent telemetry to external governance layers. This could take several forms: risk scores derived from internal state (a numeric signal indicating anomalous activation patterns), constrained audit endpoints (letting a governance layer query whether specific safety-relevant features fired during a given inference), or confidence calibrations from logit distributions rather than from the model’s self-reported text. None of these would require exposing raw weights or full activation tensors. They would require model providers to treat internal-state telemetry as a product, not just a research artifact.
If this telemetry remains proprietary, governance platforms will face a structural choice: operate at the shallow, model-agnostic trace level for all agents, or lock into a specific model provider’s ecosystem to access deeper signals. That trade-off is the real lock-in risk in agent governance, and it has not received the attention it deserves.
Implications for the observability ecosystem
There is a crowded market of companies building agent observability and evaluation tooling: LangSmith, Arize, Langfuse, Galileo, Braintrust, Datadog’s LLM module, and others. LangChain’s late-2025 State of Agent Engineering report found that 89% of organizations with production agents have implemented some form of observability, and 62% have detailed tracing. Observability is already table stakes according to this.
Frontier pressures this ecosystem by raising the floor. If a model provider bundles governance, identity, evaluation, and context management into a single platform, standalone tools need to offer something above that baseline. For the lighter-weight tools focused on tracing and token-cost monitoring, this is an existential challenge.
But the more important pressure is a capability gap that almost no current observability platform addresses: monitoring agents that change over time. Current tools trace what happened, measure latency and cost, and evaluate output quality. That is necessary but insufficient for self-learning agents. What a self-learning-aware observability platform would need to provide includes, at minimum: behavioural drift baselines that detect when an agent’s action patterns have diverged from its prior behaviour or its intended operational envelope; context/memory lineage tracking that monitors the evolution of the playbooks or institutional memory an agent draws on; rollback and override mechanisms for context-level learning, so that specific learned behaviours can be reversed; and divergence alerts that fire when an agent’s learned strategies conflict with its governance policies, even if no individual action violates a rule.
The observability ecosystem’s best response is not to compete on the features Frontier bundles, but to build the adaptive monitoring capabilities that Frontier’s own governance has not yet delivered. The window for this is open now. It will narrow as platform defaults solidify.
Five questions teams should ask to governance vendors
Whether you are evaluating Frontier, a competing platform, or building your own, these questions apply to any vendor claiming to be your agent governance layer.
1. Independence and evaluation incentives. Is the governance provider also a model provider? If so, how are evaluation conflicts of interest managed? Can you audit whether the provider’s own agents receive different treatment in benchmarks, optimization loops, or performance reporting?
2. Readiness for self-learning agents. Are governance controls static or adaptive? Does the platform detect behavioural drift as agents accumulate experience? What is the reversibility model: can you roll back context-level learning, audit experience distillation, and sandbox RL-updated models before redeployment?
3. Data ownership in the learning loop. Who owns the deployment traces that make self-learning possible? Can the governance provider use your traces to improve its own platform or other customers’ agents? What happens to your data when you leave?
4. Observability depth. Does the platform govern at the output/trace level only, or does it access internal model signals? For third-party models, what telemetry is available? What is the detection latency for anomalous agent behaviour, and does it rely on output analysis or earlier indicators?
5. Portability and auditability. Can you migrate governance to a different platform? Are policies, eval configurations, and audit data exportable in open formats? Can you audit what the governance platform itself did, through tamper-evident logs independent of the governance provider?
Overall
Frontier is a big move. Its architecture shows us where enterprise AI is heading: toward agents that learn from deployment, accumulate institutional knowledge, and improve with experience. That is a meaningful jump from today’s largely static agents and static governance, and any enterprise planning its AI stack should take the signal seriously.
The structural question of who should govern the agents is unresolved. But the more urgent takeaway is that the self-learning future is arriving, and most of the ecosystem is not ready. Governance tools are built for static agents. Observability platforms trace individual calls but not behavioural trajectories over time. The depth of what governance can actually see is constrained by whether model providers treat internal-state telemetry as a product or keep it as a moat.
For practitioners, the implication is pretty concrete. Start asking how your governance and observability stack handles agents that change over time, and who owns the data that makes that change possible. If the answer to either question is unclear, the gap between your tooling and the agents it needs to govern is about to widen fast.

