Field notes from wiring one of the world's largest multilateral institutions for agents — not "AI features bolted onto a SaaS," but a governed operating system that collapses the cost of institutional capability, rewrites how the organization is structured, and compounds into something vastly more powerful than the institution it replaces.
Most enterprise "AI strategies" are a list of chatbots. Ours isn't. Over the past months we have been designing — and standing up — the agentic infrastructure for a global, evidence-driven institution that runs on six official languages, 150+ offices, and a mandate where a wrong answer is not a UX problem but a credibility one.
The claim of this piece is bigger than "we added some AI." It is that a correctly-built agentic layer changes the economics of the institution itself. It makes capability that used to require armies of expensive specialists cheap and reproducible, and then — because the cost has collapsed — it makes the institution attempt things it never could have afforded to attempt before. That is the actual transformation. Not faster typing. A different cost curve for institutional intelligence, and therefore a different institution.
The work splits into two halves that reinforce each other. The first is plumbing: a governed way for any agent to safely reach the systems the institution actually runs on. The second is a workforce: a fleet of named specialist agents that receive goals, call those systems through the plumbing, verify their own output, and escalate to humans at the right moments. This piece is the field notes — the architecture, the economics, and the principles we refuse to compromise on.
For a century, the binding constraint on a knowledge institution has been the same: expert attention is scarce and expensive. Synthesizing the world's evidence, watching every data stream for the one anomaly that matters, re-writing the same guidance into six languages, reconciling thousands of transactions, checking that every citation in a 400-page document is real and not retracted — all of it is bottlenecked on a small number of highly-trained, highly-paid people who can only be in one place at a time.
Everything an institution doesn't do, it doesn't do because it can't afford the attention. The guideline that's three years out of date. The surveillance signal nobody was looking at on a Sunday night. The country office that gets generic global guidance instead of an analysis tailored to its situation, because tailoring it 150 times by hand is impossible. These are not failures of will. They are failures of unit economics.
Agentic infrastructure attacks exactly that constraint. It does not make experts type faster — it makes the expensive cognitive act reproducible at near-zero marginal cost. Once an agent can run a literature synthesis, the 200th synthesis costs almost nothing more than the first. Once an agent can produce a tailored situational analysis, producing it for all 150 offices costs roughly what producing it for one costs. The scarce thing stops being scarce. And when the scarce thing stops being scarce, the institution can do things that were previously off the table entirely — not 20% more of what it already did, but categorically new work.
That is the transformation. The efficiency savings are real and large, but they are the boring half of the story. The interesting half is everything the institution can now afford to attempt.
There's a useful way to classify any large institution's work: Record (documents, systems of record), Interface (dashboards and collaboration surfaces a human stares at), and Executor (work that actually completes itself). When we ran that lens over this institution, the answer came back roughly 45% Record, 45% Interface, 10% Executor.
That ratio is the diagnosis. Ninety percent of the institution's work is either inert documents or surfaces that require a human's eyeballs and judgment to produce any value at all. Almost nothing completes cognitive work autonomously. A methodologist holds an entire evidence base in their head and hopes to notice when a new trial overturns a recommendation. An epidemiologist eyeballs dashboards for anomalies — which means a human must be looking, awake, and not on leave. An officer re-writes the same brief into six languages through six separate teams. A SitRep gets assembled by hand from raw data and the morning news, every single morning.
Every one of those is a place where expensive attention is being spent on work that has structure — and structured cognitive work is exactly what an agent can do. The Era-2 pattern to kill is "email the analyst and wait." The Era-3 replacement is an agent that holds the whole picture, watches at 3 a.m. on a Sunday, drafts the brief in all six languages at once, and pages a human only when something is genuinely worth their judgment. The in-the-head work — the synthesis, the noticing, the cross-referencing that never touches a screen — is the agentic surface, and it is most of the institution's real cost.
Here is the mechanism that makes capability cheap rather than just automated, and it is the most important architectural decision in the whole program.
The naive approach is to build N integrations for N agents against M tools — an N×M explosion of bespoke glue that no one can govern and every team rebuilds. That keeps capability expensive: every new use case pays the full integration cost again. The escape is a standard. We standardized on the Model Context Protocol (MCP) as the single wire format by which any agent — a copilot in someone's inbox, an orchestrator answering a knowledge question, a scheduled automation flow — discovers and calls a tool or reads a resource.
This is the move that bends the cost curve. It decouples what the agent can do from which agent you bought, and it makes integration compounding instead of throwaway. Wrap one system as an MCP tool — the surveillance database, the document store, the cloud control plane — and it becomes instantly available to every agent in the institution, forever. The first team pays the integration cost; every subsequent team gets it free. Capability accretes. After a year, a new agent isn't built from scratch — it's assembled from a catalogue of tools that already exist, which is why the tenth agent costs a fraction of the first.
But the single most important governance decision was not "which tools." It was how agents reach tools safely inside the tenant. We settled on three rules:
The registry isn't a footnote to the strategy. The registry is the strategy: it is simultaneously the thing that makes capability cheap to add and the thing that makes it safe to add.
We organized the agentic estate into three layers, each with a different user, a different delivery vehicle, and a different risk profile — but all riding the same MCP fabric and the same governed catalogue. This is how the same underlying investment pays off three different ways.
Layer 1 — Everyday productivity (horizontal, in the flow of work). The everyday productivity suite already speaks MCP. The leverage here is small in code and enormous in reach: expose the institution's own systems as tools so they appear inside the tools staff already use. "Draft this brief from the meeting notes." "Summarize my inbox and flag the genuinely urgent." "Build the deck from this dataset." High adoption, low governance risk, no sensitive data. This layer's job is reach and trust — it puts agentic capability in front of every staff member on day one, inside software they already open.
Layer 2 — The knowledge layer (the institution's normative core). This is the part no generic enterprise has, and it is the moat. Its job is to be the one authoritative, always-cited answer. It is built from a vector retriever over the curated corpus, a knowledge graph for institutional memory and terminology, clean ingestion, and external lookups that always return provenance. Every answer cites the source object behind it; the system refuses when it isn't grounded. This is where the institution's century of accumulated expertise — currently locked in a few experts' heads and a million unsearchable documents — becomes a living, queryable asset that every agent and every officer can draw on instantly. The tacit becomes explicit, and the explicit compounds.
Layer 3 — Enterprise workflows (deep, vertical, auditable). This is where the fabric pays for itself in hard hours: narrow, high-volume processes where an agent calls a sequence of governed tools and produces a reviewable artifact. Recurring report assembly. Data-quality audits that open tickets instead of silently editing. Onboarding and offboarding orchestrated across identity systems. The selling point is auditability — every tool call logged through the gateway, every figure traceable to its source.
The point of three layers is that the same governed tool answers a Layer-1 user asking a question, grounds a Layer-2 authoritative answer, and feeds a Layer-3 scheduled flow. One catalogue, three altitudes of use, one bill for the underlying plumbing.
The headline deliverable is not "an assistant." It is a workforce of ~30 named specialist agents, each with a scoped allow-list of tools, an explicit authority level, an explicit list of things it must never do, and a defined escalation trigger. This is the difference between Era 2 (a chatbot you ask) and Era 3 (a workforce that does the work and brings you the result to approve). A few that show the range and the transformational ambition:
The design rule is many specialists beat one mega-bot. Each agent is small enough to reason about, govern, and verify. Together they are a parallel workforce that never sleeps, never forgets, works in every timezone at once, and gets cheaper per unit of work as volume grows — the exact inverse of how a human workforce scales.
This is the part most "AI strategies" miss entirely, and it is where the deepest transformation lives. When a fleet of agents does the structured cognitive work, the shape of the organization changes, not just its speed.
Today the institution's structure is, in large part, a coping mechanism for the scarcity of attention. There are translation teams because translation is slow and manual. There are layers of analysts because data has to be stitched together by hand. There are reporting units because reports are assembled by humans. Whole reporting lines exist to move work between people who each hold one piece. Take away the scarcity, and most of those structures are revealed as scaffolding around a constraint that no longer exists.
In the revamped model, humans stop being the processors of cognitive work and become its directors and adjudicators. The methodologist no longer re-reads the literature; she sets the question, judges the agent's draft, and owns the call. The epidemiologist no longer stares at dashboards; he adjudicates the signals the agent surfaces. The officer no longer assembles the report; she approves and shapes the one the agent drafted. The org flattens, because you no longer need layers whose entire job was passing partially-finished cognition down a chain. A small number of senior people, each amplified by a fleet, can do what used to require a pyramid.
And critically: the institution can now take on missions that the old structure made impossible, not because anyone decided to be more ambitious, but because the work that was previously the limiting reagent — synthesis, monitoring, translation, reconciliation — now scales with compute instead of headcount. The roadmap stops being "what can we afford to staff?" and becomes "what should we do?" That is a different question, and an institution that gets to ask it is a different institution.
Anyone can point an agent at an API and demo something impressive. The institutions that actually transform — rather than run a pilot that dies in legal review — are the ones that get identity, the read/write seam, citation, and approval right from day zero. Governance isn't the brake on this program; it is the thing that lets the institution press the accelerator. These are the principles we will not compromise:
The compliance trail (what was decided, by which agent, on what evidence) and the security trail (which identity called which tool) are kept separate, retained per data classification. The highest-risk agents — anything touching health records or identity — sit in a dedicated risk tier with formal impact assessments before they touch real data, and are never autonomous on identifiable information. The result is an institution that can move fast because every action is attributable, reversible, and logged.
The reason this gets vastly more powerful over time — and not just incrementally better — is that three things compound at once:
Tools × knowledge × memory is a flywheel, and it is mostly fixed cost. Once the fabric is built, adding the next capability is cheap, and the cheapness is what makes the institution attempt the next ambitious thing, which adds more tools and knowledge and memory, which makes the next thing cheaper still. This is how "cheap" and "vastly more powerful" turn out to be the same sentence: the marginal cost of institutional capability falls toward zero while the stock of capability climbs.
A strategy that can't start on Monday isn't a strategy. So the first move is deliberately small: stand up the evidence-synthesis agent plus the citation gate against one live question, behind the gateway. Literature and trial sources in; a draft evidence table with a fully logged, human-verifiable search trail out. One methodologist, one question, one week.
If that draft survives expert review, the whole fleet has its proof — and, just as importantly, its first compounding memory. Because the real asset being built here isn't any single agent. It's the governed fabric underneath them: the protocol, the gateway, the curated catalogue, the identity model, and the knowledge moat that every future agent inherits for free. Build that once, and the second agent is cheap, the tenth is nearly free, and the hundredth is a configuration choice.
This is what the third era of software looks like inside a serious, regulated, mission-critical organization. Not a chatbot in the corner. A governed operating system in which the marginal cost of capability collapses, the org chart reorganizes around judgment instead of processing, the institution's accumulated expertise becomes a living asset, and the whole thing compounds.
The arithmetic is stark. The old institution scaled its capability linearly with headcount and budget, and its best knowledge was perishable — it left when people left. The revamped institution scales capability with a flywheel that's mostly fixed cost, and its knowledge is permanent and accumulating. One of those curves crosses the other and never looks back.
The institutions that wire themselves for agents now will compress cycle times that used to be measured in months down to days, watch streams they could never afford to watch, deliver in every language at once, and take on missions the old cost structure made unthinkable — all with a full audit trail. The ones that don't will keep doing by hand, expensively and slowly, what their peers are about to do by conversation, cheaply and continuously. We are betting — and building — on the first kind.