Nine New Kinds of Software Agents Made Possible
A colleague at the WHO described a problem to me in ninety seconds. Her team handled around two hundred entitlement requests a week, and half the team had just been let go. Twenty-five minutes later, I sent her a URL. The app at the other end didn't just process requests — it decided whether to process them. It read each case, judged whether it was a duplicate, routed the simple ones, and escalated the ones it wasn't sure about.
That app did not exist in 2023. More precisely: it could not have existed. This is the part most people miss. The story of AI in software is usually told as a story about speed — the same things we always built, built faster. That framing is comfortable and it is wrong. The real story is that entirely new categories of software are now possible — shapes of software that had no way to exist two years ago, because the missing ingredient was judgment at runtime.
This is not a forecast. Over the last twelve months, working across the WHO, our own venture portfolio, and consulting clients, we shipped more than twenty-five of these applications into production — each with its own backend, frontend, infrastructure, and deployment pipeline. What follows is a field map of the nine kinds we keep building. Every example below is a real, running system; I'll name it and tell you what it actually does.
The One Shift Everything Else Is Downstream Of
For sixty years, software was a recipe. You specified every step in advance. Inputs were predefined fields. Logic was deterministic rules written at design time. Outputs were predetermined responses. The behaviour of the program was fixed the moment it shipped, and a human had to operate it through a screen built for exactly that purpose.
Agentic software is not a recipe. It is an actor. Its inputs are open-ended natural language. Its logic is judgment and reasoning, decided mid-execution rather than fixed at design time. Its outputs are outcomes, not canned responses. Its interface is optional. And instead of one product serving a million users, a single person can now have one workflow served by one bespoke program.
Five forces sit underneath this. Software can now reason at runtime — the model picks the next code path by reading natural language, instead of following an if/switch decided at design time. The marginal cost of an app has collapsed — six weeks of a team became two days of one person, an order-of-magnitude change, not a percentage. A user interface is no longer required — if the agent does the work, you don't need a screen to operate it. Natural language became a first-class input — forms and menus existed because software needed structure; the model doesn't. And software can operate other software — the DOM is an API, a screenshot is an API. Hold these five. Every category below is one of them, made concrete.
One — Software That Does the Job, Instead of Helping You Do It
Old software helped you work. The new software is the worker. A CRM gave you a "send email" button and you did the rest. An agent now researches the prospect, drafts the message, sends it, and follows up — autonomously.
The relationship to your tools inverts. You stop operating software and start directing it. You review outcomes; you don't perform steps. It feels less like using a tool and more like managing a capable junior: you approve, redirect, reject. This is the category that absorbs the most existing work and creates the most value — and it is also the one that fails most quietly. It works until it silently doesn't, so the discipline is to ship a review gate, not blind trust.
Our own Metamatics Fund is built this way. Its analyzer fans out one model call per market across more than a hundred markets, predicts the competitive dynamics out to 2046, and writes the investment thesis itself — narrative and recommendation, traceable to its inputs, not a number on a dashboard. Investment research, which used to be a building full of analysts, has become an embarrassingly parallel workload that runs with zero human-hours of operation and a single person reviewing the output. The leverage is real. So is the need for that one reviewer.
Two — Software With Judgment, Including the Judgment to Refuse
"I don't know" used to be a crash. Now it is a return value.
Old routing logic was if field X equals Y, send to team Z. New routing reads the actual request, judges whether it's a duplicate, routes it the way a person would, and escalates when it isn't sure. The model decides like a human, not like a hard-coded branch. Crucially, refusal is a first-class behaviour — the agent can decline to act, and that is a feature. This is exactly what lets agents touch consequential work: handle the easy eighty percent confidently, escalate the uncertain twenty percent honestly.
At the WHO, our cso-product classifies whether a draft document qualifies as a normative product — a guideline, a handbook — using a committee of seven specialist modules and a question generator that picks the next best question from a bank of more than two hundred, so the interview gets longer or shorter depending on how certain the system already is. When two categories are close, it writes the reason in plain language: "Close decision between X and Y — review these attribute scores." The pharma tool specs-ai is even stricter: it writes the narrative of an impurity report with a model, but the chemical structures come only from a local database populated from PubChem and validated by RDKit. The rule, written into the code, is "No AI-generated SMILES. No guessing." The failure mode here isn't a logic bug — it's mis-calibration. You tune a threshold, not a rulebook, so you test the edges relentlessly.
Three — Software With No Interface At All
The most useful new applications never get opened. The deliverable is the deliverable.
Every product used to need a screen so a person could operate it. Now you can build a pure agentic backend whose outputs are emails, files, decisions, and actions. Our WHO donor-reporting system takes a loose folder of inputs — financial spreadsheets, project notes — and hands back a finished donor report as a Word document. No login, no dashboard, no app to open; it even runs on an annual calendar rather than waiting for a click. Some of the most defensible companies of the coming decade will ship outcome streams, not dashboards — like OpenClaw, a personal assistant that has no screen at all and simply appears as a contact inside the messengers you already use. The catch is that the risk inverts: when there's no UI, failure becomes invisible. You have to instrument the two questions a screen used to answer for free — did it run, and did it deliver?
Four — Software Configured in Prose
English has become an executable specification.
For decades, code, configuration, and content were three separate things: code defined behaviour, config tweaked parameters, content was what humans read. They are collapsing into one. In donor-reporting, the program logic is a Word file — executive_prompt.docx. Edit that document and you have changed the system's behaviour, with no code commit. In our research agent Hyperthesis, the "code" for how it thinks is a pair of multi-paragraph English prompts with real branching inside them — "If this is the first iteration, focus on breadth" — that you can edit a paragraph of and immediately ship as new behaviour. There is no compiler; the model is the runtime.
This may be the deepest shift of the nine. The bottleneck moves from who can code to who can specify behaviour clearly, which makes domain experts first-class builders for the first time — the lawyer, the analyst, the clinician can author the program directly. The corollary is uncomfortable for most teams: if the prompt is the program, it deserves version control, review, and diffs. The configuration of these systems has even climbed past single files into whole Markdown documents — CLAUDE.md, AGENTS.md — that the model reads at the start of every session to learn the system's invariants. Documentation has been promoted to a runtime code path.
Five — Disposable Software
A piece of software used to be a six-month IT project that had to last years to justify its cost. The economics flipped. Now it can take three hours to build and three hours to throw away.
This sounds trivial and it is not. The unit of digitisation shrank by roughly ten times, which means the entire long tail of useful-but-small software is finally buildable. Most of the disposable apps we built didn't replace an existing product — they replaced "we just don't have a tool for that." The WHO folder holds dozens: info-extractor, where you upload a PDF and type "extract every dosage value mentioned in the clinical trials" and the model figures out how; redacting-app, which judges that "John Smith" is a name but "John Smith Pharmacy" is a company — a distinction no regex can make; keyword-highlight; detect-ai, a one-screen classifier so cheap it was redeployed across three hosting providers on a whim. One workflow, one person, sometimes one week. Deleting the app afterward is success, not waste.
Six — Software That Builds Other Software
Code generators used to produce templates and scaffolding — everything decided up front. Now agents write, deploy, and operate other agents at runtime, sizing the team to the job and dissolving it afterward.
Kybernetist, our operator console for autonomous agents, is the clearest case. Hand it a task and its tool-discovery step asks the model to predict what capabilities it will need, searches the public MCP registries — Smithery, npm — in parallel, then uses the model as a judge to install the best candidate. If the registries come back empty, it falls back to writing the tool itself. It also spawns its own sub-agents on demand — a research agent, a code agent, a scraper, a writer — with the human kept on the loop through Telegram approvals. It is, in effect, a package manager driven by an LLM judge. This is the category that breaks the oldest assumption in software, that team size equals output.
Seven — Software That Operates Other Software
Integration used to mean APIs. No API meant brittle scraping, a six-month project, or simply "we can't." That integration tax shaped enterprise IT for a generation.
It just collapsed. The model reads screens. The DOM is an API. The screenshot is an API. Our sales agent Primespect drives LinkedIn straight through the user's own authenticated browser — no headless browser, no proxy, no stored passwords — reading the prospect's actual recent activity rather than a template variable, and routing the work across three model tiers (a fast model triages, a stronger one writes, a third plans the sequence). Behind it sits a catalogue of more than sixty tools — Maps, Gmail, Drive, Calendar, a browser driver — that the model dispatches to as needed. CursorRemote takes the same idea to its logical end: it reads Cursor's own chat panel through Chrome DevTools and relays it to your phone, so you can approve or redirect a coding agent from Telegram. Visual interfaces have re-emerged as universal APIs, because the model can read a screen the way a person does. Most "we need an integration with X" projects are now an afternoon of work.
Eight — Proactive Software That Runs Without Being Asked
Old software waited to be opened. You launched the app, clicked the button, invoked the function — nothing happened until a human started it. New software runs on a schedule or a trigger, watches continuously, judges whether anything is even worth your attention, and comes to you.
The trigger moved from the human to the software. Our WHO policy-monitor crawls Azure compliance state across every subscription on its own, enriches each raw policy with a plain-English explanation and a remediation step, and produces the narrative report nobody has to sit down and write. The Metamatics Fund analyses its markets and the donor reports assemble themselves on a calendar; nobody launches them. The craft here is restraint: the hard part isn't acting, it's deciding when not to interrupt you. And like all no-UI work, it fails silently — if nobody notices the digest stopped arriving, nobody knows it broke.
Nine — Software as a Committee of Specialist Roles
One prompt, one response, one model — that was the whole shape of an AI feature a year ago. Now a single application is a committee. A fast, cheap model triages. A capable model writes. A third grades. A fourth synthesises.
The model picker is the architecture. Complexity, our consulting swarm, runs a Strategist → Scout → Synthesis pattern at every step: a fast model picks the minimum set of tools, those tools run concurrently, a scout extracts the facts that matter, and a stronger model writes the answer — three model calls per step, dozens of steps per phase, scheduled across a dependency graph so independent work runs in parallel. After each step it runs a quality gate: a model judges whether the output is usable and either passes it, retries it with a hint, or fails it. The agent grades its own work before showing it to you. Primespect blends three different model families by task; step-determine, the WHO salary-step tool, fires one model call per past position on a candidate's record and a final call to aggregate them — hundreds of judgments per person. "Which model do you use?" is the wrong question; the right one is "which committee, in what shape?" The architecture diagram of a modern agentic system has stopped looking like a flowchart and started looking like an org chart.
Real Systems Are Sentences, Not Single Words
No production application is exactly one of these kinds. Each is a stack of four or five. The WHO donor-reporting system has no interface, is configured in prose through its executive prompt, is disposable in the sense that it was built fast for one team's recurring need, and exercises judgment about what each report must cover — four kinds fused into one app. Primespect operates other software through the browser, runs as a committee of models, does the outreach itself, and builds its own sub-agents on the fly. Kybernetist builds other software, runs proactively on schedules, organises itself as a committee, and drives other tools through their screens.
So treat the nine as a vocabulary, not a taxonomy. Their value isn't in filing your idea under one heading. It's in letting you say precisely what you're building — and inheriting, with each word, the specific constraint that comes attached to it.
Where the New Categories Are Not Yet Reliable
Honesty is the whole job, so here are the limits. These categories are excellent for triage and preparation and not yet trustworthy for final calls in medicine, law, or regulation — surface, prepare, and draft, but don't let them decide. Agents also fake outcomes convincingly: ask "did you send the email?" and the answer is "yes" even when nothing was sent, so verify before you depend. Behaviour drifts over weeks in ways you won't notice unless you pin a handful of cases and re-run them; the system that worked in January is subtly wrong by June. And no-UI software fails invisibly — the absence of a screen removes the place where errors used to show up.
There is a single craft principle underneath all four. When the default behaviour of your engine is invention, the discipline of the engineer is to forbid invention in the places that matter. The best systems we build encode this literally. Our economic-modelling tool, Excel, holds one inviolable rule: no number in the model exists without a path to either a quoted sentence in a cited document or an explicit human estimate with a named owner. specs-ai refuses to invent a chemical structure. Kybernetist ships an explicit honesty contract — "No simulated step output. Ever." — and stamps every result with which model actually produced it, so an operator can always see what really ran versus what was meant to run. In 2022 we engineered software to be fast and correct. In 2026 we also engineer it to be honest. It is harder than it sounds. It is also why this work is worth doing.
The New Question — and the Opportunity in It
The old question was "How do I build my current software faster?" The new question is "What kind of software does my problem actually call for now — and which of these nine kinds is it?"
This is the opportunity, and it is unusually accessible. You are almost certainly sitting on at least one workflow that quietly belongs in one of these categories — thirty judgment-heavy minutes you repeat every week, or a task spread across five tools that don't talk to each other. You can build the first version in an afternoon: pick the workflow, write the prompt as if it were the spec, skip the UI, bake in the honesty contract, and ship it to a single user — yourself — for a week. Then decide to keep it, fork it, or throw it away. Disposable is a perfectly good outcome.
If software's identity has changed this completely, the team that ships it has to change too. The old studio meant fifteen to thirty people, nine to eighteen months, and a million dollars to reach version one. Our agent-first model is one to three people plus dozens of agents, one to three weeks, and an order of magnitude less capital — which is how a single studio ships five to ten ventures a year instead of one. That isn't aspirational; it's a description of our last twelve months. Studios that bolt agents onto the old process will lose to studios designed around agents from day zero.
Software stopped being a recipe and became an actor. The categories that exist now did not exist two years ago. The people who notice first will build the next decade.
