Every AI agent that interacts with a website today does it the same way: it pretends to be a person. It takes a screenshot, parses the pixels, figures out where the button is, and clicks it. Or it reads the DOM, hunts for the right element, and fires a synthetic event. Either way, the agent is reverse-engineering a human interface to do machine work. It's slow and fragile. It breaks every time someone moves a button two pixels to the left.
I've spent the last several months building Pagerunner, a tool that connects AI agents to real Chrome sessions over CDP. It works. But I'd be lying if I said the current approach doesn't feel like a stopgap. We keep building better machinery to make agents act like humans on sites that were never designed for them. What if the websites themselves learned to talk to agents directly?
That's starting to happen. And it's worth paying attention to, because the next eighteen months will reshape how agents and the web interact.
The structured web
The most interesting development right now is WebMCP. It landed in Chrome 146 behind a feature flag in February 2026, and it's a W3C standard proposal backed by Google and Microsoft. The idea is simple: websites can expose structured tools and actions to AI agents through a browser API called navigator.modelContext.
Two flavours. The declarative version annotates HTML forms with attributes like toolname and tooldescription, so agents can discover and submit forms without parsing visual layout. The imperative version lets sites register tools dynamically via JavaScript, complete with names, descriptions, and JSON Schema for inputs. The browser translates everything into MCP format for agents to consume.
The numbers are striking. Google reports 89% token efficiency improvement over screenshot-based methods. That makes sense. Instead of sending a 1280x720 screenshot to a model and asking "what's on this page?", the agent gets a structured list of available actions. It's the difference between describing a menu to someone over the phone and just handing them the menu.
Chrome only, for now. Edge will likely follow. Firefox and Safari are engaged in the spec process but haven't committed to timelines. Broader rollout is expected by late 2026. The first sites to adopt will probably be Google properties and Microsoft 365.
There's older infrastructure worth mentioning too. Schema.org has had an Action type and potentialAction property for years, originally built for search engine rich results. It's already on millions of websites. A BuyAction or SearchAction on an e-commerce site could, in theory, let an agent discover purchase or search flows without touching the UI. The limitation is that most sites only expose read-oriented actions (search, view) rather than transactional ones. But the wiring is there, waiting to be used.
Front doors for agents
The web is starting to build front doors specifically for agents. This is the part I find most exciting.
llms.txt was proposed in autumn 2024 by Jeremy Howard. It's a markdown file at /.well-known/llms.txt that gives agents a curated, pre-flattened content roadmap. No navigation, no ads, no JavaScript. Just the content an agent needs in a format it can consume directly. Adoption is growing among API documentation sites especially. Makes sense — if your docs are the first thing an agent reads when building an integration, you want that experience to be clean.
Google's A2A protocol takes a different approach. Agents publish "Agent Cards" at /.well-known/agent-card.json, describing their capabilities, supported tasks, and communication preferences. Fifty-plus partners signed on, including Atlassian, PayPal, Salesforce, and SAP. It's agent-to-agent discovery, not agent-to-website, but the pattern is the same: a machine-readable front door that lets automated systems find each other without human intermediation.
And then there's agenticweb.md. A single file declaring semantic context, security mechanisms, API endpoints, data formats, even compliance certifications like ISO 27001 and SOC 2. One file that tells an agent everything: who we are, what you can do, what you can't, how to verify our claims. Early days. But the concept is right.
All of these point in the same direction: the web needs a machine-readable layer alongside the human-readable one. We've had robots.txt for crawlers since 1994. We're building the equivalent for agents now.
The hybrid path
I don't think visual automation is going away. Too many sites will never adopt WebMCP or publish agent cards. Internal tools, legacy portals, small business websites. The long tail of the web won't opt in to structured agent interaction, at least not for years.
So the practical architecture looks like a ladder. Start with visual automation as the universal fallback. It works on any site. While the agent is clicking around, passively capture the network traffic through CDP. Analyse the traffic patterns. Which API endpoints does the site call when you click "Submit"? What headers does it send? What does the response look like? Over time, you can generate typed API clients from observed patterns, shifting from visual interaction to direct API calls as confidence builds.
And before any of that, check whether the site already speaks agent. Does it have an agent card? Does navigator.modelContext expose tools? Is there an llms.txt or agenticweb.md? If so, use them. They're faster, cheaper, and more reliable than anything you can reverse-engineer.
Tools like Unbrowse and Integuru are already exploring the traffic-capture-to-API-client pipeline. Unbrowse claims interaction times drop from 5-30 seconds (visual) to 50-200 milliseconds (direct API), with token usage falling from ~8,000 to ~200 per action. I haven't verified those numbers independently, but the direction makes intuitive sense. Screenshots are expensive. Structured data is cheap.
The best agent architecture doesn't choose between visual automation and structured APIs. It starts with one and graduates to the other.
What I'm watching
If you're building agent infrastructure today, here's how I'd prioritise:
MCP is the standard. 97 million monthly SDK downloads, adopted by Anthropic, OpenAI, Google, Microsoft, AWS. If you're building agent infrastructure and not speaking MCP, you're building in a vacuum. The recent MCP Apps Extension — a collaboration between Anthropic and OpenAI — adds interactive UI capabilities through sandboxed iframes. Agents can now show a human what they found, not just describe it.
WebMCP is the bet I'm most excited about. Browser-native structured interaction would eliminate entire categories of breakage in agent-web communication. Still behind a feature flag, still Chrome-only, still waiting on site adoption. But when it lands? The agents that can query navigator.modelContext for available tools will be working in a completely different mode from the ones still parsing screenshots.
A2A matters for discovery. Agent cards as a standard would make finding and connecting to services much simpler, and the protocol has real backing — Linux Foundation since June 2025, fifty-plus enterprise partners. VOIX, ANP, and Agent Protocol are interesting to track but too early to build on.
What this means for Pagerunner
Honestly? If WebMCP succeeds, a lot of what Pagerunner does today becomes unnecessary. I think that's fine. The goal was never to build the best screenshot-parsing tool. It was to give agents real access to real web services. If the web learns to offer that access natively, that's a better outcome than anything I could build on top of CDP.
In the meantime, the web as it exists today still needs visual automation and real browser sessions. Still needs the privacy and session management infrastructure we built. The transition from "agents pretend to be humans" to "sites talk to agents directly" won't happen overnight. It'll happen site by site, protocol by protocol, over two or three years.
I'm building for both.