Writing

Claude can see your screen now.
That's the problem.

Anthropic's Computer Use ships screenshot-based browsing into Claude Code. Here's why DOM-based browser MCP is faster, cheaper, and private.

Anthropic's Computer Use announcement: 14 million views on the tweet. The pitch: Claude Code can now see your screen, move the mouse, click buttons, and type, like a human sitting at your desk.

Once the novelty wore off, the developer reaction was less enthusiastic. The feature is real. The problems are real too. And they point to a fundamental architectural question: is screenshotting your desktop the right way for an AI agent to use a browser?

The token cost problem

Computer Use works by taking a screenshot, sending it to Claude's vision model for analysis, deciding on an action, executing it, then taking another screenshot. Every screenshot costs roughly 2,000 tokens. Every action is a screenshot-analyze-act loop.

Claude Code users are already hitting rate limits faster than expected. Anthropic acknowledged a caching bug inflating costs broadly across all Claude Code usage. Computer Use compounds this: every screenshot adds ~2,000 tokens of pixel data carrying no semantic information the agent needs. On an already strained token budget, vision-based browsing is the most expensive thing you can do.

For a task like filling out a form with ten fields, that's twenty screenshots minimum: navigate, screenshot, click field one, screenshot, type, screenshot, click field two, screenshot, and so on. Each round trip costs tokens. Each round trip takes 30–60 seconds.

Screenshots vs. the DOM

The core issue isn't bugs or rate limits. Those get fixed. The core issue is the abstraction layer.

A screenshot is a lossy, expensive representation of something that already exists in a structured format: the DOM. When Claude looks at a screenshot of a webpage, it's reverse-engineering pixel coordinates from an image of a data structure it could read directly.

Head-to-head benchmarks from the first 48 hours tell the story. DOM-based browser tools won 4 out of 5 tasks against Computer Use. The one Computer Use won was a native desktop interaction that doesn't involve a browser at all.

The speed difference:

Computer Use (screenshot) DOM-based browser MCP
Action latency 30–60 seconds 1–2 seconds
Tokens per action ~2,000 (screenshot) ~50–200 (DOM selectors)
Bot detection Blocked frequently Stealth mode, real browser fingerprint
Complex form fills Needs JS workarounds for date pickers, dropdowns Native DOM interaction
Auth state Re-login every session Persistent encrypted profiles
Rate limit burn 5-hour limit in ~90 min Negligible token overhead

Vision is the right tool for understanding a screen when you have no other way in. But a browser already exposes its full state through well-documented protocols. Using vision to read a webpage is like OCR-ing a spreadsheet instead of opening the CSV.

The privacy question

Computer Use sends screenshots of your desktop to Anthropic's servers. Not just the browser tab you're working in. Anything visible on screen at the time of capture. Dark Reading flagged security vulnerabilities in the research preview, and prompt injection risks are real: a malicious page could render invisible instructions that Claude's vision model reads and follows.

The screenshot model creates a privacy surface that's hard to scope. You can't predict what will be on screen when the next screenshot fires. Notification popups, background tabs, messaging apps. All fair game.

A DOM-based approach avoids this entirely. The agent interacts with a specific browser tab through a protocol. It reads the DOM tree, not a raster image of your monitor. The agent never sees anything outside the tab it's working in.

What this means for browser automation

Those 14 million views confirm what most developers already knew: agent workflows need browser access. The screenshot loop works as a demo. As a daily coding tool, it falls apart fast.

What developers actually need is a browser layer that's fast enough to stay in the flow of a coding session, cheap enough to run continuously, and private enough to trust with authenticated sessions.

Pagerunner is a Rust MCP server with 38 tools that drives Chrome through CDP (Chrome DevTools Protocol), the same protocol Chrome's own DevTools use. Screenshots are available when you need them, but the agent doesn't depend on them. Direct DOM access at sub-second latency, no vision model overhead.

Named Chrome profiles store AES-256-GCM encrypted session state. You log into GitHub, Jira, your staging environment once. From then on, open_session(profile="work") launches Chrome already authenticated, with no re-login loops or cookie expiry surprises. A local ONNX NER model strips personally identifiable information before anything reaches the LLM, mapping real values back on the way out. Pagerunner also maintains adapters for common sites and a selector cache that learns across sessions, so reliability compounds over time.

Because Pagerunner launches a real Chrome instance rather than a headless or puppeted window, sites see a normal browser fingerprint. Cloudflare doesn't block it. CAPTCHAs don't fire.

Pagerunner ships as both an MCP server and a standalone CLI. The MCP interface connects to Claude Code, Cursor, Windsurf, Codex, or any MCP-compatible client, giving agents direct DOM access without screenshots. The CLI covers everything else: scripts, CI pipelines, profile management, and any browser automation that doesn't need an AI in the loop.

Setup

Claude Code

bash
brew install enreign/tap/pagerunner
claude mcp add pagerunner -- pagerunner mcp

Cursor: add to .cursor/mcp.json:

json
{
  "mcpServers": {
    "pagerunner": {
      "command": "pagerunner",
      "args": ["mcp"]
    }
  }
}

Windsurf: add to .windsurf/mcp.json:

json
{
  "mcpServers": {
    "pagerunner": {
      "command": "pagerunner",
      "args": ["mcp"]
    }
  }
}

Once configured, create a profile and snapshot your auth state:

bash
pagerunner init --profile work
# Chrome opens. Log into your sites, then:
pagerunner snapshot --profile work

From that point on, your agent has a fast, private, authenticated browser. DOM-first, with screenshots available when you actually need them.

Pagerunner is open-source (MIT) and available at github.com/Enreign/pagerunner.

Found this useful? Share it with someone who needs to hear it.
Share on X LinkedIn
All posts
Back to Writing