How long does AI-assisted work actually take?

AI agents are phenomenal on small tasks and underwhelming on large ones. Here's the data, and a practical way to estimate hybrid human-plus-AI work.

A founder I advise asked me last month how much faster her team would ship with AI coding agents. She'd seen the demos. An agent writing a full feature in minutes, generating tests, deploying to staging. She wanted a number. "Twice as fast? Three times?" I told her it depends entirely on how you decompose the work. She looked at me like I'd dodged the question. I hadn't. That answer is the most important thing I know about AI-assisted development timelines, and almost nobody talks about why.

The effectiveness curve nobody talks about

METR, a research organisation that evaluates AI model capabilities, has published some of the most rigorous data on how well AI agents perform on real software engineering tasks. The finding that matters most: agent effectiveness decays as task size increases. On small, well-scoped tasks (a single function with a clear spec), current agents achieve roughly ninety per cent effectiveness, completing the work at about nine-tenths the quality and speed of a competent human developer. On large tasks (multi-file changes, ambiguous requirements, integration work across several systems), effectiveness drops to around thirty per cent.

This isn't a minor detail. It's the core dynamic that explains why AI-assisted development feels miraculous on Monday and disappointing on Friday. Monday: you give the agent a small, clear task, and it delivers in minutes. Friday: you give it something larger and fuzzier, and three hours later you're debugging its output and rewriting the parts where its assumptions diverged from yours.

The decay curve is roughly logarithmic. Effectiveness drops steeply from small to medium tasks, then flattens between medium and large. What does that mean in practice? There's a sweet spot. Tasks big enough to be meaningful but small enough for the agent to handle well. Teams that consistently hit it ship faster. Teams that don't spend just as long reviewing and correcting as they would have spent writing the code themselves.

Why traditional estimation breaks

The standard approach to estimating development work assumes a single actor: a human developer with predictable throughput. You estimate how long a task would take a competent person, add a buffer for unknowns, and commit to a timeline. This model is imperfect even without AI, but it has a well-understood failure mode (things take longer than expected) and teams have built intuitions for how much buffer to add.

AI-assisted work breaks this model. The actor isn't one person. It's a person-plus-agent system with variable effectiveness. A task the agent handles well might take a quarter of the traditional estimate. A task it handles poorly might take the same time as before, or longer, because the developer now has to understand and debug code they didn't write. The variance is enormous, and it's not random. It correlates systematically with task characteristics (size, ambiguity, domain complexity) that are knowable in advance.

So teams using AI agents consistently mis-estimate in both directions. Small tasks get overestimated ("this will take a day" becomes "the agent did it in two hours"). Large tasks get underestimated ("the agent should speed this up" doesn't account for the review and rework time). The net effect on a project timeline can be surprisingly close to zero.

Time saved on small tasks gets eaten by time lost on large ones. The net improvement depends almost entirely on how you structure the work.

The decomposition premium

This is why my answer to the founder wasn't a number but a strategy. The single most valuable thing a team can do to ship faster with AI is decompose work into smaller units. Decomposition isn't a new idea. It's always been good practice. But the economics have changed. The gap in agent effectiveness between a small task and a large task is so wide that splitting one large task into five small ones can save time even after accounting for the overhead of splitting.

Consider a feature that a human developer would estimate at five days. Give it to an agent as a single task and you might see thirty per cent effectiveness. The agent produces a rough draft; the developer spends three to four days reviewing and completing the work. Total: maybe four days. Modest.

Now decompose that same feature into eight well-scoped subtasks, each half a day for a human. The agent might hit eighty to ninety per cent effectiveness on each one. The developer spends an hour or two reviewing and finishing each subtask instead of half a day. Total: maybe two to two and a half days, including the decomposition work. That improvement only materialises because the work was structured to play to the agent's strengths.

I've tracked this pattern across several projects. The correlation between decomposition granularity and time savings is the strongest predictor I've found. Teams that throw large, loosely-specified tasks at agents see marginal gains. Teams that invest in breaking work into clear, bounded pieces see real ones.

A practical mental model

Look, I don't think the world needs another estimation framework. What it needs is a way of thinking about AI-assisted work that helps you make better calls about timelines. No spreadsheet required.

Here's the model I use. For each task, I ask four questions:

How big is the scope? A single function with a clear spec is small. A multi-file change with integration requirements is large. An architectural refactor touching the data model is extra-large. The agent's contribution will be roughly inverse to this scale.

How fuzzy are the requirements? If they're precise ("write a function that takes X and returns Y"), the agent will do well. If they're vague ("improve the onboarding flow"), the agent will produce something, but you'll spend real time steering it toward what you actually want. Ambiguity is the single largest source of rework in AI-assisted development.

And then there's domain knowledge. Agents do well on common patterns: CRUD operations, API wrappers, test generation. They do poorly on domain-specific logic that requires understanding business rules or institutional conventions not documented in the codebase. If the task is domain-heavy, budget for human time regardless.

Can you verify the output quickly? This one's easy to overlook. An agent might produce correct-looking code that's subtly wrong in ways you won't catch without careful review. If verification is cheap (run the tests, check the output), the agent is a good bet. If verification means reading every line and reasoning about edge cases, factor that cost in.

For each task, I estimate an optimistic scenario (the agent handles it well), a likely one (the agent helps but needs human finishing), and a pessimistic one (the agent's output isn't useful). The weighted average, leaning toward the likely case, gives me a number I can commit to. It's not scientific. But it's been more accurate than any formula I've tried.

What I tell founders

When founders ask me "how much faster will we ship with AI?", I give a more specific answer than "it depends." On a well-decomposed project with clear specs, expect a thirty to fifty per cent reduction in development time. On a poorly-decomposed project with ambiguous requirements, expect roughly zero net change. The time you save on easy parts gets consumed by the hard parts. The difference isn't the AI. It's the structure of the work.

Most founders don't want to hear that. They want a multiplier. 2x, 3x, 10x. Those multipliers exist, but only for specific types of work: boilerplate generation, test coverage, data migration scripts, API integration layers. For the core product work — the features that define the company, the logic that embodies the business rules — AI is a powerful assistant, not a replacement for human judgement. Teams that understand this plan better and ship better than the ones chasing the 10x headline.

So how long does AI-assisted work actually take? Less than it used to, if you structure work around what agents do well. The same amount, or more, if you don't. The tool has changed. The discipline hasn't.