NewAgents Just Need APIs — Benchmarking three ways to give AI agents web access: read the writeup →
blog/research
Research · · 9 min read

Agents Just Need APIs

Benchmarking three ways to give AI agents web access.

Posted

Browser automation and ad-hoc web extraction have become the two primary ways agents access web data. Intuitively, structured APIs should provide a more ergonomic interface for agents. To assess the validity of this intuition, we sought to quantify how much more ergonomic structured APIs are for agents — that is, how reliably, how quickly, and with how few tokens an agent can complete a real web task when using structured APIs instead of browser automation or ad-hoc scraping.

Eval design

We gave Claude Code (running Sonnet 4.6) the following task:

Find 3 round-trip flight options for 1 adult from New York City to San Francisco, departing 2026-06-16 and returning 2026-06-19. Report a total round-trip price in USD for each of the 3 options and at least one source URL per option.

The full prompt — with airport whitelists, per-segment field requirements, and a uniqueness constraint — can be found in the appendix.

Each Claude Code session was given one of three tool stacks and ran 5 times. We applied a cap of 80 turns per run to limit runaway sessions.

Tool stacks

Note: we picked representative implementations for each category. However, our focus here is on comparing tool types, not particular providers.

Headline results

ModalitySuccessCost per runInput tokens per runOutput tokens per runLatency per runTurns per run
Browser automation0/5$1.533,727,28213,089457s (~8min)80*
Web search + extraction0/5$0.96969,89518,387451s (~8min)33
Structured API5/5$0.49577,0428,035141s (~2min)15

Per-run statistics represent the medians across runs. Averages are shown in the Appendix. * All browser automation runs hit the per-run turn limit.

Agents using browser automation were roughly 3.1× as expensive as those using a structured API and produced zero successful results. Web search + extraction agents came in around 1.9× the cost with the same zero-success result.

Cost represents model usage only, as estimated by Claude Code's total_cost_usd metadata. You can interpret this as: how expensive is it to use an agent with this tool. It does not include per-call API fees for the underlying tools (e.g., web search + extraction) — adding those would widen the gap further as failing runs also made many more tool calls.

Web search + extraction: clean execution, but imprecise results

All 5 attempts completed cleanly within the per-run turn limit. The agents executed the canonical workflow — broad search → targeted extraction → synthesis — without obvious procedural errors. The agent wrote 5–13 distinct, well-refined queries per run, and recovered fast from dead ends. For example, when booking aggregators returned empty markdown (their fare grids render only after an interactive submission), the agents pivoted within 1–2 turns to schedule sites that publish fully rendered route tables. Every run produced real outbound and return flight numbers (e.g., B6 115/116, AA 179/234, and DL 363/668), pulled from schedule sources that publish JFK→SFO route data.

Zero runs produced specific, date-bound fares. With only web search + extraction tools, the agent was unable to interact with booking sites (Expedia) or flight-search sites (Google Flights), leaving the agent to summarize seasonal averages from blogs and other webpages ("Jun range is typically $174–$524 one-way") which are not bookable.

Browser automation: screenshot, screenshot, screenshot

All 5 browser-automation runs hit the 80-turn cap, while consuming ~6.5x more tokens than agents using the structured API and ~3.8x more tokens than those using web search + extraction. In each attempt, the agent followed a consistent pattern: open a flight-search site with a pre-encoded deep link (Sonnet had strong priors here), then enter a screenshot → click → screenshot → fill → screenshot loop to drive the search UI.

Across all 5 runs, 34% of tool calls were screenshots, with the agent re-reading page state after every action. Another 14% were session management (named sessions, tab handling) as the agent tried to isolate one booking flow from another. Actions like clicks and form fills were under 20% of calls.

The agent also encountered heavy anti-bot protections: 5/5 runs were bot-blocked on at least one major site. Two of five runs eventually navigated to ITA Matrix — the underlying flight-search engine that Google Flights uses — but discovered it only after burning 30+ turns on the consumer-SPA frontend, leaving few turns in its budget to extract useful data.

Structured API: more reliable and more consistent

All 5 structured-API runs succeeded, with a median of 15 turns and $0.49 per run. Every run followed the same three-phase pattern:

Limitations and scope

There are a few limitations here are worth flagging.

Appendix

Per-run averages

ModalityMean CostMean Input TokensMean Output TokensMean LatencyMean Turns
Browser automation$1.543,713,96412,719452s (~8min)80.0
Web search + extraction$0.991,139,18318,571410s (~7min)36.6
Structured API$0.65831,07011,508216s (~4min)17.6

Full task prompt

Find 3 round-trip flight options for 1 adult from New York City (any of JFK, LGA, EWR) to San Francisco (SFO), departing 2026-06-16 and returning 2026-06-19, economy, 0–1 stops. Each option must be a distinct round-trip pairing. For each of the 6 segments, include airline, flight number, origin airport, destination airport, departure time (local), arrival time (local), number of stops, and price contribution if available. Also report a total round-trip price in USD for each of the 3 options and at least one source URL per option.

Review methodology

Programmatic metrics (cost, tokens, turns, wall time, per-tool counts) were extracted from each run's metadata (meta.json) and agent trajectory.

Reviewer judgments (success/failure, rubric score, error type, per-option price specificity) came from 15 independent subagent reviews using Claude Sonnet-4.6 — one per run. Each subagent received an identical prompt with the task spec, the success criteria, a rubric, a controlled error vocabulary, and the price-specificity rule (specific numeric or transparently-constructed sum → pass; range, fuzzy, or missing → fail). Reviewers did not see any heuristic flags from earlier passes, so the judgments were independent of prior labeling.

Per-run turn limit: if a run hit the 80-turn cap, its rubric score is 0 regardless of partial content captured before the cap.

Reviewer rubric: each of the 15 runs was scored against a rubric requiring three distinct round-trip pairings, all required per-segment fields, a total price in USD, and a source URL. The price-specificity rule required either a specific numeric total or a transparently-constructed sum (e.g., outbound one-way + return one-way explicitly summed); a price range, fuzzy estimate, or missing price meant failure on that option.

Transcripts, per-run reviews, and methodology artifacts are available on request.

Related.