Browser automation and ad-hoc web extraction have become the two primary ways agents access web data. Intuitively, structured APIs should provide a more ergonomic interface for agents. To assess the validity of this intuition, we sought to quantify how much more ergonomic structured APIs are for agents — that is, how reliably, how quickly, and with how few tokens an agent can complete a real web task when using structured APIs instead of browser automation or ad-hoc scraping.
Eval design
We gave Claude Code (running Sonnet 4.6) the following task:
The full prompt — with airport whitelists, per-segment field requirements, and a uniqueness constraint — can be found in the appendix.
Each Claude Code session was given one of three tool stacks and ran 5 times. We applied a cap of 80 turns per run to limit runaway sessions.
Tool stacks
- Structured API — the agent-data CLI, backed by a structured flight-information API endpoint.
- Web search + extraction — Tavily search and extraction tools.
- Browser automation — Playwright CLI.
Note: we picked representative implementations for each category. However, our focus here is on comparing tool types, not particular providers.
Headline results
| Modality | Success | Cost per run | Input tokens per run | Output tokens per run | Latency per run | Turns per run |
|---|---|---|---|---|---|---|
| Browser automation | 0/5 | $1.53 | 3,727,282 | 13,089 | 457s (~8min) | 80* |
| Web search + extraction | 0/5 | $0.96 | 969,895 | 18,387 | 451s (~8min) | 33 |
| Structured API | 5/5 | $0.49 | 577,042 | 8,035 | 141s (~2min) | 15 |
Per-run statistics represent the medians across runs. Averages are shown in the Appendix. * All browser automation runs hit the per-run turn limit.
Agents using browser automation were roughly 3.1× as expensive as those using a structured API and produced zero successful results. Web search + extraction agents came in around 1.9× the cost with the same zero-success result.
total_cost_usd metadata. You can interpret this as: how expensive is it to use an agent with this tool. It does not include per-call API fees for the underlying tools (e.g., web search + extraction) — adding those would widen the gap further as failing runs also made many more tool calls.Web search + extraction: clean execution, but imprecise results
All 5 attempts completed cleanly within the per-run turn limit. The agents executed the canonical workflow — broad search → targeted extraction → synthesis — without obvious procedural errors. The agent wrote 5–13 distinct, well-refined queries per run, and recovered fast from dead ends. For example, when booking aggregators returned empty markdown (their fare grids render only after an interactive submission), the agents pivoted within 1–2 turns to schedule sites that publish fully rendered route tables. Every run produced real outbound and return flight numbers (e.g., B6 115/116, AA 179/234, and DL 363/668), pulled from schedule sources that publish JFK→SFO route data.
Zero runs produced specific, date-bound fares. With only web search + extraction tools, the agent was unable to interact with booking sites (Expedia) or flight-search sites (Google Flights), leaving the agent to summarize seasonal averages from blogs and other webpages ("Jun range is typically $174–$524 one-way") which are not bookable.
Browser automation: screenshot, screenshot, screenshot
All 5 browser-automation runs hit the 80-turn cap, while consuming ~6.5x more tokens than agents using the structured API and ~3.8x more tokens than those using web search + extraction. In each attempt, the agent followed a consistent pattern: open a flight-search site with a pre-encoded deep link (Sonnet had strong priors here), then enter a screenshot → click → screenshot → fill → screenshot loop to drive the search UI.
Across all 5 runs, 34% of tool calls were screenshots, with the agent re-reading page state after every action. Another 14% were session management (named sessions, tab handling) as the agent tried to isolate one booking flow from another. Actions like clicks and form fills were under 20% of calls.
The agent also encountered heavy anti-bot protections: 5/5 runs were bot-blocked on at least one major site. Two of five runs eventually navigated to ITA Matrix — the underlying flight-search engine that Google Flights uses — but discovered it only after burning 30+ turns on the consumer-SPA frontend, leaving few turns in its budget to extract useful data.
Structured API: more reliable and more consistent
All 5 structured-API runs succeeded, with a median of 15 turns and $0.49 per run. Every run followed the same three-phase pattern:
- Discovery (3 calls): search across available endpoints for "round trip flight" and read the docs for the relevant API endpoint. The docs return endpoint names, parameter schemas, and response shape in a single structured read.
- Data collection (10–20 calls): the API uses a submit-then-poll pattern. Notably, several runs issued parallel queries across NYC's three airports (JFK, EWR, LGA) when the first batch's return options were too similar — a task-aware, emergent behavior that's only cheap to do when each call is an API hit rather than a browser session.
- Synthesis: API responses contain
offers[]arrays with structured outbound and return objects — carrier, flight number, times, stops, price — making synthesis trivial. There was no additional extraction step, DOM inference, or screenshot parsing required.
Limitations and scope
There are a few limitations here are worth flagging.
- We tested one task within one domain (flight search), which may not be representative of all web-based work an agent might perform. That said, we expect our core finding — structured APIs beat browser automation and web search + extraction — to hold across tasks where critical data lives behind dynamic UIs.
- Browser automation failed in all cases, but increasing our per-run cap from 80 to 160 or 240 could lead to runs completing successfully. However, we expect the per-run cost to scale roughly linearly with that cap.
- We chose Claude Sonnet 4.6 as a representative model of what most teams would use for a workflow like this. We leave it to future work to explore how performance varies across other models and model families.
Appendix
Per-run averages
| Modality | Mean Cost | Mean Input Tokens | Mean Output Tokens | Mean Latency | Mean Turns |
|---|---|---|---|---|---|
| Browser automation | $1.54 | 3,713,964 | 12,719 | 452s (~8min) | 80.0 |
| Web search + extraction | $0.99 | 1,139,183 | 18,571 | 410s (~7min) | 36.6 |
| Structured API | $0.65 | 831,070 | 11,508 | 216s (~4min) | 17.6 |
Full task prompt
Review methodology
Programmatic metrics (cost, tokens, turns, wall time, per-tool counts) were extracted from each run's metadata (meta.json) and agent trajectory.
Reviewer judgments (success/failure, rubric score, error type, per-option price specificity) came from 15 independent subagent reviews using Claude Sonnet-4.6 — one per run. Each subagent received an identical prompt with the task spec, the success criteria, a rubric, a controlled error vocabulary, and the price-specificity rule (specific numeric or transparently-constructed sum → pass; range, fuzzy, or missing → fail). Reviewers did not see any heuristic flags from earlier passes, so the judgments were independent of prior labeling.
Per-run turn limit: if a run hit the 80-turn cap, its rubric score is 0 regardless of partial content captured before the cap.
Reviewer rubric: each of the 15 runs was scored against a rubric requiring three distinct round-trip pairings, all required per-segment fields, a total price in USD, and a source URL. The price-specificity rule required either a specific numeric total or a transparently-constructed sum (e.g., outbound one-way + return one-way explicitly summed); a price range, fuzzy estimate, or missing price meant failure on that option.
Transcripts, per-run reviews, and methodology artifacts are available on request.