Backlink: [[2026-05-21]]
> Source: ChatGPT Pro chat.
# Agentic Compute on Object Storage
*The agentic-compute equivalent of the turbopuffer idea: durable, cache-native agent execution for high-volume AI workflows.*
Yes. There is a solid opportunity, but the sharp version is **not** “cheaper Kubernetes for agents.”
> The agentic-compute equivalent of turbopuffer is a durable, cache-native agent runtime where object storage is the source of truth, hot compute is only used while an agent is actively doing work, and high-QPS workspaces can be pinned for predictable latency.
In turbopuffer, the insight is that search data is huge, multi-tenant, and unevenly accessed, so keeping everything in RAM/replicated SSD is wasteful. In agentic compute, the equivalent insight is that agent sessions are **long-running, stateful, bursty, and mostly waiting** . Keeping every agent’s sandbox, browser, repo checkout, memory, and context hot is often as wasteful as keeping every vector index in RAM.
The strongest product wedge is **not just runtime billing** . Tokens still dominate many agent tasks. The wedge is the combination of:
1. Pause/resume agent execution from durable object-storage state.
2. Cache repeated context, tool results, repo state, browser profiles, package layers, and model prompt prefixes.
3. Bill hot compute only while useful work is happening.
4. Pin hot workspaces/orgs/repos/browsers when customers need predictable p95 latency.
5. Route each step to the cheapest adequate model/runtime.
This is a real opportunity because adoption is early but moving fast. The important point is that we are still in the “costs are constraining product ambition” stage. The hard truth: **there is probably not a universal 100x cost reduction in total agent cost** , because LLM tokens often dominate. But there is a credible **3–20x reduction in runtime waste** for idle-heavy agents, and a **20–60% total-cost reduction** for many production agents when runtime, prompt caching, model routing, and checkpointing are combined.
## First-Principles Economics
An agent task has five cost buckets:
```text
C_agent = C_tokens + C_sandbox + C_browser + C_state + C_orchestration
```
Most people obsess over `C_tokens` , but production agents also spend money on sandboxes, browsers, long-lived sessions, retries, logs, screenshots, data fetches, repo clones, package installs, network calls, and idle time.
### 1. Token cost
For a coding or research agent using a frontier model, token cost can dominate.
Example: a 20-minute coding agent task using 150k input tokens and 30k output tokens at $3 / million input tokens and $15 / million output tokens costs:
```text
150k × $3/M + 30k × $15/M = $0.45 + $0.45 = $0.90
```
That is before sandbox, browser, logs, storage, orchestration, and retries.
Prompt caching is one of the biggest levers. If 80% of that 150k-token input is repeatable context and cached input costs 10% of normal input, the input cost can fall from $0.45 to roughly $0.126 before accounting for cache-write amortization.
### 2. Sandbox/runtime cost
Runtime is cheaper than frontier tokens on many tasks, but it becomes painful at scale or when agents wait.
A 2-vCPU, 4GB task kept alive for one hour at representative serverless-container prices might cost roughly:
```text
2 × 3600 × 0.000011244 + 4 × 3600 × 0.000001235 ≈ $0.099/hour
```
A 2-vCPU, 4GB sandbox at representative agent-sandbox pricing might cost roughly:
```text
3600 × (0.000028 + 4 × 0.0000045) = $0.166/hour
```
The key variable is utilization:
```text
Utilization = useful active compute seconds / wall-clock session seconds
```
If an agent runs for 60 minutes but only actively uses CPU/browser/tools for 4 minutes, keeping the sandbox hot causes a 15x wall-clock tax before even considering cold starts and retries.
### 3. Browser cost
Browser agents are even more obviously bursty. A human-like browser agent might spend most of its wall-clock time waiting for pages, auth, rate limits, APIs, or model responses.
A durable runtime can checkpoint the browser profile/session state and only keep browsers hot for active interaction windows. For high-QPS customers, it can pin browser pools for predictable latency, exactly like pinning hot search namespaces.
### 4. Session/runtime pricing is already emerging
The market is already validating that “agent runtime” is its own billable primitive. Model providers are beginning to price containers, managed agents, and session runtime separately from tokens. That matters because it means the opportunity exists, but the model providers are also competitors.
### 5. The core arbitrage
> Replace wall-clock hot compute with durable cold state + warm cache + pinned hot capacity only where needed .
For a light back-office agent:
- 60-minute wall-clock task
- 5 minutes of active sandbox/browser work
- 20k input tokens, 5k output tokens on a cheap model
- 2-vCPU, 4GB sandbox
Naive wall-clock sandbox cost:
```text
60min ≈ $0.166
```
Token cost on a cheap model at $1/M input and $5/M output:
```text
20k × $1/M + 5k × $5/M = $0.045
```
Total naive cost:
```text
$0.166 + $0.045 = $0.211/task
```
If runtime is active for only 5 minutes:
```text
5min sandbox ≈ $0.014
```
Total optimized cost before orchestration/storage:
```text
$0.014 + $0.045 = $0.059/task
```
That is a roughly **3.6x total-cost reduction** . At 1M tasks/month, the difference is about **$152k/month** . At 10M tasks/month, it is about **$1.5M/month** . That is a company.
For heavy coding agents, runtime savings alone are smaller because tokens dominate. But prompt caching, repo-state caching, package-layer caching, test-result caching, model routing, and retry avoidance can still cut total cost materially.
## Blog Post Draft
## Agentic Compute on Object Storage
In late 2025, I was helping a team scale an internal coding-agent deployment. The prototype was magical. Engineers could hand an issue to an agent, watch it explore the repo, edit files, run tests, and open a pull request.
Then the bill arrived.
The model bill was high, but that part was expected. What surprised us was everything around the model: sandboxes sitting idle, browsers waiting on pages, repo clones repeated thousands of times, package installs rerun for the same workspaces, giant prompts resent on every step, and failed agent runs that had to start over from scratch.
The agent was “working” for 25 minutes. But it was only actively computing for a few minutes. The rest of the time it was waiting: on the model, on the network, on tests, on package managers, on web pages, on human review, or on another system.
We were paying hot-compute prices for cold state.
That felt wrong.
Search had a similar problem. The previous generation of vector databases assumed most data needed to live on expensive hot storage. turbopuffer’s insight was that search indexes are huge, multi-tenant, and unevenly accessed. Keep the source of truth in object storage. Cache what is hot. Make cold queries acceptable and warm queries fast.
Agentic compute has the same shape, but with time instead of data.
Most agents are not hot all the time. Most customer workspaces are not active all the time. Most browser sessions are not clicking all the time. Most sandboxes are not running code all the time. Most prompt context is repeated. Most tool results do not need to be recomputed.
So why are we building agent infrastructure as if every agent is a tiny always-on VM?
## The Five Common Computes
Companies usually start with a web app and a relational database. Then, as workloads grow, they pull pieces into specialized systems: queues, caches, warehouses, search engines, stream processors.
Compute is going through the same specialization.
| Category | Examples | Strength | Weakness |
| --- | --- | --- | --- |
| Stateless serverless | Lambda, Workers | Cheap bursty functions | Weak for long-lived stateful sessions |
| Containers/VMs | ECS, Fargate, Kubernetes, Fly | General-purpose execution | Often billed by wall-clock allocation |
| Workflow engines | Temporal, Step Functions | Durable orchestration | Not a full agent execution environment |
| GPU/model APIs | OpenAI, Anthropic, Together, Modal | Dense model inference | Expensive, token-centric, not full state |
| Agent sandboxes/browsers | E2B, Browserbase, managed agent runtimes | Safe tool/browser execution | Still early; runtime, state, and cache layers are fragmented |
Agents do not fit cleanly into any one of these.
An agent is part workflow, part model client, part browser, part sandbox, part file system, part memory system, part audit log, and part distributed scheduler.
It needs low-latency interaction when a human is watching. It needs durable state when no one is watching. It needs untrusted-code isolation. It needs browser identity. It needs model routing. It needs prompt caching. It needs replay. It needs cost controls. It needs per-tenant fairness. And it needs all of this at a price low enough that companies can run agents on every workflow, not just the most valuable ones.
That is a different kind of compute.
## The Agentic Workload Is Mostly Waiting
A typical agent loop looks like this:
```text
think -> call tool -> wait -> observe -> think -> edit -> test -> wait -> retry -> summarize
```
Only some of those steps use local CPU. Some use remote model inference. Some use a browser. Some wait on network. Some wait on a human. Some wait on a queue.
The naive architecture gives each agent a container or VM and keeps it alive for the whole wall-clock session.
That is easy to reason about, but economically wrong.
The useful unit is not “session minutes.” The useful unit is:
```text
active tool seconds
+ active browser seconds
+ model tokens
+ durable state bytes
+ orchestration events
```
Everything else is idle tax.
If an agent runs for one hour but only uses CPU for four minutes, wall-clock billing turns a 4-minute workload into a 60-minute workload. That is a 15x tax on the runtime portion of cost.
Runtime is not always the biggest line item. For deep coding agents, model tokens can dominate. But for browser agents, support agents, research agents, back-office agents, and millions of low-value automation tasks, runtime waste can decide whether the product exists.
## First-Principle Agent Costs
An agent run costs:
```text
tokens
+ sandbox CPU/memory
+ browser time
+ state storage
+ logs/traces/screenshots
+ orchestration
+ retries
```
Token cost is obvious because every provider exposes it. Runtime cost is more subtle because it hides in the shape of execution.
A 20-minute coding agent that uses 150k input tokens and 30k output tokens on a frontier model might spend around a dollar on tokens. Cutting sandbox cost from five cents to one cent is nice, but not transformational.
But the same architecture applied to a high-volume support or browser workflow looks very different.
Suppose an agent spends one hour of wall-clock time completing a task, but only five minutes actively using its sandbox and browser. If you keep the environment hot for the full hour, you pay for the hour. If you checkpoint the environment and resume only during active work, you pay for five minutes plus cheap durable state.
That is the agentic-compute equivalent of moving cold vectors out of RAM.
## Object Storage Native Agent Runtime
The architecture we want is not “serverless functions for agents.” It is an object-storage-native agent runtime.
```text
╔═ agent runtime ═══════════════════════════╗
╔════════════╗ ║ ║
║ client ║ API ║ ┏━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━━┓ ║
║ product ║─────▶║ ┃ Hot compute ┃────▶┃ Object storage┃ ║
║ human ║ ║ ┃ cache/pool ┃ ┃ source truth ┃ ║
╚════════════╝ ║ ┗━━━━━━━━━━━━━━┛ ┗━━━━━━━━━━━━━━━┛ ║
║ ▲ ▲ ║
║ │ │ ║
║ pinned workspaces event log, fs, ║
║ hot browsers traces, snapshots ║
╚═══════════════════════════════════════════╝
```
The source of truth is not the running container. The source of truth is an append-only agent event log, file-system snapshots, browser/session state, tool results, traces, and cost records in durable storage.
Hot compute is a cache.
If a node dies, the agent does not die. Another node loads the agent continuation, pulls the hot layers it needs, and resumes. If a workspace is queried constantly, pin it. If it is idle, evict it. If a customer needs predictable p95 latency, sell pinned capacity. If they care more about cost, let the runtime cold-resume.
This is not just cheaper. It is a better abstraction.
## Pin Workspaces
Search systems have namespaces. Agents have workspaces.
A workspace might be:
- a repo
- a customer tenant
- a browser identity
- a CRM account
- a data room
- a long-running research project
- a customer-support queue
- a user’s personal automation environment
Most workspaces are cold most of the time. A few are hot.
So the runtime should expose two modes:
```text
default workspace: cheap, cold-resumable, best-effort warm latency
pinned workspace: reserved hot cache, predictable latency, predictable concurrency
```
This maps cost to value.
A startup running occasional agents should not pay to keep everything hot. A large customer running thousands of agents per minute on the same repo, browser identity, or tenant should be able to pin that workspace and get consistent latency.
Pinning is the agentic equivalent of keeping a hot search namespace in cache.
## The Caches That Matter
A serious agent runtime needs more than container snapshots.
It needs several caches working together:
**Prompt prefix cache.** System prompts, tool definitions, policy text, repo maps, schema descriptions, and long-lived instructions should not be paid for at full price every turn.
**Workspace cache.** Repo clones, dependency installs, generated indexes, build artifacts, and test caches should survive across agent runs.
**Browser cache.** Login state, cookies, profiles, device fingerprints, page sessions, and screenshots should be durable and resumable.
**Tool-result cache.** Fetching the same docs, APIs, tickets, emails, PDFs, or database rows repeatedly is wasteful. Many tool calls are pure enough to cache with careful invalidation.
**Model-routing cache.** Not every step deserves the frontier model. Planning might. Formatting does not. Extraction often does not. Verification sometimes does.
**Failure cache.** When an agent fails, the runtime should know exactly what happened, what state was valid, and where to resume. Restarting from zero is the most expensive retry policy.
The winning system will not merely run agents. It will make repeated agent work cheaper every time.
## Why This Is Hard
This is not a weekend wrapper around Docker.
There are real problems:
- secure isolation for untrusted model-written code
- snapshotting file systems without huge write amplification
- browser identity and anti-abuse complexity
- secrets handling
- SOC 2 / HIPAA / enterprise audit trails
- noisy-neighbor isolation
- per-tenant budgets
- model-provider rate limits
- prompt-cache locality
- replay determinism
- human approval flows
- long-tail package installs
- arbitrary network access
- customer data boundaries
The temptation is to build a generic agent platform. That is probably too broad.
The better wedge is a narrow high-volume workflow where cost is obviously constraining usage: coding agents over many repos, browser agents for web operations, support-ticket agents, compliance-review agents, sales-research agents, or back-office data agents.
The test is simple: does the customer say, “We would run 10x more agents if each run were 3x cheaper and more predictable”?
If yes, you have the turbopuffer-shaped opportunity.
## What the Incumbents Will Do
The model providers will not ignore this. They are already adding containers, managed agents, web search, computer use, prompt caching, and session runtime billing.
That does not kill the opportunity. But it changes the product.
The independent runtime cannot just be “a sandbox.” It must be the neutral execution layer across models, clouds, browsers, tools, customer data, and workflows.
The pitch is:
```text
Bring your models.
Bring your tools.
Bring your cloud.
Bring your security boundary.
We make agent execution durable, cheap, observable, and fast.
```
That is valuable because enterprises will not run every agent through one model provider forever. They will route by cost, latency, quality, data policy, and procurement.
A neutral runtime can sit above that mess.
## The Economics of More Agents
When infrastructure gets cheaper, product ambition changes.
If an agent costs $2, you only run it on important tasks.
If it costs $0.20, you run it on every ticket, every lead, every pull request, every document, every alert, every customer account, every workflow edge case.
If it costs $0.02, you stop thinking of it as an assistant and start thinking of it as background compute.
That is the real opportunity: not making today’s agents slightly cheaper, but making tomorrow’s agent-heavy products economically possible.
Search got dramatically more useful when storage economics improved. Retrieval moved from “special feature” to “default capability.”
Agentic compute will go through the same transition.
Right now, many teams are building agents as if every task deserves a hot VM, a fresh browser, a fresh repo clone, and a full-context frontier-model call.
That will look absurd in a few years.
The future agent runtime is durable by default, cold when idle, warm when useful, pinned when necessary, and ruthless about not recomputing the same work twice.
**That is the turbopuffer idea for agentic compute.**