- Our biggest cost multiplier was "conversational drift" - not the initial call, but what happens when you let users iterate.
In our email marketing tool, a user might say "make it more punchy" → AI rewrites → "actually, more professional" → rewrite → "can we A/B test both versions?" → now you're generating multiple variants. One "simple" email could spiral into 15+ LLM calls.
What worked for us:
1. *Session-level budgets, not request-level.* We cap total tokens per session rather than per call. Users can iterate freely within their budget, but can't inadvertently 10x their usage.
2. *Explicit "done" signals.* Instead of letting users endlessly refine, we added a clear "I'm happy with this" button that closes the generation loop. Sounds UX-y but it cut our average calls-per-task by 60%.
3. *Cascade to cheaper models for iteration.* First generation uses Claude 3.5. Tweaks and refinements use Haiku. Users can't tell the difference for small edits, and it cut iteration costs ~80%.
4. *Cache aggressively at the semantic level.* "Make it shorter" and "condense this" should hit the same cache key. We use embeddings to identify semantically similar requests and serve cached results when possible.
The counterintuitive insight: your biggest cost driver is probably user behavior, not model choice. The difference between GPT-4 and Claude matters less than how you architect the interaction loop.
- - Tool calling: This is unavoidable, but I try structure the tools such that the total tool calling for an input is minimised.
- Using UUIDs in the prompt (which can happen if you serialise a data structure that contains UUIDs into a prompt): Just don't use UUIDs, or if you must, then map them onto unique numbers (in memory) before adding them to a prompt
- Putting everything in one LLM chat history: Use sub agents with their own chat history, and discard it after sub agent finishes.
- Structure your system prompt to maximize input cache tokens: You can do this by putting all the variable parts of the system prompt towards the end if it, if possible.
- In my experience the biggest multiplier isn't any single variable it's the interaction between them. Fanout × retries × context growth compounds in ways that linear cost models completely miss.
The fix that worked for us: treat budget as a hard constraint, not a target. When you're approaching limit, degrade gracefully (shorter context, fewer tool calls, fallback to smaller model) rather than letting costs explode and cleaning up later.
Also worth tracking: the 90th percentile request often costs 10x the median. A handful of pathological queries can dominate your bill. Capping max tokens per request is crude but effective.
- +1 on interaction terms + tails : fanout × retries × context growth is where linear token math dies.
One thing we do in enzu is make “budget as constraint” executable: we clamp `max_output_tokens` from the budget before the call, and in multi-step/RLM runs we adapt output caps downward as the budget depletes (so it naturally gets shorter/cheaper instead of spiraling). When token counting is unavailable we explicitly enter a “budget degraded” mode rather than pretending estimates are exact.
Also agree p90/p95 cost/run matters more than averages; max-output caps are crude but effective.
Docs: https://github.com/teilomillet/enzu/blob/main/docs/PROD_MULT... and https://github.com/teilomillet/enzu/blob/main/docs/BUDGET_CO...
- If you’re trying to estimate before prod, logging these 4 things in a pilot gets you 80% there: - tokens/run (in+out) - tool calls/run (and fanout) - retry rate (timeouts/429s) - context length over turns (P50/P95)
Fanout × retries is the classic “bill exploder”, and P95 context growth is the stealth one. The point of “budget as contract” is deciding in advance what happens at limit (degraded mode / fallback / partial answer / hard fail), not discovering it from the invoice.
- Background note I wrote (framing + “budget as contract”): https://github.com/teilomillet/enzu/blob/main/docs/BUDGETS_A...