Ask HN: What's your biggest LLM cost multiplier?

7 points by teilom 2 days ago | 6 comments

Soerensen
Our biggest cost multiplier was "conversational drift" - not the initial call, but what happens when you let users iterate.
In our email marketing tool, a user might say "make it more punchy" → AI rewrites → "actually, more professional" → rewrite → "can we A/B test both versions?" → now you're generating multiple variants. One "simple" email could spiral into 15+ LLM calls.
What worked for us:
1. *Session-level budgets, not request-level.* We cap total tokens per session rather than per call. Users can iterate freely within their budget, but can't inadvertently 10x their usage.
2. *Explicit "done" signals.* Instead of letting users endlessly refine, we added a clear "I'm happy with this" button that closes the generation loop. Sounds UX-y but it cut our average calls-per-task by 60%.
3. *Cascade to cheaper models for iteration.* First generation uses Claude 3.5. Tweaks and refinements use Haiku. Users can't tell the difference for small edits, and it cut iteration costs ~80%.
4. *Cache aggressively at the semantic level.* "Make it shorter" and "condense this" should hit the same cache key. We use embeddings to identify semantically similar requests and serve cached results when possible.
The counterintuitive insight: your biggest cost driver is probably user behavior, not model choice. The difference between GPT-4 and Claude matters less than how you architect the interaction loop.
rishabhpoddar
- Tool calling: This is unavoidable, but I try structure the tools such that the total tool calling for an input is minimised.
- Using UUIDs in the prompt (which can happen if you serialise a data structure that contains UUIDs into a prompt): Just don't use UUIDs, or if you must, then map them onto unique numbers (in memory) before adding them to a prompt
- Putting everything in one LLM chat history: Use sub agents with their own chat history, and discard it after sub agent finishes.
- Structure your system prompt to maximize input cache tokens: You can do this by putting all the variable parts of the system prompt towards the end if it, if possible.
zhug3
In my experience the biggest multiplier isn't any single variable it's the interaction between them. Fanout × retries × context growth compounds in ways that linear cost models completely miss.
The fix that worked for us: treat budget as a hard constraint, not a target. When you're approaching limit, degrade gracefully (shorter context, fewer tool calls, fallback to smaller model) rather than letting costs explode and cleaning up later.
Also worth tracking: the 90th percentile request often costs 10x the median. A handful of pathological queries can dominate your bill. Capping max tokens per request is crude but effective.
- teilom
  +1 on interaction terms + tails : fanout × retries × context growth is where linear token math dies.
  One thing we do in enzu is make “budget as constraint” executable: we clamp `max_output_tokens` from the budget before the call, and in multi-step/RLM runs we adapt output caps downward as the budget depletes (so it naturally gets shorter/cheaper instead of spiraling). When token counting is unavailable we explicitly enter a “budget degraded” mode rather than pretending estimates are exact.
  Also agree p90/p95 cost/run matters more than averages; max-output caps are crude but effective.
  Docs: https://github.com/teilomillet/enzu/blob/main/docs/PROD_MULT... and https://github.com/teilomillet/enzu/blob/main/docs/BUDGET_CO...
teilom
If you’re trying to estimate before prod, logging these 4 things in a pilot gets you 80% there: - tokens/run (in+out) - tool calls/run (and fanout) - retry rate (timeouts/429s) - context length over turns (P50/P95)
Fanout × retries is the classic “bill exploder”, and P95 context growth is the stealth one. The point of “budget as contract” is deciding in advance what happens at limit (degraded mode / fallback / partial answer / hard fail), not discovering it from the invoice.
teilom
Background note I wrote (framing + “budget as contract”): https://github.com/teilomillet/enzu/blob/main/docs/BUDGETS_A...