What actually breaks when you put an LLM in production
Latency spikes, silent cost creep, and the 2am model outage. The failure modes nobody puts in a demo, and the boring guardrails I design around each one before a feature ships.
Every LLM feature looks great in the demo. You type a question, the model answers, the room nods. Then you ship it to real users, and discover that a demo and a product are very different animals.
I've spent the last 18 months putting language models into things that real businesses depend on. The model itself is rarely the hard part. The hard part is everything around it: the request that takes nine seconds instead of one, the bill that quietly triples, the morning the provider returns 500s and your whole feature falls over. None of that shows up in a demo. All of it shows up in production.
Here's what actually breaks, and the unglamorous engineering I do to keep it standing.
The failure modes nobody demos
When a feature works in a notebook, it's tempting to call it done. But a notebook hides three things that production exposes immediately:
- Variance. The same prompt can take 800ms or 8 seconds depending on output length and provider load.
- Volume. One request is free to reason about. Ten thousand a day is a budget line and a rate limit.
- Dependency. Your feature now has a hard dependency on a service you don't control and can't see inside.
Treat each of these as a first-class engineering problem and the feature holds up. Ignore them and you're shipping a demo with a production URL.
Latency you didn't budget for
Token generation is sequential, so latency scales with how much the model says. A chatty system prompt and an open-ended question can turn a snappy feature into a spinner. The fix isn't a faster model; it's designing for the wait.
Stream tokens so the user sees progress immediately. Cap output length where you can. And move anything that doesn't need to be synchronous off the request path entirely: summaries, enrichment, indexing all belong in a queue, not in front of a waiting human.
If a user is staring at a spinner, the model isn't slow. Your architecture is.
Silent cost creep
The scariest LLM bug is the one that doesn't throw an error: cost. A prompt that grows by a few hundred tokens per release, multiplied by every request, is a bill that climbs while every test stays green. You don't notice until finance does.
So I account for tokens the way I'd account for money, per request, with a budget and an alarm:
// reject before we ever call the model
const estimate = countTokens(prompt) + maxOutputTokens;
if (estimate > budget.perRequest) {
metrics.increment("llm.budget.exceeded");
return fallbackResponse();
}
Every call logs its real token usage against the estimate. A daily rollup tells me which feature is drifting before the invoice does. It's accounting, not machine learning, and it has saved clients real money more than once.
The 2am model outage
Providers have bad days. Rate limits tighten without warning, a region degrades, a model version gets deprecated. If your feature assumes the API is always up, it will eventually be down exactly when you're asleep.
The answer is the same pattern we've used for every external dependency for decades: a fallback chain. Try the primary model; on error, drop to a cheaper or alternate one; if that's failing too, degrade gracefully to a cached answer or an honest "try again in a moment", never a stack trace.
Designing the guardrails
None of this is novel. It's the same discipline we apply to any system that talks to the outside world: timeouts, retries with backoff, idempotency, circuit breakers, observability. LLMs just make the outside world feel magical enough that people forget to apply it.
The mental shift that helps: stop treating the model as a function that returns the right answer, and start treating it as a remote service that usually returns a useful answer. Once it's a remote service, you already know how to engineer around it.
What I check before shipping
Before an LLM feature goes live, I walk the same short list every time:
- Is output streamed, and is there a sensible max length?
- Is there a per-request token budget with an alarm on drift?
- Is there a fallback for when the primary model errors or times out?
- Does the UI have an honest degraded state, with no raw errors and no lies?
- Can I see latency, cost and error rate per feature in one place?
- Are prompts versioned so I can roll back a bad change?
None of it is glamorous. All of it is the difference between a feature that demos well and one that holds up. That gap, between the demo and the thing that survives real users, is most of the job.