The Hidden Cost of AI Workloads
I was a few hours into a Claude Code session last month, jumping from one feature to the next, riding the momentum. I'd fix a bug, add a small feature, tweak a test. Every change felt lightweight. But somewhere around the fourth or fifth task, something felt heavier. Responses were slower. Context was creeping up. I finally looked at the token count, and it was enormous. I was dragging the history of every previous task into every new one.
That moment reshaped how I work in Claude Code. Now every small change is a commit, a PR, then a /clear to reset context. Smaller, tighter loops. Far fewer tokens. The work doesn't feel any slower, and the model stays sharper on the task in front of it.
I didn't catch that because I was being careful. I caught it because I happened to look. If I'd had a dashboard showing context size per session, I would have caught it in week one. That's the whole point of this post.
The word "cost" does a lot of work here
When developers talk about AI costs, they usually mean dollars. Dollars matter, but they're a narrow slice. In a production AI workload, cost shows up in at least four shapes.
- Dollar cost is the obvious one. Tokens in, tokens out, model choice, cache hits, cache misses. It lands on an invoice once a month and that's usually when people start paying attention.
- Latency cost is what users feel and what ties up your infrastructure. A 12 second response isn't just bad UX. It's a server that can't serve other traffic, and a user who might hit the button again.
- Architectural cost is the cost of bad patterns. Context bloat, like my example. Tool call loops where the model keeps trying the same wrong action. Fallback chains that silently escalate every request to a bigger model because the small one returned something weird.
- Quality cost is the sneakiest. A prompt tweak that saves 20% on tokens but drops accuracy on the workload you care about most. A model swap that looked identical in spot checks and fails on the long tail. Quality cost usually shows up as a customer complaint or a churn number, which means it shows up late.
None of these are bad on their own. They're just costs, and every design choice trades between them. The problem isn't that costs exist. The problem is not being able to see them.
Why these costs stay hidden
Traditional APM (application performance monitoring) tools weren't built for this. They know about CPU, memory, request count, status codes. They don't know about tokens. They don't know that the same input to the same endpoint can cost three cents today and fifteen cents tomorrow because the context grew.
AI costs also accumulate in places APM doesn't naturally slice. Per user. Per feature. Per conversation. Per prompt template. Your infrastructure dashboard likely aggregates everything to the server level, which is the wrong resolution for almost every interesting AI question.
And quality is subjective. There's no HTTP 500 for "the model guessed when it should have answered" or "the summary missed the key detail." If you only watch uptime and error rates, quality can degrade for months without anything turning red.
The observability playbook still works
Here's the good news. You don't need new principles. The same observability playbook you've used for web services for the last twenty years maps almost directly onto AI workloads.
Metrics. Emit one event per AI call with the model name, the input token count, the output token count, latency, a feature tag, and the outcome (success, error, retry, fallback). That single event, rolled up a few different ways, answers most of the cost questions you'll ever have.
Tracing. Agentic workflows make a mess of traditional logs. A single user request might kick off a model call, a tool call, another model call, and a retry. Distributed tracing makes that readable. Each span shows duration, tokens, and which step it was. When something goes sideways, the trace tells you exactly where.
Logs. Capture prompts and responses with the usual care around PII. You'll need them the first time someone asks "why did the model do that," and you will not remember.
SLOs. Set budgets not just for latency, but for cost per user session and for quality on a canonical set of inputs. When you burn the budget, something changed and it's worth investigating.
Alerts. Anomaly detection on "token spend per feature" catches context bloat before it compounds. Alerts on retry rate catch tool call loops. Alerts on p95 latency per model catch a provider having a bad day.
Nothing here is exotic. It's the same work developers already do for databases and APIs. The move is recognizing that AI calls deserve the same treatment.
What becomes visible once you instrument it
My Claude Code context story is that pattern in the small. In a production app, it shows up at scale.
A team I talked to recently had a slowly climbing p50 input token count on their chat feature. Nothing was broken. The feature worked. But the curve was going up, not flat. The instrumentation caught it. Their fix was a sliding context window plus summarization of older turns. Token spend on that feature dropped, and no user noticed the change.
Another team had a feature that occasionally ran up a huge number of tool calls in a single session. They'd been told it was "just a long conversation." Their trace data told a different story: the model was hitting an edge case and calling the same tool in a loop. The dashboard made the loop obvious. The fix was a loop guard and a better prompt.
In both cases, the dashboard was the difference between catching it in engineering and hearing about it from a customer.
The architect's decisions, now with data
Every AI system has decisions baked into it. Which model to use. How much context to send. When to cache. When to retry. When to fall back. These decisions aren't right or wrong in the abstract. They're trade-offs.
Observability is what turns those trade-offs from guesses into decisions. When you can see context size by feature, you have evidence for where to trim. When you can see latency by model and by feature, you know where routing pays off. When you can see retry rate spike right after a prompt change, you know to roll back.
You're still the architect. Observability just means you're designing with data instead of vibes.
What to do this week
Pick one AI workload and instrument three things: input tokens, output tokens, and latency, tagged with the feature that triggered the call. That's it. One event per call, three numbers, one tag.
Then build one dashboard. Total tokens per feature per day. p50 and p95 latency per model. Cost per user session if you can manage it. Watch it for a week before you change anything.
You'll probably see something surprising. The point isn't to cut costs or squeeze every last penny out of every call. The point is that the decisions were already happening. Observability just makes them visible while you can still do something about them.