The Next AI Bottleneck Isn’t the Model: It’s the Inference System

I’ve seen a lot when I’m working with enterprise AI teams: they nearly always blame the model when something goes wrong. This is understandable, but it’s also frequently incorrect, and it ends up being quite costly.

The usual scenario is as follows. The outputs are inconsistent; when someone raises it, the first reaction is to blame the model. It may require more training data, another fine-tuning run, or a different base model. After weeks of work, the issue remains the same or has only slightly changed. The real problem, often sitting in the retrieval layer, the context window or how tasks were being routed, was never examined.

I’ve seen it happen so many times before that I believe it is worth writing about.

Fine-tuning is useful, but it gets overused

In many cases, it’s still worthwhile to make a few adjustments. If domain adaptation, tone alignment, or safety calibration are required, it should be part of the workflow. I’m not saying that you shouldn’t use it.

The problem is that it is the automatic answer to any problem, even if it is not the appropriate tool. Partly because it feels like it’s a productive thing to do. You start a fine-tuning job, something clearly happens, and there is a before and after. It appears that you’re addressing the issue when you are not.

One example of this is a contract analysis system, which I was observing a team debugging. The outputs were unreliable for complex documents, and the initial idea was that the model lacked legal reasoning skills. So they ran several tuning iterations. The problem didn’t go away. Eventually, someone noticed that the retrieval layer was doing the same retrievals multiple times and was adding them to the context window. The model was attempting to work through a lot of low-value text that was repeated over and over. They adjusted the retrieval ranking and introduced context compression, and it eventually became much better.

The model itself was never changed. And, this is a fairly common occurrence.

Fine-Tuning vs Inference Loop (Image by Author)

What’s happening at inference time

For a long time, inference was just the step where you used the model. Training was where all the interesting decisions happened. That’s changing now.

One reason for this is that some models began allocating more compute to generation rather than baking it into the training process. Another factor was that research demonstrated that behaviours such as self-checking or rewriting a response can be learned through reinforcement learning. Both of these pointed to inference itself as a place where performance could be improved.

What I see now is engineering teams starting to treat inference as something you can actually design around, rather than just a fixed step you accept. How much reasoning depth does this task need? How is memory being managed? How is retrieval being prioritized? These are becoming real questions rather than defaults you don’t think about.

The resource allocation problem

What is often underrated is that most AI systems use a uniform approach to all their queries. A single question regarding account status follows the same process as a multi-step compliance process, with information to be reconciled in several conflicting documents. The same cost, the same process, the same compute.

This doesn’t seem to make much sense when you think about it. In all other engineering applications, resources would be allocated based on the required work. Some teams are beginning to do this with AI, offloading lighter inferences to lighter workloads and routing heavier compute to tasks that truly require it. The economics get better, and the quality of the more difficult stuff improves as well, since you’re no longer underresourcing it.

These systems are more layered than people realize

When you look inside a production AI system today, it usually isn’t just one model answering questions. It is often accompanied by a retrieval step, a ranking step, possibly a verification step, and a summarization step; several steps in tandem to generate the final output. It’s not only about the capability of the underlying model, but also about how all those pieces fit together to produce the output.

If the retrieval ranker isn’t properly calibrated, it will produce outputs similar to model errors. A context window that can grow without restraint will subtly affect the quality of reasoning, but nothing obviously will fail. These are systems issues, not model issues, and they need to be addressed with systems thinking.

An example of this type of thinking in practice is speculative decoding. The concept is that a smaller model generates candidate outputs, and a larger model verifies them. It started as a latency optimization, but it’s really an example of distributing reasoning across multiple components rather than expecting one model to do everything. Two teams using the same base model but different inference architectures can end up with quite different results in production.

Production AI Inference Pipeline (Image By Author)

Memory is becoming a real issue

Larger context windows have been useful, but past a certain point, more context doesn’t improve reasoning; it degrades it. Retrieval gets noisier, the model tracks less effectively, and inference costs go up. The teams running AI at scale are spending real time on things like paged attention and context compression, which aren’t exciting to talk about but matter a lot operationally.

The idea is to have the right context, but not too much, and to have it managed well.

Takeaway

Model selection matters less than it used to. Capable foundation models are now available from several providers, and capability gaps have narrowed for most use cases. What’s actually determining whether a deployment succeeds is the infrastructure around the model, how retrieval is tuned, how compute is allocated, and how the system handles edge cases over time.

The teams that will be in a good position in a few years are the ones treating inference architecture as something worth engineering carefully, rather than assuming a good-enough model will sort everything else out. In my experience, it usually doesn’t.

What's Hot

The Next AI Bottleneck Isn’t the Model: It’s the Inference System

Embrace Near Misses – by Charlie Gilkey

What sugar does to your body

I Let CodeSpeak Take Over My Repository

Physical AI moves closer to factory floors as companies test humanoid robots

Fastino Labs Open-Sources GLiGuard: A 300M Parameter Safety Moderation Model That Matches or Exceeds Accuracy of Models 23–90x Its Size

Top real estate app development companies in the US: Abilities and costs

Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models

Building an Evaluation Harness for Production AI Agents: A 12-Metric Framework From 100+ Deployments