yourself how real machine learning products actually run in major tech companies or departments? If yes, this article is for you š
Before discussing scalability, please donāt hesitate to read my first article on the basics of machine learning in production.
In this last article, I told you that Iāve spent 10 years working as an AI engineer in the industry. Early in my career, I learned that a model in a notebook is just a mathematical hypothesis. It only becomes useful when its output hits a user, a product, or generates money.
Iāve already shown you what āMachine Learning in Productionā looks like for a single project. But today, the conversation is about Scale: managing tens, or even hundreds, of ML projects simultaneously. These last years, we have moved from the Sandbox Era into the Infrastructure Era. āDeploying a modelā is now a non-negotiable skill; the real challenge is ensuring a massive portfolio of models works reliably and safely.
1. Leaving the Sandbox: The Strategy of Availability
To understand ML at scale, you first need to leave the āSandboxā mindset behind you. In a sandbox, you have static data and one model. If it drifts, you see it, you stop it, you fix it.
But once you transition to Scale Mode, youāre no longer managing a model, youāre managing a portfolio. This is where the CAP Theorem (Consistency, Availability, and Partition Tolerance) becomes your reality. In a single-model setup, you can try to balance the tradeoffs, but at scale, itās impossible to be perfect across the 3 metrics. You must choose your battles, and more often than not, Availability becomes the top priority.
Why? Because when you have 100 models running, something is always breaking. If you stopped the service every time a model drifted, your product would be offline 50% of the time.
Since we cannot stop the service, we design models to fail ācleanly.ā Take an example of a recommendation system: if its model gets corrupted data, it shouldnāt crash or show a ā404 error.ā It should fall back to a safe default setting (like showing the āTop 10 Most Popularā items). The user stays happy, the system stays available, even though the result is suboptimal. But to do this, you need to know when to trigger that fallback. And that leads us to our biggest challenge at scaleā¦āThe monitoringā.
2. The Monitoring Challenge And Why traditional metrics die at scale
By saying that at scale itās important that our system fail ācleanly,ā you might think that itās easy and we just need to check or monitor the accuracy. But at scale, āAccuracyā is not enough and I will tell you exactly why:
- The Lack of Human Consensus: In Computer Vision, for example, monitoring is easy because humans agree on the truth (itās a dog or itās not). But in a Recommendation System or an Ad-ranking model, there is no āGold Standard.ā If a user doesnāt click, is the model bad? Or is the user just not in the mood?
- The Feature Engineering Trap: Because we canāt easily measure ātruthā through a simple metric, we over-compensate. We add hundreds of features to the model, hoping that āmore dataā will solve the uncertainty.
- The Theoretical Ceiling: We fight for 0.1% accuracy gains without knowing if the data is just too noisy to give more. We are chasing a āceilingā we canāt see.
So letās link all of that to understand where we are going and why this is important: Because monitoring ātruthā is nearly impossible at scale (Dead Zones), we canāt rely on simple alerts to tell us to stop. This is exactly why we prioritize Availability and Safe Fallbacks, we assume the model might be failing without the metrics telling us, so we build a system that can survive that āfuzzyā failure.
3. What about The Engineering Wall
Now that we have discussed the strategy and monitoring challenges, we are not yet ready to scale, as we have not yet addressed the infrastructure aspect. Scaling requires engineering skills just as much as data science skills.
We cannot talk about scaling if we donāt have a solid, secure infrastructure. Because the models are complex, and because Availability is our number one priority, we need to think seriously about the architecture we set up.
At this stage, my honest advice is to surround yourself with a team or people who are used to building big infrastructures. You donāt necessarily need a massive cluster or a supercomputer, but you do need to think about these three execution basics:
- Cloud vs. Device: A server gives you power and is easy to monitor, but itās expensive. Your choice depends entirely on Cost vs. Control.
- The Hardware: You simply canāt put every model on a GPU; youād go bankrupt. You need a Tiered Strategy: run your simple āfallbackā models on cheap CPUs, and reserve the expensive GPUs for the heavy āmoney-makerā models.
- Optimization: At scale, a 1-second lag in your fallback mechanism is a failure. You arenāt just writing Python anymore; you must learn to compile and optimize your code for specific chips so the āFail Cleanlyā switch happens in milliseconds.
4. Be careful of Label Leakage
So, youāve anticipated the failures, worked on availability, sorted the monitoring, and built the infrastructure. You probably think youāre finally ready to master scalability. Actually, not yet. There is an issue you simply canāt anticipate if you have never worked in a real environment.
Even if your engineering is perfect, Label Leakage can ruin your strategy and your systems that are running multiple models.
In a single project, you might spot leakage in a notebook. But at scale, where data comes from 50 different pipelines, leakage becomes almost invisible.
The Churn Example: Imagine youāre predicting which users will cancel their subscription. Your training data has a feature called Last_Login_Date. The model looks perfect with 99% F1 score.
But hereās what actually happened: The database team set up a trigger that āclearsā the login date field the moment a user hits the āCancelā button. Your model sees a āNullā login date and realizes, āAha! They canceled!ā
In the real world, at the exact millisecond the model needs to make a prediction before the user cancels, that field isnāt Null yet. The model is looking at the answer from the future.
This is a basic example just so you can understand the concept. But believe me, if you have a complex system with real-time predictions (which happens often with IoT), this is incredibly hard to detect. You can only avoid it if you are aware of the problem from the start.
My tips:
- Feature Latency Monitoring: Donāt just monitor the value of the data, monitor when it was written vs. when the event actually happened.
- The Millisecond Test: Always ask: āAt the exact moment of prediction, does this specific database row actually contain this value yet?ā
Of course, these are simple questions, but the best time to evaluate this is during the design phase, before you ever write a line of production code.
5. Finally, The Human Loop
The final piece of the puzzle is Accountability. At scale, our metrics are fuzzy, our infrastructure is complex, and our data is leaky, so we need a āSafety Net.ā
- Shadow Deployment: This is mandatory for scale. You deploy āModel Bā but donāt show its results to users. You let it run āin the shadowsā for a week, comparing its predictions to the āTruthā that eventually arrives. If itās stable, only then do you promote it to āLive.ā
- Human-in-the-Loop: For high-stakes models, you need a small team to audit the āSafe Defaults.ā If your system has fallen back to āMost Popular Itemsā for three days, a human needs to ask why the main model hasnāt recovered.
And a quick recap before you start working with ML at scale:
- Since we canāt be perfect, we choose to stay online (Availability) and fail safely.
- Availability is our metric number 1 since monitoring at scale is āfuzzyā and traditional metrics are unreliable.
- We build the infrastructure (Cloud/Hardware) to make these safe failures fast.
- We watch out for ācheatingā data (Leakage) that makes our fuzzy metrics look too good to be true.
- We use Shadow Deploys to prove the model is safe before it ever touches a customer.
And remember, your scale is only as good as your safety net. Donāt let your work be among the 87% of failed projects.
šĀ LinkedIn:Ā Sabrine Bendimerad
šĀ Medium:Ā https://medium.com/@sabrine.bendimerad1
šĀ Instagram:Ā https://tinyurl.com/datailearn

