yourself how real machine learning products actually run in major tech companies or departments? If yes, this article is for you π
Before discussing scalability, please donβt hesitate to read my first article on the basics of machine learning in production.
In this last article, I told you that Iβve spent 10 years working as an AI engineer in the industry. Early in my career, I learned that a model in a notebook is just a mathematical hypothesis. It only becomes useful when its output hits a user, a product, or generates money.
Iβve already shown you what βMachine Learning in Productionβ looks like for a single project. But today, the conversation is about Scale: managing tens, or even hundreds, of ML projects simultaneously. These last years, we have moved from the Sandbox Era into the Infrastructure Era. βDeploying a modelβ is now a non-negotiable skill; the real challenge is ensuring a massive portfolio of models works reliably and safely.
1. Leaving the Sandbox: The Strategy of Availability
To understand ML at scale, you first need to leave the βSandboxβ mindset behind you. In a sandbox, you have static data and one model. If it drifts, you see it, you stop it, you fix it.
But once you transition to Scale Mode, youβre no longer managing a model, youβre managing a portfolio. This is where the CAP Theorem (Consistency, Availability, and Partition Tolerance) becomes your reality. In a single-model setup, you can try to balance the tradeoffs, but at scale, itβs impossible to be perfect across the 3 metrics. You must choose your battles, and more often than not, Availability becomes the top priority.
Why? Because when you have 100 models running, something is always breaking. If you stopped the service every time a model drifted, your product would be offline 50% of the time.
Since we cannot stop the service, we design models to fail βcleanly.β Take an example of a recommendation system: if its model gets corrupted data, it shouldnβt crash or show a β404 error.β It should fall back to a safe default setting (like showing the βTop 10 Most Popularβ items). The user stays happy, the system stays available, even though the result is suboptimal. But to do this, you need to know when to trigger that fallback. And that leads us to our biggest challenge at scaleβ¦βThe monitoringβ.
2. The Monitoring Challenge And Why traditional metrics die at scale
By saying that at scale itβs important that our system fail βcleanly,β you might think that itβs easy and we just need to check or monitor the accuracy. But at scale, βAccuracyβ is not enough and I will tell you exactly why:
- The Lack of Human Consensus: In Computer Vision, for example, monitoring is easy because humans agree on the truth (itβs a dog or itβs not). But in a Recommendation System or an Ad-ranking model, there is no βGold Standard.β If a user doesnβt click, is the model bad? Or is the user just not in the mood?
- The Feature Engineering Trap: Because we canβt easily measure βtruthβ through a simple metric, we over-compensate. We add hundreds of features to the model, hoping that βmore dataβ will solve the uncertainty.
- The Theoretical Ceiling: We fight for 0.1% accuracy gains without knowing if the data is just too noisy to give more. We are chasing a βceilingβ we canβt see.
So letβs link all of that to understand where we are going and why this is important: Because monitoring βtruthβ is nearly impossible at scale (Dead Zones), we canβt rely on simple alerts to tell us to stop. This is exactly why we prioritize Availability and Safe Fallbacks, we assume the model might be failing without the metrics telling us, so we build a system that can survive that βfuzzyβ failure.
3. What about The Engineering Wall
Now that we have discussed the strategy and monitoring challenges, we are not yet ready to scale, as we have not yet addressed the infrastructure aspect. Scaling requires engineering skills just as much as data science skills.
We cannot talk about scaling if we donβt have a solid, secure infrastructure. Because the models are complex, and because Availability is our number one priority, we need to think seriously about the architecture we set up.
At this stage, my honest advice is to surround yourself with a team or people who are used to building big infrastructures. You donβt necessarily need a massive cluster or a supercomputer, but you do need to think about these three execution basics:
- Cloud vs. Device: A server gives you power and is easy to monitor, but itβs expensive. Your choice depends entirely on Cost vs. Control.
- The Hardware: You simply canβt put every model on a GPU; youβd go bankrupt. You need a Tiered Strategy: run your simple βfallbackβ models on cheap CPUs, and reserve the expensive GPUs for the heavy βmoney-makerβ models.
- Optimization: At scale, a 1-second lag in your fallback mechanism is a failure. You arenβt just writing Python anymore; you must learn to compile and optimize your code for specific chips so the βFail Cleanlyβ switch happens in milliseconds.
4. Be careful of Label Leakage
So, youβve anticipated the failures, worked on availability, sorted the monitoring, and built the infrastructure. You probably think youβre finally ready to master scalability. Actually, not yet. There is an issue you simply canβt anticipate if you have never worked in a real environment.
Even if your engineering is perfect, Label Leakage can ruin your strategy and your systems that are running multiple models.
In a single project, you might spot leakage in a notebook. But at scale, where data comes from 50 different pipelines, leakage becomes almost invisible.
The Churn Example: Imagine youβre predicting which users will cancel their subscription. Your training data has a feature called Last_Login_Date. The model looks perfect with 99% F1 score.
But hereβs what actually happened: The database team set up a trigger that βclearsβ the login date field the moment a user hits the βCancelβ button. Your model sees a βNullβ login date and realizes, βAha! They canceled!β
In the real world, at the exact millisecond the model needs to make a prediction before the user cancels, that field isnβt Null yet. The model is looking at the answer from the future.
This is a basic example just so you can understand the concept. But believe me, if you have a complex system with real-time predictions (which happens often with IoT), this is incredibly hard to detect. You can only avoid it if you are aware of the problem from the start.
My tips:
- Feature Latency Monitoring: Donβt just monitor the value of the data, monitor when it was written vs. when the event actually happened.
- The Millisecond Test: Always ask: βAt the exact moment of prediction, does this specific database row actually contain this value yet?β
Of course, these are simple questions, but the best time to evaluate this is during the design phase, before you ever write a line of production code.
5. Finally, The Human Loop
The final piece of the puzzle is Accountability. At scale, our metrics are fuzzy, our infrastructure is complex, and our data is leaky, so we need a βSafety Net.β
- Shadow Deployment: This is mandatory for scale. You deploy βModel Bβ but donβt show its results to users. You let it run βin the shadowsβ for a week, comparing its predictions to the βTruthβ that eventually arrives. If itβs stable, only then do you promote it to βLive.β
- Human-in-the-Loop: For high-stakes models, you need a small team to audit the βSafe Defaults.β If your system has fallen back to βMost Popular Itemsβ for three days, a human needs to ask why the main model hasnβt recovered.
And a quick recap before you start working with ML at scale:
- Since we canβt be perfect, we choose to stay online (Availability) and fail safely.
- Availability is our metric number 1 since monitoring at scale is βfuzzyβ and traditional metrics are unreliable.
- We build the infrastructure (Cloud/Hardware) to make these safe failures fast.
- We watch out for βcheatingβ data (Leakage) that makes our fuzzy metrics look too good to be true.
- We use Shadow Deploys to prove the model is safe before it ever touches a customer.
And remember, your scale is only as good as your safety net. Donβt let your work be among the 87% of failed projects.
πΒ LinkedIn:Β Sabrine Bendimerad
πΒ Medium:Β https://medium.com/@sabrine.bendimerad1
πΒ Instagram:Β https://tinyurl.com/datailearn
