In my first TDS post, I wrote about how to translate a real-world problem into an integer linear program. In my second, I wrote about how to make that program robust against uncertainty. Both were variations on the same idea: take a fuzzy real-world question, squeeze it into an LP, and let a solver do the rest.
There is a moment in every optimizer’s life, though, when the LP starts to feel a bit too neat. Demand is a number. Travel time is a number. Wind speed is a number. The model accepts the input, returns an optimal solution, and goes on its way. The reality those numbers were supposed to describe (messy, jittery, and occasionally surprising) doesn’t really show up anywhere.
Stochastic programming is the field that takes that discomfort seriously. Instead of pretending the data is exact, it builds the uncertainty directly into the model. The price you pay is a bit more notation; the payoff is decisions that hold up when the world doesn’t cooperate.
This post is a gentle tour of the basics. We’ll see why the obvious approach doesn’t work, walk through the four standard ways to handle uncertainty in a linear program, and finish with a quick sanity check on whether any of this is worth the effort. There’s some math, but it’s the same math you already know from LP, with one extra symbol attached.
Starting point: a fashion company with a bad crystal ball
To make this concrete, we’ll use the running example from dr. Ruben van Beesten’s lectures (more on that in the credits below). It goes like this.
You run a fashion company that sells winter clothing in Germany. Production happens in Bangladesh, which is cheap but slow: the goods take a few weeks to arrive. So in the fall, you have to decide how much to produce for the upcoming winter season.
Two ways this can go wrong: produce too little, and you lose sales; produce too much, and you’re stuck with stock you can’t sell. The whole question is how much to produce now, and the answer depends on something you don’t actually know yet: winter demand.
If you ignored the uncertainty for a moment and pretended demand was a fixed number, you could write down a vanilla LP:
Here x is how much you produce, c is the unit production cost, h is demand, and T is just the identity matrix (one unit produced satisfies one unit of demand). The constraint says: produce at least as much as is demanded.
This is fine if h is actually known. The trouble is that demand isn’t a number, it’s a random variable. Let’s call it ξ. The honest version of the model would look like this:
And here we hit a wall. What does it mean for x to satisfy a constraint that depends on a random variable? Is x = 100feasible if demand might be 80, might be 120, and might be anywhere in between? The problem isn’t hard to solve: it’s ill-defined. The solver doesn’t even know which problem you’re asking it to solve.
Stochastic programming is, in essence, a collection of principled answers to that question. We’ll look at the four most common ones.
Four ways to handle the uncertainty
Each of the four approaches takes the ill-defined LP above and turns it into a well-defined optimization problem. They differ in what they assume you know about the uncertainty, and in how cautious they are about bad outcomes.
1. Robust optimization: prepare for the worst
The most cautious approach. You don’t need to know the full probability distribution of ξ, but only its support, i.e., the set of values it could possibly take. We call this set the uncertainty set, written U. Then you ask: what is the best decision that stays feasible no matter which ξ ∈ U actually shows up?
The constraint now has to hold for every ξ in the uncertainty set. In our fashion example with U = [0, 10], you’d be planning for demand of 10, the worst case, every time.
That’s the strength and the weakness of robust optimization in one sentence. The solution is bulletproof, but it’s also conservative: you’ll often be sitting on inventory you didn’t need, because you planned as if the unlikely worst case were guaranteed. If you’ve read my earlier post on robustifying linear programs, this is exactly the framework that sits behind those four steps.
2. Chance constraints: relax the worst case
Robust optimization plans for any possible outcome. Chance constraints relax that to: plan for most of them. You pick a probability level α, say 95%, and require the constraint to hold with at least that probability:
This is called a joint chance constraint: all the entries of the constraint vector have to be satisfied simultaneously, with joint probability ≥ α. A weaker variant treats each row separately:
These are individual chance constraints: each constraint i must hold with probability at least αᵢ, but you don’t care about the joint event. Quick exercise: if you set every αᵢ equal to the joint α, which formulation is more conservative?
Answer: the joint version. Satisfying all constraints simultaneously is a stricter requirement than satisfying each one in isolation, so the joint formulation has a smaller feasible region and a worse (higher) optimal cost. Either way, chance constraints give you a knob, α, to dial how cautious you want to be. Crank it to 1, and you’re back to (almost) robust. Drop it to 0.5, and you’re basically flipping a coin on feasibility. Most real applications live somewhere in the 0.9–0.99 range.
There’s a catch worth flagging: chance constraints are hard in general. The probability term inside the constraint is a non-linear, often non-convex function of x, so you usually can’t hand the formulation directly to a standard LP solver. There are tractable special cases (Gaussian noise, certain mixtures of distributions, sample-based approximations), but the general problem is harder than it looks at first glance.
3. Two-stage recourse models: decide, observe, correct
The first two approaches treat constraint violation as something to avoid, either always (robust) or with high probability (chance). Sometimes that’s the wrong frame. In our fashion example, falling short of demand isn’t catastrophic. It’s annoying. You can usually fix it: produce a small emergency batch in Germany at a higher cost, or ship by air, or just accept the lost sales and move on.
This idea, that violating a constraint isn’t the end of the world, you can take a corrective action later, is the heart of recourse models. In the two-stage version, the timeline looks like this:
- Stage 1 (now): you make a first-stage decision x while ξ is still uncertain.
- Then: ξ is realized, i.e., the random variable becomes a known number.
- Stage 2 (later): you make a second-stage decision y, knowing ξ.
Mathematically, the first stage looks almost like a vanilla LP, except the objective now contains an expected future cost:
The function v(ξ, x) is the optimal value of the second-stage problem, given that you chose x in the first stage and that ξ turned out to be the realized value:
Read this carefully. The right-hand side, h(ξ) − T(ξ) x, is the shortfall, how much your first-stage decision failed to cover, after ξ was revealed. The recourse decision y then closes that gap, at a cost q(ξ)ᵀ y. So the structure is: pay the up-front cost cᵀ x, and on top of it pay the expected cost of cleaning up after the random variable does its thing.
That’s the whole idea. Two-stage recourse models are by far the most common formulation in practice, partly because they capture the actual chronology of decisions in many real problems (production planning, inventory, energy dispatch, scheduling), and partly because they’re relatively well-behaved mathematically.
A couple of pieces of vocabulary you’ll trip over if you read further:
- A model has fixed recourse if the recourse matrix W doesn’t depend on ξ. Many algorithms only work in this case.
- A model has (relatively) complete recourse if there’s always a feasible recourse decision y, no matter what ξ turns out to be and no matter what x you chose. If complete recourse fails, the second-stage problem can be infeasible, which becomes an implicit constraint on the first stage. (This is exactly where Benders’ feasibility cuts come from, but that’s a story for another post.)
4. Multi-stage recourse models: keep going
Sometimes life isn’t two stages. You don’t just decide-observe-correct once and go home; you decide, observe, decide, observe, decide, … over and over. Multi-stage recourse models are the natural extension.
In our fashion example, suppose we’re no longer choosing once in the fall, but three times: in the fall (cheap, in Bangladesh), in early winter (more expensive, in Romania), and in late winter (most expensive, in Germany). Demand is gradually revealed over the season, and at each stage we decide based on what we’ve observed so far.
The notation gets heavier, you end up writing recursive value functions Qₜ, with histories ξ[t] = (ξ₁, …, ξₜ) hanging off them, but conceptually nothing new is going on. Each stage is a recourse problem nested inside the previous one. The natural way to picture this is as a scenario tree: each node is a state of the world, each branch is a possible realization of the next random variable, and a scenario is a complete root-to-leaf path.
Example of a three-stage scenario tree, source: course slides by dr. Ruben van Beesten.
One subtlety. A scenario is the entire trajectory of ξ, not just one realization. Knowing that ξ₂ = 10 doesn’t tell you which scenario you’re in, because ξ₃ hasn’t happened yet. This matters when you start writing the deterministic equivalent (next section), because you have to be careful that your decisions only depend on information that has actually been observed by the time the decision is made. That property is called non-anticipativity: you can’t anticipate the future. The model would happily cheat if you didn’t enforce it explicitly.
How do we actually solve a recourse model?
So far we’ve been writing models. To solve them, we typically transform them into something a standard LP solver can chew on. The trick is the deterministic equivalent formulation.
Suppose the random variable ξ has a discrete distribution: it takes finitely many values ξ¹, ξ², …, ξˢ (called scenarios), each with probability pₛ. Then the expected second-stage cost is just a finite sum, and we can write the entire two-stage problem as one big LP by introducing one copy of y per scenario:
That’s a regular LP. Big, possibly very big, if you have S scenarios, you’ve essentially copied the second stage S times, but it’s an LP. You can hand it straight to HiGHS, Gurobi, CPLEX, or whatever solver you like, and it’ll solve it.
Two natural questions follow.
First: what if the distribution of ξ is not discrete? In that case the deterministic equivalent has infinitely many scenarios and isn’t finite-dimensional. The standard fix is sample average approximation: draw a sample of size S from the true distribution, solve the sampled deterministic equivalent, and let S grow until your solution stabilizes statistically. There’s a whole literature on how big S needs to be and what guarantees you get.
Second: what if the deterministic equivalent is just too big to solve directly? This is where decomposition methods come in. Benders’ decomposition splits the problem into a master problem in the first-stage variables and a subproblem per scenario, then iteratively passes information between them. For multi-stage models with many stages, the analogous trick is stochastic dual dynamic programming (SDDP), which uses sampling and approximate value functions to avoid building the full scenario tree. Both are advanced enough to deserve their own posts, so I’ll come back to them later.
Is any of this actually worth the trouble?
Honest question. Stochastic programs are messier to formulate, harder to solve, and slower to run than their deterministic cousins. If your real-world problem isn’t very sensitive to uncertainty, you might be better off just plugging the expected demand into a regular LP and calling it a day.
The good news is, you can quantify exactly how much the stochastic formulation buys you. There are two classical metrics, and both are worth knowing.
Define four numbers:
In words: SP is the optimal value of the actual stochastic program. EV is what you get if you replace ξ with its expected value and solve the resulting deterministic problem; call its solution x̄. EEV is the expected cost of implementing that deterministic solution x̄ in the actual stochastic world. And WS (“wait-and-see”) is the expected cost if you got to peek at the realized ξ before deciding x, the cheating-but-best case.
From these four numbers you can build two highly informative quantities:
VSS is the Value of the Stochastic Solution: how much worse off you’d be if you just solved the deterministic problem with average values and implemented its solution. If VSS is small, the stochastic program isn’t buying you much; the deterministic shortcut is fine.
EVPI is the Expected Value of Perfect Information: how much you’d gain if a benevolent oracle handed you the realized ξ before you had to decide. If EVPI is small, your forecasts already contain most of the information you need; investing in better predictions probably won’t move the needle. If EVPI is large, better data has real value.
Explanation of useful metrics for a stochastic program.
The two metrics ride along on a tidy chain of inequalities (assuming uncertainty only on the right-hand side):
Read it left to right: cheating-with-the-mean (EV) is at most as bad as cheating-with-the-realization (WS), which is at most as bad as the honest stochastic answer (SP), which is at most as bad as plugging in the deterministic-solution-and-living-with-it (EEV). The chain implies a free upper bound on VSS that you can compute before you ever solve the SP: VSS ≤ EEV − EV. If that gap is tiny, the deterministic shortcut is good enough and you can save yourself the headache.
Where to go from here
This post stuck to the basics: how to write a stochastic program down. The next natural step is how to solve large ones efficiently. The two big workhorses are:
- Benders’ decomposition — for two-stage models, decomposes the deterministic equivalent into a master problem (in x) plus one subproblem per scenario, and reconciles them with cuts. Particularly elegant when you have lots of scenarios but a relatively small first stage.
- Stochastic Dual Dynamic Programming (SDDP) — for multi-stage models, uses sampling and piecewise-linear approximations of the future value functions. Famously used in hydropower scheduling, where the scenario tree is so big that explicit enumeration is hopeless.
Both deserve their own posts. If there’s interest, I’ll write them up.
Takeaway
If you’re using LPs in any context where the input data is genuinely uncertain due to forecasted demand, weather, prices, travel times, or anything else, then your model is making an implicit choice about how to handle that uncertainty. “Just use the mean” is a choice. So is “plan for the worst.” Stochastic programming gives you the vocabulary to make that choice explicit, and the tools to evaluate whether your choice was a good one (hello, VSS).
To summarize the four basic ways to model uncertainty in an LP:
- Robust optimization — plan for the worst case in a given uncertainty set.
- Chance constraints — require feasibility with at least probability α.
- Two-stage recourse — decide, observe, correct; pay an expected recourse cost.
- Multi-stage recourse — the same idea, repeated over time on a scenario tree.
And two metrics worth keeping in your back pocket: VSS (does the stochastic model help?) and EVPI (would better forecasts help?).
Most real problems aren’t deterministic. The good news is your modeling toolkit doesn’t have to be either.
Credits and references
This post is based on lectures by dr. Ruben van Beesten (Norwegian University of Science and Technology) from his course on Stochastic Programming given in October 2023, which I had the pleasure of attending in Trondheim, Norway. The fashion-company example, the four-way taxonomy of formulations, and the VSS/EVPI framing all come straight from his slides; any clumsiness in the retelling is mine.
The original modeling exercise that motivates much of the recourse-model intuition is from
- Higle, J. L. (2005). Stochastic Programming: Optimization When Uncertainty Matters. In INFORMS TutORials in Operations Research, pp. 30–53.
A couple of further pointers worth knowing about:
- Kleywegt, A. J., Shapiro, A., and Homem-de-Mello, T. (2002). The sample average approximation method for stochastic discrete optimization. SIAM Journal on Optimization, 12(2), 479–502. The standard reference for SAA.
- Higle, J. L., and Sen, S. (1991). Stochastic decomposition: an algorithm for two-stage linear programs with recourse. Mathematics of Operations Research, 16(3), 650–669. One of the few methods that handles non-discrete distributions directly.
And of course, the two earlier posts in this series: Five questions that will help you model integer linear programs better and Four steps to robustify your linear program.

