have a simple problem: they show you the forecast, but they don’t tell you when it actually changed.
That might sound trivial. It isn’t.
Modern numerical weather prediction (NWP) systems — like ECMWF IFS — produce remarkably accurate forecasts at ~9 km resolution, updated every few hours. The data is already very good.
The problem is not the forecast.
The problem is attention: knowing when a change in that data is actually meaningful.
I didn’t learn that from software engineering. I learned it years earlier, studying chaos theory at the Instituto Balseiro. It was there, working through dynamical systems, that I first encountered a slightly unsettling idea:
A system can be completely deterministic and still be practically unpredictable.
That idea stayed with me. And years later, when I started building AI systems, I realized that many of them were ignoring it.
The problem with “vibe-based” deltas
When I started seeing how developers were building weather agents, I noticed a pattern:
- Fetch forecast data
- Feed it into an LLM
- Ask: “Did the weather change significantly?”
At first glance, this seems reasonable. From a physics perspective, it is problematic — at least for problems where the decision boundary is already well-defined — because it replaces a well-defined threshold with a probabilistic interpretation.
In a chaotic system, significance is not a linguistic judgment — it is a threshold defined on variables like temperature, precipitation, or wind speed. It depends on magnitudes, context, and time horizons.
An LLM is a stochastic process. It is very good at generating language, but it is not designed to enforce deterministic boundaries on physical systems.
When you ask an LLM whether a forecast “changed significantly,” you’re asking a probabilistic model to approximate a deterministic rule that you could have defined explicitly. That introduces variability exactly where you want consistency.
The failure modes are subtle:
- Trends inferred from phrasing rather than data
- Inconsistent decisions across similar inputs
- Outputs that cannot be tested or reproduced
In many applications, that might be acceptable. In agriculture, energy, and logistics — where a 3°C drop is a phase transition for a crop, a non-linear spike in energy demand, or an operational disruption — it is not. These decisions need to be stable and explainable.
Which led me to a simple rule:
If you can write an assert statement for it, you probably shouldn’t be using a prompt.
My path to this problem
My career has looked less like a straight line and more like a trajectory in phase space. A Marie Curie PhD in climate dynamics, five years directing R&D at Uruguay’s national meteorology institute — forest fire prevention, seasonal forecasting, climate adaptation — then a shift to production ML at Microsoft and Mercado Libre.
That arc gave me something specific: I already understood the physics of the data, the skill horizons of the models, and what “significant change” actually means in a physical system. Not as a software abstraction — as a measurable delta on a variable with known uncertainty bounds.
When I started building AI systems, the instinct was immediate: this is a threshold problem. Thresholds belong in code, not in prompts.
Skygent is one expression of that perspective — an agent designed not to display forecasts, but to detect meaningful changes in them.
The system runs continuously on real forecast data for user-defined events, evaluating changes every few hours and only triggering alerts when predefined conditions are met. In practice, most evaluation cycles result in no alert — only a small fraction of changes cross the significance threshold. That’s the point: signal, not noise.
The architecture
Skygent follows a clean separation across five layers:
Architecture description
Only one layer calls the LLM.
The Deterministic Gatekeeper
At the core is a Python evaluator. It doesn’t interpret — it calculates. It:
- Compares consecutive Pydantic-validated forecast snapshots
- Evaluates deltas against configurable thresholds
- Incorporates context: event type, variable sensitivity
- Accounts for forecast horizon using established NWP skill limits — a change in a 24-hour forecast does not carry the same reliability as a change in a 10-day forecast
This is where decisions are made. Every alert has a traceable path: which variable changed, by how much, which threshold was crossed. In a corporate or government environment, being able to explain why an alert fired — without saying “the model felt like it” — is not optional.
The Trigger
An alert fires only if a threshold is crossed. If the delta doesn’t cross the boundary, nothing happens. This is a binary, testable condition — not a judgment call.
The Narrator
Only after the decision is made does the LLM enter the pipeline. Its role is strictly limited: take structured JSON data, translate it into natural language.
# Structured payload sent to GPT-4o-mini
{
“event_name”: “Ana’s Wedding”,
“variable”: “precipitation_probability_max”,
“from_value”: 10.0,
“to_value”: 50.0,
“delta”: 40.0,
“horizon_days”: 5.2,
“confidence”: “medium”
}
Output:
“Rain probability increased from 10% to 50% for your event window. Confidence is medium due to the 5-day forecast horizon.”
The LLM is not deciding anything. It is explaining.
Why this architecture is testable
It is practically impossible to reach 100% test coverage on a pure LLM agent — you cannot write deterministic assertions on probabilistic outputs.
The hybrid approach changes this. The decision logic is pure Python with Pydantic-validated inputs: 204 unit tests, zero LLM dependencies in the test suite. The LLM handles only the narrative tone — the one thing that genuinely benefits from natural language generation.
This is not just a testing convenience. It means every decision the
system makes can be explained, reproduced, and verified independently of the LLM.
Event-Driven LLM Invocation
A naive agent calls the LLM on every polling cycle. This one doesn’t.
Skygent evaluates every 6 hours. It only calls the model when a threshold is crossed — roughly once or twice per week per monitored event, compared to ~28 calls for a naive polling agent.
At gpt-4o-mini pricing (~$0.0001 per narrative), cost is negligible. More importantly, cost is proportional to actual information: you pay for an LLM call only when something worth communicating happened.
A concrete example
Previous snapshot: Rain probability 10%, Max temp 22°C, Wind 15 km/h
Current snapshot: Rain probability 50%, Max temp 21.4°C, Wind 18 km/h
Threshold: Alert if rain probability Δ > 20pp
Evaluation frequency: Every 6 hours
Result: Alert triggered → GPT-4o-mini generates narrative → Telegram delivery
Screenshot of Skygent’s alert example
When this pattern breaks
This approach doesn’t apply everywhere. It breaks down when:
- Inputs are unstructured or ambiguous
- Decision boundaries cannot be codified as thresholds
- Reasoning is open-ended
In those cases, LLM-first architectures — ReAct, Plan-and-Execute — make more sense.
One honest caveat: the thresholds in Skygent are configurable defaults — reasonable starting points informed by meteorological practice, but not calibrated against historical forecast errors for specific use cases. Calibration against real outcomes is the natural next step for any vertical deployment. The pattern is sound; the parameters are a starting point.
Closing
The most important decision I made building this system was not choosing a model or a framework.
It was deciding where not to use an LLM.
There is a tendency right now to delegate more and more to language models — to let them figure things out. But some problems already have structure. Some decisions already have boundaries.
When they do, approximating them with language is the wrong move. Encoding them explicitly is better.
In practice, this often comes down to a simple distinction: use LLMs to explain decisions, not to replace well-defined ones.
The full implementation — significance evaluator, LangGraph pipeline, Telegram bot — is available at: github.com/ferariz/skygent
Fernando Arizmendi builds production AI systems at the intersection of rigorous scientific method and applied AI engineering. He is a physicist (B.Sc. & M.Sc.) from Instituto Balseiro, former Marie Curie fellow (Ph.D. studying Climate Dynamics & Complex Systems), and previously directed R&D at Uruguay’s national meteorology institute.
LinkedIn · GitHub
All images by the author. Pipeline diagram generated with Claude (Anthropic).

