Helps in Time Series Forecasting
We all know how it goes: Time-series data is tricky.
Traditional forecasting models are unprepared for incidents like sudden market crashes, black swan events, or rare weather patterns.
Even large fancy models like Chronos sometimes struggle because they haven’t dealt with that kind of pattern before.
We can mitigate this with retrieval. With retrieval, we are able to ask Has anything like this happened before? and then using that past example to guide the forecast.
As we all might know now, in natural language processing (NLP), this idea is called Retrieval-Augmented Generation (RAG). It is becoming popular too in the time-series forecasting world.
The model then considers past situations that look similar to the current one, and from there it can make more reliable predictions.
How is this RAF different from traditional time-series? Retrieval forecasting adds an explicit memory access step.
Instead of:
Past -> parameters -> forecast
With retrieval we have:
Current situation -> similarity search -> concrete past episodes-> forecast
Retrieval-Augmented Forecasting Cycle. Image by Author | Napkin AI.
Instead of just using what the model learned during training, the idea is to give it access to a range of similar situations.
It’s like letting a weather model check, “What did past winters like this one look like before?”.
Hey there, I am Sara Nóbrega, an AI Engineer. If you’re working on similar problems or want feedback on applying these ideas, I collect my writing, resources, and mentoring links here.
In this article, I explore retrieval–augmented forecasting from first principles and show, with concrete examples and code examples, how retrieval can be used in real forecasting pipelines.
What Is Retrieval-Augmented Forecasting (RAF)?
What is RAF? On a very high-level view, instead only leaning on what a model learned in training, RAF lets the model actively look up concrete past situations similar to the current one and use their outcomes to guide its prediction.
Let’s see it more in detail:
- You convert the current situation (e.g., the last few weeks of a time series stock dataset) into a query.
- This query is then used to search a database of historical time-series segments to find the most similar patterns.
- These matches don’t need to come from the same stock; the system should also surface similar movements from other stocks or financial products.
It retrieves those patterns and what happened afterwards.
Afterwards, this information is ingested to the forecasting model to help it make better predictions.
This technique is powerful in:
- Zero-shot scenarios: When the model faces something it wasn’t trained on.
- Rare or anomalous events: Like COVID, sudden financial crashes, etc.
- Evolving seasonal trends: Where past data contains helpful patterns, but they shift over time.
RAF doesn’t replace your forecasting model, but instead augments it by giving it extra hints and grounding it in relevant historical examples.
Another example: let’s say you want to forecast energy consumption during an unusually hot week.
Instead of hoping your model recalls how heatwaves affect usage, retrieval finds similar past heatwaves and lets the model consider what happened in that time.
What Do These Models Actually Retrieve?
The retrieved “knowledge” isn’t only raw data. It’s context that gives the model clues.
Here are some common examples:
Examples of Data Retrieval. Image by Author | Napkin AI.
As you can see, retrieval focuses on meaningful historical situations, like rare shocks, seasonal effects and patterns that have similar structures. These give actionable context for the current forecast.
How Do These Models Retrieve?
To find relevant patterns from the past, these models use structured mechanisms that represent the current situation in a way that makes it easy to search large databases and find the closest matches.
The code snippets in this section are a simplified illustration meant to build intuition, they do not represent production code.
Retrieval methods for time series forecasting. Image by Author | Napkin AI.
Some of these methods are:
Embedding-Based Similarity
This one converts time-series (or patches/windows of a series) into compact vectors, then compare them with distance metrics like Euclidean or cosine similarity.
In simple words: The model turns chunks of time-series data into short summaries and then checks which past summaries look most similar to what’s happening now.
Some retrieval-augmented forecasters (e.g., RAFT) retrieve the most similar historical patches from the training data / entire series and then aggregate retrieved values with attention-like weights.
In simple words: It finds similar situations from the past and averages them, paying more attention to the best matches.
import numpy as np
# Example: embedding-based retrieval for time-series patches
# This is a toy example to show the *idea* behind retrieval.
# In practice:
# – embeddings are learned by neural networks
# – similarity search runs over millions of vectors
# – this logic lives inside a larger forecasting pipeline
def embed_patch(patch: np.ndarray) -> np.ndarray:
“””
Convert a short time-series window (“patch”) into a compact vector.
Here we use simple statistics (mean, std, min, max) purely for illustration.
Real-world systems might use:
– a trained encoder network
– shape-based representations
– frequency-domain features
– latent vectors from a forecasting backbone
“””
return np.array([
patch.mean(), # average level
patch.std(), # volatility
patch.min(), # lowest point
patch.max() # highest point
])
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
“””
Measure how similar two vectors are.
Cosine similarity focuses on *direction* rather than magnitude,
which is often useful for comparing patterns or shapes.
“””
return float(a @ b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-9)
# Step 1: Represent the current situation
# A short window representing the current time-series behavior
query_patch = np.array([10, 12, 18, 25, 14, 11])
# Turn it into an embedding
query_embedding = embed_patch(query_patch)
# Step 2: Represent historical situations
# Past windows extracted from historical data
historical_patches = [
np.array([9, 11, 17, 24, 13, 10]), # looks similar
np.array([2, 2, 2, 2, 2, 2]), # flat, unrelated
np.array([10, 13, 19, 26, 15, 12]) # very similar
]
# Convert all historical patches into embeddings
historical_embeddings = [
embed_patch(patch) for patch in historical_patches
]
# Step 3: Compare and retrieve the most similar past cases
# Compute similarity scores between the current situation
# and each historical example
similarities = [
cosine_similarity(query_embedding, hist_emb)
for hist_emb in historical_embeddings
]
# Rank historical patches by similarity
top_k_indices = np.argsort(similarities)[::-1][:2]
print(“Most similar historical patches:”, top_k_indices)
# Step 4 (conceptual):
# In a retrieval-augmented forecaster, the model would now:
# – retrieve the *future outcomes* of these similar patches
# – weight them by similarity (attention-like weighting)
# – use them to guide the final forecast
# This integration step is model-specific and not shown here.
Retrieval Tools and Libraries
1. FAISS
FAISS is a super fast and GPU-friendly library for similarity search over dense vectors. The best datasets for this library are the ones that are large and in-memory, though its structure makes real-time updates more difficult to implement.
import faiss
import numpy as np
# Suppose we already have embeddings for historical windows
d = 128 # embedding dimension
xb = np.random.randn(100_000, d).astype(“float32”) # historical embeddings
xq = np.random.randn(1, d).astype(“float32”) # query embedding
index = faiss.IndexFlatIP(d) # inner product (often used with normalized vectors for cosine-like behavior)
index.add(xb)
k = 5
scores, ids = index.search(xq, k)
print(“Nearest neighbors (ids):”, ids)
print(“Similarity scores:”, scores)
# Some FAISS indexes/algorithms can run on GPU.
Nearest-neighbor lookup (Annoy)
The Annoy library is relatively lightweight and easy to work with.
The best datasets for this library is historical datasets that remain mostly static, since any modification to the dataset requires rebuilding the index.
from annoy import AnnoyIndex
import numpy as np
# Number of values in each embedding vector.
# The “length” of each fingerprint.
f = 64
# Create an Annoy index.
# This object will store many past embeddings and help us quickly find the most similar ones.
ann = AnnoyIndex(f, “angular”)
# “angular” distance is commonly used to compare patterns
# and behaves similarly to cosine similarity.
# Add historical embeddings (past situations).
# Each item represents a compressed version of a past time-series window.
# Here we use random numbers just as an example.
for i in range(10000):
ann.add_item(i, np.random.randn(f).tolist())
# Build the search structure.
# This step organizes the data so similarity searches are fast.
# After this, the index becomes read-only.
ann.build(10)
# Save the index to disk.
# This allows us to load it later without rebuilding everything.
ann.save(“hist.ann”)
# Create a query embedding.
# This represents the current situation we want to compare
# against past situations.
q = np.random.randn(f).tolist()
# Find the 5 most similar past embeddings.
# Annoy returns the IDs of the closest matches.
neighbors = ann.get_nns_by_vector(q, 5)
print(“Nearest neighbors:”, neighbors)
# Important note:
# Once the index is built, you cannot add new items.
# If new historical data appears, the index must be rebuilt.
Qdrant / Pinecone
Qdrant and Pinecone are like Google for embeddings.
You store lots of vector “fingerprints” (plus extra tags like city/season), and when you have a new fingerprint, you ask:
Show me the most similar ones but only from this city/season/store type.”
This is what makes them easier than rolling your own retrieval: they handle fast search and filtering!
Qdrant calls metadata payload, and you can filter search results using conditions.
# Example only (for intuition). Real code needs a running Qdrant instance + real embeddings.
from qdrant_client import QdrantClient, models
client = QdrantClient(url=”http://localhost:6333″)
collection = “time_series_windows”
# Pretend this is the embedding of the *current* time-series window
query_vector = [0.12, -0.03, 0.98, 0.44] # shortened for readability
# Filter = “only consider past windows from New York in summer”
# Qdrant documentation shows filters built from FieldCondition + MatchValue. :contentReference[oaicite:3]{index=3}
query_filter = models.Filter(
must=[
models.FieldCondition(
key=”city”,
match=models.MatchValue(value=”New York”),
),
models.FieldCondition(
key=”season”,
match=models.MatchValue(value=”summer”),
),
]
)
# In real usage, you’d call search/query and get back the nearest matches
# plus their payload (metadata) if you request it.
results = client.search(
collection_name=collection,
query_vector=query_vector,
query_filter=query_filter,
limit=5,
with_payload=True, # return metadata so you can inspect what you retrieved
)
print(results)
# What you’d do next (conceptually):
# – take the matched IDs
# – load the actual historical windows behind them
# – feed those windows (or their outcomes) into your forecasting model
Pinecone stores metadata key-value pairs alongside vectors and lets you filter at query time (including $eq) and return metadata.
# Example only (for intuition). Real code needs an API key + an index host.
from pinecone import Pinecone
pc = Pinecone(api_key=”YOUR_API_KEY”)
index = pc.Index(host=”INDEX_HOST”)
# Pretend this is the embedding of the current time-series window
query_vector = [0.12, -0.03, 0.98, 0.44] # shortened for readability
# Ask for the most similar past windows, but only where:
# city == “New York” AND season == “summer”
# Pinecone docs show query-time filtering and `$eq`. :contentReference[oaicite:5]{index=5}
res = index.query(
namespace=”windows”,
vector=query_vector,
top_k=5,
filter={
“city”: {“$eq”: “New York”},
“season”: {“$eq”: “summer”},
},
include_metadata=True, # return tags so you can sanity-check matches
include_values=False
)
print(res)
# Conceptually next:
# – use the returned IDs to fetch the underlying historical windows/outcomes
# – condition your forecast on those retrieved examples
Why do vector DBs help? They let you do similarity search + “SQL-like WHERE filters” in one step, which is hard to do cleanly with a DIY setup (both Qdrant payload filtering and Pinecone metadata filtering are first-class features in their docs.)
Each tool has its trade-offs. For instance, FAISS is great for performance but isn’t suited for frequent updates. Qdrant gives flexibility and real-time filtering. Pinecone is easy to set up but SaaS-only.
Retrieval + Forecasting: How to Combine Them
After knowing what to retrieve, the next step is to combine that information with the current input.
It can vary depending on the architecture and the task. There are several strategies for doing this (see image below).
Strategies for Combining Retrieval and Forecasting. Image by Author | Napkin AI.
A. Concatenation
Idea: treat retrieved context as “more input” by appending it to the existing sequence (very common in retrieval-augmented generation setups).
Works well with transformer-based models like Chronos and doesn’t require architecture changes.
import torch
# x_current: the model’s usual input sequence (e.g., last N timesteps or tokens)
# shape: [batch, time, d_model] (or [batch, time] if you think in tokens)
x_current = torch.randn(8, 128, 256)
# x_retrieved: retrieved context encoded in the SAME representation space
# e.g., embeddings for similar past windows (or their summaries)
# shape: [batch, retrieved_time, d_model]
x_retrieved = torch.randn(8, 32, 256)
# Simple fusion: just append retrieved context to the end of the input sequence
# Now the model sees: [current history … + retrieved context …]
x_fused = torch.cat([x_current, x_retrieved], dim=1)
# In practice, you’d also add:
# – an attention mask (so the model knows what’s real vs padded)
# – segment/type embeddings (so the model knows which part is retrieved context)
# Then feed x_fused to your transformer.
B. Cross-Attention Fusion
Idea: keep the “current input” and “retrieved context” separate, and let the model attend to retrieved context when it needs it. This is the core “fusion in the decoder via cross-attention” pattern used by retrieval-augmented architectures like FiD.
import torch
# current_repr: representation of the current time-series window
# shape: [batch, time, d_model]
current_repr = torch.randn(8, 128, 256)
# retrieved_repr: representation of retrieved windows (could be concatenated)
# shape: [batch, retrieved_time, d_model]
retrieved_repr = torch.randn(8, 64, 256)
# Think of cross-attention like:
# – Query (Q) comes from the current sequence
# – Keys/Values (K/V) come from retrieved context
Q = current_repr
K = retrieved_repr
V = retrieved_repr
# Attention scores: “How much should each current timestep look at each retrieved timestep?”
scores = torch.matmul(Q, K.transpose(-1, -2)) / (Q.size(-1) ** 0.5)
# Turn scores into weights (so they sum to 1 across retrieved positions)
weights = torch.softmax(scores, dim=-1)
# Weighted sum of retrieved information (this is the “fused” retrieved signal)
retrieval_signal = torch.matmul(weights, V)
# Final fused representation: current info + retrieved info
# (Some models add, some concatenate, some use a learned projection)
fused = current_repr + retrieval_signal
# Then the forecasting head reads from `fused` to predict the future.
C. Mixture-of-Experts (MoE)
Idea: combine two “experts”:
- the retrieval-based forecaster (non-parametric, case-based)
- the base forecaster (parametric knowledge)
A “gate” decides which one to trust more at each time step.
import torch
# base_pred: forecast from the main model (what it “learned in weights”)
# shape: [batch, horizon]
base_pred = torch.randn(8, 24)
# retrieval_pred: forecast suggested by retrieved similar cases
# shape: [batch, horizon]
retrieval_pred = torch.randn(8, 24)
# context_for_gate: summary of the current situation (could be last hidden state)
# shape: [batch, d_model]
context_for_gate = torch.randn(8, 256)
# gate: a number between 0 and 1 saying “how much to trust retrieval”
# (In real models, this is a tiny neural net.)
gate = torch.sigmoid(torch.randn(8, 1))
# Mixture: convex combination
# – if gate ~ 1 -> trust retrieval more
# – if gate ~ 0 -> trust the base model more
final_pred = gate * retrieval_pred + (1 – gate) * base_pred
# In practice:
# – gate might be timestep-dependent: shape [batch, horizon, 1]
# – you might also add training losses to stabilize routing/usage (common in MoE)
D. Channel Prompting
Idea: treat retrieved series as extra input channels/features (especially natural in multivariate time series, where each variable is a “channel”).
import torch
# x: multivariate time series input
# shape: [batch, time, channels]
# Example: channels could be [sales, price, promo_flag, temperature, …]
x = torch.randn(8, 128, 5)
# retrieved_series_aligned: retrieved signal aligned to the same time grid
# Example: average of the top-k similar past windows (or one representative neighbor)
# shape: [batch, time, retrieved_channels]
retrieved_series_aligned = torch.randn(8, 128, 2)
# Channel prompting = append retrieved channels as extra features
# Now the model gets “normal channels + retrieved channels”
x_prompted = torch.cat([x, retrieved_series_aligned], dim=-1)
# In practice you’d likely also include:
# – a mask or confidence score for retrieved channels
# – normalization so retrieved signals are on a comparable scale
# Then feed x_prompted into the forecaster.
Some models even combine multiple methods.
A common approach is to retrieve multiple similar series, merge them using attention so the model can focus on the most relevant parts, and then feed them to an expert.
Wrap-up
Retrieval-Augmented Forecasting (RAF) lets your model learn from the past in a way that traditional time-series modeling does not achieve.
It acts like an external memory that helps the model navigate unfamiliar situations with more confidence.
It’s simple to experiment with and delivers meaningful improvements in forecasting tasks.
Retrieval is not an academic hype anymore, it is already delivering results in real-world systems.
Thank you for reading!
My name is Sara Nóbrega. I’m an AI engineer focused on MLOps and on deploying machine learning systems into production.
References
[1] J. Liu, Y. Zhang, Z. Wang et al., Retrieval-Augmented Time Series Forecasting (2025), arXiv preprint
Source: https://arxiv.org/html/2505.04163v1
[2] UConn DSIS, TS-RAG: Time-Series Retrieval-Augmented Generation (n.d.), GitHub Repository
Source: https://github.com/UConn-DSIS/TS-RAG
[3] Y. Zhang, H. Xu, X. Chen et al., Memory-Augmented Forecasting for Time Series with Rare Events (2024), arXiv preprint
Source: https://arxiv.org/abs/2412.20810

