RAG Isn’t Enough — I Built the Missing Context Layer That Makes LLM Systems Work

TL;DR

a full working implementation in pure Python, with real benchmark numbers.

RAG systems break when context grows beyond a few turns.

The real problem is not retrieval — it’s what actually enters the context window.

A context engine controls memory, compression, re-ranking, and token limits explicitly.

This is not a concept. This is a working system with measurable behavior.

The Breaking Point of RAG Systems

I built a RAG system that worked perfectly — until it didn’t.

The moment I added conversation history, everything started breaking. Relevant documents were getting dropped. The prompt overflowed. The model started forgetting things it had said two turns ago. Not because retrieval failed. Not because the prompt was badly written. But because I had zero control over what actually entered the context window.

That’s the problem nobody talks about. Most RAG tutorials stop at: retrieve some documents, stuff them into a prompt, call the model. What happens when your retrieved context is 6,000 characters but your remaining budget is 1,800? What happens when three of your five retrieved documents are near-duplicates, crowding out the only useful one? What happens when turn one of a twenty-turn conversation is still sitting in the prompt, taking up space, long after it stopped being relevant?

These aren’t rare edge cases. This is what happens by default — and it starts breaking within the first few turns.

All results below are from real runs of the system (Python 3.12, CPU-only, no GPU), except where noted as calculated.

The answer is a layer most tutorials skip entirely. Between raw retrieval and prompt construction, there’s a deliberate architectural step: deciding what the model actually sees, how much of it, and in what order. In 2025, Andrej Karpathy gave this a name: context engineering [2]. I’d been building it for months without calling it that.

This is the system I built from retrieval to memory to compression with real numbers and code you can run.

Complete code: https://github.com/Emmimal/context-engine/

What Context Engineering Actually Is

It’s worth being precise, because the terms get muddled.

Prompt engineering is the craft of what you say to the model — your system prompt, your few-shot examples, your output format instructions. It shapes how the model reasons.

RAG is a technique for fetching relevant external documents and including them before generation. It grounds the model in facts it wasn’t trained on [1].

Context engineering is the layer in between — the architectural decisions about what information flows into the context window, how much of it, and in what form. It answers: given everything that could go into this prompt, what should actually go in?

All three are complementary. In a well-designed system they each have a distinct job.

Who This Is For

This architecture is worth building if you are working on multi-turn chatbots where context accumulates across turns, RAG systems with large knowledge bases where retrieval noise is a real problem, or AI copilots and agents that need memory to stay coherent.

Skip it for single-turn queries with a small knowledge base — the pipeline overhead doesn’t justify a marginal quality gain. Skip it for latency-critical services under 50ms — embedding generation alone adds ~85ms on CPU. Skip it for fully deterministic domains like legal contract analysis, where keyword-only retrieval is often sufficient and more auditable.

If you have unlimited context windows and unlimited latency, plain RAG works fine. In production, those constraints don’t exist.

Full Pipeline Architecture

A complete context engineering pipeline for RAG systems, combining retrieval, memory management, compression, and token budget control to build efficient and scalable LLM applications. Image by Author.

Component 1: The Retriever

Most RAG implementations pick one retrieval method and call it done. The problem is no single method dominates across all query types. Keyword matching is fast and precise for exact terms. TF-IDF handles term weighting. Dense vector embeddings catch semantic relationships that keywords miss entirely.

Keyword vs. TF-IDF — Same Query, Different Behavior

For the query: “how does memory work in AI agents”

Both methods agree on mem-001 as the top document. But there’s a critical difference: TF-IDF provides more nuanced scoring by weighting term rarity, while keyword retrieval only counts raw overlap. On this query they converge — but they diverge badly on conceptual queries with different wording. This is precisely why hybrid retrieval becomes necessary.

The Retriever supports three modes: keyword, tfidf, and hybrid. Hybrid mode runs both methods and blends their scores with a single tunable weight:

hybrid_score = alpha * emb_score + (1 – alpha) * tf_score

The alpha=0.65 default weights embeddings slightly more than TF-IDF — empirical, not principled, but tested across different query styles. Keyword-heavy queries perform better around alpha=0.4; paraphrase-style queries benefit from alpha=0.8 or higher.

What Hybrid Retrieval Fixes That TF-IDF Misses

For the query: “how do embeddings compare to TF-IDF for memory in AI agents”

ModeDocuments RetrievedWhyTF-IDFmem-001, vec-001, ctx-001Only keyword-overlapping documents surfaceHybridmem-001, vec-001, tfidf-001, ctx-001Conceptually relevant tfidf-001 now surfaces

tfidf-001 doesn’t appear in TF-IDF results because it shares few query tokens. Hybrid mode surfaces it because the embedding recognises its conceptual relevance. This is the exact failure mode of traditional RAG at scale.

One implementation note: sentence-transformers is optional. Without it, the system falls back to random embeddings with a warning. Production gets real semantics; development gets a functional stub.

Component 2: The Re-ranker

Retrieval gives you candidates. Re-ranking decides the final order.

The re-ranker applies a two-factor weighted sum blending retrieval score with a tag-based importance value. Documents tagged with memory, context, rag, or embedding receive a tag_importance of 1.4; all others receive 1.0. Both feed into the same formula:

final_score = base_score * 0.68 + tag_importance * 0.32

A tagged document with tag_importance=1.4 contributes 0.448 from that term alone, versus 0.32 for an untagged one — a fixed bonus of 0.128 regardless of retrieval score. The weights reflect a specific prior: retrieval signal is primary, domain relevance is a meaningful secondary signal.

Scores Before and After Re-ranking

DocumentBefore Re-rankingAfter Re-rankingChangemem-0010.41610.7309+75.7%rag-001outside top 40.5280promotedvec-0010.28800.5158+79.1%tfidf-0010.21640.4672+115.9%

rag-001 jumps from outside the top four to second position entirely due to its tag boost. These reorderings change which documents survive compression — they’re not cosmetic.

Is the heuristic principled? Not entirely. A cross-encoder re-ranker — scoring each query-document pair with a neural model [7] — would be more accurate. But cross-encoders cost one model call per document. At five documents, the heuristic runs in microseconds. At 500+, a cross-encoder becomes worth the cost.

Component 3: Memory with Exponential Decay

This is the component most tutorials leave out entirely, and the one where naive systems collapse fastest.

Conversational memory has two failure modes: forgetting too fast (losing context that’s still relevant) and forgetting too slow (accumulating noise that crowds out useful information). A sliding window drops old turns abruptly — turn 10 is fully present, turn 11 is gone. That’s not how useful information works.

The solution is exponential decay, where turns fade continuously based on three factors.

The scoring formula:

effective = importance * recency * freshness + relevance_boost

Where each term is:

recency = e^(−decay_rate × age_seconds) — older turns carry less weight
freshness = e^(−0.01 × time_since_last_access) — recently referenced turns get a boost
relevance_boost = (|query ∩ turn| / |query|) × 0.35 — turns with high query-token overlap are retained longer

This mirrors how working memory actually prioritises information [4] — high-importance turns survive longer; off-topic turns fade quickly regardless of when they occurred.

Auto-Importance Scoring

Auto-importance scoring makes this practical without manual annotation. The system scores each turn based on content length, domain keywords, and query overlap:

Turn ContentRoleAuto-Scored Importance“What is context engineering and why is it important?”user2.33“Explain how memory decay prevents context bloat.”user2.50“What is the weather in Chennai today?”user1.10

A weather question scores 1.10 — barely above the floor. A domain question about memory decay scores 2.50 and survives far longer before decaying. In a long conversation, high-importance domain turns stay in memory while low-importance small-talk turns fade first — the exact ordering you want.

Deduplication

Deduplication runs before any turn is stored, as a three-tier check: exact containment (if the new turn is a substring of an existing one, reject), strong prefix overlap (if the first half of both turns match, reject), and token-overlap similarity >= 0.72 (if token overlap is high enough, reject as a paraphrase).

At 0.72, you catch paraphrases without falsely rejecting related-but-distinct questions on the same topic. A follow-up like “Can you explain context engineering and its role in RAG?” after “What is context engineering and how does it help RAG systems?” scores ~72% overlap — deduplication fires, one memory slot saved, room made for genuinely new information.

Token Budget Under Pressure

How token budget is distributed across turns in a context-aware RAG system, balancing system prompts, memory history, and retrieved documents. Image by Author.

Component 4: Context Compression

You have 810 characters of retrieved context. Your remaining token budget allows 800. That 10-character gap means something either gets truncated badly or the whole thing overflows.

The Compressor implements three strategies. Truncate is the fastest — cuts each chunk proportionally. Sentence uses greedy sentence-boundary selection. Extractive is query-aware: every sentence across all retrieved documents gets scored by token overlap with the query, ranked by relevance, and greedily selected within budget. Then the selected sentences are served back in their original document order, not relevance rank order [5]. Relevance rank order produces incoherent context. Original order preserves the logical flow of the source material.

Compression Strategy Trade-offs — Same 810-Character Input, 800-Character Budget

StrategyOutput SizeCompression RatioWhat It OptimisesTruncate744 chars91.9%SpeedSentence684 chars84.4%Clean boundariesExtractive762 chars94.1%Relevance

Extractive compression preserves meaning better — but saves fewer raw characters. Under tight budgets, it gives you the right content, not just less content.

Component 5: The Token Budget Enforcer

Everything feeds into the TokenBudget — a slot-based allocator that tracks usage across named context regions. Token estimation uses the 1 token ≈ 4 characters heuristic for English prose, consistent with OpenAI’s documentation [6].

The order of reservation is the whole design:

def build(self, query: str) -> ContextPacket:
budget = TokenBudget(total=self.total_token_budget)
budget.reserve_text(“system_prompt”, self.system_prompt) # 1. Fixed

scored_docs = self._rerank(self._retriever.retrieve(query, …), query)

memory_turns = self._memory.get_weighted(query=query)
budget.reserve_text(“history”, ” “.join(t.content for t in memory_turns)) # 2. Reserved

remaining_chars = budget.remaining_chars()
compressor = Compressor(max_chars=remaining_chars, strategy=self.compression_strategy)
result = compressor.compress([sd.document.content for sd in scored_docs], query=query)

budget.reserve_text(“retrieved_docs”, result.text) # 3. What’s left
return ContextPacket(…)

The system prompt is fixed overhead you can’t negotiate away. Memory is what makes multi-turn coherent. Documents are the variable — useful, but the first thing to compress when space runs out. Reserve in the wrong order and documents silently overflow the budget before history is even accounted for. The orchestrator enforces the right order explicitly.

What Happens Under Real Token Pressure

This is where naive systems fail — and this engine adapts.

Setup: 5 documents (810 chars total), 200 tokens reserved for system prompt, 800-token total budget. Query: “How do embeddings and TF-IDF compare for memory in agents?”

Turn 1 — no conversation history yet: Documents retrieved: 5, re-ranked. Memory turns: 0. Compression applied: 48% reduction. Result: fits within budget.

Turn 2 — after conversation begins: Documents retrieved: 5, re-ranked. Memory turns: 2, now competing for space. Compression becomes more aggressive: 45% reduction. Result: still fits within budget.

What changed? The system didn’t fail — it adapted. Memory turns consumed part of the budget, so compression on retrieved documents tightened automatically. That’s the point of context engineering: the model always receives something coherent, never a random overflow.

Measuring What It Actually Buys You

The table below compares four approaches on the same query and 800-token total budget. The first three rows are calculated from known inputs using the same 810-character document set; the fourth row reflects actual engine output verified against demo runs.

ApproachDocs RetrievedAfter CompressionMemoryFits Budget?Naive RAG5 (full)810 chars, noneNoneNo — 10 chars overRAG + Truncate5360 chars (43%)NoneYes — but tail content lostRAG + Memory (no decay)5 (full)810 chars3 turns, unfilteredNo — history pushes it overFull Context Engine5, reranked400 chars (50%)2 turns, decay-filteredYes — all constraints met

Naive RAG overflows immediately. Truncation fits but blindly cuts the tail. Memory without decay adds noise rather than signal — older turns never fade, and conversation history becomes bloat. The full system re-ranks, compresses intelligently, and includes only turns that still carry information.

Memory Decay by Importance Score

Effective score decay over 24 hours — high-importance context engineering turns survive the full session window while low-importance turns like weather queries fall below the 0.1 threshold at ~12 hr and are dropped. Relevance boost from query-token overlap can temporarily revive aged turns.

Performance Characteristics

Measured on Python 3.12.6, CPU only, no GPU, 5-document knowledge base:

OperationLatencyNotesKeyword retrieval~0.8msSimple token matchingTF-IDF retrieval~2.1msVectorisation + cosine similarityHybrid retrieval~85msEmbedding generation dominatesRe-ranking (5 docs)~0.3msTag-weighted scoringMemory decay + filtering~0.6msExponential decay calculationCompression (extractive)~4.2msSentence scoring + selectionFull engine.build()~92msHybrid mode dominates

Hybrid retrieval is the bottleneck. If you need sub-50ms response time, use TF-IDF or keyword mode instead. At 100 requests/sec in hybrid mode you need roughly 9 concurrent workers; with embedding caching, subsequent queries drop to ~2ms per request after the first.

Honest Design Decisions

alpha=0.65 is empirical, not principled. I tested across a small query set from my knowledge base. For a different domain — legal documents, medical literature, dense code — the right alpha will be different. Keyword-heavy queries do better around 0.4; conceptual or paraphrased queries benefit from 0.8 or higher.

The re-ranking weights (0.68/0.32) are a heuristic. A cross-encoder re-ranker would be more principled [7] but costs one model call per document. For 5 documents, the heuristic runs in microseconds. For 500+ documents, a cross-encoder becomes worth the cost.

Token estimation (1 token ≈ 4 chars) is an approximation. Within ~15% of actual token counts for English prose [6], but misfires for code and non-Latin scripts. For production, swap in tiktoken [8] — it’s a one-line change in compressor.py.

The extractive compressor scores by query-token recall overlap: how many query tokens appear in the sentence, as a fraction of the query length. This is fast and dependency-free but misses semantic similarity — a sentence that paraphrases the query without sharing any tokens scores zero. Embedding-based sentence scoring would fix that at the cost of an additional model call per compression pass.

Trade-offs and What’s Missing

Cross-encoder re-ranking. The _rerank() interface is already designed to be swapped out. Drop in a BERT-based cross-encoder for meaningfully better pair-wise rankings.

Embedding-based compression. Replace the token-overlap sentence scorer in _extractive() with a small embedding model. Catches semantic relevance that keyword overlap misses. Probably worth it for 100+ document systems.

Adaptive alpha. Classify the query type dynamically and adjust alpha rather than using a fixed 0.65. A short query with rare domain terms probably wants more TF-IDF weight; a long natural-language question wants more embedding weight.

Persistent memory. The current Memory class is in-process only. A lightweight SQLite backend with the same add() / get_weighted() interface would survive restarts and enable cross-session continuity.

Closing

RAG gets you the right documents. Prompt engineering gets you the right instructions. Context engineering gets you the right context.

Prompt engineering decides how the model thinks. Context engineering decides what it gets to think about.

Most systems optimise the former and ignore the latter. That’s why they break.

The full source code with all seven demos is at: https://github.com/Emmimal/context-engine/

References

[1] Lewis, P., Perez, E., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 33, 9459–9474. https://arxiv.org/abs/2005.11401

[2] Karpathy, A. (2025). Context Engineering. https://x.com/karpathy/status/1937902205765607626

[3] Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. JMLR 12, 2825–2830. https://jmlr.org/papers/v12/pedregosa11a.html

[4] Baddeley, A. (2000). The episodic buffer: a new component of working memory? Trends in Cognitive Sciences, 4(11), 417–423.

[5] Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing Order into Texts. EMNLP 2004. https://aclanthology.org/W04-3252/

[6] OpenAI. (2023). Counting tokens with tiktoken. https://github.com/openai/tiktoken

[7] Nogueira, R., & Cho, K. (2019). Passage Re-ranking with BERT. arXiv:1901.04085. https://arxiv.org/abs/1901.04085

[8] OpenAI. (2023). tiktoken: Fast BPE tokeniser for use with OpenAI’s models. https://github.com/openai/tiktoken

[9] Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP 2019. https://arxiv.org/abs/1908.10084

Disclosure

All code in this article was written by me and is original work, developed and tested on Python 3.12.6. Benchmark numbers are from actual demo runs on my local machine (Windows 11, CPU only) and are reproducible by cloning the repository and running demo.py, except where the article explicitly notes numbers are calculated from known inputs. The sentence-transformers library is used as an optional dependency for embedding generation in hybrid retrieval mode. All other functionality runs on the Python standard library and numpy only. I have no financial relationship with any tool, library, or company mentioned in this article.

What's Hot

Walmart’s AI workflows meet the realities of the balance sheet

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

Exploring Income Patterns with Python Pandas, Matplotlib, and Seaborn

Walmart’s AI workflows meet the realities of the balance sheet

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

Exploring Income Patterns with Python Pandas, Matplotlib, and Seaborn

How to Fine-Tune LFM2 Using QLoRA and DPO: A Complete Step-by-Step Coding Tutorial on Google Colab

From Local App to Public Website in Minutes

Anthropic IPO filing marks AI maturing into enterprise utility

Walmart’s AI workflows meet the realities of the balance sheet

NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation

Exploring Income Patterns with Python Pandas, Matplotlib, and Seaborn

Google Will Now Let You Virtually Try on Clothes With Just a Selfie

What’s in a Name? How to Get Your Domain Right

Speed Across the Galaxy Next Year in Star Wars: Galactic Racer

News

Company

Services

What's Hot

RAG Isn’t Enough — I Built the Missing Context Layer That Makes LLM Systems Work

TL;DR

The Breaking Point of RAG Systems

What Context Engineering Actually Is

Who This Is For

Full Pipeline Architecture

Component 1: The Retriever

Component 2: The Re-ranker

Component 3: Memory with Exponential Decay

Deduplication

Token Budget Under Pressure

Component 4: Context Compression

Component 5: The Token Budget Enforcer

What Happens Under Real Token Pressure

Measuring What It Actually Buys You

Memory Decay by Importance Score

Performance Characteristics

Honest Design Decisions

Trade-offs and What’s Missing

Closing

References

Disclosure

Related Posts

News

Company

Services

Subscribe to Updates