Chunk Size as an Experimental Variable in RAG Systems

User: “What does the green highlighting mean in this document?”
RAG system: “Green highlighted text is interpreted as configuration settings.”

the kinds of answers we expect today from Retrieval-Augmented Generation (RAG) systems.

Over the past few years, RAG has become one of the central architectural building blocks for knowledge-based language models: Instead of relying exclusively on the knowledge stored in the model, RAG systems combine language models with external document sources.

The term was introduced by Lewis et al. and describes an approach that is widely used to reduce hallucinations, improve the traceability of answers, and enable language models to work with proprietary data.

I wanted to understand why a system selects one specific answer instead of a very similar alternative. This decision is often made already at the retrieval stage, long before an LLM comes into play.

For this reason, I conducted three experiments in this article to investigate how different chunk sizes (80, 220, 500) influence retrieval behavior.

Table of Contents
1 – Why Chunk Size Is More Than Just a Parameter
2 – How Does Chunk Size Influence the Stability of Retrieval Results in Small RAG Systems?
3 – Minimal RAG System Without Output Generation
4 – Three Experiments: Chunk Size as a Variable
5 – Final Thoughts

1 – Why Chunk Size Is More Than Just a Parameter

In a typical RAG pipeline, documents are first split into smaller text segments, embedded into vectors, and stored in an index. When a query is issued, semantically similar text segments are retrieved and then processed into an answer. This final step is often performed in combination with a language model.

Typical components of a RAG system include:

Document preprocessing
Chunking
Embedding
Vector index
Retrieval logic
Optional: Generation of the output

In this article, I focus on the retrieval step. This step depends on several parameters:

Choice of the embedding model:
The embedding model determines how text is converted into numerical vectors. Different models capture meaning at different levels of granularity and are trained on different objectives. For example, lightweight sentence-transformer models are often sufficient for semantic search, while larger models may capture more nuance but come with higher computational cost.
Distance or similarity metric:
The distance or similarity metric defines how the closeness between two vectors is measured. Common choices include cosine similarity, dot product or Euclidean distance. For normalized embeddings, cosine similarity is often used
Number of retrieved results (Top-k):
The number of retrieved results specifies how many text segments are returned by the retrieval step. A small Top-k can miss relevant context, while a large Top-k increases recall but may introduce noise.
Overlap between text segments:
Overlap defines how much text is shared between consecutive chunks. It is typically used to avoid losing important information at chunk boundaries. A small overlap reduces redundancy but risks cutting explanations in half, while a larger overlap increases robustness at the cost of storing and processing more similar chunks.
Chunk size:
Describes the size of the text units that are extracted from a document and stored as individual vectors. Depending on the implementation, chunk size can be defined based on characters, words, or tokens. The size determines how much context a single vector represents.

Small chunks contain very little context and are highly specific. Large chunks include more surrounding information, but at a much coarser level. As a result, chunk size determines which parts of the meaning are actually compared when a query is matched against a chunk.

Chunk size implicitly reflects assumptions about how much context is required to capture meaning, how strongly information may be fragmented, and how clearly semantic similarity can be measured.

With this article, I wanted to explore exactly this through a small RAG system experiment and asked myself:

How do different chunk sizes affect retrieval behavior?

The focus is not on a system intended for production use. Instead, I wanted to find out how different chunk sizes affect the retrieval results.

2 – How Does Chunk Size Influence the Retrieval Results in Small RAG Systems?

I therefore asked myself the following questions:

How does chunk size change retrieval results in a small, controlled RAG system?
Which text segments make it to the top of the ranking when the queries are identical but the chunk sizes differ?

To investigate this, I deliberately defined a simple setup in which all conditions (except chunk size) remain the same:

Three Markdown documents as the knowledge base
Three identical, fixed questions
The same embedding model for vectorizing the texts

The text used in the three Markdown files is based on a documentation from a real tool called OneLatex. To keep the experiment focused on retrieval behavior, the content was slightly simplified and reduced to the core explanations relevant for the questions.

The three questions I used where:

“Q1: What is the main advantage of separating content creation from formatting in OneLatex?”
“Q2: How does OneLatex interpret text highlighted in green in OneNote?”
“Q3: How does OneLatex interpret text highlighted in yellow in OneNote?”

In addition, I deliberately omitted an LLM for output generation.

The reason for this is simple: I did not want an LLM to turn incomplete or poorly-matched text segments into a coherent answer. This makes it much clearer what actually happens in the retrieval step, how the parameters of the retrieval interact, and what role the sentence transformer plays.

3 – Minimal RAG System Without Output Generation

For the experiments, I therefore used a small RAG system with the following components: Markdown documents as the knowledge base, a simple chunking logic with overlap, a sentence transformer model to generate embeddings, and a ranking of text segments using cosine similarity.

As the embedding model, I used all-MiniLM-L6-v2 from the Sentence-Transformers library. This model is lightweight and therefore well-suited for running locally on a personal laptop (I ran it locally on my Lenovo laptop with 64 GB of RAM). The similarity between a query and a text segment is calculated using cosine similarity. Because the vectors are normalized, the dot product can be compared directly.

I deliberately kept the system small and therefore did not include any chat history, memory or agent logic, or LLM-based answer generation.

As an “answer,” the system simply returns the highest-ranked text segment. This makes it much clearer which content is actually identified as relevant by the retrieval step.

The full code for the mini RAG system can be found in my GitHub repository:

→ 🤓 Find the full code in the GitHub Repo 🤓 ←

4 – Three Experiments: Chunk Size as a Variable

For the evaluation, I ran the three commands below via the command line:

#Experiment 1 – Baseline
python main.py –chunk-size 220 –overlap 40 –top-k 3

#Experiment 2 – Small Chunk-Size
python main.py –chunk-size 80 –overlap 10 –top-k 3

#Experiment 3 – Big Chunk-Size
python main.py –chunk-size 500 –overlap 50 –top-k 3

The setup from Section 3 remains exactly the same: The same three documents, the same three questions, and the same embedding model.

Chunk size defines the number of characters per text segment. In addition, I used an overlap in each experiment to reduce information loss at chunk boundaries. For each experiment, I computed the semantic similarity scores between the query and all chunks and ranked the highest-scoring segments.

Small Chunks (80 Characters) – Loss of Context

With very small chunks (chunk-size 80), a strong fragmentation of the content becomes apparent: Individual text segments often contain only sentence fragments or isolated statements without sufficient context. Explanations are split across multiple chunks, so that individual segments contain only parts of the original content.

Formally, the retrieval still works correctly: Semantically similar fragments are found and ranked highly.

However, when we look at the actual content, we see that the results are hardly usable:

Screenshot taken by the Author.

The returned chunks are thematically related, but they do not provide a self-contained answer. The system roughly recognizes what the topic is about, but it breaks the content down so strongly that the individual results do not say much on their own.

Medium Chunks (220 characters) – Apparent Stability

With the medium chunks (chunk-size 220), the results already improved clearly. Most of the returned text segments contained complete explanations and were content-wise plausible. At first glance, the retrieval appeared stable and reliable: It usually returned exactly the information one would expect.

However, a concrete problem became apparent when distinguishing between green and yellow highlighted text. Regardless of whether I asked about the meaning of the green or the yellow highlighting, the system returned the chunk about the yellow highlighting as the top result in both cases. The correct chunk was present, but it was not selected as Top-1.

Screenshot taken by the author.

The reason lies in the very similar similarity scores of the two top results:

Score for Top-1: 0.873
Score for Top-2: 0.774

The system can hardly distinguish between the two candidates semantically and ultimately selects the chunk with the slightly higher score.

The problem? It does not match the question content-wise and is simply wrong.

For us as humans, this is very easy to recognize. For a sentence transformer like all-MiniLM-L6-v2, it seems to be a challenge.

What matters here is this: If we only look at the Top-1 result, this error remains invisible. Only by comparing the scores do we see that the system is uncertain in this situation. Since it is forced to make a clear decision in our setup, it returns the Top-1 chunk as the answer.

Large Chunks (500 characters) – Robust Contexts

With the larger chunks (chunk-size 500), the text segments contain much more coherent context. There is also hardly any fragmentation anymore: Explanations are no longer split across multiple chunks.

And indeed, the error in distinguishing between green and yellow no longer occurs. The questions about green and yellow highlighting are now correctly distinguished, and the respective matching chunk is clearly ranked as the top result. We can also see that the similarity scores of the relevant chunks are now more clearly separated.

Screenshot taken by the author.

This makes the ranking more stable and easier to understand. The downside of this setting, however, is the coarser granularity: Individual chunks contain more information and are less finely tailored to specific aspects.

In our setup with three Markdown files, where the content is already thematically well separated, this downside hardly plays a role. With differently structured documentation, such as long continuous texts with multiple topics per section, an excessively large chunk size could lead to irrelevant information being retrieved together with relevant content.

On my Substack Data Science Espresso, I share practical guides and bite-sized updates from the world of Data Science, Python, AI, Machine Learning, and Tech — made for curious minds like yours.

Have a look and subscribe on Medium or on Substack if you want to stay in the loop.

5 – Final Thoughts

The results of the three very simple experiments can be traced back to how retrieval works. Each chunk is represented as a vector, and its proximity to the query is calculated using cosine similarity. The resulting score indicates how similar the question and the text segment are in the semantic space.

What is important here is that the score is not a measure of correctness. It is a measure of relative comparison within the available chunks for a given question in a single run.

When multiple segments are semantically very similar, even minimal differences in the scores can determine which chunk is returned as Top-1. One example of this was the incorrect distinction between green and yellow in the medium chunk size.

One possible extension would be to allow the system to explicitly signal uncertainty. If the scores of the Top-1 and Top-2 chunks are very close, the system could return an “I don’t know” or “I’m uncertain” response instead of forcing a decision.

Based on this small RAG system experiment, it is not really possible to derive a “best chunk size” conclusion.

But what we can observe instead is the following:

Small chunks lead to high variance: Retrieval reacts very precisely to individual terms but quickly loses the overall context.
Medium-sized chunks: Appear stable at first glance, but can create dangerous ambiguities when multiple candidates are scored almost equally.
Large chunks: Provide more robust context and clearer rankings, but they are coarser and less precisely tailored.

Chunk size therefore, determines how sharply retrieval can distinguish between similar pieces of content.

In this small setup, this did not play a major role. However, when we think about larger RAG systems in production environments, this kind of retrieval instability could become a real problem: As the number of documents grows, the number of semantically similar chunks increases as well. This means that many situations with very small score differences are likely to occur. I can also imagine that such effects are often masked by downstream language models, when an LLM turns incomplete or only partially matching text segments into plausible answers.

What's Hot

Tested: 2026 Porsche 911 Cabriolet – Full review, price & features

How to watch The Other Bennet Sister from anywhere – it’s FREE

Why Cutting Off A Wide-Load Semi Truck And Its Lead Car Is A Bad Idea

How to Build Type-Safe, Schema-Constrained, and Function-Driven LLM Pipelines Using Outlines and Pydantic

The Multi-Agent Trap | Towards Data Science

The Current Status of The Quantum Software Stack

Garry Tan Releases gstack: An Open-Source Claude Code System for Planning, Code Review, QA, and Shipping

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

Personalized Restaurant Ranking with a Two-Tower Embedding Variant

Tested: 2026 Porsche 911 Cabriolet – Full review, price & features

How to watch The Other Bennet Sister from anywhere – it’s FREE

Why Cutting Off A Wide-Load Semi Truck And Its Lead Car Is A Bad Idea

Google Will Now Let You Virtually Try on Clothes With Just a Selfie

What’s in a Name? How to Get Your Domain Right

Speed Across the Galaxy Next Year in Star Wars: Galactic Racer

News

Company

Services