Simplicity is a form of art...

Embeddings in RAG

by Sven Vermeulen, post on Sun 01 March 2026

When I started looking into the architecture of Large Language Models (LLMs), I got confused when I encountered Retrieval Augmented Generation (RAG). Both LLMs themselves and RAG use embeddings (a numerical vector representation of a token) and through its shared terminology, I made the wrong assumption that the embeddings in both are strongly related. It is in fact much simpler, and while both use embeddings, they are unrelated to each other.

Note: I'm still dipping my toes into the world of LLMs (and other generative AI, like diffusion-models for image generation), so my posts might be inaccurate. I welcome any feedback or comments on this.

Embeddings in a large language model

LLMs are trained to predict text given a certain input. The text that is predicted are so-called tokens, small text snippets. These are then added to the input text, and the LLM again predicts the next token, moving forward until it predicted a special token that indicates the end of a text sequence.

Simple view of LLM

Suppose the text at that point is the following:

Two roads diverged in a yellow wood,
and sorry I

You might know this as the start of "The Road Not Taken", a poem by Robert Frost. If the LLM is trained with this poem, it might be able to predict the next tokens. When I ran this as input through Qwen3-VL 8B, one of the more recent open-weights model released by the Qwen team at Alibaba Cloud, it was able to generate parts of the poem further, but eventually strayed off course.

Two roads diverged in a yellow wood,
and sorry I couldn’t travel both
and be one traveler, long I stood.
and looked down one as far as I could
to where it bent in the undergrowth;
Then took the other, as just as fair,
and having perhaps a better claim,
because it was grassy and wanted wear;
though as for that the passing there
had worn them really about the same,

And both that morning equally lay
in leaves no step had trodden black.
And both … The question is — which way does he take? It’s not clear. He says “I took the other”, 
but then says “the passing there had worn them really about the same”. So why did he choose one 
over the other? Is it a matter of chance? Or is there something more symbolic going on?

^{Note: There is some randomness involved here, other iterations with the same
model and input did result in the poem being quoted correctly, followed by an
analysis of the poem.}

While generating the output, the model generates one part of text at a time. This part of text is called a token, and the LLM has a built-in tokenizer that converts text into tokens, and tokens back into text. For the Qwen3 models, the Qwen tokenizer is used. If I understand its vocabulary correctly, the text "couldn't travel" would be tokenized into:

[ "couldn", "'t", " ", "travel" ]

Different LLMs can use different tokenization methods, but there is a lot of re-use here. Different LLM models can use the same tokenizer.

These tokens are converted into embeddings, which form the foundational representation for use in LLMs. They are numerical vectors that represent those text tokens. LLMs work with these numerical vectors: LLMs (and AI in general) are software systems that perform heavy computational operations, performing many matrix operations with each matrix being a massive set of numbers. Well, text is represented as a huge matrix.

Embeddings are not just a simple index, but are pretrained values. These values enable token mapping based on semantic similarity. When the training material often combines "corona" and "COVID", then these two will have embeddings that allow both terms to be seen as close to each other. But the same is true if there is material combining "corona" and "beer". So the embedding that represents "corona" (assuming it is a single token) would have semantic understanding of both corona being a viral disease (related to COVID-19) as well as an alcoholic beverage.

Unlike tokenizers, which can be reused across different LLM models, the embeddings are unique to each model. Sure, within the same family (e.g. Qwen3) there can be reuse as well, but it is much less common to see this re-use across different families.

The phrase "Two roads" would consist of three tokens ("Two", " ", "roads"), which are converted into a corresponding 4096-dimensional embedding vector during processing. The dimension is fixed for a particular LLM: Qwen3 8B for instance uses embeddings of 4096 numbers. So that start would be a matrix with dimensions 3x4096. The entire text itself thus would be represented by a very large matrix, with one dimension being this embedding size (4096 in my case), the other dimension being the amount of tokens already used as text (both input and generated output).

These matrices are then used as input within the LLM, which then starts doing magic with them (well, not really magic, it's rather maths, multiplying the matrix against other in-LLM stored matrices, iterating over multiple blocks of matrix operations, etc.) to eventually output a (sequence of) embedding(s), which is appended to the input matrix to re-iterate the entire process over and over again.

Embedding-based view of LLM

The maximum amount of tokens that a model can handle is also predefined, although there are methods to extend this. For Qwen3 8B, this is 32768 natively, and 131072 with an extension method called YaRN. So, for the native implementation, that means the maximum text size would be represented as a matrix of dimensions 32768x4096.

Retrieval Augmented Generation

LLMs are trained with a certain set of data, so once it is finished training, it does not have the ability to learn more. To make it more useful, you want the LLM to have access to recent insights. Nowadays, the hype is all about MCP (Model Context Protocol), which is having LLMs trained to understand that they have tools at their disposal, and know how to call these tools (well, in reality, they are trained to generate output that the software which executes the LLM detects, makes a tool output, and adds the outcome of that tool back to the text already generated, allowing the LLM to continue).

Before MCP the world was (and still is) using Retrieval Augmented Generation (RAG). The idea behind RAG is that, before the LLM responds to a user's query (prompt) it also receives new information from external data sources. With both the user query and information from the sources, the LLM is able to generate more useful output.

When I looked at RAG, I noticed it using embeddings as well prior to the actual retrieval, so I wrongfully thought that those are the same embeddings, and that the outcome of the RAG would be an embedding matrix as well, that the LLM then receives and further processes...

Incorrect RAG view

I was misled by documentation on RAGs indicated things like "the data to be referenced is converted into LLM embeddings", and that the technology used for RAG retrieval are vector databases specialized for embedding-based operations. Many online resources also looked at RAG as a complete, singular solution with multiple components. So I jumped into conclusion that these are the same embeddings. But then, that would mean the RAG solution would be tailored to the LLM being used, because other LLM models (like Llama3, or Mistral) use different embedding vocabulary.

Instead, what RAG does, is take the same prompt, convert it into tokens and embeddings (using its own tokenizer/embedding vocabulary) and then uses that to perform a search operation against the data that is added to the RAG database. This data (which is the recent insights or other documents you want your LLM to know about) is also tokenized and converted into embeddings, but it is not those embeddings that are brought back to the main LLM, but the plain text outcome (or other media types that your LLM understands, such as images).

Why does RAG then use embeddings? Wouldn't a simple search engine be sufficient? Well, the RAG's primary advantage is its ability to locate relevant information more effectively through embeddings. Thanks to the embedding representation, the RAG can find information that is related to the user query without relying on keyword matches. You could effectively replace the RAG engine with a simple search - and many LLM-powered software applications do support this. For instance, Koboldcpp which I use to run LLM locally, supports a simple DuckDuckGo-based websearch as well.

RAG view

The use of embeddings for search operations (again, completely independent of the LLM) allows for contextual understanding. When a user prompts for "What are the ingredients for Corona", a simple keyword-based search operation might incorrectly result in findings of COVID-19, whereas in this case the query is about the Corona beer.

These improved search operations are often called "semantic search", as they have a better understanding of the semantics and meanings of text (through the embeddings), resulting in more contextually relevant insights.

When is it "RAG" and when semantic search

Retrieval Augmented Generation is the process of converting the user query, performing a semantic search against the knowledge base, and appending the best results (e.g. top-3 hits in the knowledge base) to the user input text. This completed input text thus contains both the user query, as well as pieces of insights obtained from the semantic search. The LLM uses this additional information for generating better outcomes. This entire pipeline (retrieving context, augmenting the prompt, and then generating output) is what defines "RAG".

I personally see RAG technology-wise being very similar to a regular search: replace the semantic search with a search engine (which underlyingly could also use semantic search anyway) and the outcome is the same. The main difference is that RAG is meant for finding exact truth, information snippets tailored to bring context information accurately, whereas a search engine based retrieval would rather bring snippets of data back.

In the market, RAG also focuses on the management of the semantic search (and vector database), optimizing the data that is added to the knowledge base to be LLM-friendly (shorter pieces of accurate data, rather than fully-indexed complete pages which could easily overload the maximum size that an LLM can handle). It prioritizes efficient data management and insights lifecycle control.

For LLMs, it also provides a bit more nuance. A web search would be presented to the LLM as "The following information can be useful to answer the question", whereas RAG results would be presented as actual insights/context. LLMs might be trained to deal differently with that distinction.

Understanding that the semantic search is independent of the LLM of course makes much more sense. It allows companies or organizations to build up a knowledge base and maintain this knowledge independent of the LLMs. Multiple different LLMs can then use RAG to obtain the latest information from this knowledge base - or you can just use the engine for semantic searches alone, you do not need LLMs to get beneficial searches. Many popular web search engines use semantic search underlyingly (i.e. when they index pages, they also generate the embeddings from it and store those in their own vector databases to improve search results).

When new embedding algorithms emerge that you want to use, you must re-generate the embeddings for the entire knowledge base. But that will most likely occur much, much less frequently than using new LLM models (given the rapid evolution here).

Conclusion

RAG is a feature of the software that runs the LLM, allowing for retrieving contextual information from a curated knowledge base. RAG's use of embeddings is related to its semantic search, not to the same embeddings as those used by the LLM. The contextual information is added to the user prompt as text, and only then 'converted' into the embeddings used by the LLM itself.

Feedback? Comments? Don't hesitate to get in touch on Mastodon.

^{Images are created in Inkscape, using icons from
Streamline
(GitHub), released under
the CC BY 4.0 license, indexed
at OpenSVG.}

ai ai embedding rag

Hypergovernance is a bad thing, but do not dismiss optimal governance

by Sven Vermeulen, post on Thu 11 September 2025

I once read a blurb about the benefits of bureaucracy, and how it is intended to resist political influences, autocratic leadership, priority-of-the-day decision-making, silo'ed views, and more things that we generally see as "Bad Things^™️". I'm sad that I can't recall where it was, but its message was similar as what The Benefits Of Bureaucracy: How I Learned To Stop Worrying And Love Red Tape by Rita McGrath presents. When I read it, I was strangely supportive to the message, because I am very much confronted, and perhaps also often the cause, for bureaucracy and governance-related deliverables in the company that I work for.

regulation dora

Is IT a DORA CIF?

by Sven Vermeulen, post on Mon 27 January 2025

Core to the Digital Operational Resilience Act is the notion of a critical or important function. When a function is deemed critical or important, DORA expects the company or group to take precautions and measures to ensure the resilience of the company and the markets in which it is active.

But what exactly is a function? When do we consider it critical or important? Is there a differentiation between critical and important? Can an IT function be a critical or important function?

regulation dora

Digital Operational Resilience Act

by Sven Vermeulen, post on Sun 12 January 2025

One of the topics that most financial institutions are (still) currently working on, is their compliance with a European legislation called DORA. This abbreviation, which stands for "Digital Operational Resilience Act", is a European regulation. European regulations apply automatically and uniformly across all EU countries. This is unlike another recent legislation called NIS2, the "Network and Information Security" directive. As a EU directive, NIS2 requires the EU countries to formulate the directive into local law. As a result, different EU countries can have a slightly different implementation.

The DORA regulation applies to the EU financial sector, and has some strict requirements in it that companies' IT stakeholders are affected by. It doesn't often sugar-coat things like some frameworks do. This has the advantage that its "interpretation flexibility" is quite reduced - but not zero of course. Yet, that advantage is also a disadvantage: financial entities might have had different strategies covering their resiliency, and now need to adjust their strategy.

regulation dora

Diagrams are no communication channel

by Sven Vermeulen, post on Thu 05 September 2024

IT architects generally use architecture-specific languages or modeling techniques to document their thoughts and designs. ArchiMate, the framework I have the most experience with, is a specialized enterprise architecture modeling language. It is maintained by The Open Group, an organization known for its broad architecture framework titled TOGAF.

My stance, however, is that architects should not use the diagrams from their architecture modeling framework to convey their message to every stakeholder out there...

architecture architecture

Sustainability in IT

by Sven Vermeulen, post on Sun 25 September 2022

For one of the projects I'm currently involved in, we want to have a better view on sustainability within IT and see what we (IT) can contribute in light of the sustainability strategy of the company. For IT infrastructure, one would think that selecting more power-efficient infrastructure is the way to go, as well as selecting products whose manufacturing process takes special attention to sustainability.

There are other areas to consider as well, though. Reusability of IT infrastructure and optimal resource consumption are at least two other attention points that deserve plenty of attention. But let's start at the manufacturing process...

architecture sustainability

Getting lost in the frameworks

by Sven Vermeulen, post on Fri 26 August 2022

The IT world is littered with frameworks, best practices, reference architectures and more. In an ever-lasting attempt to standardize IT, we often get lost in too many standards or specifications. For consultants, this is a gold-mine, as they jump in to support companies - for a fee, naturally - in adopting one or more of these frameworks or specifications.

While having references and specifications isn't a bad thing, there are always pros and cons.

architecture framework cmmi iso

Containers are the new IaaS

by Sven Vermeulen, post on Sat 21 May 2022

At work, as with many other companies, we're actively investing in new platforms, including container platforms and public cloud. We use Kubernetes based container platforms both on-premise and in the cloud, but are also very adamant that the container platforms should only be used for application workload that is correctly designed for cloud-native deployments: we do not want to see vendors packaging full operating systems in a container and then shouting they are now container-ready.

architecture kubernetes container iaas infrastructure virtual-machine

Defining what an IT asset is

by Sven Vermeulen, post on Sun 13 February 2022

One of the main IT processes that a company should strive to have in place is a decent IT asset management system. It facilitates knowing what assets you own, where they are, who the owner is, and provides a foundation for numerous other IT processes.

However, when asking "what is an IT asset", it gets kind off fuzzy...

architecture asset-management cobit itil

An IT conceptual data model

by Sven Vermeulen, post on Mon 17 January 2022

This time a much shorter post, as I've been asked to share this information recently and found that it, by itself, is already useful enough to publish. It is a conceptual data model for IT services.

architecture cdm asset-management configuration-management