What is RAG? Retrieval-Augmented Generation Explained

The Problem RAG Solves

Large language models have two fundamental limitations:

Knowledge cutoff — they only know what was in their training data. Ask GPT about events after its training date and it can't help.
Hallucination — when they don't know something, they confidently make up plausible-sounding answers instead of saying "I don't know."

RAG — Retrieval-Augmented Generation — solves both problems by giving the LLM access to external knowledge at inference time. Instead of relying solely on training data, the system retrieves relevant documents and includes them in the prompt, grounding the response in factual information.

How RAG Works: The Three-Stage Pipeline

Stage 1: Indexing (Offline)

Before the system can retrieve information, it needs to process and store your documents:

Document loading — ingest PDFs, web pages, databases, APIs, or any text source
Chunking — split documents into smaller pieces (typically 200-1000 tokens). Chunk size matters: too small loses context, too large reduces retrieval precision
Embedding — convert each chunk into a dense vector using an embedding model (OpenAI text-embedding-3, Cohere embed-v3, or open-source alternatives like BGE)
Storage — store vectors in a vector database (Pinecone, Chroma, Qdrant, Weaviate, or pgvector)

Stage 2: Retrieval (Online)

When a user asks a question:

Query embedding — convert the user's question into a vector using the same embedding model
Similarity search — find the K most similar document chunks using cosine similarity or dot product
Reranking (optional) — use a cross-encoder model to re-score retrieved chunks for more accurate relevance ranking

Stage 3: Generation

Combine the retrieved context with the user's question into a prompt:

Prompt construction — "Based on the following context, answer the user's question. Context: [retrieved chunks]. Question: [user query]"
LLM generation — the model generates a response grounded in the retrieved information
Citation — include source references so users can verify the answer

Key Components Deep Dive

Embeddings

Embeddings are the heart of RAG. They map text to points in a high-dimensional space where semantically similar texts are close together. "How to train a neural network" and "Steps for building a deep learning model" would have similar embeddings even though they share few words.

Quality of your embedding model directly impacts retrieval quality. OpenAI's text-embedding-3-large (3072 dimensions) is a strong choice. For open-source, BGE-large and E5-large perform well.

Vector Databases

Vector databases are optimized for similarity search at scale. They use algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) to find nearest neighbors efficiently without scanning every vector.

Pinecone — fully managed, scales automatically, great for production
Chroma — open-source, easy to start with, good for prototyping
Qdrant — open-source with advanced filtering capabilities
pgvector — PostgreSQL extension, good if you're already using Postgres

Chunking Strategies

How you split documents matters enormously:

Fixed-size chunks — simple but may split mid-sentence or mid-concept
Recursive text splitting — split by paragraphs, then sentences, then characters. Preserves natural boundaries
Semantic chunking — use embedding similarity to detect topic boundaries
Overlap — include 10-20% overlap between chunks to preserve context at boundaries

Advanced RAG Techniques

Hybrid Search

Combine vector similarity search with traditional keyword search (BM25). Vector search excels at semantic matching; keyword search catches exact terms, names, and codes. Most production RAG systems use both with a weighted combination.

Query Transformation

Improve retrieval by transforming the user's query:

Query expansion — use an LLM to generate multiple variations of the query
HyDE — generate a hypothetical answer first, then use its embedding to search (surprisingly effective)
Step-back prompting — ask a broader question to retrieve more comprehensive context

Multi-step RAG (Agentic RAG)

For complex questions, a single retrieval step isn't enough. An AI agent can:

Break the question into sub-questions
Retrieve information for each sub-question
Synthesize a comprehensive answer from multiple retrieval rounds

When to Use RAG vs Fine-Tuning

Aspect	RAG	Fine-Tuning
Knowledge freshness	Always current (just update docs)	Frozen at training time
Source attribution	Easy — can cite exact sources	Difficult — knowledge baked into weights
Cost	Per-query retrieval + LLM cost	High upfront training cost
Best for	Factual Q&A, customer support, docs	Style/tone adaptation, domain expertise
Hallucination	Significantly reduced	Still possible

In practice, the best systems combine both: fine-tune for domain-specific reasoning, use RAG for factual grounding.

Building Your First RAG System

A minimal RAG system in Python requires about 30 lines of code with LangChain:

Load documents with a document loader
Split into chunks with RecursiveCharacterTextSplitter
Create embeddings and store in Chroma
Create a retrieval chain with an LLM
Query and get grounded answers

Learn to build production RAG systems in our RAG Systems Deep Dive lesson, which covers advanced retrieval strategies, evaluation, and deployment. Get full access to all 31 lessons and start building AI applications today.