LLMs9 min readFebruary 26, 2026

What is RAG? Retrieval-Augmented Generation Explained

Understand how RAG (Retrieval-Augmented Generation) works, why it solves LLM hallucination, and how to build your own RAG system from scratch.

S

Soumyajit Sarkar

Partner & CTO, Greensolz

The Problem RAG Solves

Large language models have two fundamental limitations:

  • Knowledge cutoff — they only know what was in their training data. Ask GPT about events after its training date and it can't help.
  • Hallucination — when they don't know something, they confidently make up plausible-sounding answers instead of saying "I don't know."

RAG — Retrieval-Augmented Generation — solves both problems by giving the LLM access to external knowledge at inference time. Instead of relying solely on training data, the system retrieves relevant documents and includes them in the prompt, grounding the response in factual information.

How RAG Works: The Three-Stage Pipeline

Stage 1: Indexing (Offline)

Before the system can retrieve information, it needs to process and store your documents:

  1. Document loading — ingest PDFs, web pages, databases, APIs, or any text source
  2. Chunking — split documents into smaller pieces (typically 200-1000 tokens). Chunk size matters: too small loses context, too large reduces retrieval precision
  3. Embedding — convert each chunk into a dense vector using an embedding model (OpenAI text-embedding-3, Cohere embed-v3, or open-source alternatives like BGE)
  4. Storage — store vectors in a vector database (Pinecone, Chroma, Qdrant, Weaviate, or pgvector)

Stage 2: Retrieval (Online)

When a user asks a question:

  1. Query embedding — convert the user's question into a vector using the same embedding model
  2. Similarity search — find the K most similar document chunks using cosine similarity or dot product
  3. Reranking (optional) — use a cross-encoder model to re-score retrieved chunks for more accurate relevance ranking

Stage 3: Generation

Combine the retrieved context with the user's question into a prompt:

  1. Prompt construction — "Based on the following context, answer the user's question. Context: [retrieved chunks]. Question: [user query]"
  2. LLM generation — the model generates a response grounded in the retrieved information
  3. Citation — include source references so users can verify the answer

Key Components Deep Dive

Embeddings

Embeddings are the heart of RAG. They map text to points in a high-dimensional space where semantically similar texts are close together. "How to train a neural network" and "Steps for building a deep learning model" would have similar embeddings even though they share few words.

Quality of your embedding model directly impacts retrieval quality. OpenAI's text-embedding-3-large (3072 dimensions) is a strong choice. For open-source, BGE-large and E5-large perform well.

Vector Databases

Vector databases are optimized for similarity search at scale. They use algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index) to find nearest neighbors efficiently without scanning every vector.

  • Pinecone — fully managed, scales automatically, great for production
  • Chroma — open-source, easy to start with, good for prototyping
  • Qdrant — open-source with advanced filtering capabilities
  • pgvector — PostgreSQL extension, good if you're already using Postgres

Chunking Strategies

How you split documents matters enormously:

  • Fixed-size chunks — simple but may split mid-sentence or mid-concept
  • Recursive text splitting — split by paragraphs, then sentences, then characters. Preserves natural boundaries
  • Semantic chunking — use embedding similarity to detect topic boundaries
  • Overlap — include 10-20% overlap between chunks to preserve context at boundaries

Advanced RAG Techniques

Hybrid Search

Combine vector similarity search with traditional keyword search (BM25). Vector search excels at semantic matching; keyword search catches exact terms, names, and codes. Most production RAG systems use both with a weighted combination.

Query Transformation

Improve retrieval by transforming the user's query:

  • Query expansion — use an LLM to generate multiple variations of the query
  • HyDE — generate a hypothetical answer first, then use its embedding to search (surprisingly effective)
  • Step-back prompting — ask a broader question to retrieve more comprehensive context

Multi-step RAG (Agentic RAG)

For complex questions, a single retrieval step isn't enough. An AI agent can:

  1. Break the question into sub-questions
  2. Retrieve information for each sub-question
  3. Synthesize a comprehensive answer from multiple retrieval rounds

When to Use RAG vs Fine-Tuning

AspectRAGFine-Tuning
Knowledge freshnessAlways current (just update docs)Frozen at training time
Source attributionEasy — can cite exact sourcesDifficult — knowledge baked into weights
CostPer-query retrieval + LLM costHigh upfront training cost
Best forFactual Q&A, customer support, docsStyle/tone adaptation, domain expertise
HallucinationSignificantly reducedStill possible

In practice, the best systems combine both: fine-tune for domain-specific reasoning, use RAG for factual grounding.

Building Your First RAG System

A minimal RAG system in Python requires about 30 lines of code with LangChain:

  1. Load documents with a document loader
  2. Split into chunks with RecursiveCharacterTextSplitter
  3. Create embeddings and store in Chroma
  4. Create a retrieval chain with an LLM
  5. Query and get grounded answers

Learn to build production RAG systems in our RAG Systems Deep Dive lesson, which covers advanced retrieval strategies, evaluation, and deployment. Get full access to all 31 lessons and start building AI applications today.

RAGretrieval augmented generationLLMvector databaseAI applicationsembeddings

Related Lessons

Go deeper with our interactive lessons on this topic:

Want to Master This Topic?

Our interactive course goes way beyond articles. Get hands-on with 31 lessons, 25 coding exercises, and AI-evaluated quizzes.