You're going to learn how to build a Retrieval-Augmented Generation (RAG) system with LangChain that lets an LLM reference your own documents instead of just hallucinating answers.

What is RAG and why it matters

RAG solves a fundamental problem: LLMs are trained on fixed datasets and can't access your private documents, recent information, or domain-specific knowledge. RAG fixes this by:

  1. Storing your documents in a searchable format
  2. Retrieving relevant chunks when someone asks a question
  3. Generating an answer using those chunks as context

Think of it like giving the AI a library card instead of expecting it to memorize every book.

The RAG pipeline: 6 steps

Step 1: Load documents

First, get your content into LangChain's Document format.

from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader

# Load a PDF
pdf_loader = PyPDFLoader("https://example.com/research-paper.pdf")
pdf_docs = pdf_loader.load()

# Load a webpage
web_loader = WebBaseLoader("https://python.langchain.com/docs/introduction/")
web_docs = web_loader.load()

Each Document object has:

  • page_content - the actual text
  • metadata - source info (URL, page number, filename)

That metadata is crucial for debugging and showing sources later.

Step 2: Split into chunks

Long documents need to be broken into smaller pieces. Too big = fuzzy retrieval. Too small = missing context.

from langchain.text_splitter import CharacterTextSplitter

splitter = CharacterTextSplitter(
    chunk_size=1000,      # ~1000 characters per chunk
    chunk_overlap=150,    # 150 chars overlap to preserve context
    separator="\n"
)

chunks = splitter.split_documents(pdf_docs)
print(f"Created {len(chunks)} chunks")

Pro tip: Start with 800-1500 characters and 10-20% overlap, then tune based on your content.

Step 3: Create embeddings

Embeddings convert text into vectors (lists of numbers) that capture semantic meaning. Similar concepts have similar vectors.

from langchain_ibm import WatsonxEmbeddings
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames

embed_params = {
    EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: 3,
    EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": True},
}

embedding_model = WatsonxEmbeddings(
    model_id="ibm/slate-125m-english-rtrvr-v2",
    url="https://us-south.ml.cloud.ibm.com",
    project_id="your-project-id",
    params=embed_params,
)

Step 4: Store in a vector database

Vector stores let you do semantic search - find chunks by meaning, not just keywords.

from langchain.vectorstores import Chroma

# Create the vector store and embed all chunks
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model
)

Chroma is easy for local development. For production, consider Pinecone, Weaviate, or Qdrant.

Step 5: Set up retrieval

Convert your vector store into a retriever that can fetch relevant chunks.

# Basic retriever - returns top 4 most similar chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

# Test it
docs = retriever.invoke("What is LangChain used for?")
print(docs[0].page_content)  # Most relevant chunk

Step 6: Build the QA chain

Now connect retrieval to your LLM so it can answer questions using your documents.

from langchain.chains import RetrievalQA
from ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLM

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",        # "stuff" = put all retrieved docs in prompt
    retriever=retriever,
    return_source_documents=True  # Include sources in response
)

# Ask a question
result = qa_chain.invoke({"query": "What are the main benefits of LangChain?"})
print(result["result"])
print(f"Sources: {result['source_documents']}")

Advanced: Hierarchical retrieval

The Parent Document Retriever solves a chunking dilemma: you want small chunks for precise retrieval, but you need larger context for the LLM.

from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

# Small chunks for embedding (precise retrieval)
child_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=20)

# Large chunks for context (what the LLM sees)
parent_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=20)

vectorstore = Chroma(
    collection_name="hierarchical_chunks",
    embedding_function=embedding_model
)

store = InMemoryStore()  # Stores parent documents

retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

retriever.add_documents(pdf_docs)

# Now retrieval finds small chunks but returns large context
docs = retriever.invoke("Explain the architecture")
print(docs[0].page_content)  # Returns the full parent chunk

How it works:

  1. Small chunks get embedded for precise similarity search
  2. Retrieved small chunks point to their parent IDs
  3. Full parent chunks are returned to the LLM for context

Common pitfalls and fixes

Problem 1: Retrieves wrong chunks

Fix: Increase k (return more chunks) or improve chunking strategy. Try RecursiveCharacterTextSplitter which respects paragraph boundaries.

Problem 2: Still hallucinates despite correct retrieval

Fix: Tighten the prompt. Add instructions like:

Answer only using the provided context. If the answer isn't in the context, say "I don't know based on the provided documents."

Problem 3: Slow retrieval

Fix:

  • Index fewer documents initially
  • Use a faster embedding model
  • Add caching with Chroma(persist_directory="./db")

Problem 4: Can't see what's being retrieved

Fix: Always set return_source_documents=True during development. Log retrieved chunks to see what the LLM is actually reading.

Testing your RAG system

# Quick test script
test_queries = [
    "What is this document about?",
    "What are the key findings?",
    "Who are the authors?",
]

for query in test_queries:
    result = qa_chain.invoke({"query": query})
    print(f"\nQ: {query}")
    print(f"A: {result['result']}")
    print(f"Sources: {[doc.metadata for doc in result['source_documents']]}")

Production checklist

Before deploying:

  • [ ] Test with edge cases (empty queries, irrelevant questions)
  • [ ] Add error handling for retrieval failures
  • [ ] Implement query logging for debugging
  • [ ] Set up monitoring for retrieval quality
  • [ ] Cache embeddings to avoid recomputing
  • [ ] Add rate limiting if using external APIs
  • [ ] Test with different chunk sizes on your domain

Quick recap

  • RAG lets LLMs reference your documents instead of guessing
  • Pipeline: Load → Split → Embed → Store → Retrieve → Generate
  • Chunking is critical: 800-1500 chars, 10-20% overlap is a good start
  • Metadata preserves source info for debugging and citations
  • Hierarchical retrieval balances precision and context
  • Always log what's being retrieved during development

RAG transforms LLMs from impressive text generators into reliable knowledge assistants grounded in your data.