You're going to learn how to build a Retrieval-Augmented Generation (RAG) system with LangChain that lets an LLM reference your own documents instead of just hallucinating answers.
What is RAG and why it matters
RAG solves a fundamental problem: LLMs are trained on fixed datasets and can't access your private documents, recent information, or domain-specific knowledge. RAG fixes this by:
- Storing your documents in a searchable format
- Retrieving relevant chunks when someone asks a question
- Generating an answer using those chunks as context
Think of it like giving the AI a library card instead of expecting it to memorize every book.
The RAG pipeline: 6 steps
Step 1: Load documents
First, get your content into LangChain's Document format.
from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader
# Load a PDF
pdf_loader = PyPDFLoader("https://example.com/research-paper.pdf")
pdf_docs = pdf_loader.load()
# Load a webpage
web_loader = WebBaseLoader("https://python.langchain.com/docs/introduction/")
web_docs = web_loader.load()
Each Document object has:
page_content- the actual textmetadata- source info (URL, page number, filename)
That metadata is crucial for debugging and showing sources later.
Step 2: Split into chunks
Long documents need to be broken into smaller pieces. Too big = fuzzy retrieval. Too small = missing context.
from langchain.text_splitter import CharacterTextSplitter
splitter = CharacterTextSplitter(
chunk_size=1000, # ~1000 characters per chunk
chunk_overlap=150, # 150 chars overlap to preserve context
separator="\n"
)
chunks = splitter.split_documents(pdf_docs)
print(f"Created {len(chunks)} chunks")
Pro tip: Start with 800-1500 characters and 10-20% overlap, then tune based on your content.
Step 3: Create embeddings
Embeddings convert text into vectors (lists of numbers) that capture semantic meaning. Similar concepts have similar vectors.
from langchain_ibm import WatsonxEmbeddings
from ibm_watsonx_ai.metanames import EmbedTextParamsMetaNames
embed_params = {
EmbedTextParamsMetaNames.TRUNCATE_INPUT_TOKENS: 3,
EmbedTextParamsMetaNames.RETURN_OPTIONS: {"input_text": True},
}
embedding_model = WatsonxEmbeddings(
model_id="ibm/slate-125m-english-rtrvr-v2",
url="https://us-south.ml.cloud.ibm.com",
project_id="your-project-id",
params=embed_params,
)
Step 4: Store in a vector database
Vector stores let you do semantic search - find chunks by meaning, not just keywords.
from langchain.vectorstores import Chroma
# Create the vector store and embed all chunks
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embedding_model
)
Chroma is easy for local development. For production, consider Pinecone, Weaviate, or Qdrant.
Step 5: Set up retrieval
Convert your vector store into a retriever that can fetch relevant chunks.
# Basic retriever - returns top 4 most similar chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
# Test it
docs = retriever.invoke("What is LangChain used for?")
print(docs[0].page_content) # Most relevant chunk
Step 6: Build the QA chain
Now connect retrieval to your LLM so it can answer questions using your documents.
from langchain.chains import RetrievalQA
from ibm_watson_machine_learning.foundation_models.extensions.langchain import WatsonxLLM
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" = put all retrieved docs in prompt
retriever=retriever,
return_source_documents=True # Include sources in response
)
# Ask a question
result = qa_chain.invoke({"query": "What are the main benefits of LangChain?"})
print(result["result"])
print(f"Sources: {result['source_documents']}")
Advanced: Hierarchical retrieval
The Parent Document Retriever solves a chunking dilemma: you want small chunks for precise retrieval, but you need larger context for the LLM.
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
# Small chunks for embedding (precise retrieval)
child_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=20)
# Large chunks for context (what the LLM sees)
parent_splitter = CharacterTextSplitter(chunk_size=2000, chunk_overlap=20)
vectorstore = Chroma(
collection_name="hierarchical_chunks",
embedding_function=embedding_model
)
store = InMemoryStore() # Stores parent documents
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
retriever.add_documents(pdf_docs)
# Now retrieval finds small chunks but returns large context
docs = retriever.invoke("Explain the architecture")
print(docs[0].page_content) # Returns the full parent chunk
How it works:
- Small chunks get embedded for precise similarity search
- Retrieved small chunks point to their parent IDs
- Full parent chunks are returned to the LLM for context
Common pitfalls and fixes
Problem 1: Retrieves wrong chunks
Fix: Increase k (return more chunks) or improve chunking strategy. Try RecursiveCharacterTextSplitter which respects paragraph boundaries.
Problem 2: Still hallucinates despite correct retrieval
Fix: Tighten the prompt. Add instructions like:
Answer only using the provided context. If the answer isn't in the context, say "I don't know based on the provided documents."
Problem 3: Slow retrieval
Fix:
- Index fewer documents initially
- Use a faster embedding model
- Add caching with
Chroma(persist_directory="./db")
Problem 4: Can't see what's being retrieved
Fix: Always set return_source_documents=True during development. Log retrieved chunks to see what the LLM is actually reading.
Testing your RAG system
# Quick test script
test_queries = [
"What is this document about?",
"What are the key findings?",
"Who are the authors?",
]
for query in test_queries:
result = qa_chain.invoke({"query": query})
print(f"\nQ: {query}")
print(f"A: {result['result']}")
print(f"Sources: {[doc.metadata for doc in result['source_documents']]}")
Production checklist
Before deploying:
- [ ] Test with edge cases (empty queries, irrelevant questions)
- [ ] Add error handling for retrieval failures
- [ ] Implement query logging for debugging
- [ ] Set up monitoring for retrieval quality
- [ ] Cache embeddings to avoid recomputing
- [ ] Add rate limiting if using external APIs
- [ ] Test with different chunk sizes on your domain
Quick recap
- RAG lets LLMs reference your documents instead of guessing
- Pipeline: Load → Split → Embed → Store → Retrieve → Generate
- Chunking is critical: 800-1500 chars, 10-20% overlap is a good start
- Metadata preserves source info for debugging and citations
- Hierarchical retrieval balances precision and context
- Always log what's being retrieved during development
RAG transforms LLMs from impressive text generators into reliable knowledge assistants grounded in your data.