An LLM by itself is brilliant but blind to your specific data — your company docs, your PDFs, last week's emails. RAG (Retrieval-Augmented Generation) is the technique that fixes this. It's the architecture behind every "chat with your docs" product. You'll build one — in your browser, no API key.
Start retrievingLLMs are remarkable, but they're not omniscient. There are three failure modes that even GPT-4 and Claude hit immediately when you ask them about your data. Understand these, and you'll understand why RAG exists.
When an LLM doesn't know the answer, it doesn't say "I don't know" — it confidently makes one up. The output sounds plausible but is fabricated.
Every LLM was trained on data up to some date. It doesn't know about events, products, or research published after that. Ask it about today's news — it can't help.
The biggest gap: your internal documents, customer notes, codebase, last quarter's reports — the model has never seen any of it. It can only reason about what's in its training set.
Below is a working RAG system. Pick a document, ask a question, watch the system find the most relevant chunks and compose an answer. The semantic search uses real TF-IDF cosine similarity — the same math powering most production retrieval systems before the embedding era.
1. Pick a document from the tabs. 2. Click a suggested question or type your own. 3. Watch the chunks get scored — top 3 light up amber. 4. The answer is composed from the highest-scoring sentences. Note: real production RAG uses neural embeddings (Module 8) for retrieval — this uses simpler TF-IDF math, but the architecture is identical.
Whether you're using LangChain, LlamaIndex, or hand-rolling it — these five steps happen in every retrieval-augmented system. They are the RAG pipeline.
Split each document into smaller pieces (usually 200-800 tokens). Each chunk should be self-contained enough to answer a question on its own.
Convert each chunk into a vector. In production, use a real embedding model (OpenAI embeddings, Cohere, or open-source like BGE).
Save the vectors in a vector database (Pinecone, Weaviate, Qdrant, Chroma, or pgvector). This lets you search them efficiently at scale.
When a question comes in, embed it the same way. Find the top-K chunks with the highest cosine similarity to the question vector.
Send the question + retrieved chunks to the LLM as one prompt. Instruct it to answer using only the provided context. Get a grounded answer.
Even with RAG, the prompt you send to the LLM determines the quality of the output. Four patterns that work, every time:
Vague requests get vague answers. State the format, length, tone, and audience explicitly.
Show the model what good output looks like with 2-3 examples. Massively improves quality on structured tasks.
Telling the model "You are a senior data engineer" shapes vocabulary, depth, and assumptions toward that domain.
Ask for JSON, a markdown table, or a specific schema. LLMs are much more reliable when given a structure to fill.
You don't always need RAG. Sometimes a clever prompt works. Sometimes you actually need to fine-tune. The choice depends on your problem.
Tell the model what you want, in detail.
Retrieve relevant docs, then prompt.
Retrain the model on your examples.
Aim for 4/5. Wrong answers explain themselves.
You retrieved chunks with real semantic search. You composed answers from those chunks. You learned the five-step RAG pipeline that powers every modern Q&A system. You also know when to skip RAG entirely (just prompt) or go bigger (fine-tune). The architecture decisions are now yours to make.
Continue to Module 10