Course 2 · Module 09 · 80 minutes

Give the LLM
your data.
Get real answers.

An LLM by itself is brilliant but blind to your specific data — your company docs, your PDFs, last week's emails. RAG (Retrieval-Augmented Generation) is the technique that fixes this. It's the architecture behind every "chat with your docs" product. You'll build one — in your browser, no API key.

You'll ask

Questions of real docs

You'll watch

RAG retrieve answers

You'll choose

Prompt vs RAG vs fine-tune

Start retrieving

Part 01 · Why prompting alone fails

An LLM by itself
has three problems.

LLMs are remarkable, but they're not omniscient. There are three failure modes that even GPT-4 and Claude hit immediately when you ask them about your data. Understand these, and you'll understand why RAG exists.

// Problem 01

Hallucinations

When an LLM doesn't know the answer, it doesn't say "I don't know" — it confidently makes one up. The output sounds plausible but is fabricated.

It invented a court case that didn't exist and cited it in my legal brief.

// Problem 02

Knowledge cutoff

Every LLM was trained on data up to some date. It doesn't know about events, products, or research published after that. Ask it about today's news — it can't help.

What happened in tech this week? The model can't tell you. Its memory ends in 2024.

// Problem 03

It doesn't know your data

The biggest gap: your internal documents, customer notes, codebase, last quarter's reports — the model has never seen any of it. It can only reason about what's in its training set.

How many tickets did Acme Corp file last month? The model has no idea what Acme Corp is.

Part 02 · Hands on · Build a Q&A bot

Three documents.
One question.

Below is a working RAG system. Pick a document, ask a question, watch the system find the most relevant chunks and compose an answer. The semantic search uses real TF-IDF cosine similarity — the same math powering most production retrieval systems before the embedding era.

How to use it.

1. Pick a document from the tabs. 2. Click a suggested question or type your own. 3. Watch the chunks get scored — top 3 light up amber. 4. The answer is composed from the highest-scoring sentences. Note: real production RAG uses neural embeddings (Module 8) for retrieval — this uses simpler TF-IDF math, but the architecture is identical.

Pick a question and click Ask to start the RAG pipeline

Document chunks —

💬 Generated answer

No question asked yet

// Sources used (top 3)

No sources retrieved yet

Part 03 · What just happened

The five steps of
every RAG system.

Whether you're using LangChain, LlamaIndex, or hand-rolling it — these five steps happen in every retrieval-augmented system. They are the RAG pipeline.

Chunk

Split each document into smaller pieces (usually 200-800 tokens). Each chunk should be self-contained enough to answer a question on its own.

Embed

Convert each chunk into a vector. In production, use a real embedding model (OpenAI embeddings, Cohere, or open-source like BGE).

Store

Save the vectors in a vector database (Pinecone, Weaviate, Qdrant, Chroma, or pgvector). This lets you search them efficiently at scale.

Retrieve

When a question comes in, embed it the same way. Find the top-K chunks with the highest cosine similarity to the question vector.

Generate

Send the question + retrieved chunks to the LLM as one prompt. Instruct it to answer using only the provided context. Get a grounded answer.

Part 04 · The other half of the skill

How you ask
matters.

Even with RAG, the prompt you send to the LLM determines the quality of the output. Four patterns that work, every time:

// Pattern 01

Be specific.

Vague requests get vague answers. State the format, length, tone, and audience explicitly.

Write something about climate change.

Write a 200-word summary of climate change for a 10-year-old, using analogies they'll recognize.

// Pattern 02

Show examples (few-shot).

Show the model what good output looks like with 2-3 examples. Massively improves quality on structured tasks.

Classify this review: "Amazing service!"

"Slow delivery" → negative
"Best product ever" → positive
"Amazing service!" → ?

// Pattern 03

Assign a role.

Telling the model "You are a senior data engineer" shapes vocabulary, depth, and assumptions toward that domain.

Help me debug this SQL.

You are a senior database engineer. Review this SQL for performance issues. Suggest indexes.

// Pattern 04

Request structured output.

Ask for JSON, a markdown table, or a specific schema. LLMs are much more reliable when given a structure to fill.

Give me info about Tesla.

Return as JSON: { "company": "...", "founded": ..., "ceo": "...", "headquarters": "..." }

Part 05 · The big decision

Three ways to adapt an LLM.
Pick one per problem.

You don't always need RAG. Sometimes a clever prompt works. Sometimes you actually need to fine-tune. The choice depends on your problem.

Cheapest

Just Prompt

Tell the model what you want, in detail.

Cost~$0.01 per call

Setup timeMinutes

Custom data?No

Up-to-date?Only training data

Use when The task only needs general knowledge — writing, brainstorming, code from scratch, explanations of well-known topics.

Most common

RAG

Retrieve relevant docs, then prompt.

Cost~$0.05 per call

Setup timeHours to days

Custom data?Yes

Up-to-date?As fresh as your docs

Use when You need the model to answer from your specific data — company docs, support tickets, codebases, recent news, a PDF.

Most expensive

Fine-tune

Retrain the model on your examples.

Cost$100s to $10,000s

Setup timeDays to weeks

Custom data?Built into the model

Up-to-date?Frozen at fine-tune time

Use when You need consistent behavior, a specific output style, or domain-specific reasoning that prompting can't reliably produce.

Course 2 · Module 09 complete

You can now build the
"chat with your docs" product.

You retrieved chunks with real semantic search. You composed answers from those chunks. You learned the five-step RAG pipeline that powers every modern Q&A system. You also know when to skip RAG entirely (just prompt) or go bigger (fine-tune). The architecture decisions are now yours to make.

Up next · Course 2 · Module 10

Evaluation & Metrics

You've trained models, classified images, retrieved documents. But are they any good? How do you measure? Module 10 is about evaluation — the unsexy but absolutely critical skill that separates production ML from toy projects. Interactive confusion matrix and threshold tuning.

Continue to Module 10

Give the LLMyour data.Get real answers.

An LLM by itselfhas three problems.