Course 3 · Module 12 · 90 minutes

Brute force
scans a billion vectors.
You have microseconds.

RAG works by embedding everything (queries, docs, code, images) as vectors, then finding the most similar ones. At small scale, you compute cosine similarity against every vector — fine. At billion scale, that's 60+ seconds per query. Production needs sub-100ms. This module covers the indexing structures — HNSW, IVF, product quantization — that bend the curve from O(n) to O(log n).

You'll watch

HNSW descend layers

You'll compare

4 indexing methods

You'll know why

Pinecone exists

Index it all

Part 01 · Why brute force fails

Linear search
doesn't make it past a million.

The math is unforgiving.

Each comparison is a cosine similarity: a dot product over d dimensions plus normalization. For typical 1024-dim embeddings, that's ~2000 FLOPs per vector. Multiply by N vectors, and you get a quick sense of where things break:

At 1K vectors: 2M FLOPs per query → microseconds. Fine. At 1M vectors: 2B FLOPs → tens of ms. Still OK. At 1B vectors: 2T FLOPs → tens of seconds. Catastrophic.

And you don't have one query — you have thousands per second. A search engine that takes 30 seconds per query is just broken. The whole architecture has to change.

The escape: give up exact answers. Approximate nearest neighbor (ANN) algorithms accept ~95-99% recall in exchange for a 1000-100000× speedup. For RAG, where you're already retrieving the "top 5 docs" out of millions, missing one borderline result almost never matters. Trading 1% recall for 10000× speed is the entire business of vector search.

// brute-force latency vs dataset size

Part 02 · The four approaches

Four ways to avoid
scanning everything.

Modern ANN libraries (FAISS, ScaNN, Annoy) implement many algorithms. In practice, four dominate: Flat for tiny datasets, IVF for medium scale, HNSW for low-latency serving, IVF-PQ for billion-scale on a budget. Pick wisely — they have very different cost profiles.

// 01 · baseline

Flat (brute force)

// exact · O(n)

Compare the query to every vector. 100% recall guaranteed. Trivial to implement. Linear time. Beyond ~100K-1M vectors it gets impractical for low-latency apps. Still useful as a ground-truth baseline and for small datasets.

Recall100%

Latency @ 1M~200 ms

Latency @ 100M~20s · broken

// 02 · partitioning

IVF (Inverted File)

// approximate · O(√n) effective

Run k-means on all vectors offline, producing K cluster centroids. At query time, find the nprobe closest centroids and only scan vectors in those clusters. ~10× speedup with K=1000 and nprobe=10, while keeping >95% recall.

Recall (with nprobe=10)~95-98%

Latency @ 100M~50 ms

Build timek-means training

// 03 · the modern default

HNSW

// graph-based · O(log n)

Multi-layer graph. Top layer is sparse — only a few "highway" vectors with long-range connections. Bottom layer holds every vector. Search starts at top, navigates to nearest neighbor, then drops down a layer. Like binary search but for high-dim space.

Recall~98-99.9%

Latency @ 100M~5-20 ms

Memory overheadgraph edges (~50% extra)

// 04 · billion-scale

IVF-PQ

// IVF + product quantization · cheap memory

The billion-scale option. Product quantization compresses each vector ~32-64× by splitting it into chunks and storing only cluster IDs. Combined with IVF clustering, you can serve billions of vectors from a single machine's RAM. Some recall loss, but viable at scale nothing else can touch.

Recall~85-95%

Latency @ 1B~20-50 ms

Memory @ 1B~64 GB (vs ~4 TB raw)

Part 03 · Hands on · HNSW search animation

Watch the query
fall through the layers.

HNSW is the most widely deployed ANN method in 2024. The trick: build a multi-layer graph where each higher layer contains fewer points but with longer-range edges. Search starts at the top with one entry point, greedily walks to the nearest neighbor, then drops to the next layer to refine. Logarithmic depth + greedy navigation = O(log n) search.

What you're seeing.

A small HNSW index with 80 vectors in 2D, distributed across 3 layers. Layer 2 has ~5 nodes (the "highways"), Layer 1 has ~20 (regional roads), Layer 0 has all 80 (local streets). The query (cyan) enters at the top, follows golden edges to the nearest highway, then descends. Try the 4 query positions and watch the path — you'll see the same logarithmic structure search every time.

comparisons: 0 · brute-force would have: 80

// 3-layer HNSW · query descent

// search trace

Press play to watch the descent.

// brute-force comparisons

// hnsw comparisons

—

// speedup

—

Part 04 · Product quantization · billion-scale compression

Cut a vector into chunks.
Replace each chunk with a number.

// product quantization · how it shrinks vectors

The math trick that makes billion-vector indexes fit in RAM.

Without compression, a billion 768-dim FP32 vectors require 1B × 3072 bytes = 3 TB. That's many GPUs just to hold the data, before any indexing overhead. Product quantization (PQ) was the breakthrough that let Google's image-search index fit on commodity hardware in 2010.

The idea is decomposition. A 768-dim vector is one point in a 768-dim space — far too large to quantize directly. But you can split it into M chunks of 96 dims each. Each chunk is a point in a much smaller 96-dim subspace, where running k-means on a few million training vectors gives you 256 representative "centroids" per subspace.

Now each original vector becomes M centroid IDs (one per sub-vector). With M=8 and 256 centroids per chunk, each vector is just 8 bytes — a 384× compression. And distance computation can be done directly on the codes via precomputed lookup tables, with no need to decompress.

Combined with IVF (cluster the codes themselves, search only nearby clusters), FAISS can index a billion vectors on a single machine with sub-100ms latency. That's the unlock that makes serving large RAG systems economically possible.

Part 05 · Vector databases shipping today

The databases
your RAG runs through.

By 2024, "vector database" became a recognized infrastructure category. Different products pick different points on the managed/self-hosted, dedicated/integrated, performance/flexibility tradeoffs.

Pinecone · 2021

Pinecone

// the managed default

TypeFully managed SaaS

IndexProprietary (HNSW-derived)

StrengthsZero ops · automatic scaling

Pricing modelPods / serverless

Best forTeams that don't want to operate infra

The category-defining managed product. You upload vectors via API, get sub-50ms search across millions. No clusters to manage, no indexes to tune. Pricier than self-hosted alternatives but operationally simplest. Most "we shipped RAG in two weeks" stories run on Pinecone.

Open source · 2019

Weaviate

// search engine + vector DB

TypeSelf-hosted or managed

IndexHNSW

DistinctiveBuilt-in hybrid search (BM25 + vector)

SchemaGraphQL · class-based

Best forHybrid keyword + semantic search

Open-source with a managed cloud option. Native hybrid search is the killer feature: combine keyword (BM25) and vector scores in a single query. Useful when you need both literal keyword matches and semantic similarity — common for product search, knowledge bases.

Open source · 2021

Qdrant

// performance-focused, Rust

TypeSelf-hosted or managed

Built inRust (compiled, fast)

IndexHNSW with filtering

DistinctiveFast metadata filtering

Best forHigh-throughput self-hosted

Among the fastest open-source options thanks to Rust. Particularly good at filtered search — finding vectors that match both a similarity threshold AND structured filters (date range, category, user_id). Many AI startups self-host Qdrant for cost reasons.

PostgreSQL · 2021

pgvector

// just an extension

TypePostgreSQL extension

IndexHNSW + IVFFlat

Why it mattersNo new infrastructure

Scale ceiling~10M vectors (well)

Best forTeams already on Postgres

The most pragmatic option for many teams. Add vector search to your existing Postgres database with a single extension. JOIN against your normal tables. No new operational burden. Doesn't beat dedicated vector DBs on raw performance — but for <10M vector workloads, the simplicity wins. AWS, Google Cloud, Supabase all offer hosted pgvector.

Course 3 · Module 12 complete

You understand why RAG
actually scales.

You saw why brute force collapses past 1M vectors. You watched HNSW's hierarchical descent give you 80→20 comparisons (and at billion scale: 1B→30). You know how product quantization compresses 3 TB of vectors into 64 GB. The next time someone says "we built a custom RAG on a billion documents," you'll know exactly which 3 acronyms made it possible.

Up next · Course 3 · Module 13

Production AI Systems

You've built the model, you've built the agent, you've built the index. Now ship it without it breaking. Eval suites, drift detection, guardrails, observability, regression testing, prompt versioning, fallback strategies. The unglamorous-but-critical work that separates "demo" from "production." Interactive: build an eval harness for a real model.

Continue to Module 13

Brute forcescans a billion vectors.You have microseconds.

Linear searchdoesn't make it past a million.