AI Skill Course Course 3 · Expert
Module 12 of 14
Course 3 · Module 12 · 90 minutes

Brute force
scans a billion vectors.
You have microseconds.

RAG works by embedding everything (queries, docs, code, images) as vectors, then finding the most similar ones. At small scale, you compute cosine similarity against every vector — fine. At billion scale, that's 60+ seconds per query. Production needs sub-100ms. This module covers the indexing structures — HNSW, IVF, product quantization — that bend the curve from O(n) to O(log n).

You'll watch
HNSW descend layers
You'll compare
4 indexing methods
You'll know why
Pinecone exists
Index it all
The nearest-neighbor problem query Given a query vector, find the k closest from N stored vectors. N = 1,000,000,000 · query latency budget: ~10 ms → you need ≤ 30 comparisons, not 1 billion
Part 01 · Why brute force fails

Linear search
doesn't make it past a million.

The math is unforgiving.

Each comparison is a cosine similarity: a dot product over d dimensions plus normalization. For typical 1024-dim embeddings, that's ~2000 FLOPs per vector. Multiply by N vectors, and you get a quick sense of where things break:

At 1K vectors: 2M FLOPs per query → microseconds. Fine. At 1M vectors: 2B FLOPs → tens of ms. Still OK. At 1B vectors: 2T FLOPs → tens of seconds. Catastrophic.

And you don't have one query — you have thousands per second. A search engine that takes 30 seconds per query is just broken. The whole architecture has to change.

The escape: give up exact answers. Approximate nearest neighbor (ANN) algorithms accept ~95-99% recall in exchange for a 1000-100000× speedup. For RAG, where you're already retrieving the "top 5 docs" out of millions, missing one borderline result almost never matters. Trading 1% recall for 10000× speed is the entire business of vector search.

// brute-force latency vs dataset size
60s 10s 1s 100ms 10ms 1ms 100μs 1K 100K 1M 100M 10B 100 ms · production limit 200μs 20ms 200ms 20s ~30 min Brute-force search · O(n·d) HNSW · O(log n)
Part 02 · The four approaches

Four ways to avoid
scanning everything.

Modern ANN libraries (FAISS, ScaNN, Annoy) implement many algorithms. In practice, four dominate: Flat for tiny datasets, IVF for medium scale, HNSW for low-latency serving, IVF-PQ for billion-scale on a budget. Pick wisely — they have very different cost profiles.

// 01 · baseline

Flat (brute force)

// exact · O(n)
scan every vector query → N distance computations

Compare the query to every vector. 100% recall guaranteed. Trivial to implement. Linear time. Beyond ~100K-1M vectors it gets impractical for low-latency apps. Still useful as a ground-truth baseline and for small datasets.

Recall100%
Latency @ 1M~200 ms
Latency @ 100M~20s · broken
// 02 · partitioning

IVF (Inverted File)

// approximate · O(√n) effective
cluster vectors · search only nearby clusters → only search 1 of K clusters

Run k-means on all vectors offline, producing K cluster centroids. At query time, find the nprobe closest centroids and only scan vectors in those clusters. ~10× speedup with K=1000 and nprobe=10, while keeping >95% recall.

Recall (with nprobe=10)~95-98%
Latency @ 100M~50 ms
Build timek-means training
// 03 · the modern default

HNSW

// graph-based · O(log n)
hierarchical layered graph · search descends L2 L1 L0

Multi-layer graph. Top layer is sparse — only a few "highway" vectors with long-range connections. Bottom layer holds every vector. Search starts at top, navigates to nearest neighbor, then drops down a layer. Like binary search but for high-dim space.

Recall~98-99.9%
Latency @ 100M~5-20 ms
Memory overheadgraph edges (~50% extra)
// 04 · billion-scale

IVF-PQ

// IVF + product quantization · cheap memory
compress vectors · index in clusters d=1024 floats 4096 bytes PQ 64 bytes 64× smaller IVF

The billion-scale option. Product quantization compresses each vector ~32-64× by splitting it into chunks and storing only cluster IDs. Combined with IVF clustering, you can serve billions of vectors from a single machine's RAM. Some recall loss, but viable at scale nothing else can touch.

Recall~85-95%
Latency @ 1B~20-50 ms
Memory @ 1B~64 GB (vs ~4 TB raw)
Part 03 · Hands on · HNSW search animation

Watch the query
fall through the layers.

HNSW is the most widely deployed ANN method in 2024. The trick: build a multi-layer graph where each higher layer contains fewer points but with longer-range edges. Search starts at the top with one entry point, greedily walks to the nearest neighbor, then drops to the next layer to refine. Logarithmic depth + greedy navigation = O(log n) search.

What you're seeing.

A small HNSW index with 80 vectors in 2D, distributed across 3 layers. Layer 2 has ~5 nodes (the "highways"), Layer 1 has ~20 (regional roads), Layer 0 has all 80 (local streets). The query (cyan) enters at the top, follows golden edges to the nearest highway, then descends. Try the 4 query positions and watch the path — you'll see the same logarithmic structure search every time.

comparisons: 0 · brute-force would have: 80
// 3-layer HNSW · query descent
// search trace
Press play to watch the descent.
// brute-force comparisons
80
// hnsw comparisons
// speedup
Part 04 · Product quantization · billion-scale compression

Cut a vector into chunks.
Replace each chunk with a number.

// product quantization · how it shrinks vectors
original vector · d = 768 0.34 -0.12 0.78 0.21 ... 0.49 -0.65 0.11 0.93 768 × 4 bytes (FP32) = 3072 bytes step 1 · split into M = 8 subvectors of 96 dims each sub₁ sub₂ sub₃ sub₄ sub₅ sub₆ sub₇ sub₈ step 2 · each sub → closest of 256 centroids (1 byte) 37 182 5 241 96 128 14 203 final compressed vector · 8 bytes total The compression Original: 3072 bytes PQ (M=8): 8 bytes PQ (M=96): 96 bytes Ratio: 32–384× smaller distance computation works directly on the codes

The math trick that makes billion-vector indexes fit in RAM.

Without compression, a billion 768-dim FP32 vectors require 1B × 3072 bytes = 3 TB. That's many GPUs just to hold the data, before any indexing overhead. Product quantization (PQ) was the breakthrough that let Google's image-search index fit on commodity hardware in 2010.

The idea is decomposition. A 768-dim vector is one point in a 768-dim space — far too large to quantize directly. But you can split it into M chunks of 96 dims each. Each chunk is a point in a much smaller 96-dim subspace, where running k-means on a few million training vectors gives you 256 representative "centroids" per subspace.

Now each original vector becomes M centroid IDs (one per sub-vector). With M=8 and 256 centroids per chunk, each vector is just 8 bytes — a 384× compression. And distance computation can be done directly on the codes via precomputed lookup tables, with no need to decompress.

Combined with IVF (cluster the codes themselves, search only nearby clusters), FAISS can index a billion vectors on a single machine with sub-100ms latency. That's the unlock that makes serving large RAG systems economically possible.

Part 05 · Vector databases shipping today

The databases
your RAG runs through.

By 2024, "vector database" became a recognized infrastructure category. Different products pick different points on the managed/self-hosted, dedicated/integrated, performance/flexibility tradeoffs.

Pinecone · 2021

Pinecone

// the managed default
TypeFully managed SaaS
IndexProprietary (HNSW-derived)
StrengthsZero ops · automatic scaling
Pricing modelPods / serverless
Best forTeams that don't want to operate infra

The category-defining managed product. You upload vectors via API, get sub-50ms search across millions. No clusters to manage, no indexes to tune. Pricier than self-hosted alternatives but operationally simplest. Most "we shipped RAG in two weeks" stories run on Pinecone.

Open source · 2019

Weaviate

// search engine + vector DB
TypeSelf-hosted or managed
IndexHNSW
DistinctiveBuilt-in hybrid search (BM25 + vector)
SchemaGraphQL · class-based
Best forHybrid keyword + semantic search

Open-source with a managed cloud option. Native hybrid search is the killer feature: combine keyword (BM25) and vector scores in a single query. Useful when you need both literal keyword matches and semantic similarity — common for product search, knowledge bases.

Open source · 2021

Qdrant

// performance-focused, Rust
TypeSelf-hosted or managed
Built inRust (compiled, fast)
IndexHNSW with filtering
DistinctiveFast metadata filtering
Best forHigh-throughput self-hosted

Among the fastest open-source options thanks to Rust. Particularly good at filtered search — finding vectors that match both a similarity threshold AND structured filters (date range, category, user_id). Many AI startups self-host Qdrant for cost reasons.

PostgreSQL · 2021

pgvector

// just an extension
TypePostgreSQL extension
IndexHNSW + IVFFlat
Why it mattersNo new infrastructure
Scale ceiling~10M vectors (well)
Best forTeams already on Postgres

The most pragmatic option for many teams. Add vector search to your existing Postgres database with a single extension. JOIN against your normal tables. No new operational burden. Doesn't beat dedicated vector DBs on raw performance — but for <10M vector workloads, the simplicity wins. AWS, Google Cloud, Supabase all offer hosted pgvector.

Part 06 · Knowledge check

Five questions on what
you just indexed.

Aim for 4/5. Wrong answers explain themselves.

Question 01 of 05

0/5

Continue
Course 3 · Module 12 complete

You understand why RAG
actually scales.

You saw why brute force collapses past 1M vectors. You watched HNSW's hierarchical descent give you 80→20 comparisons (and at billion scale: 1B→30). You know how product quantization compresses 3 TB of vectors into 64 GB. The next time someone says "we built a custom RAG on a billion documents," you'll know exactly which 3 acronyms made it possible.

Up next · Course 3 · Module 13

Production AI Systems

You've built the model, you've built the agent, you've built the index. Now ship it without it breaking. Eval suites, drift detection, guardrails, observability, regression testing, prompt versioning, fallback strategies. The unglamorous-but-critical work that separates "demo" from "production." Interactive: build an eval harness for a real model.

Continue to Module 13