RAG works by embedding everything (queries, docs, code, images) as vectors, then finding the most similar ones. At small scale, you compute cosine similarity against every vector — fine. At billion scale, that's 60+ seconds per query. Production needs sub-100ms. This module covers the indexing structures — HNSW, IVF, product quantization — that bend the curve from O(n) to O(log n).
Index it allEach comparison is a cosine similarity: a dot product over d dimensions plus normalization. For typical 1024-dim embeddings, that's ~2000 FLOPs per vector. Multiply by N vectors, and you get a quick sense of where things break:
At 1K vectors: 2M FLOPs per query → microseconds. Fine. At 1M vectors: 2B FLOPs → tens of ms. Still OK. At 1B vectors: 2T FLOPs → tens of seconds. Catastrophic.
And you don't have one query — you have thousands per second. A search engine that takes 30 seconds per query is just broken. The whole architecture has to change.
The escape: give up exact answers. Approximate nearest neighbor (ANN) algorithms accept ~95-99% recall in exchange for a 1000-100000× speedup. For RAG, where you're already retrieving the "top 5 docs" out of millions, missing one borderline result almost never matters. Trading 1% recall for 10000× speed is the entire business of vector search.
Modern ANN libraries (FAISS, ScaNN, Annoy) implement many algorithms. In practice, four dominate: Flat for tiny datasets, IVF for medium scale, HNSW for low-latency serving, IVF-PQ for billion-scale on a budget. Pick wisely — they have very different cost profiles.
Compare the query to every vector. 100% recall guaranteed. Trivial to implement. Linear time. Beyond ~100K-1M vectors it gets impractical for low-latency apps. Still useful as a ground-truth baseline and for small datasets.
Run k-means on all vectors offline, producing K cluster centroids. At query time, find the nprobe closest centroids and only scan vectors in those clusters. ~10× speedup with K=1000 and nprobe=10, while keeping >95% recall.
Multi-layer graph. Top layer is sparse — only a few "highway" vectors with long-range connections. Bottom layer holds every vector. Search starts at top, navigates to nearest neighbor, then drops down a layer. Like binary search but for high-dim space.
The billion-scale option. Product quantization compresses each vector ~32-64× by splitting it into chunks and storing only cluster IDs. Combined with IVF clustering, you can serve billions of vectors from a single machine's RAM. Some recall loss, but viable at scale nothing else can touch.
HNSW is the most widely deployed ANN method in 2024. The trick: build a multi-layer graph where each higher layer contains fewer points but with longer-range edges. Search starts at the top with one entry point, greedily walks to the nearest neighbor, then drops to the next layer to refine. Logarithmic depth + greedy navigation = O(log n) search.
A small HNSW index with 80 vectors in 2D, distributed across 3 layers. Layer 2 has ~5 nodes (the "highways"), Layer 1 has ~20 (regional roads), Layer 0 has all 80 (local streets). The query (cyan) enters at the top, follows golden edges to the nearest highway, then descends. Try the 4 query positions and watch the path — you'll see the same logarithmic structure search every time.
Without compression, a billion 768-dim FP32 vectors require 1B × 3072 bytes = 3 TB. That's many GPUs just to hold the data, before any indexing overhead. Product quantization (PQ) was the breakthrough that let Google's image-search index fit on commodity hardware in 2010.
The idea is decomposition. A 768-dim vector is one point in a 768-dim space — far too large to quantize directly. But you can split it into M chunks of 96 dims each. Each chunk is a point in a much smaller 96-dim subspace, where running k-means on a few million training vectors gives you 256 representative "centroids" per subspace.
Now each original vector becomes M centroid IDs (one per sub-vector). With M=8 and 256 centroids per chunk, each vector is just 8 bytes — a 384× compression. And distance computation can be done directly on the codes via precomputed lookup tables, with no need to decompress.
Combined with IVF (cluster the codes themselves, search only nearby clusters), FAISS can index a billion vectors on a single machine with sub-100ms latency. That's the unlock that makes serving large RAG systems economically possible.
By 2024, "vector database" became a recognized infrastructure category. Different products pick different points on the managed/self-hosted, dedicated/integrated, performance/flexibility tradeoffs.
The category-defining managed product. You upload vectors via API, get sub-50ms search across millions. No clusters to manage, no indexes to tune. Pricier than self-hosted alternatives but operationally simplest. Most "we shipped RAG in two weeks" stories run on Pinecone.
Open-source with a managed cloud option. Native hybrid search is the killer feature: combine keyword (BM25) and vector scores in a single query. Useful when you need both literal keyword matches and semantic similarity — common for product search, knowledge bases.
Among the fastest open-source options thanks to Rust. Particularly good at filtered search — finding vectors that match both a similarity threshold AND structured filters (date range, category, user_id). Many AI startups self-host Qdrant for cost reasons.
The most pragmatic option for many teams. Add vector search to your existing Postgres database with a single extension. JOIN against your normal tables. No new operational burden. Doesn't beat dedicated vector DBs on raw performance — but for <10M vector workloads, the simplicity wins. AWS, Google Cloud, Supabase all offer hosted pgvector.
Aim for 4/5. Wrong answers explain themselves.
You saw why brute force collapses past 1M vectors. You watched HNSW's hierarchical descent give you 80→20 comparisons (and at billion scale: 1B→30). You know how product quantization compresses 3 TB of vectors into 64 GB. The next time someone says "we built a custom RAG on a billion documents," you'll know exactly which 3 acronyms made it possible.
Continue to Module 13