Course 3 · Module 06 · 85 minutes

Vision.
Language.
The same space.

Until 2021, image models and text models were separate creatures living in separate codebases. CLIP changed that. Then LLaVA, GPT-4V, Claude with vision, Gemini went all the way: unified models that handle pictures and words as one stream. The trick is shockingly simple — encode both into the same embedding space, then let the transformer do what it does.

You'll explore

A CLIP-style space

You'll walk through

LLaVA, step by step

You'll see

How GPT-4V works

Cross the modalities

Part 01 · The unifying trick

Every modality
becomes a vector.

One embedding space to rule them all.

Text tokens become vectors. Image patches become vectors. Audio chunks become vectors. Video frames become vectors. Once everything is a vector in the same space, the rest of the architecture doesn't care where they came from.

The model that proved this was CLIP (OpenAI, 2021). Train two encoders jointly on 400 million (image, caption) pairs: a vision encoder for images, a text encoder for captions. The training objective: make the embedding of an image close to the embedding of its caption, and far from other captions.

The result: a shared 512-dimensional space where "a photo of a cat" lives next to actual photos of cats. You can search images with text. You can classify images with text. You can guide image generation with text — Stable Diffusion uses CLIP's text encoder for exactly this.

Every modern vision-language model (GPT-4V, Claude with vision, Gemini, LLaVA) inherits this trick. Get every modality into the same vector space, then let attention do the work.

// CLIP training · contrastive

Part 02 · Hands on · Embedding space search

Type a description.
Watch images align.

Below is a CLIP-style search simulation across 6 images. Type a phrase — the model "embeds" your text, computes cosine similarity to each image's embedding, and ranks them. The top match gets a glow. This is how Google Photos text search, image search engines, and Stable Diffusion's text-to-image pairing all work.

How the simulation works.

Each image has a hand-curated "concept profile" — its position in semantic space. Your typed query gets tokenized and matched against these profiles, with synonym expansion. The result is realistic CLIP-style ranking. In a real CLIP model the same thing happens, except both your text and the images get encoded by 12-24 layer transformers into 512-dim vectors first, and the similarity is a real dot product.

// Try a query

Part 03 · How images become tokens

Tokens for words. Tokens for pixels.

// vision transformer · ViT

Vision Transformers don't reinvent anything.

The pivotal 2020 paper was titled "An Image is Worth 16x16 Words." That title is the whole insight. If you split an image into 14×14 grid of patches, each patch becomes one "token" — and you can run the exact same transformer architecture you'd use for text.

Each patch (say, 16×16×3 pixels) gets flattened to a vector and linearly projected to the model's hidden dimension. For ViT-Large at 224×224 input that's 14×14 = 196 patches, each becoming a 1024-dim vector.

Then add positional embeddings (where in the grid each patch came from) and run standard self-attention. Patches attend to other patches. After 12-24 layers, you have a rich representation of the image expressed as a sequence of tokens.

This is why CLIP works at all — and why the next step (jamming these "image tokens" directly into a language model) is so natural. Once images are token sequences, the LLM doesn't need to know the difference.

Part 04 · Hands on · LLaVA architecture

Watch an image
become prompt context.

LLaVA (Large Language and Vision Assistant, 2023) was the first open-source vision-language model that just worked. Its architecture is breathtakingly simple. Walk through it step by step — see exactly how an image becomes tokens an LLM can read alongside text.

The example.

An image of a cat + the question "What is in this image?" arrives at LLaVA. We'll trace what happens to the image at every step. Click any pill below or use Next to advance. The text path is shown separately at the final stage.

Step 1 of 6

Part 05 · Multimodal in the wild

The models you've
already shown a picture.

Every major chat AI now sees. The exact recipes differ — different vision encoders, projector types, training data — but the core LLaVA-style pattern holds across the board.

OpenAI · 2023

GPT-4V / GPT-4o

// "vision" baked in

ArchitectureUnified multimodal transformer

Vision encoderCustom (undisclosed)

Modalities (4o)text + image + audio + video

Image tokens~85 per "tile" of 512×512

ReleasedSep 2023 (V), May 2024 (4o)

The model that made vision-language mainstream. GPT-4o ("omni") integrates audio natively too — same transformer, just more token types. OCR, chart-reading, screenshot understanding all come essentially for free from the language model treating images as tokens.

Anthropic · 2024

Claude 3 / 3.5 with vision

// strong document understanding

ArchitectureUnified multimodal

StrengthsDocuments, charts, screenshots

Max image size~8000×8000 pixels

FamilyHaiku, Sonnet, Opus

Best atstructured visual content

Anthropic's vision models are known for being especially strong at reading complex documents, charts, tables, and code screenshots. Computer use (Claude controlling a screen via vision) extends this further — same architecture, used as a feedback loop.

Google · 2023+

Gemini

// natively multimodal from day one

ArchitectureMultimodal from pre-training

Modalitiestext + image + audio + video

ContextUp to 2M tokens

Video understandingHour+ of video

VariantsUltra, Pro, Flash, Nano

Pitched as the first model trained on multiple modalities simultaneously from scratch rather than bolted-on. The long-context + video features are differentiating — analyze hours of video in one shot, useful for film/sport analysis, surveillance, etc.

UW / Microsoft · 2023

LLaVA

// the open-source recipe

ArchitectureCLIP encoder + projector + LLM

Vision encoderCLIP ViT-L/14

Projector2-layer MLP

LLM backboneVicuna / Llama

Training data~158k samples (tiny!)

Showed that you don't need to train a vision-language model from scratch. Take a pretrained CLIP, take a pretrained LLM, train just a 2-layer MLP between them on a small dataset. The result was competitive with commercial models. This recipe is now standard.

Course 3 · Module 06 complete

Pixels and words
now share a space.

You explored a CLIP-style embedding space directly. You walked through LLaVA stage by stage. You can explain how GPT-4V "sees" — image patches become tokens, get projected into the LLM's embedding space, and the LLM processes them like words. The reason "multimodal" works isn't a separate magic vision system. It's that every modality, once embedded as vectors, is the same kind of thing.

Up next · Course 3 · Module 07

Long Context

Attention is O(n²). A 1-million-token context window means 10¹² attention scores per layer per head — enough to crush any GPU. Yet Gemini does 2M, Claude does 200k, and even local models handle 128k. How? Flash Attention, sparse attention, sliding window, RoPE scaling, Mamba/SSMs — the bag of tricks behind modern long context. Interactive: compare attention strategies for long sequences.

Continue to Module 07

Vision.Language.The same space.

Every modalitybecomes a vector.