AI Skill Course Course 3 · Expert
Module 06 of 14
Course 3 · Module 06 · 85 minutes

Vision.
Language.
The same space.

Until 2021, image models and text models were separate creatures living in separate codebases. CLIP changed that. Then LLaVA, GPT-4V, Claude with vision, Gemini went all the way: unified models that handle pictures and words as one stream. The trick is shockingly simple — encode both into the same embedding space, then let the transformer do what it does.

You'll explore
A CLIP-style space
You'll walk through
LLaVA, step by step
You'll see
How GPT-4V works
Cross the modalities
Shared embedding space Image vision encoder Text text encoder shared ℝ⁵¹² vectors live together cat photo "a cat" cosine sim "a cat" and a cat photo end up near each other that's the entire trick
Part 01 · The unifying trick

Every modality
becomes a vector.

One embedding space to rule them all.

Text tokens become vectors. Image patches become vectors. Audio chunks become vectors. Video frames become vectors. Once everything is a vector in the same space, the rest of the architecture doesn't care where they came from.

The model that proved this was CLIP (OpenAI, 2021). Train two encoders jointly on 400 million (image, caption) pairs: a vision encoder for images, a text encoder for captions. The training objective: make the embedding of an image close to the embedding of its caption, and far from other captions.

The result: a shared 512-dimensional space where "a photo of a cat" lives next to actual photos of cats. You can search images with text. You can classify images with text. You can guide image generation with text — Stable Diffusion uses CLIP's text encoder for exactly this.

Every modern vision-language model (GPT-4V, Claude with vision, Gemini, LLaVA) inherits this trick. Get every modality into the same vector space, then let attention do the work.

// CLIP training · contrastive
image cat photo caption "a cat" image encoder text encoder v_img ∈ ℝ⁵¹² v_txt ∈ ℝ⁵¹² shared embedding space v_img v_txt ↑ pulled together other captions · pushed away L = −log [sim(v_img, v_txt) / Σ sim(v_img, v_other)]
Part 02 · Hands on · Embedding space search

Type a description.
Watch images align.

Below is a CLIP-style search simulation across 6 images. Type a phrase — the model "embeds" your text, computes cosine similarity to each image's embedding, and ranks them. The top match gets a glow. This is how Google Photos text search, image search engines, and Stable Diffusion's text-to-image pairing all work.

How the simulation works.

Each image has a hand-curated "concept profile" — its position in semantic space. Your typed query gets tokenized and matched against these profiles, with synonym expansion. The result is realistic CLIP-style ranking. In a real CLIP model the same thing happens, except both your text and the images get encoded by 12-24 layer transformers into 512-dim vectors first, and the similarity is a real dot product.

// Try a query
"
Part 03 · How images become tokens

Tokens for words. Tokens for pixels.

// vision transformer · ViT
image (e.g., 224×224) split into 16 patches N=196 patches (14×14 for ViT-Large) linear projection ... N vectors of dim d_model = same shape as N text tokens! Transformer (the same one!) attention · FFN · norms · 12-24 layers processes patches like words

Vision Transformers don't reinvent anything.

The pivotal 2020 paper was titled "An Image is Worth 16x16 Words." That title is the whole insight. If you split an image into 14×14 grid of patches, each patch becomes one "token" — and you can run the exact same transformer architecture you'd use for text.

Each patch (say, 16×16×3 pixels) gets flattened to a vector and linearly projected to the model's hidden dimension. For ViT-Large at 224×224 input that's 14×14 = 196 patches, each becoming a 1024-dim vector.

Then add positional embeddings (where in the grid each patch came from) and run standard self-attention. Patches attend to other patches. After 12-24 layers, you have a rich representation of the image expressed as a sequence of tokens.

This is why CLIP works at all — and why the next step (jamming these "image tokens" directly into a language model) is so natural. Once images are token sequences, the LLM doesn't need to know the difference.

Part 04 · Hands on · LLaVA architecture

Watch an image
become prompt context.

LLaVA (Large Language and Vision Assistant, 2023) was the first open-source vision-language model that just worked. Its architecture is breathtakingly simple. Walk through it step by step — see exactly how an image becomes tokens an LLM can read alongside text.

The example.

An image of a cat + the question "What is in this image?" arrives at LLaVA. We'll trace what happens to the image at every step. Click any pill below or use Next to advance. The text path is shown separately at the final stage.

Step 1 of 6
Part 05 · Multimodal in the wild

The models you've
already shown a picture.

Every major chat AI now sees. The exact recipes differ — different vision encoders, projector types, training data — but the core LLaVA-style pattern holds across the board.

OpenAI · 2023

GPT-4V / GPT-4o

// "vision" baked in
ArchitectureUnified multimodal transformer
Vision encoderCustom (undisclosed)
Modalities (4o)text + image + audio + video
Image tokens~85 per "tile" of 512×512
ReleasedSep 2023 (V), May 2024 (4o)

The model that made vision-language mainstream. GPT-4o ("omni") integrates audio natively too — same transformer, just more token types. OCR, chart-reading, screenshot understanding all come essentially for free from the language model treating images as tokens.

Anthropic · 2024

Claude 3 / 3.5 with vision

// strong document understanding
ArchitectureUnified multimodal
StrengthsDocuments, charts, screenshots
Max image size~8000×8000 pixels
FamilyHaiku, Sonnet, Opus
Best atstructured visual content

Anthropic's vision models are known for being especially strong at reading complex documents, charts, tables, and code screenshots. Computer use (Claude controlling a screen via vision) extends this further — same architecture, used as a feedback loop.

Google · 2023+

Gemini

// natively multimodal from day one
ArchitectureMultimodal from pre-training
Modalitiestext + image + audio + video
ContextUp to 2M tokens
Video understandingHour+ of video
VariantsUltra, Pro, Flash, Nano

Pitched as the first model trained on multiple modalities simultaneously from scratch rather than bolted-on. The long-context + video features are differentiating — analyze hours of video in one shot, useful for film/sport analysis, surveillance, etc.

UW / Microsoft · 2023

LLaVA

// the open-source recipe
ArchitectureCLIP encoder + projector + LLM
Vision encoderCLIP ViT-L/14
Projector2-layer MLP
LLM backboneVicuna / Llama
Training data~158k samples (tiny!)

Showed that you don't need to train a vision-language model from scratch. Take a pretrained CLIP, take a pretrained LLM, train just a 2-layer MLP between them on a small dataset. The result was competitive with commercial models. This recipe is now standard.

Part 06 · Knowledge check

Five questions on what
you just unified.

Aim for 4/5. Wrong answers explain themselves.

Question 01 of 05

0/5

Continue
Course 3 · Module 06 complete

Pixels and words
now share a space.

You explored a CLIP-style embedding space directly. You walked through LLaVA stage by stage. You can explain how GPT-4V "sees" — image patches become tokens, get projected into the LLM's embedding space, and the LLM processes them like words. The reason "multimodal" works isn't a separate magic vision system. It's that every modality, once embedded as vectors, is the same kind of thing.

Up next · Course 3 · Module 07

Long Context

Attention is O(n²). A 1-million-token context window means 10¹² attention scores per layer per head — enough to crush any GPU. Yet Gemini does 2M, Claude does 200k, and even local models handle 128k. How? Flash Attention, sparse attention, sliding window, RoPE scaling, Mamba/SSMs — the bag of tricks behind modern long context. Interactive: compare attention strategies for long sequences.

Continue to Module 07