Until 2021, image models and text models were separate creatures living in separate codebases. CLIP changed that. Then LLaVA, GPT-4V, Claude with vision, Gemini went all the way: unified models that handle pictures and words as one stream. The trick is shockingly simple — encode both into the same embedding space, then let the transformer do what it does.
Cross the modalitiesText tokens become vectors. Image patches become vectors. Audio chunks become vectors. Video frames become vectors. Once everything is a vector in the same space, the rest of the architecture doesn't care where they came from.
The model that proved this was CLIP (OpenAI, 2021). Train two encoders jointly on 400 million (image, caption) pairs: a vision encoder for images, a text encoder for captions. The training objective: make the embedding of an image close to the embedding of its caption, and far from other captions.
The result: a shared 512-dimensional space where "a photo of a cat" lives next to actual photos of cats. You can search images with text. You can classify images with text. You can guide image generation with text — Stable Diffusion uses CLIP's text encoder for exactly this.
Every modern vision-language model (GPT-4V, Claude with vision, Gemini, LLaVA) inherits this trick. Get every modality into the same vector space, then let attention do the work.
Below is a CLIP-style search simulation across 6 images. Type a phrase — the model "embeds" your text, computes cosine similarity to each image's embedding, and ranks them. The top match gets a glow. This is how Google Photos text search, image search engines, and Stable Diffusion's text-to-image pairing all work.
Each image has a hand-curated "concept profile" — its position in semantic space. Your typed query gets tokenized and matched against these profiles, with synonym expansion. The result is realistic CLIP-style ranking. In a real CLIP model the same thing happens, except both your text and the images get encoded by 12-24 layer transformers into 512-dim vectors first, and the similarity is a real dot product.
The pivotal 2020 paper was titled "An Image is Worth 16x16 Words." That title is the whole insight. If you split an image into 14×14 grid of patches, each patch becomes one "token" — and you can run the exact same transformer architecture you'd use for text.
Each patch (say, 16×16×3 pixels) gets flattened to a vector and linearly projected to the model's hidden dimension. For ViT-Large at 224×224 input that's 14×14 = 196 patches, each becoming a 1024-dim vector.
Then add positional embeddings (where in the grid each patch came from) and run standard self-attention. Patches attend to other patches. After 12-24 layers, you have a rich representation of the image expressed as a sequence of tokens.
This is why CLIP works at all — and why the next step (jamming these "image tokens" directly into a language model) is so natural. Once images are token sequences, the LLM doesn't need to know the difference.
LLaVA (Large Language and Vision Assistant, 2023) was the first open-source vision-language model that just worked. Its architecture is breathtakingly simple. Walk through it step by step — see exactly how an image becomes tokens an LLM can read alongside text.
An image of a cat + the question "What is in this image?" arrives at LLaVA. We'll trace what happens to the image at every step. Click any pill below or use Next to advance. The text path is shown separately at the final stage.
Every major chat AI now sees. The exact recipes differ — different vision encoders, projector types, training data — but the core LLaVA-style pattern holds across the board.
The model that made vision-language mainstream. GPT-4o ("omni") integrates audio natively too — same transformer, just more token types. OCR, chart-reading, screenshot understanding all come essentially for free from the language model treating images as tokens.
Anthropic's vision models are known for being especially strong at reading complex documents, charts, tables, and code screenshots. Computer use (Claude controlling a screen via vision) extends this further — same architecture, used as a feedback loop.
Pitched as the first model trained on multiple modalities simultaneously from scratch rather than bolted-on. The long-context + video features are differentiating — analyze hours of video in one shot, useful for film/sport analysis, surveillance, etc.
Showed that you don't need to train a vision-language model from scratch. Take a pretrained CLIP, take a pretrained LLM, train just a 2-layer MLP between them on a small dataset. The result was competitive with commercial models. This recipe is now standard.
Aim for 4/5. Wrong answers explain themselves.
You explored a CLIP-style embedding space directly. You walked through LLaVA stage by stage. You can explain how GPT-4V "sees" — image patches become tokens, get projected into the LLM's embedding space, and the LLM processes them like words. The reason "multimodal" works isn't a separate magic vision system. It's that every modality, once embedded as vectors, is the same kind of thing.
Continue to Module 07