Computers can't read English. They can only do math on numbers. So before any model can understand a sentence, two things must happen: the text gets broken into pieces (tokens), and each piece becomes a vector of numbers (embedding). Get those two right and everything from search to ChatGPT becomes possible.
Decode the languageBefore any NLP model can do anything useful, raw text has to become numbers. The two-step pipeline below powers every search engine, every chatbot, every language model — including ChatGPT, Claude, Gemini, all of them.
Break the text into pieces called tokens. Sometimes a token is a whole word ("cat"). Sometimes it's a subword ("anti" + "freeze"). Sometimes a single character.
Each unique token gets an integer ID from a fixed vocabulary (typically 50,000 entries). The model only ever sees these numbers, never the original text.
Each token ID becomes a long vector (often 768 or 4096 numbers). These vectors are learned so similar words end up with similar vectors.
Below is a simplified tokenizer working in real time. Type any text and see how it breaks into tokens, each with its own ID and color. Try common words, made-up words, code, even other languages — and notice the token count. (That's also how API providers charge you: per token.)
Common English words get one token each. Uncommon or technical words get broken into subword pieces. Capital letters, punctuation, and emojis each get their own tokens. The total token count is what matters for GPT-4, Claude, and most APIs — they all charge per token.
36 words plotted in 3D space using simulated embeddings. Real embeddings live in 768+ dimensions — impossible to visualize directly — but the same clustering happens. Animals near animals, royalty near royalty, foods near foods. Drag to rotate. Click any word to highlight its 5 nearest neighbors.
Click and drag the cube to rotate. Click any word to select it — its 5 nearest semantic neighbors light up, connected by dashed cyan lines. Notice how "king", "queen", "prince" cluster together; "happy", "sad", "joy" form their own region; "computer" and "code" are practically on top of each other. This is what neural networks "see" when they read.
Because embeddings live in a coherent vector space, you can add and subtract them. And the results are often poetic.
Subtract the "male" component of "king" by removing "man." Add the "female" component by adding "woman." The closest vector in embedding space to the result is — almost spookily — "queen." This is real: it works on actual trained embeddings like Word2Vec and GloVe.
This isn't magic. It's geometry. The gender axis is consistent across word pairs because the model learned (from billions of sentences) that "king" relates to "queen" the same way "man" relates to "woman." So the vector difference is the same.
Every product that "understands meaning" is doing some version of this. Three big ones:
"Find documents about AI risks" — not just ones with the exact words. Search engines embed your query and every document, then find the closest vectors. Works across synonyms and paraphrasing.
"You liked this movie — here are similar ones." Movies, products, songs, articles — all get embedded as vectors. Recommendations are just nearest neighbor lookups in vector space.
Every LLM starts by embedding the input tokens, processes those vectors through dozens of layers, and outputs new vectors. The whole transformer architecture (next module) operates entirely in embedding space.
Aim for 4/5. Wrong answers explain themselves.
You tokenized text and saw it become numbered pieces. You explored a 3D embedding space and felt how meaning becomes distance. You learned why "king − man + woman = queen." Everything modern in NLP — every LLM, every search engine, every recommendation system — is built on these two ideas. You now have the foundation.
Continue to Module 09