Part 1:Tokenization - The First Step Every LLM Takes | Gopikrishna Pavuluri

Part 1:Tokenization - The First Step Every LLM Takes

• 3 min read

Before we jump into the key details, here are a few keywords I will be using throughout these articles. It is best to understand them first.

The Language of Tokenization

  • Token : an atomic unit of text. Be careful here: a token is not necessarily a word. A token might be a word, a piece of a word, or a single character.
  • Tokenization : the process of cutting text into tokens.
  • Tokenizer : the program that performs the tokenization process and owns the vocabulary.
  • Vocabulary : the fixed, finite set of all tokens the tokenizer have visibility.
  • Token ID : the id of the token from vocabulary(Integer).
  • Encoding / decoding : text → IDs, and IDs → text.

If you get confused, hold on to these terms, as I go through the explanations they will make sense.

The core problem: neural networks don’t eat text

A tranformer underneath in largge language model is few matrix multiplications, It operates on vectors of numbers. But the data is a string, Something has to bridge that gap, and that bridge has two halves.

Tokenization

Chop the string into discrete units (tokens) and map each to an integer ID.

Diagram showing raw text being split into tokens and mapped to integer IDs

Embedding

Look up each ID in a table to get a learnable vector.

Diagram showing integer token IDs being looked up in an embedding table to produce dense vectors

People often get confused about tokenization and embedding, but keep them separate in your head: tokenization assigns integer IDs to text, and embedding turns those integers into vectors the model operates on.

What are the different options to represent text as tokens?

Option 1: One word = one token

The most intuitive idea, split text by spaces and treat every word as a token. But English has ~170,000 words, and most appear rarely. The model assigns probabilities across the entire vocabulary at every step, and most of those slots are nearly useless.

Word tokenization leads to a huge vocabulary where most words appear rarely

Option 2: One character = one token

Vocabulary shrinks to ~100–200 characters and there is no rare token problem. But sequence length explodes: “The cat sat” goes from 3 tokens to 11. Transformers scale quadratically with sequence length, and individual characters carry almost no meaning, so the model has to relearn language from raw atoms.

Character tokenization causes sequence length to explode

Option 3: Subword tokenization — the sweet spot

Break text into pieces smaller than words but larger than characters. Common words stay as single tokens; rare words split into recognizable parts (unhappinessun + happiness). The vocabulary is learned from data using Byte Pair Encoding (BPE) : start with characters, repeatedly merge the most frequent pair until hitting a target size. GPT-2 uses 50,257 tokens. GPT-4 uses ~100,000.

BPE merge steps from characters to subword tokens


Acknowledgments: Grammar and prose lightly edited with Claude (Anthropic) assistance.


References

  1. Hashimoto, T. & Liang, P. (2026). CS336 LLM from Scratch. Stanford University, Spring 2026.
  2. Karpathy, A. (2023). Let’s build the GPT Tokenizer. YouTube. — Excellent hands-on walkthrough of BPE from scratch.
  3. Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016. — The original BPE paper applied to NLP.
  4. Radford, A. et al. (2019). Language Models are Unsupervised Multitask Learners. OpenAI. — GPT-2 paper.