Tokenization in LLMs

Jérémy Martin
March 13, 2025
IA

Tokenization

We begin by training the model. The neural layers will initially consist of random data structures, and as repetitions occur, these structures will be updated to produce increasingly informed responses.

It is understood that the dataset used for the model is fundamental because it is through successive corrections that the model learns. Large language models like ChatGPT use hundreds of gigabytes of data for their training, which they must clean beforehand. Once a large quantity of high-quality text is obtained, the text is split into tens of thousands of pieces, often called tokens. Each token represents a specific string of characters to which a meaning and a numerical identifier can be assigned. Sometimes these are entire words like "eat," but sometimes they are only fragments of words. Then, a certain number of tokens are provided, and the response to predict will be the next token. This defines the size of the "vocabulary" as the total number of unique tokens in our dataset.

Various algorithms can be used to determine how to choose which characters make up the tokens. The most classic one, byte pair encoding (BPE), consists of the following steps:

  1. Initialize the vocabulary with an initial set of tokens corresponding to all individual characters.
  2. Merge the most frequent pairs of tokens in the training corpus to create longer new tokens.
  3. Add this merged pair to the vocabulary and transform all the text with this merge.
  4. Repeat the previous steps.

This allows capturing both frequent words and less frequent subwords or word segments.

Here are some examples of tokens:

Tokens are used to transform a sequence of words into a list of numbers.

For example, the sentence:

the sun is shining

is transformed into the list of tokens:

[the, su , is, shining]

which in turn is transformed into a list of integers (token IDs):

[4321, 9876, 1234, 5678]

This is a lossless encoding because from the list of identifiers, one can retrieve the tokens and thus the original sentence.

Each frequently used word will be encoded by a unique token, while rarer words are split into two tokens or more.

Tokens facilitate a logical learning of terms, for example by separating the root of a word and its ending:

  • resolution = resolv + ation
  • sustainable = sustain + able
  • proactive = pro + active
  • dreamer = dream + er

We now imagine ℝⁿ, a vector space of dimension N, equipped with a canonical basis denoted eᵢ for i ∈ ℕ. Thus, e₂ is, for example, the vector that has a 1 in the second position and zeros everywhere else.

We now map the token identifiers to the canonical basis vectors of ℝⁿ. For example, the token "the" will be associated with the identifier 4321, then with the vector e₄₃₂₁.

The approximate value of a word in tokens depends on the language and the complexity of the word. In general, for English, a word corresponds on average to 1.3 tokens. Here are some points to better understand:

Simple Words

Short and common words, like the, cat, or is, often count as a single token.

Complex Words

Longer or compound words, like incredible or transformation, can be broken down into multiple tokens. For example, incredible could be decomposed into in, cred, ible (3 tokens).

Languages with Long Words

Some languages, like German, where compound words are common, may see a greater decomposition into tokens for a single word.

General Estimates

  • Short and simple words: Approximately 1 token per word.
  • Medium words: Approximately 1-2 tokens per word.
  • Long and compound words: Approximately 2-3 tokens per word.

On average, for a quick estimate:

One word = approximately 1.3 tokens.

BPE (Byte Pair Encoding)

There are several processes for choosing tokens in a corpus. These are small iterative programs that extract the most repetitive words or subwords in a corpus. One of them is Byte Pair Encoding (BPE) presented below.

The algorithm starts by counting the occurrence of each character and continues by merging characters. It is a greedy process that carefully considers all possible combinations to identify the optimal set of words/subwords covering the dataset with the fewest required tokens.

The next step will be to create the vocabulary for our model, which consists of a complete dictionary including the most frequent tokens extracted by BPE (or another technique of your choice) from the dataset. A dictionary is a data structure that contains a key-value pair for each entry. In our particular scenario, each data point is assigned a key represented by an index starting at 0, while the corresponding value is a token.

Since neural networks only accept numerical inputs, we use the vocabulary itself as a lookup table between tokens and their corresponding IDs. When we do this, we must save the vocabulary for future use cases in order to always decode the model's output from IDs to words. This is known as a pre-trained vocabulary, an essential component accompanying published pre-trained models. Without the vocabulary, understanding the model's output (the IDs) would be impossible. For models like GPT-4o, it can extend up to 50,000 tokens.

Embedding (Vector Representation)

The next step is to associate the canonical basis vector with a new vector called a token vector, embedding vector, vector representation, state vector, hidden state vector, integration vector, or even hidden state vector.

We denote φ as the embedding function, which maps a space of dimension N to a space of dimension n, with N > n. For example, for ChatGPT4, N = 1,048,576 and n = 12,288. We call n the embedding dimension.

The vectors vᵢ = φ(eᵢ) ∈ ℝⁿ are thus obtained by embedding for each token.

The goal of embedding is to group words by semantic categories. For example, words describing colors will be grouped together, and so on. Mathematically, this corresponds to a linear application:

ϕ:RN⟶Rn\phi : \mathbb{R}^N \longrightarrow \mathbb{R}^n

ϕ:RN⟶Rn

ei⟼vie_i \longmapsto v_i

ei⟼vi

However, vector space theory tells us that any linear application from N to n can be represented in the form of a matrix with n rows and N columns. Thus, φ can be represented by matrix A such that the i-th column of the matrix is exactly vector vᵢ. This matrix is called the embedding matrix. Thus, in terms of vectors, we simply have:

vi=Aei.v_i = A e_i.

vi=Aei.

Once all token vectors are calculated, we introduce the notion of similarity between two vectors based on the cosine distance between two vectors.

Cosine distance, or cosine similarity, between two vectors a and b, is given by:

sC(a,b)=a⋅b∥a∥⋅∥b∥s_{C}(\mathbf{a}, \mathbf{b}) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \cdot \|\mathbf{b}\|}

sC(a,b)=∥a∥⋅∥b∥a⋅b

where a⋅b\mathbf{a} \cdot \mathbf{b}a⋅b is the dot product of the vectors and ∥a∥\|\mathbf{a}\|∥a∥ is the norm of a\mathbf{a}a.

Cosine similarity will provide us with an angle, which corresponds to the logical distance between two vectors.

  • When the angle between the vectors is small, that is, close to zero degrees, the cosine similarity is close to 1, indicating that the vectors have the same direction and sense.
  • When the angle between the vectors is large, that is, close to 90°, the cosine is close to 0, indicating that the vectors representing the words have no connection.
  • When the angle between the vectors is very large, that is, close to 180°, the cosine is close to -1, indicating that the vectors have opposite directions, meaning that the words have opposite meanings.

Vectors also have a norm, that is, a size relative to the origin of the coordinate system. The larger the norm, the more pronounced the vector's strength, meaning that the confidence in the similarity calculation is high. For example, imagine a space ℝ⁴ with the following dimensions: oracle, adventurer, young, mythology.

In this coordinate space (oracle, adventurer, young, mythology), consider three words (wise, hero, deity) represented by vectors V₁, V₂, and V₃.

If in a sentence, we encounter the words "prophecy," "divination," and "prediction" (which belong to the symbolism of the word oracle) as well as the words "pantheon" (symbol of Olympus) and "epic" (symbol of mythology), we will associate with this text the vector W=[3,0,1,1] by counting the occurrences of each category.

Thus, we have the following vectors Vi :

V₁ = [1, 0, 0, 0]

V₂ = [0, 1, 0, 1]

V₃ = [1, 0, 1, 1]

And so :

cos(θ₁) = (W ⋅ V₁) / (||W|| × ||V₁||) = (3×1 + 0×0 + 1×0 + 1×0) / (√(3² + 0² + 1² + 1²) × √(1² + 0² + 0² + 0²)) ≈ 0.90

cos(θ₂) = (W ⋅ V₂) / (||W|| × ||V₂||) = (3×0 + 0×1 + 1×0 + 1×1) / (√(3² + 0² + 1² + 1²) × √(0² + 1² + 0² + 1²)) ≈ 0.15

cos(θ₃) = (W ⋅ V₃) / (||W|| × ||V₃||) = (3×1 + 0×0 + 1×1 + 1×1) / (√(3² + 0² + 1² + 1²) × √(1² + 0² + 1² + 1²)) ≈ 0.87

The vector closest to W is V1 with a cosine similarity of 0.90, but vector V3 follows closely behind. Additionally, we notice that vector V3 has a higher norm than V1, which means it has a higher confidence rate, so according to the configuration policy, it might be preferred over V1. This example illustrates well the subtle trade-offs that sometimes need to be made in setting up this type of model.

Thus, we can calculate the embeddings of a sentence by concatenating the embeddings of the vectors of the words.

Written by
Jérémy Martin
Research Director