Text Tokenization

focusing on GPT tokenizers

In earlier chapters we have limited the discussion to tokenizers that either produce a list of words or a list of characters. Its very important though to understand the connection that exists between tokenization and modeling for various NLP tasks.

Model fine-tuning

The last few years we have seen the rise of pre-trained language models such as BERT, GPT-2, GPT-3, etc. These models are trained on large amounts of text data and are then used to perform downstream tasks such as text classification, text generation, etc. When you finetune a language model, you have to feed a pretrained model with the same tokenizer type that it was trained on.

Model training from scratch

If a language model is not available in the language you are interested in, or if your corpus is very different from the one your language model was trained on, you will most likely want to retrain the model from scratch using a tokenizer adapted to your data. That will require training a new tokenizer on your dataset. Follow this guide to learn how to do that.

Tokenizer types

Lets review the main tokenizers so we highlight such connection for each type specifically.

Word level tokenization

The word-level tokenizers are intuitive, however, they suffer from the problem of unknown words, tagged as Out Of Vocabulary (OOV) words. They also tend to result into large vocabulary sizes, assigning different tokens to “can’t” and “cannot”. They also have issues with abbreviations eg. “U.S.A”.

from spacy.lang.en import English
nlp = English()
# Create a Tokenizer with the default settings for English
# including punctuation rules and exceptions
tokenizer = nlp.tokenizer

text = "I have a new GPU!"

doc = nlp(text)

for token in doc:
    print(token.text)

2023-05-28 15:48:47.231642: I tensorflow/core/util/port.cc:110] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-05-28 15:48:48.703471: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-05-28 15:48:54.109770: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2023-05-28 15:49:04.727468: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:996] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-05-28 15:49:04.730528: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1956] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

I
have
a
new
GPU
!

Character level tokenization

Character-level tokenization are more flexible as they bypass the OOV issue, however to capture the context of each word we need to use much longer sequences and this results in loss of performance.

Byte Pair Encoding (BPE)

In BPE, one token can correspond to a character, an entire word or more, or anything in between and on average a token corresponds to 0.7 words. The idea behind BPE is to tokenize at word level frequently occuring words and at subword level the rarer words. GPT-3 uses a variant of BPE.

Let see an example a tokenizer in action. We wull use the HuggingFace Tokenizers API and the GPT2 tokenizer. Note that this is called the encoder as it is used to encode text into tokens.

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokens = tokenizer.tokenize("she sells seashells by the seashore")
print(tokens)
print(tokenizer.convert_tokens_to_string(tokens))
print(tokenizer.convert_tokens_to_ids(tokens))

/usr/local/lib/python3.8/dist-packages/tqdm/auto.py:22: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

['she', 'Ġsells', 'Ġseas', 'hell', 's', 'Ġby', 'Ġthe', 'Ġse', 'ash', 'ore']
she sells seashells by the seashore
[7091, 16015, 21547, 12758, 82, 416, 262, 384, 1077, 382]

If you are perplexed about the Ġ please note that this character is what the encoder produces for the space character. The encoder code doesn’t like spaces, so they replace spaces and other whitespace characters with other unicode bytes. The encoder more specifically takes all control and whitespace characters in code points 0-255 and shifts them up by 256 to make them printable. So space (code point 32) becomes Ġ (code point 288).

In the following we have another subword tokenization from the BERT tokenizer known as WordPiece.

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokenizer.tokenize("I have a new GPU!")

Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 60.7MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 28.0/28.0 [00:00<00:00, 127kB/s]
Downloading (…)lve/main/config.json: 100%|██████████| 570/570 [00:00<00:00, 2.32MB/s]

['i', 'have', 'a', 'new', 'gp', '##u', '!']

“##” means that the rest of the token should be attached to the previous one, without space (for decoding or reversal of the tokenization).

The tiktokenizer webapp runs various tokenizers in your browser. Its is a great tool to understand the impact of tokenizers on a language model complexity. Copy and paste the following code into the text box.

# A simple Python program to calculate the area of a rectangle

# Function to calculate the area of a rectangle
def calculate_rectangle_area(length, width):
  """
  Calculates the area of a rectangle.

  Args:
    length: The length of the rectangle.
    width: The width of the rectangle.

  Returns:
    The area of the rectangle.
  """
  area = length * width
  return area

# Get input from the user
length = float(input("Enter the length of the rectangle: "))
width = float(input("Enter the width of the rectangle: "))

# Calculate the area
area = calculate_rectangle_area(length, width)

# Print the result
print("The area of the rectangle is:", area)

We are going to test two tokenizers cl100k_base that has a 100,000 vocabulary size and gpt2 that has approximately a 50,000 vocabulary size. The first one is used by the GPT-3.5 and GPT-4 models, while the second one is used by the GPT-2 model.

Select gpt2 from the list of tokenizers and observe the number of tokens produced. For the code above, this would be 186 tokens. The same text would result in 149 tokens with the cl100k_base tokenizer. Why is that and what is the impact of this on the model complexity?

gpt2 needs more tokens to express the same words due to its less expressive vocabulary. This is exactly the same thing that happens if you dont know a word eg intricate you will use a longer phrase to express the same thing eg. very complicated`.
gpt2 will however result into a smaller model size since the vocabulary size is smaller, as evident by the transformer architecture. On the other hand, the increased token size will result into a larger requirement for context memory. The same python code will occupy more memory in the gpt2 coding agent and the agent is more likely to be slower to respond since the longer sequencies result into increased attention and memory requirements and also more likely to hallucinate since the model is more likely to hallucinate when its context is exceeded.