2.1.1: Tokenization

Tokenization

Tokenization is the process of converting text into smaller units, typically words or subwords, that can be processed by machine learning models. In natural language processing (NLP), tokens are the basic building blocks for understanding and generating language. Tokenization helps the model “understand” the text by converting it into a format that can be fed into the neural network.

Word-level tokenization splits text into words. Example: "I love AI" → ["I", "love", "AI"].
Subword-level tokenization (used in models like GPT) splits words into smaller parts or subwords. This is more efficient and handles unknown words better, as the model can understand smaller pieces of a word. Example, “delightful”→[“delight”, “ful”]`.
Character-Level Tokenization Splits text into individual characters. Example: "AI" → ["A", "I"].

Hands-On: Tokenization

Tokenization Example

from transformers import AutoTokenizer

# Load the tokenizer for GPT-2
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Tokenize a sentence
text = "Hello, how are you?"
tokens = tokenizer.tokenize(text)
token_ids = tokenizer.encode(text)

print("Tokens:", tokens)
print("Token IDs:", token_ids)

Output:

Tokens: ['Hello', ',', 'Ġhow', 'Ġare', 'Ġyou', '?']
Token IDs: [15496, 11, 703, 389, 345, 30]

Explanation:

tokenizer.tokenize(text): Splits the text into tokens.
tokenizer.encode(text): Converts tokens into their corresponding numerical IDs.

Tokenization: Additional Resources

Blog: Tokenization in NLP
Video: Word Embeddings Explained