2.3.1: DeepDive Hands On

Deep Dive : Text Generation with GPT and Tokenizer

We’ll start with loading a pretrained model (like GPT-2 or BERT) and running a simple text generation task. We’ll use Hugging Face’s transformers library for this.

Step 1: Install the required libraries

You’ll need Python installed on your machine along with the following packages:

  • transformers (from Hugging Face)
  • torch (PyTorch backend)

To install these, run:

pip install transformers torch

Step 2: Load a Pretrained GPT-2 Model and Tokenizer
Here’s a simple example to load the GPT-2 model and tokenizer, then generate text based on a prompt.

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = 'gpt2'  # You can change to 'gpt2-medium', 'gpt2-large', etc., for a larger model
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

# Set model to evaluation mode
model.eval()

# Encode the prompt
prompt = "In the near future, artificial intelligence will"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate text
outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1, no_repeat_ngram_size=2)

# Decode the generated text
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Step 3: Run the Code and Observe the Output
When you run the script, the GPT-2 model will generate a continuation of the prompt:

In the near future, artificial intelligence will

Let’s break down the prompt:

  1. Tokenizer Encodes: The tokenizer will first convert this text into token IDs.
  2. Model Generates: The model uses these token IDs to generate the continuation of the sentence.
  3. Tokenizer Decodes: The output token IDs are converted back into a string of text.

For example, the output might look like:

"In the near future, artificial intelligence will be able to predict our every move, revolutionize industries, and improve the quality of life. With advancements in machine learning algorithms and deep learning techniques, AI will be a central part of our daily lives."

Step 4: Experiment with Different Prompts
You can modify the prompt to see how the model responds to different inputs. For example, try:

  • “Once upon a time, in a land far away,”
  • “The economy of the future will be driven by”
  • “The secret to a successful business is”

This will give you a sense of how the GPT-2 model can generate creative and contextually relevant text. Feel free to experiment with different parameters like max_length, num_return_sequences, or other settings to customize the output.


Let’s break down tokenization and the components involved, as well as explain the different parameters used in the code.

Tokenizer in the Code

The tokenizer is responsible for converting human-readable text into tokens that the model can understand. In Hugging Face’s transformers library, the tokenizer is used to:

  • Convert text into token IDs that the model can process.
  • Convert token IDs back into human-readable text (decoding).
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load pre-trained model and tokenizer
model_name = 'gpt2'  # You can change to 'gpt2-medium', 'gpt2-large', etc., for a larger model
model = GPT2LMHeadModel.from_pretrained(model_name)
tokenizer = GPT2Tokenizer.from_pretrained(model_name)

GPT2Tokenizer.from_pretrained(model_name): This loads the tokenizer associated with the GPT-2 model. This tokenizer is trained specifically for the model and knows how to convert text into tokens and vice versa.


Key Parameters Used in the Code

a. Encoding the Input Text

inputs = tokenizer(prompt, return_tensors="pt")
  • prompt: This is the initial text input that you want the model to complete or generate further text from.
  • tokenizer(prompt): This will convert the prompt text into token IDs that GPT-2 can understand.
  • return_tensors="pt": This specifies that the output should be in the form of PyTorch tensors. This is required because PyTorch is used for processing the data inside the model. (If you’re using TensorFlow, you’d use return_tensors="tf").

The result inputs will look like:

{
    'input_ids': tensor([[50256, 318, 257, 4768, 282, 2740]]),
    'attention_mask': tensor([[1, 1, 1, 1, 1, 1]])
}
  • input_ids: These are the actual token IDs that represent the words in the prompt.
  • attention_mask: This tells the model which tokens to focus on (1 means to focus, 0 means ignore).

b. Generating Text

outputs = model.generate(inputs['input_ids'], max_length=50, num_return_sequences=1, no_repeat_ngram_size=2)
  • inputs['input_ids']: The token IDs for the prompt are passed as input to the model.
  • max_length=50: This limits the total number of tokens (words/subwords) in the generated text, including both the input and the output. In this case, the total length is 50 tokens.
  • num_return_sequences=1: This defines how many different sequences of text you want the model to generate. In this case, the model will generate 1 sequence.
  • no_repeat_ngram_size=2: This parameter prevents the model from repeating a sequence of 2 consecutive tokens (an n-gram) in the generated text. For example, if the model starts generating “the the”, this parameter will force it to avoid that repetition.

c. Decoding the Output

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
  • outputs[0]: This contains the token IDs for the generated text. The model outputs the predicted token IDs for each step in the sequence.
  • tokenizer.decode(...): This converts the token IDs back into human-readable text.
  • skip_special_tokens=True: This removes special tokens like the end-of-sequence token (typically used in transformers models to indicate the end of the generated text).

For example, the outputs[0] could be a list of token IDs like [50256, 318, 257, 4768, 282], which the decoder will turn into human-readable text.

Conclusion

  • Tokenization is the process of converting text into tokens (IDs) that a model can understand and process.
  • The Tokenizer is a crucial component in transforming text for a model and back into text after processing.
  • Parameters like max_length, num_return_sequences, and no_repeat_ngram_size control the length, number of sequences, and quality of the generated output.

More hands-on examples

Let’s dive into more hands-on examples to reinforce the concepts of tokenization and model generation.

1. Experimenting with Different Prompt Types

Let’s start by experimenting with different types of prompts to see how GPT-2 responds.

Example 1: Story Prompt

prompt = "Once upon a time, in a land far away,"

Example 2: Business Scenario

prompt = "The future of artificial intelligence in business is"

Example 3: Philosophical Question

prompt = "What is the meaning of life?"

For each of these, run the same code and observe the text generated by GPT-2.

2. Play with Generation Parameters

You can adjust parameters to experiment with how the model generates text:

  1. Change max_length:

    • If you increase max_length to 100 or 200, the model will generate a longer continuation of the prompt.

    • Note that we have added min_length to ensure the generated text is at least 100 tokens long.

      outputs = model.generate(inputs['input_ids'], min_length=100, max_length=200, num_return_sequences=1, no_repeat_ngram_size=2)
  2. Experiment with num_return_sequences:

    • If you set num_return_sequences to 3, the model will generate 3 different continuations of the same prompt.

      outputs = model.generate(inputs['input_ids'], min_length=100, max_length=200, num_return_sequences=3, no_repeat_ngram_size=2)
    If ValueError for num_return_sequences

    ValueError: Greedy methods without beam search do not support num_return_sequences different than 1 (got 3). Replace num_return_sequences with num_beams

  3. Experiment with temperature and top_k:

    • temperature controls randomness. A lower temperature (e.g., 0.7) generates more focused, deterministic text, while a higher temperature (e.g., 1.5) generates more creative, diverse output.

    • top_k restricts the sampling to the top-k most likely next tokens, which controls diversity.

      outputs = model.generate(inputs['input_ids'], min_length=100, max_length=200, num_return_sequences=1, temperature=0.9, top_k=50)
      • Lower temperature: More predictable output.
      • Higher temperature: More random, creative output.
  4. Experiment with top_p (nucleus sampling):

    • top_p restricts sampling to the smallest set of tokens whose cumulative probability is greater than p (e.g., top_p=0.9 means it will sample from the smallest set of tokens that cumulatively have 90% of the probability mass).
    outputs = model.generate(inputs['input_ids'], min_length=100, max_length=200, num_return_sequences=1, top_p=0.9, temperature=0.8)

3. Try Decoding the Output

You’ll see different outputs for each of the above changes. Use the tokenizer’s decode method to see how the generated tokens look.

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

This allows you to experiment and see how changing each parameter affects the model’s output.


Summary of Parameters

Here’s a quick recap of the parameters we explored:

  • max_length: Controls the length of the generated text (in tokens).
  • num_return_sequences: Controls how many different outputs you want to generate.
  • no_repeat_ngram_size: Prevents repetitive sequences (n-grams) from appearing in the generated text.
  • temperature: Controls the randomness of the text generation. Higher = more randomness.
  • top_k: Limits sampling to the top-k tokens by probability. Controls diversity.
  • top_p: Nucleus sampling; limits the set of tokens to a cumulative probability p.