2.2: Controlling GenAI Model Output

Temperature

Purpose: Controls the randomness of the predictions. It’s a hyperparameter used to scale the logits (predicted probabilities) before sampling.
How it works: The model computes probabilities for each token, and the temperature parameter adjusts these probabilities.
- Low temperature (<1.0): Makes the model more deterministic by amplifying the difference between high-probability tokens and low-probability tokens. This makes the model more likely to choose the most probable token.
- High temperature (>1.0): Makes the model more random by flattening the probabilities. This results in more diverse, creative, and sometimes less coherent text.

Example

Temperature = 0.7: The model will likely choose the more predictable or likely tokens.
Temperature = 1.5: The model will take more risks, leading to more unexpected, diverse outputs.

# Example of lower temperature (more deterministic)
outputs = model.generate(inputs['input_ids'], max_length=50, temperature=0.7)

# Example of higher temperature (more creative/random)
outputs = model.generate(inputs['input_ids'], max_length=50, temperature=1.5)

Top-k Sampling

Purpose: Limits the number of tokens to sample from, making the generation process more efficient and sometimes more coherent.
How it works: Instead of considering all possible tokens (the entire vocabulary), top-k sampling restricts the set of possible next tokens to the top-k most likely tokens based on their probability scores.
- k = 1: This would make the model behave deterministically, always picking the most probable token.
- k = 50: The model will sample from the top 50 tokens with the highest probabilities.

Example

Top-k = 10: The model will only consider the 10 tokens with the highest probabilities when selecting the next word.
Top-k = 100: The model will consider the top 100 tokens, giving it more variety.

# Example with top-k sampling (restricted to top 50 tokens)
outputs = model.generate(inputs['input_ids'], max_length=50, top_k=50)

Effect of Top-k: By limiting the token options to the top-k, the model’s output tends to be more controlled and less random than pure sampling from all tokens.

Top-p (Nucleus Sampling)

Purpose: Similar to top-k, but instead of limiting to a fixed number of tokens, top-p limits the tokens considered based on their cumulative probability.
How it works: The model keeps sampling from the smallest set of tokens whose cumulative probability exceeds a threshold p (where p is between 0 and 1). This dynamic method is often referred to as nucleus sampling.
- p = 0.9: The model will consider the smallest set of tokens whose cumulative probability is at least 90%. This results in considering a variable number of tokens based on how steep the probability distribution is.
- p = 1.0: This would be equivalent to top-k sampling with k = all tokens, allowing the model to sample from all tokens.

Example

Top-p = 0.9: The model considers the smallest set of tokens whose combined probability is at least 90%. This prevents very unlikely tokens from being considered while still allowing more diversity.
Top-p = 0.95: The model will sample from a slightly larger set of tokens.

# Example with top-p (nucleus) sampling
outputs = model.generate(inputs['input_ids'], max_length=50, top_p=0.9)

Effect of Top-p: Nucleus sampling tends to generate more coherent and diverse text than top-k sampling, as the model is free to choose tokens from a set that dynamically adjusts based on their probabilities.

Temperature, Top-k, and Top-p Combined

You can combine these parameters to fine-tune the model’s output. For example:

outputs = model.generate(
    inputs['input_ids'], 
    max_length=50, 
    temperature=0.8, 
    top_k=50, 
    top_p=0.9
)

This will give you:

A lower temperature (0.8), making the generation more predictable.
Top-k sampling with the top 50 tokens.
Top-p sampling that only includes tokens whose cumulative probability is at least 90%.

By tuning these parameters, you can experiment with how controlled or creative the generated text is.

Summary of Differences

Temperature: Adjusts the randomness of the sampling. Higher temperature means more diverse output; lower means more predictable.
Top-k Sampling: Limits the number of candidate tokens to the top-k most likely tokens.
Top-p (Nucleus) Sampling: Limits the candidate tokens to those whose cumulative probability is at least p (a probability threshold), providing more flexible diversity control.

Confused! Let us break down top-k and top-n with simpler examples.

Top-k Sampling (Simplified)

Imagine the model is choosing the next word from a list of 5 possible words, each with a probability:

Word	Probability
“apple”	0.5
“banana”	0.3
“cherry”	0.1
“date”	0.05
“elderberry”	0.05

Top-k = 2:

With top-k=2, the model will only consider the top 2 most probable words. So it will only consider “apple” and “banana”. The model ignores the words “cherry”, “date”, and “elderberry” because they are less likely.

If the model needs to choose the next word, it will only sample from these 2 words: “apple” and “banana”. This makes the sampling process more controlled and focused.

Top-k = 3:

If top-k=3, it will consider “apple”, “banana”, and “cherry”. This is a little more diverse but still limited to the top 3.

Top-p (Nucleus Sampling) (Simplified)

Now, let’s look at top-p (nucleus sampling), which works a bit differently.

Let’s use the same words and probabilities:

Word	Probability
“apple”	0.5
“banana”	0.3
“cherry”	0.1
“date”	0.05
“elderberry”	0.05

Top-p = 0.8:

With top-p=0.8, the model will add up the probabilities from the most likely words until the total probability is greater than or equal to 0.8.

“apple” = 0.5
“banana” = 0.3
Total = 0.8

At this point, the model has already reached 0.8 probability. So it will stop and consider only “apple” and “banana”.

This is different from top-k because it doesn’t limit to a fixed number of tokens. It dynamically chooses the most likely words until the total probability reaches the given threshold (in this case, 0.8).

Top-p = 0.9:

If we set top-p=0.9, the model will keep adding tokens until the cumulative probability is 0.9.

“apple” = 0.5
“banana” = 0.3
“cherry” = 0.1
Total = 0.9

Now, the model will consider “apple”, “banana”, and “cherry”.

Key Difference between Top-k and Top-p

Top-k restricts you to a fixed number of the most likely tokens.
- Example: top-k=2 would only allow the model to choose from the top 2 words.
Top-p (Nucleus sampling) restricts you to the smallest set of tokens whose cumulative probability is greater than or equal to p.
- Example: top-p=0.8 means the model will sample from the tokens that, together, have at least 80% probability.

Summary

Top-k: Always limits to a fixed number of tokens (e.g., top 3, top 5).
Top-p: Dynamically limits to the smallest set of tokens whose cumulative probability is at least p (e.g., 80% or 90%).

FAQ

1. Let’s work through the scenario where every word has a probability of 0.7 and you’re using top-p sampling with a threshold of 0.8.

Scenario: Let’s assume the following token probabilities:

Word	Probability
“apple”	0.7
“banana”	0.7
“cherry”	0.7
“date”	0.7
“elderberry”	0.7

Top-p = 0.8: In top-p sampling, the model keeps adding tokens to the pool until their cumulative probability exceeds or meets the top-p threshold (0.8).

Step-by-step breakdown:

“apple” = 0.7
“banana” = 0.7 (cumulative probability = 0.7 + 0.7 = 1.4)

At this point, the cumulative probability is 1.4, which exceeds the 0.8 threshold. So the sampling pool will be limited to these two words: “apple” and “banana”.

Since the total probability already exceeds 0.8 after the first two words, the model will include both “apple” and “banana” in the selection pool.

Key Points:

Top-p sampling doesn’t strictly limit the number of tokens — it selects tokens whose cumulative probability is at least the threshold (0.8 in this case).
If all tokens have the same probability (0.7), then the model will keep adding tokens until the cumulative probability reaches the top-p threshold.
In this case, the model will sample from the first two words (“apple” and “banana”), as their cumulative probability (1.4) exceeds the threshold of 0.8.

Final Conclusion: If every word has the same probability of 0.7, and you’re using top-p = 0.8, the model will include all words up to the point where the cumulative probability exceeds 0.8. In this case, it will stop at the second word, and you’ll end up with a pool of two words to choose from.

2. Let’s now look at how top-k sampling works in this case where every word has a probability of 0.7.

Scenario: We have the same token probabilities:

Word	Probability
“apple”	0.7
“banana”	0.7
“cherry”	0.7
“date”	0.7
“elderberry”	0.7

Top-k = 2: In top-k sampling, the model selects the top-k most probable tokens. The number k is fixed, meaning the model will consider exactly the top k tokens based on their probabilities.

How it works:

Regardless of the probabilities, the model will pick the top 2 most probable tokens.
In this case, since all the words have the same probability of 0.7, the model will choose the first 2 tokens (based on their order or position in the list).

What Happens Here:

Since top-k=2, the model will always select the first 2 tokens, because every token has the same probability (0.7).
The model doesn’t care about the cumulative probability here; it only cares about the number of tokens, which is fixed at 2 in this case.

Key Points:

Top-k simply selects the top k most probable words — it doesn’t dynamically sum probabilities like top-p.
In the case where all words have the same probability, top-k just picks the first k words in the list.
Top-k is not influenced by the cumulative probability — it just selects a fixed number of top tokens.