1. Understanding LLMs & Text Generation
How LLMs Generate Text
LLMs don’t “think” like humans. They predict the most probable next word (token) based on previous words.
Step 1: Convert Text to Tokens
Example (Word-based tokenization):
Sentence: "The cat sat on the mat."
Tokens: ["The", "cat", "sat", "on", "the", "mat", "."]Example (Sub-word tokenization, used in LLaMA models):
Sentence: "Artificial intelligence"
Tokens: ["Art", "ificial", "intelli", "gence"]Why sub-word tokenization?
- Handles new words by breaking them into smaller known parts.
- Reduces vocabulary size, improving efficiency.
Step 2: Assign Probability to Next Token
Example: Predicting the next token for the phrase: "The capital of France is"
Token Probabilities:
"Paris" → 85%
"London" → 5%
"Berlin" → 3%
"Rome" → 2%📌 The model chooses “Paris” because it has the highest probability.
Step 3: Decoding Strategies (Choosing the Next Word)
Once we have probabilities, we need a decoding strategy to pick the best next word.
Decoding Strategies
With Simple Examples & Ollama Implementation
1. Greedy Search (Always Pick the Highest Probability Token)
How it works: Greedy search selects the token with the highest probability at each step i.e. Always chooses the word with the highest probability.
Example:
Input: "The cat sat on the"
Greedy Output: "mat mat mat mat..."Pros:
- Computationally efficient
- Simple and fast to implement
Cons:
- May not produce optimal results
- Can get stuck in local maxima
Use cases: Greedy search is suitable for applications where computational efficiency is crucial, such as real-time language translation or text summarization.
import ollama
response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": "Complete the sentence: The cat sat on the"}])
print(response['message']['content']) # Greedy search (default in Ollama)2. Beam Search (Looks at Multiple Possibilities)
How it works: Beam search maintains a set of top-scoring candidate sequences (beams). At each step, it expands each beam by adding possible next tokens and selects the top-scoring beams to continue.
Example:
Input: "The cat sat on the"
Beam Search Output: "mat because it was comfortable."Pros:
- More accurate than greedy search
- Can handle sequences with varying lengths i.e. Reduces repetition
Cons:
- Computationally expensive
- Requires careful tuning of beam size and other hyperparameters
Use cases: Beam search is suitable for applications where high accuracy is crucial, such as machine translation, speech recognition, or text generation.
response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": "Complete the sentence: The cat sat on the"}], options={"num_beams": 3})
print(response['message']['content']) # Beam Search3. Top-k Sampling (Adds Randomness)
How it works: Top-K sampling selects the top-K tokens with the highest probabilities and samples from this subset to generate the next token. Instead of always picking the highest probability word, Top-k sampling picks randomly from the top K most probable words.
Example (Top-3 Sampling):
Input: "The cat sat on the"
Possible Outputs: "sofa / couch / floor" (randomly picked from top 3)Pros:
- Encourages diversity in generated text i.e. Introduces randomness, making text more creative
- Can handle rare or unseen tokens
Cons:
- May not produce optimal results
- Requires careful tuning of K as can sometimes generate weird outputs
Use cases: Top-K sampling is suitable for applications where diversity and creativity are important, such as language generation, chatbots, or creative writing.
response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": "Complete the sentence: The cat sat on the"}], options={"top_k": 50})
print(response['message']['content']) # Top-k Sampling4. Top-p (Nucleus) Sampling (More Dynamic)
How it works: Top-N sampling selects the top tokens that make up a certain probability mass (p) and samples from this subset to generate the next token. Instead of a fixed K, Top-p sampling dynamically picks words until the cumulative probability reaches P.
Example (Top-p = 0.9):
Input: "The cat sat on the"
Top-p Output: "mat because it was sunny outside."Pros:
- More efficient than Top-K sampling
- Encourages diversity and creativity i.e. more natural text
Cons:
- May not produce optimal results
- Requires careful tuning of p
Use cases: Top-N sampling is suitable for applications where efficiency and diversity are important, such as language generation, chatbots, or creative writing.
response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": "Complete the sentence: The cat sat on the"}], options={"top_p": 0.9})
print(response['message']['content']) # Top-p SamplingReal-World Example: AI Chat Assistant
Using Ollama LLaMA 8B
Step 1: Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3:8bStep 2: Implement Chatbot
import ollama
def chatbot(prompt):
response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": prompt}])
return response['message']['content']
# Example conversation
user_input = "Tell me a fun fact about space!"
print(chatbot(user_input))Understanding Softmax: How Probabilities are Computed
We have discussed this in previous topics in details but to summarize, Softmax is a mathematical function that converts raw scores into probabilities.
Simple Example:
Let’s say the model predicts these scores:
Raw Scores: [3.2, 2.1, 5.8, 1.4]Softmax converts them into probabilities:
Probabilities: [10%, 4%, 80%, 6%]Softmax Formula
\[ P(y_i) = \frac{e^{x_i}}{\sum e^{x_j}} \]Where:
- \( x_i \) is the raw score for token \( i \)
- \( e^{x_i} \) makes the numbers positive and scalable
Python Code
import numpy as np
def softmax(logits):
exp_values = np.exp(logits - np.max(logits)) # Stability trick
return exp_values / np.sum(exp_values)
logits = [3.2, 2.1, 5.8, 1.4]
probabilities = softmax(logits)
print(probabilities) # Output: [0.104, 0.04, 0.8, 0.056]Best Decoding Strategy for Different Use Cases
| Use Case | Best Decoding Strategy |
|---|---|
| Factual Answers | Greedy or Beam Search |
| Creative Writing | Top-k or Top-p Sampling |
| Chatbots | Top-p Sampling (for variety) |
| Summarization | Beam Search |