1. Understanding LLMs & Text Generation

How LLMs Generate Text

LLMs don’t “think” like humans. They predict the most probable next word (token) based on previous words.

Step 1: Convert Text to Tokens

Example (Word-based tokenization):

Sentence:  "The cat sat on the mat."
Tokens:    ["The", "cat", "sat", "on", "the", "mat", "."]

Example (Sub-word tokenization, used in LLaMA models):

Sentence:  "Artificial intelligence"
Tokens:    ["Art", "ificial", "intelli", "gence"]

Why sub-word tokenization?

  • Handles new words by breaking them into smaller known parts.
  • Reduces vocabulary size, improving efficiency.

Step 2: Assign Probability to Next Token

Example: Predicting the next token for the phrase: "The capital of France is"

Token Probabilities:
"Paris" → 85%
"London" → 5%
"Berlin" → 3%
"Rome" → 2%

📌 The model chooses “Paris” because it has the highest probability.

Step 3: Decoding Strategies (Choosing the Next Word)

Once we have probabilities, we need a decoding strategy to pick the best next word.


Decoding Strategies

With Simple Examples & Ollama Implementation

1. Greedy Search (Always Pick the Highest Probability Token)

How it works: Greedy search selects the token with the highest probability at each step i.e. Always chooses the word with the highest probability.

Example:

Input: "The cat sat on the"
Greedy Output: "mat mat mat mat..."

Pros:

  • Computationally efficient
  • Simple and fast to implement

Cons:

  • May not produce optimal results
  • Can get stuck in local maxima

Use cases: Greedy search is suitable for applications where computational efficiency is crucial, such as real-time language translation or text summarization.

import ollama

response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": "Complete the sentence: The cat sat on the"}])
print(response['message']['content'])  # Greedy search (default in Ollama)

2. Beam Search (Looks at Multiple Possibilities)

How it works: Beam search maintains a set of top-scoring candidate sequences (beams). At each step, it expands each beam by adding possible next tokens and selects the top-scoring beams to continue.

Example:

Input: "The cat sat on the"
Beam Search Output: "mat because it was comfortable."

Pros:

  • More accurate than greedy search
  • Can handle sequences with varying lengths i.e. Reduces repetition

Cons:

  • Computationally expensive
  • Requires careful tuning of beam size and other hyperparameters

Use cases: Beam search is suitable for applications where high accuracy is crucial, such as machine translation, speech recognition, or text generation.

response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": "Complete the sentence: The cat sat on the"}], options={"num_beams": 3})
print(response['message']['content'])  # Beam Search

3. Top-k Sampling (Adds Randomness)

How it works: Top-K sampling selects the top-K tokens with the highest probabilities and samples from this subset to generate the next token. Instead of always picking the highest probability word, Top-k sampling picks randomly from the top K most probable words.

Example (Top-3 Sampling):

Input: "The cat sat on the"
Possible Outputs: "sofa / couch / floor" (randomly picked from top 3)

Pros:

  • Encourages diversity in generated text i.e. Introduces randomness, making text more creative
  • Can handle rare or unseen tokens

Cons:

  • May not produce optimal results
  • Requires careful tuning of K as can sometimes generate weird outputs

Use cases: Top-K sampling is suitable for applications where diversity and creativity are important, such as language generation, chatbots, or creative writing.

response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": "Complete the sentence: The cat sat on the"}], options={"top_k": 50})
print(response['message']['content'])  # Top-k Sampling

4. Top-p (Nucleus) Sampling (More Dynamic)

How it works: Top-N sampling selects the top tokens that make up a certain probability mass (p) and samples from this subset to generate the next token. Instead of a fixed K, Top-p sampling dynamically picks words until the cumulative probability reaches P.

Example (Top-p = 0.9):

Input: "The cat sat on the"
Top-p Output: "mat because it was sunny outside."

Pros:

  • More efficient than Top-K sampling
  • Encourages diversity and creativity i.e. more natural text

Cons:

  • May not produce optimal results
  • Requires careful tuning of p

Use cases: Top-N sampling is suitable for applications where efficiency and diversity are important, such as language generation, chatbots, or creative writing.

response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": "Complete the sentence: The cat sat on the"}], options={"top_p": 0.9})
print(response['message']['content'])  # Top-p Sampling

Real-World Example: AI Chat Assistant

Using Ollama LLaMA 8B

Step 1: Install Ollama

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3:8b

Step 2: Implement Chatbot

import ollama

def chatbot(prompt):
    response = ollama.chat(model="llama3:8b", messages=[{"role": "user", "content": prompt}])
    return response['message']['content']

# Example conversation
user_input = "Tell me a fun fact about space!"
print(chatbot(user_input))

Understanding Softmax: How Probabilities are Computed

We have discussed this in previous topics in details but to summarize, Softmax is a mathematical function that converts raw scores into probabilities.

Simple Example:

Let’s say the model predicts these scores:

Raw Scores: [3.2, 2.1, 5.8, 1.4]

Softmax converts them into probabilities:

Probabilities: [10%, 4%, 80%, 6%]

Softmax Formula

\[ P(y_i) = \frac{e^{x_i}}{\sum e^{x_j}} \]

Where:

  • \( x_i \) is the raw score for token \( i \)
  • \( e^{x_i} \) makes the numbers positive and scalable

Python Code

import numpy as np

def softmax(logits):
    exp_values = np.exp(logits - np.max(logits))  # Stability trick
    return exp_values / np.sum(exp_values)

logits = [3.2, 2.1, 5.8, 1.4]
probabilities = softmax(logits)
print(probabilities)  # Output: [0.104, 0.04, 0.8, 0.056]

Best Decoding Strategy for Different Use Cases

Use Case Best Decoding Strategy
Factual Answers Greedy or Beam Search
Creative Writing Top-k or Top-p Sampling
Chatbots Top-p Sampling (for variety)
Summarization Beam Search