AI / LLMs · Teaching Reference

LLM Fundamentals

Tokens, context windows, temperature, prompt engineering, hallucinations, and when to use RAG vs fine-tuning — the core concepts behind how large language models work.

Concept 1 · Tokens & Context Window

The unit of language
📝
Raw Text
"Hello, world!"
✂️
Tokenizer
BPE / WordPiece splits
🔢
Token IDs
[15496, 11, 995, 0]
🪟
Context Window
8K – 200K token limit
🕸️
Self-Attention
Each token sees all others
🎲
Next Token
Predicted one at a time
What is a token? Not a word — a subword unit. "Unbelievable" might be 4 tokens; "cat" is 1. A rule of thumb: ~4 characters ≈ 1 token, or ~¾ of a word. Every character you type — spaces, punctuation, newlines — costs tokens. Models never see letters, only integer IDs from a fixed vocabulary (32K–128K entries).
Context window The hard limit on how many tokens the model can "see" at once — prompt + history + response combined. GPT-4: 128K. Claude: up to 200K. Gemini: 1M+. Exceeding this limit means older content gets dropped. Longer contexts also cost more to compute: attention is O(n²) in sequence length.
Autoregressive generation LLMs generate one token at a time, left to right. Each new token is appended to the context and the model runs again. A 500-word response = ~375 tokens = 375 forward passes. This is why generation feels fast at first and can slow down as the response grows.

Concept 2 · Temperature & Sampling

Controlling randomness
📥
Input Context
All tokens so far
⚙️
Forward Pass
Transformer layers compute
📊
Raw Logits
Score for each vocab token
🌡️
Temperature
Low = precise · High = creative
📐
Softmax
Logits → probabilities
🎯
Sample
top-p / top-k filter
🔤
New Token
Appended → loop repeats
Temperature — the creativity dial After ranking every possible next word by likelihood, the model has to pick one. Temperature controls how adventurous that pick is. Low (0–0.4): almost always picks the most predictable word — precise and consistent, great for code or factual Q&A. Medium (0.7): occasionally picks the 2nd or 3rd most likely word — feels more natural, less robotic. High (1.2+): freely picks surprising words — more creative, but more prone to errors. Most apps run between 0.2 and 0.8.
Top-p — cutting off the long tail Even at a sensible temperature, the model still assigns tiny probabilities to thousands of bizarre words. Top-p (p = 0.9) says: only consider words until their combined probability reaches 90%, then ignore everything else. When the model is confident, only a handful of words qualify. When it's uncertain, more options stay in. Either way, the truly absurd options get cut.
Top-k — a simpler cap Only consider the top K most likely next words and ignore the rest (k = 40 is common). Easy to reason about: at k = 1 the model always picks the single most likely word; at k = 50 it has variety to work with. Most systems apply temperature first, then top-p and top-k together as a safety net to keep outputs sensible.

Prompt Engineering

The craft of designing inputs that reliably elicit the output you want — without changing the model weights. Four techniques every practitioner needs.

⚙️

System Prompt

A privileged instruction block prepended before the user turn. Sets the model's persona, rules, output format, and constraints. Every production LLM deployment uses one. Users typically never see it, but the model treats it as a trusted, high-priority directive.

Example You are a precise SQL assistant.
Always validate column names.
Never guess; ask for clarification.
🎯

Zero-Shot

Just ask — no examples provided. Works well for tasks the model has seen extensively in training (translation, summarization, basic Q&A). The model draws entirely on pattern-matched knowledge. Fast and low-effort, but output format and tone can be unpredictable.

Example Summarize the following text
in three bullet points:
[text here]
📋

Few-Shot

Provide 2–5 input→output examples before the real query. The model infers the pattern and applies it. Dramatically improves format consistency, tone matching, and edge-case handling without any fine-tuning. The examples are "in-context training" that runs at inference time.

Example Q: Capital of France? A: Paris
Q: Capital of Japan? A: Tokyo
Q: Capital of Brazil? A: Brasília
💭

Chain-of-Thought

Prompt the model to show its reasoning before the final answer. "Think step by step" or "Let's reason through this:" unlocks better performance on multi-step math, logic, and planning tasks. Reasoning tokens are "working memory" — the model uses them to stay accurate over long derivations.

Example Q: 17 × 24 = ?
Think step by step.
17×20=340, 17×4=68, 340+68=408

Hallucinations — When Models Make Things Up

LLMs generate plausible-sounding text. "Plausible" and "true" are not the same thing. Understanding why hallucinations happen is the first step to mitigating them.

🧩

Why It Happens

LLMs are trained to predict the next most likely token — not to retrieve verified facts. They interpolate patterns from training data. When asked about something outside that distribution, they extrapolate confidently into fiction. There is no internal "truth check."

no ground truth pattern completion training cutoff
🔬

Common Types

Factual: wrong dates, names, statistics stated confidently. Citation: fabricated paper titles, authors, URLs that don't exist. Reasoning: logical errors that look correct. Intrinsic: contradicts the source document it was given.

factual errors fake citations logic errors source drift
🛡️

How to Mitigate

RAG: ground answers in retrieved documents the model can cite. Low temperature: reduces wild extrapolation. System prompt: "If unsure, say so." Chain-of-thought: slower reasoning catches more errors. Verification: ask the model to check its own answer.

RAG grounding low temp self-check citations

❌ Hallucinated Answer

Q: Who invented the telephone? Alexander Graham Bell patented it in 1876. He was born in Edinburgh in 1850 and studied at Oxford University, where he first developed his acoustic theories under Prof. James Whitmore.

✅ Grounded Answer (with RAG)

Q: Who invented the telephone? Based on the provided source: Alexander Graham Bell patented the telephone in 1876. He was born in Edinburgh in 1847 and educated at the University of Edinburgh. [Source: encyclopedia excerpt, para. 2]

Prompting vs RAG vs Fine-tuning

Three levers for improving LLM output. They are not mutually exclusive — real systems often combine all three. The key is knowing which problem each one solves.

✍️

Prompting

Change what you ask, not the model. Use system prompts, few-shot examples, and chain-of-thought to steer behavior at inference time. Zero cost, instant iteration, fully reversible. The right starting point for every task — exhaust this before spending on the others.

CostFree
Data neededNone
IterationInstant
KnowledgeTraining only
fast iteration format control no infra
🗄️

RAG

Inject private or up-to-date data into the context at query time by retrieving relevant documents from a vector store. No training required. Data stays fresh — add new documents without touching the model. Answers are grounded and citable. Best for document Q&A, knowledge bases, and current events.

CostRetrieval only
Data neededDocuments
IterationFast
KnowledgeYour corpus
private data always fresh citable
🎛️

Fine-tuning

Adjust the model weights on your labeled dataset. Bakes domain knowledge, style, and behavior directly into the model — consistent output without lengthy prompts. Best when you have thousands of high-quality examples and a narrow, well-defined task. Expensive to train and re-train as data changes.

CostHigh (GPU hrs)
Data needed1K–100K examples
IterationSlow (retrain)
KnowledgeBaked in
domain expert consistent style short prompts
Related Reading Agentic AI – Teaching Diagram How agents use LLMs in loops to complete complex tasks
Related Reading Model Context Protocol – Teaching Diagram How agents connect to tools and external data via MCP
Related Reading RAG Pipeline – Teaching Diagram Deep dive into retrieval-augmented generation end to end
All Guides AI Teaching Reference Back to the full index of teaching pages