LLM Fundamentals

Concept 1 · Tokens & Context Window

The unit of language

📝

Raw Text

"Hello, world!"

✂️

Tokenizer

BPE / WordPiece splits

🔢

Token IDs

[15496, 11, 995, 0]

🪟

Context Window

8K – 200K token limit

🕸️

Self-Attention

Each token sees all others

🎲

Next Token

Predicted one at a time

What is a token? Not a word — a subword unit. "Unbelievable" might be 4 tokens; "cat" is 1. A rule of thumb: ~4 characters ≈ 1 token, or ~¾ of a word. Every character you type — spaces, punctuation, newlines — costs tokens. Models never see letters, only integer IDs from a fixed vocabulary (32K–128K entries).

Context window The hard limit on how many tokens the model can "see" at once — prompt + history + response combined. GPT-4: 128K. Claude: up to 200K. Gemini: 1M+. Exceeding this limit means older content gets dropped. Longer contexts also cost more to compute: attention is O(n²) in sequence length.

Autoregressive generation LLMs generate one token at a time, left to right. Each new token is appended to the context and the model runs again. A 500-word response = ~375 tokens = 375 forward passes. This is why generation feels fast at first and can slow down as the response grows.

Concept 2 · Temperature & Sampling

Controlling randomness

📥

Input Context

All tokens so far

⚙️

Forward Pass

Transformer layers compute

📊

Raw Logits

Score for each vocab token

🌡️

Temperature

Low = precise · High = creative

📐

Softmax

Logits → probabilities

🎯

Sample

top-p / top-k filter

🔤

New Token

Appended → loop repeats

Temperature — the creativity dial After ranking every possible next word by likelihood, the model has to pick one. Temperature controls how adventurous that pick is. Low (0–0.4): almost always picks the most predictable word — precise and consistent, great for code or factual Q&A. Medium (0.7): occasionally picks the 2nd or 3rd most likely word — feels more natural, less robotic. High (1.2+): freely picks surprising words — more creative, but more prone to errors. Most apps run between 0.2 and 0.8.

Top-p — cutting off the long tail Even at a sensible temperature, the model still assigns tiny probabilities to thousands of bizarre words. Top-p (p = 0.9) says: only consider words until their combined probability reaches 90%, then ignore everything else. When the model is confident, only a handful of words qualify. When it's uncertain, more options stay in. Either way, the truly absurd options get cut.

Top-k — a simpler cap Only consider the top K most likely next words and ignore the rest (k = 40 is common). Easy to reason about: at k = 1 the model always picks the single most likely word; at k = 50 it has variety to work with. Most systems apply temperature first, then top-p and top-k together as a safety net to keep outputs sensible.

⚙️

System Prompt

A privileged instruction block prepended before the user turn. Sets the model's persona, rules, output format, and constraints. Every production LLM deployment uses one. Users typically never see it, but the model treats it as a trusted, high-priority directive.

Example You are a precise SQL assistant.
Always validate column names.
Never guess; ask for clarification.

🎯

Zero-Shot

Just ask — no examples provided. Works well for tasks the model has seen extensively in training (translation, summarization, basic Q&A). The model draws entirely on pattern-matched knowledge. Fast and low-effort, but output format and tone can be unpredictable.

Example Summarize the following text
in three bullet points:
[text here]

📋

Few-Shot

Provide 2–5 input→output examples before the real query. The model infers the pattern and applies it. Dramatically improves format consistency, tone matching, and edge-case handling without any fine-tuning. The examples are "in-context training" that runs at inference time.

Example Q: Capital of France? A: Paris
Q: Capital of Japan? A: Tokyo
Q: Capital of Brazil? A: Brasília

💭

Chain-of-Thought

Prompt the model to show its reasoning before the final answer. "Think step by step" or "Let's reason through this:" unlocks better performance on multi-step math, logic, and planning tasks. Reasoning tokens are "working memory" — the model uses them to stay accurate over long derivations.

Example Q: 17 × 24 = ?
Think step by step.
17×20=340, 17×4=68, 340+68=408

🧩

Why It Happens

LLMs are trained to predict the next most likely token — not to retrieve verified facts. They interpolate patterns from training data. When asked about something outside that distribution, they extrapolate confidently into fiction. There is no internal "truth check."

no ground truth pattern completion training cutoff

🔬

Common Types

Factual: wrong dates, names, statistics stated confidently. Citation: fabricated paper titles, authors, URLs that don't exist. Reasoning: logical errors that look correct. Intrinsic: contradicts the source document it was given.

factual errors fake citations logic errors source drift

🛡️

How to Mitigate

RAG: ground answers in retrieved documents the model can cite. Low temperature: reduces wild extrapolation. System prompt: "If unsure, say so." Chain-of-thought: slower reasoning catches more errors. Verification: ask the model to check its own answer.

RAG grounding low temp self-check citations

✍️

Prompting

Change what you ask, not the model. Use system prompts, few-shot examples, and chain-of-thought to steer behavior at inference time. Zero cost, instant iteration, fully reversible. The right starting point for every task — exhaust this before spending on the others.

CostFree

Data neededNone

IterationInstant

KnowledgeTraining only

fast iteration format control no infra

🗄️

RAG

Inject private or up-to-date data into the context at query time by retrieving relevant documents from a vector store. No training required. Data stays fresh — add new documents without touching the model. Answers are grounded and citable. Best for document Q&A, knowledge bases, and current events.

CostRetrieval only

Data neededDocuments

IterationFast

KnowledgeYour corpus

private data always fresh citable

🎛️

Fine-tuning

Adjust the model weights on your labeled dataset. Bakes domain knowledge, style, and behavior directly into the model — consistent output without lengthy prompts. Best when you have thousands of high-quality examples and a narrow, well-defined task. Expensive to train and re-train as data changes.

CostHigh (GPU hrs)

Data needed1K–100K examples

IterationSlow (retrain)

KnowledgeBaked in

domain expert consistent style short prompts

Concept 1 · Tokens & Context Window

Concept 2 · Temperature & Sampling

Prompt Engineering

System Prompt

Zero-Shot

Few-Shot

Chain-of-Thought

Hallucinations — When Models Make Things Up

Why It Happens

Common Types

How to Mitigate

❌ Hallucinated Answer

✅ Grounded Answer (with RAG)

Prompting vs RAG vs Fine-tuning

Prompting

RAG

Fine-tuning