Tokens, context windows, temperature, prompt engineering, hallucinations, and when to use RAG vs fine-tuning — the core concepts behind how large language models work.
p = 0.9) says: only consider words until their combined probability reaches 90%, then ignore everything else. When the model is confident, only a handful of words qualify. When it's uncertain, more options stay in. Either way, the truly absurd options get cut.
k = 40 is common). Easy to reason about: at k = 1 the model always picks the single most likely word; at k = 50 it has variety to work with. Most systems apply temperature first, then top-p and top-k together as a safety net to keep outputs sensible.
The craft of designing inputs that reliably elicit the output you want — without changing the model weights. Four techniques every practitioner needs.
A privileged instruction block prepended before the user turn. Sets the model's persona, rules, output format, and constraints. Every production LLM deployment uses one. Users typically never see it, but the model treats it as a trusted, high-priority directive.
Just ask — no examples provided. Works well for tasks the model has seen extensively in training (translation, summarization, basic Q&A). The model draws entirely on pattern-matched knowledge. Fast and low-effort, but output format and tone can be unpredictable.
Provide 2–5 input→output examples before the real query. The model infers the pattern and applies it. Dramatically improves format consistency, tone matching, and edge-case handling without any fine-tuning. The examples are "in-context training" that runs at inference time.
Prompt the model to show its reasoning before the final answer. "Think step by step" or "Let's reason through this:" unlocks better performance on multi-step math, logic, and planning tasks. Reasoning tokens are "working memory" — the model uses them to stay accurate over long derivations.
LLMs generate plausible-sounding text. "Plausible" and "true" are not the same thing. Understanding why hallucinations happen is the first step to mitigating them.
LLMs are trained to predict the next most likely token — not to retrieve verified facts. They interpolate patterns from training data. When asked about something outside that distribution, they extrapolate confidently into fiction. There is no internal "truth check."
Factual: wrong dates, names, statistics stated confidently. Citation: fabricated paper titles, authors, URLs that don't exist. Reasoning: logical errors that look correct. Intrinsic: contradicts the source document it was given.
RAG: ground answers in retrieved documents the model can cite. Low temperature: reduces wild extrapolation. System prompt: "If unsure, say so." Chain-of-thought: slower reasoning catches more errors. Verification: ask the model to check its own answer.
Three levers for improving LLM output. They are not mutually exclusive — real systems often combine all three. The key is knowing which problem each one solves.
Change what you ask, not the model. Use system prompts, few-shot examples, and chain-of-thought to steer behavior at inference time. Zero cost, instant iteration, fully reversible. The right starting point for every task — exhaust this before spending on the others.
Inject private or up-to-date data into the context at query time by retrieving relevant documents from a vector store. No training required. Data stays fresh — add new documents without touching the model. Answers are grounded and citable. Best for document Q&A, knowledge bases, and current events.
Adjust the model weights on your labeled dataset. Bakes domain knowledge, style, and behavior directly into the model — consistent output without lengthy prompts. Best when you have thousands of high-quality examples and a narrow, well-defined task. Expensive to train and re-train as data changes.