Skip to Content

What Is an LLM?

A complete simple language guide to large language models — how they work, what they can do, and where they still fail
April 4, 2026 by
What Is an LLM?
Vishal

Quick answer: What is an LLM?

A large language model (LLM) is a type of artificial intelligence trained on vast amounts of text to predict and generate human language. Capable of answering questions, writing, reasoning, translating, and coding across dozens of tasks without being explicitly programmed for any of them, LLMs are the technology behind ChatGPT, Claude, Gemini, GitHub Copilot, and most other AI products you interact with today.


You have almost certainly used a large language model today. Maybe you asked a chatbot to summarise a document. Maybe autocomplete finished your sentence in a way that felt eerily accurate. Maybe GitHub Copilot suggested a function before you finished typing its name.

Behind every one of those interactions is an LLM — a system so computationally expensive to build that training a single frontier model costs tens of millions of dollars and requires thousands of specialised chips running for weeks. Yet the interface is so natural that millions of people use it without thinking about the engineering underneath.

This guide explains what a large language model actually is. Not the marketing version — the real version. What it is trained on, how it learns, what it genuinely can and cannot do, why it sometimes invents facts with serene confidence, and where the technology is heading next.

No machine learning background is required. Every technical term is defined when it first appears. By the end, you will have the literacy to evaluate LLM claims critically, use these tools effectively, and understand why the questions being asked about them matter.


What Is an LLM? — Three Definitions for Three Audiences

The one-sentence definition

A large language model is a neural network trained on a massive text corpus to predict the most likely next token in a sequence — and which, through doing this at sufficient scale, acquires a broad ability to understand and generate human language across an enormous range of tasks.

Breaking down the name: 

  • Large: billions to trillions of learned numerical parameters. GPT-3 has 175 billion. GPT-4 is estimated at over one trillion (using a mixture-of-experts architecture). Llama 3 comes in sizes from 8 billion to 70 billion. "Large" is relative to prior generations — a 2017 state-of-the-art model had tens of millions of parameters.

  • Language: text is the primary input and output modality. Modern LLMs are increasingly multimodal (accepting images, audio, and video as well), but language — written text — remains the core capability and the training foundation.

  • Model: a mathematical function. Specifically, a function that maps a sequence of input tokens to a probability distribution over the next token. The "intelligence" is encoded in billions of numerical weights that were learned, not hand-coded.

The intuitive definition: autocomplete at civilisational scale

At its core, an LLM is trained on one deceptively simple task: given all the text so far, predict the most likely next word. That is it. No more.

But to predict the next word accurately across the full diversity of human writing — research papers, legal contracts, novels, Python tutorials, Reddit threads, ancient philosophy, medical records — the model must implicitly learn grammar, facts, reasoning patterns, cultural context, coding conventions, mathematical notation, and stylistic register. Understanding is not the goal; it is a side effect of doing prediction very well at very large scale.

Analogy: imagine someone who has read every book, article, forum post, and website ever written — in every language, across every domain. Ask them a question, and they respond based on patterns absorbed from all of it. That is roughly what an LLM does, with an important caveat: it absorbed those patterns through statistical matching, not through lived experience or sensory grounding. The boundary between "deep pattern matching" and "understanding" is the subject of genuine philosophical debate — we return to it in Section 9.

The technical characterisation

An LLM is a transformer-based neural network pre-trained on a large text corpus using a self-supervised next-token prediction objective, subsequently fine-tuned for instruction following and alignment using supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF).

If that sentence is opaque, each component has its own section later in this guide. The key technical point for now: the transformer architecture (covered in depth in our companion article, How Transformers Work) is the engine. The training pipeline — data, objective, fine-tuning — is what shapes the engine into a useful assistant.

Terminology disambiguation: LLM vs. AI vs. chatbot

These terms are used interchangeably in public discourse. They are not the same thing.


Term

What it means

Examples

AI

The broadest category. Any system exhibiting intelligent behaviour.

Chess engines, recommendation systems, spam filters, LLMs

Machine learning

AI that learns from data rather than hand-coded rules.

Random forests, neural networks, gradient boosting

Deep learning

ML using multi-layer neural networks.

CNNs for vision, RNNs for sequences, transformers

LLM

A large transformer model trained on text.

GPT-4, Claude 3, Llama 3, Gemini, Mistral

Generative AI

AI that produces new content (text, image, audio, video).

LLMs plus Stable Diffusion, Suno, Runway

Foundation model

A large model trained broadly and adapted to many tasks.

GPT-4, CLIP, DALL-E 3 — often overlaps with LLM

Chatbot

A product interface built on top of an LLM.

ChatGPT, Claude.ai, Gemini (product), Copilot

AGI

Hypothetical AI with general human-level intelligence.

Does not yet exist. No consensus on definition.


A Short History — How LLMs Went from Theory to Product in Eight Years

The pre-transformer era (pre-2017)

The history of language models begins long before the current wave. For decades, the dominant approach was statistical: count how often word B follows word A in training data, and predict accordingly. N-gram models were effective for short, formulaic text — autocorrect, simple autocomplete — but collapsed on anything requiring real context.

Word2Vec and GloVe (2013–2014) introduced neural word embeddings — dense vectors that captured semantic relationships geometrically. "King minus man plus woman approximately equals queen" became the canonical demonstration. These were a genuine advance, but they encoded static word meanings, not contextual understanding. "Bank" in "river bank" and "bank" in "bank account" mapped to the same vector.

Recurrent neural networks (RNNs) and Long Short-Term Memory networks (LSTMs) addressed the context problem by processing text sequentially, maintaining a running summary of what had been read. They powered the best machine translation and speech recognition systems of the mid-2010s. But they were fundamentally limited: their hidden state could only hold so much, sequential processing meant they could not use modern parallel hardware efficiently, and very long-range dependencies remained unreliable.

The transformer moment (2017)

"Attention Is All You Need" (Vaswani et al., Google Brain, 2017) replaced sequential recurrence with a mechanism that let every token directly attend to every other token simultaneously. Parallelisable, scalable, and remarkably general. The transformer became the universal architecture for AI within five years of its publication.

Why it was the keystone: the transformer has no built-in notion of what tokens represent. Point it at text, and it learns language. Point it at image patches, and it learns vision. Point it at amino acid sequences, and it learns protein structure. The same architecture powers GPT-4, AlphaFold 2, and DALL-E 3.

The scale breakthrough (2018–2020)

BERT (Google, 2018) showed that a bidirectional transformer trained on masked token prediction achieved state-of-the-art on eleven natural language understanding benchmarks simultaneously — without task-specific architecture changes. The world noticed that scale worked reliably.

GPT-2 (OpenAI, 2019) generated text so coherent that OpenAI initially declined to release it fully, citing misuse concerns. GPT-3 (2020, 175 billion parameters) crossed a qualitative threshold: it demonstrated in-context learning — the ability to perform new tasks from just a few examples in the prompt, with no weight updates. This was not programmed. It emerged from scale. Researchers who built it were surprised.

Scaling laws (Kaplan et al., OpenAI, 2020) established that performance follows smooth power laws with model size, training data, and compute. Making training large models a science — predictable returns on investment — accelerated the field dramatically.

The alignment turn (2021–2022)

A pre-trained LLM is a text engine. It continues whatever you give it. It does not follow instructions, answer helpfully, or decline harmful requests. InstructGPT (OpenAI, 2022) changed this through Reinforcement Learning from Human Feedback (RLHF): human raters ranked model responses, a reward model was trained on those rankings, and GPT-3 was fine-tuned to maximise reward scores. The result was dramatically more helpful, less harmful, and more honest behaviour.

ChatGPT (November 2022) packaged this into a consumer chat product. One million users in five days. One hundred million in two months — the fastest consumer product adoption ever recorded. LLMs moved from research curiosity to household topic in a single quarter.

The competitive era (2023–present)

GPT-4 (March 2023): multimodal input, dramatically stronger reasoning, estimated 1+ trillion parameters in a mixture-of-experts architecture. Gemini (Google, December 2023): natively multimodal from training, with a 1-million-token context window in the 1.5 Pro version. Claude 3 (Anthropic, 2024): constitutional AI alignment, 200,000-token context. Llama 3 (Meta, 2024): open-weight models at 8B and 70B parameters enabling a vibrant open-source ecosystem.

Milestone timeline

2018  BERT (Google): Bidirectional encoder; state-of-the-art on 11 NLP benchmarks

2019  GPT-2 (OpenAI): Coherent long-form generation; release delayed over misuse concerns

2020  GPT-3 (OpenAI): 175B params; in-context learning emerges from scale

2022  InstructGPT / ChatGPT: RLHF alignment; 100M users in 2 months

2022  Chinchilla (DeepMind): Optimal training balances params and data equally

2023  GPT-4 (OpenAI): Multimodal; passes bar exam at ~90th percentile

2023  Llama 1/2 (Meta): Open-weight LLMs; enables open-source ecosystem

2024  Gemini 1.5 Pro (Google): 1M-token context; natively multimodal

2024  Claude 3 / Llama 3: 200K context; open weights at frontier capability


How an LLM Is Built — the Complete Training Pipeline

Building a frontier LLM involves four distinct phases, each with its own objectives, techniques, and costs. Understanding the pipeline is essential to understanding both the capabilities and the failure modes — many limitations arise not from accidents of engineering but from the structure of the process itself.

Phase 1: Data collection and curation

The training corpus is the foundation of everything. Modern frontier LLMs train on 1 to 15 trillion tokens — more text than any human could read in thousands of lifetimes. The primary sources:

  • Common Crawl: petabytes of raw web text scraped from billions of pages. The largest source, but also the noisiest — containing spam, gibberish, duplicate content, and objectionable material in vast quantities.

  • Books and long-form writing: digitised books (Books3, Project Gutenberg, BooksCorpus), providing coherent long-range structure that web text often lacks.

  • Scientific literature: ArXiv preprints, PubMed abstracts, academic papers — critical for technical and scientific capability.

  • Code repositories: GitHub and other code hosts. Models trained on more code reason more systematically. Code's explicit logical structure teaches the model something about sequential, rule-following thought.

  • Wikipedia and encyclopaedic sources: high-quality factual content across domains.

  • Multilingual corpora: text in dozens of languages. The balance of languages in training data directly determines multilingual capability.

Raw data is unusable as collected. Quality filtering is essential and difficult:

  • Deduplication removes near-identical content (MinHash algorithms identify and eliminate repetitive web text — without this, certain pages appear thousands of times and dominate the model's learned patterns).

  • Quality classifiers filter low-quality pages using heuristics (short documents, high symbol density, non-linguistic content) and trained classifiers.

  • Toxicity filtering removes overtly harmful content, though this is imperfect — models still encounter bias-laden and objectionable text in training.

  • The data mix — the ratio of web text to books to code to scientific papers — is itself a hyperparameter. Teams experiment with different ratios because the mix shapes what capabilities emerge.

All text is then tokenised: broken into sub-word fragments (using Byte-Pair Encoding or SentencePiece), each mapped to an integer ID. GPT-4 uses approximately 100,000 token types. After tokenisation, the entire corpus is a very long sequence of integers — the only format the model actually processes.

Phase 2: Pre-training

Pre-training is the expensive phase. A transformer model with randomly initialised weights is trained to minimise one objective: next-token prediction. Given the sequence of tokens up to position N, predict the probability distribution over what token comes at position N+1.

The training loss is cross-entropy: the negative log-probability assigned to the correct next token. If the model assigns high probability to the right answer, loss is low. If it assigns high probability to the wrong answer, loss is high. Backpropagation computes how much each of the model's billions of parameters contributed to the loss, and the optimiser (usually Adam) adjusts them all to reduce it. Repeat across trillions of tokens.

The self-supervision insight is what makes this tractable: the labels — the correct next tokens — come from the data itself. Every sentence in the corpus is simultaneously training data and label. No human annotation is required. This is why training on internet-scale data is feasible: there is no annotation bottleneck.

What the model learns from this objective is remarkable. To accurately predict the next token across the full diversity of human writing, the model must implicitly encode:

  • Grammar and syntax — predicting grammatically valid continuations

  • Factual knowledge — predicting what comes after "The capital of France is..."

  • Reasoning patterns — predicting that a logical conclusion follows from its premises

  • Social and cultural context — predicting what a character would say in a given situation

  • Code structure — predicting that a function body follows its signature

  • Mathematical conventions — predicting the next step in a derivation

None of these are trained for explicitly. They emerge because they are all predictively useful. This emergence-from-prediction is the central surprise of the LLM era.

Compute cost: training GPT-3 is estimated to have cost $4–12 million in cloud compute. GPT-4 is estimated at $50–100 million or more. Training runs on clusters of thousands of NVIDIA H100 or Google TPU v4 accelerators running continuously for weeks. This cost structure determines who can build frontier models.

Join our newsletter for regular  updates on AI, digital marketing and growth!


Phase 3: Post-training alignment

A pre-trained model is a powerful text engine, not an assistant. It continues text rather than following instructions. It may helpfully complete a coding task, or unhelpfully continue a prompt in ways the user did not intend. Post-training shapes it into something useful and safe.

Supervised Fine-Tuning (SFT): a curated dataset of (prompt, ideal response) pairs, written by human specialists covering a wide range of tasks — question answering, coding assistance, creative writing, sensitive topic handling. The model is trained to produce the ideal responses. This teaches it the format, register, and style of helpful assistant behaviour. Relatively modest in compute compared to pre-training, but critical for usability.

Reinforcement Learning from Human Feedback (RLHF): human raters compare pairs of model responses to the same prompt and indicate which is better. A reward model is trained on these preference pairs to predict human preference scores. The LLM is then fine-tuned using Proximal Policy Optimisation (PPO) — a reinforcement learning algorithm — to generate responses that maximise the reward model's score. This allows alignment with complex human preferences that are difficult to specify through examples alone.

Direct Preference Optimisation (DPO): a simpler alternative to RLHF that directly fine-tunes the model on preference pairs without a separate reward model. More stable, cheaper, and increasingly common at frontier labs.

Constitutional AI (Anthropic): the model critiques and revises its own outputs according to a stated set of principles — the "constitution" — then reinforcement learning is applied to the self-revised data. This makes alignment principles explicit and auditable, and reduces the volume of human-generated preference data required.

Red-teaming: human specialists and automated systems systematically attempt to elicit harmful, false, or policy-violating outputs. Findings feed back into additional fine-tuning rounds. No model passes red-teaming completely — it is an ongoing process, not a one-time gate.

Phase 4: Evaluation

How do you know if the model is actually good? This is harder than it sounds.

  • Benchmark suites: standardised test sets measuring specific capabilities. MMLU tests knowledge across 57 academic subjects. HumanEval tests code generation. GSM8K tests multi-step maths. GPQA tests graduate-level science. Strong benchmark performance is meaningful but not sufficient — a model can be optimised for benchmarks without being genuinely capable.

  • Human preference evaluations: blind comparison studies (Chatbot Arena's LMSYS leaderboard) where human raters choose between anonymised model responses. Correlates better with real-world usefulness than benchmark scores but is expensive and subject to rater biases.

  • Contamination problem: if benchmark questions appear in the training corpus, the model may appear to perform better than it actually would on truly unseen questions. Contamination detection is an active methodological challenge, and the field continuously develops new held-out benchmarks.

  • Alignment evaluation: refusal quality (refusing harmful requests without over-refusing benign ones), factual accuracy on verifiable claims, calibration (does expressed confidence match actual accuracy?), robustness to adversarial prompting.


How an LLM Works at Inference — From Your Prompt to the Response

Training is complete. The model's weights are fixed. You type a message and press enter. Here is exactly what happens.

Tokenisation: your words become numbers

Your message is split into tokens — word fragments mapped to integer IDs. The tokeniser has a fixed vocabulary (GPT-4's has approximately 100,000 entries). Common words are single tokens. Rare words and names are split into sub-word pieces. Numbers, punctuation, and whitespace each get their own tokens.

Practical implications of tokenisation:

  • Cost: most LLM APIs charge per token consumed. A 1,000-word message is roughly 1,300 tokens.

  • Language inequality: languages with smaller training corpora have less efficient tokenisation — Swahili requires more tokens to express the same content than English, making it more expensive and often lower quality.

  • Surprising failures: tasks that seem simple to humans can be hard for models because of how they tokenise. Counting the letter "r" in "strawberry" requires reasoning about sub-word tokens, which don't map one-to-one with individual characters.

The context window: the model's working memory

The context window is the maximum number of tokens the model can consider in a single forward pass — its entire working memory for the interaction. Your system prompt, conversation history, any documents you attached, and your current message all share this space.

Context window sizes over time

GPT-2 (2019): 1,024 tokens (~750 words)

GPT-3 (2020): 4,096 tokens (~3,000 words)

GPT-4 (2023): 128,000 tokens (~96,000 words)

Claude 3 (2024): 200,000 tokens (~150,000 words)

Gemini 1.5 Pro (2024): 1,000,000 tokens (~750,000 words)


Why context windows are limited: attention — the mechanism at the heart of every transformer — computes relationships between every pair of tokens. This is an O(N²) operation: doubling the context length quadruples the compute and memory required. Extending context windows to one million tokens requires substantial engineering to avoid making inference prohibitively expensive.

What happens at the limit: when a very long conversation fills the context window, the model cannot reference earlier content — it has, architecturally, forgotten it. This is why a chatbot may seem to 'forget' what you said at the start of a very long exchange. It is not a personality quirk. It is physics.

The forward pass: computing the next token

Your tokenised message flows through all N transformer layers simultaneously. In each layer, self-attention lets every token attend to every other token — building a rich, contextual representation of each token in light of everything else in the context. The feed-forward network in each layer then processes each token's representation independently. Residual connections and layer normalisation keep the computation stable across dozens or hundreds of layers.

The final layer produces a vector for the last token position. This vector is projected through an output matrix into a probability distribution over all ~100,000 tokens in the vocabulary. The model does not "think about" the answer; it computes which token statistically tends to follow this sequence, given everything it learned during training.

Sampling: why AI responses are probabilistic

The model does not always output the single most likely token (greedy decoding). Most deployments sample from the probability distribution — introducing controlled randomness. This is why running the same prompt twice produces different responses.

  • Temperature: a parameter dividing the probability scores before sampling. Temperature below 1 sharpens the distribution — more confident, more conservative, near-deterministic. Temperature above 1 flattens it — more creative, more variable, higher risk of incoherence. "AI creativity" is substantially this one parameter.

  • Top-p sampling: sample only from the smallest set of tokens whose cumulative probability reaches threshold p (e.g. 0.9). Prevents very low-probability "wild card" tokens from occasionally derailing the output.

  • Top-k sampling: restrict sampling to the k highest-probability tokens at each step. Simpler than top-p, often used in combination with temperature.

The response is generated one token at a time — sample a token, append it to the sequence, run the forward pass again, sample the next token. This is why responses stream character by character: they literally are being generated that way.

Prompting and prompt engineering

A typical API call contains three components: a system prompt (persistent instructions — 'You are a helpful coding assistant specialising in Python'), a conversation history (alternating user and assistant turns), and the current user message. All three are concatenated into one long token sequence. The model has no separate 'instruction module' — the system prompt is just more tokens that it attends to when generating each response.

Prompt engineering is the practice of crafting inputs that reliably elicit better outputs. Evidence-based techniques:

  • Chain-of-thought prompting: ask the model to 'think step by step before answering.' This dramatically improves multi-step reasoning — the intermediate steps serve as working memory for the model's computation.

  • Few-shot prompting: provide 2–5 examples of the input-output format you want before your actual question. The model pattern-matches from examples without any weight updates.

  • Explicit output format specification: 'Respond in JSON with keys: summary, key_points, action_items.' Specificity about structure dramatically improves consistency.

  • Role assignment: 'You are an expert cardiologist reviewing a patient case.' Activates relevant patterns in the model's training distribution.


What LLMs Can Do — Capabilities Mapped Honestly

Language tasks: the native domain

LLMs are strongest on the tasks most represented in their training data: reading, writing, and transforming text.

  • Text generation: drafting emails, articles, reports, marketing copy, social posts, product descriptions. At their best, indistinguishable from competent professional writing. Quality is highest for conventional formats and genres well-represented in training data.

  • Summarisation: condensing long documents into key points. Genuinely valuable for research papers, legal documents, meeting transcripts, news archives. Quality degrades on very long inputs and on highly technical content where the model may not recognise what is important versus incidental.

  • Translation: competitive with dedicated neural machine translation for major language pairs (English, Spanish, French, German, Mandarin, Japanese, Arabic). Weaker on low-resource languages with limited training data.

  • Question answering: answering factual questions, explaining concepts, synthesising information from provided context. Strongest when the answer is in the context window; weaker when relying solely on parametric knowledge from training.

  • Classification and extraction: categorising text, extracting named entities, identifying sentiment, parsing structured data from unstructured text — pulling dates and amounts from invoices, extracting contract terms, labelling customer support tickets.

Reasoning and problem-solving

Frontier LLMs have crossed into territory that would have been considered impossible for language models just five years ago.

  • Chain-of-thought reasoning: when prompted to think step by step, models perform substantially better on multi-step problems. GPT-4 passes the LSAT and the GRE verbal at above the 80th percentile. It passes the US Medical Licensing Exam and the bar exam. These are not easy tests — they were designed to filter humans who haven't sufficiently mastered professional knowledge.

  • Mathematical reasoning: strong on textbook-level problems (GPT-4 achieves competitive scores on AMC 10/12). o1 and o3 (OpenAI's extended-thinking models) achieve gold-medal-level performance on International Mathematical Olympiad problems — a benchmark that was considered out of reach for AI systems as recently as 2023.

  • Code generation and debugging: writing functional code from natural language descriptions, explaining what existing code does, identifying bugs, suggesting fixes. GitHub Copilot is estimated to write 40–50% of code in repositories where it's active.

  • In-context learning: the ability to perform new tasks from just a few examples in the prompt, with no weight updates. Show the model three examples of a classification task it's never seen, and it performs the fourth correctly. This was not trained for explicitly — it emerged at scale.

  • Creative and analogical reasoning: generating novel metaphors, creative writing under constraints, connecting ideas across disparate domains. Genuinely impressive — an emergent capability that surprised researchers and philosophers alike.

Emergent capabilities that surprised researchers

Several capabilities appeared in larger models that were absent in smaller ones — crossing qualitative thresholds rather than simply performing better on existing tasks:

  • Few-shot learning emerged around GPT-3 scale (175B parameters) and was not predicted by extrapolating from smaller models' performance.

  • Chain-of-thought reasoning emerged as a stable, reliable capability around GPT-3.5/4 scale — smaller models showed inconsistent benefit from step-by-step prompting.

  • Theory of mind performance (correctly predicting what a character with different knowledge would believe) appears in frontier models at rates comparable to human adults — a result that has prompted genuine philosophical debate.

  • Instruction following — understanding and executing complex, multi-part, conditionally-specified instructions in natural language — improved discontinuously with scale and alignment training.


What LLMs Cannot Do — the Real Limitations

Overclaiming about LLM capabilities causes real harm — in professional contexts, in policy, and in public expectations. This section is as important as Section 5. Understanding failure modes is prerequisite to using these tools responsibly.

Hallucination: the confident fabrication problem

Hallucination is when an LLM generates text that is fluent, confident, and factually false. Invented academic citations. Wrong dates and statistics. Fictional legal cases. Plausible-sounding but incorrect medical information. Detailed biographical facts about people who do not exist.

Why it happens architecturally: the model is optimised to predict the most likely next token, not to verify truth. A fluent, plausible-sounding false sentence may be statistically more probable under the model's distribution than a correct but awkward one. The model has no internal fact-checking mechanism. It has no access to external reality — only to patterns in its training data. Hallucinated content looks syntactically and stylistically identical to accurate content.

The citation problem is a canonical example: LLMs regularly generate plausible-looking academic references — correct author name formats, realistic journal names, plausible volume and page numbers — for papers that do not exist. The components come from training data; the specific combination is fabricated by sampling from the distribution.

Case study: Mata v. Avianca (2023)

A lawyer used ChatGPT to research legal precedents for a federal court filing. ChatGPT provided detailed citations to multiple cases — docket numbers, court names, quotes from decisions. None of the cited cases existed. The lawyer submitted the brief without verifying the citations. When the court requested the cited cases, they could not be produced. The judge sanctioned the attorneys. The case became a widely-cited example of professional harm from hallucination and has influenced legal industry guidelines on AI use.


Mitigation but not elimination: Retrieval-Augmented Generation (RAG) reduces hallucination by grounding responses in retrieved source documents. Larger, better-calibrated models hallucinate less frequently. But no current LLM is reliably hallucination-free, and the failure is characteristically hard to detect — it does not announce itself.

Knowledge cutoff and temporal blindness

LLMs are trained on static datasets with a hard cutoff date. They have no knowledge of events after training ended. Ask about last week's news and the model either correctly says it does not know, or confabulates a plausible-sounding answer based on prior patterns — the second outcome is more dangerous than the first.

Mitigation: tool use (web browsing, real-time API access), RAG with fresh document retrieval, regular model retraining. These are engineering workarounds for an architectural constraint, not solutions to it.

Systematic generalisation failure

LLMs excel at interpolation — performing well on inputs similar to their training distribution. They are unreliable at extrapolation — applying rules to genuinely novel input combinations.

The arithmetic example: GPT-4 can often multiply two five-digit numbers. It fails reliably on twelve-digit numbers, even though the multiplication algorithm is identical. The model approximates from learned patterns; it does not execute algorithms. Multi-digit arithmetic is harder for LLMs than for a 1970s calculator.

This is not a bug to be patched. The transformer architecture is a pattern matcher of extraordinary sophistication. Pattern matching at sufficient scale produces behaviour that looks like reasoning. But pattern matching and algorithmic reasoning are not the same thing, and the difference becomes visible precisely in the cases that matter most — novel, high-stakes situations without prior examples.

No persistent memory

An LLM has no memory between conversations. Each new session starts with blank context. It does not know you from previous interactions unless that history is explicitly injected into the context window. Within a long conversation, early content is eventually pushed out of the window and forgotten entirely.

Workarounds: external vector databases, conversation summarisation pipelines, user profile stores that retrieve and inject relevant history. These are infrastructure layers sitting outside the model — they approximate persistent memory but are not equivalent to it.

Calibration: confident when wrong

Calibration is the correspondence between expressed confidence and actual accuracy. A well-calibrated model that says "I am 80% confident" should be right about 80% of the time. LLMs are often poorly calibrated — expressing the same fluent confidence whether stating a well-established fact or confabulating a detail outside their training distribution.

The dangerous failure mode: the model is most likely to be confidently wrong on precisely the inputs where you most need it to flag uncertainty — obscure facts, recent events, domain-specific technical details. The model doesn't know what it doesn't know, and it doesn't sound different when it doesn't.

Bias and representation failures

LLMs trained on internet-scale text inherit the biases present in that text — skewed toward Western, English-language, majority-group perspectives. Measurable performance disparities by race, gender, nationality, and language are well-documented in the research literature.

Mitigation through RLHF and Constitutional AI has reduced the most obvious surface-level failures but has not eliminated them — and mitigation itself can introduce new failure modes, such as over-cautious refusals that disproportionately affect minority groups seeking information about their own communities.


LLMs in the Wild — Real-World Applications by Domain

Software development

Software development is the domain where LLM productivity gains are best-evidenced and most immediately measurable.

  • Code completion and generation: GitHub Copilot, Cursor, Replit Ghostwriter. A randomised controlled trial by GitHub found a 55% task completion speed increase for developers using Copilot on well-defined tasks. Copilot is estimated to write 40–50% of code in repositories where it is actively used.

  • Code review and bug detection: identifying logical errors, security vulnerabilities (SQL injection, buffer overflows, insecure random number generation), and style violations at the pull request stage.

  • Test generation: automatically writing unit tests for existing functions. One of the least-enjoyed developer tasks, now largely automatable for well-structured code.

  • Legacy modernisation: translating COBOL and FORTRAN to modern languages, generating documentation for undocumented codebases, explaining what inherited code does.

Healthcare and medicine

Healthcare presents both the highest potential value and the highest risk of LLM deployment. The gap between current capability and required reliability is largest here.

  • Clinical documentation: ambient AI listening to patient consultations and generating structured clinical notes (Microsoft DAX Copilot, Nuance, Nabla). Early deployments report 2–3 hour reduction in documentation burden per physician per day — a significant contributor to physician burnout.

  • Medical literature synthesis: summarising clinical trial results, systematic reviews, and drug interaction data for clinicians. Speeds literature review substantially, but requires expert verification — hallucinated clinical data is dangerous.

  • Drug discovery support: analysing molecular literature, hypothesising protein targets, generating candidate structures in combination with specialised molecular tools.

  • Patient communication: drafting discharge instructions and medication guides in plain language tailored to patient literacy level. Significant liability and accuracy concerns remain unresolved.

Legal and professional services

  • Contract review: identifying key clauses, unusual provisions, and missing standard terms in bulk. Harvey AI and similar products report 10–40× speed improvement on first-pass contract review for standard commercial agreements.

  • Legal research: identifying relevant case law, statutory provisions, regulatory guidance. Requires strict verification — the Mata v. Avianca case (Section 6.1) illustrates the catastrophic failure mode when verification is skipped.

  • Document drafting: first drafts of NDAs, employment agreements, terms of service. High quality for standard forms; unreliable for jurisdiction-specific or highly negotiated provisions.

Education

  • Personalised tutoring: Khanmigo (Khan Academy) uses Socratic prompting — asking guiding questions rather than providing direct answers — to help students work through problems while preserving the cognitive engagement that drives learning.

  • Writing feedback: detailed, actionable feedback on essay structure, argument quality, and evidence use — available at scale without the instructor bottleneck.

  • Language learning: conversational practice, grammar correction, vocabulary explanation in context (Duolingo Max, Speak). Removes the barrier of needing a human conversation partner for low-stakes practice.

  • Academic integrity: LLMs can generate passing essays. This is a structural challenge to text-based assessment that institutions are actively grappling with — responses range from AI detection tools (imperfect) to redesigning assessments around process and in-person demonstration.

Enterprise productivity

  • Meeting intelligence: transcription + summarisation + action item extraction from recorded meetings. Products: Otter.ai, Fireflies.ai, Microsoft Copilot in Teams. Reduces meeting overhead and improves follow-through on decisions.

  • Customer support: handling tier-1 queries (order status, basic troubleshooting, FAQs) with conversational understanding far beyond keyword-matching. Escalation to human agents on complex or sensitive cases.

  • Knowledge base Q&A: answering employee questions from internal documentation. Reduces support tickets and dramatically speeds onboarding for new employees who can query the knowledge base in natural language.

  • Writing and drafting: first drafts of reports, proposals, emails, presentations. Studies consistently report 30–40% time savings on writing-heavy knowledge work tasks.


The LLM Ecosystem — Models, Providers, and How They Differ

The three-layer stack

"ChatGPT said X" conflates three distinct layers that make different decisions:

  • The base model: the neural network with its trained weights. The intellectual and computational asset of the lab that built it. Determines the fundamental capability ceiling.

  • The API: access to the model over HTTP, with token-based pricing, safety filters applied, system prompt capability, and usage limits. What developers build products on. OpenAI's API, Anthropic's API, Google AI Studio.

  • The product: the user interface, memory management, tool integrations, and product decisions built on top of the API. ChatGPT, Claude.ai, Gemini (consumer product), Copilot. Makes decisions about conversation history, safety thresholds, feature rollout.

When a model 'refuses' a request, it may be refusing because of a weight-level learned behaviour (baked in during RLHF), because of a safety filter at the API layer, or because of a system prompt instruction set by the product. These are meaningfully different — only the first is attributable to the model itself.

Closed vs. open-weight models

The distinction between closed and open models is one of the most consequential in the current AI landscape:

  • Closed models (GPT-4, Claude 3, Gemini Ultra, Grok): weights not released. Accessible only through API or consumer product. Provider controls updates, safety measures, and pricing. Typically strongest at frontier capability. Users have no visibility into training data, architecture details, or weight values.

  • Open-weight models (Llama 3, Mistral, Qwen 2, Gemma, Falcon): weights published for download. Can be run locally, fine-tuned, modified, deployed without API dependency. Privacy-preserving (data never leaves your hardware). Typically somewhat behind closed models on overall capability benchmarks, though the gap has narrowed dramatically.

The policy debate: open weights enable research, auditability, customisation, and access for under-resourced communities. They also enable misuse — fine-tuning to remove safety guardrails is straightforward for someone with the technical capability. This tension is unresolved and shapes ongoing regulatory conversations.

How to choose a model: the key trade-off dimensions

  • Capability: benchmark performance (MMLU, MATH, HumanEval), human preference ratings (Chatbot Arena ELO score), performance on your specific task domain. No single model dominates on all dimensions.

  • Context window: critical for document-level tasks. If your use case involves long documents, call transcripts, or large codebases, context window size may be the binding constraint.

  • Cost: varies roughly 100× across the market. Frontier closed models: $10–60 per million output tokens. Mid-tier models: $1–5. Smaller open-weight models self-hosted: fractions of a cent. Cost drives architecture at scale.

  • Latency: smaller models are faster. For real-time applications (voice assistants, interactive tools), latency may matter more than benchmark performance.

  • Deployment options: API-only (closed models), self-hostable (open-weight), edge-deployable (quantised small models on consumer hardware). Data sensitivity, network availability, and compliance requirements drive this choice.

  • Safety and alignment: refusal behaviour, jailbreak resistance, factual accuracy, bias levels. Harder to measure than capability benchmarks. Varies substantially across models and across model families from the same provider.


Understanding vs. Stochastic Parrots — the Philosophical Debate

This section engages with a question that thoughtful users of LLMs inevitably arrive at: do these systems understand language, or are they doing something fundamentally different that merely resembles understanding?

The stochastic parrot critique

Bender et al. (2021) introduced the term "stochastic parrots" to describe LLMs as systems that manipulate linguistic form without any underlying meaning or comprehension — producing plausible text through statistical pattern matching rather than understanding.

The core argument rests on grounding: an LLM has never seen a cup, felt cold, experienced surprise, or navigated a physical space. Its "knowledge" that coffee is hot is a statistical correlation between tokens in training data, not a representation grounded in sensory or causal experience of the world. This lack of grounding, the argument goes, makes the comparison to human understanding fundamentally misleading.

Supporting evidence comes from the failure modes catalogued in Section 6: systematic generalisation failures, hallucination, brittle performance on adversarially crafted inputs that break statistical patterns. These are characteristic of pattern matchers, not reasoners.

The emergence counterargument

The opposing position holds that if a system reliably performs tasks that require understanding — answering novel questions correctly, writing valid mathematical proofs, explaining its reasoning coherently, passing theory of mind tests — then the distinction between "genuine" understanding and "functional" understanding may be philosophically significant but practically thin.

The empirical record: GPT-4 passed the bar exam at the 90th percentile. It passed the US Medical Licensing Exam at passing level. It solves novel analogical reasoning problems. OpenAI's o3 achieved gold-medal-level performance on the International Mathematical Olympiad. If these tasks require understanding, and the model performs them, what exactly is the stochastic parrot label explaining?

John Searle's Chinese Room argument (1980) — that a system can manipulate symbols according to rules without understanding their meaning — predates LLMs but is directly relevant. The standard response from LLM researchers is that Searle's argument applies to individual components but not to the system as a whole. This debate remains philosophically unresolved.

A pragmatic framing for 2025

For most practical decisions, the question of whether LLMs "truly understand" matters less than accurately characterising what they can and cannot do. A system that reliably solves X and reliably fails at Y is useful in contexts requiring X and dangerous in contexts requiring Y — regardless of the philosophical status of its internal representations.

Where the question has practical stakes: AI safety and alignment. If an LLM does not understand its instructions but only pattern-matches to them, its behaviour in genuinely novel situations — situations structurally unlike its training distribution — is less predictable than if it has learned abstract principles that generalise. Whether LLMs have learned generalisable principles or surface-pattern correlations is actively debated in alignment research and has direct implications for how much trust to place in model behaviour in high-stakes deployments.

The honest answer in 2025

We do not have consensus scientific tools to determine whether and to what degree LLMs have internal representations that deserve the label 'understanding.' The question is live, contested, and unresolved. Anyone who tells you it is definitively settled — in either direction — is overstating the evidence.


Safety, Risks, and Responsible Use

LLMs are powerful tools with real failure modes and real risks. Neither dismissal ("just a tool, no different from a search engine") nor catastrophism ("existential threat requiring immediate halt") serves careful thinking. This section addresses the risks that are already manifesting or well-evidenced in the near term.

Disinformation and synthetic text

LLMs dramatically lower the cost of producing persuasive, targeted written disinformation at scale — fabricated news articles, fake reviews, astroturfing social media content, impersonation of real people's writing styles. Previously limited by the human cost of writing at scale; that constraint no longer holds.

Documented deployment: multiple documented influence operations (Meta Threat Intelligence, EU DisinfoLab, Stanford Internet Observatory) have used LLM-generated content in political influence campaigns. The 2024 global election cycle saw significant activity. Watermarking standards (C2PA) and detection tools exist but are imperfect and not universally deployed.

Hallucination-caused professional harm

The legal, medical, and financial professions have all documented incidents of professional harm from acting on hallucinated LLM outputs without verification. The Mata v. Avianca case is the canonical legal example. Medical misinformation from LLMs is harder to document at individual case level but represents a clear population-level risk given the scale of people using AI for health queries.

The mitigation imperative: LLMs should not be used as primary information sources for high-stakes professional decisions without expert verification of specific factual claims. This is not a temporary limitation pending better models — it is a fundamental characteristic of the architecture that grounding through RAG reduces but does not eliminate.

Bias and discrimination at scale

When LLMs are integrated into consequential decisions — CV screening, credit assessment, triage, bail recommendations — training data biases become institutionalised at scale. Documented performance disparities by race, gender, and nationality in LLM outputs are well-evidenced. Deploying LLMs in high-stakes decisions without bias audits, human oversight, and explicit accountability mechanisms is a known, avoidable risk.

Dual-use risks

LLMs can provide detailed information on dangerous topics — synthesis of hazardous substances, social engineering scripts, cyberattack code — to users who could not previously have easily accessed it. Safety training reduces (but does not eliminate) this risk. No frontier model is fully robust to determined adversarial prompting — jailbreaking techniques remain an active attack surface. This is a genuine arms-race dynamic with no complete defensive solution.

Labour displacement

Near-term displacement is documented in specific task categories: entry-level writing tasks (content marketing, routine correspondence, first-draft documentation), certain code review functions, basic customer service, data entry and extraction. These tasks are not disappearing immediately, but the time required to perform them is declining, which implies declining demand for human labour on them at the margin.

The augmentation nuance: in many domains, LLMs increase the productivity of workers rather than replacing them. Developers using Copilot write more code, not fewer jobs. Translators using LLMs produce more work per day. But content writers competing with AI-generated marketing copy face structural wage pressure. The distribution of effects is highly uneven across occupations and skill levels.

Environmental cost

Training frontier LLMs consumes substantial energy. GPT-3 training was estimated at roughly 1,300 MWh — equivalent to the lifetime carbon footprint of approximately five cars. Inference — running the model for every user query across millions of daily interactions — adds ongoing consumption. Data centre water use for cooling is a significant and underreported resource cost.

The efficiency trend is positive: newer architectures (Mixture of Experts, quantisation, distillation) achieve comparable performance at lower energy cost per unit of capability. But total consumption is growing faster than efficiency gains as deployment scales — making the net environmental trajectory uncertain.


Where LLMs Are Headed — Near-Term Trajectory

Multimodal by default

The next generation of LLMs processes text, images, audio, and video within a single model — not as separate modules but as unified representations trained jointly. GPT-4o, Gemini 1.5, and Claude 3.5 already demonstrate this. The trajectory is toward any-to-any capability: input and generate in any combination of modalities. See our companion guide, What Is Multimodal AI?, for the full treatment.

Longer context and better memory

Context windows have grown from 1,024 tokens in GPT-2 (2019) to 1,000,000 tokens in Gemini 1.5 Pro (2024) — a 1,000× expansion in five years. The quadratic attention cost bottleneck is being addressed through Flash Attention, sparse attention variants, and hybrid state-space-transformer architectures. External memory systems (RAG, vector databases, long-term memory stores) extend effective context beyond what the model processes natively.

Agentic AI: from answering to acting

LLMs are increasingly deployed as agents — systems that take multi-step actions in the world: browsing the web, writing and executing code, sending emails, booking appointments, controlling software interfaces. This requires planning, tool use, error recovery, and sustained coherent behaviour over long horizons. Current agentic systems are impressive in controlled demonstrations and brittle in real-world deployment — reliability is the central unsolved challenge.

The stakes shift significantly with agency: a model that can only generate text makes mistakes that a human can catch before they propagate. An agent that takes actions in the world — sending emails, executing transactions, modifying files — makes mistakes that may be irreversible. The reliability bar required for safe agency is substantially higher than for assisted writing.

Smaller, faster, on-device

Quantisation, distillation, and architecture improvements are delivering capable models to consumer hardware. Apple Intelligence (on-device models on iPhone and Mac), Gemini Nano (Pixel phones), quantised Llama models on consumer GPUs. On-device inference means privacy (data does not leave the device), offline capability, and lower latency. This expands LLM access to contexts where cloud round-trip latency, cost, or data sensitivity is prohibitive.

Extended reasoning and test-time compute

OpenAI's o1 and o3 models demonstrated that generating an extended chain-of-thought reasoning trace before answering — spending more compute at inference time — dramatically improves performance on hard reasoning tasks. Performance scales with inference compute as well as training compute. This "think before you answer" paradigm is a qualitatively new axis of capability improvement, distinct from scaling model parameters, and is spreading across frontier model families.


Conclusion — What LLMs Are, and What They Are Not

Walk back through what we've covered. An LLM is a transformer trained on next-token prediction at scale — a simple objective that, through billions of parameters and trillions of training examples, produces a system capable of language, reasoning, coding, and creative work that would have seemed impossible to the researchers building LSTMs a decade ago.

It is also a system that hallucinate facts, cannot verify its own outputs, has no persistent memory, fails systematically on novel algorithmic tasks, and encodes the biases of its training data. These are not bugs to be fixed in the next version — they are characteristics of the architecture and training regime. Some will be mitigated by better engineering; others are more fundamental.

The thesis in two sentences

LLMs are the most capable text-processing and text-generating systems ever built. They are not minds, not oracles, and not reliable sources of factual truth — they are extraordinarily powerful tools whose outputs require human judgment to use well, particularly in high-stakes contexts.


The practical framework: use LLMs aggressively for tasks where the cost of occasional errors is manageable and the value of speed and scale is high — first drafts, code prototypes, research summaries, brainstorming, translation, accessibility tools. Use them cautiously for tasks where errors are consequential — medical advice, legal conclusions, factual claims in published work, high-stakes decisions. Verify specific factual claims independently. Never submit AI-generated professional work without expert review.

The broader point: LLMs are becoming infrastructure. They are embedded in the software tools, search engines, productivity suites, and communication platforms that hundreds of millions of people use daily — often invisibly. AI literacy — understanding what these systems are, what they can do, and where they fail — is becoming a basic competency for participating effectively in professional and civic life. This guide is one contribution to that literacy.

The next article in this series covers how LLMs are deployed as agents — which is where the capabilities and the risks both escalate sharply.