AI · LLM

How LLMs Actually Work: From Transformer to Reasoning Models

A structural deep dive into LLMs — Transformer, Self-Attention, the training pipeline, how they differ from earlier AI, and where the next decade is headed.

Who should read this

Summary: LLMs are neither a “giant search engine,” nor “just a statistical autocomplete,” nor “something that thinks like a person.” This article is for readers who want to understand LLMs as a structure — starting from the Transformer architecture, Self-Attention, and the full training pipeline; then comparing LLMs against classical ML and early deep learning; then looking at the evolution path through multimodal, agentic, and reasoning models.

This piece is written for developers, researchers, and decision makers who want to understand LLMs beyond casual use — from a design, adoption, and critique perspective. It keeps math minimal and focuses on the structure of the ideas and the generational differences between them.


1. Why start with “structure”?

Since ChatGPT, the term Large Language Model has become familiar, but public understanding of how it actually works has, if anything, grown more confused. “It’s a giant search engine,” “it’s just statistical autocomplete,” and “it thinks like a human” coexist as claims, despite being incompatible.

This article aims directly for the middle ground. Understanding an LLM means looking at its internal architecture and its training mechanics. And to put that understanding in context, it’s worth asking explicitly: how does an LLM differ from prior generations of AI (classical ML, early deep learning), and where is the next decade headed?


2. A structural tour of the LLM

2.1 The foundation: the Transformer

Roughly 99% of modern LLMs sit on the Transformer, proposed by Google in the 2017 paper “Attention Is All You Need.” GPT, Claude, Gemini, and the LLaMA family all share this skeleton.

The core idea of the Transformer is simple:

“Don’t process a sentence sequentially. Let every word look at every other word, all at once.”

RNNs and LSTMs read one token at a time. That approach is slow (no parallelization) and suffers from the long-term dependency problem — earlier tokens get forgotten as sentences grow. The Transformer solves this with a mechanism called Self-Attention.

2.2 Self-Attention: the heart of the LLM

Self-Attention computes, for every token, how related it is to every other token in the sequence.

Each token is projected into three vectors:

  • Query (Q): “what am I looking for?”
  • Key (K): “what identity do I carry?”
  • Value (V): “what information do I offer?”

Via dot products and softmax across Q, K, V, every token receives a weighted average of information from every other token in the context. For example, in “He went to the bank. Then he withdrew money,” whether “bank” means financial institution or river bank is decided by the attention weight it assigns to “money” further down the sequence.

Running many of these in parallel gives you Multi-Head Attention, where each head learns a different relation type (syntactic, semantic, coreferential, and so on).

2.3 Tokenization and embedding

LLMs don’t handle raw characters or words. Input first goes through a tokenizer that splits text into IDs drawn from a fixed vocabulary of thousands to hundreds of thousands of tokens. BPE (Byte Pair Encoding) and SentencePiece are the common schemes. “understanding” might become subwords like “under,” “stand,” “ing.”

Each token ID is then mapped into an embedding vector — typically thousands of dimensions (e.g. 4,096 or 12,288) of real numbers. The space is learned such that semantically similar tokens sit near each other. Practically, what the LLM manipulates as “language” is actually trajectories of points in a high-dimensional geometric space.

2.4 Positional encoding

Self-Attention is order-invariant on its own. To distinguish “the dog bit the man” from “the man bit the dog,” positional information has to be injected separately. Early Transformers used sinusoidal functions; modern LLMs mostly use relative positional schemes like RoPE (Rotary Positional Embedding) or ALiBi. This turns out to be critical for scaling to long contexts.

2.5 Stacking blocks: depth creates abstraction

A single Transformer block looks roughly like this:

  1. Multi-Head Self-Attention
  2. Residual connection + Layer Normalization
  3. Feed-Forward Network (FFN) — typically a 2-layer MLP that expands dimensionality ~4× before projecting back
  4. Residual connection + Layer Normalization

These blocks stack tens to hundreds deep. GPT-3 ran 96 layers; today’s frontier models go further. Depth empirically produces abstraction: lower layers learn syntactic patterns, middle layers handle semantic relations, and upper layers handle world knowledge and reasoning patterns.

2.6 The four-stage training stack: Pre-training → SFT → RLHF → Reasoning

An LLM’s “intelligence” is not the product of one training run. It’s a stack.

(1) Pre-training. Trillions of tokens from the web, books, papers, and code are used for the simple task of next-token prediction. This is where the model internalizes grammar, factual knowledge, reasoning patterns, code syntax, and multilingual capability. It’s the most expensive stage — thousands of GPUs over months.

(2) Supervised Fine-Tuning (SFT). Tens of thousands to hundreds of thousands of human-written “good question / good answer” pairs teach instruction following.

(3) RLHF / RLAIF. Reinforcement learning from human (or AI) preferences aligns the model — shaping it toward “helpful, harmless, honest” behavior. More efficient variants like DPO and Constitutional AI are increasingly standard.

(4) Post-training: Reasoning (the newest stage). Since OpenAI’s o1 family and DeepSeek-R1, generating long chain-of-thought traces and refining them with reinforcement learning has moved toward the default. This is the shift from an instant-answer machine to a machine that thinks — and it’s the premise behind harness engineering, which controls the full agent runtime.


3. What’s actually different from earlier AI

Calling LLMs “just a bigger deep-learning model” is half right and half wrong. The differences land at different layers:

3.1 Against classical ML

Linear regression, SVMs, decision trees, random forests — these share a few properties:

  • Feature engineering is required. Domain experts hand-design what to look at.
  • One model per task. Spam classifiers do spam; price predictors do prices.
  • Performance saturates with data. Beyond a point, more data stops helping.

LLMs flip all three. They discover features on their own, one model handles translation, summarization, coding, reasoning, and dialogue, and performance keeps improving with more data and parameters — the Scaling Law.

3.2 Against early deep learning (CNN, RNN, LSTM)

Deep learning after AlexNet (2012) removed manual feature engineering — a revolution. But models stayed task-specific. Image-classification CNNs couldn’t translate; seq2seq translators couldn’t understand images.

LLMs shift three things:

(1) Generality. A single pretrained model, with only prompt changes, performs hundreds of tasks. What made this possible is in-context learning — learning a new task from examples in the prompt, without touching the weights.

(2) Emergent abilities. Past a certain scale (roughly 10B–100B parameters), abilities that are completely absent in smaller models — multi-step arithmetic, logical reasoning, code debugging — appear abruptly. The phenomenon itself still lacks a full theoretical explanation.

(3) Implicit world modeling. Predicting the next token at very high quality requires modeling, to some degree, how the world described by the text actually behaves. Recent work (Othello-GPT, Anthropic’s interpretability research) shows that LLMs develop internal structures resembling spatial concepts, temporal concepts, and even rough theory-of-mind representations — all emergent.

3.3 Summary: a paradigm shift

AxisClassical MLEarly deep learningLLM
FeaturesHand-designedAuto-extractedAuto + general
ScopeSingle taskSingle domainGeneral purpose
TrainingPer-task supervisedPer-task supervisedPre-training + fine-tuning + RLHF
Data scaleMB–GBGB–TBTB–PB
Performance curveEarly saturationMid saturationScaling Law (keeps rising)
UsageRetrainingTransfer learningPrompting

4. Where it’s headed — a view on the next decade

Today’s LLM is not a finished form. On the technology curve, it’s arguably still early. Here’s the view on the next 5–10 years. Each direction is tightly coupled to the hardware paradigm shift of the LLM era — where the model performance curve bends is ultimately decided by silicon, memory, and power constraints.

4.1 Full multimodal integration

GPT-4o, Gemini, and Claude handle images, audio, and video today, but the structure is still text-centric with other modalities attached. Going forward, text, image, speech, video, 3D, and sensor data will be processed in the same token spacenative multimodal as the default. The stakes are not convenience: they’re about giving LLMs physical intuition and spatial reasoning that language alone cannot deliver.

4.2 Agentic AI

The default mode of LLMs so far has been single-turn Q&A. Going forward, they extend into agents that actively use tools, plan, and carry out long-horizon tasks. Products like Claude Code and Cowork are already pointing the direction. How agents reshape team structure and development process is covered separately in Software Development Process and Collaboration in the AI Agent Era.

4.3 Reasoning models enter the mainstream

OpenAI’s o1 and o3, DeepSeek-R1, and Claude’s extended thinking all demonstrated that “think longer, solve better” holds for LLMs. Call it test-time compute scaling. Alongside pretraining scale, inference-time compute is becoming a primary performance variable. Math, science, and coding — domains that require rigor — will accelerate fastest along this axis.

4.4 Efficiency: MoE, quantization, distillation

Running GPT-4-class models on a home GPU is no longer science fiction. Mixture of Experts (MoE) (only a subset of parameters activated), quantization (down to 4-bit and 2-bit), and distillation together keep improving the size-vs-performance frontier year over year. Claude Haiku, Gemini Flash, and Llama 3.2 now match the frontier models of two years ago — that’s the evidence.

4.5 Long-term memory and continual learning

A fundamental limitation of today’s LLMs is that they don’t learn new things after training. “Memory” exists only temporarily inside the context window. Going forward, per-user, per-session long-term memory — beyond RAG — and continual learning techniques for safely updating weights will be core research directions.

4.6 Interpretability and alignment

The stronger these models get, the more it matters to know why they answered what they did. Anthropic’s mechanistic interpretability work, and sparse autoencoder research that decomposes internal features, show that looking inside the LLM is becoming feasible. Long-term, this is the foundation of AI safety and the precondition for models to be trustworthy partners.

4.7 Extension into the physical world — Embodied AI

Embodied AI, where an LLM serves as the brain of a robot (Google RT-2, Figure 01, Physical Intelligence), extends models “from understanding the world through language” to “acting inside the world.” This is likely one of the biggest inflection points in AI history. The moment abstract reasoning acquired through language combines with physical manipulation, the concept of “a robot” as we’ve imagined it gets redefined.


5. Conclusion: where LLMs stop being a tool

LLMs live somewhere between the cynical view (“giant autocomplete”) and the optimistic one (“precursor to AGI”). Technically, they are emergent systems built on a simple computational structure (the Transformer) layered with astronomical data and training technique.

If prior-generation AI was “a tool that solves a specific problem,” LLMs are closer to “a general-purpose engine that intervenes in almost any intellectual task through language.” That difference is qualitative (a different kind), not quantitative (just bigger).

The evolution ahead moves less toward bigger models and more toward smarter training, deeper reasoning, safer alignment, and more physical extension. What’s worth watching isn’t benchmark scores so much as the structural question: how far, and in what way, do these systems enter human cognitive activity?

To understand the structure of an LLM, in the end, is to understand the shape of the intellectual partner we’ll share the next decade with.


Based on the Transformer architecture, the modern LLM training pipeline, and the main research trends through 2025–2026.