Technology

The Transformer Architecture: A Technical Deep Dive for Practitioners

Every large language model you have used, from ChatGPT to Claude to Gemini, owes its existence to a single paper published by eight Google researchers in 2017. "Attention Is All You Need" introduced the transformer architecture, and in doing so, rendered the previous generation of sequence models, the recurrent neural networks and LSTMs that had dominated NLP for a decade, essentially obsolete. Understanding how transformers work is no longer optional for anyone building with or building on top of modern AI systems.

This article is written for practitioners: engineers and technical leaders who need to understand the architecture deeply enough to make informed decisions about model selection, fine-tuning, deployment, and debugging. We will walk through each component of the transformer, explain why it exists, and show how the pieces fit together. Where mathematical notation helps, we use it. Where pseudocode is clearer, we use that instead.

Why Transformers Replaced RNNs

To appreciate the transformer's design, you need to understand the problem it solved. Recurrent neural networks process sequences one token at a time, maintaining a hidden state that accumulates information as it reads through the input. This sequential processing creates two fundamental problems.

The first is the vanishing gradient problem. As the hidden state passes through many time steps, gradients used for training become exponentially small, making it difficult for the model to learn relationships between tokens that are far apart in the input. LSTMs and GRUs mitigated this with gating mechanisms, but they did not eliminate it. In practice, even well-tuned LSTMs struggle to maintain useful information across more than a few hundred tokens.

The second problem is parallelization. Because each time step depends on the output of the previous one, RNNs cannot process multiple tokens simultaneously. Training is inherently sequential, which means it is slow. When Google's researchers were training on thousands of GPUs, this sequential bottleneck became intolerable.

The transformer addresses both problems with a single insight: replace recurrence with attention. Instead of processing tokens sequentially and hoping information survives the journey through the hidden state, allow every token to directly attend to every other token in the sequence. This is the core idea, and everything else in the architecture supports it.

The High-Level Architecture

The original transformer follows an encoder-decoder structure. The encoder reads the input sequence and produces a rich representation of it. The decoder generates the output sequence one token at a time, attending both to the encoder's representation and to the tokens it has already generated.

Modern large language models typically use only the decoder half of this architecture (GPT, Llama, Mistral) or only the encoder half (BERT, RoBERTa). The full encoder-decoder design remains common in sequence-to-sequence tasks like machine translation and summarization (T5, BART). But the core components, self-attention, feed-forward layers, layer normalization, and positional encoding, appear in all variants.

Let us examine each component in detail.

Input Embeddings

Before the transformer can process text, it needs to convert discrete tokens into continuous vectors. This is done through an embedding layer: a lookup table that maps each token in the vocabulary to a dense vector of dimension d_model. In the original paper, d_model was 512. In modern large models, it ranges from 4,096 (Llama 3 8B) to 12,288 (GPT-4, estimated).

The embedding is learned during training. Initially random, these vectors gradually organize themselves so that semantically related tokens end up near each other in the embedding space. The word "king" will be closer to "queen" than to "bicycle," and the relationship between "king" and "queen" will be roughly parallel to the relationship between "man" and "woman."

In pseudocode, the embedding step looks like this:

// Token embedding
// vocab_size: number of unique tokens (e.g., 32,000 for Llama)
// d_model: embedding dimension (e.g., 4096)

embedding_matrix = random_init(vocab_size, d_model)

function embed(token_ids):
    // token_ids: array of integer IDs, shape [seq_len]
    // returns: array of vectors, shape [seq_len, d_model]
    return embedding_matrix[token_ids] * sqrt(d_model)

The multiplication by the square root of d_model is a scaling factor from the original paper. It ensures that the embedding magnitudes are in a reasonable range relative to the positional encodings that will be added next.

Positional Encoding

Here is a subtle but critical problem: the transformer's attention mechanism is permutation-invariant. If you scramble the order of input tokens, the attention computation produces exactly the same result (just scrambled in the same way). The model has no inherent notion of position, which means "the cat sat on the mat" would be indistinguishable from "mat the on sat cat the."

Positional encoding fixes this by injecting information about each token's position in the sequence. The original paper used fixed sinusoidal functions:

function positional_encoding(position, d_model):
    encoding = zeros(d_model)
    for i in range(0, d_model, 2):
        encoding[i]     = sin(position / (10000 ^ (i / d_model)))
        encoding[i + 1] = cos(position / (10000 ^ (i / d_model)))
    return encoding

// Final input to the transformer:
input = embed(token_ids) + positional_encoding(positions, d_model)

The sinusoidal approach has an elegant property: the positional encoding for position p + k can be expressed as a linear function of the encoding for position p. This means the model can potentially learn to attend to relative positions rather than absolute ones.

Modern models have largely moved to learned positional embeddings (GPT-2, GPT-3) or rotary positional embeddings, known as RoPE (Llama, Mistral, Qwen). RoPE encodes relative position information directly into the attention computation by rotating the query and key vectors, which allows models to generalize to sequence lengths longer than those seen during training. This is one reason why modern models can handle context windows of 128K or even 1M tokens.

Self-Attention: The Core Mechanism

Self-attention is the operation that gives the transformer its power. It allows each token in a sequence to look at every other token and decide how much attention to pay to each one. The result is a new representation of each token that incorporates contextual information from the entire sequence.

The mechanism works through three learned projections: queries, keys, and values. Think of it this way: each token generates a query ("what am I looking for?"), a key ("what do I contain?"), and a value ("what information do I provide?"). Attention scores are computed by comparing each query against all keys, and these scores determine how much of each value to include in the output.

The computation proceeds as follows:

function self_attention(X, W_Q, W_K, W_V):
    // X: input matrix, shape [seq_len, d_model]
    // W_Q, W_K, W_V: learned weight matrices, shape [d_model, d_k]
    
    Q = X @ W_Q    // Queries, shape [seq_len, d_k]
    K = X @ W_K    // Keys,    shape [seq_len, d_k]
    V = X @ W_V    // Values,  shape [seq_len, d_k]
    
    // Compute attention scores
    scores = (Q @ K.transpose()) / sqrt(d_k)  // shape [seq_len, seq_len]
    
    // Apply softmax to get attention weights
    weights = softmax(scores, dim=-1)  // each row sums to 1
    
    // Weighted sum of values
    output = weights @ V  // shape [seq_len, d_k]
    
    return output

The division by the square root of d_k is critical for training stability. Without it, when d_k is large, the dot products between queries and keys can become very large in magnitude, pushing the softmax function into regions where its gradients are extremely small. This scaling factor keeps the dot products in a range where the softmax produces useful gradients.

The attention weights matrix has shape [seq_len, seq_len]. Entry (i, j) represents how much token i attends to token j. This matrix is dense: every token attends to every other token, which gives the transformer its ability to capture long-range dependencies but also creates its quadratic computational cost. Processing a sequence of length n requires O(n^2) attention computations, which is why context length has historically been limited and why efficient attention variants are an active area of research.

Causal Masking in Decoder Models

In decoder-only models like GPT, a crucial modification is applied: causal masking. When generating text, a token should only be able to attend to tokens that came before it, not tokens that come after. Otherwise, the model would be "cheating" by looking at the answer while generating it.

This is implemented by setting the upper-triangular portion of the attention scores matrix to negative infinity before the softmax, effectively zeroing out attention to future tokens:

function causal_self_attention(X, W_Q, W_K, W_V):
    Q = X @ W_Q
    K = X @ W_K
    V = X @ W_V
    
    scores = (Q @ K.transpose()) / sqrt(d_k)
    
    // Create causal mask: -infinity for positions where i < j
    mask = upper_triangular_matrix(seq_len, value=-infinity)
    scores = scores + mask
    
    weights = softmax(scores, dim=-1)
    output = weights @ V
    
    return output

Multi-Head Attention

A single attention head can only focus on one type of relationship at a time. In a sentence like "The animal didn't cross the street because it was too tired," a single attention head might learn to connect "it" to "animal" based on semantic similarity. But understanding the sentence fully also requires tracking syntactic structure, resolving coreference, and understanding causal relationships. A single set of Q, K, V projections is not enough to capture all of these simultaneously.

Multi-head attention solves this by running multiple attention heads in parallel, each with its own learned projections. Each head can specialize in different types of relationships. In practice, researchers have observed that different heads learn to track different linguistic phenomena: some focus on adjacent tokens (local syntax), others on subject-verb agreement across long distances, and still others on semantic similarity.

function multi_head_attention(X, num_heads):
    d_k = d_model / num_heads
    heads = []
    
    for h in range(num_heads):
        // Each head has its own Q, K, V projections
        head_output = self_attention(X, W_Q[h], W_K[h], W_V[h])
        heads.append(head_output)
    
    // Concatenate all head outputs
    concatenated = concat(heads, dim=-1)  // shape [seq_len, d_model]
    
    // Final linear projection
    output = concatenated @ W_O  // shape [seq_len, d_model]
    
    return output

The original transformer used 8 attention heads with d_k = 64 each (8 x 64 = 512 = d_model). Modern large models use many more: GPT-3 uses 96 heads, and Llama 3 70B uses 64 heads with grouped-query attention, a variant where multiple query heads share a single key-value head to reduce memory usage during inference.

Feed-Forward Network

After the attention layer, each token's representation passes through a position-wise feed-forward network. This is a simple two-layer MLP applied independently to each token position:

function feed_forward(x):
    // x: single token representation, shape [d_model]
    hidden = activation(x @ W1 + b1)  // shape [d_ff]
    output = hidden @ W2 + b2          // shape [d_model]
    return output

The inner dimension d_ff is typically 4x the model dimension. So for a model with d_model = 4096, the feed-forward layer expands to 16,384 dimensions and then projects back down. This expansion-contraction pattern allows the model to compute complex nonlinear functions of the attention output.

The choice of activation function has evolved. The original paper used ReLU. GPT-2 switched to GELU. Modern models like Llama and Mistral use SwiGLU, a gated variant that replaces the simple activation with a gating mechanism that further improves performance.

An important insight about the feed-forward layer: research suggests it functions as a key-value memory. The first layer's weight matrix stores "keys" (patterns to match), and the second layer stores "values" (information to retrieve). When a token's representation matches a pattern in the first layer, the corresponding information from the second layer is added to its representation. This is how factual knowledge is stored in language models.

Layer Normalization and Residual Connections

Training deep neural networks is notoriously difficult. As networks get deeper, gradients can vanish or explode, internal statistics shift between layers, and optimization becomes unstable. The transformer addresses these challenges with two mechanisms: residual connections and layer normalization.

Residual connections add the input of each sub-layer to its output. This creates a direct path for gradients to flow backward through the network, mitigating the vanishing gradient problem:

// Residual connection around attention
attention_output = multi_head_attention(x)
x = x + attention_output

// Residual connection around feed-forward
ff_output = feed_forward(x)
x = x + ff_output

Layer normalization normalizes the activations across the feature dimension, stabilizing the distribution of values flowing through the network:

function layer_norm(x, gamma, beta):
    mean = mean(x)
    variance = variance(x)
    x_normalized = (x - mean) / sqrt(variance + epsilon)
    return gamma * x_normalized + beta

The placement of layer normalization matters. The original paper applied it after each sub-layer ("post-norm"). Modern models almost universally apply it before each sub-layer ("pre-norm"), which improves training stability and allows for deeper networks. Some recent architectures use RMSNorm, a simplified variant that omits the mean subtraction and the learned bias, reducing computation with minimal impact on performance.

Putting It All Together: A Complete Transformer Block

function transformer_block(x):
    // Pre-norm + Multi-head attention + Residual
    normed = layer_norm(x)
    attention_out = multi_head_attention(normed)
    x = x + attention_out
    
    // Pre-norm + Feed-forward + Residual
    normed = layer_norm(x)
    ff_out = feed_forward(normed)
    x = x + ff_out
    
    return x

function transformer(token_ids):
    // Embed tokens and add positional information
    x = embed(token_ids) + positional_encoding(positions)
    
    // Pass through N transformer blocks
    for layer in range(N):
        x = transformer_block(x)
    
    // Final layer norm
    x = layer_norm(x)
    
    // Project to vocabulary size for next-token prediction
    logits = x @ embedding_matrix.transpose()
    
    return logits

The number of transformer blocks N varies dramatically by model size: GPT-2 Small uses 12, GPT-3 uses 96, and Llama 3 405B uses 126. Each additional layer adds capacity for the model to learn more complex representations, but also adds proportionally to compute cost and memory requirements.

Key Variants and Modern Innovations

Grouped-Query Attention (GQA)

Standard multi-head attention requires storing separate key and value tensors for each head, which becomes a memory bottleneck during inference, especially with long sequences. Grouped-query attention, used in Llama 3 and Mistral, shares key-value heads across multiple query heads. With 64 query heads and 8 key-value heads, memory usage drops by 8x with minimal quality loss.

Flash Attention

The standard attention computation materializes the full [seq_len, seq_len] attention matrix in GPU memory, which is prohibitive for long sequences. Flash Attention, developed by Tri Dao, restructures the computation to work in blocks that fit in the GPU's fast SRAM, avoiding the need to ever materialize the full matrix in slower HBM. This is not a change to the mathematical operation, just a more efficient implementation, but it enables 2-4x speedups and makes long context windows practical.

Mixture of Experts (MoE)

Models like Mixtral and DeepSeek V3 replace the feed-forward layer with a routing mechanism that selects a subset of "expert" feed-forward networks for each token. This allows the model to have many more total parameters while only activating a fraction of them per token, dramatically improving the ratio of capability to inference cost. Understanding this architecture is essential for evaluating models like those in our 2025 LLM rankings.

Why This Matters for Practitioners

Understanding the transformer architecture is not just academic exercise. It directly informs practical decisions that engineers face daily. When you encounter a model that struggles with long-range dependencies in your specific use case, you now understand that this is likely an attention or positional encoding limitation. When you see that a model performs poorly on factual recall, you know that the feed-forward layers may not have sufficient capacity to store the relevant knowledge. When you need to choose between a dense model and a mixture-of-experts model, you understand the trade-offs involved.

The transformer architecture is now seven years old, which is ancient by deep learning standards. Yet no viable replacement has emerged. The innovations since 2017 have been refinements, not revolutions: better attention implementations, more efficient positional encodings, improved normalization schemes, and clever parameter-sharing strategies. The fundamental insight, that attention can replace recurrence, and that the resulting architecture scales beautifully, remains as powerful as ever.

For those interested in seeing how these architectural choices play out in practice across today's leading models, our comparison of Claude 3.5 Sonnet and GPT-4o examines two different implementations of transformer-based systems and how their architectural differences manifest in real-world performance.