Summary: The transformer block: where everything comes together
A transformer is not just stacked attention. Stacked attention alone does not work: gradients vanish through deep stacks, activations drift in scale, and stacked attentions mathematically collapse into one big linear transformation. This summary is the scan-it-in-five-minutes version of why, and what real transformers do instead.
Real transformers wrap attention inside a block that adds four pieces, each fixing a specific gap: position encoding (which fixes attention’s order-blindness), a feed-forward network (which adds per-token nonlinearity), residual connections (which enable gradient flow and preserve information), and layer normalization (which keeps activations stable). Without all four, attention cannot be scaled into a working transformer.
The full lesson walks each piece in detail, presents the complete block diagram, explains stacking, and ends on what every component on the canonical “Attention Is All You Need” architecture diagram represents and how modern variants substitute on the same axes.
Core ideas
Section titled “Core ideas”- Stacked attention alone does not work. Three structural problems: gradients vanish through deep stacks, activations drift in scale layer by layer, and stacked attention loses representational rank rapidly with depth (Dong et al. 2021) without a per-token mechanism that mixes across the channel dimension. The block design fixes all three. (Common shorthand says “stacked attention is linear” — that is technically inaccurate because softmax is nonlinear; the real failure is rank collapse, which the FFN’s per-token feature mixing prevents.)
- Position encoding fixes attention’s order-blindness. Attention is permutation-invariant by default. Position encoding is added once, to the embeddings, before the first block runs. Sinusoidal in the original paper; modern variants include learned, RoPE, and ALiBi. The mechanism varies; the goal is the same.
- The feed-forward network adds per-token feature mixing and pointwise nonlinearity. A two-layer MLP applied independently per token. Projects up to
d_ff(typically 4 ×d_model), applies an activation (ReLU, GELU, SwiGLU), projects back tod_model. Attention mixes across tokens; the FFN mixes across the channel dimension within each token. Without it, stacked attention loses representational rank with depth. - Residual connections do double duty. Each sub-layer outputs
input + sub-layer(input)rather than replacing the input. (1) Gradients flow back through the addition unchanged, which lets deep stacks train. (2) Original information is preserved through the stream so subsequent layers still see it. - LayerNorm prevents scale drift. Rescales activations across the feature dimension of a single token to mean 0 and variance 1, then applies a learned scale and shift. Per-token (unlike BatchNorm), so it works for variable-length sequences. Modern variants: RMSNorm (cheaper, no mean-centering); DeepNorm (stabilizes very deep stacks). Modern implementations also apply LayerNorm before each sub-layer (Pre-LN) rather than after, which trains more stably at scale.
- The full block, in order. Input → multi-head attention → Add+Norm (residual + LayerNorm) → feed-forward network → Add+Norm → output. Shape in equals shape out, which is what allows blocks to stack.
- Stacking is the architecture. A real transformer is N copies of this block stacked vertically. Layer count, head count per block,
d_model, andd_ffare the four primary architecture knobs. - The FFN holds most of the per-block parameters. Roughly two thirds of per-block parameter count, which surprises most readers who have just learned attention and assume that is where the parameters live. The model is doing real work in both places.
- Modern variants compose from these pieces. Almost every named modern transformer variant is a substitution at one of the boxes: position encoding (sinusoidal, learned, RoPE, ALiBi), normalization (LayerNorm, RMSNorm, DeepNorm), FFN activation (ReLU, GELU, SwiGLU), and head sharing (full multi-head, MQA, GQA from the multi-head attention lesson). Once you know the boxes, every variant is legible.
What changes for you
Section titled “What changes for you”Before this lesson, the architecture diagram in “Attention Is All You Need” was a box of arrows you could not parse. Now every box has a name and a job: this one is the engine that mixes tokens, this one is the per-token nonlinearity, these are the residual paths, these are the normalization stops. When you read a model card for a modern open-source transformer and see RMSNorm, RoPE, SwiGLU, GQA, or any combination, you can locate each substitution on the same diagram you now know cold. That is the synthesis the previous four lessons were aiming at: the building blocks of every transformer you will use.
This lesson closes the Stanford CME 295 Lecture 1 adaptation. From here, the natural next steps are the variants and optimizations modern models build on top of this architecture, plus the training and inference techniques (RLHF, prompting, fine-tuning, quantization) that turn a trained transformer into a useful tool. Those are tracks of their own.
Attention is the engine.
The block is the machine.