The transformer block: where everything comes together

What you’ll learn

This is lesson 3 of Phase 2 (How models think: the transformer architecture) in Track 5 (AI Foundations). The full Stanford CME 295 course materials are at cme295.stanford.edu. Phase 2 so far gave you the single-head attention mechanism and its multi-head generalization.

This lesson assembles them into a complete transformer block: the repeating unit stacked many times to build a real model. A block is more than attention. It wraps attention with three other ingredients (a feed-forward network, residual connections, and layer normalization) plus position encoding added before the first block runs. The lesson covers each of those pieces, why each one is structurally required (attention alone is order-blind, has no per-token nonlinearity, suffers rank collapse without an FFN, and cannot be stacked deep without residuals + normalization), distinguishes the modern Pre-LN ordering from the original 2017 paper’s Post-LN, and walks one full forward pass through a Pre-LN block. When you finish, you can read the canonical Attention Is All You Need diagram and name every component, and you have the vocabulary to recognize the variants modern models build on top of it (RMSNorm, RoPE, SwiGLU, GQA, and so on).

Where this fits

This is lesson 3 of Phase 2, How models think: the transformer architecture. The previous lessons introduced attention and multi-head attention. This lesson combines them into one complete transformer block. The next lesson, How models know word order (revisited from Phase 1) and then Position embeddings and RoPE, covers how the field moved position encoding from the input embedding into the attention computation itself, the first of three places where modern LLMs genuinely diverge from the 2017 architecture (the other two being normalization choice and attention efficiency, both covered later in Phase 2).

Before you start

Prerequisites: the attention lesson and multi-head attention lesson are required. This lesson builds directly on the query, key, and value vectors and the split-run-concatenate pattern from those two lessons. The Phase 1 embeddings lesson is also useful since position encoding modifies the embedding output before the first block runs.

By the end, you’ll be able to

Identify the four pieces that wrap around attention to make a complete transformer block (position encoding, feed-forward network, residual connections, layer normalization)
Explain why attention alone is not enough and what each wrapping piece adds, including the rank-collapse argument that makes the FFN structurally required
Walk through a single forward pass through one Pre-LN block, in order, and trace what changes at each step
Distinguish Pre-LN (modern default) from Post-LN (original 2017 ordering) and explain why the field shifted
Read the “Attention Is All You Need” architecture diagram and name every component on it

Time and difficulty

Read time: about 25 minutes
Practice time: about 15 minutes (a forward-pass trace by hand, plus annotating the canonical architecture diagram)
Difficulty: standard