File talk:Full GPT architecture.png

there is a mistake here with how the normalization is applied. the norm layer before the mlp is actually an internal layer of the mlp and the add is happening externally. essentially the arrow out should have happened 1 step before.

this can be seen clearly in the source code at: https://github.com/openai/gpt-2/blob/master/src/model.py

def block(x, scope, *, past, hparams):

   with tf.variable_scope(scope):
       nx = x.shape[-1].value
       a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)
       x = x + a

look here

       m = mlp(norm(x, 'ln_2'), 'mlp', nx*4, hparams=hparams) #this one

look here

       x = x + m
       return x, present