Jump to content

File talk:Full GPT architecture.png

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

there is a mistake here with how the normalization is applied. the norm layer before the mlp is actually an internal layer of the mlp and the add is happening externally. essentially the arrow out should have happened 1 step before.

this can be seen clearly in the source code at: https://github.com/openai/gpt-2/blob/master/src/model.py

def block(x, scope, *, past, hparams):

   with tf.variable_scope(scope):
       nx = x.shape[-1].value
       a, present = attn(norm(x, 'ln_1'), 'attn', nx, past=past, hparams=hparams)
       x = x + a
  1. look here
       m = mlp(norm(x, 'ln_2'), 'mlp', nx*4, hparams=hparams) #this one
  1. look here
       x = x + m
       return x, present