Breaking down GPT-2 and Transformer

What is GPT-2?

Figure 1: transformer decoder block
Figure 2: Overview of GPT-2 process
Figure 3: Matrix of token embeddings, the length of embedding varies with regard to the model size
Figure 4: Matrix of positional embedding
Figure 5: Input embedding is the sum of token embedding and positional embedding

Self-attention mechanism

Figure 6: brief illustration of a masked self-attention layer
Figure 7: analogy of self-attention to sticky note matching
Figure 8: The multiplication between the query vector and each key vector will be a probability
Figure 9: illustration of calculating weighted vector corresponding to “it”
Figure 10: the first layer of feed-forward neural network
Figure 11: the second layer of feed-forward neural network
Figure 12: matrices inside a transformer decoder block
Figure 13: reshape long vector to get Q, K, V vector for each attention head
Figure 14: concatenation of output of attention heads

Enjoying the landscape of the whole GPT-2 model

Figure 15: whole picture of GPT-2





Ph.D. student in Computer Science

Zheng Zhang

