Attention please
@stevhliu|Aug 10, 2025 (3w ago)57 views
It's kind of wild that attention is a quantifiable concept. I don't have a way of measuring how I personally pay attention, but I imagine it's very primitive.
If you told me to take out the trash on Monday night because it's really full and pickup day is Tuesday, my brain would probably compile it to something like "trash, Monday night". Just the essential keywords.
But for transformer models, attention is calculable.
An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.
This post tries to very simply and very plainly explain how the original self-attention (and several variants of it, but not all) works while eschewing as much jargon and fancy language as possible.
#self-attention
Self-attention (scaled dot-product attention) computes how much each word in a sequence should "pay attention" to every other word in that same sequence. It's what makes transformers contextually aware.
You need 3 matrices to compute self-attention. These matrices are created by multiplying word embeddings by 3 weight matrices (Wk, Wq, Wv).
-
Query (Q) is compared to all the Ks (including itself) to calculate how much attention to pay to each word. Q is the information you're looking for.
-
Key (K) is multiplied by Q (dot product) to produce the attention scores for each word. K is the information a word contains.
The attention scores are scaled by dividing by the square root of the vector dimension. For example, if the dimension is 64, divide the attention scores by 8. Scaling prevents the attention scores from becoming too large or too small.
Apply a softmax function to the scores to convert them into probabilities that add up to 1. Scaling also smooths out the softmax function by preventing any one value from dominating the rest.
-
Value (V) weights each word with an attention score to determine what information every other word offers. V is the information each word contributes.
Try calculating the self-attention score for the word Fear
by hand in the following sequence to really get a feel for how it works.
#causal self-attention
Causal self-attention is used in decoder models like GPT. These models predict the next word in a sequence, so it is important to mask future words to prevent it from cheating.
In the mask, words that should be blocked are set to -∞, and the softmax of -∞ becomes ~0.
Try calculating the causal self-attention score for the word is
by hand in the following sequence.
#multi-head attention
Self-attention is a single head of attention. Multi-head attention (MHA) adds more heads to learn other things about a sentence from different perspectives.
The calculations are the same as self-attention, but the embedding size is split by the number of heads. Each head independently computes the scaled dot-product attention on their slice of data. At the end, the outputs are combined and multiplied by a weight matrix to blend all the information together.
MHA is still used by models today such as DeepSeek-R1.
Try calculating the multi-head attention score for the word Fear
by hand in the following sequence.
#multi-query attention
Multi-query attention (MQA) is the same as MHA except all Qs share the same K and V.
MQA has the advantage of being more memory efficient and faster at decoding. Each head doesn't need to store a separate K and V. This makes an especially big difference for really long sequences.
Gemma 2B uses MQA.
Try calculating the multi-query attention score for the word Fear
by hand in the following sequence.
#grouped-query attention
Grouped-query attention (GQA) is similar to MHA and MQA except each group of Qs share the same K and V. The Ks and Vs are different for each group.
GQA combines the best of MHA and MQA. It's faster than MHA because it still has fewer Ks and Vs and it's more expressive than MQA because it has more Qs.
Many modern large language models like Llama 3 and gpt-oss-20b use GQA.
Try calculating the grouped-query attention score for the word Fear
by hand in the following sequence.
#summary
Self-attention is powerful, but it's also computationally expensive when sequences start getting longer. Newer attention algorithms, like FlashAttention and PagedAttention, handle longer sequences more memory efficiently.
The key takeaway is that self-attention is a weighted sum of contextual information of all words. The weights reflect each surrounding word's relevance to the current word.