import numpy as np
import matplotlib.pyplot as plt
# Settings
= 4
d = 3
T 42)
np.random.seed(
# Random Gaussian vectors for Q, K
= np.random.randn(T, d)
Q = np.random.randn(T, d)
K
# Compute attention scores (dot product only)
= Q @ K.T
dot_products = dot_products / np.sqrt(d)
scaled_dot_products
def softmax(x):
= np.exp(x - np.max(x, axis=-1, keepdims=True))
e_x return e_x / e_x.sum(axis=-1, keepdims=True)
# Compute softmax
= softmax(dot_products)
attn_noscale = softmax(scaled_dot_products)
attn_scaled
attn_noscale, attn_scaled
Understanding the Division by √d in the Attention Mechanism
In this notebook, we explore why the dot-product attention mechanism includes a scaling factor of $ $. We use an example with embedding dimension $d = 4 $, sequence length $T = 3 $, and assume the input vectors are sampled from two Gaussian distributions.
- $Q, K ^{T d} $
- Each row $q_i (0, I_d) $, $k_j (0, I_d) $
We compute: \[ \text{score}_{ij} = q_i k_j^T = \sum_{\ell=1}^{d} q_{i\ell} k_{j\ell} \]
Each component in the sum $q_{i}k_{j} $ is the product of two independent standard normal variables. So:
🔍 What is the distribution of $Z = XY $ where $X, Y (0,1) $?
- The product of two independent standard normal variables follows a standard normal product distribution, also called the product-normal distribution.
- Mean: $[XY] = 0 $
- Variance: $[XY] = 1 $
⏬ Apply this to the dot product
Let: \[ S = \sum_{\ell=1}^d q_{i\ell} k_{j\ell} \]
Then: - Each term has mean 0 and variance 1 - The terms are i.i.d. (since $Q $ and $K $ are independent) - So: \[ \mathbb{E}[S] = 0, \quad \text{Var}[S] = d \]
Thus, the unscaled dot product has variance proportional to $d $.
🧮 Why Divide by $ $?
If we define the scaled score as: \[ \text{scaled\_score}_{ij} = \frac{q_i \cdot k_j}{\sqrt{d}} \]
Then: \[ \text{Var}[\text{scaled\_score}_{ij}] = \frac{1}{d} \cdot \text{Var}[q_i \cdot k_j] = \frac{1}{d} \cdot d = 1 \]
So the scaling ensures: - The variance of the attention logits is constant (independent of dimension $d $) - This keeps the softmax numerically stable across different embedding sizes
🧠 Intuition
Without scaling: - As $d $ grows, the variance of the dot product grows linearly - Softmax becomes extremely sharp → one large value dominates, others vanish - Leads to poor gradient flow
With scaling: - The dot product distribution is normalized - Softmax stays smooth and expressive - Improves learning dynamics and stability
Would you like this analytical explanation embedded into the notebook as well?
= plt.subplots(1, 2, figsize=(10, 4))
fig, axs for i in range(T):
0].plot(attn_noscale[i], label=f'Q{i}')
axs[1].plot(attn_scaled[i], label=f'Q{i}')
axs[
0].set_title('Attention without Scaling')
axs[1].set_title('Attention with Scaling (1/√d)')
axs[for ax in axs:
'Key Index')
ax.set_xlabel('Attention Weight')
ax.set_ylabel(
ax.legend()True)
ax.grid(
plt.tight_layout() plt.show()
Summary
- Without scaling, attention scores can be overly large, leading to softmax outputs that are near one-hot.
- This results in vanishing gradients and unstable training.
- Scaling by $ $ normalizes the variance of the dot product, improving gradient flow and model stability.