Understanding the Division by √d in the Attention Mechanism

In this notebook, we explore why the dot-product attention mechanism includes a scaling factor of $ $. We use an example with embedding dimension $d = 4 $, sequence length $T = 3 $, and assume the input vectors are sampled from two Gaussian distributions.

$Q, K ^{T d} $
Each row $q_i (0, I_d) $, $k_j (0, I_d) $

We compute: \[ \text{score}_{ij} = q_i k_j^T = \sum_{\ell=1}^{d} q_{i\ell} k_{j\ell} \]

Each component in the sum $q_{i}k_{j} $ is the product of two independent standard normal variables. So:

🔍 What is the distribution of $Z = XY $ where $X, Y (0,1) $?

The product of two independent standard normal variables follows a standard normal product distribution, also called the product-normal distribution.
Mean: $[XY] = 0 $
Variance: $[XY] = 1 $

⏬ Apply this to the dot product

Let: \[ S = \sum_{\ell=1}^d q_{i\ell} k_{j\ell} \]

Then: - Each term has mean 0 and variance 1 - The terms are i.i.d. (since $Q $ and $K $ are independent) - So: \[ \mathbb{E}[S] = 0, \quad \text{Var}[S] = d \]

Thus, the unscaled dot product has variance proportional to $d $.

🧮 Why Divide by $ $?

If we define the scaled score as: \[ \text{scaled\_score}_{ij} = \frac{q_i \cdot k_j}{\sqrt{d}} \]

Then: \[ \text{Var}[\text{scaled\_score}_{ij}] = \frac{1}{d} \cdot \text{Var}[q_i \cdot k_j] = \frac{1}{d} \cdot d = 1 \]

So the scaling ensures: - The variance of the attention logits is constant (independent of dimension $d $) - This keeps the softmax numerically stable across different embedding sizes

🧠 Intuition

Without scaling: - As $d $ grows, the variance of the dot product grows linearly - Softmax becomes extremely sharp → one large value dominates, others vanish - Leads to poor gradient flow

With scaling: - The dot product distribution is normalized - Softmax stays smooth and expressive - Improves learning dynamics and stability

Would you like this analytical explanation embedded into the notebook as well?

import numpy as np
import matplotlib.pyplot as plt

# Settings
d = 4
T = 3
np.random.seed(42)

# Random Gaussian vectors for Q, K
Q = np.random.randn(T, d)
K = np.random.randn(T, d)

# Compute attention scores (dot product only)
dot_products = Q @ K.T
scaled_dot_products = dot_products / np.sqrt(d)

def softmax(x):
    e_x = np.exp(x - np.max(x, axis=-1, keepdims=True))
    return e_x / e_x.sum(axis=-1, keepdims=True)

# Compute softmax
attn_noscale = softmax(dot_products)
attn_scaled = softmax(scaled_dot_products)

attn_noscale, attn_scaled

fig, axs = plt.subplots(1, 2, figsize=(10, 4))
for i in range(T):
    axs[0].plot(attn_noscale[i], label=f'Q{i}')
    axs[1].plot(attn_scaled[i], label=f'Q{i}')

axs[0].set_title('Attention without Scaling')
axs[1].set_title('Attention with Scaling (1/√d)')
for ax in axs:
    ax.set_xlabel('Key Index')
    ax.set_ylabel('Attention Weight')
    ax.legend()
    ax.grid(True)

plt.tight_layout()
plt.show()

Summary

Without scaling, attention scores can be overly large, leading to softmax outputs that are near one-hot.
This results in vanishing gradients and unstable training.
Scaling by $ $ normalizes the variance of the dot product, improving gradient flow and model stability.