Interactive Gradient Descent Visualization: How Derivatives Drive Weight Updates
Gradient descent is the backbone of modern machine learning. But understanding why it works — how a simple derivative tells the model which direction to adjust — can be tricky without seeing it in action. This interactive visualization lets you step through the process one iteration at a time.
What You'll See
The visualization below presents three carefully chosen scenarios that reveal the core mechanics of gradient descent:
- Scenario 1 — Underprediction: When the prediction ŷ is below the target y=1, the derivative dL/dw is negative. Following the update rule
w = w - α·(dL/dw), subtracting a negative number increases w, pushing the prediction upward. - Scenario 2 — Overprediction: When ŷ overshoots the target y=0, the derivative is positive. The update rule subtracts a positive value, decreasing w and pulling the prediction back down.
- Scenario 3 — Oscillation: With an excessively large learning rate (α=4.0), the weight overshoots the optimal value repeatedly. The derivative alternates between positive and negative, causing the weight to bounce back and forth before eventually converging.
The Math Behind Each Step
For a single-neuron model with sigmoid activation σ(z), the chain rule gives us:
z = w · x + b
ŷ = σ(z) = 1 / (1 + e-z)
L = (ŷ - y)²
Backward pass (chain rule):
dL/dw = dL/dŷ · dŷ/dz · dz/dw
= 2(ŷ - y) · σ(z)(1 - σ(z)) · x
Weight update:
wnew = w - α · dL/dw
The visualization shows every single number in this chain — you can verify each multiplication yourself and see exactly how the gradient flows from the loss back to the weight.
Interactive Demo
Click through the iterations or hit Auto Play to watch the optimization unfold. The left panel shows the loss function curve L(w) with the tangent line (slope = gradient), while the right panel shows the sigmoid curve with the current prediction's position.
Key Takeaways
When the prediction is too low, the loss curve slopes downward to the right. Moving w in the positive direction (increasing it) reduces the loss.
When the prediction is too high, the loss curve slopes upward to the right. Moving w in the negative direction (decreasing it) reduces the loss.
Too large a learning rate causes the weight to overshoot the minimum, leading to oscillation. Too small and convergence is painfully slow.
Why This Matters for AI
Every modern AI system — from ChatGPT to image generators — relies on gradient descent at its core. The same principle shown here with a single weight scales to models with billions of parameters. Each parameter gets its own gradient computed via the chain rule, and they all update simultaneously.
Understanding this fundamental mechanism helps you:
- Debug training issues (exploding/vanishing gradients)
- Choose appropriate learning rates and optimizers
- Understand why techniques like batch normalization and learning rate scheduling work
- Build intuition for how neural networks learn from data
Want to use premium AI tools like ChatGPT Plus or Gemini Advanced? Check out SlashSub for the best subscription deals on AI products.