Step-by-Step Derivation of Gradients for Logistic Regression using the Chain Rule
01 Mar 2025
Reading time ~2 minutes
We want to compute the gradients of the Binary Cross-Entropy Loss function for Logistic Regression using the chain rule.
Define the Loss Function
The Binary Cross-Entropy Loss (or Log Loss) is:
\[J(w, b)= -\frac{1}{m} \sum_{i=1}^{m}\left[ y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{y}^{(i)}) \right]\]where:
- \(\hat{y}^{(i)} = \sigma(z^{i})\) is the predicted probability.
- \(z^{i} = \mathbf{X}^{(i)} \mathbf{w} + b\) is the linear combination of inputs and weights.
- \(\sigma(z) = \frac{1}{1 + e^{-z}}\) is the sigmoid function.
Our goal is to compute:
- Gradient of the loss with respect to \(\mathbf{w} \colon \frac{\partial J}{\partial \mathbf{w}}\).
- Gradient of the loss with respect to \(b \colon \frac{\partial J}{\partial b}\).
Compute Gradients using the Chain Rule
We apply the chain rule step by step.
Step 1: Differentiate the Loss Function
For a single example \((\mathbf{X}^{(i)}, y^{(i)})\), the loss function is:
\[J^{(i)}(w, b) = - \left[ y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{y}^{(i)}) \right]\]Taking the derivative w.r.t. \(z^{(i)}\):
Because \(\frac{d}{dx} \log_a (x) = \frac{1}{x \ln (a)}\)
\[\frac {\partial J^{(i)}} {\partial \hat{y}^{(i)}} = - \frac {y^{(i)}} {\hat{y}^{(i)}} + \frac {1 - y^{(i)}} {1 - \hat{y}^{(i)}}\]Using the property of the sigmoid function:
\[\frac{d}{dz} \sigma(z) = \sigma(z) (1 - \sigma(z))\]we compute:
\[\frac{\partial \hat{y}^{(i)}}{\partial z^{(i)}} = \hat{y}^{(i)} (1 - \hat{y}^{(i)})\]By the chain rule:
\[\begin{align} \frac {\partial J^{(i)}} {\partial z^{(i)}} & = \frac {\partial J^{(i)}} {\partial \hat{y}^{(i)}} \cdot \frac{\partial \hat{y}^{(i)}}{\partial z^{(i)}} \\ & = (- \frac {y^{(i)}} {\hat{y}^{(i)}} + \frac {1 - y^{(i)}} {1 - \hat{y}^{(i)}}) \cdot \hat{y}^{(i)} (1 - \hat{y}^{(i)}) \\ & = -y^{(i)}(1 - \hat{y}^{(i)}) + \hat{y}^{(i)}(1 - y^{(i)}) \\ & = -y^{(i)} + y^{(i)}\hat{y}^{(i)} + \hat{y}^{(i)} - \hat{y}^{(i)}y^{(i)} \\ & = \hat{y}^{(i)} - y^{(i)} \end{align}\]Step 2: Compute \(\frac{\partial J}{\partial \mathbf{w}}\)
Using the chain rule:
\[\frac {\partial J} {\partial w} = \frac {1} {m} \sum_{i=1}^{m} \frac {\partial J^{(i)}} {\partial z^{(i)}} \cdot \frac {\partial z^{(i)}} {\partial w}\]Since \(z^{(i)} = \mathbf{X}^{(i)} \mathbf{w} + b\), we have:
\[\frac {\partial z^{(i)}} {\partial w} = \mathbf{X}^{(i)}\]Thus:
\[\frac {\partial J} {\partial w} = \frac {1} {m} \mathbf{X}^T ( \sigma(\mathbf{X} \mathbf{w} + b) - y )\]Step 3: Compute \(\frac{\partial J}{\partial b}\)
Similarly, using the chain rule:
\[\frac {\partial J} {\partial b} = \frac {1} {m} \sum_{i=1}^{m} \frac {\partial J^{(i)}} {\partial z^{(i)}} \cdot \frac {\partial z^{(i)}} {\partial b}\]Since \(\frac {\partial z^{(i)}} {\partial b} = 1\), we get:
\[\frac {\partial J} {\partial b} = \frac {1} {m} \sum( \sigma(\mathbf{X} \mathbf{w} + b) - y )\]Final Gradient Descent Update
Using the gradients, we update parameters:
\[\begin{align} \mathbf{w} & := \mathbf{w} - \alpha \frac {1} {m} \mathbf{X}^{T}( \sigma(\mathbf{X} \mathbf{w} + b) - y ) \\ b & := b - \alpha \frac {1} {m} \sum( \sigma(\mathbf{X} \mathbf{w} + b) - y ) \end{align}\]where \(\alpha\) is the learning rate.
Summary
- We applied the chain rule step by step to compute gradients.
- The derivative of the sigmoid function played a key role.
- The gradients were then used for Gradient Descent updates.