Home
About
- Qikai Gu
  
  Software Engineer in Machine Learning
- Learn More
- LinkedIn
- Github
- Twitter
- StackOverflow
Posts
- All Posts
- All Tags
Projects

Step-by-Step Derivation of Gradients for Logistic Regression using the Chain Rule

01 Mar 2025

Reading time ~2 minutes

We want to compute the gradients of the Binary Cross-Entropy Loss function for Logistic Regression using the chain rule.

Define the Loss Function

The Binary Cross-Entropy Loss (or Log Loss) is:

\[J(w, b)= -\frac{1}{m} \sum_{i=1}^{m}\left[ y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{y}^{(i)}) \right]\]

where:

\(\hat{y}^{(i)} = \sigma(z^{i})\) is the predicted probability.
\(z^{i} = \mathbf{X}^{(i)} \mathbf{w} + b\) is the linear combination of inputs and weights.
\(\sigma(z) = \frac{1}{1 + e^{-z}}\) is the sigmoid function.

Our goal is to compute:

Gradient of the loss with respect to \(\mathbf{w} \colon \frac{\partial J}{\partial \mathbf{w}}\).
Gradient of the loss with respect to \(b \colon \frac{\partial J}{\partial b}\).

Compute Gradients using the Chain Rule

We apply the chain rule step by step.

Step 1: Differentiate the Loss Function

For a single example \((\mathbf{X}^{(i)}, y^{(i)})\), the loss function is:

\[J^{(i)}(w, b) = - \left[ y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log (1 - \hat{y}^{(i)}) \right]\]

Taking the derivative w.r.t. \(z^{(i)}\):

Because \(\frac{d}{dx} \log_a (x) = \frac{1}{x \ln (a)}\)

\[\frac {\partial J^{(i)}} {\partial \hat{y}^{(i)}} = - \frac {y^{(i)}} {\hat{y}^{(i)}} + \frac {1 - y^{(i)}} {1 - \hat{y}^{(i)}}\]

Using the property of the sigmoid function:

\[\frac{d}{dz} \sigma(z) = \sigma(z) (1 - \sigma(z))\]

we compute:

\[\frac{\partial \hat{y}^{(i)}}{\partial z^{(i)}} = \hat{y}^{(i)} (1 - \hat{y}^{(i)})\]

By the chain rule:

\[\begin{align} \frac {\partial J^{(i)}} {\partial z^{(i)}} & = \frac {\partial J^{(i)}} {\partial \hat{y}^{(i)}} \cdot \frac{\partial \hat{y}^{(i)}}{\partial z^{(i)}} \\ & = (- \frac {y^{(i)}} {\hat{y}^{(i)}} + \frac {1 - y^{(i)}} {1 - \hat{y}^{(i)}}) \cdot \hat{y}^{(i)} (1 - \hat{y}^{(i)}) \\ & = -y^{(i)}(1 - \hat{y}^{(i)}) + \hat{y}^{(i)}(1 - y^{(i)}) \\ & = -y^{(i)} + y^{(i)}\hat{y}^{(i)} + \hat{y}^{(i)} - \hat{y}^{(i)}y^{(i)} \\ & = \hat{y}^{(i)} - y^{(i)} \end{align}\]

Step 2: Compute \(\frac{\partial J}{\partial \mathbf{w}}\)

Using the chain rule:

\[\frac {\partial J} {\partial w} = \frac {1} {m} \sum_{i=1}^{m} \frac {\partial J^{(i)}} {\partial z^{(i)}} \cdot \frac {\partial z^{(i)}} {\partial w}\]

Since \(z^{(i)} = \mathbf{X}^{(i)} \mathbf{w} + b\), we have:

\[\frac {\partial z^{(i)}} {\partial w} = \mathbf{X}^{(i)}\]

Thus:

\[\frac {\partial J} {\partial w} = \frac {1} {m} \mathbf{X}^T ( \sigma(\mathbf{X} \mathbf{w} + b) - y )\]

Step 3: Compute \(\frac{\partial J}{\partial b}\)

Similarly, using the chain rule:

\[\frac {\partial J} {\partial b} = \frac {1} {m} \sum_{i=1}^{m} \frac {\partial J^{(i)}} {\partial z^{(i)}} \cdot \frac {\partial z^{(i)}} {\partial b}\]

Since \(\frac {\partial z^{(i)}} {\partial b} = 1\), we get:

\[\frac {\partial J} {\partial b} = \frac {1} {m} \sum( \sigma(\mathbf{X} \mathbf{w} + b) - y )\]

Final Gradient Descent Update

Using the gradients, we update parameters:

\[\begin{align} \mathbf{w} & := \mathbf{w} - \alpha \frac {1} {m} \mathbf{X}^{T}( \sigma(\mathbf{X} \mathbf{w} + b) - y ) \\ b & := b - \alpha \frac {1} {m} \sum( \sigma(\mathbf{X} \mathbf{w} + b) - y ) \end{align}\]

where \(\alpha\) is the learning rate.

Summary

We applied the chain rule step by step to compute gradients.
The derivative of the sigmoid function played a key role.
The gradients were then used for Gradient Descent updates.

machine-learning Share Tweet +1