Backpropagation for deep neural networks

Jan 21, 2017

Backpropagation is an algorithm used to calculate partial derivatives of the cost function with regards to weights, by propagating the errors from the output layer back to the first one

Feed forward

I’m going to use the notation θjk(l)\theta_{jk}^{(l)} to indicate the weight of the connection from neuron kthk^{th} in layer (l1)th(l - 1)^{th} to neuron jthj^{th} in layer lthl^{th}.

The activation aj(l)a_{j}^{(l)} for the neuron jthj^{th} in layer lthl^{th} is going to be:

aj(l)=g(kθjk(l)ak(l1)) a_j^{(l)} = g(\sum_k \theta_{jk}^{(l)} a_k^{(l-1)})

Where a0(l)=1a_0^{(l)} = 1 for every level, to account for the bias term.

For example, you can have:

a1(2)=g(θ10(2)x0+θ11(2)x1+θ12(2)x2+θ13(2)x3)a2(2)=g(θ20(2)x0+θ21(2)x1+θ22(2)x2+θ23(2)x3)a3(2)=g(θ30(2)x0+θ31(2)x1+θ32(2)x2+θ33(2)x3) \begin{matrix} a_1^{(2)} = g(\theta_{10}^{(2)}x_0 + \theta_{11}^{(2)}x_1 + \theta_{12}^{(2)}x_2 + \theta_{13}^{(2)}x_3) \cr a_2^{(2)} = g(\theta_{20}^{(2)}x_0 + \theta_{21}^{(2)}x_1 + \theta_{22}^{(2)}x_2 + \theta_{23}^{(2)}x_3) \cr a_3^{(2)} = g(\theta_{30}^{(2)}x_0 + \theta_{31}^{(2)}x_1 + \theta_{32}^{(2)}x_2 + \theta_{33}^{(2)}x_3) \end{matrix}

So you can write in vectorized form as: a(l)=g(Θ(l)a(l1)) \mathbf{a}^{(l)} = g(\Theta^{(l)} \mathbf{a}^{(l-1)})

When calculating the activation for one layer, you first need to calculate the quantity: z(l)=Θ(l)a(l1) \mathbf{z}^{(l)} = \Theta^{(l)} \mathbf{a}^{(l-1)}

This is called the weighted input to the neurons in layer ll.

So much for the feed forward phase.

Cost function

We can think of neural networks as a class of parametric non linear functions from a vector x\vec{x} of input variables to a vector y\vec{y} of output variables. So we can find weights (parameters) as in polynomial curve fitting by minimizing the sum-of-squares error function. So we define a cost function J as

J(w)=12n=0Ny(xn,w)tn2 J(\mathbf{w}) = \frac{1}{2} \sum_{n=0}^N {\parallel \mathbf{y}(\mathbf{x_n},\mathbf{w}) - \mathbf{t_n} \parallel}^2

Where N is the number of training examples, tn\mathbf{t_n} is the desired output for the training sample n, and y(xn,w)\mathbf{y}(\mathbf{x_n},\mathbf{w}) is the calculated output for the corresponding sample. Notice as the cost function is considered dependent only on weights and not on inputs and/or ground truth, as they are given and cannot be changed.

To use the gradient descend, we need to calculate the partial derivatives Jnw\frac{\partial J_n}{\partial w} for a single training example and then recover the Jw\frac{\partial J}{\partial w} by averaging over the entire training set. The backpropagation algorithm will be used to calculate such derivatives.



Consider the idea to change the weighted input for the neuron jthj^{th} in layer ll of a small quantity Δzj(l)\Delta z_j^{(l)}. The neuron will output g(zj(l)+Δzj(l))g(z_j^{(l)} + \Delta z_j^{(l)}) instead of g(zj(l))g(z_j^{(l)}), causing an overall change to cost J of the amount Jzj(l)Δzj(l)\frac{\partial J}{\partial z_j^{(l)}} \Delta z_j^{(l)}

If the rate of change of the cost w.r.t. zj(l)z_j^{(l)} is small, then for a small Δzj(l)\Delta z_j^{(l)} the cost won’t change too much. In this case we say the neuron is nearly optimal. So the quantity Jzj(l)\frac{\partial J}{\partial z_j^{(l)}} measures, somehow, how much the neuron is not optimized, and we call it the error δj(l)\delta_j^{(l)} of the neuron. So by definition we have:

δj(l)Jzj(l) \delta_j^{(l)} \equiv \frac{\partial J}{\partial z_j^{(l)}}

Varying zj(l)z_j^{(l)} while keeping all other things fixed has some repercussions on the next layer. For a neuron k in layer (l + 1) one can write using the chain rule:

Jzj(l)k=Jzk(l+1)zk(l+1)zj(l) \frac{\partial J}{\partial z_j^{(l)}} |_k = \frac{\partial J}{\partial z_k^{(l+1)}} \frac{\partial z_k^{(l+1)}}{\partial z_j^{(l)}}

Which is to say: the contribution of neuron k at the rate of change of cost J caused by neuron j (in the previous layer) is how the cost changes with regards to the weighted sum of neuron k, multiplied for how that said weighted sum changes with regards to the weighted sum of neuron j (chain rule differentiation of function composition).

We can then sum up all the contributions at the level l+1 and state the following:

Jzj(l)=kJzk(l+1)zk(l+1)zj(l) \frac{\partial J}{\partial z_j^{(l)}} = \sum_k \frac{\partial J}{\partial z_k^{(l+1)}} \frac{\partial z_k^{(l+1)}}{\partial z_j^{(l)}}

This is the most important equation of backpropagation, as it put in relation the error δj(l)\delta_j^{(l)} with the errors in the next layer. We can rewrite it as:

Jzj(l)δj(l)=kδk(l+1)zk(l+1)zj(l) \frac{\partial J}{\partial z_j^{(l)}} \equiv \delta_j^{(l)} = \sum_k \delta_k^{(l+1)} \frac{\partial z_k^{(l+1)}}{\partial z_j^{(l)}}

delta backpropagation

Regarding the quantity zk(l+1)zj(l)\frac{\partial z_k^{(l+1)}}{\partial z_j^{(l)}}, we can calculate it starting from the following equation:

zk(l+1)=jθkj(l+1)aj(l)=jθkj(l+1)g(zj(l)) z_k^{(l+1)} = \sum_j \theta_{kj}^{(l+1)} a_j^{(l)} = \sum_j \theta_{kj}^{(l+1)} g(z_j^{(l)})

If we differentiate w.r.t. zj(l)z_j^{(l)} we obtain

zk(l+1)zj(l)=θkj(l+1)g(zj(l)) \frac{\partial z_k^{(l+1)}}{\partial z_j^{(l)}} = \theta_{kj}^{(l+1)} g\prime(z_j^{(l)})

So we can finally write

δj(l)=kδk(l+1)θkj(l+1)g(zj(l)) \delta_j^{(l)} = \sum_k \delta_k^{(l+1)} \theta_{kj}^{(l+1)} g\prime(z_j^{(l)})