Backwards & Backprop by choice

This week I really dug in to the background to DL - in particular loss equations, back prop & implementing them in vanilla python (look ma … no libraries!) It was about going backwards in more ways than one - and I am glad of it.

I started my journey in this arena with - an amazing program for getting up and running in Deep Learning fast. It was amazing - going from little exposure to running your own CNNs quickly was an amazing way to get a feel for what is going on in some of the key areas in DL and for getting a sense of what is possible. It is based on the library, which is, in turn, built on Pytorch - so you are necessarily working at levels of abstraction from the codebase and algorithms the DL models are built on. This week was without abstraction - I went to the (now very famous) lectures of 231n and worked to understand some of the major building blocks, implementing them in python and paying close (and painful) attention to numpy array dimensions and the like.

Backpropagation: a way of computing, through recursive application of the chain rule, the influence of every single intermediate value in the final loss function

The most impactful part for me was digging into backprop with lecture 4 of 231n (so much so I have watched both the 2016 and 2017 versions …). Visualizing backprop in terms of computational graphs was mind-blowing to me - the process of forward and backward propagation went from ‘oh yeah, makes sense to go in both directions and I can see that the chain rule helps’ to ‘oh so THAT’S how it is working’ at a level of grokking granularity. I realise how much working with specific examples helps me understand - whether jumping in and assembling CNNs in fastai or calculating specific numerical gradients in CS231n - courses that are structured around specific working problems have the most value to my learning.

For example in the image below:

Activations (green) are calculated on the forward pass through the gates. Gradients (red) are calculated on the backward pass. Assuming that Z on the chart is directly prior to the Loss calculation we can calculate its gradient with respect to loss: dL/dZ directly. Using backprop we want to now calculate the gradients with respect to X and Y (so as to know how to adjust their weights to minimize overall loss). However we cannot calculate their impact on the overall Loss directly ( dL/dX or dL/dY) as they are separated from the Loss value by Z. So to know their impact on the overall Loss we need to use the chain rule to combine 1) their relationship to Z (dZ/dX) and 2) Z’s relationship to Loss (dL/dZ). The lectures used the terminology of ‘local gradients’ (i.e. those between 2 consecutive points such as dZ/dX) and ‘global gradients’ (i.e. those between the overall loss and a particular point such as dL/dX). In this regard the global gradient at any one point is equal to its local gradient * upstream gradient (global gradient) of the point prior - or:

dL/dX = dL/dZ * dZ/dx

Generalized view - from CS231n

This really comes to life when you look at a Sigmoid Function in terms of a computational graph like the below. The gradient with respect to loss immediately to the RHS of the add gate is 0.20. This is calculated as the Local Gradient (derivative of *-1 = -1) multiplied by the upstream gradient (-.20).

Source: CS231n April 2018 -  Lecture 4

Source: CS231n April 2018 - Lecture 4

I was thinking of building a cool backprop animation in Unity as I was so taken by the power of viewing the sigmoid function above … but focused on doing more of the assignments and writing code instead. A less cool looking, but likely more rewarding outcome ;-)

Also - in event they are of help to anyone else (and inspired by Tess Fernandez) - I am going to try and use images from my notes as I work each week…