Multi-Layer Networks and Backpropagation

As you've seen in the previous Perceptron lesson, the most simplistic neural networks shares the limitation of solving XOR-problems with other linear classifiers. The Neural Networks and Deep Learning online book explains how multiple perceptrons can be connected to each other to overcome this problem. This approach should seem fairly obvious, concerning that neural networks are inspired by how neurons in human brain are connected. The output of one layer is the input of the next layer: this is called a feed-forward neural network. In our examples, all layers are fully connected (which means that output of a neuron in Layer n will feed data to all neurons in Layer n+1). Figuring out how to train these multi-layer networks was a mystery for years, until in 1986 Rumelhart et al. published an article about a backpropagation algorithm.

This course material will provide a brief overview of the backpropagation algorithm. Links will be provided for further reading. The steps of backpropagation are:

Perform a forward pass (just like when predicting)
Compute the error (e.g. squared error)
Starting with the last layer, the output layer, compute how much each weight contributed to the total error - these are the gradients you already know.
…then compute the same for the second last hidden layer.
…and continue this process until the weights of all layers have updated values.
Finally, take a gradient step. (Subtract the weights from the current weights multiplied with a chosen learning_rate, for example: 0.001.)

Notice that gradient descent doesn't take steps on a flat surface. Rumelhart et al. changed the perceptron's activation function from the step function to an S-shaped tanh function.

Matt Mazur provides an excellent step-by-step example of backpropgation with numbers. In practice, we don't usually have to write our own implementation of backpropagation like Mazur has done. Neural Network frameworks and libraries, such as TensorFlow, hold their own efficient implementations of backpropagation. In TensorFlow, the gradient can be calculated using autodiff.

Read the Chapter 2, How the backpropagation algorithm works, from Neural Networks and Deep Learning book. Even if some of the mathematics in the chapter might be a bit overwhelming, read the chapter through and understand the intuition behind backpropagation. After reading the chapter, you should be able to answer the question "Why would we perform backpropagation instead of just computing the partial derivatives of all weights one by one?"