DL weekly study notes: batch normalization, PyTorch’s APIs and distribution visualization

Updated on May 3. Code reproduction of Andrej Karpathy’s “building makemore part 3” lecture can be found at https://github.com/gangfang/makemore/blob/main/makemore_part3.ipynb

This weekly notes include some older notes. They have been written for Andrej’s Building makemore Part 3: Activations, Gradients & BatchNorm YT video.

Principles:

We want stable gradients (neither exploding nor vanishing) of non-linearity throughout the network. Therefore, we want roughly Gaussian distribution throughout the neural network, through normalization (primary) and initialization (secondary).

Recap: 3 things

  • Intro of BN
  • Build layer APIs similar to PyTorch’s
  • Intro to diagnosis tools/plots: activation dist, activation grad dist, param grad dist and update/data over iterations

Notes:

  • In March 3’s study notes, I wrote about using different initialization methods for training. Note that the hockey shape graph looks appealing, though it is not exactly desirable because the hockey shape really comes from the drastic decrease of the loss in the first few iterations, dwarfing all the subsequent ones, which are the ones that do the hard/useful work. A cleverer initialization will make the loss start out at a much lower value, rendering that initial drop in loss in the hockey shape loss graph unnecessary, and that is what Andrej talked about in lec4’s “fixing the initial loss” section.
  • The cleverer initialization I mentioned focuses on the weights that calculate logits, as their magnitude directly affects the logits. And we want to avoid excessive initial loss due to these weights.
    • #24/wk16 We can “focus” on the weights in this last layer because we are talking about initialization, which we directly control. The scheme is to set the bias of the last layer all to 0, and scale all the weights down by applying a multiplifaction of like 0.01
  • We also consider earlier layer weights but the aim is different, which is to prevent the saturated tanh problem.
    • There are graphs that we can plot to detect the saturated tanh problem. See the fourth segment of building Makemore part three video.
    • The problem can get nastier when the network is deeper so we have to be careful with the initialization. Also, yes, plots to visualize.
  • The key idea of Kaiming initialization is to maintain a consistent and desirable variance of activations (i.e., outputs after the activation function) across layers. In Angrej’s lecture, input distribution and only the distribution of linear output are compared, but in Kaiming initialization, the nonlinear activation function should also be taken into account. The purpose of such maintenance is to prevent the problem of vanishing or exploding gradients. The mechanics behind this is that varying variances lead to very small or very large signals in the forward pass. And vanishing or exploding signals, can lead to vanishing or exploding gradients in different manners, depending on the activation function that is used. For example, if tanh is used, we will get vanishing gradients when the signals are exceedingly large, exploding gradients less of a problem.
    • “fan in” and “fan out” are terms used to refer to the number of input and output connections of a layer, respectively.
      It makes sense that Kaiming initialization incorporates fan in or fan out because the results of matrix multiplication depends directly on the number of terms in the summation.
    • Vanishing gradients make convergence very slow or even halt the entire training process. Exploding gradients make the training process unstable.
  • w = torch.randn(10, 200) can be visualized as the connections between a layer of 10 neurons and next layer of 200 neurons
  • #24/wk16 Modern innovations in neural network optimization make initialization easier. They also made training much deeper neuron networks possible in 2015.
    • improved activation functions like relu
    • residual connections
    • normalization of different kinds
    • better optimization techniques like Adam
  • A saturation of 5% is a good number
  • The case of stacking linear layers: Without non-linearities, no matter how many layers a neural network has, it would essentially behave like a single-layer linear model. The linear transformation between the input and the output, looking holistically, doesn’t increase representational power.
    • Mathematically, if you have another linear layer following a linear layer, which transforms 𝑦 into 𝑧 using π‘Šβ€² and 𝑏′, the output 𝑧 would be π‘Šβ€²(π‘Šπ‘₯+𝑏)+𝑏′. This can be rearranged into a new linear transformation 𝑧=(π‘Šβ€²π‘Š)π‘₯+(π‘Šβ€²π‘+𝑏′), which is still linear. It doesn’t create polynomial transformations because the multiplication incurred by a new linear layer happens between the inputs and the weights, which are constant.
  • The loss didn’t continue to drop after using BN because the bottleneck is no longer optimization but likely architecture, we might need context length of more than 3 (i.e., using more than 3 prior characters as inputs)
    • I have to be careful about using the word “architecture” though, it could mean topology of the network, or something else. In this case, Andrej suspected the limiting factor is the context length, which is easy to understand. To further increase the accuracy of the prediction, more information needs to go into the training and inference
Batch normalization
  • BN normalizes values among examples in a batch rather than among neurons.
  • Andrej said a BN layer is usually placed after multiplication, which implies pre-activation. But there are debates about this. And a lot of them can be used in a single network.
  • Batch normalization normalizes results of computations whereas initialization techniques apply to parameters,the building blocks of computations. It makes sense that less care needed to be paid to initialization when BN is used. In fact when network gets deeper, it becomes more difficult to initialize the network parameters
  • #24/wk17 BN introduces non-determinism into the forward pass computation through the random minibatch construct.
    • But how is manipatch not introducing randomness into the forward pass before BN is used? The dimension of hpreact (Hidden layers preactivation) is 32 examples by 100 neurons. The randomness minibatch introduced only affects the selection of the 32 examples but not the forward pass computation.
    • The non-determinism introduced by BN during the forward pass affects the outputs of the BN layer. **Specifically, the values for neurons after undergoing BN are influenced by the randomly selected minibatch of 32 examples and by the values of all examples (i.e., different weight initialization) in the minibatch prior to the application of BN. The influence comes through the subtraction of each neuron’s value by the mean of all examples and the subsequent division by the std of all examples.
      (hpreact - hpreact.mean(0,keepdim=True)) / hpreact.std(0,keepdim=True)
    • this is by and large seen as an undesirable characteristic because of the mathematical coupling of random examples in the batch.
    • However, people are still using it because of the regularizing effect (a side effect) BN has. The trained network is less likely to overfit training examples because BN influences values in random directions for those examples
    • BN dictates the use of batching, which makes inference impossible. The solution is to calculate the mean and standard deviation over the entire training set, either after training as a separate step or during training as a running procedure, outside of training.
      • the running estimate is calculated using exponential moving average EMA. New Estimate=decayΓ—Old Estimate+(1βˆ’decay)Γ—New Data. Decay is chosen as 0.999 in Andrej’s lecture
    • with pre-activation normalization, the bias term is redundant because we have a BN bias term. They have the same shape and function. The only difference is the regular bias term works before normalization and the BN bias term works after normalization, when we have BN bias we don’t need the regular bias.
Primitive layers in PyTorch
  • In resNet, I see:
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)

out = self.conv2(x)
out = self.bn2(out)
out = self.relu(out)
...
  • The linear layer:
    • I previously thought a layer can only consist of neurons (because of the visual understanding of a neural network), which means only non-linearity can be layers. But in engineering, apparently layers are computational so linear computation can also be a layer.
    • the parameters in the layer are initialized according to best practice by default

Exercises

  • E01: I did not get around to seeing what happens when you initialize all weights and biases to zero. Try this and train the neural net. You might think either that 1) the network trains just fine or 2) the network doesn’t train at all, but actually it is 3) the network trains but only partially, and achieves a pretty bad final performance. Inspect the gradients and activations to figure out what is happening and why the network is only partially training, and what part is being trained exactly.
    • AN: the network trains but only partially, and achieves a pretty bad final performance. All weights, biases and their gradients are all zero and remain zero, except the biases at the last layer, which is a batch norm layer. It is easy to see the zero weights at the first layer bring all values to zero as they multiply the inputs. The subsequent portion of the forward pass sees zero throughout. in the backward pass, the only gradient that are not zero are those for the biases of the last layer which directly contribute to the final loss without being multiplied by a zero (i.e., a zero weight). therefore the only parameters that are able to change at all are these biases whereas all the other parameters are stuck at zero because their gradients are stuck at zero the whole time.
      • There is also the symmetry issue where all neurons in a given layer learn the same features, effectively making them redundant.
    • the accurate language describing gradient is this: “the gradient of the loss with respect to each parameter”. GPT4: The gradient of a function is a vector containing all of the partial derivatives of that function with respect to its inputs. In the case of neural networks, when we speak about “the gradient,” we’re often referring to the gradient of the loss function with respect to the trainable parameters of the model. It’s not accurate to say there is a gradient ‘of’ each parameter; instead, we compute the gradient ‘with respect to’ each parameter. This means we’re looking at how changes in each parameter affect the output of the loss function.
  • E02: BatchNorm, unlike other normalization layers like LayerNorm/GroupNorm etc. has the big advantage that after training, the batchnorm gamma/beta can be “folded into” the weights of the preceeding Linear layers, effectively erasing the need to forward it at test time. Set up a small 3-layer MLP with batchnorms, train the network, then “fold” the batchnorm gamma/beta into the preceeding Linear layer’s W,b by creating a new W2, b2 and erasing the batch norm. Verify that this gives the same forward pass during inference. i.e. we see that the batchnorm is there just for stabilizing the training, and can be thrown out after training is done! pretty cool.

Leave a comment