Code reproduction of Andrej Karpathy’s “building makemore part 4” lecture:
https://github.com/gangfang/makemore/blob/main/makemore_part4_manual_backprop.ipynb
- Training neuron networks is a craft in itself.
- applying exponential to logtis maintains the relative order and exaggerate the differences. So this is not necessarily a distortion
- GPT: While the exponential function does change the scale of the differences, it does not alter the relative order of the logits. The highest logit remains the highest in the exponential scale, and the lowest remains the lowest. This means that while the scale of the outputs changes, the decision of which class is most likely (i.e., which logit is highest) does not change.
- shift invariance property: mathematically, the soft max function is invariant to adding any constant to all elements of the input factor z
https://chat.openai.com/share/8fe5b01d-7055-4fca-b8eb-40cfe0bc922e - training test discrepancy should be avoided: an example of such describe pregnancy is the use of biased estimator of the variance in training and the use of unbiased estimator of the variance in inference in the original BN paper
- Broadcasting in the forward pass means reuses and therefore in the backward pass, we use summation to aggregate that derivatives. on the contrary, summation in the forward pass means the matrix multiplies by a matrix of all ones and therefore in the backward pass, we use
torch.ones_like() - in classification, the computational purpose of the label is indexing. Label index into the pytorch tensor to compute the cross entropy loss
loss = F.cross_entropy(logits, Yb)Ybis the label vector
- the intuition of
dlogits(for a given row in the Tensor, ie one example): given thatwhen i != y, di = pi and when i == y, di = pi - 1,dlogitsis the same assoftmax(logits, 1)for all positions except the label position, which has the value ofpi -1. since we are looking at the derivative of the loss in terms of the logits, the effect of training is that the value of logits adjust to the opposite direction of dlogits.- Based on the above, value of logits not on the label position is to be pushed down (dlogits for them are probabilities and therefore are positive) while that of logit on the label position is to be pushed up (probability minus one, hence negative).
- the amount of force of these pushes is propositional to its probability for non-label positions and inversely propositional for the label-position element. it is easy to see that the higher the probability for a non-label class, the more incorrect it is and therefore a stronger downward force is to be applied so that it decreases to a greater extent. on the contrary, the lower the probability of the label class the more incorrect it is and therefore a stronger upward force is to be applied so that it increases to a greater extent.
- note that we have
di = pianddi = pi - 1, which leads to the sum of dlogits being 0. This means that, overall, the forces neutralize themselves. - https://youtu.be/q8SA3rM6ckI?si=_kYbDyylq8WWpDt-&t=5630
Leave a comment