Building Makemore
- I am starting to understand why DL is more and more an engineering work. There is a lot of low level manipulation like operating tensors and stuff like that. It’s very similar to the work I do at Amazon albeit different in programming languages and the exact details being dealt with. Therefore, I think I need to understand common tech more deeply like Brazil.
- torch.multinomial(…) doesn’t care whether the input probability, the first param, is normalized or not. It will normalize that internally.
- The probabilities in the tensor are the “parameters” of the bigram model. And the bigram model is considered parametric rather than nonparametric even though the number of its parameters changes according to size of the vocab. because its complexity in terms of structure doesn’t change with training data size.
- The training of a bigram model is essentially counting.
- Critiquing the bigram approach: the predictions from the model don’t look like observed data (i.e., provided names) because the bigram approach is too local: it focuses only on two adjacent characters and ignore the global characteristics of a name completely. So even though the data generating distribution is
extractedapproximated based on the training data (not extracted because the data generating distribution can only be extracted from the entire English vocab.) and sampled from, the predictions aren’t good. The dist. is so localized that broader patterns aren’t captured. - Given the above, how the training error should be constructed?
My guess is the training loss will be used in a NN approach because there isn’t a training phase to begin with with the bigram model.Nope, if I init the tensor with some random value, then I do have a training error.
The No Free Lunch Theorem and OOD generalization
I was wondering if The NFL Theorem is an implication of the impossibility of OOD generalization. NFL theorem talks about the impossibility of creating a universal learning algorithm, which sounds similar to the goal of out-of-dist generalization. But I think there are nuances that make them different. With OOD generalization, the task that the learning algorithm attempting to address remains the same even when there is a distribution shift from the training set to the test set. The constraint on the task already mades NFL not applicable.
Leave a comment