building makemore
- the counting approach to bigram is unsupervised. this model and GPT are both generative and predict the next word. however, bigram is very primitive and i can’t wait to see the comparison between this and the transformer arch.
before part 2:
# Doing the same thing but with a ANN
# imagine how the model looks like before watching video:
# input is a char, output is the next
# one-hot encoding to make each output neuron a bin classifier
# softmax to get the prob dist
# N is not needed anymore bc the model serves the same purpose
# The prob in N and the weights in this new ANN shouldn’t be confused. when the ann is trained, weights are adjusted as
# nll goes down. but to use nll, the output should be prob instead of the next word, shouldn’t it?
# but this ann will output the same result because the counting approach is already the best that we can get: the data
# generating dist is approximated in the best possible way with counting (confirmed by Andrej)
# training: since bigram is unsupervised, how am i gonna train it with a ann?
Leave a comment