DL implementation study – Jan 5, 2024

building makemore

the counting approach to bigram is unsupervised. this model and GPT are both generative and predict the next word. however, bigram is very primitive and i can’t wait to see the comparison between this and the transformer arch.

before part 2:

# Doing the same thing but with a ANN

# imagine how the model looks like before watching video:

# input is a char, output is the next

# one-hot encoding to make each output neuron a bin classifier

# softmax to get the prob dist

# N is not needed anymore bc the model serves the same purpose

# The prob in N and the weights in this new ANN shouldn’t be confused. when the ann is trained, weights are adjusted as

# nll goes down. but to use nll, the output should be prob instead of the next word, shouldn’t it?

# but this ann will output the same result because the counting approach is already the best that we can get: the data

# generating dist is approximated in the best possible way with counting (confirmed by Andrej)

# training: since bigram is unsupervised, how am i gonna train it with a ann?

Gang Fang's Blog