DL implementation study – Feb 16, 2024

makemore repo with the completed bigram model (lecture 2): https://github.com/gangfang/makemore

Notes:

Context of scaling: the growth of size of the counting table is exponential as the number of previous characters used for the next character prediction. (i.e., there are 27 rows when using one character, 27*27 rows for 2, 27*27*27 rows for 3 and so on) This is apparently the brute-force approach, whose O is overwhelming. With this perspective, now let’s look at GPT: even though these GPT models are large, we can consider the amazing performance is still achieved due to its efficiency. And the reasoning goes as with more efficient models, more impressive performance than GPTs’ can be achieved. That is why further innovations of architecture can be useful. Note that the bigram model is itself dumb because it considers so little context when making predictions. The brute-force approach expand the context a model considers all the way to infinity but this is just inefficient.

Questions:

Gang Fang's Blog