DL study – Jan 23, 2024

I was listening to the podcast between Lex and Ilya Sutskever and here are some takeaways:

Designing DL model: other than installing inductive biases, the other main thing about DL architecture innovation lays in efficiency improvement. Ilya Sutskever mentioned in the Lex podcast that the success of Transformers come from the architecture being highly trainable, as in it utilizes GPUs extremely well and fast. The invariant of DL research is trainability: a model has to be trainable. And this reflects Ilya’s orientation to things that work.

Fundamentally simple ideas: gradient descent and back prop are very simple but extremely powerful because they solve the fundamental problem which is to FIND the neural circuit given certain constraints. And the way to AGI is the discovering of THE simple program that can generalize to everything. And we don’t know whether the brain does back prop or not.

Deep double descent: this is a phenomenon that an ANN doesn’t overfit when it has either few parameters or a lot of parameters compared to the size of the training dataset. The performance is the worst when the number of parameters and that of the training set are close. The reason is asymmetry makes the model less sensitive to smaller changes (e.g., noises) in the data. This one is related to the question I got in my chat on why overparameterized DL models perform well without overfitting but the answer is from a different angle:

In short, we found that deep networks exhibit a bias towards fitting data with “simple” functions (as shown in Image 2). To the extent that these simple functions align with the learning task (Image 3), sample efficient learning is possible, and we say that the network has a good inductive bias for this task. If the learning task requires fitting a “complex” function, then learning is not sample-efficient. In our work, we give precise definitions of simple and complex functions.

https://brain.harvard.edu/hbi_news/how-neural-networks-escape-perils-of-overparameterization/

Current tech stack in DL: there are many the stack of deep learning is starting to be quite deep if you look at it you’ve got all the way from the ideas the systems to build the data sets the distributed programming the building the actual cluster the gpu programming putting it all together.

Simulation: models trained in simulation can be transferred and used in the real world and vice versa. This one resonates with the idea I heard that the origin of consciousness is about locomotion. Organisms developed consciousness in order to navigate the physical world. From this perspective, an AI without some embodiment in some space doesn’t seem to have the need to possess consciousness but simulation might be a way to that.


Two more ideas of Ilya (https://youtu.be/YEUclZdj_Sc?si=seOeeu6u7kZbWJT9):

When we debate whether LLMs can understand and the argument that they can’t because they are simply next-token predicting parrot, Ilya said next-token prediction is merely an avenue towards learning the human mind. Essentially, next-token prediction gives us a lever to pull but the lever itself is of less significance. What matters is the machine learns general intelligence THROUGH next-token prediction, similar to me learning to dance THROUGH the techniques.

And the extrapolation in this case is a super-intelligent human that doesn’t exist but the machine “creates”, hence the machine is super-intelligent.

Reliability and controlability are the current and next big challenge.


Can we imagine a future where all programs are learning programs and inherently probabilistic? I found again the comparison can be made between traditional hand-coded systems and learning systems. Both kinds of systems need design and the theory is the simpler and more general, the better. Following Ilya Sutskever’s proposition: any program that can be found by human by hand can be found by machine learning, and if sample efficiency is possible, then indeed there is a possible future where all programs are learned. The human in loop is to design these learning programs instead of writing deterministic logic.

Leave a comment