Generalization and the Scaling laws – Jan 31, 2024

Traditionally, the central challenge in ML is generalization, or out of distribution generalization. The problem of generalization only occurs when the model is used to predict on previously unobserved inputs. However, with the current approach of LLM training, which uses the “entire Internet”, there doesn’t seem to be any more unobserved data and therefore, OOD generalization as a problem is gone. But is it really that simple?

One problem with the above argument is the precise meaning of “entire internet”. This is a term that is way to vague to base any conclusion upon. How exactly the text data is cleaned and used? What is kept and what is left out? To achieve the performance of GPT4, data preparation is a question that I can’t not answer. As we consider the scaling laws and expand the training data set, what and how exactly we append more data? What are the invariants as the training set grows?

Then the next question is as we include all textual data on the internet, can we say we provide the entire knowledge of humanity for the training? Is there anything that is left out? One thing that struck me when learning about Alpha Geometry is that GPT4 got a 0% success rate on solving any Olympiad-level geometry problems. Why is that? If we say we have provided all knowledge of humanity, why couldn’t GPT4, or GPT5, solve those problems? It was mentioned that the textual data on solutions to those problem was too little, does it imply the density of data among different areas should differ? Or do we have to build in specific mechanisms to support reasoning?

Another thing is the structure of the inference process. As the current models are optimized for fluency rather than accuracy, interact with the model in a way that guides it to compose a solution step by step improves accuracy. Such a technique is great for solving mathematical problems mentioned above. In the context of mathematical problem solving, generalization is the model’s ability to solve a new problem after being trained on a set of training problems. A model has to know the theorems needed for a particular problem, which is the abstraction part, then applying, mixing and matching them to come up with a solution to the new problem. The compositional nature of the process has been best induced by chain-of-thought prompting, meaning these models are capable of doing it but not on their own yet.

Leave a comment