DL weekly study note: Bengio’s AI scientist, Bayesian or Frequentist, and more

Bengio’s AI scientist talk:

  • Model-based machine learning: This term is derived from model-based reinforcement learning. What is model-based reinforcement learning? Well, the agent acts in the environment in a way that maximizes rewards. My guess is that the reward function is provided by the model in a model-based RL scheme and it might change as the environment changes. AFTER-READING: Model-based RL constructs a model of the environment which allows future states and rewards to be predicted. The predictions lead to planning of the policy, which is more sample efficient because the policy can be planned, instead of being learned through the interactions the agent has with the environment (i.e., model-free RL)
  • Model’s optimal capacity << Inference machine’s optimal capacity: This is a major motivation behind separating the world model from the inference machine. The number of abstractions, concepts, scientific principles and so on is small, so the world model’s size should be correspondingly small. But to answer questions, we need a lot of learning from a lot more data and therefore the inference machine is much larger. Separation of these two components prevents the world model from being overfitted and the inference machine from being underfitted.
    • But one question is how the world model is trained? If the core of DL is learning of representations from pixel-level data, then the world model should be trained using a sea of data, like LLM’s training, so that abstract features can be obtained, shouldn’t it? But that probably isn’t the case. Let study further.
  • Re scaling mentioned above.. I am reading the results of the inverse scaling prize (link) and those examples seem minor in significance, as in they feel like corner cases. But I intend to complete the reading and draw more conclusions then.
    • When Bengio and his team used this prize to support their proposal of separation of world model and inference model, this support is fairly weak.
  • The problem with maximum-likelihood estimation: Only one explanation or one understanding can be obtained of the data because once the params are optimized, the hypothesis stays at one single point in the hypothesis space. But we might want diversity and there might be multiple places in the hypothesis space that are similarly good. Can I dig deeper into this idea?
    • The solution is to be Bayesian. But the toy example Bengio used to illustrate his point is an edge case, how strong can this be?
  • Non-parametric:
  • Relationship between the world model and the inference machine: The world model is a GFN that approximates the Bayesian posterior over theories and can generate theories according to the posterior distribution. The question answering machine samples from this world model
    • Wait, does GFN serve as only the world model or the combo of the world model and the inference machine?
  • What exactly does Bengio mean when he said something like an LLM (with universal approximation properties) can be used approximate the Bayesian posteriors? I understand obtaining the posterior predictive distribution requires integration over the entire parameter space, which is computationally, if I understand that correctly, intractable. But was he talking about the posterior predictive distribution or the Bayesian posterior? And why training a LLM long enough can get us the Bayesian posterior? – AN: He was definitely talking about the posterior being intractable and the cause is we don’t know how to compute the normalizing constant. Why do we not know how to compute the normalizing constant? – AN: Because the normalizing constant, or P(x1, …, xm), in a real-world scenario is simply unknown, it is not a toy example where I have a table of the joint distribution of x and theta and can calculate everything. Back to the original question but rephrase it: how can I tell when the LLM is approximating the Bayesian posterior but not something else? What is the criterion? – AN: The approximation is done by making the posterior proportional to the product of prior and likelihood. The LLM generates theories, which then get turned into the prior dist. over theories and the likelihood of evidences given a theory. And a GFN can achieve this.
  • The inference machine is also intractable because it involves marginalization, summing up over all theta/theories.
  • Quick thing to point out: The population approach of Bayesian is about theta, not y. Theta as one theory (derived from maximum likelihood) or as theories (Bayesian machine learning) explains the data, which includes y.

Why greatness can’t be planned: https://youtu.be/lhYGXYeMq_E?si=EaFHRz0d1MAqQ68G

  • Implications on LLMs: LLMs do the opposite, convergent thinking. Content generated by these foundational models look similar YT video. And this got me to think that Sam Altman’s statement that AI would disrupt creative jobs first is probably incorrect. These models produce images, articles, code and etc., but these artifacts, surprisingly, a lot of the time aren’t a result of genuine creativity. I wrote this article being a dancer doesnt guarantee originality last year and I think this is the exact phenomenon: simply creating stuff isn’t a sign of creativity, the type that Stanlay talks about, the type that is truly novel and interesting. A lot of creators make stuff that is derivative, and foundational models do the same. This is a monumental obstacle for current mainstream AI, at some point these AI-generated, derivative artifacts, and the expansion of this paradigm into other domains, might cease to attract the same amount of attention because they become commonplace like how everything naturally progress, and accelerated by its own feature. Of course, this is assuming the current paradigm isn’t making more exponential gains, say the scaling laws stop working.
  • Stanlay’s discussions on interestingness, subjectivity, divergence and diversity resonate with me because these are ideas I have been appealed to but couldn’t articulate. Paul Graham’s advice on learning the things that interest you really stuck with me almost 10 years ago.

Bayesian Statistics

  • Why Can’t Maximum Likelihood Estimation Be Used to Estimate the Full Distribution of Theta?

    Because in point estimation where maximum likelihood is one principle, we assume the true theta is unknown and fixed, implying there is simply no full distribution of theta. Theta hat is an estimator that approximates this one fixed value so it makes sense that it is a point estimator (i.e., single best guess).

    This is beautifully contrasted with Bayesian estimation, where the true theta is unknown and uncertain, implying it is a random variable and there is a distribution of it.

  • How Does the Integration Over the Estimator Protect Against Over Fitting in the Bayesian Setting?

    The integration over estimator is to average over all possible models (i.e., param values) weighed it by their probability. This tantra prevent overfitting because not only those configurations which fit the training data exceptionally well, but might suffer from overfitting, are considered, but all the other possible configurations. It is also a form of regularization: it penalizes complexity, because simpler models, with broader posterior distributions, are less likely to fit the noise in the training data too closely.

Bayesian or Frequentist, Which Are You?

  • Freq and baye are two perspectives. They are two perspectives because the objects in the real world that we deal with are the fixed few: evidence likelihoods, and so on. But it is how we look at them that distinguishes these perspectives. The Bayesian perspective is a conditional perspective, where we condition on some observations, and we I assume they are fixed and make inference based on them. On the contrary, the frequent test perspective is an unconditional perspective where we average over all data sets and it works well on all different kinds of data sets.
  • The frequentists are pessimist while the Bayesians are optimists. The frequentists think the world is unknowable, and a model is a simplification of it, which help us make less mistakes. The Bayesians, on the other hand, try to get as much as knowledge from a dataset as possible.
  • Machine learning are a mixture of multiple themes, but the underlying is still statistical inference. It uses both frequentist, and Bayesian approaches. Emphasis on predictions and methodologies.
  • There are different properties in either of these perspectives.
  • When we want to turn randomness into a number, we calculate the expectation.
  • The frequentist expectation: R(θ) =E_θl(δ(X), θ) takes the expected value of X over the entire sample space: not the X that you saw, but the X that you might see. You are looking at the other possible data you could have got, the unconditional

Basics

  • What does it mean when we say expectation of a random variable with respect to theta (i.e., E_θ​(X))?
    The really means the expected value of a random variable X with respect to a probability distribution parameterized by θ. θ defines/describes/characterizes the probability distribution, which in turn describes the behaviors of X or more random variables.

Leave a comment