DL weekly study notes: DNN’s Initialization and GPT4 Can’t Reason

Published on Mar 17th, 2024.

Notes:

In March 3’s study notes, I wrote about using different initialization methods for training. Note that the hockey shape graph looks appealing, though it is not exactly desirable because the hockey shape really comes from the drastic decrease of the loss in the first few iterations, dwarfing all the subsequent ones, which are the ones that do the hard/useful work. A cleverer initialization will make the loss start out at a much lower value, rendering that initial drop in loss in the hockey shape loss graph unnecessary, and that is what Andrej talked about in lec4’s “fixing the initial loss” section.
The cleverer initialization I mentioned focuses on the weights that calculate logits, as their magnitude directly affects the logits. And we want to avoid excessive initial loss due to these weights. We also consider earlier layer weights but the aim is different, which is to prevent the saturated tanh problem.
- There are graphs that we can plot to detect the saturated tanh problem. See the fourth segment of building Makemore part three video.
- The problem can get nastier when the network is deeper so we have to be careful with the initialization. Also, yes, plots to visualize.
The key idea of Kaiming initialization is to maintain a consistent and desirable variance of activations across layers. In Angrej’s lecture, input distribution and only the distribution of linear output are compared, but in Kaiming initialization, the nonlinear activation function should also be taken into account. The purpose of such maintenance is to prevent the problem of vanishing or exploding gradients. The mechanics behind this is that varying variances lead to very small or very large signals in the forward pass. And vanishing or exploding signals, can lead to vanishing or exploding gradients in different manners, depending on the activation function that is used. For example, if tanh is used, we will get vanishing gradients when the signals are exceedingly large, exploding gradients less of a problem.
- “fan in” and “fan out” are terms used to refer to the number of input and output connections of a layer, respectively.
  It makes sense that Kaiming initialization incorporates fan in or fan out because the results of matrix multiplication depends directly on the number of terms in the summation.
- Vanishing gradients make convergence very slow or even halt the entire training process. Exploding gradients make the training process unstable.
w = torch.randn(10, 200) can be visualized as the connections between a layer of 10 neurons and next layer of 200 neurons
Modern innovations in neural network optimization make initialization easier.

Notes on GPT for Can’t reason

My current stand on LLMs align with the author’s: “that LLM-based systems are not mere stochastic parrots but build genuine abstractions and can exhibit creativity”. What I want to get out of this reading is to understand why the author says “these systems are still severely limited in their reasoning abilities”.
The author set a narrow and well-defined subject of study before the entire discussion: “Can GPT4 reason?” This makes a systematic enquiry possible and easier.
First-order logic and higher-order logic: first-order logic extends propositional logic with predicates and quantifiers, which allows expression of more complex statements about objects and their relationships. Higher-order logic is an extension of first-order logic that allows quantification not only over individual variables but also over predicates and functions.
Core argument of the author: there are a bunch of a priori considerations, but the most compelling one is that it is computationally complex.
[ ] I should study this: “In fact, in the general case (first-order or higher-order logic), it is algorithmically undecidable, i.e., every bit as unsolvable as the halting problem”

In my view, the most compelling a priori considerations against the plausibility of reliably robust LLM reasoning turn on computational complexity results. Reasoning is a (very) computationally hard problem. In fact, in the general case (first-order or higher-order logic), it is algorithmically undecidable, i.e., every bit as unsolvable as the halting problem. Thus, by Church’s thesis, we cannot expect any algorithm, LLMs included, to solve arbitrary reasoning problems in a sound and complete way.

What makes the analysis grounded is the extensive context it provides: two sides of the debate, methodology, clarifications of objectives and so on. This requires substantial depth and extensiveness of knowledge in the writer.
Definition of reasoning: the process of making and justifying arguments. An argument consists of a conclusion and a set of premises from which the conclusion is derived.

We say that a set of premises S logically entails (or logically implies) a conclusion p iff p is true whenever all the sentences in S are true, in which case the argument is said to be valid.

…

Deduction is the process of making and justifying non-ampliative arguments.

Humans use rules of inference implicitly. Logic systems make them formal. The reason behind the formality, I think, is to ensure soundness.
In deductive logic, rules of inference are the logical principles that justify the steps in proof/reasoning from premises to conclusions. My understanding is from premises to conclusion in an argument, a proof or reasoning justify the connections by using those inference rules.

All mathematical proofs are deductive, and mathematical reasoning in general is predominantly deductive

Three primary categories of reasoning: deduction, induction, and abduction
- Induction and abduction are similar but different in that induction seeks to generalize while abduction seeks explanations

Abduction consists mostly in making and justifying arguments that explain a set of facts. If one day I come home early from work and I see a plumber’s van parked in my neighbors’ driveway, I might conclude that my neighbors are having some plumbing work done in their house. The premise here is “There is a plumbing van parked in my neighbors’ driveway” and the conclusion is “My neighbors are having plumbing work done in their house.” This is sometimes called “inference to the best explanation,” because the conclusion serves to explain the premise(s). This is also a form of ampliative reasoning — the conclusion does not follow logically from the premises. There are many alternative explanations of a given set of facts or observations.

The author ventured into induction and abduction because he wanted to rule out the use of them with an explanation and solely focus on deductive reasoning when he examined the reasoning ability of GPT4.

For present purposes it suffices to say that we will focus on deduction, because it is the type of reasoning that underpins most logico-mathematical thought and for which we have clear normative standards of evaluation.

I am trying to understand why planning involves reasoning as the author mentioned below. The obvious is the result of planning is a solution to a problem, but unlike general problem solving, the solution is to be followed and executed. In that sense, planning requires reasoning just like general problem solving does. A simple example can also illustrate this dependency: “consider the situation of planning a meal that must adhere to dietary restrictions, such as a gluten-free diet. … the planning relies on the logical deduction that since wheat contains gluten, it must not be used in a gluten-free diet and other ingredients are used instead.” (generated by ChatGPT)

Tackling complex problems requires planning, and planning itself requires reasoning.

In the first conclusion made by the author, he pointed out the below in quotes. The statement itself is true, but the consequences are uncertain. The pessimistic prediction of the author is based on the assumption that bugs in code should be avoided because correctness is paramount. However, when it comes to domains where increases in development velocity offset the damages introduced by buggy code, I can see adoptions of LLM for generating systems.
Can we have more recommendation engines like software system operating in the future?

it has the potential to proliferate buggy code at scale.
…
we need sorting algorithms that work on all inputs, not just most of them, we need Amazon’s cart to charge customers the right amount every time, not just most of the time, and so on. Computation-heavy and reasoning-heavy applications are not like recommendation engines. They need to be sound.

One question remains: how are these 21 categories of questions identified, a questions constructed for each category?

It came to me after more training with my contemporary dance teacher Katie, that once you know a subject matter inside out, or even well enough, one can start to generate content. In the case of contemporary dancing, we generate movements; and in the case of this study, the author generated questions.

Gang Fang's Blog

DL weekly study notes: DNN’s Initialization and GPT4 Can’t Reason

Leave a comment Cancel reply

DL weekly study notes: DNN’s Initialization and GPT4 Can’t Reason

Share this:

Leave a comment Cancel reply