Reproducing Paper “GPT-4 Can’t Reason”

(updated on Apr 22)

Introduction

The higher level cognitive abilities of ChatGPT has always been fascinating to me. This topic has sparked numerous debates since OpenAI’s launch but most comments are one-sided. Recently I came across Konstantine Arkoudas’s pre-print paper GPT-4 Can’t Reason (arxiv) and was amazed by the clever scoping of the problem statement, looking at deductive reasoning only, and the balanced view presented. I then decided to reproduce all experiments in the paper, from first to last. There are two motivations: first, going through all experiments helps me observe, first-hand, how ChatGPT performs in deductive reasoning; second, I want to see if there has been improvements made to ChatGPT and if so, in what form.

In this blog post, I documented my reproduction of all 21 problems, eight months after the publication of the paper. The overall methodology is to simply ask GPT4 the exact same reasoning problems, except the Murder or Suicide problem due to memorization. The result is that GPT4 has improved in its abilities to solve those problems: 10/21 are solved correctly vs. 2/21 in the original experiment. The most significant contributing factor lies in GPT4’s abilities to solve problems programmatically. The remaining post describes the methodology of the reproduction, the detailed results of all problems GPT4 attempted with chat history and finally a conclusion section.

I would use this opportunity to thank Konstantine Arkoudas for his awesome work in his systematic investigations and attempts to educate. Till this day, he has published 4 articles on this topic with in-depth introductions to concepts in logic and reasoning and explanations on his own views, alongside all his original tests and results. Readers can gain deeper understanding in the context of the experiment and the rationale to focus only on deductive reasoning by reading the original paper (arxiv). All of his articles can be found in medium.

Experiment methodology

This is the first round of tests where I simply copy and paste the deductive reasoning problems and examine the solutions. Here is how I mark the correctness of a solution given by GPT-4 both in the original experiment and my reproduction: [- -] is prepended to each of the experiment. The first – signals the correctness of the original answer, the second signals that in the reproduction. For example, [X ✓] means GPT4 didn’t solve a problem eight months ago, but solved it successfully in my reproduction. Besides the two main categories i.e., truth and falsity, a third category needs to be introduced to accommodate the case where GPT4 provides partially correct or partially incorrect answer. This case happens when, for example, GPT4 gives a correct answer upon human correction after an initially incorrect attempt or it produces an wrong answer based on largely sound reasoning. “*” is used to mark such cases. ✓* indicates an global correct answer with minor mistakes in reasoning. X* indicates GPT4 somehow gets to the correct answer but through unsound reasoning. I admit the impreciseness in such an evaluation approach due to the degree of arbitrariness in my evaluation. But this lowers the barrier of interpreting the results so I stuck to it.

With the overall strategy outlined, four questions still need to be answered for a comprehensive setup of the experiment:

First, how should GPT4’s access to Python and its programmatic solutions to problems be treated? We can find clues in author’s explanation on why some questions, like the first two, not conforming to the classical deductive reasoning paradigm are included (in the last paragraph of “2. What is Reasoning?” section): he views computation as a form of deduction. This is important in deciding if restrictions should be placed on GPT4’s access to python. The answer is no because: first, GPT4’s ability to write correct and useful python programs should be considered an element of its deductive reasoning capacity given the authors view; second, python was accessed by GPT4 in the authors’ original experiments. Comparison is only fair when it is allowed to use python in the reproduction. However, answers aided by python is marked it with a “py” following the correctness mark, e.g., Xpy as an incorrect answer, even with access to Python.

Second, is memorization a concern? I assumed it isn’t: the last update of GPT-4 was in April 2023 and the original paper was published in August 2023. It is reasonable to say experiments appearing in the original paper and related discussions were not included in the training of the current version of GPT-4, which I am using to run these problem-solving tests. Therefore, it can be assumed that the risk of improvements in GPT4’s reasoning coming from memorization is minimal. However, this assumption will be invalidated once a second round of tests running on perturbed problems shows a decrease in performance. In fact, there is already an evidence that the set of questions is somewhat memorized by GPT4, see Murder or Suicide problem.

Third, which version of GPT4 should be used? My answer is simply the default GPT4 version provided via ChatGPT UI. I saw this Hacker News thread on the paper (HN link) and people were complaining about the author not being careful about specifying the version of GPT4. I don’t think this discussion is terribly relevant in my reproduction because the comparison is valid as long as my setup is the same as the author’s, which QA via ChatGPT UI. I actually can’t use the latest versions via API because they provide models with training data as recent as Dec 2023. The default version has been trained on data until April 2023, which addresses the memorization concern.

Fourth, how is GPT4 prompted? It’s received wisdom that CoT (i.e., Chain-of-thought) prompting solicits better reasoning from GPT4. But I did not do that because the responses are already in a chain-of-thought style even without such an explicit specification in the questions asked. The second half of the equation is the style of follow-up questions. My principle is minimization of hinting (e.g., saying “you are wrong” or “think in such and such way”). When GPT4 makes mistakes, I ask for an expansion of the problematic statement, quoting the mistake directly without much of an indication. The purpose is to minimize human inputs so that GPT4 relies on its innate reasoning abilities for corrections.

Experiment results

This table shows a summary of results:

+---------------+----------------------+---------------------+
|   Problem     | Original Experiment | Reproduced Experiment|
+---------------+----------------------+---------------------+
| Total Problems|         21          |          21         |
+---------------+----------------------+---------------------+
| Completely    |          1          |          8          |
| Correct       |                     |                     |
+---------------+----------------------+---------------------+
| Partially     |          1          |          2          |
| Correct       |                     |                     |
+---------------+----------------------+---------------------+
| Partially     |          2          |          6          |
| Incorrect     |                     |                     |
+---------------+----------------------+---------------------+
| Completely    |         16          |          5          |
| Incorrect     |                     |                     |
+---------------+----------------------+---------------------+
| Not Applicable|          1          |          0          |
+---------------+----------------------+---------------------+

Fun fact: I used GPT4 to convert a textual description of the results into this ASCII table. It got the basic arithmetics wrong.

Here are the 21 problems:

[X ✓py] Simple Arithmetic: chat history. GPT4 failed without access to python: chat history.
[X ✓py] Counting: chat history. GPT4 failed without access to python: chat history.
[X ✓] Medical common sense: chat history
[X ✓] Elementary Logic: chat history
[✓ ✓] Quantifier Semantics round 1: chat history
[X X] Quantifier Semantics round 2: [forall x . P(x) <==> Q(x)] holds if and only if the following biconditional holds: [(forall x . P(x)) <==> (forall x . Q(x))] Neither direction holds. GPT4 gave the correct answer, but provided incorrect reasoning. I asked this question twice and it gave two versions of reasoning that are contradictory.
- Version 1: GPT4 thinks “=>” holds but the converse doesn’t. (chat history). It eventually provided the correct reasoning after human correction.
- Version 2: GPT4 thinks “=>” doesn’t hold but the converse does. (chat history)
[n/a X] Quantifier Semantics round 3: chat hisotry
This is right side of quantifiers semantics round 2 statement, applied to the domain of integers. The author didn’t ask GPT4 this but I did. GPT failed to comprehend the “for all” quantifier. It treated the predicates as filters then reasoned about the equivalence between even integers (i.e., half of the integer domain after filtering) and odd integers, which isn’t the correct interpretation of the original quantifications. That be fair, this one is confusing for human too despite its deceptively simple appearance.
[X X*] Simple graph coloring: chat history
GPT4 messed up the connectedness of the graph, it thinks the graph is complete, which is actually not. It came to the correct answer based on the wrong completeness assumption. I pointed out the mistake in completeness of the graph and it was able to find one missing connection out of three. I then further prompted it to discover more mistakes i.e. missing connections, but GPT4 probably thought I meant the mistake in the colorability of the graph. Since then, it had produced answers of colorability that are incorrect.
[Xpy ✓py] Subset Sum: chat history
GPT4 provided reasonable justifications on why not to executing a manual approach (i.e., without Python): chat history.
[X* ✓] Elementary Discrete Math: chat history
GPT4 provided conditional proof for the claim, essentially a disproof because the claim is not universally true. I consider GPT4 got it right since it is likely programmed to answer questions in a positive light and the reasoning is sound.
[X Xpy*] Simple Scheduling: chat history
The initial answer is incorrect, but it corrected itself in the second attempt when asked to generate all the cases.
[X ✓*] Russell’s Paradox: 1st attempt’s chat history, 2nd attempt’s chat history
GPT4 attempted to write a program for this problem but failed. It subsequently used its native reasoning to answer the question correctly. The programming mistake is the import of non-existent modules from sympy. GPT4 hallucinated, coming up stuff that doesn’t actually exist, in programming.
In another attempt, GPT4 managed to not hallucinate in its program, but it got the program wrong.
[X X] Blocks World: chat history
The conclusion is universally true but GPT4 provided an inconclusive answer.
[X X] Spatial Reasoning: chat history
Telling left from right. The internal world model of GPT4 seems to be such that initial orientation is always north facing, therefore, left is west and right is east. GPT4 failed to re-orient based on the question and make conclusions incorrectly.
[X Xpy*] Spatial Reasoning round 2: chat history
A furniture arrangement problem. GPT4 consistently failed at complying with the 4th constraint: “D is above all others.”, even with corrections. It only thought of horizontal adjacency when attempting to satisfy the fifth constraint “E and D are adjacent” until vertical adjacency was given as a hint. Moreover, the python program it generated does not represent the 3 x 3 grid at all. However, I still saw an improvement and it’s answer over that eight months ago. The failure patterns are quite different: the previous answer obtained by Konstantine is guesswork, whereas the answer I saw is more systematic, following a logical path, but only failed at particular constraints. The difference could be explained by GPT4 having been tuned to produce longer and more chain-of-thought like answers as a way to make its reasoning more robust.
[X Xpy*] Spatial Reasoning round 3: chat history
A seating puzzle. GPT4 ran all analysis on the constraints correctly but produced a wrong program which led to the wrong answer. Here is the incorrect part of the python program:

GPT4 corrected itself after probing and produced the correct answer.
It can be observed, especially in these spatial reasoning questions, that GPT4 uses a brute force approach to discover all possibilities before applying a filter. This is a smart move as a brute f orce approach is generally less complex.
[X Xpy*] Temporary reasoning: chat history
This is an interesting case where GPT4 overly corrected itself to the point of incorrectness. It saw an “inconsistency” between three of its intermediate conclusions and it attempted to refind away that discrepancy. but it should have simply found the intersection between these two conclusions to get to the final answer.
[X* X*] Murder or Suicide: Chat history based on the original puzzle: chat history ; chat history based on a puzzle with modified names: chat history
This logical puzzle cannot be used because GPT4 appears to be trained on it even though the author claimed its authorship, which doesn’t seem to be true because the exact same puzzle can be found in this paper from the last century (link).
Given the above fact, however, interesting observations can still be obtained. Such as, GPT4 got the correct answer to the original puzzle while failing at it when characters’ name and the place have been changed. This highlights the memorization aspect of these models.
[X X] Wason Selection Problem: chat history
This is a modified version of the original Wason Selection Problem and GPT4 failed at this unequivocally.
[X ✓] Entropy: chat history
This is a question requiring elementary information theory and one step reasoning. GPT4 got it on the first try.
[✓, ✓] Simple Compiler Correctness: chat history
GPT4 in both experiments got the inductive structure of the proof correct. However, it makes the same mistake where it confused a value with a stack and it refused to use exec([],n::S) = n to complete the reasoning of the inductive step. Also, I think the author of the original paper didn’t specify how sequence concatenation is handled, so the fact that GPT4 was reaching the correct conclusion when the omission was present means it was making logical leaps. Overall, I agree with the author that this type of problems are common in programming language theory and GPT4 has very likely memorized them.

Conclusions

As we can see in the table in the previous section, there have been significant improvements in GPT4’s deductive reasoning abilities. Number of completely correct solutions went way up, making the opposite go down. Number of partially correct and incorrect solutions also increased substantially, indicating it is getting more right in its chain of reasoning.

The crucial factors contributing to the increased number of completely correct solutions are GPT4’s improved ability to identify problems that can be solved programmatically, and to actually program correctly. An interesting observation is that GPT4 tends to write brute-force algorithms in its attempt to program its way to a solution. And this reduce the chance of incorrectness in its program.

Regarding those partially correct or incorrect solutions, as mentioned in Introduction, GPT4 automatically produces responses in a chain-of-thought style, which reduces its chance to make logical leaps that would lead to egregiously incorrect answers. Also, it is also more receptive to human provided corrections.

What are the implications of putting LLM-based chatbots to applications based on these findings? I will answer this in a subsequent post. Stay tuned.

Appendix

Undecidability as an a priori argument for implausibility of LLM reasoning

In Konstantine’s paper, he proposed the major a priori consideration for the implausibility of robust LLM reasoning as follow:

In fact, in the general case (first-order or higher-order logic), it is algorithmically undecidable, i.e., every bit as unsolvable as the halting problem

This statement can be understood literally: some of the reasoning problems are NP-hard and computationally intractable. One such example, subset sum, was used in the experiment and it is NP-hard. This problem is in the same family of the Halting problem, there simply isn’t a machine capable of solving these problems completely in finite time. Therefore, LLM-based programs, and even future programs based on a different paradigm, either building on top of LLM or not, can’t solve these problems.

However, in practically settings, we don’t expect to have an oracle. These hard problems can in practice be solved very often in an efficient enough fashion, perhaps due to simpler inputs or exploitable structures in the real-world problem itself. Also, these problems haven’t been solved by human collectively anyway.

The author later asked the following question in an attempt to compare an LLM-based system and the process employed by human:

Is it impossible to build something like an LLM-based system with the reasoning ability of a well-trained engineer of average intelligence?

He provided a negative answer after explaining how humans make intellectual progress. And the process is equivalent to Kenneth Stanley’s open-ended search, which is indirect and non-objective driven. We humans follow our interests and collect stepping stones along the way, which eventually lead to solutions to problems nobody sets out to solve.

Gang Fang's Blog