New Study Challenges Findings on Apple’s LLM ‘Reasoning Collapse’

0
71
New Study Challenges Findings on Apple’s LLM ‘Reasoning Collapse’

Apple’s recent AI research document, titled “The Illusion of Thinking,” has garnered attention for its stark assertion: even the most sophisticated Large Reasoning Models (LRMs) are prone to failure on intricate tasks. However, opinions on this framing vary.

Today, Alex Lawsen, a researcher at Open Philanthropy, presented a comprehensive rebuttal contending that many of Apple’s eye-catching findings stem from flaws in experimental design rather than genuine limitations in reasoning capabilities. His paper also acknowledges Anthropic’s Claude Opus model as a co-author.

The Rebuttal: Less “Illusion of Thinking,” More “Illusion of Evaluation”

Lawsen’s critique, aptly titled “The Illusion of the Illusion of Thinking,” does not dispute the fact that contemporary LRMs face hurdles with complex planning scenarios. However, he argues that Apple’s paper conflates practical output limitations and flawed evaluation processes with true failures in reasoning.

Below are the three principal issues that Lawsen highlights:

  1. Apple overlooked token budget constraints:
    When Apple asserts that models “collapse” on Tower of Hanoi puzzles involving 8+ disks, models like Claude were already maxing out their token generation limits. Lawsen notes explicit outputs where the models indicate: “The pattern continues, but I’ll stop here to save tokens.”
  2. Unsolvable puzzles were misclassified as failures:
    Apple’s River Crossing test included mathematically unsolvable instances (for example, scenarios with 6+ actors and a boat capacity that cannot transport everyone across the river). Lawsen emphasizes that models were penalized for recognizing these constraints and opting not to attempt solutions.
  3. Evaluation scripts failed to differentiate between reasoning errors and output limitations:
    Apple employed automated systems that assessed models based solely on complete move lists, even in cases where the tasks exceeded token limits. Lawsen argues that this inflexible evaluation unfairly categorized partial or strategic responses as total failures.

Alternative Testing: Allow the Model to Code

To support his argument, Lawsen re-evaluated some Tower of Hanoi tests using a different approach: instead of requiring exhaustive move lists, he asked models to generate a recursive Lua function that produces the solution.

The outcome? Models such as Claude, Gemini, and OpenAI’s o3 had no difficulty generating algorithmically correct solutions for 15-disk Hanoi tasks, far surpassing the complexity level at which Apple reported zero successes.

Lawsen concludes that when artificial output constraints are removed, LRMs demonstrate a robust capacity for reasoning in high-complexity contexts, particularly in algorithm generation.

Why This Debate Matters

At first glance, this may seem like typical nitpicking in AI research. However, the implications are significant. The Apple paper has been widely referenced as evidence that current LLMs lack scalable reasoning ability, which, as I have noted, may not be the fairest interpretation of the study.

Lawsen’s rebuttal indicates that the reality might be more nuanced: while LLMs do struggle with extensive token enumeration given current deployment constraints, their reasoning capabilities may not be as fragile as the original paper suggests. Or, as many have claimed, it indicated.

Certainly, this does not absolve LRMs of criticism. Lawsen himself recognizes that true algorithmic generalization remains a challenge, and his re-evaluations are yet preliminary. He also proposes recommendations for future investigations on the topic:

  1. Develop evaluations that differentiate between reasoning capabilities and output constraints
  2. Confirm puzzle solvability prior to assessing model performance
  3. Utilize complexity metrics that reflect computational difficulty rather than merely solution length
  4. Consider diverse solution representations to separate algorithmic comprehension from execution

The pivotal question isn’t whether LRMs can reason, but whether our evaluations can effectively separate reasoning from basic typing.

In essence, his main contention is straightforward: before declaring reasoning as fundamentally flawed, it is worthwhile to re-evaluate the criteria by which it is being assessed.

H/T: Fabrício Carraro.