Defining Difficulty In the Age of Reason
If a Parrot can recite the complete works of Socrates, is he a flying philosopher?
Authors Note: This is the second post in a series exploring a new approach to LLM Reasoning benchmarks, start with Part 1
Studying reasoning with humans is frustratingly messy. Their performance varies with hunger, sleep, mood, whether they're annoyed at Sandy's latest Instagram post. You can't ask a human to solve the same problem 50 times with different working memory constraints just to see what happens, they will most likely just glare at you and go back to scrolling their phone. When you try to measure how long they've thought about something, there is no way to control how much of that time was actually spent thinking about what to have for Lunch later.
But LLMs offer something unprecedented in the study of reasoning: controllable minds that we can test systematically. We can run the same logical puzzle through a thousand different models, at different parameter sizes, with different prompting strategies, different working memory sizes and actually measure how reasoning scales. We can directly observe the thought streams, measure usage and even inject or modify thoughts!
It's like having a laboratory for studying the thinking process itself.
Just one small catch: We will have to understand both universal reasoning challenges AND artificial-specific failure modes specific to our laboratory. Because while LLMs share some Universal cognitive limitations with humans, they also have their own specific failure modes that create uniquely Artifical categories of reasoning difficulty.
Universal Difficulties
Every reasoning system - human or artificial - faces three fundamental challenges:
Load - How much information you need to track simultaneously. Sorting 3 words versus sorting 24 words. Remembering 2 facts versus juggling 15 interconnected pieces of information. This hits universal working memory limits regardless of whether you're made of neurons or transformers.
Depth - How many sequential steps the reasoning chain requires. Simple arithmetic versus nested expressions six levels deep. Following one logical implication versus chaining together multi-hop inferences where each step depends on the previous ones. You can't skip ahead by pattern matching - you have to actually walk the logical path.
Interference - How much irrelevant information competes for attention, or how many exceptions break the expected patterns. Counting fruits when half of the objects are vegetables. Following rules that work 90% of the time but have specific exceptions that matter. This tests whether you actually understand underlying principles or just learned to follow the most common patterns.
These factors add to create the Base Load that stresses any reasoning system. A human struggling with a 50-term nested arithmetic expression is hitting real working memory limits. So is an AI.
But here's where it gets interesting.
Why AI Fails Differently
LLMs have unique vulnerabilities that can multiply this base difficulty exponentially. Understanding these architectural blind spots is the key to designing tests that actually probe reasoning rather than accidentally testing memorization.
Tokenization Mismatch happens when the task requires operating on units that don't align with how the model actually sees the world. You want to count letters in "Hello World" but the model sees maybe 2-3 tokens, not 11 characters. You ask it to reverse "abcdef" but "def" might be a single token while "abc" is three separate ones. Suddenly familiar operations become surprisingly difficult.
Distribution Shift occurs when you package familiar tasks in unfamiliar formats. The model learned sorting from [1, 2, 3]
examples, but you give it Apple banana CHERRY apple
and ask for newline-separated lowercase output with duplicates preserved. The underlying procedure is identical, but the packaging breaks all the pattern matching shortcuts.
Attention Decay strikes when important connections are separated by too many tokens. In deeply nested expressions, the model needs to remember which closing parenthesis matches which opening one, but they're separated by 30+ tokens of arithmetic. The attention mechanism struggles to maintain these long-range dependencies consistently.
The magic happens when these architectural amplifiers combine with universal cognitive load.
When 2+2 Becomes Impossible
Let me show you how Base Load × Architectural Multipliers = Exponential Difficulty
with two examples that look deceptively simple.
Take something as basic as (2 + 3) * 4
. Every model handles this perfectly.
But change it to (2 + 3)* 4
and suddenly some models struggle - not because the math is harder, we didnt actually change anything numerically, but because ) *
is actually encoded as two tokens, )
and *
, while )*
is a single token. We have succesfully changed how the input is perceived by the LLM, despite not changing the problem.
Now scale this up: ((((15 + 23) × 4) - 8) + ((7 × 12) - 5))
.
You've got:
- Base Load: 8 numbers, 4 levels of nesting, multiple operations
- Tokenization Chaos: Operators, brackets and numbers are mixed causing unpredictable tokenization boundaries
- Attention Decay: Matching parentheses across 20+ tokens
- Distribution Shift: Training data rarely contains this exact nesting pattern
What looks like straightforward arithmetic becomes genuinely challenging because the model has to actually follow mathematical rules rather than pattern-match to memorized solutions.
Next, consider word sorting. Ask any model to sort apple cherry banana
and it succeeds instantly!
Input: apple cherry banana
Expected Output: apple banana cherry
But what if we re-frame the problem just a little bit? Make the sort case-insensitive, repeat inputs, and require the output is returned with one word per line:
Input: bob BOB Bob bob ALICE Bob bob
Expected Output: alice
bob
bob
bob
bob
bob
bob
Suddenly this simple task is not so simple:
- Base Load: 7 items with case variations and duplicates
- Tokenization Chaos:
bob
,BOB
, andBob
are completely different tokens to the model - Distribution Shift: Output should be newline-separated lowercase - neitherthe bracketed lists seen during training, nor the space-seperated lists from the input
- Interference: Case differences look important but are not.
A human sees "sort 'alice' and these variations of 'bob'" and handles it trivially. The model sees a sequence of potentially unrelated tokens that need parsing, transformation, and output in an unfamiliar format with exactly the kind of repetition it's been trained NOT to produce. Models that can flawlessly handle complex tasks get tripped up by what looks like elementary school homework.
Degrees of Understanding
This approach to designing reasoning challenges reveals three levels of capability:
Brittle Pattern Matching - Performance collapses when format changes slightly. The model memorized transformations but doesn't understand the underlying procedure.
Robust Procedural Knowledge - Performance degrades gracefully under increasing load and format pressure. The model grasps the basic approach but struggles with execution under stress.
True Generalization - Consistent rule application regardless of surface presentation. The model genuinely understands the underlying principles and can adapt them flexibly.
A model that truly understands sorting shouldn't care whether the input is [1, 2, 3]
or bob BOB Bob bob
. A model that genuinely grasps arithmetic shouldn't be thrown by tokenization differences between * (
and *(
.
But pattern matching systems? They'll confidently apply memorized transformations until the format shifts just enough to break their internal templates. Then they'll either fail catastrophically or succeed in ways that reveal they're following completely different internal logic than what the task actually requires.
The Laboratory of Mind
This is why my new test suite generates thousands of parameterized examples across systematic difficulty ramps. When models improve on simple 3-word sorting, I can challenge them with 15-word mixed-case runs with duplicates. When they master basic arithmetic, I can test 6-level nested expressions with large numbers.
The test evolves with the models, maintaining that sweet spot where universal cognitive load creates genuine difficulty while architectural vulnerabilities prevent pattern matching shortcuts.
But more importantly, this gives us an unprecedented window into reasoning itself. By systematically varying both universal challenges and artificial-specific stress factors, we can begin to understand which aspects of intelligence are fundamental and which are implementation details. Every time a model fails at bob BOB Bob bob
sorting but succeeds at complex code generation, we learn something about the nature of procedural knowledge versus pattern recognition. Every time increasing arithmetic depth causes graceful degradation versus catastrophic failure, we understand something about how reasoning systems handle cognitive load.
It's not just about evaluating AI reasoning capabilities - we're using artificial minds as instruments to study the structure of reasoning itself.
More details on the project (its something you can run today!) are available on GitHub.
Next up: diving deeper into the specific tasks I designed to probe the artificial mind, and why a short series of opening left-brackets like < < < < < < < < < < < <
can totally break models which can otherwise handle complex reasoning perfectly (hint: due to HTML and XML the LLM has seen so many opening left brackets during training - but never in a row like this).