It's Like Strawberries

The Strawberry Test: Why AI's Hidden Reasoning Gaps Matter for Business Leaders

How a simple letter-counting exercise reveals the critical difference between pattern recognition and genuine understanding in AI systems

The Problem We're Not Seeing

AI systems are still learning to think, not just pattern-match. And that changes everything about how we should deploy them in our organisations.

Last week, I was reviewing our AI implementation strategy when I stumbled back into something that fundamentally shifted my perspective on how we evaluate AI capabilities. It started with what seemed like a trivial question: "How many r's are in the word strawberry?"

If you're thinking "three," you're absolutely right. But a year ago, this simple question became the litmus test that exposed a critical flaw in how we understand AI reasoning. The implications for business leaders go far beyond letter counting.

The Strawberry Test Origins

In 2024, a Reddit post in r/singularity highlighted something peculiar: leading AI models consistently failed at counting letters in simple words. When asked "How many r's in strawberry?", most large language models confidently answered "2." The correct answer, of course, is 3 (s-t-r-a-w-b-e-r****r-y).

The Reddit community immediately began theorising. Was it tokenisation? Prompt engineering issues? Training data limitations? The technical explanations were fascinating, diving deep into how AI systems break down language into processable chunks and how that might interfere with character-level analysis.

But here's what caught my attention as a technology leader: the community's response was to create increasingly complex prompts to "fix" the problem. One particularly elaborate prompt began with: "When I give you a problem, I don't want you to solve it by guessing based on your training data. Instead, solve it by observing the problem, reasoning about it, and paying attention to the smallest details..."

This approach treating the symptom, not the disease.

Fast-Forward to Today: Have We Really Solved It?

Fast-forward to 2025: every major LLM now gets the strawberry question right. OpenAI's models, Google's Gemini, Anthropic's Claude, even smaller open-source models. Problem solved, right?

Not quite.

Curious about whether we'd actually addressed the underlying reasoning gap or simply trained our way around a specific test case, I decided to probe a bit deeper. I started with the standard question:

Me: "How many r's in the word strawberry?"

AI: "There are 3 r's in the word 'strawberry'. Let me break it down: s-t-r-a-w-b-e-r-r-y. The r's appear in positions 3, 8, and 9."

Perfect. But then I tried a slight variation:

Me: "How many r's in 'the word strawberry'?"

AI: "There are 3 r's in 'the word strawberry'. The r's all appear in the word 'strawberry' itself: s-t-r-a-w-b-e-r-r-y. The other words in the phrase ('the word') don't contain any r's."

Wait. "The other words... don't contain any r's"? Let me check that.

Me: "Does the word 'word' contain an r?"

AI: "Yes, the word 'word' contains one r. The word 'word' is spelled: w-o-r-d. So going back to your previous question about 'the word strawberry' there are actually 4 r's total: 1 r in 'word', 3 r's in 'strawberry'. I apologise for the error in my previous response."

What This Reveals About AI Reasoning

This interaction illuminated something profound about current AI systems: they're incredibly sophisticated pattern-matching engines that have been fine-tuned to handle specific test cases, but they still struggle with contextual reasoning and systematic analysis.

The technical explanation involves tokenisation, the process by which AI models break down text into manageable chunks for processing. When an AI encounters "strawberry," it might tokenise it as ["straw", "berry"] or ["str", "aw", "berry"], making letter-by-letter analysis more complex than it appears.  I know this is quite simplified, without going into BPE etc.

But the deeper issue isn't technical. It's conceptual. The AI had clearly been trained or fine-tuned to correctly answer "How many r's in strawberry?" When faced with the slightly modified phrase "the word strawberry," it applied the same learned response pattern without systematically analysing the new context.

One particular very popular Youtuber tested all new LLM with specific tests on their channel when they were released.  They started to pass each time and even he questioned if the owners were feeding the answers to the questions to the LLM's before release.

Think of it like teaching someone to pass a driving test versus actually learning to drive. They might execute a perfect three-point turn in the exact conditions they practised, but struggle when the street is narrower, there's a parked car, or the weather conditions are different.

The Business Implications

For business leaders evaluating AI implementations, this distinction between memorised responses and genuine reasoning capabilities has significant implications:

1. Training Data Versus Generalisation

Many AI systems excel at problems similar to those in their training data but falter when faced with novel variations. If your AI customer service system has been trained on common support queries, how does it handle edge cases or unprecedented situations? The difference between pattern recognition and reasoning determines whether your system degrades gracefully or fails spectacularly.

2. Context Sensitivity

The strawberry test reveals how AI systems can miss crucial context when applying learned patterns. In business applications, this could manifest as recommendation engines that miss important user context, financial models that fail to account for market conditions outside their training data, or automated decision systems that apply rules too rigidly.

3. Evaluation Methodology

Most AI evaluation focuses on benchmark performance. But benchmarks can become targets that systems optimise for without developing genuine capabilities. The strawberry test was effectively "solved" through training, but the underlying reasoning limitation persisted. Are your AI evaluation methods testing true capability or just benchmark performance?

4. Robustness and Reliability

Systems that rely on pattern matching rather than reasoning are inherently brittle. They work well within their trained domain but can fail unexpectedly when conditions change. For mission-critical applications, understanding this distinction is crucial for risk management.

A Framework for Better AI Evaluation

Based on these insights, I've developed a framework for evaluating AI systems that goes beyond standard benchmarks:

Test for Reasoning, Not Just Performance

Instead of asking whether your AI gets the "right" answer, ask whether it can explain its reasoning and handle variations of the same problem. Can your AI system articulate why it made a particular decision? When you slightly modify the input, does the output change appropriately?

Probe Edge Cases and Context Shifts

Deliberately test your AI systems with variations of standard inputs. If your system performs well on customer queries about "product returns," how does it handle "returning products" or "product return policies"? These linguistic variations reveal whether the system understands concepts or just matches patterns.

Evaluate Systematic Thinking

Look for evidence that your AI system can break down complex problems systematically. Can it identify when it needs more information? Does it recognise when a problem falls outside its capabilities? These meta-cognitive abilities distinguish reasoning systems from sophisticated pattern matchers.

Monitor for Training Data Leakage

Be wary of AI systems that perform suspiciously well on test cases that might have appeared in training data. The strawberry test's "solution" might simply reflect training on similar examples rather than improved reasoning capabilities.

The Strategic Question

The strawberry test ultimately asks a strategic question that every technology leader must answer: Are we building AI systems that can think, or just very sophisticated systems that can mimic thinking?

For routine, well-defined tasks, pattern matching might be sufficient. But as we push AI into more complex, dynamic environments where genuine reasoning matters, this distinction becomes critical.

The organisations that recognise this difference and design their AI strategies accordingly will build more robust, adaptable, and ultimately valuable AI capabilities. Those that mistake sophisticated pattern matching for reasoning may find their AI systems failing precisely when they need them most.

Moving Forward: Questions for Leaders

As you evaluate AI implementations in your organisation, consider these questions:

For existing AI systems: How do your current AI tools handle variations of the problems they were designed to solve? Have you tested them on edge cases and context shifts?

For AI procurement: When evaluating new AI solutions, are you testing genuine reasoning capabilities or just benchmark performance? How do vendors demonstrate that their systems can handle novel situations?

For AI strategy: Are you building AI capabilities that can adapt to changing business conditions, or are you creating sophisticated but brittle automation?

For risk management: Do you understand where your AI systems rely on pattern matching versus reasoning? How might this affect their reliability in critical situations?

The strawberry test might seem trivial, but it reveals something fundamental about the current state of AI: we're still in the early stages of building systems that can truly reason rather than just recognise patterns.

The Computational Reality Check

Here's the uncomfortable truth that emerged from my strawberry test experiments: achieving genuine reasoning requires significantly more computational resources than pattern matching. To get a reliable answer of 1 r in "word" and 3 r's in "strawberry" (totaling 4 in "the word strawberry"), the AI system would need to engage in multiple reasoning cycles, process more tokens, and potentially re-evaluate its initial response several times.

This means longer processing times, higher computational costs, and substantially more infrastructure requirements. When I forced the AI to reason systematically through the problem, the response time increased noticeably, and the token usage multiplied as the system worked through its logic step by step.

The question becomes: are we prepared to wait for systems to think and rethink until they arrive at genuinely reasoned answers? Are we willing to accept the latency and cost implications of true AI reasoning?

More fundamentally, are we expecting too much of AI at this stage until the computational power in data centres catches up to the reasoning demands we're placing on these systems?

The Infrastructure-Intelligence Trade-off

This reveals a critical trade-off that business leaders must navigate: the balance between AI speed and AI reasoning quality. Current AI implementations are optimised for fast, confident responses that satisfy most use cases. But genuine reasoning requires computational patience—the ability to slow down, reconsider, and systematically work through problems.

The organisations that understand this distinction and plan accordingly won't just avoid the pitfalls of brittle AI systems. They'll position themselves to leverage AI's genuine reasoning capabilities as they continue to evolve, while also managing realistic expectations about current computational constraints.

Are you testing your AI implementations for true comprehension or just sophisticated mimicry? Are you prepared for the infrastructure investment that genuine AI reasoning will require? The answers might determine whether you're building the foundation for genuine AI-powered transformation or setting yourself up for unexpected failures when the patterns your systems recognise no longer match the problems you need them to solve.

​Brett Harrison - VirCIO Australia


ITSM / ESM a Mid 2025 Review