AI FundamentalsEducationMarch 16, 2026

AI Gets a D: What the WSU ChatGPT Study Reveals About AI Accuracy

A new Washington State University study finds ChatGPT earns a D grade on complex reasoning tasks — only 60% accurate when adjusted for guessing, and just 16.4% accurate at identifying false statements.

AI Gets a D: What the WSU ChatGPT Study Reveals About AI Accuracy

ChatGPT can write a flawless cover letter, summarize a legal document, and generate code in seconds. It sounds smart. It reads smart. But a landmark new study from Washington State University reveals a deeply inconvenient truth: when it comes to complex reasoning and conceptual accuracy, today's most popular AI chatbot earns a grade that would alarm most students — a D.

The study, led by WSU Professor Mesut Cicek and published in March 2026, is one of the most methodologically rigorous evaluations of ChatGPT's reasoning capabilities to date. Its findings have sweeping implications — not just for AI researchers, but for every business, educator, and professional who has incorporated AI-generated answers into their daily workflows.

How the Study Was Conducted

The WSU research team designed a rigorous, repeatable test to evaluate not just whether ChatGPT gets answers right — but whether it gets them right *consistently*.

The methodology was straightforward but illuminating:

  • 700+ scientific hypotheses drawn from published research papers were fed into ChatGPT
  • The AI was asked to classify each statement as true or false
  • Each query was repeated 10 times to measure consistency across identical prompts
  • The experiment was run in both 2024 and 2025 to track improvement over time

This approach was designed to expose a specific kind of failure that standard benchmarks miss: the gap between *linguistic plausibility* and *conceptual accuracy*. ChatGPT is extraordinarily good at producing answers that sound correct. The question the WSU team asked is whether those answers actually *are* correct — and whether the model arrives at the same answer every time it's asked.

The Results: What a D Grade Actually Means

The raw numbers look better than the D grade implies — at first glance.

In 2024, ChatGPT answered the hypothesis questions correctly 76.5% of the time. By 2025, that figure had improved to 80%. On the surface, that sounds like a passing grade. But the WSU researchers applied a crucial correction: they adjusted for the baseline accuracy expected from random guessing.

For true/false binary questions, random guessing produces a 50% accuracy rate. After correcting for this baseline, ChatGPT's *true* accuracy — the portion of correct answers attributable to actual reasoning rather than chance — drops to approximately 60%. In academic grading terms, that is firmly a D.

Beyond the overall accuracy number, the study surfaced an even more troubling finding about performance on false hypotheses specifically. In 2025, ChatGPT correctly identified false statements only 16.4% of the time. The model was dramatically better at confirming true things than at detecting falsehoods — a pattern that carries significant real-world risk, particularly in high-stakes domains like medicine, law, and scientific research where identifying incorrect claims is just as important as confirming correct ones.

Perhaps most alarming for practitioners: the model's answers were *inconsistent* across the 10 identical runs of each query. Across all prompts, ChatGPT consistently arrived at the same answer only 73% of the time. In other words, roughly 1 in 4 questions yielded different answers depending on which "roll of the dice" the model happened to produce that session.

Fluency vs. Intelligence: The Core Problem AI Hasn't Solved

Professor Cicek's analysis cuts to the heart of a fundamental limitation in today's large language models: linguistic fluency is not the same as conceptual intelligence.

ChatGPT — and models like it — are trained on vast datasets of human-generated text. They learn to predict what the next word in a sequence *should* be, optimizing for responses that read as coherent, authoritative, and contextually appropriate. This training produces remarkable fluency. It does not, however, teach the model to reason from first principles.

When ChatGPT encounters a complex hypothesis — say, a claim about the causal relationship between two biochemical processes — it doesn't "think through" the logic. It pattern-matches to the statistical regularities in its training data and generates the response most likely to satisfy the prompt. When the correct answer is a common, widely-documented fact, this approach works well. When the correct answer requires nuanced multi-step reasoning, or when the question is designed to probe subtle distinctions that might not have clean precedent in the training data, the statistical approach breaks down.

This distinction matters enormously for anyone relying on AI tools for substantive work. The fluency of the output creates a false confidence signal — the answer sounds authoritative even when it's wrong. Unlike a confident but mistaken human expert, ChatGPT provides no tone, uncertainty, or hedging that might alert the reader to question the response.

What This Means for the AGI Timeline

The WSU study also touches on the broader question of artificial general intelligence — the theoretical point at which AI systems can reason, learn, and generalize across domains the way humans do.

Professor Cicek's assessment is direct: based on these results, AGI is "farther off than some are predicting."

This is a significant statement to make in 2026, when major AI labs — OpenAI, Anthropic, Google DeepMind — are actively competing to claim proximity to AGI-level capabilities. Many executives, investors, and technologists have accepted a narrative in which AGI is a near-term inevitability, perhaps 3-5 years away.

The WSU study suggests the gap between current AI capabilities and genuine general reasoning is wider than the hype implies. A system that correctly identifies false scientific hypotheses only 16.4% of the time, and gives different answers to the same question one in four times, does not demonstrate the kind of stable, reliable reasoning we associate with general intelligence — human or otherwise.

This doesn't mean current AI tools aren't extraordinarily useful. It means the limitations need to be clearly understood and accounted for, especially in high-stakes applications.

Where AI Tools Actually Struggle Most

The WSU findings align with a growing body of research documenting specific domains where today's AI models consistently underperform:

  • Complex multi-step reasoning — Tasks requiring chains of inference where each step depends on the last
  • Detecting falsehoods — Models are significantly better at confirming true statements than identifying incorrect ones
  • Consistency — Identical prompts can produce meaningfully different outputs across sessions
  • Domain-specific nuance — Even in specialized domains, models may produce plausible-sounding but factually incorrect claims
  • Causal reasoning — Understanding *why* something is true, not just *that* it is true, remains a persistent weakness

A separate WSU-led study on ChatGPT's performance in financial contexts found similar patterns: the model performed well on multiple-choice licensing exam questions but "faltered when asked to explain its reasoning on more advanced topics" — a finding consistent with the hypothesis-testing study. The more a task requires genuine understanding rather than pattern recognition, the more AI performance degrades.

For professionals using AI in medical, legal, financial, or scientific contexts, these findings represent a direct warning: AI-generated outputs need verification, not trust. The model's persuasive tone is not a signal of correctness.

How to Work With AI Given Its Accuracy Limits

Understanding ChatGPT's accuracy limitations isn't an argument against using AI tools — it's an argument for using them *strategically*.

Here's what the evidence suggests about how to get the most out of current AI while managing its failure modes:

Treat AI output as a first draft, not a final answer. For tasks that require factual accuracy — research, analysis, technical documentation — AI-generated content should be treated as a starting point that requires human verification, not a finished product.

Verify claims in high-stakes domains. When AI is used in medical, legal, financial, or scientific contexts, any specific claims or conclusions should be independently verified against primary sources. The WSU study's false-hypothesis accuracy of 16.4% means AI is essentially unreliable as a fact-checker in these domains without human oversight.

Use structured prompting to expose uncertainty. Rather than accepting binary yes/no answers at face value, prompting AI to "explain your reasoning step by step" or "identify any assumptions you're making" can surface logical gaps that would otherwise be hidden by confident-sounding output.

Leverage AI where consistency matters less. For creative tasks, brainstorming, summarization of content you've already verified, and drafting first-pass text, the variability problem is far less consequential. The 27% answer inconsistency rate matters enormously for scientific questions; it matters very little for generating ideas.

The FireStart Applied AI Program is built on exactly this philosophy — teaching participants not just how to use AI tools, but how to deploy them with clear-eyed awareness of their capabilities and limitations. Understanding *where* AI is reliable and *where* it isn't is one of the most valuable skills any practitioner can develop in 2026. The WSU study is a timely reminder that that understanding has to be grounded in evidence, not hype.

Want to learn more about AI?

Join FireStart for free — access Guides, try Ember AI, and start learning today.