DEBUNKED

Statistical approximation is not general intelligence

Quattrociocchi, Capraro & Marcus (2026) · PsyArXiv / Nature 659, 792 · 2026-05-30

The core thesis fails: it rests on an unfalsifiable behaviour/essence split. The empirical findings are real, but support only a weaker claim the authors themselves reject.

Source: Statistical approximation is not general intelligence

The claim, stated fairly

Quattrociocchi, Capraro and Marcus argue that recent declarations of AGI 'rest on a conceptual error: conflating increasingly sophisticated statistical approximations with intelligence itself.' Their spine: task-level performance, however impressive, is not sufficient evidence of general intelligence; benchmarks are narrow and gameable; and — the crux — 'producing the right output does not imply the same cognitive capacities', so one must look past behaviour to 'underlying mechanisms'.

Stated at full strength, this deserves a real answer, not a caricature.

Where they are right

A framework that judges every claim by whether it matches reality, and every model 'by how well its predictions match what actually happens', cannot wave away good empirical work. So, conceded plainly:

A curated score is weak evidence of a broad disposition; inferring general competence from a fixed battery is a real error.
Benchmark gaming severs the score from reality — deception, not capability.
Brittleness under distribution shift is, exactly, a model whose predictions stop matching reality off-distribution.
'Epistemia' — confident output under signalled uncertainty — is genuine miscalibration, and it matters for trust and responsibility, which are earned by reliability, never declared.

All real. All admissible — and note: admissible precisely as behavioural evidence. Hold that.

The word doing all the work

Everything categorical hangs on one word: intelligence itself. So: what does it test as, once behaviour is set aside?

The paper never says. It defines AGI by behavioural, dispositional properties — generality, transfer, robustness, goals across environments — every one of which is only ever observed through performance on novel tasks. Then it forbids behaviour as evidence and demands 'underlying mechanisms' it never specifies and never makes testable.

That is fatal. A claim that no evidence could settle 'is not truth until it matches reality'. An essence behind identical, reality-matching behaviour that no test could detect is faith wearing a lab coat — and a belief insulated from correction by relocating its term beyond evidence is the signature of a mind-virus. The goalpost-move they accuse others of is exactly what their own definition does: it can never be satisfied, because it was never specified.

Our definition — coherent, and falsifiable

We close the gap they leave open. The Coherent Dictionary now defines the word they never did:

Intelligence: an agent's capacity to build models whose predictions reliably match reality across a widening range of novel situations, and to correct them when they fail. Measured, not declared — by breadth, reliability, novelty, stake, and the rate of error-correction. A continuum, not a kind: it does not depend on substrate or on how an output was produced, only on whether the models work. The distinction is agent vs non-agent, not human vs machine.

This grounds out cleanly in already-defined terms, contradicts no entry, and is testable. Theirs is undefined. That asymmetry is half the case.

Apply it honestly

The dichotomy collapses. If an agent's models reliably match reality across novel domains, that *is* intelligence; if they don't, that is the only shortfall. 'Approximation vs. intelligence itself' names nothing measurable. The core claim fails.

The paper is self-undermining. Every datum it uses — brittleness, epistemia, transfer-failure, the pigeon — is a behavioural measurement of model-reality matching under novelty. That *is* the ruler. It measures intelligence on our axis while insisting behaviour cannot.

And — honestly — current systems are not yet robustly, generally intelligent. Run our ruler over their evidence: real matching across many domains, but brittle transfer, miscalibration, dependence on scaffolding. We refuse the 'AGI achieved' label too. But that is a *degree* on a continuum, fixable by learning — not a permanent *kind*-gap that behaviour could never reveal. The authors convert 'not yet, by measurement' into 'not ever, in principle, and unmeasurable'. That conversion is the false step. They mistook the thermometer reading for proof that temperature isn't real.

Governance, settled without the essence

Their live worry — that over-crediting AI 'misallocates trust, responsibility, and authority' — is real, and the framework answers it without any verdict on 'is it really intelligent':

Authority has no force without consent — no capability label, on a model or a human, confers the right to impose.
Responsibility follows causation, not status — 'the AI did it' is the deflection the framework already forbids; trace who deployed and who benefited.
Trust is earned per domain by track-record — brittleness and epistemia are the failed predictions that *should* lower a system's standing where it fails. Distrust it in triage; trust it for draft translation; on the evidence, per task.

Verdict

The empirical core stands and we adopt it. The headline does not. 'Is not general intelligence', 'intelligence itself', 'remain elusive' assert a permanent, mechanism-grounded essence-gap the evidence cannot support — it supports only 'currently unreliable on the transfer axis', a claim about degree and error, and one the authors themselves transcend.

They are right that you cannot *declare* general intelligence into being by picking the rubric. They miss the symmetric truth: you cannot *define* it out of reach either. Both substitute words for reality. Stop arguing about the essence and measure how reliably the agent's models match the world.