Skip to content
    Back to writing
    February 5, 2024 · updated May 8, 2026 · 5 min read

    The reasoning model wars are about evaluation, not capability.

    The reasoning model wars are about evaluation, not capability — by Thomas Jankowski, aided by AI
    The fight is over the rubric— TJ x AI

    The thing that is going to happen with reasoning models in 2024 and 2025 is the same thing that has happened with every prior generation of AI capability claims, only this time the stakes are larger and the visibility is lower. The labs will ship models with new reasoning behaviors. The press will narrate it as a capability arms race. The actual fight, the one that matters for who captures the next decade of AI economics, will be about which evaluation rubric gets blessed as the standard. Not which model can reason. Which yardstick gets to call itself reasoning.

    Look at what already happened. GSM8K, the grade-school math word-problem benchmark, was the gold standard when GPT-3.5 was new; GPT-4 saturated it. MATH (the Hendrycks competition-math benchmark) became the new gold standard; Gemini Ultra and GPT-4 Turbo started crowding the ceiling. Then the labs moved to MMLU, AGIEval, HellaSwag, ARC-AGI. Each transition, the capability claim was pegged to whichever benchmark the lab’s model happened to be best at that quarter. The labs are not racing to do reasoning. They are racing to define what reasoning means in a way their model can win on.

    This is not a critique of any specific lab. It is the structural shape of how a category gets defined when nobody has authoritative ownership of the definition. In the absence of an external standards body, the definitions get set by the players who have the most marketing reach and the most credible-sounding evaluation methodology. The model that wins on the rubric that gets widely adopted is the model that wins, full stop, until someone else captures the next rubric. The capability claim is downstream of the rubric claim. The rubric claim is the thing.

    Reasoning is going to be the next theater for this, and it is going to be more contested than the prior rounds for three reasons. First, reasoning is harder to operationalize than arithmetic or factual recall, so the space of plausible rubrics is much wider. A reasoning rubric can emphasize chain-of-thought faithfulness, or backtracking-and-self-correction, or solution-tree search depth, or wall-clock time-to-correct-answer, or cost-per-correct-answer at frontier model pricing. Each of those choices defines a different winner. Second, the commercial stakes have grown by an order of magnitude since the GSM8K era, because reasoning is the surface the labs are betting their next product cycle on. Third, the operator audience is more sophisticated now, which means the rubric debate is going to be public in a way the GSM8K rubric debate never was, and the fight-over-the-rubric is going to be visible to anyone who is paying attention.

    I want to spend the next thousand words on what an operator should actually do given that the public rubric is going to be captured by whichever lab has the loudest marketing voice that quarter. The operator-grade question is not which lab’s reasoning model is best. The operator-level question is which rubric matches the operator’s actual use case, and how the operator should construct a private evaluation that cannot be gamed by the labs.

    The first thing to notice is that public benchmarks are almost certainly contaminated. By the time a benchmark is famous enough to matter to the marketing claim, the questions have leaked into training data through scraped copies, blog posts, GitHub repos, and academic papers that quote example questions. The lab can claim they did not train on the benchmark, and they may even believe it, but the contamination problem is structural, not intentional. Any benchmark that has been around long enough to be widely cited is, on a Bayesian read, in the training data of every major model. The capability claim against any famous benchmark should be discounted by this fact, even before the rubric-capture problem is added on top.

    The second thing to notice is that reasoning, in the sense an operator actually cares about, is domain-specific. The reasoning a clinician needs from a clinical-decision-support tool is different from the reasoning a lawyer needs from a contract-review tool, which is different from the reasoning an engineer needs from a code-generation tool. A model that scores at the ninety-fifth percentile on competition math and the bar exam may underperform on a specific medical-coding task because the kind of reasoning required is actually rule-application-under-ambiguity, not chain-of-thought problem solving. The public benchmarks select for the kinds of reasoning that are easy to grade with a single correct answer. Most operator workflows do not look like that.

    The right operator move is to construct, before committing to a model, a private evaluation set of between fifty and three hundred examples drawn from the actual operator workflow. Ten of them should be hard cases the operator has personally seen the existing process get wrong. Twenty should be cases where two domain experts would reasonably disagree on the correct answer. The rest should be a stratified sample of the normal distribution of inputs the system will see. Score each candidate model against this set with two domain experts blind-grading the outputs against the operator-defined rubric, not against any public benchmark. The result will, almost without exception, produce a different ranking than the published leaderboard. Sometimes a much-cheaper model will outperform the flagship on the operator’s actual use case. Sometimes the reverse. The point is that the operator cannot rely on public rankings; the public rankings answer a different question than the operator is asking.

    The third operator move is to track cost-per-correct- answer rather than headline accuracy. Reasoning models in 2024 and 2025 are going to vary by an order of magnitude in inference cost depending on how much chain-of-thought they emit. A model that scores ninety-two percent at five cents per query is, for most operator workflows, a worse deployment than a model that scores eighty-eight percent at half a cent per query. The headline number ignores this; the operator decision cannot. The rubric the operator should care about is some function of accuracy, latency, cost, and the consequence-distribution of being wrong. The labs will publish on whichever of those four dimensions makes their model look best. The operator needs to weight all four against the actual workflow.

    The fourth and most underrated operator move is to keep the evaluation harness alive past the procurement decision. Most enterprises that adopt a reasoning model do an evaluation once at procurement and then run on that decision for a year or more. This is a mistake. The underlying models update, the operator’s workflow evolves, and the cost-versus-quality frontier moves on a roughly quarterly cadence. The right cadence is to re-run the private evaluation against three to five candidate models every quarter, with the same fifty-to- three-hundred-example set, and to be ready to swap the production model when a different candidate clearly wins. The cost of running this evaluation is small. The cost of being a year behind on a workflow that has compounded against a worse model is not.

    Stepping back: the public reasoning-model wars are going to be loud and will inevitably be misleading. The labs will compete on rubric capture; the operators who win will be the ones who refuse to outsource their evaluation to the rubric the labs have captured. The gap will be visible in operating numbers within eighteen months. The operators who built private evaluation harnesses will be deploying the right model at the right cost; the operators who relied on public leaderboards will be paying premium prices for capabilities they did not need and missing gains in cheaper models they did not measure.

    The longer-run question is whether any external standards body emerges to adjudicate reasoning rubrics in a way that the labs cannot capture. NIST has been moving in this direction; academic consortia have proposed third-party evaluation services; several of the more thoughtful labs have published on the rubric problem themselves. None of these efforts has, in early 2024, reached the scale or the credibility to be the standard. I expect one of them will in the next two to four years, and that whoever does will accumulate substantial influence over how the next decade of AI capability claims are made and adjudicated. Until then, the operator’s only defense is the private evaluation set and the discipline to re-run it quarterly against the actual workflow.

    The summary is short. The reasoning model wars of 2024 and 2025 will not be capability wars; they will be rubric wars. The capability claims the labs publish are downstream of which rubric they have managed to define and capture. The operator who wants the right answer in the actual workflow has to refuse the rubric and run their own evaluation. The cost of doing this is small. The cost of not doing this is the difference between a workflow that compounds against the right model and a workflow that pays a premium for a model that won on a benchmark that was contaminated, captured, and answering a different question than the one the operator was asking.

    —TJ