Skip to content
    Back to writing
    March 24, 2024 · updated May 8, 2026 · 4 min read

    Stop benchmarking on benchmarks the lab made.

    Stop benchmarking on benchmarks the lab made — by Thomas Jankowski, aided by AI
    The lab makes both— TJ x AI

    When the model lab also made the benchmark, the benchmark is marketing.

    The pattern is mechanical. A frontier-lab releases a model. The release blog post includes a new evaluation suite that the lab built, often labeled with a serious-sounding name and a one-paragraph methodology that does not survive close reading. The lab ranks its model first on the new suite by a comfortable margin. The press writes the ranking as evidence of capability. The competing labs run the same suite later, post lower scores, and the discourse treats the gap as a real one. By the time anyone realizes the suite was constructed in a way that advantaged the constructor, the news cycle has moved on, and the constructor has shipped the next model with the next bespoke benchmark.

    The pattern works because the audience has not yet internalized the conflict of interest. The frontier lab has a strong incentive to publish a benchmark on which it ranks first. It has the engineering staff to construct the benchmark. It has the privileged access to its own model's behavior during training, which lets it shape the benchmark distribution to favor capabilities it knows the model has. The competing labs have no such advantage going in, because they were not in the room when the benchmark was being designed. The result is that the published number is not a measurement; it is a selection.

    The defense the labs typically offer is that the benchmark is open-source, the methodology is published, and the competition is welcome to run it. The defense is technically correct and substantively misleading. An open-source benchmark constructed by one lab is still constructed by that lab. The space of possible benchmarks is enormous, and the specific benchmark that gets shipped is the one that produced the favorable ranking. Other constructions, equally defensible, would produce different rankings. The lab does not publish those.

    Three categories of 2024-2025 benchmark fall into this trap.

    Category one: agentic-task evaluations

    The first category is the agentic-task evaluation suite. A frontier lab releases a model with new tool-use and multi-step reasoning capabilities, alongside a benchmark that tests exactly those capabilities on a curated set of tasks. The tasks are chosen post-hoc, after the model's behavior on candidate tasks has been observed. Tasks the model handles well make it into the suite. Tasks the model handles poorly do not, on the methodological argument that they were "not representative of real-world agent use cases" or "subject to evaluation noise."

    The published score is high. The methodology paragraph is plausible. The tasks the model fails on are not in the suite, and there is no public artifact establishing what fraction of candidate tasks were considered and rejected.

    The operator-grade read of an agentic-task evaluation suite, when the lab that built the suite ranks first, is to assume the score is an upper bound rather than a measurement, and to construct a private evaluation against the operator's own task distribution before making any procurement decision.

    Category two: reasoning-trace rubrics

    The second category is the reasoning-trace rubric. The lab publishes a benchmark that scores not only the final answer but the quality of the chain-of-thought leading to the answer. The rubric is described as a step toward measuring "how the model thinks." The judging is done by another model from the same lab, fine-tuned on examples of "good reasoning."

    The setup is closed-loop. The model whose reasoning is being judged was trained on data that overlapped, often heavily, with the data the judging model was trained on. The judging model has learned to recognize the reasoning style of its sibling model as canonical. Other labs' reasoning styles are scored as off-distribution, which the rubric labels as "lower-quality reasoning" without further interrogation.

    The published score for the constructor lab is high. The competing labs score lower. The gap is presented as a measurement of reasoning quality. It is not. It is a measurement of stylistic similarity to a fine-tuning corpus that the constructor controls.

    The operator-level read is to discount reasoning-trace scores entirely when the judge model is from the same lab as the model being scored, and to require an independent judge for any rubric purporting to measure reasoning quality.

    Category three: long-context retrieval

    The third category is the long-context retrieval evaluation. The model is given a million-token document and asked to retrieve specific facts from it. The benchmark tests retrieval at varying depths, with varying needle-haystack ratios, and reports a high accuracy number. The accompanying paper presents the result as evidence of robust long-context capability.

    The construction of the haystack is the tell. The synthetic documents in the benchmark are produced by a procedure the lab designed. The needles are chosen from a distribution the procedure favors. The model was trained, in part, on synthetic data with similar structural characteristics. The result is that the model is being evaluated on a distribution close to its training distribution, which is the reason it scores high. Real-world long-context retrieval, on documents the model has not seen the structural fingerprint of, performs noticeably worse, and the gap is documented in the operator literature but not in the lab's own benchmark releases.

    The operator-tier read is to assume long-context-retrieval claims constructed on synthetic haystacks generalize poorly to real documents, and to evaluate the model on the operator's own document corpus before relying on the published number.

    What to substitute

    The substitute for lab-built benchmarks is independent evaluation. The independent suites that have held up under scrutiny since 2023 share a small set of properties. They are constructed by parties without a stake in any specific model's ranking. They are versioned with a chain of custody that documents every change. They publish raw model outputs alongside scores, so that downstream observers can re-score under different rubrics. They include held-out task distributions that are not disclosed in advance, so that no lab can train against them.

    Most of the suites that meet these criteria are run by academic groups or independent benchmark organizations rather than by frontier labs. Most of the suites that get the press coverage are run by frontier labs. The asymmetry is the problem.

    The operator decision in 2025 is to weight the independent suites more heavily than the lab-released ones, and to treat any lab-released benchmark accompanying a model release as a marketing artifact until proven otherwise. The default should be skepticism. The burden of proof should be on the constructor, not on the audience.

    The press will continue to cover the lab-released suites because the lab-released suites come with the lab's communications budget attached. The operators reading the press should learn to discount accordingly.

    —TJ