SINCE MAY 2026

Aggregating from major public AI benchmarks

ONE SCORE.
INFINITE CLARITY.

AGI Ranker measures how close each frontier AI is to AGI. One transparent score (0-100) per model, distilled from 14 public benchmarks. Score 100 marks the AGI threshold.

Independent verification preferred over lab self-reports. Scores we can't verify are flagged, not invented. Every correction is logged publicly.

TOP MODEL

FRONTIER MODELS

AGI Score ≥ 65

BENCHMARKS AGGREGATED

Live benchmarks (TBA excluded)

Distance to AGI

pts

Top model's gap to AGI Score = 100

Top Model Breakdown

Thinking

44% weight

Fluid Reasoning + World Knowledge

Doing

43% weight

Agency (tools, planning, coding) + Multimodal

Communicating

13% weight

Language Production + Multimodal output

REAL-TIME RANKINGS

The AGI Ranker Leaderboard

100 = AGI

#	Model	AGIAGI Score	Arena Elo	Key Strength	Coverage	Price $/M tok	·Details

Scores normalized to best-human performance (100 = AGI genesis; above 100 = super-human / ASI direction). See full methodology →

CUSTOMIZE THE FORMULA

Interactive Model Explorer

Adjust domain weights and instantly see how the AGI Score changes. Transparency at its core.

Total: 100% - normalized to 100% before scoring

SELECT MODELS TO COMPARE (max 4)

Domain Proficiency Radar

0 models selected

TRANSPARENT METHODOLOGY

How the AGI Score is Calculated

A transparent, reproducible composite index designed for maximum signal and minimum noise.

Data Ingestion & Curation

We aggregate scores from major public AI benchmarks: GPQA Diamond, HLE, ARC-AGI-2, SWE-bench Verified/Pro, LiveBench, MMMU-Pro, Tau-bench, OSWorld, BrowseComp, Aider Polyglot, AIME 2025, and more. Each cell is cited so any score can be traced back to its origin. We do not run benchmarks ourselves - we aggregate what others have already published. Source quality is tiered: independent benchmark leaderboards (Tier 1) count fully at 1.00×; third-party evaluators with provider involvement (Tier 2) at 0.85×; self-reports from labs with verified track records (Tier 3) at 0.75×; non-verified labs and commentary content (Tier 4) are excluded from scoring entirely.

Note on the AA Intelligence Index (removed June 2026): We previously carried Artificial Analysis's composite Intelligence Index as a Language signal. We have removed it from the AGI Score. Their v4.1 revision turned it into an explicitly agentic composite (GDPval-AA, Terminal-Bench 2.1, τ³-Banking, SciCode, HLE, GPQA and more) that re-bundles benchmarks we already score directly - keeping it would double-count those signals across components and mislabel agency as language. Language now rests on LiveBench; a dedicated language/writing benchmark is on our roadmap to restore a second independent source. We continue to track the AA Index as an external reference.

Normalization to Best-Human

Each raw benchmark score is rescaled so 100 = best-human performance on that task. AGI Score = 100 thus marks the genesis of AGI per our working definition: an AI that surpasses the best human on every purely brain-based intellectual task. Embodiment, sensory acquisition, and lived experience are explicitly out of scope - those are separate problems. Scores climb past 100 as the AI grows from genesis into super-human (ASI-direction) territory; we preserve those scores rather than clamping, so the index stays informative in an ASI / post-AGI world.

5-Component Aggregation

Benchmarks roll up into 5 cognitive components AGI must master: Agency (35%), Fluid Reasoning (29%), World Knowledge (15%), Multimodal Perception (11%), Language Production (10%). Within each component, the score is the weighted mean of all the model's contributing benchmarks; each benchmark's effective weight is sub_weight × source_tier_multiplier. Sub-weights are fixed per benchmark and reflect each benchmark's relative importance within a component (e.g., ARC-AGI-2 carries 25% of Reasoning, GPQA carries 20%). The saturation rule kicks in dynamically: if the IQR of normalized scores among the top-10 ranked models on a benchmark drops below 5 percentage points, that benchmark's sub-weight is halved and the freed weight is redistributed to non-saturated benchmarks in the same component - keeping the component score responsive to whichever benchmarks still discriminate frontier models. Source tiering applies a multiplier per cell based on source independence (T1 1.00×, T2 0.85×, T3-Verified-Lab 0.75×; T4 excluded from scoring entirely). Component weights are user-adjustable in the Explorer below.

AGI Score = Σ (wᵢ × Cᵢ) / Σ wᵢ

where Cᵢ = component score (shrunk toward median if thin-coverage), wᵢ = component weight

Sparse-Data Handling (Asymmetric Pull-Down + Renormalize)

Benchmark coverage is uneven. Within each component we compute the weighted mean over present benchmarks (renormalizing the denominator). Then asymmetric pull-down shrinkage: if a model's coverage in a component is below 60% AND the raw score is above the population median, we pull it toward the median proportionally to coverage shortfall. Low scores with thin coverage stay low - no reward for hiding weaknesses. High scores with thin coverage get discounted toward "typical." Models need ≥3 components and ≥5 benchmarks (with Reasoning + Agency required) to be ranked. Cross-component coverage floor (v1.4): a model with 2+ components below their coverage thresholds gets demoted from RANKED to PROVISIONAL - even if its present cells are exceptional. Thresholds: 30% sub-weight for World Knowledge / Multimodal / Language; 40% for the core components (Fluid Reasoning and Agency, the load-bearing AGI dimensions per the locked definition). This prevents selection-bias rankings: a model can only claim near-AGI position if its evidence is reasonably broad across capability dimensions. The saturation rule separately protects against dead benchmarks: any benchmark whose IQR among the top-10 ranked models drops below 5pp gets its sub-weight halved and redistributed to non-saturated benchmarks in the same component.

Specialty Rankings & Custom View

Above the leaderboard, six tabs switch the ranking between the canonical AGI Score and task-focused views. Each specialty (Coding, Reasoning, Knowledge, Tool Use) re-ranks models using only the benchmarks that test that capability, with within-set sub-weights renormalized to 100%. Specialty scores apply the same source-tier weighting and asymmetric pull-down shrinkage as the main score - so a model with thin coverage in a specialty is pulled toward the population median for that specialty, never the other way. 100 in a specialty means matching best-human on its benchmarks; above 100 = super-human on those specific tasks. This is not AGI - AGI requires the full battery, which only the canonical AGI Score measures. The Custom tab (visible when you adjust the Explorer's weight sliders) shows the AI Score under your settings; the AGI tab always reverts to default weights, keeping the canonical view unambiguous.

Calibration Constants & Operational Discipline

reproducibility addendum

Full reproducibility requires three things beyond the formula: the exact numeric thresholds we apply, the within-set sub-weights used in specialty views, and the discipline we hold against common mis-attribution patterns. All disclosed below.

Current as of v1.4. These are operational values, not permanent invariants. Future methodology versions may revise specific constants as the methodology evolves and as benchmark coverage matures; substantive changes will be noted in the version history.

Numeric thresholds

Constant	Value	Used in
Asymmetric shrinkage coverage threshold (main AGI Score & specialties with 3+ benchmarks)	60%	Step 4
Asymmetric shrinkage coverage threshold (specialties with 2 benchmarks, e.g. Reasoning)	85%	Step 5
Saturation IQR threshold (top-10 ranked models)	5 pp	Step 4
Coverage floor - Fluid Reasoning & Agency (core)	40%	Step 4
Coverage floor - Knowledge / Multimodal / Language	30%	Step 4
Max thin components allowed in the RANKED tier	1	Step 4
Eval-date grace window vs. model release	30 days	below

Specialty sub-weight splits

Within each specialty tab, only the relevant benchmarks contribute, with within-set sub-weights renormalized to sum to 100. The splits are fixed. Minimum cells required for a model to enter the specialty ranking: 2 for Coding / Knowledge / Tool Use; 1 for Reasoning (which has only 2 benchmarks total). Models below the minimum but with at least one relevant cell appear in a separate "insufficient evidence" section below the ranking.

Note on Coding methodology (v1.5): Real-world coding in 2026 is agentic - it happens via tools like Claude Code, Codex, Cursor, not bare LLM completion. The Coding specialty composition reflects this: SWE-bench Verified and Pro (run with agentic harnesses by all labs that submit) plus Terminal-Bench (general agentic terminal-task benchmark). Aider Polyglot dropped from the composition in v1.5.0 (zero variant-distinct roster coverage after Round 14 cleanup).

Contamination note for SWE-bench Verified: Public-test-set SWE-bench scores can be inflated by training-data leakage. Private-test-set evaluators (e.g., vals.ai) tend to report 5-15pp lower numbers than public-leaderboard aggregators for the same model. We prefer T1 private-test sources where available; see the Corrections Log for the DeepSeek V4 Pro GPQA correction (0.901 self-report -> 0.729 independent T2) as a concrete example of the contamination correction in action.

Harness disclosure for Terminal-Bench: Agentic-benchmark scores depend on the scaffold used to execute tasks. Our Terminal-Bench cells come primarily from vals.ai (daytona / Terminus-2 harness, pass@1 over 89 tasks). benchlm.ai provides triangulating coverage with a different scaffold. Real users running each model with its native agentic tool (Claude Code for Anthropic, Codex for OpenAI, etc.) may experience different relative performance than benchmark scores predict. Treat the score as one input; consult practitioner reports for workflow-specific decisions.

Specialty	Benchmarks & within-set sub-weights
Coding	SWE-bench Verified 35 · SWE-bench Pro 30 · Terminal-Bench 35
Reasoning	ARC-AGI-2 70 · AIME 2025 30
Knowledge	HLE 40 · GPQA Diamond 35 · MMMU-Pro 25
Tool Use	OSWorld 30 · BrowseComp 25 · Tau-bench retail 22.5 · Tau-bench airline 22.5

Value view: how value for money is derived

The Value tab pairs each model's published capability for the selected area (Overall = the AGI Score; otherwise the specialty score above - the exact same number that area's tab shows) with an estimated API cost, so capability can be weighed against price.

Estimated cost is a blended price per 1M tokens at a typical input:output mix for that workload (coding is input-heavy over a large cached context; reasoning is output-heavy; etc.), using each lab's published list prices and its cached-input rate where available (otherwise about 10% of the input rate for the cached share). It is an estimate of typical cost, not a measured per-task bill.

Value for money is the extra capability a model delivers over the weakest option shown, per estimated dollar: (score − lowest score in view) / cost. We subtract that floor on purpose - a capability score has no true zero (a coding score of 0 is not "no value"), so raw score-per-dollar would over-reward the cheapest model no matter how capable it is. Subtracting the floor measures the extra capability you actually buy.

Each area surfaces three picks: Top (highest capability), Budget (cheapest to run), and Best value (the most extra capability per dollar). The ranking shows the picks rather than a raw value number, which is ambiguous to read in isolation.

Variant-attribution discipline

A cell enters scoring only when the source page's row label identifies the variant unambiguously. The lab's named max-tier configuration (OpenAI Pro, Anthropic thinking, etc.) must appear in the source label. Effort-knob suffixes are not variants.

Accept rows labeled e.g. "Claude Opus 4.7 (thinking)" or "GPT-5.5 Pro".

Reject rows like "GPT-5.5 (xHigh)", "GPT-5.5 (High)", "DeepSeek V4 Pro (Max)" - these are base or default config with an effort knob, not the separately-priced Pro / thinking SKU.

Reject generic family rows where a single row on the source cannot distinguish base from Pro / thinking.

Eval-date sanity check

A cell whose eval_date predates the model's release_date by more than 30 days is rejected as a mis-attribution. The 30-day grace window covers pre-release lab evaluations; anything older almost certainly tested a preceding model that happens to share part of the name.

Version history

v1.11.9 · 2026-06-19 · GLM 5.1 demoted to Provisional. Two BenchLM cells are voided: SWE-bench Pro was a Provider-exact relay of Z.AI's own figure (not an independent eval), and Tau-bench Retail no longer exists on BenchLM (replaced by tau-Telecom, a different benchmark we do not score). GLM 5.1 now rests on four independent T1 cells only. Provisional AGI Score about 71.9. No imputation.

v1.11.8 · 2026-06-19 · GLM 5.2 harvest: vals.ai + LiveBench. Five independent T1 cells now ground GLM 5.2: GPQA Diamond 0.86, SWE-bench Verified 0.83, and Terminal-Bench 2.1 0.68 from vals.ai (replacing earlier Artificial Analysis figures on the overlapping benchmarks), HLE 0.40 from our standardized AA source, and LiveBench 0.76. LM Arena Text Elo 1471 is published. Stays Provisional: agency sub-weight coverage sits just below the 40% core floor alongside absent multimodal data. No imputation.

v1.11.7 · 2026-06-17 · GLM 5.2 graduates to Provisional. Same day it launched, Artificial Analysis published independent results, so GLM 5.2 moves from Awaiting Verification to Provisional on three T1 cells: GPQA Diamond 0.89, Terminal-Bench 2.1 0.75, and HLE 0.40. The independent numbers came in below the lab's launch self-reports (which we had excluded) - exactly why we wait for them. It needs a fifth independent cell to reach Ranked. No imputation.

v1.11.6 · 2026-06-17 · GLM 5.2 added (Awaiting Verification). Z.AI released GLM 5.2 on June 16. So far only the lab's own launch-day numbers are public (its model card and a benchmark aggregator relaying the same figures), which we exclude as unverified self-reports. GLM 5.2 is listed as Awaiting Verification - no score - until an independent evaluator publishes results. No imputation.

v1.11.5 · 2026-06-16 · Hero badge. The launch-month pill now reads "Since May 2026" - a founding mark rather than a stale-looking date stamp. No scoring change.

v1.11.4 · 2026-06-16 · Count consistency. The hero subhead and structured data still read "15 benchmarks" after yesterday's removal of the AA Index; both now read 14, matching the live count. No scoring change.

v1.11.3 · 2026-06-16 · AA Intelligence Index removed from the score. Artificial Analysis shipped Intelligence Index v4.1, an explicitly agentic composite (GDPval, Terminal-Bench 2.1, τ³-Banking, SciCode, HLE, GPQA and more). Because it re-bundles benchmarks we already score directly, keeping it as our Language signal would double-count those results and mislabel agency as language - so we removed it from the AGI Score (14 live benchmarks now). Language rests on LiveBench; a dedicated language benchmark is on the roadmap. Most scores rise ~0.3-0.5 (the Index had been a slight drag); the ranking is unchanged. We still track the AA Index as an external reference.

v1.11.2 · 2026-06-16 · Cleaner number type. Scores and stats now render in Inter with tabular figures - more legible in the dense tables than the previous display face, and consistent across the site. Headings and the wordmark are unchanged. No change to any score.

v1.11.1 · 2026-06-16 · Value view polish. Refined the value-for-money UI and method after testing: the ranking now leads with the picks (Top / Best value / Budget) instead of a raw value index that was hard to read at a glance, the cost column is labelled $/1M tokens for clarity, and the methodology page now spells out exactly how value for money is derived. No change to any score.

v1.11.0 · 2026-06-15 · Value view. The Value tab ranks models by capability against an estimated API cost. Pick a capability area - Overall (the AGI Score) or Coding, Reasoning, Knowledge, Tool Use - and the table shows the exact same score that area's own tab shows, next to an estimated cost at that area's typical token mix (cache-aware). Sort by capability or by value for money, and read the picks at a glance: top capability, best value, and budget. A value-frontier graph plots the same models. Also renamed Google to Google DeepMind, and renamed the Agentic specialty to Tool Use (coding is also agentic, so the old name was ambiguous; Tool Use covers computer use, web browsing and tool/function calling beyond code). No change to the canonical AGI Score or any benchmark data.

v1.10.1 · 2026-06-15 · Full data refresh, plus MiniMax M2.7 (Provisional). A complete re-harvest of every model from independent sources corrected several stale or mislabeled cells (a DeepSeek score that was far too low, a Qwen coding number drawn from the wrong benchmark, and more), filled gaps, and upgraded many cells to higher-trust independent measurements. Effect on the board: the top stays a near-tie between ChatGPT 5.5 and Claude Opus 4.8; Qwen 3.7 Max earns a canonical rank as new coverage completes its profile; Gemini 3 Pro and DeepSeek V4 Pro rise on corrected data. MiniMax M2.7 (the prior MiniMax flagship) joins as Provisional. Every cell traces to an independent source. No imputation.

v1.10.0 · 2026-06-15 · MiniMax M3 added, Ranked at #6. The new MiniMax flagship enters scored entirely from independent sources (Artificial Analysis, vals.ai, benchlm, LiveBench) rather than the lab's own numbers. It is strong on multimodal and general knowledge, weaker on language, and its independently-measured score lands well below its launch claims, which is exactly why we wait for third-party data. Priced low, so it places well in the Value view. No imputation.

v1.9.3 · 2026-06-14 · Accessibility 100. Underlined the last in-text link that was distinguishable by color alone. Lighthouse: Accessibility 100, SEO 100, Best Practices 100 (desktop Performance 98).

v1.9.2 · 2026-06-14 · Accessibility and SEO pass. Form controls and the sort menu now carry proper labels, in-text links are underlined rather than color-only, the page has a main landmark for screen readers, and muted text is lightened to readable contrast. Two action links became buttons so search engines crawl every link. Lighthouse accessibility rises from 70 toward the high 90s, with no change to any ranking or score.

v1.9.1 · 2026-06-14 · Copy fix: 15 live benchmarks. The homepage, methodology, and page metadata now read 15 live benchmarks, matching the count the leaderboard has been computing for a while. The static text had lagged the live data by one. No scoring change.

v1.9.0 · 2026-06-12 · Sensitivity bands on every score, plus a Claude Fable 5 update. Each score on the main leaderboard now shows a sensitivity band (for example 87.02 ±3.6): we remove each of a model's benchmarks one at a time, re-run the entire scoring pipeline, and report how far the score moves. It is not a confidence interval - it shows how much a score depends on which benchmarks exist, and it widens honestly when coverage is thin. When two models' bands overlap, read their order as a statistical tie. Separately, Claude Fable 5 gains its Humanity's Last Exam result from our standardized independent source, lifting its provisional score to about 95.8. It stays Provisional by explicit editorial hold, with the reason published in the open dataset and shown on its badge: ARC-AGI-2 has not yet been run on Fable 5, and every ranked model's reasoning score includes that benchmark, so ranking it today would compare unlike baskets. It ranks the day that result publishes. No imputation, and no corner-cutting in either direction.

v1.8.1 · 2026-06-10 · Site visibility upgrade. The full technical methodology now lives at its own address (/methodology) rather than only inside a popup. Every model page now carries a written summary and a static benchmark table with source attribution, readable even without JavaScript - the same numbers as the interactive view. Added robots.txt and per-page structured data so search engines can find and understand every page. The open dataset (models.json) is now formally licensed CC BY 4.0 - cite it freely with attribution.

v1.8.0 · 2026-06-09 · Claude Fable 5 added, on launch day. Anthropic's new generally-available flagship (its Mythos-class model with production safeguards) enters as Provisional. It posts the strongest agentic-coding and knowledge results on the board, but day-one coverage is uneven across components, so it is not yet canonically ranked. Two launch figures arrived above their benchmarks' human ceilings from a single source; we are holding both pending independent confirmation rather than letting them inflate the score. Fable lifts to ranked once independent evaluations complete its profile. No imputation, even for the year's most anticipated launch.

v1.7.3 · 2026-06-09 · ARC-AGI-2 cleanup: one effort-tier fix, one version fix. Completing the High-effort standardization, ChatGPT 5.4's ARC-AGI-2 is now read at the same High tier as the rest of the column. Separately, a GLM ARC-AGI-2 result we had attributed to GLM 5.1 in fact belonged to the previous version, GLM 5; with no GLM 5.1 result published on that benchmark, the cell is removed rather than guessed. The correction lifts GLM 5.1 and takes it off the Reasoning leaderboard, where it no longer has a qualifying result. No imputation.

v1.7.2 · 2026-06-09 · ARC-AGI-2 read at one effort level. Reasoning models can be run at several effort settings, and one model on the board had been carried at a higher tier than the rest. We now standardize ARC-AGI-2 on the High-effort tier so the column compares like-for-like. Claude Opus 4.8 (thinking) gains its ARC-AGI-2 result and joins the Reasoning leaderboard. Effect on the top: ChatGPT 5.5 and Claude Opus 4.8 stay within a fraction of a point, with ChatGPT 5.5 nominally first. No imputation.

v1.7.1 · 2026-06-09 · SWE-bench Verified standardized on one independent source. Every model's SWE-bench Verified result now comes from the same independent third-party leaderboard (vals.ai), replacing a mix that leaned on a benchmark host whose public table has not refreshed since February. The result is consistent, current agentic-coding measurement across the whole column - and it tightens the top of the board: Claude Opus 4.8 (thinking) and ChatGPT 5.5 now sit within 0.2 points, effectively tied, with Opus 4.8 nominally first. No imputation.

v1.7.0 · 2026-06-05 · HLE standardized on one independent source; Opus 4.8 joins the ranked board; Opus 4.6 retired. Humanity's Last Exam is now sourced consistently from Artificial Analysis (independent, no-tools) for every model it covers, replacing a patchwork of sources that disagreed by up to 20 points on the same model. Claude Opus 4.8 enters the ranked leaderboard at #2 now that independent reasoning and language scores complete its coverage. Per our scope rule (latest two versions per lab), Claude Opus 4.6 leaves the board. No imputation.

v1.6.1 · 2026-06-04 · Value view polish. Added a reset for the workload-mix slider (back to the standard 75% input / 25% output), spaced out overlapping model labels on the scatter, and paused weight customization while the Value view is open (weights don't change price-performance).

v1.6.0 · 2026-06-04 · New: the Value view - capability per dollar. A price column (input/output API cost per million tokens) plus a Value tab that plots AGI Score against blended API price and highlights the value frontier - the best score available at each price point. A workload-mix slider weights input vs output cost for your use case. Pricing covers current frontier models; superseded or preview-only models without public pricing are labeled as such. Best value among frontier models - we do not track budget tiers. AGI Score remains the default view.

v1.5.3 · 2026-05-31 · Source upgrade for Claude Opus 4.8 (thinking). Its SWE-bench Verified result is now drawn from an independent third-party evaluation that corroborates the launch-day figure, replacing the lab self-report we carried at launch. Same value, higher-trust source; the model stays Provisional pending independent reasoning data. No imputation.

v1.5.2 · 2026-05-29 · Claude Opus 4.8 (thinking) added the day after launch, as Provisional. We hold its GPQA Diamond and AA Intelligence Index (independent T1) plus launch-day agentic results (SWE-bench Verified/Pro, OSWorld); reasoning-component coverage is still too thin for a canonical rank, so no AGI Score position yet. It lifts once independent reasoning and second-source agentic benchmarks publish. Claude Opus 4.6 stays on the board until then. No imputation.

v1.5.1 · 2026-05-23 · Roster expansion to 18 models. Added Qwen 3.7 Max (Provisional), Cursor Composer 2.5 (System entry - scores reflect the full product harness, not weights alone), and MiMo V2.5 Pro. Lifted Claude Opus 4.7 and Claude Opus 4.6 (thinking) from awaiting-verification to Provisional after new variant-distinct evaluations surfaced.

v1.5.0 · 2026-05-17 · Coding specialty restructured to reflect agentic-coding reality. Real coding in 2026 happens via tools like Claude Code, Codex, Cursor - not bare LLM completion. Composition: SWE-bench Verified 35 / SWE-bench Pro 30 / Terminal-Bench 35 (replacing Aider Polyglot, which had zero variant-distinct roster coverage after Round 14 cleanup). 9 of 15 roster models now have Terminal-Bench data (Round 25 harvest from vals.ai T1 + benchlm.ai T2 second-source). Added contamination note for SWE-bench Verified and harness disclosure for Terminal-Bench on the methodology page. Top of leaderboard shifts: GPT-5.5 takes #1 in Coding by ~0.7pp over Claude Opus 4.7 (thinking) - within statistical noise of the benchmarks' error bars, which is itself worth surfacing honestly. Each tab now answers its question directly.

v1.4.15 · 2026-05-15 · Homepage OG card gains a call-to-action: emerald pill-shaped "See live rankings →" in the bottom-right corner, replacing the previous "Built with obsessive care for truth" tagline. Closes the last issue OpenGraph debugger flagged ("Missing call-to-action in your image"). The CTA is visual-only inside the PNG (not a clickable link - it's part of the social-share image), but it telegraphs the action a viewer should take if they click through.

v1.4.14 · 2026-05-15 · Homepage OG image fixed: regenerated at native 1200x630 / 408KB (was 2400x1260 / 1.14MB - exceeded WhatsApp's <600KB ceiling and was over-spec for OG's recommended dimensions). Same fix already applied to per-model OG cards in v1.4.11; this brings the homepage card into line. Also extended page title from 47 to 51 characters ("AGI Ranker - Open AGI Score for Frontier AI Models") to land in the optimal 50-60 char window for SERP and OG previews.

v1.4.13 · 2026-05-13 · Developer-mode analytics toggle. Visit /?dev=1 on any browser/device to disable Vercel Analytics tracking for that browser (sets the localStorage va-disable flag Vercel respects); /?dev=0 to re-enable. Confirmation toast appears for ~3.5s. URL is cleaned after action so the param doesn't persist on refresh or share. Designed for Barak's own repeat visits to not inflate metrics, without requiring browser DevTools console access (which is unrealistic on mobile).

v1.4.12 · 2026-05-13 · "Last updated" freshness indicator added to the leaderboard header line. Auto-bumps every commit that touches models.json - the date comes from the HTTP Last-Modified header Vercel sets on the file, so no manual maintenance. Complements the existing "Most recent eval" date (which signals source-side freshness) with a "Last updated" date (which signals our integration activity). Visual cue for returning users that the site is actively maintained.

v1.4.11 · 2026-05-13 · Per-model OG cards properly delivered to social-media crawlers. Pre-generated 15 per-model HTML files at /model/{slug}.html with per-model meta tags (title, og:image, og:description, canonical URL). Social-media bots don't execute JavaScript, so the v1.4.10 client-side meta updates never reached them - they saw only the homepage card. With per-model HTML now served via Vercel cleanUrls, OG/Twitter/WhatsApp/Slack previews show the correct model-specific card and copy. Also reduced PNG file size from 1.14MB to ~380KB (native 1200x630 instead of 2x device scale) to fit WhatsApp's <600KB ceiling.

v1.4.10 · 2026-05-13 · Per-model OG cards. Each of the 15 model URLs now has its own social-share preview image at /og/{apiName}.png, showing the model's AGI Score, tier badge, 5-component breakdown, and rank within the canonical view. Social shares of /model/{slug} URLs (X, LinkedIn, Slack, Discord, iMessage) now display model-specific imagery rather than the homepage card. Cards generated via tools/regenerate_model_og_cards.py - re-run whenever scores materially change.

v1.4.9 · 2026-05-13 · Capability heatmap shipped. New section between Corrections Log and By Capability Area showing every scoreable model on every cognitive component in one grid. Color-coded by score (rose → amber → emerald), sorted by AGI Score descending, with PROVISIONAL rows visually flagged. Each model name links to its /model/{slug} detail page. Designed to be screenshot-shareable - one image tells the strengths-and-weaknesses story across the roster.

v1.4.8 · 2026-05-13 · Hero clarity rewrite. The subhead now states what AGI Ranker measures (how close each frontier AI is to AGI) and defines the score scale (0-100, with 100 as the AGI threshold) up-front instead of burying that context in the modal. Added a second smaller paragraph surfacing the three trust differentiators: independent-verification preference, no-imputation policy, public Corrections Log. Improves mass-market readability without weakening the rigor signal.

v1.4.7 · 2026-05-13 · Per-model URL routing shipped. Each model in the roster now has a shareable, SEO-indexable URL at /model/{apiName}. The existing detail view opens automatically when the URL is loaded directly, and clicking "View" on a leaderboard row updates the URL via History API. Document title and meta description update dynamically per model. sitemap.xml added with all 15 model URLs.

v1.4.6 · 2026-05-13 · Corrections Log refreshed to reflect Round 20: source-tier upgrades table now lists the Claude Opus 4.7 (thinking) SWE-V T2→T1 swap alongside the headline DeepSeek correction. New section added for the four Round 20 vals.ai T1 additions (GPT-5.5, GPT-5.4, Gemini 3.1 Pro Preview, Claude Opus 4.6 Thinking on SWE-bench Verified). Summary stats updated: 14 cells corrected, 23 cells from independent verification.

v1.4.5 · 2026-05-13 · Round 20 follow-up: Claude Opus 4.7 (thinking) SWE-bench Verified source-tier upgrade. The vals.ai Settings panel confirmed the unsuffixed "Claude Opus 4.7" row at vals.ai was tested with "Thinking Type: Adaptive" (= Anthropic thinking variant per our convention). Replaced the existing T2 benchlm.ai cell (0.876) with the T1 vals.ai cell (0.820). Lower value, higher trust weight - exactly the methodology pattern: independent T1 verification supersedes T2 aggregation.

v1.4.4 · 2026-05-13 · Round 20 vals.ai harvest integrated. Four T1 SWE-bench Verified cells added: GPT-5.5 (0.826), GPT-5.4 (0.782), Gemini 3.1 Pro Preview (0.788), Claude Opus 4.6 Thinking (0.782). The Coding specialty now has 9 eligible models (up from 6) - ChatGPT 5.5, ChatGPT 5.4, and Gemini 3.1 Pro Preview moved from "insufficient evidence" into the main Coding ranking.

v1.4.3 · 2026-05-13 · Reasoning tab now shows a data-sparsity disclaimer. With only 2 benchmarks (ARC-AGI-2 and AIME 2025) and most models having only one measured, single-cell rankings here can be misleading. We kept minBenchmarks at 1 (rather than raising to 2, which would leave only ~2 models ranked) and added a visible amber disclaimer so users understand the limitation.

v1.4.2 · 2026-05-12 · Specialty eligibility raised to minimum 2 cells for Coding, Knowledge, and Agentic (Reasoning stays at 1 because it has only 2 benchmarks total and is already protected by the 85% shrinkage threshold). Single-cell models on hard benchmarks were being unfairly dragged to the bottom; they now appear in a separate "insufficient evidence" section.

v1.4.1 · 2026-05-12 · Reasoning specialty (and any future 2-benchmark specialty) now uses an 85% shrinkage coverage threshold instead of the flat 60%. Prevents a single high-weight cell from producing an inflated specialty score.

v1.4.0 · 2026-05-12 · Cross-component coverage floor introduced. Three-section visibility layout (RANKED / PROVISIONAL / AWAITING). 40% core-component coverage requirement (Reasoning & Agency) added.

CORRECTIONS LOG

Independent verification, public corrections

Every cell on the leaderboard cites its source. When we find a mis-attribution, an inflated self-report contradicted by independent measurement, or a stale evaluation that predates the model itself, we correct it - and log the change here.

Headline correction

DeepSeek V4 Pro · GPQA Diamond

0.901 → 0.729 −17 pp

Lab self-report (T3, 0.901) replaced by independent third-party measurement (T2 benchlm.ai, 0.729). Same model, different evidence.

Promotion story

Kimi K2.6 enters scoring

Four cells migrated from T4 (excluded) to T2 (independent verification): GPQA Diamond, SWE-bench Verified, SWE-bench Pro, MMMU-Pro.

Previously stuck in lab-self-report territory. With independent evidence, K2.6 climbed into the canonical top three.

Source-tier upgrades

Cells where a previously lower-trust source was replaced by a higher-trust independent measurement. The headline DeepSeek correction and the Kimi K2.6 promotion (both above) are the most consequential examples. Two specific value-changing upgrades worth surfacing in detail:

Model	Benchmark	Change	Note
DeepSeek V4 Pro	GPQA Diamond	0.901 (T3) → 0.729 (T2)	independent third-party measurement (benchlm.ai)
Claude Opus 4.7 (thinking)	SWE-bench Verified	0.876 (T2) → 0.820 (T1)	private-test-set evaluation (vals.ai)

Additional T3 → T2 swaps across the Coding and Knowledge batteries for multiple models are tracked in models.json.

Variant-attribution corrections

Cells where the source row label didn't unambiguously identify the variant (Pro / thinking / base). Either re-attributed to the correct column on the source page, or nulled per the variant-distinct evidence policy.

Model	Benchmark	Change	Reason
ChatGPT 5.5 (Pro)	BrowseComp	0.844 → 0.901	corrected to actual Pro column on release page
ChatGPT 5.5 (Pro)	FrontierMath	0.517 → 0.524	corrected to actual Pro column on release page
ChatGPT 5.4 (Pro)	SWE-bench Pro	0.577 → null	value was from the GPT-5.4 base column
Claude Opus 4.6 (thinking)	SWE-bench Pro	0.534 → null	source row didn't distinguish thinking from base
Claude Opus 4.6 (thinking)	OSWorld	0.727 → null	source row didn't distinguish thinking from base
Claude Opus 4.7 (base)	GPQA Diamond	0.942 → null	source row didn't distinguish base from thinking
ChatGPT 5.5, 5.5 (Pro), 5.4, 5.4 (Pro)	AA Intelligence Index	57-60 → null (4 cells)	AA family row not variant-distinct
Claude Opus 4.7 (base)	AA Intelligence Index	57 → null	AA "(max)" attribution ambiguous

Stale-source removals

Cells where the cited source no longer hosts the value (typically because the source's leaderboard has rotated to newer models). Per "verifiable evidence only," these cells are nulled rather than left citing an unfetchable source. Note that ChatGPT 5.4's SWE-bench Verified cell was subsequently restored from an independent T1 source (see additions table below).

Model	Benchmark	Change	Reason
ChatGPT 5.4	SWE-bench Verified	0.728 → null	no longer on swebench.com leaderboard
ChatGPT 5.4 (Pro)	SWE-bench Verified	0.728 → null	no longer on swebench.com leaderboard

Cells added from independent T1 verification

Cells that were previously null in our schema and have now been populated from a T1 independent third-party evaluation (1.00× trust weight). These additions reflect new measurements arriving in the public ecosystem and do not displace any prior values.

Model	Benchmark	Change	Source
ChatGPT 5.5	SWE-bench Verified	null → 0.826	vals.ai T1
ChatGPT 5.4	SWE-bench Verified	null → 0.782	vals.ai T1 (restores the previously nulled stale-source cell)
Gemini 3.1 Pro (preview)	SWE-bench Verified	null → 0.788	vals.ai T1
Claude Opus 4.6 (thinking)	SWE-bench Verified	null → 0.782	vals.ai T1 (first variant-distinct SWE-V cell for this AWAITING model)

Cells corrected

Cells from independent verification

Model entered scoring

Values imputed

Spotted a value that disagrees with the cited source? Or know of a published independent measurement we should be tracking? .

CAPABILITY HEATMAP

Strengths and weaknesses, at a glance

Each cell shows a model's score on one of the five cognitive components. The AGI Score (rightmost column) blends these by parent weights (shown beneath each header). 100 marks the AGI threshold; values above are super-human on that dimension.

Model	Agency 35%	Reasoning 29%	Knowledge 15%	Multimodal 11%	Language 10%	AGI Score
Loading...

Scale:

0 to 30

30 to 60

60 to 85

85+

no data

· Provisional rows shown with italic name and amber chip

By Capability Area

Top performers in each of the three public capability framings. The overall AGI Score blends these by their parent weights (Thinking 44 / Doing 43 / Communicating 13).