v1.11.9 · 2026-06-19 · GLM 5.1 demoted to Provisional. Two BenchLM cells are voided: SWE-bench Pro was a Provider-exact relay of Z.AI's own figure (not an independent eval), and Tau-bench Retail no longer exists on BenchLM (replaced by tau-Telecom, a different benchmark we do not score). GLM 5.1 now rests on four independent T1 cells only. Provisional AGI Score about 71.9. No imputation.
v1.11.8 · 2026-06-19 · GLM 5.2 harvest: vals.ai + LiveBench. Five independent T1 cells now ground GLM 5.2: GPQA Diamond 0.86, SWE-bench Verified 0.83, and Terminal-Bench 2.1 0.68 from vals.ai (replacing earlier Artificial Analysis figures on the overlapping benchmarks), HLE 0.40 from our standardized AA source, and LiveBench 0.76. LM Arena Text Elo 1471 is published. Stays Provisional: agency sub-weight coverage sits just below the 40% core floor alongside absent multimodal data. No imputation.
v1.11.7 · 2026-06-17 · GLM 5.2 graduates to Provisional. Same day it launched, Artificial Analysis published independent results, so GLM 5.2 moves from Awaiting Verification to Provisional on three T1 cells: GPQA Diamond 0.89, Terminal-Bench 2.1 0.75, and HLE 0.40. The independent numbers came in below the lab's launch self-reports (which we had excluded) - exactly why we wait for them. It needs a fifth independent cell to reach Ranked. No imputation.
v1.11.6 · 2026-06-17 · GLM 5.2 added (Awaiting Verification). Z.AI released GLM 5.2 on June 16. So far only the lab's own launch-day numbers are public (its model card and a benchmark aggregator relaying the same figures), which we exclude as unverified self-reports. GLM 5.2 is listed as Awaiting Verification - no score - until an independent evaluator publishes results. No imputation.
v1.11.5 · 2026-06-16 · Hero badge. The launch-month pill now reads "Since May 2026" - a founding mark rather than a stale-looking date stamp. No scoring change.
v1.11.4 · 2026-06-16 · Count consistency. The hero subhead and structured data still read "15 benchmarks" after yesterday's removal of the AA Index; both now read 14, matching the live count. No scoring change.
v1.11.3 · 2026-06-16 · AA Intelligence Index removed from the score. Artificial Analysis shipped Intelligence Index v4.1, an explicitly agentic composite (GDPval, Terminal-Bench 2.1, τ³-Banking, SciCode, HLE, GPQA and more). Because it re-bundles benchmarks we already score directly, keeping it as our Language signal would double-count those results and mislabel agency as language - so we removed it from the AGI Score (14 live benchmarks now). Language rests on LiveBench; a dedicated language benchmark is on the roadmap. Most scores rise ~0.3-0.5 (the Index had been a slight drag); the ranking is unchanged. We still track the AA Index as an external reference.
v1.11.2 · 2026-06-16 · Cleaner number type. Scores and stats now render in Inter with tabular figures - more legible in the dense tables than the previous display face, and consistent across the site. Headings and the wordmark are unchanged. No change to any score.
v1.11.1 · 2026-06-16 · Value view polish. Refined the value-for-money UI and method after testing: the ranking now leads with the picks (Top / Best value / Budget) instead of a raw value index that was hard to read at a glance, the cost column is labelled $/1M tokens for clarity, and the methodology page now spells out exactly how value for money is derived. No change to any score.
v1.11.0 · 2026-06-15 · Value view. The Value tab ranks models by capability against an estimated API cost. Pick a capability area - Overall (the AGI Score) or Coding, Reasoning, Knowledge, Tool Use - and the table shows the exact same score that area's own tab shows, next to an estimated cost at that area's typical token mix (cache-aware). Sort by capability or by value for money, and read the picks at a glance: top capability, best value, and budget. A value-frontier graph plots the same models. Also renamed Google to Google DeepMind, and renamed the Agentic specialty to Tool Use (coding is also agentic, so the old name was ambiguous; Tool Use covers computer use, web browsing and tool/function calling beyond code). No change to the canonical AGI Score or any benchmark data.
v1.10.1 · 2026-06-15 · Full data refresh, plus MiniMax M2.7 (Provisional). A complete re-harvest of every model from independent sources corrected several stale or mislabeled cells (a DeepSeek score that was far too low, a Qwen coding number drawn from the wrong benchmark, and more), filled gaps, and upgraded many cells to higher-trust independent measurements. Effect on the board: the top stays a near-tie between ChatGPT 5.5 and Claude Opus 4.8; Qwen 3.7 Max earns a canonical rank as new coverage completes its profile; Gemini 3 Pro and DeepSeek V4 Pro rise on corrected data. MiniMax M2.7 (the prior MiniMax flagship) joins as Provisional. Every cell traces to an independent source. No imputation.
v1.10.0 · 2026-06-15 · MiniMax M3 added, Ranked at #6. The new MiniMax flagship enters scored entirely from independent sources (Artificial Analysis, vals.ai, benchlm, LiveBench) rather than the lab's own numbers. It is strong on multimodal and general knowledge, weaker on language, and its independently-measured score lands well below its launch claims, which is exactly why we wait for third-party data. Priced low, so it places well in the Value view. No imputation.
v1.9.3 · 2026-06-14 · Accessibility 100. Underlined the last in-text link that was distinguishable by color alone. Lighthouse: Accessibility 100, SEO 100, Best Practices 100 (desktop Performance 98).
v1.9.2 · 2026-06-14 · Accessibility and SEO pass. Form controls and the sort menu now carry proper labels, in-text links are underlined rather than color-only, the page has a main landmark for screen readers, and muted text is lightened to readable contrast. Two action links became buttons so search engines crawl every link. Lighthouse accessibility rises from 70 toward the high 90s, with no change to any ranking or score.
v1.9.1 · 2026-06-14 · Copy fix: 15 live benchmarks. The homepage, methodology, and page metadata now read 15 live benchmarks, matching the count the leaderboard has been computing for a while. The static text had lagged the live data by one. No scoring change.
v1.9.0 · 2026-06-12 · Sensitivity bands on every score, plus a Claude Fable 5 update. Each score on the main leaderboard now shows a sensitivity band (for example 87.02 ±3.6): we remove each of a model's benchmarks one at a time, re-run the entire scoring pipeline, and report how far the score moves. It is not a confidence interval - it shows how much a score depends on which benchmarks exist, and it widens honestly when coverage is thin. When two models' bands overlap, read their order as a statistical tie. Separately, Claude Fable 5 gains its Humanity's Last Exam result from our standardized independent source, lifting its provisional score to about 95.8. It stays Provisional by explicit editorial hold, with the reason published in the open dataset and shown on its badge: ARC-AGI-2 has not yet been run on Fable 5, and every ranked model's reasoning score includes that benchmark, so ranking it today would compare unlike baskets. It ranks the day that result publishes. No imputation, and no corner-cutting in either direction.
v1.8.1 · 2026-06-10 · Site visibility upgrade. The full technical methodology now lives at its own address (/methodology) rather than only inside a popup. Every model page now carries a written summary and a static benchmark table with source attribution, readable even without JavaScript - the same numbers as the interactive view. Added robots.txt and per-page structured data so search engines can find and understand every page. The open dataset (models.json) is now formally licensed CC BY 4.0 - cite it freely with attribution.
v1.8.0 · 2026-06-09 · Claude Fable 5 added, on launch day. Anthropic's new generally-available flagship (its Mythos-class model with production safeguards) enters as Provisional. It posts the strongest agentic-coding and knowledge results on the board, but day-one coverage is uneven across components, so it is not yet canonically ranked. Two launch figures arrived above their benchmarks' human ceilings from a single source; we are holding both pending independent confirmation rather than letting them inflate the score. Fable lifts to ranked once independent evaluations complete its profile. No imputation, even for the year's most anticipated launch.
v1.7.3 · 2026-06-09 · ARC-AGI-2 cleanup: one effort-tier fix, one version fix. Completing the High-effort standardization, ChatGPT 5.4's ARC-AGI-2 is now read at the same High tier as the rest of the column. Separately, a GLM ARC-AGI-2 result we had attributed to GLM 5.1 in fact belonged to the previous version, GLM 5; with no GLM 5.1 result published on that benchmark, the cell is removed rather than guessed. The correction lifts GLM 5.1 and takes it off the Reasoning leaderboard, where it no longer has a qualifying result. No imputation.
v1.7.2 · 2026-06-09 · ARC-AGI-2 read at one effort level. Reasoning models can be run at several effort settings, and one model on the board had been carried at a higher tier than the rest. We now standardize ARC-AGI-2 on the High-effort tier so the column compares like-for-like. Claude Opus 4.8 (thinking) gains its ARC-AGI-2 result and joins the Reasoning leaderboard. Effect on the top: ChatGPT 5.5 and Claude Opus 4.8 stay within a fraction of a point, with ChatGPT 5.5 nominally first. No imputation.
v1.7.1 · 2026-06-09 · SWE-bench Verified standardized on one independent source. Every model's SWE-bench Verified result now comes from the same independent third-party leaderboard (vals.ai), replacing a mix that leaned on a benchmark host whose public table has not refreshed since February. The result is consistent, current agentic-coding measurement across the whole column - and it tightens the top of the board: Claude Opus 4.8 (thinking) and ChatGPT 5.5 now sit within 0.2 points, effectively tied, with Opus 4.8 nominally first. No imputation.
v1.7.0 · 2026-06-05 · HLE standardized on one independent source; Opus 4.8 joins the ranked board; Opus 4.6 retired. Humanity's Last Exam is now sourced consistently from Artificial Analysis (independent, no-tools) for every model it covers, replacing a patchwork of sources that disagreed by up to 20 points on the same model. Claude Opus 4.8 enters the ranked leaderboard at #2 now that independent reasoning and language scores complete its coverage. Per our scope rule (latest two versions per lab), Claude Opus 4.6 leaves the board. No imputation.
v1.6.1 · 2026-06-04 · Value view polish. Added a reset for the workload-mix slider (back to the standard 75% input / 25% output), spaced out overlapping model labels on the scatter, and paused weight customization while the Value view is open (weights don't change price-performance).
v1.6.0 · 2026-06-04 · New: the Value view - capability per dollar. A price column (input/output API cost per million tokens) plus a Value tab that plots AGI Score against blended API price and highlights the value frontier - the best score available at each price point. A workload-mix slider weights input vs output cost for your use case. Pricing covers current frontier models; superseded or preview-only models without public pricing are labeled as such. Best value among frontier models - we do not track budget tiers. AGI Score remains the default view.
v1.5.3 · 2026-05-31 · Source upgrade for Claude Opus 4.8 (thinking). Its SWE-bench Verified result is now drawn from an independent third-party evaluation that corroborates the launch-day figure, replacing the lab self-report we carried at launch. Same value, higher-trust source; the model stays Provisional pending independent reasoning data. No imputation.
v1.5.2 · 2026-05-29 · Claude Opus 4.8 (thinking) added the day after launch, as Provisional. We hold its GPQA Diamond and AA Intelligence Index (independent T1) plus launch-day agentic results (SWE-bench Verified/Pro, OSWorld); reasoning-component coverage is still too thin for a canonical rank, so no AGI Score position yet. It lifts once independent reasoning and second-source agentic benchmarks publish. Claude Opus 4.6 stays on the board until then. No imputation.
v1.5.1 · 2026-05-23 · Roster expansion to 18 models. Added Qwen 3.7 Max (Provisional), Cursor Composer 2.5 (System entry - scores reflect the full product harness, not weights alone), and MiMo V2.5 Pro. Lifted Claude Opus 4.7 and Claude Opus 4.6 (thinking) from awaiting-verification to Provisional after new variant-distinct evaluations surfaced.
v1.5.0 · 2026-05-17 · Coding specialty restructured to reflect agentic-coding reality. Real coding in 2026 happens via tools like Claude Code, Codex, Cursor - not bare LLM completion. Composition: SWE-bench Verified 35 / SWE-bench Pro 30 / Terminal-Bench 35 (replacing Aider Polyglot, which had zero variant-distinct roster coverage after Round 14 cleanup). 9 of 15 roster models now have Terminal-Bench data (Round 25 harvest from vals.ai T1 + benchlm.ai T2 second-source). Added contamination note for SWE-bench Verified and harness disclosure for Terminal-Bench on the methodology page. Top of leaderboard shifts: GPT-5.5 takes #1 in Coding by ~0.7pp over Claude Opus 4.7 (thinking) - within statistical noise of the benchmarks' error bars, which is itself worth surfacing honestly. Each tab now answers its question directly.
v1.4.15 · 2026-05-15 · Homepage OG card gains a call-to-action: emerald pill-shaped "See live rankings →" in the bottom-right corner, replacing the previous "Built with obsessive care for truth" tagline. Closes the last issue OpenGraph debugger flagged ("Missing call-to-action in your image"). The CTA is visual-only inside the PNG (not a clickable link - it's part of the social-share image), but it telegraphs the action a viewer should take if they click through.
v1.4.14 · 2026-05-15 · Homepage OG image fixed: regenerated at native 1200x630 / 408KB (was 2400x1260 / 1.14MB - exceeded WhatsApp's <600KB ceiling and was over-spec for OG's recommended dimensions). Same fix already applied to per-model OG cards in v1.4.11; this brings the homepage card into line. Also extended page title from 47 to 51 characters ("AGI Ranker - Open AGI Score for Frontier AI Models") to land in the optimal 50-60 char window for SERP and OG previews.
v1.4.13 · 2026-05-13 · Developer-mode analytics toggle. Visit /?dev=1 on any browser/device to disable Vercel Analytics tracking for that browser (sets the localStorage va-disable flag Vercel respects); /?dev=0 to re-enable. Confirmation toast appears for ~3.5s. URL is cleaned after action so the param doesn't persist on refresh or share. Designed for Barak's own repeat visits to not inflate metrics, without requiring browser DevTools console access (which is unrealistic on mobile).
v1.4.12 · 2026-05-13 · "Last updated" freshness indicator added to the leaderboard header line. Auto-bumps every commit that touches models.json - the date comes from the HTTP Last-Modified header Vercel sets on the file, so no manual maintenance. Complements the existing "Most recent eval" date (which signals source-side freshness) with a "Last updated" date (which signals our integration activity). Visual cue for returning users that the site is actively maintained.
v1.4.11 · 2026-05-13 · Per-model OG cards properly delivered to social-media crawlers. Pre-generated 15 per-model HTML files at /model/{slug}.html with per-model meta tags (title, og:image, og:description, canonical URL). Social-media bots don't execute JavaScript, so the v1.4.10 client-side meta updates never reached them - they saw only the homepage card. With per-model HTML now served via Vercel cleanUrls, OG/Twitter/WhatsApp/Slack previews show the correct model-specific card and copy. Also reduced PNG file size from 1.14MB to ~380KB (native 1200x630 instead of 2x device scale) to fit WhatsApp's <600KB ceiling.
v1.4.10 · 2026-05-13 · Per-model OG cards. Each of the 15 model URLs now has its own social-share preview image at /og/{apiName}.png, showing the model's AGI Score, tier badge, 5-component breakdown, and rank within the canonical view. Social shares of /model/{slug} URLs (X, LinkedIn, Slack, Discord, iMessage) now display model-specific imagery rather than the homepage card. Cards generated via tools/regenerate_model_og_cards.py - re-run whenever scores materially change.
v1.4.9 · 2026-05-13 · Capability heatmap shipped. New section between Corrections Log and By Capability Area showing every scoreable model on every cognitive component in one grid. Color-coded by score (rose → amber → emerald), sorted by AGI Score descending, with PROVISIONAL rows visually flagged. Each model name links to its /model/{slug} detail page. Designed to be screenshot-shareable - one image tells the strengths-and-weaknesses story across the roster.
v1.4.8 · 2026-05-13 · Hero clarity rewrite. The subhead now states what AGI Ranker measures (how close each frontier AI is to AGI) and defines the score scale (0-100, with 100 as the AGI threshold) up-front instead of burying that context in the modal. Added a second smaller paragraph surfacing the three trust differentiators: independent-verification preference, no-imputation policy, public Corrections Log. Improves mass-market readability without weakening the rigor signal.
v1.4.7 · 2026-05-13 · Per-model URL routing shipped. Each model in the roster now has a shareable, SEO-indexable URL at /model/{apiName}. The existing detail view opens automatically when the URL is loaded directly, and clicking "View" on a leaderboard row updates the URL via History API. Document title and meta description update dynamically per model. sitemap.xml added with all 15 model URLs.
v1.4.6 · 2026-05-13 · Corrections Log refreshed to reflect Round 20: source-tier upgrades table now lists the Claude Opus 4.7 (thinking) SWE-V T2→T1 swap alongside the headline DeepSeek correction. New section added for the four Round 20 vals.ai T1 additions (GPT-5.5, GPT-5.4, Gemini 3.1 Pro Preview, Claude Opus 4.6 Thinking on SWE-bench Verified). Summary stats updated: 14 cells corrected, 23 cells from independent verification.
v1.4.5 · 2026-05-13 · Round 20 follow-up: Claude Opus 4.7 (thinking) SWE-bench Verified source-tier upgrade. The vals.ai Settings panel confirmed the unsuffixed "Claude Opus 4.7" row at vals.ai was tested with "Thinking Type: Adaptive" (= Anthropic thinking variant per our convention). Replaced the existing T2 benchlm.ai cell (0.876) with the T1 vals.ai cell (0.820). Lower value, higher trust weight - exactly the methodology pattern: independent T1 verification supersedes T2 aggregation.
v1.4.4 · 2026-05-13 · Round 20 vals.ai harvest integrated. Four T1 SWE-bench Verified cells added: GPT-5.5 (0.826), GPT-5.4 (0.782), Gemini 3.1 Pro Preview (0.788), Claude Opus 4.6 Thinking (0.782). The Coding specialty now has 9 eligible models (up from 6) - ChatGPT 5.5, ChatGPT 5.4, and Gemini 3.1 Pro Preview moved from "insufficient evidence" into the main Coding ranking.
v1.4.3 · 2026-05-13 · Reasoning tab now shows a data-sparsity disclaimer. With only 2 benchmarks (ARC-AGI-2 and AIME 2025) and most models having only one measured, single-cell rankings here can be misleading. We kept minBenchmarks at 1 (rather than raising to 2, which would leave only ~2 models ranked) and added a visible amber disclaimer so users understand the limitation.
v1.4.2 · 2026-05-12 · Specialty eligibility raised to minimum 2 cells for Coding, Knowledge, and Agentic (Reasoning stays at 1 because it has only 2 benchmarks total and is already protected by the 85% shrinkage threshold). Single-cell models on hard benchmarks were being unfairly dragged to the bottom; they now appear in a separate "insufficient evidence" section.
v1.4.1 · 2026-05-12 · Reasoning specialty (and any future 2-benchmark specialty) now uses an 85% shrinkage coverage threshold instead of the flat 60%. Prevents a single high-weight cell from producing an inflated specialty score.
v1.4.0 · 2026-05-12 · Cross-component coverage floor introduced. Three-section visibility layout (RANKED / PROVISIONAL / AWAITING). 40% core-component coverage requirement (Reasoning & Agency) added.