AGI Score Methodology

Every formula, weight, and trust-tier rule behind the AGI Score, publicly documented. The same methodology renders interactively on the leaderboard; this page is the complete reference. Raw data: models.json (CC BY 4.0).

How the score is built

TRANSPARENT METHODOLOGY

How the AGI Score is Calculated

A transparent, reproducible composite index designed for maximum signal and minimum noise.

Data Ingestion & Curation

We aggregate scores from major public AI benchmarks: GPQA Diamond, HLE, ARC-AGI-2, SWE-bench Verified/Pro, LiveBench, MMMU-Pro, Tau-bench, OSWorld, BrowseComp, Aider Polyglot, AIME 2025, and more. Each cell is cited so any score can be traced back to its origin. We do not run benchmarks ourselves - we aggregate what others have already published. Source quality is tiered: independent benchmark leaderboards (Tier 1) count fully at 1.00×; third-party evaluators with provider involvement (Tier 2) at 0.85×; self-reports from labs with verified track records (Tier 3) at 0.75×; non-verified labs and commentary content (Tier 4) are excluded from scoring entirely.

Note on the AA Intelligence Index (removed June 2026): We previously carried Artificial Analysis's composite Intelligence Index as a Language signal. We have removed it from the AGI Score. Their v4.1 revision turned it into an explicitly agentic composite (GDPval-AA, Terminal-Bench 2.1, τ³-Banking, SciCode, HLE, GPQA and more) that re-bundles benchmarks we already score directly - keeping it would double-count those signals across components and mislabel agency as language. Language now rests on LiveBench; a dedicated language/writing benchmark is on our roadmap to restore a second independent source. We continue to track the AA Index as an external reference.

Normalization to Best-Human

Each raw benchmark score is rescaled so 100 = best-human performance on that task. AGI Score = 100 thus marks the genesis of AGI per our working definition: an AI that surpasses the best human on every purely brain-based intellectual task. Embodiment, sensory acquisition, and lived experience are explicitly out of scope - those are separate problems. Scores climb past 100 as the AI grows from genesis into super-human (ASI-direction) territory; we preserve those scores rather than clamping, so the index stays informative in an ASI / post-AGI world.

5-Component Aggregation

Benchmarks roll up into 5 cognitive components AGI must master: Agency (35%), Fluid Reasoning (29%), World Knowledge (15%), Multimodal Perception (11%), Language Production (10%). Within each component, the score is the weighted mean of all the model's contributing benchmarks; each benchmark's effective weight is sub_weight × source_tier_multiplier. Sub-weights are fixed per benchmark and reflect each benchmark's relative importance within a component (e.g., ARC-AGI-2 carries 25% of Reasoning, GPQA carries 20%). The saturation rule kicks in dynamically: if the IQR of normalized scores among the top-10 ranked models on a benchmark drops below 5 percentage points, that benchmark's sub-weight is halved and the freed weight is redistributed to non-saturated benchmarks in the same component - keeping the component score responsive to whichever benchmarks still discriminate frontier models. Source tiering applies a multiplier per cell based on source independence (T1 1.00×, T2 0.85×, T3-Verified-Lab 0.75×; T4 excluded from scoring entirely). Component weights are user-adjustable in the Explorer below.

AGI Score = Σ (wᵢ × Cᵢ) / Σ wᵢ

where Cᵢ = component score (shrunk toward median if thin-coverage), wᵢ = component weight

Sparse-Data Handling (Asymmetric Pull-Down + Renormalize)

Benchmark coverage is uneven. Within each component we compute the weighted mean over present benchmarks (renormalizing the denominator). Then asymmetric pull-down shrinkage: if a model's coverage in a component is below 60% AND the raw score is above the population median, we pull it toward the median proportionally to coverage shortfall. Low scores with thin coverage stay low - no reward for hiding weaknesses. High scores with thin coverage get discounted toward "typical." Models need ≥3 components and ≥5 benchmarks (with Reasoning + Agency required) to be ranked. Cross-component coverage floor (v1.4): a model with 2+ components below their coverage thresholds gets demoted from RANKED to PROVISIONAL - even if its present cells are exceptional. Thresholds: 30% sub-weight for World Knowledge / Multimodal / Language; 40% for the core components (Fluid Reasoning and Agency, the load-bearing AGI dimensions per the locked definition). This prevents selection-bias rankings: a model can only claim near-AGI position if its evidence is reasonably broad across capability dimensions. The saturation rule separately protects against dead benchmarks: any benchmark whose IQR among the top-10 ranked models drops below 5pp gets its sub-weight halved and redistributed to non-saturated benchmarks in the same component.

Specialty Rankings & Custom View

Above the leaderboard, six tabs switch the ranking between the canonical AGI Score and task-focused views. Each specialty (Coding, Reasoning, Knowledge, Tool Use) re-ranks models using only the benchmarks that test that capability, with within-set sub-weights renormalized to 100%. Specialty scores apply the same source-tier weighting and asymmetric pull-down shrinkage as the main score - so a model with thin coverage in a specialty is pulled toward the population median for that specialty, never the other way. 100 in a specialty means matching best-human on its benchmarks; above 100 = super-human on those specific tasks. This is not AGI - AGI requires the full battery, which only the canonical AGI Score measures. The Custom tab (visible when you adjust the Explorer's weight sliders) shows the AI Score under your settings; the AGI tab always reverts to default weights, keeping the canonical view unambiguous.

Calibration Constants & Operational Discipline

reproducibility addendum

Full reproducibility requires three things beyond the formula: the exact numeric thresholds we apply, the within-set sub-weights used in specialty views, and the discipline we hold against common mis-attribution patterns. All disclosed below.

Current as of v1.4. These are operational values, not permanent invariants. Future methodology versions may revise specific constants as the methodology evolves and as benchmark coverage matures; substantive changes will be noted in the version history.

Numeric thresholds

Constant	Value	Used in
Asymmetric shrinkage coverage threshold (main AGI Score & specialties with 3+ benchmarks)	60%	Step 4
Asymmetric shrinkage coverage threshold (specialties with 2 benchmarks, e.g. Reasoning)	85%	Step 5
Saturation IQR threshold (top-10 ranked models)	5 pp	Step 4
Coverage floor - Fluid Reasoning & Agency (core)	40%	Step 4
Coverage floor - Knowledge / Multimodal / Language	30%	Step 4
Max thin components allowed in the RANKED tier	1	Step 4
Eval-date grace window vs. model release	30 days	below

Specialty sub-weight splits

Within each specialty tab, only the relevant benchmarks contribute, with within-set sub-weights renormalized to sum to 100. The splits are fixed. Minimum cells required for a model to enter the specialty ranking: 2 for Coding / Knowledge / Tool Use; 1 for Reasoning (which has only 2 benchmarks total). Models below the minimum but with at least one relevant cell appear in a separate "insufficient evidence" section below the ranking.

Note on Coding methodology (v1.5): Real-world coding in 2026 is agentic - it happens via tools like Claude Code, Codex, Cursor, not bare LLM completion. The Coding specialty composition reflects this: SWE-bench Verified and Pro (run with agentic harnesses by all labs that submit) plus Terminal-Bench (general agentic terminal-task benchmark). Aider Polyglot dropped from the composition in v1.5.0 (zero variant-distinct roster coverage after Round 14 cleanup).

Contamination note for SWE-bench Verified: Public-test-set SWE-bench scores can be inflated by training-data leakage. Private-test-set evaluators (e.g., vals.ai) tend to report 5-15pp lower numbers than public-leaderboard aggregators for the same model. We prefer T1 private-test sources where available; see the Corrections Log for the DeepSeek V4 Pro GPQA correction (0.901 self-report -> 0.729 independent T2) as a concrete example of the contamination correction in action.

Harness disclosure for Terminal-Bench: Agentic-benchmark scores depend on the scaffold used to execute tasks. Our Terminal-Bench cells come primarily from vals.ai (daytona / Terminus-2 harness, pass@1 over 89 tasks). benchlm.ai provides triangulating coverage with a different scaffold. Real users running each model with its native agentic tool (Claude Code for Anthropic, Codex for OpenAI, etc.) may experience different relative performance than benchmark scores predict. Treat the score as one input; consult practitioner reports for workflow-specific decisions.

Specialty	Benchmarks & within-set sub-weights
Coding	SWE-bench Verified 35 · SWE-bench Pro 30 · Terminal-Bench 35
Reasoning	ARC-AGI-2 70 · AIME 2025 30
Knowledge	HLE 40 · GPQA Diamond 35 · MMMU-Pro 25
Tool Use	OSWorld 30 · BrowseComp 25 · Tau-bench retail 22.5 · Tau-bench airline 22.5

Value view: how value for money is derived

The Value tab pairs each model's published capability for the selected area (Overall = the AGI Score; otherwise the specialty score above - the exact same number that area's tab shows) with an estimated API cost, so capability can be weighed against price.

Estimated cost is a blended price per 1M tokens at a typical input:output mix for that workload (coding is input-heavy over a large cached context; reasoning is output-heavy; etc.), using each lab's published list prices and its cached-input rate where available (otherwise about 10% of the input rate for the cached share). It is an estimate of typical cost, not a measured per-task bill.

Value for money is the extra capability a model delivers over the weakest option shown, per estimated dollar: (score − lowest score in view) / cost. We subtract that floor on purpose - a capability score has no true zero (a coding score of 0 is not "no value"), so raw score-per-dollar would over-reward the cheapest model no matter how capable it is. Subtracting the floor measures the extra capability you actually buy.

Each area surfaces three picks: Top (highest capability), Budget (cheapest to run), and Best value (the most extra capability per dollar). The ranking shows the picks rather than a raw value number, which is ambiguous to read in isolation.

Variant-attribution discipline

A cell enters scoring only when the source page's row label identifies the variant unambiguously. The lab's named max-tier configuration (OpenAI Pro, Anthropic thinking, etc.) must appear in the source label. Effort-knob suffixes are not variants.

Accept rows labeled e.g. "Claude Opus 4.7 (thinking)" or "GPT-5.5 Pro".

Reject rows like "GPT-5.5 (xHigh)", "GPT-5.5 (High)", "DeepSeek V4 Pro (Max)" - these are base or default config with an effort knob, not the separately-priced Pro / thinking SKU.

Reject generic family rows where a single row on the source cannot distinguish base from Pro / thinking.

Eval-date sanity check

A cell whose eval_date predates the model's release_date by more than 30 days is rejected as a mis-attribution. The 30-day grace window covers pre-release lab evaluations; anything older almost certainly tested a preceding model that happens to share part of the name.

Version history

v1.11.9 · 2026-06-19 · GLM 5.1 demoted to Provisional. Two BenchLM cells are voided: SWE-bench Pro was a Provider-exact relay of Z.AI's own figure (not an independent eval), and Tau-bench Retail no longer exists on BenchLM (replaced by tau-Telecom, a different benchmark we do not score). GLM 5.1 now rests on four independent T1 cells only. Provisional AGI Score about 71.9. No imputation.

v1.11.8 · 2026-06-19 · GLM 5.2 harvest: vals.ai + LiveBench. Five independent T1 cells now ground GLM 5.2: GPQA Diamond 0.86, SWE-bench Verified 0.83, and Terminal-Bench 2.1 0.68 from vals.ai (replacing earlier Artificial Analysis figures on the overlapping benchmarks), HLE 0.40 from our standardized AA source, and LiveBench 0.76. LM Arena Text Elo 1471 is published. Stays Provisional: agency sub-weight coverage sits just below the 40% core floor alongside absent multimodal data. No imputation.

v1.11.7 · 2026-06-17 · GLM 5.2 graduates to Provisional. Same day it launched, Artificial Analysis published independent results, so GLM 5.2 moves from Awaiting Verification to Provisional on three T1 cells: GPQA Diamond 0.89, Terminal-Bench 2.1 0.75, and HLE 0.40. The independent numbers came in below the lab's launch self-reports (which we had excluded) - exactly why we wait for them. It needs a fifth independent cell to reach Ranked. No imputation.

v1.11.6 · 2026-06-17 · GLM 5.2 added (Awaiting Verification). Z.AI released GLM 5.2 on June 16. So far only the lab's own launch-day numbers are public (its model card and a benchmark aggregator relaying the same figures), which we exclude as unverified self-reports. GLM 5.2 is listed as Awaiting Verification - no score - until an independent evaluator publishes results. No imputation.

v1.11.5 · 2026-06-16 · Hero badge. The launch-month pill now reads "Since May 2026" - a founding mark rather than a stale-looking date stamp. No scoring change.

v1.11.4 · 2026-06-16 · Count consistency. The hero subhead and structured data still read "15 benchmarks" after yesterday's removal of the AA Index; both now read 14, matching the live count. No scoring change.

v1.11.3 · 2026-06-16 · AA Intelligence Index removed from the score. Artificial Analysis shipped Intelligence Index v4.1, an explicitly agentic composite (GDPval, Terminal-Bench 2.1, τ³-Banking, SciCode, HLE, GPQA and more). Because it re-bundles benchmarks we already score directly, keeping it as our Language signal would double-count those results and mislabel agency as language - so we removed it from the AGI Score (14 live benchmarks now). Language rests on LiveBench; a dedicated language benchmark is on the roadmap. Most scores rise ~0.3-0.5 (the Index had been a slight drag); the ranking is unchanged. We still track the AA Index as an external reference.

v1.11.2 · 2026-06-16 · Cleaner number type. Scores and stats now render in Inter with tabular figures - more legible in the dense tables than the previous display face, and consistent across the site. Headings and the wordmark are unchanged. No change to any score.

v1.11.1 · 2026-06-16 · Value view polish. Refined the value-for-money UI and method after testing: the ranking now leads with the picks (Top / Best value / Budget) instead of a raw value index that was hard to read at a glance, the cost column is labelled $/1M tokens for clarity, and the methodology page now spells out exactly how value for money is derived. No change to any score.

v1.11.0 · 2026-06-15 · Value view. The Value tab ranks models by capability against an estimated API cost. Pick a capability area - Overall (the AGI Score) or Coding, Reasoning, Knowledge, Tool Use - and the table shows the exact same score that area's own tab shows, next to an estimated cost at that area's typical token mix (cache-aware). Sort by capability or by value for money, and read the picks at a glance: top capability, best value, and budget. A value-frontier graph plots the same models. Also renamed Google to Google DeepMind, and renamed the Agentic specialty to Tool Use (coding is also agentic, so the old name was ambiguous; Tool Use covers computer use, web browsing and tool/function calling beyond code). No change to the canonical AGI Score or any benchmark data.

v1.10.1 · 2026-06-15 · Full data refresh, plus MiniMax M2.7 (Provisional). A complete re-harvest of every model from independent sources corrected several stale or mislabeled cells (a DeepSeek score that was far too low, a Qwen coding number drawn from the wrong benchmark, and more), filled gaps, and upgraded many cells to higher-trust independent measurements. Effect on the board: the top stays a near-tie between ChatGPT 5.5 and Claude Opus 4.8; Qwen 3.7 Max earns a canonical rank as new coverage completes its profile; Gemini 3 Pro and DeepSeek V4 Pro rise on corrected data. MiniMax M2.7 (the prior MiniMax flagship) joins as Provisional. Every cell traces to an independent source. No imputation.

v1.10.0 · 2026-06-15 · MiniMax M3 added, Ranked at #6. The new MiniMax flagship enters scored entirely from independent sources (Artificial Analysis, vals.ai, benchlm, LiveBench) rather than the lab's own numbers. It is strong on multimodal and general knowledge, weaker on language, and its independently-measured score lands well below its launch claims, which is exactly why we wait for third-party data. Priced low, so it places well in the Value view. No imputation.

v1.9.3 · 2026-06-14 · Accessibility 100. Underlined the last in-text link that was distinguishable by color alone. Lighthouse: Accessibility 100, SEO 100, Best Practices 100 (desktop Performance 98).

v1.9.2 · 2026-06-14 · Accessibility and SEO pass. Form controls and the sort menu now carry proper labels, in-text links are underlined rather than color-only, the page has a main landmark for screen readers, and muted text is lightened to readable contrast. Two action links became buttons so search engines crawl every link. Lighthouse accessibility rises from 70 toward the high 90s, with no change to any ranking or score.

v1.9.1 · 2026-06-14 · Copy fix: 15 live benchmarks. The homepage, methodology, and page metadata now read 15 live benchmarks, matching the count the leaderboard has been computing for a while. The static text had lagged the live data by one. No scoring change.

v1.9.0 · 2026-06-12 · Sensitivity bands on every score, plus a Claude Fable 5 update. Each score on the main leaderboard now shows a sensitivity band (for example 87.02 ±3.6): we remove each of a model's benchmarks one at a time, re-run the entire scoring pipeline, and report how far the score moves. It is not a confidence interval - it shows how much a score depends on which benchmarks exist, and it widens honestly when coverage is thin. When two models' bands overlap, read their order as a statistical tie. Separately, Claude Fable 5 gains its Humanity's Last Exam result from our standardized independent source, lifting its provisional score to about 95.8. It stays Provisional by explicit editorial hold, with the reason published in the open dataset and shown on its badge: ARC-AGI-2 has not yet been run on Fable 5, and every ranked model's reasoning score includes that benchmark, so ranking it today would compare unlike baskets. It ranks the day that result publishes. No imputation, and no corner-cutting in either direction.

v1.8.1 · 2026-06-10 · Site visibility upgrade. The full technical methodology now lives at its own address (/methodology) rather than only inside a popup. Every model page now carries a written summary and a static benchmark table with source attribution, readable even without JavaScript - the same numbers as the interactive view. Added robots.txt and per-page structured data so search engines can find and understand every page. The open dataset (models.json) is now formally licensed CC BY 4.0 - cite it freely with attribution.

v1.8.0 · 2026-06-09 · Claude Fable 5 added, on launch day. Anthropic's new generally-available flagship (its Mythos-class model with production safeguards) enters as Provisional. It posts the strongest agentic-coding and knowledge results on the board, but day-one coverage is uneven across components, so it is not yet canonically ranked. Two launch figures arrived above their benchmarks' human ceilings from a single source; we are holding both pending independent confirmation rather than letting them inflate the score. Fable lifts to ranked once independent evaluations complete its profile. No imputation, even for the year's most anticipated launch.

v1.7.3 · 2026-06-09 · ARC-AGI-2 cleanup: one effort-tier fix, one version fix. Completing the High-effort standardization, ChatGPT 5.4's ARC-AGI-2 is now read at the same High tier as the rest of the column. Separately, a GLM ARC-AGI-2 result we had attributed to GLM 5.1 in fact belonged to the previous version, GLM 5; with no GLM 5.1 result published on that benchmark, the cell is removed rather than guessed. The correction lifts GLM 5.1 and takes it off the Reasoning leaderboard, where it no longer has a qualifying result. No imputation.

v1.7.2 · 2026-06-09 · ARC-AGI-2 read at one effort level. Reasoning models can be run at several effort settings, and one model on the board had been carried at a higher tier than the rest. We now standardize ARC-AGI-2 on the High-effort tier so the column compares like-for-like. Claude Opus 4.8 (thinking) gains its ARC-AGI-2 result and joins the Reasoning leaderboard. Effect on the top: ChatGPT 5.5 and Claude Opus 4.8 stay within a fraction of a point, with ChatGPT 5.5 nominally first. No imputation.

v1.7.1 · 2026-06-09 · SWE-bench Verified standardized on one independent source. Every model's SWE-bench Verified result now comes from the same independent third-party leaderboard (vals.ai), replacing a mix that leaned on a benchmark host whose public table has not refreshed since February. The result is consistent, current agentic-coding measurement across the whole column - and it tightens the top of the board: Claude Opus 4.8 (thinking) and ChatGPT 5.5 now sit within 0.2 points, effectively tied, with Opus 4.8 nominally first. No imputation.

v1.7.0 · 2026-06-05 · HLE standardized on one independent source; Opus 4.8 joins the ranked board; Opus 4.6 retired. Humanity's Last Exam is now sourced consistently from Artificial Analysis (independent, no-tools) for every model it covers, replacing a patchwork of sources that disagreed by up to 20 points on the same model. Claude Opus 4.8 enters the ranked leaderboard at #2 now that independent reasoning and language scores complete its coverage. Per our scope rule (latest two versions per lab), Claude Opus 4.6 leaves the board. No imputation.

v1.6.1 · 2026-06-04 · Value view polish. Added a reset for the workload-mix slider (back to the standard 75% input / 25% output), spaced out overlapping model labels on the scatter, and paused weight customization while the Value view is open (weights don't change price-performance).

v1.6.0 · 2026-06-04 · New: the Value view - capability per dollar. A price column (input/output API cost per million tokens) plus a Value tab that plots AGI Score against blended API price and highlights the value frontier - the best score available at each price point. A workload-mix slider weights input vs output cost for your use case. Pricing covers current frontier models; superseded or preview-only models without public pricing are labeled as such. Best value among frontier models - we do not track budget tiers. AGI Score remains the default view.

v1.5.3 · 2026-05-31 · Source upgrade for Claude Opus 4.8 (thinking). Its SWE-bench Verified result is now drawn from an independent third-party evaluation that corroborates the launch-day figure, replacing the lab self-report we carried at launch. Same value, higher-trust source; the model stays Provisional pending independent reasoning data. No imputation.

v1.5.2 · 2026-05-29 · Claude Opus 4.8 (thinking) added the day after launch, as Provisional. We hold its GPQA Diamond and AA Intelligence Index (independent T1) plus launch-day agentic results (SWE-bench Verified/Pro, OSWorld); reasoning-component coverage is still too thin for a canonical rank, so no AGI Score position yet. It lifts once independent reasoning and second-source agentic benchmarks publish. Claude Opus 4.6 stays on the board until then. No imputation.

v1.5.1 · 2026-05-23 · Roster expansion to 18 models. Added Qwen 3.7 Max (Provisional), Cursor Composer 2.5 (System entry - scores reflect the full product harness, not weights alone), and MiMo V2.5 Pro. Lifted Claude Opus 4.7 and Claude Opus 4.6 (thinking) from awaiting-verification to Provisional after new variant-distinct evaluations surfaced.

v1.5.0 · 2026-05-17 · Coding specialty restructured to reflect agentic-coding reality. Real coding in 2026 happens via tools like Claude Code, Codex, Cursor - not bare LLM completion. Composition: SWE-bench Verified 35 / SWE-bench Pro 30 / Terminal-Bench 35 (replacing Aider Polyglot, which had zero variant-distinct roster coverage after Round 14 cleanup). 9 of 15 roster models now have Terminal-Bench data (Round 25 harvest from vals.ai T1 + benchlm.ai T2 second-source). Added contamination note for SWE-bench Verified and harness disclosure for Terminal-Bench on the methodology page. Top of leaderboard shifts: GPT-5.5 takes #1 in Coding by ~0.7pp over Claude Opus 4.7 (thinking) - within statistical noise of the benchmarks' error bars, which is itself worth surfacing honestly. Each tab now answers its question directly.

v1.4.15 · 2026-05-15 · Homepage OG card gains a call-to-action: emerald pill-shaped "See live rankings →" in the bottom-right corner, replacing the previous "Built with obsessive care for truth" tagline. Closes the last issue OpenGraph debugger flagged ("Missing call-to-action in your image"). The CTA is visual-only inside the PNG (not a clickable link - it's part of the social-share image), but it telegraphs the action a viewer should take if they click through.

v1.4.14 · 2026-05-15 · Homepage OG image fixed: regenerated at native 1200x630 / 408KB (was 2400x1260 / 1.14MB - exceeded WhatsApp's <600KB ceiling and was over-spec for OG's recommended dimensions). Same fix already applied to per-model OG cards in v1.4.11; this brings the homepage card into line. Also extended page title from 47 to 51 characters ("AGI Ranker - Open AGI Score for Frontier AI Models") to land in the optimal 50-60 char window for SERP and OG previews.

v1.4.13 · 2026-05-13 · Developer-mode analytics toggle. Visit /?dev=1 on any browser/device to disable Vercel Analytics tracking for that browser (sets the localStorage va-disable flag Vercel respects); /?dev=0 to re-enable. Confirmation toast appears for ~3.5s. URL is cleaned after action so the param doesn't persist on refresh or share. Designed for Barak's own repeat visits to not inflate metrics, without requiring browser DevTools console access (which is unrealistic on mobile).

v1.4.12 · 2026-05-13 · "Last updated" freshness indicator added to the leaderboard header line. Auto-bumps every commit that touches models.json - the date comes from the HTTP Last-Modified header Vercel sets on the file, so no manual maintenance. Complements the existing "Most recent eval" date (which signals source-side freshness) with a "Last updated" date (which signals our integration activity). Visual cue for returning users that the site is actively maintained.

v1.4.11 · 2026-05-13 · Per-model OG cards properly delivered to social-media crawlers. Pre-generated 15 per-model HTML files at /model/{slug}.html with per-model meta tags (title, og:image, og:description, canonical URL). Social-media bots don't execute JavaScript, so the v1.4.10 client-side meta updates never reached them - they saw only the homepage card. With per-model HTML now served via Vercel cleanUrls, OG/Twitter/WhatsApp/Slack previews show the correct model-specific card and copy. Also reduced PNG file size from 1.14MB to ~380KB (native 1200x630 instead of 2x device scale) to fit WhatsApp's <600KB ceiling.

v1.4.10 · 2026-05-13 · Per-model OG cards. Each of the 15 model URLs now has its own social-share preview image at /og/{apiName}.png, showing the model's AGI Score, tier badge, 5-component breakdown, and rank within the canonical view. Social shares of /model/{slug} URLs (X, LinkedIn, Slack, Discord, iMessage) now display model-specific imagery rather than the homepage card. Cards generated via tools/regenerate_model_og_cards.py - re-run whenever scores materially change.

v1.4.9 · 2026-05-13 · Capability heatmap shipped. New section between Corrections Log and By Capability Area showing every scoreable model on every cognitive component in one grid. Color-coded by score (rose → amber → emerald), sorted by AGI Score descending, with PROVISIONAL rows visually flagged. Each model name links to its /model/{slug} detail page. Designed to be screenshot-shareable - one image tells the strengths-and-weaknesses story across the roster.

v1.4.8 · 2026-05-13 · Hero clarity rewrite. The subhead now states what AGI Ranker measures (how close each frontier AI is to AGI) and defines the score scale (0-100, with 100 as the AGI threshold) up-front instead of burying that context in the modal. Added a second smaller paragraph surfacing the three trust differentiators: independent-verification preference, no-imputation policy, public Corrections Log. Improves mass-market readability without weakening the rigor signal.

v1.4.7 · 2026-05-13 · Per-model URL routing shipped. Each model in the roster now has a shareable, SEO-indexable URL at /model/{apiName}. The existing detail view opens automatically when the URL is loaded directly, and clicking "View" on a leaderboard row updates the URL via History API. Document title and meta description update dynamically per model. sitemap.xml added with all 15 model URLs.

v1.4.6 · 2026-05-13 · Corrections Log refreshed to reflect Round 20: source-tier upgrades table now lists the Claude Opus 4.7 (thinking) SWE-V T2→T1 swap alongside the headline DeepSeek correction. New section added for the four Round 20 vals.ai T1 additions (GPT-5.5, GPT-5.4, Gemini 3.1 Pro Preview, Claude Opus 4.6 Thinking on SWE-bench Verified). Summary stats updated: 14 cells corrected, 23 cells from independent verification.

v1.4.5 · 2026-05-13 · Round 20 follow-up: Claude Opus 4.7 (thinking) SWE-bench Verified source-tier upgrade. The vals.ai Settings panel confirmed the unsuffixed "Claude Opus 4.7" row at vals.ai was tested with "Thinking Type: Adaptive" (= Anthropic thinking variant per our convention). Replaced the existing T2 benchlm.ai cell (0.876) with the T1 vals.ai cell (0.820). Lower value, higher trust weight - exactly the methodology pattern: independent T1 verification supersedes T2 aggregation.

v1.4.4 · 2026-05-13 · Round 20 vals.ai harvest integrated. Four T1 SWE-bench Verified cells added: GPT-5.5 (0.826), GPT-5.4 (0.782), Gemini 3.1 Pro Preview (0.788), Claude Opus 4.6 Thinking (0.782). The Coding specialty now has 9 eligible models (up from 6) - ChatGPT 5.5, ChatGPT 5.4, and Gemini 3.1 Pro Preview moved from "insufficient evidence" into the main Coding ranking.

v1.4.3 · 2026-05-13 · Reasoning tab now shows a data-sparsity disclaimer. With only 2 benchmarks (ARC-AGI-2 and AIME 2025) and most models having only one measured, single-cell rankings here can be misleading. We kept minBenchmarks at 1 (rather than raising to 2, which would leave only ~2 models ranked) and added a visible amber disclaimer so users understand the limitation.

v1.4.2 · 2026-05-12 · Specialty eligibility raised to minimum 2 cells for Coding, Knowledge, and Agentic (Reasoning stays at 1 because it has only 2 benchmarks total and is already protected by the 85% shrinkage threshold). Single-cell models on hard benchmarks were being unfairly dragged to the bottom; they now appear in a separate "insufficient evidence" section.

v1.4.1 · 2026-05-12 · Reasoning specialty (and any future 2-benchmark specialty) now uses an 85% shrinkage coverage threshold instead of the flat 60%. Prevents a single high-weight cell from producing an inflated specialty score.

v1.4.0 · 2026-05-12 · Cross-component coverage floor introduced. Three-section visibility layout (RANKED / PROVISIONAL / AWAITING). 40% core-component coverage requirement (Reasoning & Agency) added.

Technical methodology

The AGI Score is a composite metric that aggregates publicly available benchmark results into a single number tracking how close each frontier AI model is to AGI (defined as Score = 100, the genesis of AGI per our locked v0.7 definition).

Core Principles

Source-tier weighting + trust floor: Tier 1 - fully independent third-party verified (1.00×). Tier 2 - third-party evaluators with some model-provider involvement (0.85×). Tier 3 - self-reports from Verified labs with a track record of holding up under independent re-runs (0.75×). Tier 4 - self-reports from Not-verified labs and aggregator blogs / video - excluded from scoring entirely. Verified labs (initial set): Anthropic, OpenAI, Google DeepMind, Meta, DeepSeek. Not-verified (path to upgrade exists): Moonshot/Kimi, xAI, Mistral, Alibaba/Qwen, Zhipu/GLM, Baidu/ERNIE, Cohere. Promotion criterion: any Not-verified lab whose published numbers match independent third-party measurements (Vellum, vals.ai, AA, Scale AI) within ±2pp on 3+ benchmarks gets promoted. Source watchlist re-evaluated weekly; new model releases trigger same-day or next-day harvest cycles.
Asymmetric pull-down shrinkage: within each component (and within each specialty) we compute the weighted mean over present benchmarks. If coverage is below 60% AND the raw score is above the population median, we pull the score toward the median proportionally to coverage shortfall - final = median + (raw - median) × (coverage / 0.6). Low scores with thin coverage stay low (no reward for hiding weaknesses); high scores with thin coverage get discounted toward typical. This applies symmetrically to the main AGI Score (across components) and to specialty rankings (within each specialty's benchmark set).
Cross-component coverage floor (v1.4): shrinkage alone can't always neutralize the "few-but-exceptional cells" artifact - if a model's only-tested-on benchmarks happen to be its strongest, the score can be inflated by selection bias. We add a tier-level guard: a model with 2+ components below their coverage thresholds is demoted from RANKED to PROVISIONAL. Thresholds are 30% sub-weight for World Knowledge / Multimodal / Language and 40% for the core components (Fluid Reasoning + Agency, the load-bearing AGI dimensions). The model's score is still computed and shown, but flagged with the "Provisional" badge and shown in a separate section below the canonical RANKED list - explicitly indicating uneven evidence across capability dimensions.
Three-section visibility (v1.4): the main leaderboard shows only RANKED models by default - the canonical answer to "who's closest to AGI on currently-available evidence." A toggle above the table reveals PROVISIONAL models (score computed but coverage uneven) and AWAITING VERIFICATION models (insufficient variant-distinct data to score) as separate sections below. Casual visitors see the consensus ranking; power users opt in to see the full picture with caveats. Same data, same methodology, layered presentation.
Saturation rule: if the IQR of normalized scores among the top-10 ranked models on a benchmark drops below 5pp, that benchmark's sub-weight is halved and the freed weight is redistributed to non-saturated benchmarks in the same component. Keeps the score responsive to whichever benchmarks still discriminate as frontier models converge.
Human baselines: Each benchmark is rescaled so 100 = best-human performance on that task. AGI Score = 100 = genesis of AGI per our locked v0.7 definition: an AI that surpasses the best human on every purely brain-based intellectual task (embodiment and sensory experience explicitly out of scope). Above 100 = super-human (ASI direction). Scores aren't capped, so the index stays informative post-AGI.
No imputation: If a model lacks a published score on a benchmark, that cell stays empty. We never estimate, interpolate, or infer-by-similar-model. The AWAITING VERIFICATION tier exists specifically so models with insufficient variant-distinct evidence are flagged rather than scored on inflated lab claims.
Coverage transparency: Each model surfaces its data depth - both overall (benchmarks/15, components/5) and specialty-specific (cells per specialty) when a specialty tab is active. Users can immediately see which rankings rest on thin data.

Sensitivity bands (leave-one-benchmark-out)

Every AGI Score on the canonical leaderboard carries a sensitivity band (for example, 87.02 ±3.6). It is computed by removing each of the model's benchmarks one at a time and re-running the entire scoring pipeline; the band summarizes how far the score moves across those recomputations. It is not a standard deviation or a confidence interval - benchmarks are not random samples, so the band measures how much a score depends on its benchmark composition, not measurement error. Models with broad coverage hold tight bands; models scored on few cells swing wide - the band widens honestly with thin data. When two models' bands overlap, treat their order as a statistical tie. Where sources publish their own error margins we record them in the open dataset as groundwork for a future measurement-error layer; they do not yet affect scoring.

Specialty Rankings

Four specialty leaderboards are accessible via tabs above the main leaderboard:

Coding: SWE-bench Verified (40%), SWE-bench Pro (30%), Aider Polyglot (30%).
Reasoning: ARC-AGI-2 (70%), AIME 2025 (30%) - pure abstract reasoning, knowledge-free.
Knowledge: GPQA Diamond (35%), HLE knowledge slice (40%), MMMU-Pro knowledge slice (25%).
Tool Use (general agency beyond coding - operating a computer, browsing, and tool/function calling): OSWorld (30%), BrowseComp (25%), Tau-bench retail (22.5%), Tau-bench airline (22.5%).

Specialty scores apply the same Case-B Bayesian trust weighting and asymmetric pull-down shrinkage as the AGI Score. Specialty rankings show every model with at least one variant-distinct cell in the set (including AWAITING and INSUFFICIENT models on the canonical view) - AGI Score and specialty scores are independent claims. 100 on a specialty means matching best-human on those benchmarks; above 100 = super-human on those tasks specifically. This is NOT AGI - AGI requires the full battery, which only the canonical AGI Score measures.

Custom View

The Explorer below the leaderboard lets users adjust the five component weights. When weights deviate from default, a Custom tab appears and the leaderboard shifts to showing the AI Score under those weights (the AGI label is reserved for canonical weights only - no ambiguity). The AGI tab is the "reset to canonical" action: clicking it always reverts weights to default.

Active Data Sources (14 live benchmarks)

• GPQA Diamond
• Humanity's Last Exam (HLE)
• ARC-AGI-2
• AIME 2025
• SWE-bench Verified
• SWE-bench Pro
• Aider Polyglot
• OSWorld

• Terminal-Bench
• BrowseComp
• Tau-bench (retail + airline)
• MMMU-Pro
• LiveBench
• LMSYS Arena Elo (shown but not in score)

AA Intelligence Index: external reference, not scored
TBA: GAIA, FrontierMath, SimpleBench