AGI Score Methodology
Every formula, weight, and trust-tier rule behind the AGI Score, publicly documented. The same methodology renders interactively on the leaderboard; this page is the complete reference. Raw data: models.json (CC BY 4.0).
How the score is built
How the AGI Score is Calculated
A transparent, reproducible composite index designed for maximum signal and minimum noise.
Note on the AA Intelligence Index (removed June 2026): We previously carried Artificial Analysis's composite Intelligence Index as a Language signal. We have removed it from the AGI Score. Their v4.1 revision turned it into an explicitly agentic composite (GDPval-AA, Terminal-Bench 2.1, τ³-Banking, SciCode, HLE, GPQA and more) that re-bundles benchmarks we already score directly - keeping it would double-count those signals across components and mislabel agency as language. Language now rests on LiveBench; a dedicated language/writing benchmark is on our roadmap to restore a second independent source. We continue to track the AA Index as an external reference.
sub_weight × source_tier_multiplier. Sub-weights are fixed per benchmark and reflect each benchmark's relative importance within a component (e.g., ARC-AGI-2 carries 25% of Reasoning, GPQA carries 20%). The saturation rule kicks in dynamically: if the IQR of normalized scores among the top-10 ranked models on a benchmark drops below 5 percentage points, that benchmark's sub-weight is halved and the freed weight is redistributed to non-saturated benchmarks in the same component - keeping the component score responsive to whichever benchmarks still discriminate frontier models. Source tiering applies a multiplier per cell based on source independence (T1 1.00×, T2 0.85×, T3-Verified-Lab 0.75×; T4 excluded from scoring entirely). Component weights are user-adjustable in the Explorer below.| Constant | Value | Used in |
|---|---|---|
| Asymmetric shrinkage coverage threshold (main AGI Score & specialties with 3+ benchmarks) | 60% | Step 4 |
| Asymmetric shrinkage coverage threshold (specialties with 2 benchmarks, e.g. Reasoning) | 85% | Step 5 |
| Saturation IQR threshold (top-10 ranked models) | 5 pp | Step 4 |
| Coverage floor - Fluid Reasoning & Agency (core) | 40% | Step 4 |
| Coverage floor - Knowledge / Multimodal / Language | 30% | Step 4 |
| Max thin components allowed in the RANKED tier | 1 | Step 4 |
| Eval-date grace window vs. model release | 30 days | below |
| Specialty | Benchmarks & within-set sub-weights |
|---|---|
| Coding | SWE-bench Verified 35 · SWE-bench Pro 30 · Terminal-Bench 35 |
| Reasoning | ARC-AGI-2 70 · AIME 2025 30 |
| Knowledge | HLE 40 · GPQA Diamond 35 · MMMU-Pro 25 |
| Tool Use | OSWorld 30 · BrowseComp 25 · Tau-bench retail 22.5 · Tau-bench airline 22.5 |
eval_date predates the model's release_date by more than 30 days is rejected as a mis-attribution. The 30-day grace window covers pre-release lab evaluations; anything older almost certainly tested a preceding model that happens to share part of the name./model/{apiName}. The existing detail view opens automatically when the URL is loaded directly, and clicking "View" on a leaderboard row updates the URL via History API. Document title and meta description update dynamically per model. sitemap.xml added with all 15 model URLs.Technical methodology
The AGI Score is a composite metric that aggregates publicly available benchmark results into a single number tracking how close each frontier AI model is to AGI (defined as Score = 100, the genesis of AGI per our locked v0.7 definition).
Core Principles
- Source-tier weighting + trust floor: Tier 1 - fully independent third-party verified (1.00×). Tier 2 - third-party evaluators with some model-provider involvement (0.85×). Tier 3 - self-reports from Verified labs with a track record of holding up under independent re-runs (0.75×). Tier 4 - self-reports from Not-verified labs and aggregator blogs / video - excluded from scoring entirely. Verified labs (initial set): Anthropic, OpenAI, Google DeepMind, Meta, DeepSeek. Not-verified (path to upgrade exists): Moonshot/Kimi, xAI, Mistral, Alibaba/Qwen, Zhipu/GLM, Baidu/ERNIE, Cohere. Promotion criterion: any Not-verified lab whose published numbers match independent third-party measurements (Vellum, vals.ai, AA, Scale AI) within ±2pp on 3+ benchmarks gets promoted. Source watchlist re-evaluated weekly; new model releases trigger same-day or next-day harvest cycles.
- Asymmetric pull-down shrinkage: within each component (and within each specialty) we compute the weighted mean over present benchmarks. If coverage is below 60% AND the raw score is above the population median, we pull the score toward the median proportionally to coverage shortfall -
final = median + (raw - median) × (coverage / 0.6). Low scores with thin coverage stay low (no reward for hiding weaknesses); high scores with thin coverage get discounted toward typical. This applies symmetrically to the main AGI Score (across components) and to specialty rankings (within each specialty's benchmark set). - Cross-component coverage floor (v1.4): shrinkage alone can't always neutralize the "few-but-exceptional cells" artifact - if a model's only-tested-on benchmarks happen to be its strongest, the score can be inflated by selection bias. We add a tier-level guard: a model with 2+ components below their coverage thresholds is demoted from RANKED to PROVISIONAL. Thresholds are 30% sub-weight for World Knowledge / Multimodal / Language and 40% for the core components (Fluid Reasoning + Agency, the load-bearing AGI dimensions). The model's score is still computed and shown, but flagged with the "Provisional" badge and shown in a separate section below the canonical RANKED list - explicitly indicating uneven evidence across capability dimensions.
- Three-section visibility (v1.4): the main leaderboard shows only RANKED models by default - the canonical answer to "who's closest to AGI on currently-available evidence." A toggle above the table reveals PROVISIONAL models (score computed but coverage uneven) and AWAITING VERIFICATION models (insufficient variant-distinct data to score) as separate sections below. Casual visitors see the consensus ranking; power users opt in to see the full picture with caveats. Same data, same methodology, layered presentation.
- Saturation rule: if the IQR of normalized scores among the top-10 ranked models on a benchmark drops below 5pp, that benchmark's sub-weight is halved and the freed weight is redistributed to non-saturated benchmarks in the same component. Keeps the score responsive to whichever benchmarks still discriminate as frontier models converge.
- Human baselines: Each benchmark is rescaled so 100 = best-human performance on that task. AGI Score = 100 = genesis of AGI per our locked v0.7 definition: an AI that surpasses the best human on every purely brain-based intellectual task (embodiment and sensory experience explicitly out of scope). Above 100 = super-human (ASI direction). Scores aren't capped, so the index stays informative post-AGI.
- No imputation: If a model lacks a published score on a benchmark, that cell stays empty. We never estimate, interpolate, or infer-by-similar-model. The AWAITING VERIFICATION tier exists specifically so models with insufficient variant-distinct evidence are flagged rather than scored on inflated lab claims.
- Coverage transparency: Each model surfaces its data depth - both overall (benchmarks/15, components/5) and specialty-specific (cells per specialty) when a specialty tab is active. Users can immediately see which rankings rest on thin data.
Sensitivity bands (leave-one-benchmark-out)
Every AGI Score on the canonical leaderboard carries a sensitivity band (for example, 87.02 ±3.6). It is computed by removing each of the model's benchmarks one at a time and re-running the entire scoring pipeline; the band summarizes how far the score moves across those recomputations. It is not a standard deviation or a confidence interval - benchmarks are not random samples, so the band measures how much a score depends on its benchmark composition, not measurement error. Models with broad coverage hold tight bands; models scored on few cells swing wide - the band widens honestly with thin data. When two models' bands overlap, treat their order as a statistical tie. Where sources publish their own error margins we record them in the open dataset as groundwork for a future measurement-error layer; they do not yet affect scoring.
Specialty Rankings
Four specialty leaderboards are accessible via tabs above the main leaderboard:
- Coding: SWE-bench Verified (40%), SWE-bench Pro (30%), Aider Polyglot (30%).
- Reasoning: ARC-AGI-2 (70%), AIME 2025 (30%) - pure abstract reasoning, knowledge-free.
- Knowledge: GPQA Diamond (35%), HLE knowledge slice (40%), MMMU-Pro knowledge slice (25%).
- Tool Use (general agency beyond coding - operating a computer, browsing, and tool/function calling): OSWorld (30%), BrowseComp (25%), Tau-bench retail (22.5%), Tau-bench airline (22.5%).
Specialty scores apply the same Case-B Bayesian trust weighting and asymmetric pull-down shrinkage as the AGI Score. Specialty rankings show every model with at least one variant-distinct cell in the set (including AWAITING and INSUFFICIENT models on the canonical view) - AGI Score and specialty scores are independent claims. 100 on a specialty means matching best-human on those benchmarks; above 100 = super-human on those tasks specifically. This is NOT AGI - AGI requires the full battery, which only the canonical AGI Score measures.
Custom View
The Explorer below the leaderboard lets users adjust the five component weights. When weights deviate from default, a Custom tab appears and the leaderboard shifts to showing the AI Score under those weights (the AGI label is reserved for canonical weights only - no ambiguity). The AGI tab is the "reset to canonical" action: clicking it always reverts weights to default.
Active Data Sources (14 live benchmarks)
• Humanity's Last Exam (HLE)
• ARC-AGI-2
• AIME 2025
• SWE-bench Verified
• SWE-bench Pro
• Aider Polyglot
• OSWorld
• BrowseComp
• Tau-bench (retail + airline)
• MMMU-Pro
• LiveBench
• LMSYS Arena Elo (shown but not in score)
AA Intelligence Index: external reference, not scored
TBA: GAIA, FrontierMath, SimpleBench