Benchmark methodology

The benchmark harness is not yet built. It is Priority 1 of the build plan. This page documents the methodology that will be used when it lands.

The single most important credibility artifact before any live contract deployment is the benchmark report. It measures Pelion’s built judgment layer against a large historical dataset and publishes the results reproducibly. If the numbers don’t support the thesis, the protocol should pause, not ship.

The dataset

Source. Polymarket resolved markets from the last 18 months. This is the most complete public dataset of real-money prediction markets with recorded outcomes and recorded controversy. Scope. Resolved markets (where Polymarket assigned a final outcome) plus disputed markets (where the resolution was contested). Markets that never resolved are excluded from the main analysis but noted separately. Data captured per market. Question text. Resolution criteria (the specific rubric Polymarket used). UMA’s recorded outcome. Actual-world outcome where determinable from external sources. Market size (volume and open interest at resolution). Category breakdown. Political events. Sports. Economic indicators. Entertainment. Crypto-specific events. The benchmark reports accuracy per category separately. This matters because category-specific accuracy profiles are what real integrators actually care about.

The evaluation harness

A Python harness that takes a resolver function and runs it over the dataset. The resolver function signature is the same as pelion.judgment.FrontierModelClient.judge, so the harness can swap in any implementation that conforms. Per-question metrics recorded. The resolver’s outcome. The resolver’s confidence. Latency (wall-clock time to resolution). Cost (API tokens consumed, in dollars). Aggregate metrics computed. Accuracy by category. Calibration curve (is 80% confidence actually 80% accurate, and so on). Disagreement rate across resolvers. Cost per correct resolution. The harness is reproducible. Given the dataset and the resolver implementation, anyone can re-run and get the same numbers. This is a hard requirement for investor-facing claims.

Resolvers under test

Four resolvers run on the same dataset for comparison. Each frontier model individually. Claude, GPT, Gemini, each as a single-model resolver. Establishes the baseline for what a single model can do without aggregation. The full frontier-model council. All three models aggregated via majority vote, with mean-confidence-over-majority-camp. This is what FrontierModelClient produces today. The comparison to individual models quantifies how much the council structure adds. UMA’s recorded outcome as a baseline. The value UMA actually returned at the time. This is the competitive bar Pelion must clear. Actual-world ground truth as the reference. Determined from external sources (Reuters, AP, government records, etc.) at the time the benchmark is run, not at the time Polymarket resolved. For most questions, ground truth is unambiguous. For questions where it’s not, the question is flagged and scored separately.

Metrics and what good looks like

Accuracy on standard questions. At or above 99%. Polymarket’s standard markets (clear events, unambiguous resolution) should be trivial for frontier models. Anything below 99% indicates a structural issue with the prompt or evidence handling. Accuracy on contested questions. Measurably better than UMA’s recorded outcomes. The Zelensky-type markets (where UMA got it wrong due to whale manipulation) are the test cases that matter most. Pelion needs to resolve these correctly on the retroactive evaluation. Calibration. The confidence scores produced by the resolver should match actual accuracy. A 80% confidence call should be right 80% of the time, give or take a few percentage points. Calibration curve plotted with specific bucket thresholds. The target is within 5% of ideal calibration on each bucket. Cost. At or below

0.10 per resolution on average, with a long tail for complex questions. This is the economic claim that Pelion can be cheaper than UMA (which has a

750+ bond threshold for disputing). The benchmark validates the economics alongside the accuracy.

Publication commitments

Public repository. The harness code and the full dataset (or clear pointers for any licensed portions) live in a public repo. Reproducibility is explicit. Results report. A written report summarizing findings per category, with visualizations of calibration curves and accuracy-by-category breakdowns. Linked from this site when complete. Raw data release. The per-question results (resolver outcomes, confidences, latencies, costs) are published alongside the report. Not just summary statistics. This allows independent analysis. Errata protocol. If errors are found in the report after publication, corrections are published with full version history. No silent edits.

What failure would look like

The benchmark should not be a ceremonial exercise that validates the thesis. If the numbers don’t support the thesis, the protocol should pause rather than ship. Concrete failure modes that would trigger a pause. Accuracy below 95% on standard questions. Indicates the base frontier-model approach is worse than assumed. The published research suggests this won’t happen. If it does, the design must change. Accuracy below UMA on contested questions. Indicates Pelion’s specific framing doesn’t improve on the incumbent. Would require rethinking the scoring and aggregation mechanisms. Calibration worse than 10% off. Indicates the confidence scores produced are meaningless. Users cannot rely on them to sort high-confidence verdicts from low-confidence ones. Cost per resolution above $1. Indicates the economic model doesn’t work at scale. Would require retrieval cost optimization or a different model mix. None of these are expected. But the benchmark exists specifically to find out, rigorously, before live contracts are deployed.

Relationship to the live protocol

The benchmark tests the exact same judgment code path that production miners will use. FrontierModelClient is the reference implementation for what each Pelion miner runs. Validators score miners against live questions that have ground truth available (or synthetic backtest questions). The benchmark’s methodology becomes the validator’s methodology. This is intentional. The benchmark is not a separate piece of marketing infrastructure. It is the first field test of the judgment layer, using the same code, the same data shape, and the same metrics.

Introduction

The problem

How Pelion works

Token and treasury

What's built

Reference

Benchmark methodology

The dataset

The evaluation harness

Resolvers under test

Metrics and what good looks like

Publication commitments

What failure would look like

Relationship to the live protocol

Introduction

The problem

How Pelion works

Token and treasury

What's built

Reference

Documentation Index

​The dataset

​The evaluation harness

​Resolvers under test

​Metrics and what good looks like

​Publication commitments

​What failure would look like

​Relationship to the live protocol

The dataset

The evaluation harness

Resolvers under test

Metrics and what good looks like

Publication commitments

What failure would look like

Relationship to the live protocol