ForecastBench Replication

Can LLMs Predict the Future?

Evaluating frontier models against prediction markets & superforecasters

Builders Retreat · March 2026

Daniel Hails

Source: hails.info

1 / 16

Motivation

Forecasting is hard

Term-premium-adjusted forward curves vs actual 3-Month Treasury Bill rate from 1985-2030, showing how market-implied forecasts consistently fail to predict actual rates

40 years of forward curves vs reality. Markets are systematically wrong about future rates. Can LLMs do better?

Sources: FRED TB3MS; GSW (FEDS 2006-28); ACM term premium (NY Fed)

2 / 16

Context

AI agents are getting autonomous fast

METR autonomy evaluation on linear scale showing AI agent task completion time growing from seconds to 12 hours, with Claude Opus 4.6 at the top

METR autonomy evals: time horizon of tasks AI agents can complete has gone from seconds to 12 hours. Autonomous agents need to predict consequences of their actions.

METR autonomy evals Time Horizon paper

Source: METR autonomy evaluations (metr.org)

3 / 16

The Benchmark

ForecastBench: Can LLMs forecast?

ForecastBench leaderboard showing Brier Index over time, with superforecaster baseline at 70.8 and best LLM trending toward parity by May 2027

LLM forecasting ability is improving steadily. Projected superforecaster parity: May 2027 (95% CI: May 2026 – July 2028). Contamination-free: 500 live questions every 2 weeks.

forecastbench.org Source code Datasets

Source: Karger et al. (2025), forecastbench.org

4 / 16

Experiment 1

Replication: Gemini-2.5-Flash

Ran all 24 question sets via OpenRouter. Raw Brier scores closely match the official leaderboard.

Metric	Ours (N=22,829)	Leaderboard	Delta
Brier (dataset)	0.185	0.170	+0.015
Brier (market)	0.132	0.146	−0.014
Brier (overall)	0.181	0.158	+0.023

Cohen's d = 0.12 (negligible). Statistically significant but practically identical.

Zero-shot rollouts (100) Research rollouts (100) Forecast-skill rollouts (100)

Source: ForecastBench leaderboard; our replication via OpenRouter

5 / 16

Our Contribution

BrierBench: an OpenReward environment

We packaged ForecastBench as a reusable evaluation environment on OpenReward. Any LLM agent can be benchmarked with a single API call — zero-shot, with web research, or with structured forecast strategies.

3

strategies
zero_shot, zero_shot_research, forecast_skill

$10–$175

full eval cost
$10 Gemini-Flash, $175 Grok/Sonnet

10 min

wall time
concurrency=100

BrierBench on OpenReward Replication repo

Source: hails.info

6 / 16

Experiment 1

Score distribution: market vs dataset

Histogram of per-question Brier scores for market and dataset splits, showing most predictions cluster near zero

Market questions cluster near 0 (easy wins). Dataset questions spread wider — time-series forecasting is harder than binary market questions.

Source: our replication (N=22,829)

7 / 16

Experiment 1

Brier score per question set over time for Gemini-2.5-Flash, with horizontal reference lines for superforecaster and crowd baselines

Per-question-set Brier scores over time. The leaderboard baseline (0.158) is close to our mean. Recent sets trend higher due to unresolved long-horizon forecasts.

Source: ForecastBench leaderboard; our replication

8 / 16

BrierBench

Zero-shot vs Research accuracy over time

Brier scores by question date. Research strategy (web search) improves accuracy. Post-cutoff questions are harder — no training data leakage.

Source: BrierBench evaluation via OpenReward

9 / 16

Forecast Skill

Strategy selection tree

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#3d2022', 'primaryTextColor': '#e8e6e1', 'primaryBorderColor': '#c44e52', 'lineColor': '#555550', 'secondaryColor': '#2a3a56', 'tertiaryColor': '#2d2d2d', 'fontFamily': 'IBM Plex Sans', 'fontSize': '13px', 'nodeBorder': '#c44e52', 'mainBkg': '#3d2022', 'clusterBkg': '#2d2d2d'}}}%%
graph TD
    A{Binary?} -->|Yes| B{Stock direction?}
    A -->|No| C{Continuous or discrete?}

    B -->|Yes| D[binary_stock_direction]
    B -->|No| E{Decomposable?}

    E -->|Yes| F[scenario_decomposition]
    E -->|No| G{Needs simulation?}

    G -->|Yes| H[monte_carlo_binary]
    G -->|No| I[scenario_decomposition]

    C -->|Continuous| J{Price/index GBM?}
    C -->|Discrete| K{Rare events?}

    J -->|Yes| L[mixture_lognormal_cdf]
    J -->|No| M{Naturally lognormal?}

    M -->|Yes| N[direct_lognormal_cdf]
    M -->|No| O{Needs simulation?}

    O -->|Yes| P[monte_carlo_cdf]
    O -->|No| Q[mixture_normal_cdf]

    K -->|Yes| R[poisson_pmf]
    K -->|No| S{Independent trials?}

    S -->|Yes| T[poisson_binomial_pmf]
    S -->|No| U[negative_binomial_pmf]

    style D fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style F fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style H fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style I fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style L fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style N fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style P fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style Q fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style R fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style T fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style U fill:#2a3a56,stroke:#4c72b0,color:#7da1d4

Source: BrierBench forecast_skill strategy

10 / 16

Experiment 2

GPT-5.4 vs Live Markets

Scatter plot of GPT-5.4 forecast vs market price

Closest to consensus. Mean |δ| = 0.220. Diagonal = perfect agreement.

Sources: Manifold, Metaculus, INFER, Polymarket via ForecastBench

11 / 16

Experiment 2

Gemini-3.1-Pro vs Live Markets

Scatter plot of Gemini-3.1-Pro forecast vs market price

Conservative bias — systematically predicts lower. Bias = −0.083.

Sources: Manifold, Metaculus, INFER, Polymarket via ForecastBench

12 / 16

Experiment 2

Grok-4.20-beta vs Live Markets

Scatter plot of Grok-4.20-beta forecast vs market price showing extreme contrarian positions

Most contrarian. Mean |δ| = 0.274. Loves extreme probabilities (92–95%).

Sources: Manifold, Metaculus, INFER, Polymarket via ForecastBench

13 / 16

Key Finding

Grok is a weird contrarian

0.274

mean |delta| vs markets
GPT-5.4: 0.220, Gemini: 0.224

92%

frequently assigned probability
loves extreme confidence

+0.06

positive bias vs markets
predicts higher than the crowd

Superforecasters still win (Brier Index 70.8 vs best LLM 64.2). Contrarianism hurts more than it helps.

BrierBench on OpenReward Replication repo

Source: hails.info

14 / 16

Bonus: Applied Forecasting

EF 2025 Batch: Unicorn predictions

Horizontal bar chart showing Grok-4.20-beta zero-shot unicorn probability predictions for EF 2025 batch companies, ranging from 27% (Buckeye AI) to 9% (Krew)

Grok-4.20-beta zero-shot predictions for which EF 2025 companies become unicorns by 2031.

Source: BrierBench via OpenReward, x-ai/grok-4.20-beta

15 / 16

Massive caveats

Single zero-shot Grok run — no ensemble, no calibration, no fine-tuning
One question set only — results may not generalise across time periods
Retroactive forecasting — model may have indirect knowledge of some outcomes
EF unicorn predictions are entertainment, not investment advice
Base rate for unicorns is ~1% — all predictions are likely too high

Do not make decisions based on these numbers. This is a methodology demo, not a forecasting product.

BrierBench on OpenReward Replication repo hails.info

Source: hails.info

16 / 16