ForecastBench Replication

Can LLMs Predict the Future?

Evaluating frontier models against prediction markets & superforecasters

Builders Retreat · March 2026
Daniel Hails
Source: hails.info
1 / 16

Motivation

Forecasting is hard

Term-premium-adjusted forward curves vs actual 3-Month Treasury Bill rate from 1985-2030, showing how market-implied forecasts consistently fail to predict actual rates

40 years of forward curves vs reality. Markets are systematically wrong about future rates. Can LLMs do better?

Sources: FRED TB3MS; GSW (FEDS 2006-28); ACM term premium (NY Fed)
2 / 16

Context

AI agents are getting autonomous fast

METR autonomy evaluation on linear scale showing AI agent task completion time growing from seconds to 12 hours, with Claude Opus 4.6 at the top

METR autonomy evals: time horizon of tasks AI agents can complete has gone from seconds to 12 hours. Autonomous agents need to predict consequences of their actions.

Source: METR autonomy evaluations (metr.org)
3 / 16

The Benchmark

ForecastBench: Can LLMs forecast?

ForecastBench leaderboard showing Brier Index over time, with superforecaster baseline at 70.8 and best LLM trending toward parity by May 2027

LLM forecasting ability is improving steadily. Projected superforecaster parity: May 2027 (95% CI: May 2026 – July 2028). Contamination-free: 500 live questions every 2 weeks.

Source: Karger et al. (2025), forecastbench.org
4 / 16

Experiment 1

Replication: Gemini-2.5-Flash

Ran all 24 question sets via OpenRouter. Raw Brier scores closely match the official leaderboard.

Metric Ours (N=22,829) Leaderboard Delta
Brier (dataset) 0.185 0.170 +0.015
Brier (market) 0.132 0.146 −0.014
Brier (overall) 0.181 0.158 +0.023

Cohen's d = 0.12 (negligible). Statistically significant but practically identical.

Source: ForecastBench leaderboard; our replication via OpenRouter
5 / 16

Our Contribution

BrierBench: an OpenReward environment

We packaged ForecastBench as a reusable evaluation environment on OpenReward. Any LLM agent can be benchmarked with a single API call — zero-shot, with web research, or with structured forecast strategies.

3
strategies
zero_shot, zero_shot_research, forecast_skill
$10–$175
full eval cost
$10 Gemini-Flash, $175 Grok/Sonnet
10 min
wall time
concurrency=100
Source: hails.info
6 / 16

Experiment 1

Score distribution: market vs dataset

Histogram of per-question Brier scores for market and dataset splits, showing most predictions cluster near zero

Market questions cluster near 0 (easy wins). Dataset questions spread wider — time-series forecasting is harder than binary market questions.

Source: our replication (N=22,829)
7 / 16

Experiment 1

Brier score per question set over time for Gemini-2.5-Flash, with horizontal reference lines for superforecaster and crowd baselines

Per-question-set Brier scores over time. The leaderboard baseline (0.158) is close to our mean. Recent sets trend higher due to unresolved long-horizon forecasts.

Source: ForecastBench leaderboard; our replication
8 / 16

BrierBench

Zero-shot vs Research accuracy over time

BrierBench evaluation showing zero-shot and zero-shot-research Brier scores by question date, with model cutoff line

Brier scores by question date. Research strategy (web search) improves accuracy. Post-cutoff questions are harder — no training data leakage.

Source: BrierBench evaluation via OpenReward
9 / 16

Forecast Skill

Strategy selection tree

%%{init: {'theme': 'dark', 'themeVariables': {'primaryColor': '#3d2022', 'primaryTextColor': '#e8e6e1', 'primaryBorderColor': '#c44e52', 'lineColor': '#555550', 'secondaryColor': '#2a3a56', 'tertiaryColor': '#2d2d2d', 'fontFamily': 'IBM Plex Sans', 'fontSize': '13px', 'nodeBorder': '#c44e52', 'mainBkg': '#3d2022', 'clusterBkg': '#2d2d2d'}}}%%
graph TD
    A{Binary?} -->|Yes| B{Stock direction?}
    A -->|No| C{Continuous or discrete?}

    B -->|Yes| D[binary_stock_direction]
    B -->|No| E{Decomposable?}

    E -->|Yes| F[scenario_decomposition]
    E -->|No| G{Needs simulation?}

    G -->|Yes| H[monte_carlo_binary]
    G -->|No| I[scenario_decomposition]

    C -->|Continuous| J{Price/index GBM?}
    C -->|Discrete| K{Rare events?}

    J -->|Yes| L[mixture_lognormal_cdf]
    J -->|No| M{Naturally lognormal?}

    M -->|Yes| N[direct_lognormal_cdf]
    M -->|No| O{Needs simulation?}

    O -->|Yes| P[monte_carlo_cdf]
    O -->|No| Q[mixture_normal_cdf]

    K -->|Yes| R[poisson_pmf]
    K -->|No| S{Independent trials?}

    S -->|Yes| T[poisson_binomial_pmf]
    S -->|No| U[negative_binomial_pmf]

    style D fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style F fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style H fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style I fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style L fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style N fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style P fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style Q fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style R fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style T fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
    style U fill:#2a3a56,stroke:#4c72b0,color:#7da1d4
            
Source: BrierBench forecast_skill strategy
10 / 16

Experiment 2

GPT-5.4 vs Live Markets

Scatter plot of GPT-5.4 forecast vs market price

Closest to consensus. Mean |δ| = 0.220. Diagonal = perfect agreement.

Sources: Manifold, Metaculus, INFER, Polymarket via ForecastBench
11 / 16

Experiment 2

Gemini-3.1-Pro vs Live Markets

Scatter plot of Gemini-3.1-Pro forecast vs market price

Conservative bias — systematically predicts lower. Bias = −0.083.

Sources: Manifold, Metaculus, INFER, Polymarket via ForecastBench
12 / 16

Experiment 2

Grok-4.20-beta vs Live Markets

Scatter plot of Grok-4.20-beta forecast vs market price showing extreme contrarian positions

Most contrarian. Mean |δ| = 0.274. Loves extreme probabilities (92–95%).

Sources: Manifold, Metaculus, INFER, Polymarket via ForecastBench
13 / 16

Key Finding

Grok is a weird contrarian

0.274
mean |delta| vs markets
GPT-5.4: 0.220, Gemini: 0.224
92%
frequently assigned probability
loves extreme confidence
+0.06
positive bias vs markets
predicts higher than the crowd

Superforecasters still win (Brier Index 70.8 vs best LLM 64.2). Contrarianism hurts more than it helps.

Source: hails.info
14 / 16

Bonus: Applied Forecasting

EF 2025 Batch: Unicorn predictions

Horizontal bar chart showing Grok-4.20-beta zero-shot unicorn probability predictions for EF 2025 batch companies, ranging from 27% (Buckeye AI) to 9% (Krew)

Grok-4.20-beta zero-shot predictions for which EF 2025 companies become unicorns by 2031.

Source: BrierBench via OpenReward, x-ai/grok-4.20-beta
15 / 16

Massive caveats

  • Single zero-shot Grok run — no ensemble, no calibration, no fine-tuning
  • One question set only — results may not generalise across time periods
  • Retroactive forecasting — model may have indirect knowledge of some outcomes
  • EF unicorn predictions are entertainment, not investment advice
  • Base rate for unicorns is ~1% — all predictions are likely too high

Do not make decisions based on these numbers. This is a methodology demo, not a forecasting product.

Source: hails.info
16 / 16