A benchmark for interval forecasting with LLMs. Given a forecasting question, the model outputs (lower, median, upper) that should contain the true outcome with a specified coverage (e.g., 90%). Supports OpenAI, Anthropic, Gemini, OpenRouter, and retrieval-augmented / agentic prompting.
python3 -m venv venv && source venv/bin/activate
python3 -m pip install -U pip && python3 -m pip install -r requirements.txtCreate a .env with the keys you need:
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
OPENROUTER_API_KEY=sk-or-...OPENAI_API_KEY is also used for retrieval (query embeddings).
Run GPT-5.1 on 100 questions, then score:
python3 scripts/cli/run_predictions.py \
--config configs/openforesight_2025_09_01/openai_gpt51_medium_v3_100.json
python3 scripts/cli/evaluate_interval_benchmark.py \
--predictions experiments/openforesight_2025_09_01/openai_gpt51_medium_v3_100/predictions.csv \
--benchmark data/openforesight_2025_09_01/merged_filtered_final_questions.jsonl \
--out experiments/openforesight_2025_09_01/openai_gpt51_medium_v3_100/scored.csvMetrics (coverage, MLIS, MAPE) are printed to stdout and saved as metrics.json.
Experiments are driven by a JSON config. Minimal example:
{
"experiment_folder": "experiments/my_run/",
"benchmark": "data/openforesight_2025_09_01/merged_filtered_final_questions.jsonl",
"provider": "openai",
"model": "gpt-5.1",
"reasoning_effort": "medium",
"prompt_template": "forecasting_xml_v3",
"out": "experiments/my_run/predictions.csv",
"debug_jsonl": "experiments/my_run/debug.jsonl",
"num_samples": 100,
"sample_seed": 20251222,
"parallel_workers": 20,
"resume": true
}Run with: python3 scripts/cli/run_predictions.py --config path/to/config.json
Add a retrieval block:
"retrieval": {
"enabled": true,
"mode": "agentic",
"use_chunks": true,
"db_path": "data/vector_store",
"chunk_table_name": "article_chunks",
"max_iterations": 5,
"max_results_per_query": 5
}rag— retrieves k articles once before prompting.agentic— gives the model asearch_articlestool to query iteratively.
The vector store must be prebuilt (see scripts/retrieval/).
| Template | Description |
|---|---|
forecasting_xml_v2 |
Zero-shot: title only. |
forecasting_xml_v3 |
Adds background + resolution criteria. Recommended. |
forecasting_xml_v3_rag |
V3 + retrieved articles prepended. |
forecasting_xml_v3_agentic |
V3 with a search_articles tool. |
forecasting_xml_v3_agentic_no_conf |
Agentic without the 90% target (tests inherent calibration). |
configs/ JSON run/eval configs
data/ Benchmark files + vector store
experiments/ Outputs: predictions, debug logs, scored CSVs, metrics
plots/ Publication figures (PNG + PDF)
scripts/
agentic/ Agentic loop (tool calls, fallback extraction)
cli/ Entry points (run_predictions, evaluate, compare)
data/ Dataset loaders
evaluation/ Scoring metrics
parsing/ Extract (lower, median, upper) from responses
plotting/ Figure scripts
prompts/ Prompt templates
providers/ Provider SDK wrappers
retrieval/ LanceDB indexer + retriever
- Resume: with
resume: true, re-running skips questions already scored inout. - Rate limits: Anthropic often needs
parallel_workers: 20; OpenAI tolerates 50–100. - Question IDs are always preserved (never renumbered on subset runs).