QuantSightBench - Evaluating LLM Quantitative Forecasting with Prediction Intervals

A benchmark for interval forecasting with LLMs. Given a forecasting question, the model outputs (lower, median, upper) that should contain the true outcome with a specified coverage (e.g., 90%). Supports OpenAI, Anthropic, Gemini, OpenRouter, and retrieval-augmented / agentic prompting.

Installation

python3 -m venv venv && source venv/bin/activate
python3 -m pip install -U pip && python3 -m pip install -r requirements.txt

Create a .env with the keys you need:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
OPENROUTER_API_KEY=sk-or-...

OPENAI_API_KEY is also used for retrieval (query embeddings).

Quickstart

Run GPT-5.1 on 100 questions, then score:

python3 scripts/cli/run_predictions.py \
    --config configs/openforesight_2025_09_01/openai_gpt51_medium_v3_100.json

python3 scripts/cli/evaluate_interval_benchmark.py \
    --predictions experiments/openforesight_2025_09_01/openai_gpt51_medium_v3_100/predictions.csv \
    --benchmark data/openforesight_2025_09_01/merged_filtered_final_questions.jsonl \
    --out experiments/openforesight_2025_09_01/openai_gpt51_medium_v3_100/scored.csv

Metrics (coverage, MLIS, MAPE) are printed to stdout and saved as metrics.json.

Running your own experiment

Experiments are driven by a JSON config. Minimal example:

{
  "experiment_folder": "experiments/my_run/",
  "benchmark":         "data/openforesight_2025_09_01/merged_filtered_final_questions.jsonl",
  "provider":          "openai",
  "model":             "gpt-5.1",
  "reasoning_effort":  "medium",
  "prompt_template":   "forecasting_xml_v3",
  "out":               "experiments/my_run/predictions.csv",
  "debug_jsonl":       "experiments/my_run/debug.jsonl",
  "num_samples":       100,
  "sample_seed":       20251222,
  "parallel_workers":  20,
  "resume":            true
}

Run with: python3 scripts/cli/run_predictions.py --config path/to/config.json

Retrieval (optional)

Add a retrieval block:

"retrieval": {
  "enabled":               true,
  "mode":                  "agentic",
  "use_chunks":            true,
  "db_path":               "data/vector_store",
  "chunk_table_name":      "article_chunks",
  "max_iterations":        5,
  "max_results_per_query": 5
}

rag — retrieves k articles once before prompting.
agentic — gives the model a search_articles tool to query iteratively.

The vector store must be prebuilt (see scripts/retrieval/).

Prompt templates

Template	Description
`forecasting_xml_v2`	Zero-shot: title only.
`forecasting_xml_v3`	Adds background + resolution criteria. Recommended.
`forecasting_xml_v3_rag`	V3 + retrieved articles prepended.
`forecasting_xml_v3_agentic`	V3 with a `search_articles` tool.
`forecasting_xml_v3_agentic_no_conf`	Agentic without the 90% target (tests inherent calibration).

Repo layout

configs/       JSON run/eval configs
data/          Benchmark files + vector store
experiments/   Outputs: predictions, debug logs, scored CSVs, metrics
plots/         Publication figures (PNG + PDF)
scripts/
  agentic/     Agentic loop (tool calls, fallback extraction)
  cli/         Entry points (run_predictions, evaluate, compare)
  data/        Dataset loaders
  evaluation/  Scoring metrics
  parsing/     Extract (lower, median, upper) from responses
  plotting/    Figure scripts
  prompts/     Prompt templates
  providers/   Provider SDK wrappers
  retrieval/   LanceDB indexer + retriever

Notes

Resume: with resume: true, re-running skips questions already scored in out.
Rate limits: Anthropic often needs parallel_workers: 20; OpenAI tolerates 50–100.
Question IDs are always preserved (never renumbered on subset runs).

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data/openforesight_2025_09_01		data/openforesight_2025_09_01
notebooks		notebooks
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QuantSightBench - Evaluating LLM Quantitative Forecasting with Prediction Intervals

Installation

Quickstart

Running your own experiment

Retrieval (optional)

Prompt templates

Repo layout

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

QuantSightBench - Evaluating LLM Quantitative Forecasting with Prediction Intervals

Installation

Quickstart

Running your own experiment

Retrieval (optional)

Prompt templates

Repo layout

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages