Skip to content

aisa-group/quantsightbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QuantSightBench - Evaluating LLM Quantitative Forecasting with Prediction Intervals

A benchmark for interval forecasting with LLMs. Given a forecasting question, the model outputs (lower, median, upper) that should contain the true outcome with a specified coverage (e.g., 90%). Supports OpenAI, Anthropic, Gemini, OpenRouter, and retrieval-augmented / agentic prompting.

Installation

python3 -m venv venv && source venv/bin/activate
python3 -m pip install -U pip && python3 -m pip install -r requirements.txt

Create a .env with the keys you need:

OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...
OPENROUTER_API_KEY=sk-or-...

OPENAI_API_KEY is also used for retrieval (query embeddings).

Quickstart

Run GPT-5.1 on 100 questions, then score:

python3 scripts/cli/run_predictions.py \
    --config configs/openforesight_2025_09_01/openai_gpt51_medium_v3_100.json

python3 scripts/cli/evaluate_interval_benchmark.py \
    --predictions experiments/openforesight_2025_09_01/openai_gpt51_medium_v3_100/predictions.csv \
    --benchmark data/openforesight_2025_09_01/merged_filtered_final_questions.jsonl \
    --out experiments/openforesight_2025_09_01/openai_gpt51_medium_v3_100/scored.csv

Metrics (coverage, MLIS, MAPE) are printed to stdout and saved as metrics.json.

Running your own experiment

Experiments are driven by a JSON config. Minimal example:

{
  "experiment_folder": "experiments/my_run/",
  "benchmark":         "data/openforesight_2025_09_01/merged_filtered_final_questions.jsonl",
  "provider":          "openai",
  "model":             "gpt-5.1",
  "reasoning_effort":  "medium",
  "prompt_template":   "forecasting_xml_v3",
  "out":               "experiments/my_run/predictions.csv",
  "debug_jsonl":       "experiments/my_run/debug.jsonl",
  "num_samples":       100,
  "sample_seed":       20251222,
  "parallel_workers":  20,
  "resume":            true
}

Run with: python3 scripts/cli/run_predictions.py --config path/to/config.json

Retrieval (optional)

Add a retrieval block:

"retrieval": {
  "enabled":               true,
  "mode":                  "agentic",
  "use_chunks":            true,
  "db_path":               "data/vector_store",
  "chunk_table_name":      "article_chunks",
  "max_iterations":        5,
  "max_results_per_query": 5
}
  • rag — retrieves k articles once before prompting.
  • agentic — gives the model a search_articles tool to query iteratively.

The vector store must be prebuilt (see scripts/retrieval/).

Prompt templates

Template Description
forecasting_xml_v2 Zero-shot: title only.
forecasting_xml_v3 Adds background + resolution criteria. Recommended.
forecasting_xml_v3_rag V3 + retrieved articles prepended.
forecasting_xml_v3_agentic V3 with a search_articles tool.
forecasting_xml_v3_agentic_no_conf Agentic without the 90% target (tests inherent calibration).

Repo layout

configs/       JSON run/eval configs
data/          Benchmark files + vector store
experiments/   Outputs: predictions, debug logs, scored CSVs, metrics
plots/         Publication figures (PNG + PDF)
scripts/
  agentic/     Agentic loop (tool calls, fallback extraction)
  cli/         Entry points (run_predictions, evaluate, compare)
  data/        Dataset loaders
  evaluation/  Scoring metrics
  parsing/     Extract (lower, median, upper) from responses
  plotting/    Figure scripts
  prompts/     Prompt templates
  providers/   Provider SDK wrappers
  retrieval/   LanceDB indexer + retriever

Notes

  • Resume: with resume: true, re-running skips questions already scored in out.
  • Rate limits: Anthropic often needs parallel_workers: 20; OpenAI tolerates 50–100.
  • Question IDs are always preserved (never renumbered on subset runs).

About

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors