Skip to content

feat(vortex-geo): geometry Bbox zone-map statistic + distance-filter pruning#8646

Open
HarukiMoriarty wants to merge 4 commits into
developfrom
nemo/geo-bounds-zonemap
Open

feat(vortex-geo): geometry Bbox zone-map statistic + distance-filter pruning#8646
HarukiMoriarty wants to merge 4 commits into
developfrom
nemo/geo-bounds-zonemap

Conversation

@HarukiMoriarty

Copy link
Copy Markdown
Contributor

Summary

Adds spatial chunk-pruning to Vortex. A new GeometryBounds aggregate stores a per-chunk minimum bounding box (MBR) as a zone-map statistic, and a stats-rewrite rule uses it to skip chunks that cannot satisfy a ST_Distance(geom, const) <= r filter.

Limitation

  • Only the <= / < are pruned. > / >= are soundly prunable via the symmetric farthest-corner bound but are intentionally omitted (rarely?)
  • Pruning is sound, but the performance is highly related with the geo column write order, selectivity depends on a spatially clustered layout (e.g. a Hilbert/Z-order sort) so chunk MBRs are tight and non-overlapping.

Testing

8 new vortex-geo tests. Point bbox across batches; Polygon bbox over all ring vertices, empty group → null, and registry self-declaration. only <=/< prune while >/>=/==/!= don't (parameterized), distance symmetry, non-distance comparisons ignored, and an end-to-end falsify.

Performance

SF=1

┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Query ┃ duckdb:parquet (base) ┃   duckdb:vortex ┃ duckdb:vortex-geo-native ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1     │                39.8ms │  15.5ms (0.39x) │            3.1ms (0.08x) │
│ 2     │               147.8ms │  39.9ms (0.27x) │           47.4ms (0.32x) │
│ 3     │                66.1ms │  24.0ms (0.36x) │            5.1ms (0.08x) │
│ 4     │               563.1ms │  60.2ms (0.11x) │          109.7ms (0.19x) │
│ 5     │               356.8ms │ 284.1ms (0.80x) │          287.9ms (0.81x) │
│ 6     │               677.3ms │  93.9ms (0.14x) │          149.5ms (0.22x) │
│ 7     │               174.0ms │  70.3ms (0.40x) │           96.7ms (0.56x) │
│ 8     │               142.4ms │  48.4ms (0.34x) │           66.1ms (0.46x) │
│ 9     │                18.7ms │  17.0ms (0.91x) │           19.3ms (1.03x) │
└───────┴───────────────────────┴─────────────────┴──────────────────────────┘

SF=3

┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Query ┃ duckdb:parquet (base) ┃   duckdb:vortex ┃ duckdb:vortex-geo-native ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1     │                47.8ms │  36.4ms (0.76x) │            4.0ms (0.08x) │
│ 2     │               143.8ms │  61.0ms (0.42x) │          113.1ms (0.79x) │
│ 3     │                83.1ms │  68.2ms (0.82x) │            7.2ms (0.09x) │
│ 4     │               557.2ms │  67.4ms (0.12x) │          146.2ms (0.26x) │
│ 5     │               949.6ms │ 882.9ms (0.93x) │          897.8ms (0.95x) │
│ 6     │               674.5ms │ 124.5ms (0.18x) │          231.7ms (0.34x) │
│ 7     │               332.5ms │ 330.9ms (1.00x) │          277.2ms (0.83x) │
│ 8     │               183.7ms │ 162.0ms (0.88x) │          209.2ms (1.14x) │
│ 9     │                28.5ms │  25.9ms (0.91x) │           28.7ms (1.01x) │
└───────┴───────────────────────┴─────────────────┴──────────────────────────┘

SF=10

┏━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Query ┃ duckdb:parquet (base) ┃   duckdb:vortex ┃ duckdb:vortex-geo-native ┃
┡━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ 1     │               161.6ms │ 109.2ms (0.68x) │            7.8ms (0.05x) │
│ 2     │               358.4ms │ 187.9ms (0.52x) │          329.1ms (0.92x) │
│ 3     │               270.7ms │ 237.7ms (0.88x) │           14.3ms (0.05x) │
│ 4     │               912.7ms │ 131.0ms (0.14x) │          157.8ms (0.17x) │
│ 5     │                 3.37s │   3.10s (0.92x) │            3.14s (0.93x) │
│ 6     │                 1.25s │ 301.7ms (0.24x) │          501.0ms (0.40x) │
│ 7     │                 1.27s │ 914.1ms (0.72x) │          969.6ms (0.76x) │
│ 8     │               693.5ms │ 918.7ms (1.32x) │          683.1ms (0.99x) │
│ 9     │                36.3ms │  35.7ms (0.98x) │           44.5ms (1.22x) │
└───────┴───────────────────────┴─────────────────┴──────────────────────────┘

Takeaway: when the data is pre-sorted, the bounding box pruning can significantly reduce the intermediate results DuckDB read in, make Q1 and Q3 significantly faster

Single-table `ST_DWithin(col, literal, radius)` filters over native-geometry
Vortex scans now push down as `vortex.geo.distance(..) <= radius`, evaluated
natively via a `GeoDistance` point fast path — with no query rewriting; any
query using `ST_DWithin` benefits.

Genuine `ST_DWithin` cannot push as-is: spatial's bind folds the radius into
opaque bind data, while its SPATIAL_JOIN optimization requires exactly that
folded 2-argument form. The pieces:

- `duckdb_vx_register_st_dwithin_override` shadows `ST_DWithin` in the user
  catalog with a copy whose bind is cleared, so bound calls keep the radius
  as `children[2]`. Unpushed occurrences still execute correctly through
  spatial's own 3-column code path.
- The expression converter lowers the 3-argument `st_dwithin` (and
  `st_distance`) to `GeoDistance`, guarded by the scan's fields: geometry
  columns must be native (`vortex.geo.wkb` columns fall back to DuckDB).
  `can_push_expression` dry-runs the lowering so DuckDB never installs an
  ExpressionFilter the scan would later drop.
- `RestoreStDWithin`, in the existing Vortex optimizer pass, rebinds every
  remaining 3-argument `st_dwithin` (join conditions, unpushed filters)
  through spatial's original entry, restoring the folded form its
  spatial-join optimization requires.

duckdb-bench registers the override once per connection, after
`LOAD spatial`; it is a no-op when spatial is absent.

SF1.0: Q1 47ms (parquet) / 16ms (vortex WKB) / 5.4ms (native, pushed);
Q8 keeps SPATIAL_JOIN at ~89ms on the native lane.

Signed-off-by: Nemo Yu <zyu379@wisc.edu>
- Move the ST_DWithin restore pass to the optimizer extension's
  pre_optimize_function. DuckDB runs all extensions' pre-optimize hooks
  before any post-optimize pass, so this guarantees the restore precedes
  spatial's spatial-join optimization without relying on registration
  order. Since it now runs before filter pushdown, restrict it to join
  conditions so filters keep the visible radius they need to push.
- Compare the function name with `==`; DuckDB identifiers are already
  case-insensitive.
- Terser doc comment on `duckdb_vx_register_st_dwithin_override`.
- Drop a debug print from the geo function converter.

Signed-off-by: Nemo Yu <zyu379@wisc.edu>
…-filter pruning

Add a `GeometryBounds` aggregate that computes a per-chunk 2D minimum bounding
rectangle (`Struct<xmin, ymin, xmax, ymax>`) for native geometry columns and
stores it as a zone-map statistic, plus a `GeoDistanceBoundsPrune` stats-rewrite
rule that skips a chunk when its MBR is disjoint from a
`GeoDistance(geom, const) <= r` query box.

- vortex-array: aggregates self-declare as default zone stats via a
  `zone_stat_default` vtable/plugin hook; the zoned writer discovers them from
  the session registry in deterministic (id-sorted) order.
- vortex-geo: `GeometryBounds` covers Point/Polygon/MultiPolygon; the prune rule
  guards on `is_native_geometry`, ignores a NaN radius, and only handles the
  near forms `<=` / `<`. Missing stats bind to null, so older files degrade to
  no pruning.

Signed-off-by: Nemo Yu <zyu379@wisc.edu>
Morton-sort each generated table by its geometry column's bounding-box center so
every lane (parquet, vortex-WKB, vortex-geo-native) reads spatially-clustered
data and the geometry zone-map prune can actually skip chunks. Idempotent via a
parquet marker; stale derived vortex files from pre-sort parquet are deleted so
the existence-keyed conversions regenerate.

Signed-off-by: Nemo Yu <zyu379@wisc.edu>
@HarukiMoriarty HarukiMoriarty added the changelog/feature A new feature label Jul 2, 2026
@codspeed-hq

codspeed-hq Bot commented Jul 2, 2026

Copy link
Copy Markdown

Merging this PR will degrade performance by 15.74%

⚡ 1 improved benchmark
❌ 2 regressed benchmarks
✅ 1554 untouched benchmarks
⏩ 42 skipped benchmarks1

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_bool_canonical_into[(1000, 10)] 16.5 µs 26.8 µs -38.39%
Simulation chunked_varbinview_canonical_into[(1000, 10)] 154.8 µs 191.2 µs -19.07%
Simulation chunked_varbinview_opt_into_canonical[(1000, 10)] 220.5 µs 183.7 µs +20%

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing nemo/geo-bounds-zonemap (22d686f) with nemo/geo-native-pushdown (666e132)

Open in CodSpeed

Footnotes

  1. 42 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Base automatically changed from nemo/geo-native-pushdown to develop July 3, 2026 19:15
@robert3005 robert3005 requested a review from a team July 3, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant