feat(vortex-geo): geometry Bbox zone-map statistic + distance-filter pruning#8646
Open
HarukiMoriarty wants to merge 4 commits into
Open
feat(vortex-geo): geometry Bbox zone-map statistic + distance-filter pruning#8646HarukiMoriarty wants to merge 4 commits into
Bbox zone-map statistic + distance-filter pruning#8646HarukiMoriarty wants to merge 4 commits into
Conversation
Single-table `ST_DWithin(col, literal, radius)` filters over native-geometry Vortex scans now push down as `vortex.geo.distance(..) <= radius`, evaluated natively via a `GeoDistance` point fast path — with no query rewriting; any query using `ST_DWithin` benefits. Genuine `ST_DWithin` cannot push as-is: spatial's bind folds the radius into opaque bind data, while its SPATIAL_JOIN optimization requires exactly that folded 2-argument form. The pieces: - `duckdb_vx_register_st_dwithin_override` shadows `ST_DWithin` in the user catalog with a copy whose bind is cleared, so bound calls keep the radius as `children[2]`. Unpushed occurrences still execute correctly through spatial's own 3-column code path. - The expression converter lowers the 3-argument `st_dwithin` (and `st_distance`) to `GeoDistance`, guarded by the scan's fields: geometry columns must be native (`vortex.geo.wkb` columns fall back to DuckDB). `can_push_expression` dry-runs the lowering so DuckDB never installs an ExpressionFilter the scan would later drop. - `RestoreStDWithin`, in the existing Vortex optimizer pass, rebinds every remaining 3-argument `st_dwithin` (join conditions, unpushed filters) through spatial's original entry, restoring the folded form its spatial-join optimization requires. duckdb-bench registers the override once per connection, after `LOAD spatial`; it is a no-op when spatial is absent. SF1.0: Q1 47ms (parquet) / 16ms (vortex WKB) / 5.4ms (native, pushed); Q8 keeps SPATIAL_JOIN at ~89ms on the native lane. Signed-off-by: Nemo Yu <zyu379@wisc.edu>
- Move the ST_DWithin restore pass to the optimizer extension's pre_optimize_function. DuckDB runs all extensions' pre-optimize hooks before any post-optimize pass, so this guarantees the restore precedes spatial's spatial-join optimization without relying on registration order. Since it now runs before filter pushdown, restrict it to join conditions so filters keep the visible radius they need to push. - Compare the function name with `==`; DuckDB identifiers are already case-insensitive. - Terser doc comment on `duckdb_vx_register_st_dwithin_override`. - Drop a debug print from the geo function converter. Signed-off-by: Nemo Yu <zyu379@wisc.edu>
…-filter pruning Add a `GeometryBounds` aggregate that computes a per-chunk 2D minimum bounding rectangle (`Struct<xmin, ymin, xmax, ymax>`) for native geometry columns and stores it as a zone-map statistic, plus a `GeoDistanceBoundsPrune` stats-rewrite rule that skips a chunk when its MBR is disjoint from a `GeoDistance(geom, const) <= r` query box. - vortex-array: aggregates self-declare as default zone stats via a `zone_stat_default` vtable/plugin hook; the zoned writer discovers them from the session registry in deterministic (id-sorted) order. - vortex-geo: `GeometryBounds` covers Point/Polygon/MultiPolygon; the prune rule guards on `is_native_geometry`, ignores a NaN radius, and only handles the near forms `<=` / `<`. Missing stats bind to null, so older files degrade to no pruning. Signed-off-by: Nemo Yu <zyu379@wisc.edu>
Morton-sort each generated table by its geometry column's bounding-box center so every lane (parquet, vortex-WKB, vortex-geo-native) reads spatially-clustered data and the geometry zone-map prune can actually skip chunks. Idempotent via a parquet marker; stale derived vortex files from pre-sort parquet are deleted so the existence-keyed conversions regenerate. Signed-off-by: Nemo Yu <zyu379@wisc.edu>
Merging this PR will degrade performance by 15.74%
Warning Please fix the performance issues or acknowledge them on CodSpeed. Performance Changes
Tip Investigate this regression by commenting Comparing Footnotes
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds spatial chunk-pruning to Vortex. A new
GeometryBoundsaggregate stores a per-chunk minimum bounding box (MBR) as a zone-map statistic, and a stats-rewrite rule uses it to skip chunks that cannot satisfy aST_Distance(geom, const) <= rfilter.Limitation
<=/<are pruned.>/>=are soundly prunable via the symmetric farthest-corner bound but are intentionally omitted (rarely?)Testing
8 new vortex-geo tests. Point bbox across batches; Polygon bbox over all ring vertices, empty group → null, and registry self-declaration. only <=/< prune while >/>=/==/!= don't (parameterized), distance symmetry, non-distance comparisons ignored, and an end-to-end falsify.
Performance
SF=1
SF=3
SF=10
Takeaway: when the data is pre-sorted, the bounding box pruning can significantly reduce the intermediate results DuckDB read in, make Q1 and Q3 significantly faster