Fix aggregation flow with remote initiator#1872
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e72ebfe177
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (getClusterName(context).empty() // Not cluster request | ||
| && !(context->getSettingsRef()[Setting::object_storage_remote_initiator] // Not request with remote initiator | ||
| && !context->getSettingsRef()[Setting::object_storage_remote_initiator_cluster].value.empty())) |
There was a problem hiding this comment.
Keep unsupported storages at FetchColumns
When object_storage_remote_initiator is set for a configuration that does not support cluster mode, such as icebergLocal or other local data-lake storages where getClusterName is empty because isClusterSupported is false, this new exception makes getQueryProcessingStage report WithMergeableState. IStorageCluster::read still immediately falls back to pure_storage for those configurations, and StorageObjectStorage::read ignores processed_stage, so aggregate queries can be planned as if partial aggregation happened remotely even though only raw rows were read. Please only return the distributed stage when the subsequent read path will actually use the remote/cluster execution path.
Useful? React with 👍 / 👎.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 437a47dc4a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| is_remote = settings[Setting::object_storage_remote_initiator] | ||
| && (!settings[Setting::object_storage_cluster].value.empty() | ||
| || !settings[Setting::object_storage_remote_initiator_cluster].value.empty()); |
There was a problem hiding this comment.
Preserve validation for missing remote initiator cluster
When object_storage_remote_initiator=1 is set on the fallback s3/object-storage table functions but both object_storage_cluster and object_storage_remote_initiator_cluster are empty, this now makes is_remote false and routes the query through BaseSimple. That bypasses the existing IStorageCluster::read validation that raises BAD_ARGUMENTS for this invalid configuration, so a misconfigured query silently runs locally instead of failing or using a remote initiator.
Useful? React with 👍 / 👎.
437a47d to
0878f8f
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0878f8f10d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| is_remote = settings[Setting::object_storage_remote_initiator] | ||
| && (!settings[Setting::object_storage_cluster].value.empty() | ||
| || !settings[Setting::object_storage_remote_initiator_cluster].value.empty()); |
There was a problem hiding this comment.
Preserve validation for bare remote initiator requests
When object_storage_remote_initiator=1 is set on the fallback s3/object-storage table function without object_storage_cluster or object_storage_remote_initiator_cluster, this new guard leaves is_remote false, so executeImpl builds the simple storage and bypasses the validation added in StorageObjectStorageCluster/IStorageCluster. That makes an invalid remote-initiator request run locally instead of raising the documented exception, which is easy to hit when the companion cluster setting is omitted or misspelled.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
object_storage_remote_initiator_cluster without cluster name or object_storage_remote_initiator_cluster call an exception before.
| query_id = query_id, | ||
| ) | ||
|
|
||
| assert result == "67802152770\n" |
There was a problem hiding this comment.
To be honest, I'm not really fond of these magic constants as it's difficult to see for myself if that's actually the value we should be expecting, but I trust that you've verified it to be actually valid
There was a problem hiding this comment.
Me too, but I reused existed data files, https://github.com/Altinity/ClickHouse/blob/antalya-26.3/tests/integration/test_s3_cluster/data/clickhouse/part1.csv and others. Number means nothing, just summa of values in all files.
|
Stateless test |
Audit update for PR #1872
PR: Altinity/ClickHouse#1872 — Fix aggregation flow with remote initiator Confirmed defectsMedium — Silent fallback bypasses remote-initiator validation for
|
That's a good one :D |
|
Add fixes for both defects |
Audit update for PR #1872
PR: Altinity/ClickHouse#1872 — Fix aggregation flow with remote initiator Previously-flagged defects, now fixedVerified against commits Confirmed defects (still open at HEAD)No confirmed defects in reviewed scope. |
PR #1872 CI Verification — Fix aggregation flow with remote initiator
VerdictCI is clean of regressions. The PR is safe to merge after a rerun of the remaining red checks, all of which are pre-existing flakes / image-scanner findings unrelated to the code change. Summary (current SHA
|
| Status | Count |
|---|---|
| SUCCESS | 160 |
| FAILURE | 7 (3 jobs + 4 derived rollup rows) |
| SKIPPED | 147 (build matrix exclusions) |
Failing jobs
| Job | Failing item | Type |
|---|---|---|
Integration tests (arm_binary, distributed plan, 4/4) |
test_storage_s3_queue/test_0.py::test_move_after_processing[another_bucket-AzureQueue] |
Pre-existing flaky |
Stress test (arm_asan, s3) |
Cannot start clickhouse-server |
Pre-existing flaky / infra |
GrypeScanServer / Grype Scan |
CVE scan of server image | Container image scan, unrelated to source change |
Flakiness evidence (last 60 days)
| Test | Total fails | Unique PRs | First seen | Last seen |
|---|---|---|---|---|
test_move_after_processing[another_bucket-AzureQueue] |
48 | 26 | 2026-04-13 | 2026-06-10 |
Cannot start clickhouse-server (stress) |
88 | 34 | 2026-04-30 | 2026-06-10 |
Both have failed across many unrelated PRs and master, well predating PR #1872. Neither involves remote-initiator or aggregation code paths.
The Grype scan flagging vulnerabilities in the produced container image is a packaging/dependency concern handled separately (base image / OS package CVEs); it is not blocked by this PR's source changes.
PR's own new integration test — passing
tests/integration/test_s3_cluster/test.py::test_object_storage_remote_initiator_aggregation
Result on current SHA 532219ad: PASSED in all 3 integration jobs that ran it.
Recommendations
- Rerun
Integration tests (arm_binary, distributed plan, 4/4)andStress test (arm_asan, s3)to clear flaky reds. - Address
Grype Scanseparately if the team requires zero high-severity findings on the produced image; otherwise it can be ignored as not regression-related. - Merge once jobs are green; no PR-introduced regressions detected.
…with_remote_initiator
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Fix aggregation flow with remote initiator
Documentation entry for user-facing changes
With
object_storage_remote_initiatorbut withoutobject_storage_clustersettingStorageObjectStorageCluster::getQueryProcessingStagereturnedQueryProcessingStage::Enum::FetchColumns, as result nodes sent all rows on initiator and aggregation executed on initiator.Now method returns
QueryProcessingStage::Enum::WithMergeableStateis proper cases, and pre-aggregation executed on nodes.CI/CD Options
Exclude tests:
Regression jobs to run: