[MINOR] Refactor broadcast join context and table ID type#12309
[MINOR] Refactor broadcast join context and table ID type#12309philo-he wants to merge 11 commits into
Conversation
|
Run Gluten Clickhouse CI on x86 |
There was a problem hiding this comment.
Pull request overview
This PR refactors broadcast-join plumbing across Gluten’s Substrait layer and ClickHouse backend by switching broadcast table/hash-table identifiers from string keys to numeric IDs, and by unifying “BroadCast*” naming to “Broadcast*”. It also updates the ClickHouse JNI + native join builder path to carry an explicit “is BHJ” flag instead of inferring join type from string prefixes.
Changes:
- Change join parameter generation to emit an integer
buildHashTableId/buildTableId(instead of prefixed strings). - Rename ClickHouse broadcast join context/builder symbols from
BroadCast*toBroadcast*, and update JNI/native builder APIs accordingly. - Adjust benchmark plan/test code to match the new identifier format.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| gluten-substrait/src/main/scala/org/apache/gluten/execution/JoinExecTransformer.scala | Updates join-parameter internal API to use an Int build table ID. |
| gluten-substrait/src/main/scala/org/apache/gluten/execution/BroadcastNestedLoopJoinExecTransformer.scala | Switches broadcast nested-loop build table ID to Int. |
| backends-velox/src/main/scala/org/apache/gluten/execution/HashJoinExecTransformer.scala | Implements new join-parameter internal signature for Velox SHJ. |
| backends-clickhouse/src/main/scala/org/apache/gluten/execution/CHHashJoinExecTransformer.scala | Renames/reshapes broadcast join context type for ClickHouse. |
| backends-clickhouse/src/main/scala/org/apache/gluten/execution/CHBroadcastNestedLoopJoinExecTransformer.scala | Updates CH BNLJ context + join-parameter emission for native side. |
| backends-clickhouse/src/main/scala/org/apache/gluten/execution/CHBroadcastBuildSideCache.scala | Changes cache key type from String to Int. |
| backends-clickhouse/src/main/scala/org/apache/spark/sql/execution/joins/ClickHouseBuildSideRelation.scala | Migrates build-side relation to use the new broadcast context type. |
| backends-clickhouse/src/main/java/org/apache/gluten/vectorized/StorageJoinBuilder.java | Updates JNI signatures and passes explicit isBhj to native builder. |
| cpp-ch/local-engine/local_engine_jni.cpp | Updates native JNI entrypoints for the new Java signatures and builder name. |
| cpp-ch/local-engine/Join/BroadcastJoinBuilder.h | Renames namespace and updates APIs to use numeric keys + explicit is_bhj. |
| cpp-ch/local-engine/Join/BroadcastJoinBuilder.cpp | Implements new API: uses is_bhj instead of string prefix checks. |
| cpp-ch/local-engine/Parser/AdvancedParametersParseUtil.h | Changes storage_join_key to an integer type. |
| cpp-ch/local-engine/Parser/RelParsers/JoinRelParser.cpp | Renames builder usage to BroadcastJoinBuilder. |
| cpp-ch/local-engine/Parser/RelParsers/CrossRelParser.cpp | Renames builder usage to BroadcastJoinBuilder. |
| cpp/velox/benchmarks/data/plan/q17_joins.json | Updates plan data to use numeric buildHashTableId. |
| backends-clickhouse/src/test/scala/org/apache/spark/sql/execution/benchmarks/CHStorageJoinBenchmark.scala | Renames local variables to broadcast* spelling. |
| backends-clickhouse/src/test/scala/org/apache/spark/sql/execution/benchmarks/CHHashBuildBenchmark.scala | Updates imports/types and JNI call to numeric hash-table ID. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // isBHJ, isNullAwareAntiJoin, buildTableId. | ||
| def genJoinParametersInternal(): (Int, Int, Int) |
| private var existHashTableData: Long = 0L | ||
| private var existBroadCastHashJoinContext: BroadCastHashJoinContext = null |
| } | ||
|
|
||
| private def couldReuseHashTableData(broadCastContext: BroadCastHashJoinContext): Boolean = { | ||
| private def couldReuseHashTableData(broadcastContext: BroadCastHashJoinContext): Boolean = { |
| buildSideRelationCache | ||
| .get( | ||
| broadCastContext.buildHashTableId, | ||
| (broadcast_id: String) => { | ||
| broadcastContext.buildTableId, | ||
| (broadcastId: String) => { | ||
| val (pointer, relation) = |
| joinParametersStr | ||
| .append("isBHJ=") | ||
| .append(1) | ||
| .append(0) | ||
| .append("\n") |
| .append(0) | ||
| .append("\n") | ||
| .append("buildHashTableId=") | ||
| .append("buildBroadcastTableId=") |
| Int64 right_table_bytes = -1; | ||
| Int64 partitions_num = -1; | ||
| String storage_join_key; | ||
| Int32 storage_join_key = -1; | ||
|
|
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
1 similar comment
|
Run Gluten Clickhouse CI on x86 |
facb2d1 to
58d4c0f
Compare
|
Run Gluten Clickhouse CI on x86 |
2 similar comments
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
|
@zzcclp, could you take a look? |
| .get( | ||
| broadCastContext.buildHashTableId, | ||
| (broadcast_id: String) => { | ||
| broadcastContext.buildTableId, | ||
| (broadcastId: String) => { | ||
| val (pointer, relation) = | ||
| broadcast.value | ||
| .asInstanceOf[ClickHouseBuildSideRelation] | ||
| .buildHashTable(broadCastContext) | ||
| logDebug(s"Create bhj $broadcast_id = 0x${pointer.toHexString}") | ||
| .buildHashTable(broadcastContext) | ||
| logDebug(s"Create bhj $broadcastId = 0x${pointer.toHexString}") | ||
| BroadcastHashTable(pointer, relation) |
| override def genJoinParameters(): Any = { | ||
| val (isBHJ, isNullAwareAntiJoin, buildHashTableId): (Int, Int, String) = (0, 0, "") | ||
| val (isBHJ, isNullAwareAntiJoin, buildHashTableId): (Int, Int, Int) = (0, 0, 0) | ||
|
|
| joinParametersStr | ||
| .append("isBHJ=") | ||
| .append(1) | ||
| .append(0) | ||
| .append("\n") | ||
| .append("buildHashTableId=") | ||
| .append("buildBroadcastTableId=") | ||
| .append(buildBroadcastTableId) | ||
| .append("\n") |
| JNIEXPORT jlong JNICALL Java_org_apache_gluten_vectorized_HashJoinBuilder_nativeBuild( // NOLINT | ||
| JNIEnv* env, | ||
| jobject wrapper, | ||
| jstring tableId, | ||
| jint tableId, | ||
| jlongArray batchHandles, | ||
| jobjectArray joinKeys, | ||
| jobjectArray filterBuildColumns, |
|
Run Gluten Clickhouse CI on x86 |
|
Run Gluten Clickhouse CI on x86 |
| override def genJoinParameters(): Any = { | ||
| val (isBHJ, isNullAwareAntiJoin, buildHashTableId): (Int, Int, String) = (0, 0, "") | ||
| val (isBHJ, isNullAwareAntiJoin, buildHashTableId): (Int, Int, Int) = (0, 0, 0) |
| JNIEnv* env, | ||
| jobject wrapper, | ||
| jstring tableId, | ||
| jint tableId, |
| private def couldReuseHashTableData(broadcastContext: BroadcastJoinContext): Boolean = { | ||
| if (existHashTableData != 0) { | ||
| existBroadCastHashJoinContext.joinType == broadCastContext.joinType && | ||
| existBroadCastHashJoinContext.hasMixedFiltCondition == broadCastContext.hasMixedFiltCondition | ||
| existBroadcastHashJoinContext.joinType == broadcastContext.joinType && | ||
| existBroadcastHashJoinContext.hasMixedFiltCondition == broadcastContext.hasMixedFiltCondition | ||
| } |
|
Run Gluten Clickhouse CI on x86 |
| tryAssign(kvs, "isBHJ", info.is_broadcast); | ||
| tryAssign(kvs, "isSMJ", info.is_smj); | ||
| tryAssign(kvs, "buildHashTableId", info.storage_join_key); | ||
| tryAssign(kvs, "buildBroadcastTableId", info.storage_join_key); | ||
| tryAssign(kvs, "isNullAwareAntiJoin", info.is_null_aware_anti_join); |
| /** This is called from c++ side. */ | ||
| def get(broadcastHashtableId: String): Long = { | ||
| Option(buildSideRelationCache.getIfPresent(broadcastHashtableId)) | ||
| Option(buildSideRelationCache.getIfPresent(broadcastHashtableId.toInt)) | ||
| .map(_.pointer) | ||
| .getOrElse(0) | ||
| } |
5931c1d to
aa3af94
Compare
|
Run Gluten Clickhouse CI on x86 |
aa3af94 to
34d6723
Compare
|
Run Gluten Clickhouse CI on x86 |
34d6723 to
9d29f03
Compare
|
Run Gluten Clickhouse CI on x86 |
| .append("isBHJ=") | ||
| .append(1) | ||
| .append(0) |
|
Run Gluten Clickhouse CI on x86 |
@JkSelf, thanks for the comment. In theory, we can keep using string key in this PR, converted from build plan ID. Let me think about this. And even if Gluten finally uses Int key, we still can convert it to string at the Velox boundary. So I think you can keep using the string cache key in that Velox PR. |
What changes are proposed in this pull request?
How was this patch tested?
Existing tests.
Was this patch authored or co-authored using generative AI tooling?
No.