Skip to content

[MINOR] Refactor broadcast join context and table ID type#12309

Open
philo-he wants to merge 11 commits into
apache:mainfrom
philo-he:refactor-broadcast-join
Open

[MINOR] Refactor broadcast join context and table ID type#12309
philo-he wants to merge 11 commits into
apache:mainfrom
philo-he:refactor-broadcast-join

Conversation

@philo-he

@philo-he philo-he commented Jun 16, 2026

Copy link
Copy Markdown
Member

What changes are proposed in this pull request?

  • Rename BroadCastHashJoinContext → BroadcastJoinContext and BroadCastJoinBuilder → BroadcastJoinBuilder to fix the camelCase typos consistently across Scala, Java, and C++ files.
  • Change build table ID from String to Int — previously buildPlan.id (an Int) was wrapped into strings like "BuiltBHJBroadcastTable-123" or "BuiltBNLJBroadcastTable-123". Now the integer ID is passed directly, simplifying the cache key type and the JNI boundary.
  • Replace string-prefix join type detection with an explicit isBhj flag — StorageJoinBuilder previously distinguished BHJ from BNLJ by checking buildHashTableId.startsWith("BuiltBNLJBroadcastTable-"), which was fragile. The new isBhj: Boolean field in BroadcastJoinContext makes the intent explicit.

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

Copilot AI review requested due to automatic review settings June 16, 2026 23:38
@github-actions github-actions Bot added CORE works for Gluten Core VELOX CLICKHOUSE labels Jun 16, 2026
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors broadcast-join plumbing across Gluten’s Substrait layer and ClickHouse backend by switching broadcast table/hash-table identifiers from string keys to numeric IDs, and by unifying “BroadCast*” naming to “Broadcast*”. It also updates the ClickHouse JNI + native join builder path to carry an explicit “is BHJ” flag instead of inferring join type from string prefixes.

Changes:

  • Change join parameter generation to emit an integer buildHashTableId / buildTableId (instead of prefixed strings).
  • Rename ClickHouse broadcast join context/builder symbols from BroadCast* to Broadcast*, and update JNI/native builder APIs accordingly.
  • Adjust benchmark plan/test code to match the new identifier format.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
gluten-substrait/src/main/scala/org/apache/gluten/execution/JoinExecTransformer.scala Updates join-parameter internal API to use an Int build table ID.
gluten-substrait/src/main/scala/org/apache/gluten/execution/BroadcastNestedLoopJoinExecTransformer.scala Switches broadcast nested-loop build table ID to Int.
backends-velox/src/main/scala/org/apache/gluten/execution/HashJoinExecTransformer.scala Implements new join-parameter internal signature for Velox SHJ.
backends-clickhouse/src/main/scala/org/apache/gluten/execution/CHHashJoinExecTransformer.scala Renames/reshapes broadcast join context type for ClickHouse.
backends-clickhouse/src/main/scala/org/apache/gluten/execution/CHBroadcastNestedLoopJoinExecTransformer.scala Updates CH BNLJ context + join-parameter emission for native side.
backends-clickhouse/src/main/scala/org/apache/gluten/execution/CHBroadcastBuildSideCache.scala Changes cache key type from String to Int.
backends-clickhouse/src/main/scala/org/apache/spark/sql/execution/joins/ClickHouseBuildSideRelation.scala Migrates build-side relation to use the new broadcast context type.
backends-clickhouse/src/main/java/org/apache/gluten/vectorized/StorageJoinBuilder.java Updates JNI signatures and passes explicit isBhj to native builder.
cpp-ch/local-engine/local_engine_jni.cpp Updates native JNI entrypoints for the new Java signatures and builder name.
cpp-ch/local-engine/Join/BroadcastJoinBuilder.h Renames namespace and updates APIs to use numeric keys + explicit is_bhj.
cpp-ch/local-engine/Join/BroadcastJoinBuilder.cpp Implements new API: uses is_bhj instead of string prefix checks.
cpp-ch/local-engine/Parser/AdvancedParametersParseUtil.h Changes storage_join_key to an integer type.
cpp-ch/local-engine/Parser/RelParsers/JoinRelParser.cpp Renames builder usage to BroadcastJoinBuilder.
cpp-ch/local-engine/Parser/RelParsers/CrossRelParser.cpp Renames builder usage to BroadcastJoinBuilder.
cpp/velox/benchmarks/data/plan/q17_joins.json Updates plan data to use numeric buildHashTableId.
backends-clickhouse/src/test/scala/org/apache/spark/sql/execution/benchmarks/CHStorageJoinBenchmark.scala Renames local variables to broadcast* spelling.
backends-clickhouse/src/test/scala/org/apache/spark/sql/execution/benchmarks/CHHashBuildBenchmark.scala Updates imports/types and JNI call to numeric hash-table ID.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +304 to +305
// isBHJ, isNullAwareAntiJoin, buildTableId.
def genJoinParametersInternal(): (Int, Int, Int)
Comment on lines 47 to 48
private var existHashTableData: Long = 0L
private var existBroadCastHashJoinContext: BroadCastHashJoinContext = null
}

private def couldReuseHashTableData(broadCastContext: BroadCastHashJoinContext): Boolean = {
private def couldReuseHashTableData(broadcastContext: BroadCastHashJoinContext): Boolean = {
Comment on lines 58 to 62
buildSideRelationCache
.get(
broadCastContext.buildHashTableId,
(broadcast_id: String) => {
broadcastContext.buildTableId,
(broadcastId: String) => {
val (pointer, relation) =
Comment on lines 100 to 103
joinParametersStr
.append("isBHJ=")
.append(1)
.append(0)
.append("\n")
.append(0)
.append("\n")
.append("buildHashTableId=")
.append("buildBroadcastTableId=")
Comment thread cpp-ch/local-engine/local_engine_jni.cpp Outdated
Comment on lines 35 to 38
Int64 right_table_bytes = -1;
Int64 partitions_num = -1;
String storage_join_key;
Int32 storage_join_key = -1;

@philo-he philo-he marked this pull request as draft June 17, 2026 00:10
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

Comment thread cpp-ch/local-engine/Parser/AdvancedParametersParseUtil.h
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@philo-he philo-he force-pushed the refactor-broadcast-join branch from facb2d1 to 58d4c0f Compare June 17, 2026 23:22
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

2 similar comments
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@philo-he philo-he marked this pull request as ready for review June 18, 2026 04:53
Copilot AI review requested due to automatic review settings June 18, 2026 04:53
@philo-he

Copy link
Copy Markdown
Member Author

@zzcclp, could you take a look?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 24 out of 24 changed files in this pull request and generated 4 comments.

Comment on lines 59 to 67
.get(
broadCastContext.buildHashTableId,
(broadcast_id: String) => {
broadcastContext.buildTableId,
(broadcastId: String) => {
val (pointer, relation) =
broadcast.value
.asInstanceOf[ClickHouseBuildSideRelation]
.buildHashTable(broadCastContext)
logDebug(s"Create bhj $broadcast_id = 0x${pointer.toHexString}")
.buildHashTable(broadcastContext)
logDebug(s"Create bhj $broadcastId = 0x${pointer.toHexString}")
BroadcastHashTable(pointer, relation)
Comment on lines 121 to 123
override def genJoinParameters(): Any = {
val (isBHJ, isNullAwareAntiJoin, buildHashTableId): (Int, Int, String) = (0, 0, "")
val (isBHJ, isNullAwareAntiJoin, buildHashTableId): (Int, Int, Int) = (0, 0, 0)

Comment on lines 100 to 106
joinParametersStr
.append("isBHJ=")
.append(1)
.append(0)
.append("\n")
.append("buildHashTableId=")
.append("buildBroadcastTableId=")
.append(buildBroadcastTableId)
.append("\n")
Comment on lines 960 to 966
JNIEXPORT jlong JNICALL Java_org_apache_gluten_vectorized_HashJoinBuilder_nativeBuild( // NOLINT
JNIEnv* env,
jobject wrapper,
jstring tableId,
jint tableId,
jlongArray batchHandles,
jobjectArray joinKeys,
jobjectArray filterBuildColumns,
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI review requested due to automatic review settings June 18, 2026 17:08
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 3 comments.

Comment on lines 121 to +122
override def genJoinParameters(): Any = {
val (isBHJ, isNullAwareAntiJoin, buildHashTableId): (Int, Int, String) = (0, 0, "")
val (isBHJ, isNullAwareAntiJoin, buildHashTableId): (Int, Int, Int) = (0, 0, 0)
Comment thread cpp/velox/jni/VeloxJniWrapper.cc Outdated
JNIEnv* env,
jobject wrapper,
jstring tableId,
jint tableId,
Comment on lines +76 to 80
private def couldReuseHashTableData(broadcastContext: BroadcastJoinContext): Boolean = {
if (existHashTableData != 0) {
existBroadCastHashJoinContext.joinType == broadCastContext.joinType &&
existBroadCastHashJoinContext.hasMixedFiltCondition == broadCastContext.hasMixedFiltCondition
existBroadcastHashJoinContext.joinType == broadcastContext.joinType &&
existBroadcastHashJoinContext.hasMixedFiltCondition == broadcastContext.hasMixedFiltCondition
}
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 2 comments.

Comment on lines 138 to 142
tryAssign(kvs, "isBHJ", info.is_broadcast);
tryAssign(kvs, "isSMJ", info.is_smj);
tryAssign(kvs, "buildHashTableId", info.storage_join_key);
tryAssign(kvs, "buildBroadcastTableId", info.storage_join_key);
tryAssign(kvs, "isNullAwareAntiJoin", info.is_null_aware_anti_join);
Comment on lines +78 to 83
/** This is called from c++ side. */
def get(broadcastHashtableId: String): Long = {
Option(buildSideRelationCache.getIfPresent(broadcastHashtableId))
Option(buildSideRelationCache.getIfPresent(broadcastHashtableId.toInt))
.map(_.pointer)
.getOrElse(0)
}
@philo-he philo-he force-pushed the refactor-broadcast-join branch from 5931c1d to aa3af94 Compare June 18, 2026 21:56
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI review requested due to automatic review settings June 18, 2026 22:08
@philo-he philo-he force-pushed the refactor-broadcast-join branch from aa3af94 to 34d6723 Compare June 18, 2026 22:08
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@philo-he philo-he force-pushed the refactor-broadcast-join branch from 34d6723 to 9d29f03 Compare June 18, 2026 22:11
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated 1 comment.

Comment on lines 101 to +102
.append("isBHJ=")
.append(1)
.append(0)
@JkSelf

JkSelf commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

@philo-he Thanks for your work.

We are working on using Velox’s HashTableCache to cache pre-built hash tables here so we can revert a custom commit in our internal Velox repo. Since HashTableCache currently uses String as the cache key, can we continue using a string for our cache key.

Copilot AI review requested due to automatic review settings June 19, 2026 04:28
@github-actions

Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 25 changed files in this pull request and generated no new comments.

@philo-he

Copy link
Copy Markdown
Member Author

@philo-he Thanks for your work.

We are working on using Velox’s HashTableCache to cache pre-built hash tables here so we can revert a custom commit in our internal Velox repo. Since HashTableCache currently uses String as the cache key, can we continue using a string for our cache key.

@JkSelf, thanks for the comment.
The String key was previously used both as a cache key and as a type descriptor (via starts_with("BuiltBNLJBroadcastTable-")). This PR separates them — the isBhj flag is used to distinguish join operator type, so the key only needs to be a unique identifier.

In theory, we can keep using string key in this PR, converted from build plan ID. Let me think about this. And even if Gluten finally uses Int key, we still can convert it to string at the Velox boundary. So I think you can keep using the string cache key in that Velox PR.

@philo-he philo-he changed the title [CORE] Refactor broadcast join context and table ID type [MINOR] Refactor broadcast join context and table ID type Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants