Skip to content

[fix](variant) Keep first duplicate Variant JSON path#63082

Open
eldenmoon wants to merge 1 commit intoapache:masterfrom
eldenmoon:codex/cir-19941-duplicate-json-path
Open

[fix](variant) Keep first duplicate Variant JSON path#63082
eldenmoon wants to merge 1 commit intoapache:masterfrom
eldenmoon:codex/cir-19941-duplicate-json-path

Conversation

@eldenmoon
Copy link
Copy Markdown
Member

Add BE config variant_enable_duplicate_json_path_check to keep the first duplicate JSON object key during Variant parsing when enabled.

Copilot AI review requested due to automatic review settings May 8, 2026 10:21
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a backend config to change how VARIANT parsing handles duplicate JSON object keys: when enabled, it keeps the first occurrence instead of erroring on duplicate paths. It also adds test coverage for this behavior, while removing several existing VARIANT nested regression suites and adding a new export/load regression for wide VARIANT data.

Changes:

  • Add BE config variant_enable_duplicate_json_path_check (default off) and implement “keep first duplicate key” behavior during JSON traversal for VARIANT extraction.
  • Add BE unit tests and a VARIANT regression test + data file to validate duplicate-key behavior (insert + stream load).
  • Add an export/load regression test for VARIANT with many potential subcolumns; remove multiple VARIANT nested regression suites.

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
regression-test/suites/variant_p0/nested2.groovy Deleted an existing VARIANT nested-type conflict regression suite.
regression-test/suites/variant_p0/nested/sql/q01.sql Deleted SQL outputs for a removed nested VARIANT load suite.
regression-test/suites/variant_p0/nested/nested_in_top_array.groovy Deleted a regression suite covering nested-in-top-array VARIANT behavior.
regression-test/suites/variant_p0/nested/load.groovy Deleted a (mostly commented) nested VARIANT load regression suite.
regression-test/suites/variant_p0/nested.groovy Deleted a large VARIANT nested regression suite.
regression-test/suites/variant_p0/duplicate_json_path.groovy Added regression coverage for keeping the first duplicate JSON key when the new BE config is enabled.
regression-test/suites/export_p2/test_export_variant_10k_columns.groovy Added export→S3→reload regression for wide/sparse VARIANT; currently has correctness and cleanup issues.
regression-test/data/variant_p0/duplicate_json_path.json Added stream-load input with duplicate JSON keys.
be/test/storage/segment/variant_util_test.cpp Added unit tests validating “keep first duplicate key” behavior in both subcolumn-only and doc-mode parsing.
be/src/util/json/json_parser.cpp Added duplicate-key skipping logic during object traversal when the config is enabled.
be/src/common/config.h Declared the new BE config flag.
be/src/common/config.cpp Defined the new BE config flag with default false.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread be/src/util/json/json_parser.cpp Outdated
Comment on lines 111 to 132
if (!config::variant_enable_duplicate_json_path_check) {
for (auto it = object.begin(); it != object.end(); ++it) {
const auto& [key, value] = *it;
traverse_object_member(key, value);
}
return;
}

std::unordered_set<PathInData, PathInData::Hash> visited_member_paths;
visited_member_paths.reserve(object.size());
for (auto it = object.begin(); it != object.end(); ++it) {
const auto& [key, value] = *it;
check_key_length(key);
ctx.builder.append(key, false);
PathInData member_path(ctx.builder.get_parts());
if (!visited_member_paths.insert(std::move(member_path)).second) {
ctx.builder.pop_back();
continue;
}
traverse(value, ctx);
ctx.builder.pop_back();
}
Comment on lines +18 to +22
suite("duplicate_json_path", "p0") {
def customBeConfig = [
variant_enable_duplicate_json_path_check: true
]
setBeConfigTemporary(customBeConfig) {
@eldenmoon eldenmoon force-pushed the codex/cir-19941-duplicate-json-path branch from a7bc101 to abc6911 Compare May 8, 2026 10:42
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional blocking correctness issue beyond the already-known review threads. The enabled duplicate-path check still allows flat dotted keys and nested object paths to produce the same downstream Variant path string, so the load/WAL failure this PR is meant to prevent can still occur for inputs such as {"a.b":1,"a":{"b":2}}.

Critical checkpoint conclusions:

  • Goal/test: The goal is to optionally keep the first duplicate Variant JSON path and avoid duplicate/conflicting subcolumn paths; current tests cover repeated object member names but not dotted-key versus nested-path conflicts, and the implementation does not fully accomplish the goal.
  • Scope: The change is small and focused, but the local duplicate key check is too narrow for Variant path identity.
  • Concurrency/lifecycle: No new concurrency or special lifecycle management is introduced; the mutable config follows existing BE config patterns.
  • Configuration: A disabled-by-default mutable BE config is added; behavior is observed at parse time.
  • Compatibility: No storage/protocol format change was found.
  • Parallel paths: Existing review thread already covers the JSONB conversion path separately; this comment covers a distinct subcolumn/doc-value path conflict.
  • Tests: Added unit/regression coverage misses the dotted-key conflict case; no tests were run by me in this automated review.
  • Observability/performance: No additional observability requirement found; the extra check is not on an obviously problematic hot path when disabled.
  • User focus: No additional user-provided review focus was supplied.

Comment thread be/src/util/json/json_parser.cpp Outdated
@eldenmoon eldenmoon force-pushed the codex/cir-19941-duplicate-json-path branch 4 times, most recently from 961fd01 to 99aa5b6 Compare May 9, 2026 07:00
@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@eldenmoon
Copy link
Copy Markdown
Member Author

/review

### What problem does this PR solve?

Issue Number: CIR-19941

Related PR: None

Problem Summary: Variant JSON parsing could emit duplicate or conflicting downstream subcolumn paths when duplicate object keys, flat dotted keys, nested object paths, or array/scalar leaf values normalize to the same Variant path. This could later fail load or WAL replay with subcolumn type size mismatch. Add a disabled-by-default BE config to enable duplicate JSON path checking; when enabled, duplicate leaf paths keep the first value while object entries that expand to distinct child paths are preserved. The duplicate-check setting is captured in ParseConfig so parser and materialization use the same per-call behavior.

### Release note

Add BE config variant_enable_duplicate_json_path_check to keep the first duplicate Variant JSON leaf path when enabled; default is false.

### Check List (For Author)

- Test: Unit Test and Regression test
    - Unit Test: ./run-be-ut.sh --run --filter='VariantUtilTest.ParseDuplicateJsonPaths*'
    - Build: source ./custom_env.sh; export DORIS_PARALLELISM="${DORIS_PARALLELISM:-39}"; export BUILD_TYPE="${BUILD_TYPE:-ASAN}"; export USE_MEM_TRACKER=ON; ./build.sh --be -j"${DORIS_PARALLELISM}"
    - Regression test: env -u http_proxy -u https_proxy -u HTTP_PROXY -u HTTPS_PROXY -u all_proxy -u ALL_PROXY NO_PROXY=127.0.0.1,localhost,172.19.0.1 no_proxy=127.0.0.1,localhost,172.19.0.1 ./run-regression-test.sh --run --conf tmp/regression-conf.auto.groovy -d variant_p0 -s duplicate_json_path
    - Review: GPT-5.5 xhigh subagent final review found no P0/P1 blocking issues
- Behavior changed: Yes. When variant_enable_duplicate_json_path_check is enabled, duplicate downstream leaf paths keep the first value and duplicate object entries can merge distinct child paths; default is false.
- Does this need documentation: No
@eldenmoon eldenmoon force-pushed the codex/cir-19941-duplicate-json-path branch from 99aa5b6 to 102aad0 Compare May 9, 2026 08:15
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review conclusion: request changes. I did not add duplicate inline comments because the remaining blocking concern is already covered by an existing thread on be/src/util/json/json_parser.cpp: traverseObjectAsJsonb() still preserves duplicate object keys inside JSONB output when variant_enable_duplicate_json_path_check is enabled, so nested arrays/objects converted through the JSONB path do not consistently keep the first duplicate key.

Critical checkpoints:

  • Goal/test: The PR aims to make duplicate Variant JSON paths keep the first value when enabled. The added BE unit and regression tests cover subcolumn/doc materialization, flat dotted keys, object/scalar conflicts, insert, stream load, and compaction, but the JSONB conversion path remains uncovered and inconsistent with the config comment.
  • Scope/focus: The code is mostly focused on Variant parsing and materialization; no unrelated current PR file changes were found in the GitHub PR diff.
  • Concurrency/lifecycle: No new concurrent state or non-trivial lifecycle management is introduced. The new mutable config is read per parse/materialization call into ParseConfig; no lock-order or thread lifecycle issue was found.
  • Configuration: A mutable BE config is added and propagated to the main deserialize/materialize paths, but behavior is not applied uniformly to JSONB serialization of nested arrays/objects.
  • Compatibility/storage: No serialized format or FE/BE protocol format change was found. Default behavior remains disabled, so compatibility risk is limited.
  • Parallel paths: The subcolumn/doc-value path is updated, but the JSONB path remains a parallel path with inconsistent duplicate handling.
  • Tests/results: Tests are meaningful for the updated paths and use ordered query output. Missing coverage remains for JSONB duplicate-key handling under the new config.
  • Observability/performance: No extra observability appears necessary. The duplicate path checks add per-row hash-set work only when enabled; no additional performance blocker found.

User focus: no additional user-provided review focus was specified.

Existing review context: The earlier flat-dot/nested path-collision concern appears addressed in the current head by parser-level get_path() deduplication plus path normalization/materialization tests. The earlier regression-test deletion concern appears stale for the current GitHub PR file list. The JSONB duplicate-key concern remains applicable and should be resolved before approval.

@eldenmoon
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29950 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 102aad0ae5ef513aadeb242cd64a157741da5c17, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17592	3882	3850	3850
q2	q3	10688	879	619	619
q4	4687	460	343	343
q5	7660	1351	1169	1169
q6	276	167	137	137
q7	932	963	754	754
q8	10721	1458	1329	1329
q9	7169	5396	5353	5353
q10	6302	2083	1845	1845
q11	474	264	259	259
q12	668	420	295	295
q13	18140	3349	2777	2777
q14	292	281	256	256
q15	q16	900	872	795	795
q17	981	1112	758	758
q18	6618	5814	5725	5725
q19	1386	1294	1201	1201
q20	542	411	267	267
q21	4756	2360	1895	1895
q22	450	371	323	323
Total cold run time: 101234 ms
Total hot run time: 29950 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4765	4622	4622	4622
q2	q3	4671	4857	4220	4220
q4	2128	2165	1398	1398
q5	4994	5023	5320	5023
q6	201	171	130	130
q7	2047	1787	1667	1667
q8	3491	3108	3321	3108
q9	8403	8428	8454	8428
q10	4503	4558	4272	4272
q11	608	412	392	392
q12	721	744	509	509
q13	3260	3572	2946	2946
q14	303	307	278	278
q15	q16	801	792	704	704
q17	1353	1310	1287	1287
q18	8008	6965	7180	6965
q19	1210	1162	1173	1162
q20	2239	2196	1921	1921
q21	6116	5335	4818	4818
q22	525	471	401	401
Total cold run time: 60347 ms
Total hot run time: 54251 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170427 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 102aad0ae5ef513aadeb242cd64a157741da5c17, data reload: false

query5	4307	654	526	526
query6	327	225	201	201
query7	4296	560	314	314
query8	344	239	233	233
query9	8862	4049	4044	4044
query10	453	356	313	313
query11	5810	2448	2209	2209
query12	193	137	133	133
query13	1303	595	441	441
query14	6283	5396	5065	5065
query14_1	4397	4418	4340	4340
query15	213	203	181	181
query16	1016	478	427	427
query17	1118	747	618	618
query18	2758	482	352	352
query19	228	206	161	161
query20	144	134	129	129
query21	228	145	119	119
query22	13711	13665	13282	13282
query23	17226	16368	16117	16117
query23_1	16202	16135	16158	16135
query24	7349	1747	1356	1356
query24_1	1357	1346	1358	1346
query25	561	495	438	438
query26	1168	322	169	169
query27	2699	602	337	337
query28	4382	1992	1951	1951
query29	978	661	544	544
query30	313	244	202	202
query31	1125	1082	959	959
query32	102	79	75	75
query33	578	359	320	320
query34	1203	1129	625	625
query35	780	793	690	690
query36	1321	1368	1219	1219
query37	157	114	96	96
query38	3213	3143	3030	3030
query39	936	938	896	896
query39_1	872	875	876	875
query40	244	171	145	145
query41	71	69	71	69
query42	118	113	110	110
query43	324	338	288	288
query44	
query45	217	207	197	197
query46	1049	1215	737	737
query47	2315	2365	2234	2234
query48	407	422	307	307
query49	667	568	452	452
query50	716	284	220	220
query51	4417	4312	4256	4256
query52	106	108	95	95
query53	265	300	210	210
query54	331	289	282	282
query55	95	93	86	86
query56	316	324	328	324
query57	1404	1407	1298	1298
query58	314	277	271	271
query59	1583	1619	1485	1485
query60	339	334	320	320
query61	161	159	159	159
query62	666	625	560	560
query63	240	201	210	201
query64	2346	803	659	659
query65	
query66	1698	526	396	396
query67	30031	29987	29849	29849
query68	
query69	473	339	297	297
query70	994	1047	1015	1015
query71	311	283	271	271
query72	2886	2813	1982	1982
query73	864	794	422	422
query74	5098	4929	4738	4738
query75	2769	2678	2336	2336
query76	2282	1144	765	765
query77	423	418	360	360
query78	12885	13031	12423	12423
query79	1512	980	695	695
query80	1335	574	506	506
query81	491	278	240	240
query82	1300	160	123	123
query83	352	289	258	258
query84	263	145	111	111
query85	937	491	439	439
query86	446	339	344	339
query87	3435	3365	3242	3242
query88	3529	2664	2667	2664
query89	445	377	345	345
query90	1906	195	176	176
query91	177	168	136	136
query92	78	76	68	68
query93	982	957	567	567
query94	731	335	309	309
query95	669	458	364	364
query96	1035	764	347	347
query97	2725	2680	2571	2571
query98	238	227	233	227
query99	1138	1128	993	993
Total cold run time: 254426 ms
Total hot run time: 170427 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 92.65% (63/68) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.53% (20635/38548)
Line Coverage 37.17% (194838/524170)
Region Coverage 33.48% (151809/453373)
Branch Coverage 34.55% (66265/191806)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants