Skip to content

[fix](parquet) Fix parquet row group column lookup crash and row group reader lifetime#63102

Open
Gabriel39 wants to merge 2 commits intoapache:masterfrom
Gabriel39:fix_0509
Open

[fix](parquet) Fix parquet row group column lookup crash and row group reader lifetime#63102
Gabriel39 wants to merge 2 commits intoapache:masterfrom
Gabriel39:fix_0509

Conversation

@Gabriel39
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

*** tablet id: 0 ***
*** Aborted at 1778254939 (unix time) try "date -d @1778254939" if you
are using GNU date ***
*** Current BE git commitID: c6e6ecc ***
*** SIGSEGV address not mapped to object (@0x13) received by PID 40355
(TID 77840 OR 0x7ef6079ff640) from PID 19; stack trace: ***
0# doris::signal::(anonymous namespace)::FailureSignalHandler(int,
siginfo_t*, void*) at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/common/signal_handler.h:420
1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0]
in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
2# JVM_handle_linux_signal in
/usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so
 3# 0x00007F036C55C520 in /lib/x86_64-linux-gnu/libc.so.6
4# std::_Hash_bytes(void const*, unsigned long, unsigned long) in
/mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
5# std::__detail::_Map_base<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> >,
std::pair<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > const, unsigned int>,
std::allocator<std::pair<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const, unsigned int> >,
std::__detail::_Select1st,
std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > >,
std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash,
std::__detail::_Prime_rehash_policy,
std::__detail::_Hashtable_traits<true, false, true>,
true>::operator[](std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&) in
/mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
6# doris::RowGroupReader::_read_column_data(doris::Block*,
std::vector<std::__cxx11::basic_string<char, std::char_traits<char>,
std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > > > const&, unsigned
long, unsigned long*, bool*, doris::FilterMap&) at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/format/parquet/vparquet_group_reader.cpp:496
7# doris::RowGroupReader::next_batch(doris::Block*, unsigned long,
unsigned long*, bool*) at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/format/parquet/vparquet_group_reader.cpp:404
8# doris::ParquetReader::get_next_block(doris::Block*, unsigned long*,
bool*) at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/format/parquet/vparquet_reader.cpp:724
9# doris::IcebergTableReader::get_next_block_inner(doris::Block*,
unsigned long*, bool*) at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/format/table/iceberg_reader.cpp:185
10# doris::TableFormatReader::get_next_block(doris::Block*, unsigned
long*, bool*) at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/format/table/table_format_reader.h:80
11# doris::FileScanner::_get_block_wrapped(doris::RuntimeState*,
doris::Block*, bool*) at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/file_scanner.cpp:474
12# doris::FileScanner::_get_block_impl(doris::RuntimeState*,
doris::Block*, bool*) at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/file_scanner.cpp:407
13# doris::Scanner::get_block(doris::RuntimeState*, doris::Block*,
bool*) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be
14# doris::Scanner::get_block_after_projects(doris::RuntimeState*,
doris::Block*, bool*) at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/scanner.cpp:91
15#
doris::ScannerScheduler::_scanner_scan(std::shared_ptr<doris::ScannerContext>,
std::shared_ptr<doris::ScanTask>) at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/scanner_scheduler.cpp:186
16# std::_Function_handler<bool (),
doris::ScannerScheduler::submit(std::shared_ptr<doris::ScannerContext>,
std::shared_ptr<doris::ScanTask>)::$_0::operator()()
const::{lambda()apache#1}>::_M_invoke(std::_Any_data const&) at
/usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292
17# doris::ScannerSplitRunner::process_for(std::chrono::duration<long,
std::ratio<1l, 1000000000l> >) at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/scanner_scheduler.cpp:427
18# doris::PrioritizedSplitRunner::process() at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/task_executor/time_sharing/prioritized_split_runner.cpp:103
19# doris::TimeSharingTaskExecutor::_dispatch_thread() at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/task_executor/time_sharing/time_sharing_task_executor.cpp:558
20# doris::Thread::supervise_thread(void*) at
/home/zcp/repo_center/doris_branch-4.1/doris/be/src/util/thread.cpp:461
21# start_thread at ./nptl/pthread_create.c:442
22# 0x00007F036C6408D0 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [ ] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Gabriel39
Copy link
Copy Markdown
Contributor Author

/review

@Gabriel39 Gabriel39 changed the title Fix 0509 [fix](parquet) Fix parquet row group column lookup crash and row group reader lifetime May 9, 2026
@Gabriel39
Copy link
Copy Markdown
Contributor Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review completed for the Parquet row-group reader changes. I did not find a blocking correctness issue in the touched code.

Critical checkpoint conclusions:

  • Goal and evidence: The PR addresses row-group reader crashes caused by unsafe column lookup/lifetime assumptions by copying reader inputs that outlive the constructor frame, avoiding unordered_map operator[] insertion for block/reader lookup, and ordering column-reader destruction before _read_ranges. No test evidence is included in the PR.
  • Scope and clarity: The behavioral changes are small and focused on Parquet row-group lookup/lifetime safety. The added _fill_partition_columns / _fill_missing_columns helpers are currently unused, which is not blocking but should be cleaned up or wired in later if intended.
  • Concurrency: No new shared mutable state, threads, atomics, or lock ordering are introduced.
  • Lifecycle: The main lifecycle change is appropriate: _read_table_columns and _lazy_read_ctx are now owned by RowGroupReader, and _column_readers is declared after _read_ranges so it is destroyed first. Existing raw observer pointers still rely on the owning reader/scanner lifetime, matching surrounding code.
  • Configuration and compatibility: No new config, protocol, storage format, or compatibility concern.
  • Parallel paths: The patch covers the changed Parquet row-group direct lookup paths. ORC/table-format fill paths are not changed by this PR.
  • Conditional checks and error handling: New lookup checks return Status::InternalError with block context instead of crashing or silently inserting map entries. Status results are propagated.
  • Tests: No regression/unit/manual test result is provided. A targeted external Parquet/Iceberg regression covering missing/block-index lookup and dict-filter/lazy-read paths would reduce residual risk.
  • Observability: Existing error messages include column and block structure; no additional metrics/logging appears necessary for this crash fix.
  • Transaction/persistence/data writes: Not applicable; this is a read-path-only change.
  • FE/BE variable passing: Not applicable.
  • Performance: Lookup changes replace repeated operator[] with checked find; no meaningful hot-path regression identified.
  • User focus: No additional user-provided review focus was specified.

@Gabriel39
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review opinion: no blocking issues found. The changes address the two targeted failure modes: row group readers now own the copied read/lazy-read metadata they may outlive, and parquet block column lookup no longer mutates the map or falls through to position 0 when a column is absent.

Critical checkpoint conclusions:

  • Goal/test: the code change appears to accomplish the stated crash/lifetime fix. I did not find an added targeted regression/unit test in this PR, so coverage for the specific crash scenario remains the main residual risk.
  • Scope: the functional change is small and focused on ownership/destruction order plus safer lookup. The new helper methods are localized.
  • Concurrency: no new shared mutable state or locking paths are introduced; row group reader state remains per-reader.
  • Lifecycle: the important lifecycle issue is handled by copying _read_table_columns/_lazy_read_ctx and declaring _column_readers after _read_ranges so readers are destroyed before the referenced ranges.
  • Configuration/compatibility/protocol: no new config, storage format, FE-BE protocol, or rolling-upgrade compatibility concern found.
  • Parallel paths: analogous parquet dict/lazy read block accesses in this file were updated where relevant; no distinct parquet row group reader instance requiring the same fix was found.
  • Error handling: new lookup failures return Status::InternalError with block context instead of unchecked map indexing; no ignored Status introduced in the modified paths.
  • Data correctness: no visible-version, delete-bitmap, transaction, or persistence path is touched.
  • Memory/performance: copying the column-name and lazy-read context is acceptable for row group reader lifetime safety; no hot-path pathological overhead found.
  • Observability: error messages include column and block structure details, sufficient for this failure class.
  • User focus: no additional user-provided review focus was present.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 30027 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 09adfa5b3b94854daee572c408c56c46bf6b38a3, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	18148	4219	4028	4028
q2	q3	10711	887	610	610
q4	4658	468	343	343
q5	7466	1348	1151	1151
q6	186	170	142	142
q7	935	966	763	763
q8	9330	1414	1303	1303
q9	5583	5383	5371	5371
q10	6260	2082	1849	1849
q11	464	271	264	264
q12	639	416	311	311
q13	18090	3317	2804	2804
q14	290	291	262	262
q15	q16	866	865	796	796
q17	996	1113	722	722
q18	6518	5752	5616	5616
q19	1177	1243	1050	1050
q20	530	407	267	267
q21	4642	2432	2034	2034
q22	476	401	341	341
Total cold run time: 97965 ms
Total hot run time: 30027 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4764	4739	4783	4739
q2	q3	4671	4778	4191	4191
q4	2157	2174	1421	1421
q5	5013	5033	5352	5033
q6	201	170	137	137
q7	2057	1819	1606	1606
q8	3333	3153	3105	3105
q9	8477	8547	8472	8472
q10	4528	4500	4278	4278
q11	616	425	423	423
q12	692	759	523	523
q13	3282	3572	2918	2918
q14	325	318	273	273
q15	q16	939	788	722	722
q17	1328	1299	1279	1279
q18	8187	7135	7079	7079
q19	1159	1152	1118	1118
q20	2230	2217	1950	1950
q21	6200	5482	4884	4884
q22	550	515	443	443
Total cold run time: 60709 ms
Total hot run time: 54594 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172041 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 09adfa5b3b94854daee572c408c56c46bf6b38a3, data reload: false

query5	4303	656	514	514
query6	334	226	202	202
query7	4312	564	317	317
query8	336	232	214	214
query9	8846	4079	4049	4049
query10	458	339	306	306
query11	5771	2406	2184	2184
query12	183	140	129	129
query13	1287	616	437	437
query14	6015	5393	5092	5092
query14_1	4376	4370	4382	4370
query15	214	208	213	208
query16	959	448	423	423
query17	1071	745	618	618
query18	2439	484	346	346
query19	223	213	177	177
query20	138	137	133	133
query21	216	139	123	123
query22	13669	13981	14499	13981
query23	17316	16599	16194	16194
query23_1	16262	16312	16346	16312
query24	7371	1736	1319	1319
query24_1	1338	1336	1346	1336
query25	569	485	425	425
query26	1296	318	167	167
query27	2725	601	323	323
query28	4463	1959	1935	1935
query29	986	622	527	527
query30	299	235	203	203
query31	1091	1050	921	921
query32	80	71	68	68
query33	517	341	296	296
query34	1177	1123	638	638
query35	762	817	662	662
query36	1317	1328	1139	1139
query37	147	100	90	90
query38	3188	3155	3063	3063
query39	923	931	882	882
query39_1	868	896	874	874
query40	234	154	135	135
query41	61	61	56	56
query42	107	107	104	104
query43	323	322	287	287
query44	
query45	206	202	187	187
query46	1071	1183	714	714
query47	2309	2305	2175	2175
query48	402	412	301	301
query49	635	515	419	419
query50	704	285	214	214
query51	4374	4295	4206	4206
query52	104	104	93	93
query53	257	286	213	213
query54	318	271	252	252
query55	92	89	91	89
query56	291	312	303	303
query57	1409	1376	1287	1287
query58	299	266	272	266
query59	1585	1653	1471	1471
query60	356	341	336	336
query61	157	160	163	160
query62	671	614	538	538
query63	255	202	205	202
query64	2415	842	677	677
query65	
query66	1736	519	426	426
query67	30104	30025	29937	29937
query68	
query69	454	348	327	327
query70	1025	1053	972	972
query71	325	310	282	282
query72	3119	2949	2641	2641
query73	841	758	443	443
query74	5087	4905	4767	4767
query75	2817	2707	2352	2352
query76	2290	1139	761	761
query77	429	419	364	364
query78	13238	12952	12580	12580
query79	2817	926	779	779
query80	1767	591	493	493
query81	523	278	238	238
query82	971	156	120	120
query83	346	278	248	248
query84	264	141	108	108
query85	906	511	452	452
query86	453	318	324	318
query87	3432	3362	3213	3213
query88	3553	2684	2661	2661
query89	443	381	342	342
query90	1883	202	181	181
query91	184	172	144	144
query92	76	75	76	75
query93	1244	952	548	548
query94	723	330	293	293
query95	682	470	359	359
query96	1082	748	320	320
query97	2689	2719	2549	2549
query98	241	238	228	228
query99	1080	1137	986	986
Total cold run time: 256292 ms
Total hot run time: 172041 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 21.62% (24/111) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.52% (20631/38546)
Line Coverage 37.16% (194778/524150)
Region Coverage 33.51% (151931/453421)
Branch Coverage 34.54% (66254/191828)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 33.33% (37/111) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.68% (27807/37738)
Line Coverage 57.58% (300864/522475)
Region Coverage 54.86% (251084/457672)
Branch Coverage 56.34% (108396/192399)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants