[fix](parquet) Fix parquet row group column lookup crash and row group reader lifetime#63102
[fix](parquet) Fix parquet row group column lookup crash and row group reader lifetime#63102Gabriel39 wants to merge 2 commits intoapache:masterfrom
Conversation
*** tablet id: 0 *** *** Aborted at 1778254939 (unix time) try "date -d @1778254939" if you are using GNU date *** *** Current BE git commitID: c6e6ecc *** *** SIGSEGV address not mapped to object (@0x13) received by PID 40355 (TID 77840 OR 0x7ef6079ff640) from PID 19; stack trace: *** 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/common/signal_handler.h:420 1# PosixSignals::chained_handler(int, siginfo*, void*) [clone .part.0] in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 2# JVM_handle_linux_signal in /usr/lib/jvm/java-17-openjdk-amd64/lib/server/libjvm.so 3# 0x00007F036C55C520 in /lib/x86_64-linux-gnu/libc.so.6 4# std::_Hash_bytes(void const*, unsigned long, unsigned long) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be 5# std::__detail::_Map_base<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned int>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, unsigned int> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be 6# doris::RowGroupReader::_read_column_data(doris::Block*, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&, unsigned long, unsigned long*, bool*, doris::FilterMap&) at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/format/parquet/vparquet_group_reader.cpp:496 7# doris::RowGroupReader::next_batch(doris::Block*, unsigned long, unsigned long*, bool*) at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/format/parquet/vparquet_group_reader.cpp:404 8# doris::ParquetReader::get_next_block(doris::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/format/parquet/vparquet_reader.cpp:724 9# doris::IcebergTableReader::get_next_block_inner(doris::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/format/table/iceberg_reader.cpp:185 10# doris::TableFormatReader::get_next_block(doris::Block*, unsigned long*, bool*) at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/format/table/table_format_reader.h:80 11# doris::FileScanner::_get_block_wrapped(doris::RuntimeState*, doris::Block*, bool*) at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/file_scanner.cpp:474 12# doris::FileScanner::_get_block_impl(doris::RuntimeState*, doris::Block*, bool*) at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/file_scanner.cpp:407 13# doris::Scanner::get_block(doris::RuntimeState*, doris::Block*, bool*) in /mnt/hdd01/PERFORMANCE_ENV/be/lib/doris_be 14# doris::Scanner::get_block_after_projects(doris::RuntimeState*, doris::Block*, bool*) at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/scanner.cpp:91 15# doris::ScannerScheduler::_scanner_scan(std::shared_ptr<doris::ScannerContext>, std::shared_ptr<doris::ScanTask>) at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/scanner_scheduler.cpp:186 16# std::_Function_handler<bool (), doris::ScannerScheduler::submit(std::shared_ptr<doris::ScannerContext>, std::shared_ptr<doris::ScanTask>)::$_0::operator()() const::{lambda()apache#1}>::_M_invoke(std::_Any_data const&) at /usr/local/ldb-toolchain-v0.26/bin/../lib/gcc/x86_64-pc-linux-gnu/15/include/g++-v15/bits/std_function.h:292 17# doris::ScannerSplitRunner::process_for(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/scanner_scheduler.cpp:427 18# doris::PrioritizedSplitRunner::process() at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/task_executor/time_sharing/prioritized_split_runner.cpp:103 19# doris::TimeSharingTaskExecutor::_dispatch_thread() at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/exec/scan/task_executor/time_sharing/time_sharing_task_executor.cpp:558 20# doris::Thread::supervise_thread(void*) at /home/zcp/repo_center/doris_branch-4.1/doris/be/src/util/thread.cpp:461 21# start_thread at ./nptl/pthread_create.c:442 22# 0x00007F036C6408D0 at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:83 - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [ ] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
|
/review |
There was a problem hiding this comment.
Code review completed for the Parquet row-group reader changes. I did not find a blocking correctness issue in the touched code.
Critical checkpoint conclusions:
- Goal and evidence: The PR addresses row-group reader crashes caused by unsafe column lookup/lifetime assumptions by copying reader inputs that outlive the constructor frame, avoiding unordered_map operator[] insertion for block/reader lookup, and ordering column-reader destruction before
_read_ranges. No test evidence is included in the PR. - Scope and clarity: The behavioral changes are small and focused on Parquet row-group lookup/lifetime safety. The added
_fill_partition_columns/_fill_missing_columnshelpers are currently unused, which is not blocking but should be cleaned up or wired in later if intended. - Concurrency: No new shared mutable state, threads, atomics, or lock ordering are introduced.
- Lifecycle: The main lifecycle change is appropriate:
_read_table_columnsand_lazy_read_ctxare now owned byRowGroupReader, and_column_readersis declared after_read_rangesso it is destroyed first. Existing raw observer pointers still rely on the owning reader/scanner lifetime, matching surrounding code. - Configuration and compatibility: No new config, protocol, storage format, or compatibility concern.
- Parallel paths: The patch covers the changed Parquet row-group direct lookup paths. ORC/table-format fill paths are not changed by this PR.
- Conditional checks and error handling: New lookup checks return
Status::InternalErrorwith block context instead of crashing or silently inserting map entries.Statusresults are propagated. - Tests: No regression/unit/manual test result is provided. A targeted external Parquet/Iceberg regression covering missing/block-index lookup and dict-filter/lazy-read paths would reduce residual risk.
- Observability: Existing error messages include column and block structure; no additional metrics/logging appears necessary for this crash fix.
- Transaction/persistence/data writes: Not applicable; this is a read-path-only change.
- FE/BE variable passing: Not applicable.
- Performance: Lookup changes replace repeated
operator[]with checkedfind; no meaningful hot-path regression identified. - User focus: No additional user-provided review focus was specified.
|
run buildall |
There was a problem hiding this comment.
Review opinion: no blocking issues found. The changes address the two targeted failure modes: row group readers now own the copied read/lazy-read metadata they may outlive, and parquet block column lookup no longer mutates the map or falls through to position 0 when a column is absent.
Critical checkpoint conclusions:
- Goal/test: the code change appears to accomplish the stated crash/lifetime fix. I did not find an added targeted regression/unit test in this PR, so coverage for the specific crash scenario remains the main residual risk.
- Scope: the functional change is small and focused on ownership/destruction order plus safer lookup. The new helper methods are localized.
- Concurrency: no new shared mutable state or locking paths are introduced; row group reader state remains per-reader.
- Lifecycle: the important lifecycle issue is handled by copying
_read_table_columns/_lazy_read_ctxand declaring_column_readersafter_read_rangesso readers are destroyed before the referenced ranges. - Configuration/compatibility/protocol: no new config, storage format, FE-BE protocol, or rolling-upgrade compatibility concern found.
- Parallel paths: analogous parquet dict/lazy read block accesses in this file were updated where relevant; no distinct parquet row group reader instance requiring the same fix was found.
- Error handling: new lookup failures return
Status::InternalErrorwith block context instead of unchecked map indexing; no ignoredStatusintroduced in the modified paths. - Data correctness: no visible-version, delete-bitmap, transaction, or persistence path is touched.
- Memory/performance: copying the column-name and lazy-read context is acceptable for row group reader lifetime safety; no hot-path pathological overhead found.
- Observability: error messages include column and block structure details, sufficient for this failure class.
- User focus: no additional user-provided review focus was present.
TPC-H: Total hot run time: 30027 ms |
TPC-DS: Total hot run time: 172041 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)