HBASE-30061 Add EWMA-based BlockCompressedSizePredicator#8075
Open
apurtell wants to merge 2 commits into
Open
Conversation
Apache9
reviewed
Apr 17, 2026
| import org.junit.Test; | ||
| import org.junit.experimental.categories.Category; | ||
|
|
||
| @Category({ IOTests.class, SmallTests.class }) |
PreviousBlockCompressionRatePredicator has three algorithmic deficiencies that cause compressed blocks to systematically undershoot the configured block size target: integer division truncation, single-sample estimation, and no smoothing of the estimated compression ratio. EWMABlockSizePredicator addresses these issues with double-precision arithmetic and weighted moving average smoothed estimation of the compression ratio. This produces compressed HFile blocks that are closer to the configured target block size. The ratio is smoothed using a default alpha of 0.5. This adapts quickly to changing data while dampening single-block variance. After 3 blocks, the EWMA captures 87.5% of the true ratio. Alpha = 0.5 is chosen because HFile blocks within a single file tend to have similar compression ratios (same column family, similar data distribution), and fast adaptation matters more than long-term smoothing since predicator state is per-file. Adds HFileBlockPerformanceEvaluation to microbenchmark HFileBlock related concerns.
There was a problem hiding this comment.
Pull request overview
This PR introduces a new EWMA-based BlockCompressedSizePredicator to better hit configured HFile block-size targets under compression by using double-precision ratio estimation with smoothing. It also adds a dedicated unit test suite for the new predicator and a diagnostics microbenchmark tool for evaluating block encoding/compression/predicator behavior.
Changes:
- Add
EWMABlockSizePredicatorwith configurable EWMA alpha to smooth compression ratio estimation. - Add
TestEWMABlockSizePredicatorunit tests covering cold-start, smoothing behavior, and configurability. - Add
HFileBlockPerformanceEvaluationdiagnostics utility to benchmark predicator accuracy and read/write throughput.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/EWMABlockSizePredicator.java | New EWMA-based block-size predicator with configurable smoothing factor. |
| hbase-server/src/test/java/org/apache/hadoop/hbase/io/hfile/TestEWMABlockSizePredicator.java | New unit tests validating EWMA ratio math, smoothing, and edge cases. |
| hbase-diagnostics/src/main/java/org/apache/hadoop/hbase/HFileBlockPerformanceEvaluation.java | New CLI tool to measure predicator accuracy and HFile throughput across configurations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PreviousBlockCompressionRatePredicator has three algorithmic deficiencies that cause compressed blocks to systematically undershoot the configured block size target: integer division truncation, single-sample estimation, and no smoothing of the estimated compression ratio.
EWMABlockSizePredicator addresses these issues with double-precision arithmetic and weighted moving average smoothed estimation of the compression ratio. This produces compressed HFile blocks that are closer to the configured target block size.
The ratio is smoothed using a default alpha of 0.5. This adapts quickly to changing data while dampening single-block variance. After 3 blocks, the EWMA captures 87.5% of the true ratio. Alpha = 0.5 is chosen because HFile blocks within a single file tend to have similar compression ratios (same column family, similar data distribution), and fast adaptation matters more than long-term smoothing since predicator state is per-file.
Adds HFileBlockPerformanceEvaluation to microbenchmark HFileBlock related concerns.