Skip to content

#845 Fix in place processing to retain bytes from start and end offsets or the original file when the processing strategy is 'InPlace'#849

Merged
yruslan merged 3 commits into
masterfrom
feature/845-in-place-processing-retain-start-end-offsets
May 14, 2026
Merged

#845 Fix in place processing to retain bytes from start and end offsets or the original file when the processing strategy is 'InPlace'#849
yruslan merged 3 commits into
masterfrom
feature/845-in-place-processing-retain-start-end-offsets

Conversation

@yruslan
Copy link
Copy Markdown
Collaborator

@yruslan yruslan commented May 14, 2026

Summary by CodeRabbit

  • Bug Fixes

    • In-place processing now preserves and emits unmodified leading and trailing bytes when file start/end offsets are used, ensuring correct output composition.
  • Improvements

    • Stream handling updated to track skipped regions so boundary bytes are read and written reliably across single-node and distributed workflows.
  • Tests

    • Updated tests to validate preserved boundary bytes when start/end offsets are applied.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Warning

Rate limit exceeded

@yruslan has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 48 minutes and 46 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e03f56a2-d22c-433c-8f63-5bb187339fe6

📥 Commits

Reviewing files that changed from the base of the PR and between 8234cee and 78b6660.

📒 Files selected for processing (1)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala

Walkthrough

The PR extends in-place COBOL file processing to preserve unmodified file start- and end-offset boundary bytes. FSStream gains boundary tracking via a skipped flag and methods to retrieve skipped bytes; CobolProcessorInPlace conditionally emits those bytes around stream processing; SparkCobolProcessor threads a retainStartAndEndOffsets flag through distributed file processing and performs parallel boundary writes.

Changes

Preserve file start/end offset bytes in in-place COBOL processing

Layer / File(s) Summary
FSStream boundary byte tracking and retrieval
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala
FSStream introduces a mutable skipped flag to record whether start-offset bytes have been consumed, and implements getSkippedStartBytes and getSkippedEndBytes to conditionally retrieve unmodified boundary bytes. Start-offset skipping in next occurs only once, guarded by !skipped, and skipFully sets the flag to prevent redundancy.
CobolProcessorInPlace conditional boundary output for FSStream
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlace.scala
CobolProcessorInPlace pattern-matches on the input stream, and when it is an FSStream, writes the skipped-start bytes before calling StreamProcessor.processStreamInPlace and the skipped-end bytes after, preserving boundary regions in the output.
SparkCobolProcessor distributed boundary preservation flag and wiring
spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala, spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala
SparkCobolProcessor derives a retainStartAndEndOffsets flag from CobolProcessingStrategy, threads it through the RDD-based file processing pipeline, and conditionally performs extra boundary-byte writes in parallel tasks before and after main processing when the flag is enabled.
Test assertions for boundary byte preservation
cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala, spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala
Test assertions are updated to expect the complete output including unmodified boundary bytes, and the Spark reader test configuration is extended to supply file_start_offset and file_end_offset options for round-trip verification.

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly Related PRs

  • AbsaOSS/cobrix#844: Both PRs implement honoring file_start_offset/file_end_offset for in-place processing by extending/modifying FSStream and wiring those offsets into the in-place processing flow (including boundary byte handling).
  • AbsaOSS/cobrix#800: Modifies CobolProcessorInPlace.scala's in-place stream handling; this PR builds on that foundation to add boundary-byte preservation for FSStream.

Poem

A rabbit hops through offset streams,
I tuck the skipped bytes neat between,
Start and end returned once more,
Files stitched back as they were before,
Hooray for preserved boundary dreams! 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main objective: fixing in-place processing to retain bytes from start and end offsets, which is directly reflected in changes across multiple files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/845-in-place-processing-retain-start-end-offsets

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

JaCoCo code coverage report - 'cobol-parser'

Overall Project 90.98% -0.1% 🍏
Files changed 69.63% 🍏

File Coverage
CobolProcessorInPlace.scala 100% 🍏
FSStream.scala 73.06% -16.84% 🍏

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 14, 2026

JaCoCo code coverage report - 'spark-cobol'

Overall Project 83.4% -0.04% 🍏
Files changed 90.7% 🍏

File Coverage
FileStreamer.scala 99.07% 🍏
SparkCobolProcessor.scala 96.36% -0.56% 🍏

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala (1)

53-66: ⚡ Quick win

Clarify the semantics of getSkippedEndBytes or add defensive checks.

The method closes the stream (line 63) when not at the effective end, which could cause silent failures if called prematurely. The else-branch semantics are unclear: should this method be called only when byteIndex >= effectiveSize, or is the close a defensive cleanup?

Consider either:

  1. Documenting that this method must only be called after reading all effective bytes, or
  2. Throwing an exception instead of silently closing when called prematurely.
♻️ Option: Add explicit precondition check
 def getSkippedEndBytes: Array[Byte] = {
-  if (byteIndex >= effectiveSize && !isClosed) {
+  if (byteIndex < effectiveSize) {
+    throw new IllegalStateException(s"Cannot retrieve end bytes before reaching effective end (at $byteIndex of $effectiveSize)")
+  }
+  if (!isClosed) {
     val b = new Array[Byte](fileEndOffset.toInt)
     val actual = bytesStream.read(b, 0, fileEndOffset.toInt)
     if (actual <= 0) {
       Array.empty[Byte]
     } else {
       b.take(actual)
     }
   } else {
-    close()
     Array.empty[Byte]
   }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala`
around lines 53 - 66, The getSkippedEndBytes method currently closes the stream
and returns empty bytes when called before byteIndex >= effectiveSize, which can
hide incorrect usage; update getSkippedEndBytes to enforce a clear precondition:
check that byteIndex >= effectiveSize and if not, either throw an
IllegalStateException with a descriptive message (e.g., referencing
getSkippedEndBytes, byteIndex and effectiveSize) or document the method contract
at its declaration so callers know it must be called only after reading all
effective bytes; ensure the code references the existing symbols
(getSkippedEndBytes, byteIndex, effectiveSize, close, bytesStream,
fileEndOffset) so the precondition is implemented in place of silently closing
the stream.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala`:
- Around line 38-51: Validate fileStartOffset/fileEndOffset before casting to
Int in getSkippedStartBytes and getSkippedEndBytes: check that each offset is
within 0 to Int.MaxValue and reject (e.g., throw IllegalArgumentException with a
clear message) if out of bounds to avoid silent overflow and huge array
allocation; only perform the toInt cast after this validation and then proceed
with array allocation and reads as before. Ensure the same guard (non-negative
and <= Int.MaxValue) is applied to both getSkippedStartBytes and
getSkippedEndBytes and include the offset value in the exception message for
debugging.

---

Nitpick comments:
In
`@cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala`:
- Around line 53-66: The getSkippedEndBytes method currently closes the stream
and returns empty bytes when called before byteIndex >= effectiveSize, which can
hide incorrect usage; update getSkippedEndBytes to enforce a clear precondition:
check that byteIndex >= effectiveSize and if not, either throw an
IllegalStateException with a descriptive message (e.g., referencing
getSkippedEndBytes, byteIndex and effectiveSize) or document the method contract
at its declaration so callers know it must be called only after reading all
effective bytes; ensure the code references the existing symbols
(getSkippedEndBytes, byteIndex, effectiveSize, close, bytesStream,
fileEndOffset) so the precondition is implemented in place of silently closing
the stream.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c89094d1-0b52-4fa4-a511-9611bbbcd248

📥 Commits

Reviewing files that changed from the base of the PR and between e7a09b5 and 0b65967.

📒 Files selected for processing (6)
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/processor/impl/CobolProcessorInPlace.scala
  • cobol-parser/src/main/scala/za/co/absa/cobrix/cobol/reader/stream/FSStream.scala
  • cobol-parser/src/test/scala/za/co/absa/cobrix/cobol/processor/CobolProcessorBuilderSuite.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessor.scala
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala
  • spark-cobol/src/test/scala/za/co/absa/cobrix/spark/cobol/SparkCobolProcessorSuite.scala
💤 Files with no reviewable changes (1)
  • spark-cobol/src/main/scala/za/co/absa/cobrix/spark/cobol/source/streaming/FileStreamer.scala

@yruslan
Copy link
Copy Markdown
Collaborator Author

yruslan commented May 14, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@yruslan yruslan force-pushed the feature/845-in-place-processing-retain-start-end-offsets branch from 8234cee to 78b6660 Compare May 14, 2026 09:19
@yruslan yruslan merged commit d307a35 into master May 14, 2026
7 checks passed
@yruslan yruslan deleted the feature/845-in-place-processing-retain-start-end-offsets branch May 14, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant