Skip to content

[core] Fix orphan files clean deleting data files during concurrent snapshot expiration#7715

Open
heye1005 wants to merge 1 commit intoapache:masterfrom
heye1005:fix-orphan-clean-concurrent-expiration
Open

[core] Fix orphan files clean deleting data files during concurrent snapshot expiration#7715
heye1005 wants to merge 1 commit intoapache:masterfrom
heye1005:fix-orphan-clean-concurrent-expiration

Conversation

@heye1005
Copy link
Copy Markdown

Purpose

Fix #7710.

LocalOrphanFilesClean.clean() collects used files by reading all snapshots via safelyGetAllSnapshots(). However, this method silently catches FileNotFoundException — if concurrent snapshot expiration deletes all snapshots between listing and reading, usedFiles ends up empty, and all candidate data files get deleted as orphans.

This patch skips orphan file deletion when usedFiles is empty to prevent accidental data loss.

Tests

Added LocalOrphanFilesCleanTest#testSkipCleaningWhenAllSnapshotsDeleted.

@JingsongLi
Copy link
Copy Markdown
Contributor

Hi @heye1005 It is better to fix it reading full ref files for latest snapshot.

@heye1005 heye1005 force-pushed the fix-orphan-clean-concurrent-expiration branch 3 times, most recently from a91ce0d to 5d525d4 Compare May 1, 2026 10:21
@heye1005
Copy link
Copy Markdown
Author

heye1005 commented May 1, 2026

Hi @heye1005 It is better to fix it reading full ref files for latest snapshot.

Thanks for the review! Updated the PR based on your suggestion.

The root cause here is that safelyGetAllSnapshots() lists the snapshot directory first, then reads each file. If expiration deletes the snapshot file (along with its manifest-list) between these two steps, the read silently fails and returns an empty list — all old data files then get treated as orphans.

This PR has two changes:

  1. Replaced the list-then-read in safelyGetAllSnapshots() with latestSnapshot() + ID range traversal. Expiration never deletes the latest snapshot, so the snapshot file read is safe. But the manifest-list of that snapshot can still be deleted before we read it.

  2. Added collectLatestSnapshotFiles() — re-reads the latest snapshot right before deletion and removes its referenced files from candidates. By then expiration is usually done, so the current latest snapshot's manifest-list is intact. This is applied to Local, Flink, and Spark subclasses.

This doesn't 100% eliminate the race — if expiration happens to delete the latest snapshot's manifest-list during the pre-deletion check, it would still fail. But that window is just milliseconds, so practically very unlikely.

To fully solve this, I think we'd need either some form of locking between orphan clean and expiration, or delay manifest-list deletion (similar to how orphan clean already requires data files to be older than olderThanMillis). Happy to discuss further.

… snapshot expiration

Use latestSnapshot() + ID range to read all snapshots instead of
safelyGetAllSnapshots() which lists then reads, leaving a race window.

This closes apache#7710.
@heye1005 heye1005 force-pushed the fix-orphan-clean-concurrent-expiration branch from 5d525d4 to 1d66c8e Compare May 1, 2026 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] remove_orphan_files may accidentally delete data in multitasking concurrent scenarios

2 participants