Skip to content

Cpjump1 example training data access refactor#29

Open
wli51 wants to merge 3 commits into
WayScience:mainfrom
wli51:cpjump1-data-access-refactor
Open

Cpjump1 example training data access refactor#29
wli51 wants to merge 3 commits into
WayScience:mainfrom
wli51:cpjump1-data-access-refactor

Conversation

@wli51
Copy link
Copy Markdown
Collaborator

@wli51 wli51 commented May 27, 2026

Change way of example CPJUMP1 data access uses existing manifest and metadata files from WayScience/JUMP-single-cell.

Addresses issue #26

Note that this PR only adds an additional 0.*.ipynb for example data download and does not yet replace the old data access notebook and subsequent training, which I decided to save for a separate PR to keep the size in check.

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@wli51 wli51 requested a review from MattsonCam May 27, 2026 21:52
Copy link
Copy Markdown
Member

@MattsonCam MattsonCam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @wli51 , good job!

Comment on lines +39 to +47
def download_wide_manifest_channels(
wide_manifest,
dest_dir,
channel_columns=None,
overwrite=False,
):
"""
Download S3 TIFFs for each channel and write a local file_index.csv with paths.
"""
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just pip install one of this repo from pypi to accomplish this instead:
https://github.com/WayScience/jump_image_data_downloader

If you decide to use this repo, then I think it would be useful to pin the version. Also, it uses download parallelization.

Alternatively, Dave recently found a repo that downloads the JUMP data and includes more datasets:
https://github.com/broadinstitute/monorepo/tree/main/libs/jump_portrait

Comment on lines +127 to +132
# Split plates into train (75%) and test (25%) with seed
train_plates, test_plates = train_test_split(
unique_plates,
test_size=0.25,
random_state=42
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this works, but you could also use hash splitting, which would ensure samples remain in their respective splits even if data is added or removed (such as with QC). Here is an example if interested:
https://github.com/WayScience/nuclear_speckles_analysis/blob/main/splitters/HashSplitter.py


def main(argv: Optional[list[str]] = None) -> int:
"""
Command-line interface to building and ouputting the CPJUMP1 manifest.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Command-line interface to building and ouputting the CPJUMP1 manifest.
Command-line interface to building and outputting the CPJUMP1 manifest.

Command-line interface to building and ouputting the CPJUMP1 manifest.
By default, it prints a summary and preview of the manifest.
Use --output or --stdout to write the full manifest to a file or stdout.
May or may not be useful.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would remove this last line. Alternatively, you could explain the use case and let the user decide if it is useful to them or not

negcon_u2os_24_manifest.head()


# ## Arrange as wide to be in anticipated format dor virtual stain flow datasets and also the format the download helper expects
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# ## Arrange as wide to be in anticipated format dor virtual stain flow datasets and also the format the download helper expects
# ## Arrange as wide is the anticipated format in virtual stain flow datasets and also the format the download helper expects this format

Not sure if this is the intended message or not

print(f"Train samples: {len(train_manifest_wide)}, Test samples: {len(test_manifest_wide)}")


# ## Write final splitted download manifest with metadata and download all needed data
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider making this more concise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants