Cpjump1 example training data access refactor#29
Conversation
…ed example dataset acquisition + utilties for converting manifest as formats required by virtual stain flow datasets
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
| def download_wide_manifest_channels( | ||
| wide_manifest, | ||
| dest_dir, | ||
| channel_columns=None, | ||
| overwrite=False, | ||
| ): | ||
| """ | ||
| Download S3 TIFFs for each channel and write a local file_index.csv with paths. | ||
| """ |
There was a problem hiding this comment.
I think you can just pip install one of this repo from pypi to accomplish this instead:
https://github.com/WayScience/jump_image_data_downloader
If you decide to use this repo, then I think it would be useful to pin the version. Also, it uses download parallelization.
Alternatively, Dave recently found a repo that downloads the JUMP data and includes more datasets:
https://github.com/broadinstitute/monorepo/tree/main/libs/jump_portrait
| # Split plates into train (75%) and test (25%) with seed | ||
| train_plates, test_plates = train_test_split( | ||
| unique_plates, | ||
| test_size=0.25, | ||
| random_state=42 | ||
| ) |
There was a problem hiding this comment.
I think this works, but you could also use hash splitting, which would ensure samples remain in their respective splits even if data is added or removed (such as with QC). Here is an example if interested:
https://github.com/WayScience/nuclear_speckles_analysis/blob/main/splitters/HashSplitter.py
|
|
||
| def main(argv: Optional[list[str]] = None) -> int: | ||
| """ | ||
| Command-line interface to building and ouputting the CPJUMP1 manifest. |
There was a problem hiding this comment.
| Command-line interface to building and ouputting the CPJUMP1 manifest. | |
| Command-line interface to building and outputting the CPJUMP1 manifest. |
| Command-line interface to building and ouputting the CPJUMP1 manifest. | ||
| By default, it prints a summary and preview of the manifest. | ||
| Use --output or --stdout to write the full manifest to a file or stdout. | ||
| May or may not be useful. |
There was a problem hiding this comment.
I think I would remove this last line. Alternatively, you could explain the use case and let the user decide if it is useful to them or not
| negcon_u2os_24_manifest.head() | ||
|
|
||
|
|
||
| # ## Arrange as wide to be in anticipated format dor virtual stain flow datasets and also the format the download helper expects |
There was a problem hiding this comment.
| # ## Arrange as wide to be in anticipated format dor virtual stain flow datasets and also the format the download helper expects | |
| # ## Arrange as wide is the anticipated format in virtual stain flow datasets and also the format the download helper expects this format |
Not sure if this is the intended message or not
| print(f"Train samples: {len(train_manifest_wide)}, Test samples: {len(test_manifest_wide)}") | ||
|
|
||
|
|
||
| # ## Write final splitted download manifest with metadata and download all needed data |
There was a problem hiding this comment.
Consider making this more concise
Change way of example CPJUMP1 data access uses existing manifest and metadata files from WayScience/JUMP-single-cell.
Addresses issue #26
Note that this PR only adds an additional
0.*.ipynbfor example data download and does not yet replace the old data access notebook and subsequent training, which I decided to save for a separate PR to keep the size in check.