Run the BigQuery status MERGE once per download run instead of once per chunk#81
Run the BigQuery status MERGE once per download run instead of once per chunk#81mihow wants to merge 1 commit into
Conversation
… chunk The download pipeline MERGEd each chunk's results into training_images immediately after downloading it. A BigQuery MERGE is billed mainly for scanning its target table, and training_images in global_all_leps_2605 is ~24M rows / ~7 GB with no partitioning or clustering, so every chunk re-scanned the whole table. At the default chunk size of 10,000 that is ~2,400 full-table scans (~17 TB, ~$85 at $5/TB) for a single download pass. The downloads table already records every result via free batch loads, and resume already reads from it (get_pending_rows LEFT JOINs against it), so the per-chunk MERGE was only keeping training_images.fetch_status cosmetically current mid-run. This moves the MERGE to a single call at the end of the run, sourced from the (deduplicated) downloads table — one target scan instead of thousands. Also adds a cost note at the top of the file for future readers (human and AI) and rewrites the merge tests for the new contract. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Claude says: One design point worth surfacing explicitly, since this pipeline intentionally uses BigQuery as both a status table and a work queue, and I don't want to imply this PR leaves all of that untouched. Both of those roles actually live on
The one thing that genuinely changes: If we later decide we do want |
Summary
The new SquashFS download pipeline works well — it downloads images in parallel, records every result in the downloads table, and packs them into chunks. While keeping an eye on our BigQuery usage this week I noticed the cost climbing, and traced it back to a single spot in
download_images.py: the step that writes download status back intotraining_imagesruns once after every chunk. Because of the way BigQuery bills aMERGE, that step is much more expensive than it looks, and it scales with the number of chunks.This PR moves that one step so it runs a single time at the end of a run instead of once per chunk. The live status and the download queue both keep working unchanged (see "What stays live" below) — the only thing that really changes is the BigQuery bill, and one property of the
training_imagestable that I want to call out explicitly rather than gloss over.This isn't an obvious mistake — it's a genuine BigQuery gotcha that catches a lot of people, so I wanted to write up the why in detail below (and I added a comment block at the top of the file so the next person, human or AI, has it on hand).
The BigQuery gotcha (the "why")
A
MERGE(andUPDATE/DELETE) is billed mainly for scanning the target table — the table you're writing into — not for the size of the source rows you match against. So the cost of one MERGE intotraining_imagesis roughly "scan all oftraining_images," whether the chunk you're merging has 10,000 rows or 3.training_imagesinglobal_all_leps_2605is about 24M rows / ~7 GB, with no partitioning or clustering, so each MERGE scans the full ~7 GB. The per-chunk loop runs one MERGE per chunk:--chunk-size 10000, a full pass is ~2,400 chunksWhat stays live, and what becomes end-of-run
This is the important part, because the pipeline deliberately uses BigQuery as both a status table and a work queue, and I don't want to quietly change that. The key is that those two roles live on
training_images_downloads, not ontraining_images:get_pending_rows()builds the worklist byLEFT JOINingtraining_imagesagainsttraining_images_downloadsand taking the rows whered.dataset_source_uuid IS NULL(not yet attempted). That "already attempted" filter comes from the downloads table, which is appended per chunk via the free batch-load tier — untouched by this PR. Resume and dedup work exactly as before.training_images_downloadsimmediately (for free), you can still watch a run progress in real time:training_images.fetch_statusis no longer updated mid-run. It becomes a denormalized end-of-run cache — pending/NULL while a run is in flight, then updated in one shot when the run finishes. So if anything readstraining_images.fetch_statusand needs it current during a run (a dashboard, another job), that specific property is now eventually-consistent. The live answer is ontraining_images_downloads.The per-chunk MERGE was the only thing keeping
training_images.fetch_statuslive mid-run, and nothing in the pipeline itself depends on that — it reads the queue from the downloads table.What this PR does
Moves the MERGE out of the per-chunk loop and runs it once at the end of the run, sourced from the already-populated downloads table (deduplicated per
dataset_source_uuid). That's one ~7 GB target scan per run instead of ~2,400 — the same end state intraining_images, at a tiny fraction of the cost.List of Changes
training_images, but the bill for it drops from thousands of full-table scans to one per runmerge_chunk_into_training_images()call from inside the loop; added a singlemerge_downloads_into_training_images()call after the looptraining_images_downloads, deduplicated withARRAY_AGG(... LIMIT 1)/GROUP BY dataset_source_uuid(the table is append-only, and--force-redownloadcan add a second row), preferring a successful outcome over a failed onedownload_images.pywith table sizes, the target-scan billing rule, and a pointer toINFORMATION_SCHEMA.JOBS_BY_PROJECTfor checking real spendAlternative approaches (open to your call)
I went with the smallest change that fits the current design, but there are a few directions worth weighing — happy to switch if you prefer one:
--num-jobs 10that's10 full scans per pass ($0.35) instead of ~2,400. Good default.download_images.pyentirely and run it as a single step in the array's "after all tasks" stage (next tojob_bq_pack_per_task.sh). That's exactly one7 GB scan per pass ($0.035). Cheapest, but needs a small change to the job orchestration. If we're touching the job scripts anyway, this is the one I'd lean toward long-term.training_images.fetch_statuscould be read through a view that joins the two, and we'd never write status back. Biggest change, but removes the DML entirely. Probably overkill for now, but worth knowing it's an option.training_imagesondataset_source_uuid. If we ever do wanttraining_images.fetch_statuslive mid-run, clustering the target on the join key lets BigQuery prune blocks instead of full-scanning, which makes a per-chunk MERGE much cheaper (not free, but pruned). The right move only if a future consumer needs live status ontraining_imagesitself.What I'd still like to verify
INFORMATION_SCHEMA.JOBS_BY_PROJECThistory — not a controlled before/after experiment. A good confirmation would be to run one small download pass on a--table-prefix test_table and comparetotal_bytes_billedbefore/after this change.--force-redownloadis used heavily.Tests
47/47 in
tests/dataset_tools/test_download_images.pypass. The merge tests now pin the new contract (single MERGE, no temp tables, sourced from the downloads table, deduped, retry-on-conflict).A note on lint: I matched the existing file's formatting style rather than running
black/isortacross the whole file, to keep this diff focused on the actual change. The pre-existingflake8findings in this file are unchanged by this PR.