Skip to content

Run the BigQuery status MERGE once per download run instead of once per chunk#81

Open
mihow wants to merge 1 commit into
feature/bq-squashfs-pipelinefrom
fix/bq-merge-once-per-run
Open

Run the BigQuery status MERGE once per download run instead of once per chunk#81
mihow wants to merge 1 commit into
feature/bq-squashfs-pipelinefrom
fix/bq-merge-once-per-run

Conversation

@mihow

@mihow mihow commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

Summary

The new SquashFS download pipeline works well — it downloads images in parallel, records every result in the downloads table, and packs them into chunks. While keeping an eye on our BigQuery usage this week I noticed the cost climbing, and traced it back to a single spot in download_images.py: the step that writes download status back into training_images runs once after every chunk. Because of the way BigQuery bills a MERGE, that step is much more expensive than it looks, and it scales with the number of chunks.

This PR moves that one step so it runs a single time at the end of a run instead of once per chunk. The live status and the download queue both keep working unchanged (see "What stays live" below) — the only thing that really changes is the BigQuery bill, and one property of the training_images table that I want to call out explicitly rather than gloss over.

This isn't an obvious mistake — it's a genuine BigQuery gotcha that catches a lot of people, so I wanted to write up the why in detail below (and I added a comment block at the top of the file so the next person, human or AI, has it on hand).

The BigQuery gotcha (the "why")

A MERGE (and UPDATE/DELETE) is billed mainly for scanning the target table — the table you're writing into — not for the size of the source rows you match against. So the cost of one MERGE into training_images is roughly "scan all of training_images," whether the chunk you're merging has 10,000 rows or 3.

training_images in global_all_leps_2605 is about 24M rows / ~7 GB, with no partitioning or clustering, so each MERGE scans the full ~7 GB. The per-chunk loop runs one MERGE per chunk:

  • At the default --chunk-size 10000, a full pass is ~2,400 chunks
  • 2,400 MERGEs × ~7 GB ≈ ~17 TB scanned for a single download pass
  • On-demand BigQuery is $5/TB, so that's ~$85 per pass (rough estimate from reading the code and the table size, not a precise bill), and it climbs every time the pipeline runs.

What stays live, and what becomes end-of-run

This is the important part, because the pipeline deliberately uses BigQuery as both a status table and a work queue, and I don't want to quietly change that. The key is that those two roles live on training_images_downloads, not on training_images:

  • The download queue ("what to fetch next") is unchanged. get_pending_rows() builds the worklist by LEFT JOINing training_images against training_images_downloads and taking the rows where d.dataset_source_uuid IS NULL (not yet attempted). That "already attempted" filter comes from the downloads table, which is appended per chunk via the free batch-load tier — untouched by this PR. Resume and dedup work exactly as before.
  • Live progress is unchanged — it just lives on the downloads table. Because every chunk appends its results to training_images_downloads immediately (for free), you can still watch a run progress in real time:
    SELECT fetch_status, COUNT(*)
    FROM `leps-ai.global_all_leps_2605.training_images_downloads`
    GROUP BY 1;
    That table's own docstring already states its purpose: "track fetch progress without DML updates on the base training_images table." That's exactly the role it keeps playing.
  • What does change: training_images.fetch_status is no longer updated mid-run. It becomes a denormalized end-of-run cache — pending/NULL while a run is in flight, then updated in one shot when the run finishes. So if anything reads training_images.fetch_status and needs it current during a run (a dashboard, another job), that specific property is now eventually-consistent. The live answer is on training_images_downloads.

The per-chunk MERGE was the only thing keeping training_images.fetch_status live mid-run, and nothing in the pipeline itself depends on that — it reads the queue from the downloads table.

What this PR does

Moves the MERGE out of the per-chunk loop and runs it once at the end of the run, sourced from the already-populated downloads table (deduplicated per dataset_source_uuid). That's one ~7 GB target scan per run instead of ~2,400 — the same end state in training_images, at a tiny fraction of the cost.

List of Changes

# Change (what it does) How (implementation)
1 Status still lands in training_images, but the bill for it drops from thousands of full-table scans to one per run Removed the per-chunk merge_chunk_into_training_images() call from inside the loop; added a single merge_downloads_into_training_images() call after the loop
2 The end-of-run MERGE reads from the downloads table instead of a fresh per-chunk temp table New function sources from training_images_downloads, deduplicated with ARRAY_AGG(... LIMIT 1) / GROUP BY dataset_source_uuid (the table is append-only, and --force-redownload can add a second row), preferring a successful outcome over a failed one
3 Future readers (human and AI) see the cost model before adding BigQuery calls here Added a "COST NOTE" comment block at the top of download_images.py with table sizes, the target-scan billing rule, and a pointer to INFORMATION_SCHEMA.JOBS_BY_PROJECT for checking real spend
4 Tests reflect the new contract Rewrote the merge tests: one MERGE query, no temp-table load/delete, sourced from the downloads table, deduped, retries on serialization conflicts. 47/47 pass.

Alternative approaches (open to your call)

I went with the smallest change that fits the current design, but there are a few directions worth weighing — happy to switch if you prefer one:

  • A — End-of-run MERGE per task (this PR). Each array task runs one MERGE when it finishes. Simple, no orchestration changes. With --num-jobs 10 that's 10 full scans per pass ($0.35) instead of ~2,400. Good default.
  • B — One MERGE after the whole array finishes. Move the MERGE out of download_images.py entirely and run it as a single step in the array's "after all tasks" stage (next to job_bq_pack_per_task.sh). That's exactly one 7 GB scan per pass ($0.035). Cheapest, but needs a small change to the job orchestration. If we're touching the job scripts anyway, this is the one I'd lean toward long-term.
  • C — Don't MERGE at all; derive status from the downloads table. Since the downloads table is already the source of truth for progress and the queue, training_images.fetch_status could be read through a view that joins the two, and we'd never write status back. Biggest change, but removes the DML entirely. Probably overkill for now, but worth knowing it's an option.
  • Orthogonal — cluster training_images on dataset_source_uuid. If we ever do want training_images.fetch_status live mid-run, clustering the target on the join key lets BigQuery prune blocks instead of full-scanning, which makes a per-chunk MERGE much cheaper (not free, but pruned). The right move only if a future consumer needs live status on training_images itself.

What I'd still like to verify

  • This is based on reading the code and the table size plus the INFORMATION_SCHEMA.JOBS_BY_PROJECT history — not a controlled before/after experiment. A good confirmation would be to run one small download pass on a --table-prefix test_ table and compare total_bytes_billed before/after this change.
  • The dedup in the end-of-run MERGE assumes "prefer downloaded > corrupted > failed" is the right tiebreak when a uuid has multiple attempts. That matches the resume logic, but worth a sanity check if --force-redownload is used heavily.

Tests

47/47 in tests/dataset_tools/test_download_images.py pass. The merge tests now pin the new contract (single MERGE, no temp tables, sourced from the downloads table, deduped, retry-on-conflict).

A note on lint: I matched the existing file's formatting style rather than running black/isort across the whole file, to keep this diff focused on the actual change. The pre-existing flake8 findings in this file are unchanged by this PR.

… chunk

The download pipeline MERGEd each chunk's results into training_images
immediately after downloading it. A BigQuery MERGE is billed mainly for
scanning its target table, and training_images in global_all_leps_2605 is
~24M rows / ~7 GB with no partitioning or clustering, so every chunk
re-scanned the whole table. At the default chunk size of 10,000 that is
~2,400 full-table scans (~17 TB, ~$85 at $5/TB) for a single download pass.

The downloads table already records every result via free batch loads, and
resume already reads from it (get_pending_rows LEFT JOINs against it), so the
per-chunk MERGE was only keeping training_images.fetch_status cosmetically
current mid-run. This moves the MERGE to a single call at the end of the run,
sourced from the (deduplicated) downloads table — one target scan instead of
thousands.

Also adds a cost note at the top of the file for future readers (human and AI)
and rewrites the merge tests for the new contract.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 6, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1dce4dce-011d-4781-b38b-c14cebc9cb22

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/bq-merge-once-per-run

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mihow

mihow commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

Claude says: One design point worth surfacing explicitly, since this pipeline intentionally uses BigQuery as both a status table and a work queue, and I don't want to imply this PR leaves all of that untouched.

Both of those roles actually live on training_images_downloads, not on training_images — and that table is unchanged by this PR:

  • The queue ("what to fetch next") is unchanged. get_pending_rows() builds the worklist by LEFT JOINing training_images against training_images_downloads and keeping rows where d.dataset_source_uuid IS NULL. That "already attempted" filter comes from the downloads table, which is appended per chunk via the free batch-load tier. Resume and dedup behave exactly as before.
  • Live progress is unchanged — it just lives on the downloads table. Every chunk appends its results to training_images_downloads immediately, so you can still watch a run in real time:
    SELECT fetch_status, COUNT(*)
    FROM `leps-ai.global_all_leps_2605.training_images_downloads`
    GROUP BY 1;
    That matches the table's stated purpose in its own docstring: "track fetch progress without DML updates on the base training_images table."

The one thing that genuinely changes: training_images.fetch_status is no longer updated mid-run. It becomes a denormalized end-of-run cache — pending/NULL while a run is in flight, then updated in one shot at the end. So if anything reads training_images.fetch_status and needs it current during a run, that property is now eventually-consistent, and the live answer is on training_images_downloads instead.

If we later decide we do want training_images.fetch_status itself live mid-run, the cheap way to get it back is to cluster training_images on dataset_source_uuid so the MERGE prunes instead of full-scanning — noted as the "orthogonal" option in the PR description. I've also expanded the PR body with a "What stays live, and what becomes end-of-run" section covering all of this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant