Skip to content

ci: add retry/backoff to GHCR docker pull in integration-test workflow (PLT-753)#3622

Queued
amir-deris wants to merge 2 commits into
mainfrom
amir/plt-753-add-retry-for-ci-ghcr-auth
Queued

ci: add retry/backoff to GHCR docker pull in integration-test workflow (PLT-753)#3622
amir-deris wants to merge 2 commits into
mainfrom
amir/plt-753-add-retry-for-ci-ghcr-auth

Conversation

@amir-deris

@amir-deris amir-deris commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Problem

The Integration Test matrix jobs intermittently fail at the "Load prebuilt seid and pull Docker images" step with transient GHCR errors, before any test runs:

Get "https://ghcr.io/token?...&scope=...rpcnode:pull...":
  context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Head "https://ghcr.io/v2/.../localnode/manifests/...":
  net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Root cause

PR #3582 switched image distribution from a 1 GB artifact download to GHCR docker pull. That pull step used a bare docker pull with no retry wrapper. The docker client only retries layer blob downloads automatically — it does NOT retry the initial auth-token fetch / manifest HEAD request, which is exactly where the failures occur. When ~40 matrix jobs start simultaneously and hammer ghcr.io/token, a briefly-slow auth response times out, docker pull exits 1, and with no retry loop the whole step (and job) fails.

Fix

Wrap the pulls in a retry-with-backoff loop so the token/manifest request is also retried (5 attempts, linear backoff 5/10/15/20s):

pull_with_retry() {
  local ref="$1"
  for attempt in 1 2 3 4 5; do
    if docker pull "$ref"; then return 0; fi
    echo "docker pull $ref failed (attempt $attempt), retrying in $((attempt*5))s..."
    sleep $((attempt*5))
  done
  echo "docker pull $ref failed after 5 attempts"; return 1
}
pull_with_retry "${GHCR_LOCALNODE}:${{ github.run_id }}"
pull_with_retry "${GHCR_RPCNODE}:${{ github.run_id }}"

Tagging logic is unchanged.

References

@amir-deris amir-deris self-assigned this Jun 22, 2026
@amir-deris amir-deris changed the title Added retry for ghcr pull ci: add retry/backoff to GHCR docker pull in integration-test workflow (PLT-753) Jun 22, 2026
@cursor

cursor Bot commented Jun 22, 2026

Copy link
Copy Markdown

PR Summary

Low Risk
CI-only workflow change with no application or runtime behavior impact; worst case is slightly longer job time on repeated pull failures.

Overview
Integration matrix jobs now pull localnode and rpcnode images from GHCR through a pull_with_retry helper instead of a single bare docker pull, so transient auth/manifest timeouts are retried (5 attempts with linear 5s backoff steps) before the step fails.

Image tagging to sei-chain/localnode and sei-chain/rpcnode is unchanged; only the pull step is hardened against flaky GHCR responses when many jobs start at once.

Reviewed by Cursor Bugbot for commit 5a8fe49. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

github-actions Bot commented Jun 22, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 23, 2026, 11:04 AM

@amir-deris amir-deris requested review from bdchatham and masih June 22, 2026 21:38
@codecov

codecov Bot commented Jun 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.12%. Comparing base (b8776ed) to head (5a8fe49).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3622      +/-   ##
==========================================
- Coverage   58.65%   58.12%   -0.54%     
==========================================
  Files        2225     2150      -75     
  Lines      183467   174156    -9311     
==========================================
- Hits       107606   101221    -6385     
+ Misses      66144    63945    -2199     
+ Partials     9717     8990     -727     
Flag Coverage Δ
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.
see 114 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@masih masih enabled auto-merge June 23, 2026 11:03
@masih masih added this pull request to the merge queue Jun 23, 2026
Any commits made after this event will not be merged.
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 23, 2026
@masih masih added this pull request to the merge queue Jun 23, 2026
Any commits made after this event will not be merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants