Populate /role_scripts/standby for remote replica coordinator by souravbiswassanto · Pull Request #62 · kubedb/postgres-init-docker

souravbiswassanto · 2026-06-23T10:27:42Z

Summary

In init_scripts/run.sh, the REMOTE_REPLICA=true branch now also copies standby scripts into /role_scripts/standby/ in addition to /run_scripts/role/

Problem

The pg-coordinator's AddRoleBasedScripts() restores postgres startup scripts after a crash by copying from /role_scripts/<raftRole>/ (e.g. /role_scripts/standby/). For HA clusters the init container populates this directory, but for remote replicas it only populated /run_scripts/role/ and left /role_scripts/ empty. This caused AddRoleBasedScripts() to silently fail, leaving postgres unable to restart after recovery.

Test plan

Deploy a remote replica with the new pg-coordinator (remote replica mode)
Trigger a recovery scenario (timeline divergence)
Verify the coordinator can successfully call AddRoleBasedScripts() and postgres restarts with the correct standby scripts

Signed-off-by: souravbiswassanto <saurov@appscode.com>

…imeline When a former standby is started via the primary role script (most importantly the remote-replica -> standalone-HA promotion), standby.signal is present and pg_ctl start brings postgres up in recovery. Previously start.sh removed standby.signal and then ran CREATE DATABASE / ALTER USER writes before the trailing pg_ctl promote; the writes fail under read-only recovery, so on the loop's retry postgres started directly as a primary on the EXISTING timeline and the trailing promote was a no-op. The new HA primary thus stayed on the same timeline as its old source cluster, which on failback forces a full pg_basebackup instead of pg_rewind (a whole-day op at multi-TB scale). Fix: as soon as postgres has started, if standby.signal is present, run pg_ctl promote (which ends recovery, increments the timeline, and clears standby.signal) and wait until pg_is_in_recovery() is false, BEFORE any write. The writes then run against the promoted primary on the new timeline. Scope: only affects a node started via the primary script with standby.signal present (the promotion case). A normal primary start (no standby.signal) and the live-standby fast-failover path (promoted by the coordinator via gRPC, not start.sh) are unchanged. Signed-off-by: souravbiswassanto <saurov@appscode.com>

…romotion The post-promote wait must not depend on connecting as the postgres superuser: if that role is missing/renamed the psql probe fails and the loop burns its full 120s timeout. Poll pg_controldata's cluster state (in archive recovery -> in production) instead, which needs no DB connection or role. Signed-off-by: souravbiswassanto <saurov@appscode.com>

Apply the same promote-before-writes fix already in PG17 to the other supported major versions (13,14,15,16,18) so the remote-replica -> standalone-HA timeline bump works regardless of PostgreSQL version. When standby.signal is present at startup (a former standby being promoted), pg_ctl promote runs before the CREATE DATABASE / ALTER USER writes, forking a new timeline so failback uses pg_rewind instead of a full pg_basebackup. Versions < 13 are out of scope. Signed-off-by: souravbiswassanto <saurov@appscode.com>

souravbiswassanto added 4 commits June 23, 2026 15:36

Populate /role_scripts/standby for remote replica coordinator support

28acbca

Signed-off-by: souravbiswassanto <saurov@appscode.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Populate /role_scripts/standby for remote replica coordinator#62

Populate /role_scripts/standby for remote replica coordinator#62
souravbiswassanto wants to merge 4 commits into
masterfrom
remote-replica-rewind

souravbiswassanto commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

souravbiswassanto commented Jun 23, 2026

Summary

Problem

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant