Skip to content

Populate /role_scripts/standby for remote replica coordinator#62

Open
souravbiswassanto wants to merge 4 commits into
masterfrom
remote-replica-rewind
Open

Populate /role_scripts/standby for remote replica coordinator#62
souravbiswassanto wants to merge 4 commits into
masterfrom
remote-replica-rewind

Conversation

@souravbiswassanto

Copy link
Copy Markdown
Member

Summary

  • In init_scripts/run.sh, the REMOTE_REPLICA=true branch now also copies standby scripts into /role_scripts/standby/ in addition to /run_scripts/role/

Problem

The pg-coordinator's AddRoleBasedScripts() restores postgres startup scripts after a crash by copying from /role_scripts/<raftRole>/ (e.g. /role_scripts/standby/). For HA clusters the init container populates this directory, but for remote replicas it only populated /run_scripts/role/ and left /role_scripts/ empty. This caused AddRoleBasedScripts() to silently fail, leaving postgres unable to restart after recovery.

Test plan

  • Deploy a remote replica with the new pg-coordinator (remote replica mode)
  • Trigger a recovery scenario (timeline divergence)
  • Verify the coordinator can successfully call AddRoleBasedScripts() and postgres restarts with the correct standby scripts

Signed-off-by: souravbiswassanto <saurov@appscode.com>
…imeline

When a former standby is started via the primary role script (most importantly
the remote-replica -> standalone-HA promotion), standby.signal is present and
pg_ctl start brings postgres up in recovery. Previously start.sh removed
standby.signal and then ran CREATE DATABASE / ALTER USER writes before the
trailing pg_ctl promote; the writes fail under read-only recovery, so on the
loop's retry postgres started directly as a primary on the EXISTING timeline and
the trailing promote was a no-op. The new HA primary thus stayed on the same
timeline as its old source cluster, which on failback forces a full
pg_basebackup instead of pg_rewind (a whole-day op at multi-TB scale).

Fix: as soon as postgres has started, if standby.signal is present, run
pg_ctl promote (which ends recovery, increments the timeline, and clears
standby.signal) and wait until pg_is_in_recovery() is false, BEFORE any write.
The writes then run against the promoted primary on the new timeline.

Scope: only affects a node started via the primary script with standby.signal
present (the promotion case). A normal primary start (no standby.signal) and the
live-standby fast-failover path (promoted by the coordinator via gRPC, not
start.sh) are unchanged.

Signed-off-by: souravbiswassanto <saurov@appscode.com>
…romotion

The post-promote wait must not depend on connecting as the postgres superuser:
if that role is missing/renamed the psql probe fails and the loop burns its
full 120s timeout. Poll pg_controldata's cluster state (in archive recovery ->
in production) instead, which needs no DB connection or role.

Signed-off-by: souravbiswassanto <saurov@appscode.com>
Apply the same promote-before-writes fix already in PG17 to the other supported
major versions (13,14,15,16,18) so the remote-replica -> standalone-HA timeline
bump works regardless of PostgreSQL version. When standby.signal is present at
startup (a former standby being promoted), pg_ctl promote runs before the
CREATE DATABASE / ALTER USER writes, forking a new timeline so failback uses
pg_rewind instead of a full pg_basebackup. Versions < 13 are out of scope.

Signed-off-by: souravbiswassanto <saurov@appscode.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant