diff --git a/AGENTS.md b/AGENTS.md index ce008ad47..76f9c9842 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -10,6 +10,7 @@ |--------|----------| | Architecture and components | [docs/md/architecture.md](docs/md/architecture.md) | | Database layout and migrations | [docs/md/database.md](docs/md/database.md) | +| Major migration operations runbook | [docs/md/major-migration-runbook.md](docs/md/major-migration-runbook.md) | | Local dev, tests, OpenAPI | [README.md](README.md) | | Commits, PRs, contribution style | [CONTRIBUTING.md](CONTRIBUTING.md) | @@ -30,7 +31,7 @@ Prefer these sources over guessing when behavior or schema matters. | Evaluation | `evaluator/`, topic names in code and `conf/` | | Advisory sync | `tasks/vmaas_sync/` | | Migrations | `database_admin/migrations/` (verify naming against existing migrations) | -| Migration flow and session flags | `database_admin/update.go`, [docs/md/database.md#migrations](docs/md/database.md#migrations) | +| Migration flow, session flags, ops runbook | `database_admin/update.go`, [docs/md/database.md#migrations](docs/md/database.md#migrations), [docs/md/major-migration-runbook.md](docs/md/major-migration-runbook.md) | | Database schema and SQL | `database_admin/schema/` | | Containers and local orchestration | `docker-compose.yml`, `docker-compose.test.yml`, `Dockerfile*` | | Scheduled jobs | `tasks/` | @@ -89,13 +90,15 @@ Response to User --- -## Database migrations: `terminate_db_sessions` +## Database migrations (major DDL) -When advising on migrations or deploy config, use [docs/md/database.md#migrations](docs/md/database.md#migrations). Summary for agents: +When advising on migrations or deploy config, use [docs/md/database.md#migrations](docs/md/database.md#migrations) for overview and [docs/md/major-migration-runbook.md](docs/md/major-migration-runbook.md) for ops procedure, troubleshooting, and flag reference. -**Default:** do **not** set `terminate_db_sessions`. It defaults to `false`; normal deploys must stay unchanged. +**Deploy model:** One **db-migration** Job per deploy runs migrations; app pods only **check-for-db** init (poll schema). Failed migration → new pods fail init, old pods keep serving. -**What it does:** After `NOLOGIN` on app DB users, database-admin optionally runs `pg_terminate_backend` on open `listener` / `evaluator` / `manager` / `vmaas_sync` sessions, then waits until `pg_stat_activity` shows none, then runs DDL. Code: `prepareForMigration()` in `database_admin/update.go`. +**Session handling:** Before DDL, database-admin sets app users (`listener`, `evaluator`, `manager`, `vmaas_sync`) to `NOLOGIN`, optionally terminates lingering backends (`terminate_db_sessions`), polls `pg_stat_activity` until clear (`waitForSessionClosed` in `database_admin/update.go`; fails after 5 consecutive query errors — does not proceed silently), then runs DDL and restores `LOGIN`. `NOLOGIN` stops new connections but does not close existing ones. + +**`terminate_db_sessions`:** Default **off** (`false`); normal deploys must stay unchanged. When enabled on the **db-migration Job only** via `DATABASE_ADMIN_CONFIG=terminate_db_sessions=true`, runs `pg_terminate_backend` on open app-user sessions, then waits again until `pg_stat_activity` is clear. Remove after deploy. Do not enable on manager/listener/evaluator pods. Other flags (`schema_migration`, `force_migration_version`, etc.) are documented in the runbook. **Recommend `terminate_db_sessions=true` only when:** @@ -109,6 +112,6 @@ When advising on migrations or deploy config, use [docs/md/database.md#migration - The user is working locally or in CI - There is no session-lock symptom — it forcibly drops client connections and is not a safe default -**How to set (production):** `DATABASE_ADMIN_CONFIG=terminate_db_sessions=true` on the db-migration Job for that deploy only; remove afterward. Do not enable on manager/listener/evaluator pods. +**Logging:** Key lines — `Advisory lock acquired`, `Waiting for N sessions`, `App database sessions cleared`, `Starting schema migration to version X`. Stuck at only `Getting advisory lock` → advisory lock 123 held elsewhere. Use `message:` filters in Kibana, not `kubernetes.container_name`. Full log sequence in the runbook. -**Related:** Session wait logic and `pg_stat_activity` queries are in `database_admin/update.go`. Deploy layout (single migration Job, `check-for-db` init) is in `deploy/clowdapp.yaml`. Expected migration log sequence (advisory lock → sessions cleared → DDL start) is in [docs/md/database.md#migration-log-sequence](docs/md/database.md#migration-log-sequence). +**When advising users:** Point to the runbook for before/during/after steps, Kibana queries, and Postgres diagnostics. Deploy layout (single migration Job, `check-for-db` init) is in `deploy/clowdapp.yaml`. diff --git a/docs/md/architecture.md b/docs/md/architecture.md index 342b919bf..4fda9f266 100644 --- a/docs/md/architecture.md +++ b/docs/md/architecture.md @@ -52,10 +52,11 @@ description of the component and data layout are in [separate page](database.md) - **database-admin** - Executes database initialization and migrations. It needs all rights for the database. It also creates database users for all components and updates passwords for them, so it reads passwords for admin and all -components from environment variables. Before DDL it sets app users (`listener`, `evaluator`, `manager`, `vmaas_sync`) -to `NOLOGIN`, optionally terminates lingering sessions when `terminate_db_sessions=true`, waits until no app sessions -remain, runs migrations, then restores `LOGIN`. See [Database migrations](database.md#migrations) for when to enable -session termination. Using container CLI it's possible to manually manage database +components from environment variables. In production a **db-migration** Job runs migrations once per deploy; other pods +wait in **check-for-db** init until the schema is current. Before DDL it sets app users (`listener`, `evaluator`, +`manager`, `vmaas_sync`) to `NOLOGIN`, optionally terminates lingering sessions when `terminate_db_sessions=true`, +waits until no app sessions remain, runs migrations, then restores `LOGIN`. See [Database migrations](database.md#migrations) +and the [major migration runbook](major-migration-runbook.md). Using container CLI it's possible to manually manage database (`./scripts/psql.sh`). See [component environment variables](../../conf/database_admin.env) ### Components cooperation schema diff --git a/docs/md/database.md b/docs/md/database.md index 1229d5626..fbf064dc3 100644 --- a/docs/md/database.md +++ b/docs/md/database.md @@ -19,60 +19,25 @@ The ERD image below may lag `database_admin/schema/create_schema.sql`; for syste ## Migrations -Schema changes live in `database_admin/migrations/` and are applied by the **database-admin** component (`database_admin/update.go`). In production, a single **db-migration** ClowdApp Job runs migrations; other pods wait in a `check-for-db` init container until the schema matches. +Schema changes live in `database_admin/migrations/` and are applied by **database-admin** (`database_admin/update.go`). -### Pre-migration session handling +In production: -Before running DDL, database-admin blocks app database users from new logins and waits for existing sessions to drain: +- A single **db-migration** ClowdApp Job runs `migrate` once per deploy (`completions: 1`, `parallelism: 1`). +- Manager, listener, evaluator, and other components use a **check-for-db** init container that polls until the schema matches (`database_admin/check-upgraded.sh`). + +Before DDL, database-admin blocks app database users from new logins and waits for existing sessions to drain: 1. `ALTER USER … NOLOGIN` for `listener`, `evaluator`, `manager`, `vmaas_sync` -2. Optionally (see below) `pg_terminate_backend` on remaining app sessions +2. Optionally `pg_terminate_backend` on remaining app sessions when `terminate_db_sessions=true` 3. Poll `pg_stat_activity` until no app-user sessions remain 4. Run `MigrateUp` 5. `ALTER USER … LOGIN` to restore access `NOLOGIN` stops **new** connections but does **not** close existing ones. Lingering sessions can hold locks and block DDL on large or sensitive migrations. -### `terminate_db_sessions` flag - -| | | -|---|---| -| **Config key** | `terminate_db_sessions` (boolean, default `false`) | -| **Where to set** | `DATABASE_ADMIN_CONFIG` / `POD_CONFIG` on the db-migration Job only | -| **Example** | `terminate_db_sessions=true` | - -When enabled, database-admin calls `pg_terminate_backend` on all open sessions for the four app users above (excluding its own connection), then waits again until `pg_stat_activity` is clear. - -**Set `terminate_db_sessions=true` when:** - -- The migration runs heavy or long-held DDL (e.g. `ALTER TABLE` on large partitioned tables, structural changes that need exclusive locks) -- A previous migration appeared stuck after “Blocking writing users” with app sessions still in `pg_stat_activity` -- Operations explicitly plan a major migration deploy and want to force-close stale app connections - -**Leave unset (default `false`) when:** - -- Routine deploys and normal migrations (additive columns, new tables, typical index changes) -- Local development, CI, and test runs -- There is no evidence of session-related blocking — the flag forcibly disconnects clients and should not be the default - -Remove the flag after the major migration deploy completes; subsequent deploys should not need it. - -### Migration log sequence - -When the db-migration Job runs, expect these log lines in order (Kibana: `@log_stream: patchman-*` and `kubernetes.container_name: db-migration`): - -1. `Getting advisory lock` -2. `Advisory lock acquired` — if missing, another process holds advisory lock 123 -3. `Migrating the database` -4. `Blocking writing users during the migration` -5. `Terminating active app database sessions` / `Terminated session pid=...` — only when `terminate_db_sessions=true` -6. `Waiting for N sessions: ...` — repeats until sessions drain -7. `App database sessions cleared` -8. `Starting schema migration to version X` -9. Silence during DDL (normal) -10. `Reverting components privileges` -11. `Releasing advisory lock` - -### Other `DATABASE_ADMIN_CONFIG` options - -See `deploy/clowdapp.yaml` parameters and `database_admin/config.go`: `schema_migration`, `force_migration_version`, `reset_schema`, `update_users`, `unlock_users`, `update_db_config`. +| Topic | Document | +|-------|----------| +| Major DDL deploy procedure, troubleshooting, SQL diagnostics | [major-migration-runbook.md](major-migration-runbook.md) | +| `DATABASE_ADMIN_CONFIG` flags (including `terminate_db_sessions`) and log sequence | [major-migration-runbook.md](major-migration-runbook.md) | +| ClowdApp parameters | `deploy/clowdapp.yaml`, `database_admin/config.go` | diff --git a/docs/md/major-migration-runbook.md b/docs/md/major-migration-runbook.md new file mode 100644 index 000000000..20744581c --- /dev/null +++ b/docs/md/major-migration-runbook.md @@ -0,0 +1,299 @@ +# Major database migration runbook + +Operational guide for deploying schema migrations that run heavy DDL (for example large `ALTER TABLE` on partitioned tables). + +See also [database.md — Migrations](database.md#migrations) for config reference. + +--- + +## How deploy works + +``` +New deploy triggered + ↓ +db-migration Job starts (completions: 1, parallelism: 1) + ‖ (in parallel) +New app pods start → check-for-db init polls schema every 5s (up to ~5 min) + ↓ +Job: advisory lock → block users → [terminate sessions] → MigrateUp (DDL) + ↓ +Job succeeds → check-for-db init passes → rollout continues +Job fails → new pods fail init → old pods keep serving +``` + +- **One migrator per deploy** — only the Job runs `migrate`; +- **Job limits** — `MIGRATION_TIMEOUT` (default 7200s / 2h), `MIGRATION_MAX_RETRIES=3` with 5s sleep between attempts (`database_admin/entrypoint.sh`). +- **Advisory lock** — `pg_advisory_lock(123)` ensures a single migration process even if something else triggers database-admin. + +--- + +## When to use this runbook + +Use for migrations that need exclusive locks or long DDL runtime. Routine migrations (new tables, additive columns, typical indexes) follow the normal deploy; do **not** set `terminate_db_sessions` by default. + +--- + +## Before deploy + +1. **Review the migration** — identify tables that need `ACCESS EXCLUSIVE` locks and expected runtime. +2. **Set target schema** (if not migrating to latest): + ``` + DATABASE_ADMIN_CONFIG=schema_migration=161 + ``` + on the **db-migration Job** only (via app-interface / ClowdApp `DATABASE_ADMIN_CONFIG`). +3. **Major DDL only** — enable session termination: + ``` + DATABASE_ADMIN_CONFIG=terminate_db_sessions=true + ``` + Can be combined: `terminate_db_sessions=true;schema_migration=161` +4. **Communicate** — brief app errors are expected while sessions are terminated and during DDL; clients reconnect after `LOGIN` is restored. +5. **Optional** — scale down listener/evaluator if a previous deploy showed DDL blocked by lingering connections even with the flag. + +--- + +## `DATABASE_ADMIN_CONFIG` flags + +Set on the **db-migration Job** via `DATABASE_ADMIN_CONFIG` (passed as `POD_CONFIG`). Multiple keys are semicolon-separated, e.g. `terminate_db_sessions=true;schema_migration=161`. + +Config keys are defined in `database_admin/config.go`. ClowdApp comments in `deploy/clowdapp.yaml` may use older names (`schema_version`, `force_schema_version`) — the code keys are `schema_migration` and `force_migration_version`. + +### `schema_migration` + +| | | +|---|---| +| **Config key** | `schema_migration` (integer, default `-1`) | +| **Where** | `DATABASE_ADMIN_CONFIG` on the db-migration Job | +| **Effect** | Target schema version to migrate to. `-1` means latest available migration file. Values `>= 0` migrate only up to that version. Also used by `check-for-db` / `migrateAction` to decide whether deployment should proceed. | + +**Set when:** you need to pin or cap the migration version (stage validation, staged rollout, or blocking auto-upgrade past a known-good version). + +**Leave at `-1` when:** normal production deploy should apply all pending migrations. + +**Note:** If current DB version equals `schema_migration` but newer migration files exist, deploy is **blocked** until `schema_migration` is raised — intentional safety gate. + +### `force_migration_version` + +| | | +|---|---| +| **Config key** | `force_migration_version` (integer, default `-1`, inactive when `<= 0`) | +| **Where** | `DATABASE_ADMIN_CONFIG` on the db-migration Job | +| **Effect** | Before `MigrateUp`, calls `migrate.Force(version)` — sets `schema_migrations.version` and clears `dirty`. Used to recover from a failed migration left in dirty state. Migration then continues per `schema_migration`. | + +**Set when:** `schema_migrations.dirty = true` after a failed migration and engineering/DBA has confirmed it is safe to reset the version marker (and any partial DDL has been handled). + +**Leave unset when:** schema is clean (`dirty = false`). Misuse can mark a broken schema as valid. + +### `reset_schema` + +| | | +|---|---| +| **Config key** | `reset_schema` (boolean, default `false`) | +| **Where** | `DATABASE_ADMIN_CONFIG` on the db-migration Job | +| **Effect** | `DROP SCHEMA public CASCADE` and recreate empty `public` schema before migration logic runs. **Destructive** — wipes all application data. | + +**Set when:** local/dev database rebuild only, or explicit empty-environment bootstrap under controlled conditions. + +**Never set in production** unless performing a deliberate full data reset with sign-off. + +### `update_users` + +| | | +|---|---| +| **Config key** | `update_users` (boolean, default `false`) | +| **Where** | `DATABASE_ADMIN_CONFIG` (db-migration Job; also common in local `conf/database_admin.env`) | +| **Effect** | Runs `create_users.sql`, then after migration sets passwords for `listener`, `evaluator`, `manager`, `vmaas_sync` from environment variables. | + +**Set when:** initial environment setup or refreshing DB role definitions/passwords (typical in local docker and first-time deploy). + +**Leave off when:** users already exist and passwords are managed separately — normal prod Job runs usually rely on this being set only where needed in app-interface. + +### `unlock_users` + +| | | +|---|---| +| **Config key** | `unlock_users` (boolean, default `false`) | +| **Where** | `DATABASE_ADMIN_CONFIG` on the db-migration Job | +| **Effect** | `ALTER USER … LOGIN` for app users **before** migration, without running DDL. Recovery helper if a previous migration left users at `NOLOGIN`. | + +**Set when:** app users are stuck at `NOLOGIN` after an aborted migration and you need to restore login without running a full migrate. + +**Leave off for normal deploys** — migration flow blocks and unblocks users automatically. + +### `update_db_config` + +| | | +|---|---| +| **Config key** | `update_db_config` (boolean, default `false`) | +| **Where** | `DATABASE_ADMIN_CONFIG` (db-migration Job; also in local `conf/database_admin.env`) | +| **Effect** | Re-runs `database_admin/config.sql` (PostgreSQL settings such as `work_mem` for the application). | + +**Set when:** applying or refreshing DB-level settings from `config.sql` after deploy. + +**Leave off when:** only schema migration is needed. + +### `terminate_db_sessions` + +| | | +|---|---| +| **Config key** | `terminate_db_sessions` (boolean, default `false`) | +| **Where** | `DATABASE_ADMIN_CONFIG` on the **db-migration Job** only | +| **Effect** | After `NOLOGIN` on app users, runs `pg_terminate_backend` on open `listener` / `evaluator` / `manager` / `vmaas_sync` sessions, then waits until `pg_stat_activity` is clear | + +**Enable when:** heavy DDL, prior stuck migration after “Blocking writing users”, or planned maintenance window. + +**Leave off when:** routine release, local/CI, no session-blocking symptoms. + +**Remove after** the major migration deploy completes. + +`NOLOGIN` alone does not close existing connections — that is why this flag exists. + +--- + +## During deploy + +### Where to watch logs + +Kibana — filter by log stream and message text (field names vary by environment; adjust `@log_stream` as needed): + +```kql +@log_stream: patchman-* and message: *advisory lock* +``` + +Migration progress: + +```kql +@log_stream: patchman-* and (message: "Migrating the database" or message: "Starting schema migration" or message: "App database sessions cleared") +``` + +Init containers polling for schema (may appear on manager/listener/evaluator streams): + +```kql +@log_stream: patchman-* and message: *DB migration in progress* +``` + +### Expected log sequence (db-migration Job) + +| Step | Log line | Notes | +|------|----------|--------| +| 1 | `Getting advisory lock` | | +| 2 | `Advisory lock acquired` | **Missing** → another holder of advisory lock 123 | +| 3 | `Migrating the database` | | +| 4 | `Blocking writing users during the migration` | `NOLOGIN` on app DB users | +| 5 | `Terminating active app database sessions` | Only if `terminate_db_sessions=true` | +| 6 | `Terminated session pid=... user=...` | Per terminated backend | +| 7 | `Waiting for N sessions: ...` | Repeats each second until drain | +| 8 | `App database sessions cleared` | | +| 9 | `Starting schema migration to version X` | DDL begins | +| 10 | *(silence)* | Normal during long DDL | +| 11 | `Reverting components privileges` | `LOGIN` restored | +| 12 | `Releasing advisory lock` | | + +### If stuck + +| Last log seen | Likely cause | Action | +|---------------|--------------|--------| +| Only `Getting advisory lock` | Another process holds advisory lock 123 | See [Advisory lock diagnostics](#advisory-lock-diagnostics); check for duplicate migration Job or stale pod | +| `Waiting for N sessions` (repeating) | App connections still open | Enable or verify `terminate_db_sessions=true`; scale down listener/evaluator; inspect `pg_stat_activity` | +| Past `Starting schema migration`, long silence | DDL waiting on table lock | Find blockers on target table; scale down apps; see [DDL lock diagnostics](#ddl-lock-diagnostics) | +| `failed to check app database sessions after 5 attempts` | DB connectivity or permissions on `pg_stat_activity` | Fix admin DB access; do not ignore — migration aborted intentionally | +| Job failed, new pods `CrashLoopBackOff` on init | Migration failed or timed out | Old pods still serve; fix migration state before retrying | + +--- + +## After deploy + +1. Verify schema: `SELECT version, dirty FROM schema_migrations;` — `dirty` must be `false`. +2. Remove `terminate_db_sessions` from `DATABASE_ADMIN_CONFIG` (or set `false`). +3. Confirm app pods passed `check-for-db` and are ready. +4. Smoke-test manager API and a sample evaluation path if the migration touched core tables. + +--- + +## Rollback + +- **Application rollback** — deploy previous image tag; if schema already migrated forward, old code may be incompatible with new schema. Coordinate with engineering before rolling back app only. +- **Failed migration (`dirty = true`)** — do not re-deploy blindly. Inspect `schema_migrations`, Job logs, and whether DDL partially applied. May require `force_migration_version` (see `database_admin/config.go`) under DBA/engineering guidance. +- **Stuck advisory lock** — identify holder PID; terminate only after confirming it is a stale migration pod, not an active legitimate migration. + +--- + +## Postgres diagnostics + +### Advisory lock diagnostics + +Advisory lock id **123** is hardcoded in `database_admin/update.go`. + +```sql +-- Who holds advisory lock 123? +SELECT l.pid, a.usename, a.state, a.application_name, left(a.query, 120) AS query +FROM pg_locks l +JOIN pg_stat_activity a ON a.pid = l.pid +WHERE l.locktype = 'advisory' + AND l.classid = 0 + AND l.objid = 123; +``` + +### App session diagnostics + +```sql +-- Open sessions for patchman app users +SELECT pid, usename, state, wait_event_type, wait_event, left(query, 80) AS query +FROM pg_stat_activity +WHERE usename IN ('listener', 'evaluator', 'manager', 'vmaas_sync') +ORDER BY usename, pid; +``` + +### DDL lock diagnostics + +Replace `system_inventory` with the table your migration touches. `blocked_locks` is the waiting lock (typically the db-migration DDL); `blocking_locks` is the granted lock on the same resource from another session. The JOIN already matches them on the same `relation`, so filter on `blocked_locks.relation`: + +```sql +SELECT blocked.pid AS blocked_pid, + blocked.usename AS blocked_user, + left(blocked.query, 80) AS blocked_query, + blocking.pid AS blocking_pid, + blocking.usename AS blocking_user, + left(blocking.query, 80) AS blocking_query +FROM pg_stat_activity blocked +JOIN pg_locks blocked_locks ON blocked_locks.pid = blocked.pid AND NOT blocked_locks.granted +JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype + AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database + AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation + AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page + AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple + AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid + AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid + AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid + AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid + AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid + AND blocking_locks.pid != blocked_locks.pid +JOIN pg_stat_activity blocking ON blocking.pid = blocking_locks.pid +WHERE blocking_locks.granted + AND blocked_locks.relation = 'system_inventory'::regclass; +``` + +### Migration state + +```sql +SELECT version, dirty FROM schema_migrations; +``` + +--- + +## Job parameters + +| Parameter | Default | Where | Purpose | +|-----------|---------|-------|---------| +| `MIGRATION_TIMEOUT` | `7200` | ClowdApp Job `activeDeadlineSeconds` | Max Job runtime (seconds) | +| `MIGRATION_MAX_RETRIES` | `3` | db-migration Job env | Migrate command retries on failure (`entrypoint.sh`, 5s between attempts) | + +--- + +## Related code and deploy files + +| Topic | Location | +|-------|----------| +| Migration flow, session wait/terminate | `database_admin/update.go` | +| Migrate retries | `database_admin/entrypoint.sh` | +| Init schema poll | `database_admin/check-upgraded.sh` | +| ClowdApp Job and `check-for-db` init | `deploy/clowdapp.yaml` |