Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 19 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@
| Topic | Location |
|--------|----------|
| Architecture and components | [docs/md/architecture.md](docs/md/architecture.md) |
| Database layout | [docs/md/database.md](docs/md/database.md) |
| Database layout and migrations | [docs/md/database.md](docs/md/database.md) |
| Major migration operations runbook | [docs/md/major-migration-runbook.md](docs/md/major-migration-runbook.md) |
| Local dev, tests, OpenAPI | [README.md](README.md) |
| Commits, PRs, contribution style | [CONTRIBUTING.md](CONTRIBUTING.md) |

Expand All @@ -30,6 +31,7 @@ Prefer these sources over guessing when behavior or schema matters.
| Evaluation | `evaluator/`, topic names in code and `conf/` |
| Advisory sync | `tasks/vmaas_sync/` |
| Migrations | `database_admin/migrations/` (verify naming against existing migrations) |
| Migration flow, flags, ops runbook | [docs/md/major-migration-runbook.md](docs/md/major-migration-runbook.md) |
| Database schema and SQL | `database_admin/schema/` |
| Containers and local orchestration | `docker-compose.yml`, `docker-compose.test.yml`, `Dockerfile*` |
| Scheduled jobs | `tasks/` |
Expand Down Expand Up @@ -85,3 +87,19 @@ Manager Component (REST API)
Response to User
```

---

## Database migrations (major DDL)

Authoritative ops guide: [docs/md/major-migration-runbook.md](docs/md/major-migration-runbook.md). Summary for agents:

**Deploy model:** One **db-migration** Job per deploy runs migrations; app pods only **check-for-db** init (poll schema). Failed migration → new pods fail init, old pods keep serving.

**Session handling:** `waitForSessionClosed` polls `pg_stat_activity` correctly; fails after 5 consecutive query errors (does not proceed silently).

**`terminate_db_sessions`:** Default **off**. Set on the **db-migration Job only** for major DDL when `NOLOGIN` is not enough. Remove after deploy. Other flags (`schema_migration`, `force_migration_version`, etc.) are documented in the runbook.

**Logging:** Key lines — `Advisory lock acquired`, `Waiting for N sessions`, `App database sessions cleared`, `Starting schema migration to version X`. Stuck at only `Getting advisory lock` → advisory lock 123 held elsewhere. Use `message:` filters in Kibana, not `kubernetes.container_name`.

**When advising users:** Point to the runbook for before/during/after steps, Kibana queries, and Postgres diagnostics. Do not recommend `terminate_db_sessions` for routine deploys.
4 changes: 3 additions & 1 deletion docs/md/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,9 @@ description of the component and data layout are in [separate page](database.md)

- **database-admin** - Executes database initialization and migrations. It needs all rights for the database. It also
creates database users for all components and updates passwords for them, so it reads passwords for admin and all
components from environment variables. Using container CLI it's possible to manually manage database
components from environment variables. In production a **db-migration** Job runs migrations once per deploy; other pods
wait in **check-for-db** init until the schema is current. See [Database migrations](database.md#migrations) and the
[major migration runbook](major-migration-runbook.md). Using container CLI it's possible to manually manage database
(`./scripts/psql.sh`). See [component environment variables](../../conf/database_admin.env)

### Components cooperation schema
Expand Down
17 changes: 17 additions & 0 deletions docs/md/database.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,20 @@ Main database tables description:
The ERD image below may lag `database_admin/schema/create_schema.sql`; for systems it may not reflect the split between **system_inventory** (host profile / upload payload) and **system_patch** (evaluation caches and aggregates).

![](graphics/db_diagram.png)

## Migrations

Schema changes live in `database_admin/migrations/` and are applied by **database-admin** (`database_admin/update.go`).

In production:

- A single **db-migration** ClowdApp Job runs `migrate` once per deploy (`completions: 1`, `parallelism: 1`).
- Manager, listener, evaluator, and other components use a **check-for-db** init container that polls until the schema matches (`database_admin/check-upgraded.sh`).

Before DDL, database-admin sets app users (`listener`, `evaluator`, `manager`, `vmaas_sync`) to `NOLOGIN`, waits for sessions to drain, optionally terminates lingering backends (`terminate_db_sessions`), runs `MigrateUp`, then restores `LOGIN`.

@xbhouse xbhouse Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably me stating the obvious but since the terminate_db_sessions flag is in another branch with additional docs there, this will just need to be rebased once that's merged :)


| Topic | Document |
|-------|----------|
| Major DDL deploy procedure, troubleshooting, SQL diagnostics | [major-migration-runbook.md](major-migration-runbook.md) |
| `DATABASE_ADMIN_CONFIG` flags and log sequence | [major-migration-runbook.md](major-migration-runbook.md) |
| ClowdApp parameters | `deploy/clowdapp.yaml`, `database_admin/config.go` |
298 changes: 298 additions & 0 deletions docs/md/major-migration-runbook.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,298 @@
# Major database migration runbook

Operational guide for deploying schema migrations that run heavy DDL (for example large `ALTER TABLE` on partitioned tables).

See also [database.md — Migrations](database.md#migrations) for config reference.

---

## How deploy works

```
New deploy triggered
db-migration Job starts (completions: 1, parallelism: 1)
‖ (in parallel)
New app pods start → check-for-db init polls schema every 5s (up to ~5 min)
Job: advisory lock → block users → [terminate sessions] → MigrateUp (DDL)
Job succeeds → check-for-db init passes → rollout continues
Job fails → new pods fail init → old pods keep serving
```

- **One migrator per deploy** — only the Job runs `migrate`;
- **Job limits** — `MIGRATION_TIMEOUT` (default 7200s / 2h), `MIGRATION_MAX_RETRIES=3` with 5s sleep between attempts (`database_admin/entrypoint.sh`).
- **Advisory lock** — `pg_advisory_lock(123)` ensures a single migration process even if something else triggers database-admin.

---

## When to use this runbook

Use for migrations that need exclusive locks or long DDL runtime. Routine migrations (new tables, additive columns, typical indexes) follow the normal deploy; do **not** set `terminate_db_sessions` by default.

---

## Before deploy

1. **Review the migration** — identify tables that need `ACCESS EXCLUSIVE` locks and expected runtime.
2. **Set target schema** (if not migrating to latest):
```
DATABASE_ADMIN_CONFIG=schema_migration=161
```
on the **db-migration Job** only (via app-interface / ClowdApp `DATABASE_ADMIN_CONFIG`).
3. **Major DDL only** — enable session termination:
```
DATABASE_ADMIN_CONFIG=terminate_db_sessions=true
```
Can be combined: `terminate_db_sessions=true;schema_migration=161`
4. **Communicate** — brief app errors are expected while sessions are terminated and during DDL; clients reconnect after `LOGIN` is restored.
5. **Optional** — scale down listener/evaluator if a previous deploy showed DDL blocked by lingering connections even with the flag.

---

## `DATABASE_ADMIN_CONFIG` flags

Set on the **db-migration Job** via `DATABASE_ADMIN_CONFIG` (passed as `POD_CONFIG`). Multiple keys are semicolon-separated, e.g. `terminate_db_sessions=true;schema_migration=161`.

Config keys are defined in `database_admin/config.go`. ClowdApp comments in `deploy/clowdapp.yaml` may use older names (`schema_version`, `force_schema_version`) — the code keys are `schema_migration` and `force_migration_version`.

### `schema_migration`

| | |
|---|---|
| **Config key** | `schema_migration` (integer, default `-1`) |
| **Where** | `DATABASE_ADMIN_CONFIG` on the db-migration Job |
| **Effect** | Target schema version to migrate to. `-1` means latest available migration file. Values `>= 0` migrate only up to that version. Also used by `check-for-db` / `migrateAction` to decide whether deployment should proceed. |

**Set when:** you need to pin or cap the migration version (stage validation, staged rollout, or blocking auto-upgrade past a known-good version).

**Leave at `-1` when:** normal production deploy should apply all pending migrations.

**Note:** If current DB version equals `schema_migration` but newer migration files exist, deploy is **blocked** until `schema_migration` is raised — intentional safety gate.

### `force_migration_version`

| | |
|---|---|
| **Config key** | `force_migration_version` (integer, default `-1`, inactive when `<= 0`) |
| **Where** | `DATABASE_ADMIN_CONFIG` on the db-migration Job |
| **Effect** | Before `MigrateUp`, calls `migrate.Force(version)` — sets `schema_migrations.version` and clears `dirty`. Used to recover from a failed migration left in dirty state. Migration then continues per `schema_migration`. |

**Set when:** `schema_migrations.dirty = true` after a failed migration and engineering/DBA has confirmed it is safe to reset the version marker (and any partial DDL has been handled).

**Leave unset when:** schema is clean (`dirty = false`). Misuse can mark a broken schema as valid.

### `reset_schema`

| | |
|---|---|
| **Config key** | `reset_schema` (boolean, default `false`) |
| **Where** | `DATABASE_ADMIN_CONFIG` on the db-migration Job |
| **Effect** | `DROP SCHEMA public CASCADE` and recreate empty `public` schema before migration logic runs. **Destructive** — wipes all application data. |

**Set when:** local/dev database rebuild only, or explicit empty-environment bootstrap under controlled conditions.

**Never set in production** unless performing a deliberate full data reset with sign-off.

### `update_users`

| | |
|---|---|
| **Config key** | `update_users` (boolean, default `false`) |
| **Where** | `DATABASE_ADMIN_CONFIG` (db-migration Job; also common in local `conf/database_admin.env`) |
| **Effect** | Runs `create_users.sql`, then after migration sets passwords for `listener`, `evaluator`, `manager`, `vmaas_sync` from environment variables. |

**Set when:** initial environment setup or refreshing DB role definitions/passwords (typical in local docker and first-time deploy).

**Leave off when:** users already exist and passwords are managed separately — normal prod Job runs usually rely on this being set only where needed in app-interface.

### `unlock_users`

| | |
|---|---|
| **Config key** | `unlock_users` (boolean, default `false`) |
| **Where** | `DATABASE_ADMIN_CONFIG` on the db-migration Job |
| **Effect** | `ALTER USER … LOGIN` for app users **before** migration, without running DDL. Recovery helper if a previous migration left users at `NOLOGIN`. |

**Set when:** app users are stuck at `NOLOGIN` after an aborted migration and you need to restore login without running a full migrate.

**Leave off for normal deploys** — migration flow blocks and unblocks users automatically.

### `update_db_config`

| | |
|---|---|
| **Config key** | `update_db_config` (boolean, default `false`) |
| **Where** | `DATABASE_ADMIN_CONFIG` (db-migration Job; also in local `conf/database_admin.env`) |
| **Effect** | Re-runs `database_admin/config.sql` (PostgreSQL settings such as `work_mem` for the application). |

**Set when:** applying or refreshing DB-level settings from `config.sql` after deploy.

**Leave off when:** only schema migration is needed.

### `terminate_db_sessions`

| | |
|---|---|
| **Config key** | `terminate_db_sessions` (boolean, default `false`) |
| **Where** | `DATABASE_ADMIN_CONFIG` on the **db-migration Job** only |
| **Effect** | After `NOLOGIN` on app users, runs `pg_terminate_backend` on open `listener` / `evaluator` / `manager` / `vmaas_sync` sessions, then waits until `pg_stat_activity` is clear |

**Enable when:** heavy DDL, prior stuck migration after “Blocking writing users”, or planned maintenance window.

**Leave off when:** routine release, local/CI, no session-blocking symptoms.

**Remove after** the major migration deploy completes.

`NOLOGIN` alone does not close existing connections — that is why this flag exists.

---

## During deploy

### Where to watch logs

Kibana — filter by log stream and message text (field names vary by environment; adjust `@log_stream` as needed):

```kql
@log_stream: patchman-* and message: *advisory lock*
```

Migration progress:

```kql
@log_stream: patchman-* and (message: "Migrating the database" or message: "Starting schema migration" or message: "App database sessions cleared")
```

Init containers polling for schema (may appear on manager/listener/evaluator streams):

```kql
@log_stream: patchman-* and message: *DB migration in progress*
```

### Expected log sequence (db-migration Job)

| Step | Log line | Notes |
|------|----------|--------|
| 1 | `Getting advisory lock` | |
| 2 | `Advisory lock acquired` | **Missing** → another holder of advisory lock 123 |
| 3 | `Migrating the database` | |
| 4 | `Blocking writing users during the migration` | `NOLOGIN` on app DB users |
| 5 | `Terminating active app database sessions` | Only if `terminate_db_sessions=true` |
| 6 | `Terminated session pid=... user=...` | Per terminated backend |
| 7 | `Waiting for N sessions: ...` | Repeats each second until drain |
| 8 | `App database sessions cleared` | |
| 9 | `Starting schema migration to version X` | DDL begins |
| 10 | *(silence)* | Normal during long DDL |
| 11 | `Reverting components privileges` | `LOGIN` restored |
| 12 | `Releasing advisory lock` | |

### If stuck

| Last log seen | Likely cause | Action |
|---------------|--------------|--------|
| Only `Getting advisory lock` | Another process holds advisory lock 123 | See [Advisory lock diagnostics](#advisory-lock-diagnostics); check for duplicate migration Job or stale pod |
| `Waiting for N sessions` (repeating) | App connections still open | Enable or verify `terminate_db_sessions=true`; scale down listener/evaluator; inspect `pg_stat_activity` |
| Past `Starting schema migration`, long silence | DDL waiting on table lock | Find blockers on target table; scale down apps; see [DDL lock diagnostics](#ddl-lock-diagnostics) |
| `failed to check app database sessions after 5 attempts` | DB connectivity or permissions on `pg_stat_activity` | Fix admin DB access; do not ignore — migration aborted intentionally |
| Job failed, new pods `CrashLoopBackOff` on init | Migration failed or timed out | Old pods still serve; fix migration state before retrying |

---

## After deploy

1. Verify schema: `SELECT version, dirty FROM schema_migrations;` — `dirty` must be `false`.
2. Remove `terminate_db_sessions` from `DATABASE_ADMIN_CONFIG` (or set `false`).
3. Confirm app pods passed `check-for-db` and are ready.
4. Smoke-test manager API and a sample evaluation path if the migration touched core tables.

---

## Rollback

- **Application rollback** — deploy previous image tag; if schema already migrated forward, old code may be incompatible with new schema. Coordinate with engineering before rolling back app only.
- **Failed migration (`dirty = true`)** — do not re-deploy blindly. Inspect `schema_migrations`, Job logs, and whether DDL partially applied. May require `force_migration_version` (see `database_admin/config.go`) under DBA/engineering guidance.
- **Stuck advisory lock** — identify holder PID; terminate only after confirming it is a stale migration pod, not an active legitimate migration.

---

## Postgres diagnostics

### Advisory lock diagnostics

Advisory lock id **123** is hardcoded in `database_admin/update.go`.

```sql
-- Who holds advisory lock 123?
SELECT l.pid, a.usename, a.state, a.application_name, left(a.query, 120) AS query
FROM pg_locks l
JOIN pg_stat_activity a ON a.pid = l.pid
WHERE l.locktype = 'advisory'
AND l.classid = 0
AND l.objid = 123;
```

### App session diagnostics

```sql
-- Open sessions for patchman app users
SELECT pid, usename, state, wait_event_type, wait_event, left(query, 80) AS query
FROM pg_stat_activity
WHERE usename IN ('listener', 'evaluator', 'manager', 'vmaas_sync')
ORDER BY usename, pid;
```

### DDL lock diagnostics

Replace `system_inventory` with the table your migration touches:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i don't see a table name referenced in the query following this instruction. is the diagnostic query meant to filter on the table name?


```sql
SELECT blocked.pid AS blocked_pid,
blocked.usename AS blocked_user,
left(blocked.query, 80) AS blocked_query,
blocking.pid AS blocking_pid,
blocking.usename AS blocking_user,
left(blocking.query, 80) AS blocking_query
FROM pg_stat_activity blocked
JOIN pg_locks blocked_locks ON blocked_locks.pid = blocked.pid AND NOT blocked_locks.granted
JOIN pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_stat_activity blocking ON blocking.pid = blocking_locks.pid
WHERE NOT blocking_locks.granted;

@xbhouse xbhouse Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compared this to the query AWS provided in slack and used cursor to help me understand 😅 it looks like the final WHERE clause might need to be blocked_locks.granted vs blocking_locks.granted, or it could be omitted since the JOIN already includes the NOT blocked_locks.granted condition

```

### Migration state

```sql
SELECT version, dirty FROM schema_migrations;
```

---

## Job parameters

| Parameter | Default | Where | Purpose |
|-----------|---------|-------|---------|
| `MIGRATION_TIMEOUT` | `7200` | ClowdApp Job `activeDeadlineSeconds` | Max Job runtime (seconds) |
| `MIGRATION_MAX_RETRIES` | `3` | db-migration Job env | Migrate command retries on failure (`entrypoint.sh`, 5s between attempts) |

---

## Related code and deploy files

| Topic | Location |
|-------|----------|
| Migration flow, session wait/terminate | `database_admin/update.go` |
| Migrate retries | `database_admin/entrypoint.sh` |
| Init schema poll | `database_admin/check-upgraded.sh` |
| ClowdApp Job and `check-for-db` init | `deploy/clowdapp.yaml` |
Loading