Skip to content

Add instance health checks#234

Open
sjmiller609 wants to merge 12 commits into
mainfrom
hypeship/add-healthcheck-policy
Open

Add instance health checks#234
sjmiller609 wants to merge 12 commits into
mainfrom
hypeship/add-healthcheck-policy

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented May 16, 2026

Summary

  • add instance health_check policy and health_status response fields for http, tcp, and exec probes
  • add a health check controller owned by the instance manager, with timing, thresholds, start-period handling, and runtime status persistence
  • start health checks while instances are Initializing or Running, while keeping public health status starting until the instance reaches Running
  • add HTTP healthcheck assertions to TestCreateInstanceWithNetwork so the VM-starting network path waits for persisted healthy status
  • wire the controller into the api process and document lifecycle semantics in lib/healthcheck/README.md

Tests

  • go test ./lib/healthcheck
  • go test ./lib/instances -run TestCreateInstanceWithNetwork -count=0
  • go test ./lib/instances -run 'TestHealthCheck|TestValidateCreateRequestHealthCheck|TestValidateUpdateInstanceRequest|TestManagerUpdateInstanceHealthCheckOnlyPublishesLifecycleUpdate|TestLifecycleEventMetrics_ObserveSubscribersQueueDepthAndDrops|TestLifecycleSubscribers'
  • go test ./cmd/api/api -run 'TestCreateInstance_MapsHealthCheckPolicy|TestUpdateInstance_MapsHealthCheckPatch|TestCreateInstance_MapsAutoStandbyPolicy|TestUpdateInstance_MapsAutoStandbyPatch'
  • go test ./cmd/api -run TestDoesNotExist
  • go test ./lib/providers

Notes

  • go test ./lib/instances -run TestCreateInstanceWithNetwork -count=1 was attempted twice; both runs failed before instance creation because the existing nginx image readiness wait still saw image status pending after 60s.
  • go test ./cmd/api/api is currently blocked by Docker Hub unauthenticated pull rate limits and local network bridge permissions in existing integration tests.
  • make generate-wire is currently blocked because the checked-in wire binary was built with Go 1.24 and this package now requires Go 1.25; wire_gen.go was updated in the same small shape and go test ./cmd/api -run TestDoesNotExist passes.

Note

Medium Risk
Adds a new asynchronous health-check controller that probes running/initializing instances and persists runtime status, plus new API surface for configuring checks; timing/state handling and background scheduling increase behavioral and concurrency risk.

Overview
Adds a first-class workload health dimension to instances via a new lib/healthcheck package (policy normalization/validation, probe execution, and status tracking) plus persisted runtime state.

Extends the Instances API to accept and return health_check and to report health_status, wiring request/response mapping and validation into create/update flows and resetting runtime on policy changes.

Introduces an instances.HealthCheckController that subscribes to lifecycle events, schedules probes while instances are Initializing/Running (with starting masking until Running), runs HTTP/TCP/exec checks, and persists health runtime; the API process now wires and runs this controller. Separately, metadata writes are made atomic via temp-file + rename, and tests/integration paths are updated to cover health-check behavior and lifecycle consumer metrics.

Reviewed by Cursor Bugbot for commit a32c4c8. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 16, 2026

✱ Stainless preview builds for hypeman

This PR will update the hypeman SDKs with the following commit message.

feat: Add instance health checks

Edit this comment to update it. It will appear in the SDK's changelogs.

hypeman-openapi studio · code · diff

Your SDK build had at least one new note diagnostic, which is a regression from the base state.
generate ✅

New diagnostics (5 note)
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/HealthCheck`
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/HealthCheckExec`
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/HealthCheckHTTP`
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/HealthCheckTCP`
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/InstanceHealthStatus`
hypeman-typescript studio · code · diff

Your SDK build had at least one new note diagnostic, which is a regression from the base state.
generate ✅build ✅lint ❗test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/e2bb156dae1ddb8a5bb88644354fdcb5897f3c53/dist.tar.gz
New diagnostics (5 note)
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/HealthCheck`
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/HealthCheckExec`
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/HealthCheckHTTP`
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/HealthCheckTCP`
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/InstanceHealthStatus`
hypeman-go studio · code · diff

Your SDK build had at least one new note diagnostic, which is a regression from the base state.
generate ✅build ✅lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@ee2f11657bfaecf38ef3f2b3b10fa128ff05f4b9
New diagnostics (5 note)
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/HealthCheck`
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/HealthCheckExec`
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/HealthCheckHTTP`
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/HealthCheckTCP`
💡 Model/Recommended: We recommend you use a model for `#/components/schemas/InstanceHealthStatus`

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-05-17 18:18:22 UTC

@sjmiller609 sjmiller609 marked this pull request as ready for review May 17, 2026 17:08
Comment thread lib/instances/health_check_controller.go Outdated
@firetiger-agent
Copy link
Copy Markdown

Monitoring Plan: Instance Health Checks (PR #234)

This PR adds a new health-check subsystem to hypeman: POST /instances and PUT /instances/{id} now accept an optional health_check policy (HTTP, TCP, or exec probes), and a new HealthCheckController goroutine runs alongside the existing AutoStandbyController to drive periodic probes and persist runtime status. The GET /instances/{id} response gains health_check + health_status fields.

The main risks are: (1) validation errors in toDomainHealthCheck/NormalizePolicy surfacing as unexpected 400s on existing callers who send bodies that incidentally conflict with new fields, (2) the new controller goroutine panicking or leaking timers under high instance churn, and (3) exec probes firing guest-agent commands on instances that lack a guest-agent, triggering error log noise. API 5xx error rate baseline is 0.013–0.018% (30–35 errors/hr out of ~190K–280K req/hr); 400 error baseline is ~267 in the latest 4-hour window. Status updates will be posted automatically on this PR as monitoring progresses.

Key risks to watch:

  • Spike in HTTP 400 responses on /instances endpoints (invalid_health_check errors from new validation path)
  • Unhandled panics or nil pointer dereference errors in HealthCheckController.Run or timer callbacks
  • API 5xx error rate exceeding 0.05% (3× normal baseline) sustained for >15 min
  • Log errors: "failed to set health check runtime" or "health check controller started" absent after deploy

View agent

Comment thread lib/instances/health_check_controller.go
Comment thread lib/instances/types.go
Comment thread lib/instances/health_check_controller.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 36c8c34. Configure here.

Comment thread lib/healthcheck/status.go
Comment thread lib/instances/health_check_controller.go
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant