Skip to content

Add storage backend probe to /health (closes #73)#119

Open
larsborn wants to merge 11 commits into
git-pkgs:mainfrom
larsborn:feat/storage-health-probe
Open

Add storage backend probe to /health (closes #73)#119
larsborn wants to merge 11 commits into
git-pkgs:mainfrom
larsborn:feat/storage-health-probe

Conversation

@larsborn
Copy link
Copy Markdown

@larsborn larsborn commented May 12, 2026

Summary

  • /health now performs an active write → size-check → read → verify → delete round-trip against the configured storage backend, in addition to the existing database check. Closes Add storage backend probe to health check #73.
  • Result is cached for a configurable interval (health.storage_probe_interval, env PROXY_HEALTH_STORAGE_PROBE_INTERVAL, default 30s; "0" disables caching). The probe runs under a detached context.WithTimeout(context.Background(), 10s) so a client disconnect can't poison the cache.
  • Response shape changes from plain text ("ok" / "database error") to JSON:
    {"status":"ok","checks":{"database":{"status":"ok"},"storage":{"status":"ok"}}}
    Status codes are unchanged (200 healthy / 503 unhealthy). Failures include an error field and (for storage) a step label.
  • New metric: proxy_health_probe_failures_total{step="write|size|read|verify|delete"}, following the existing proxy_integrity_failures_total pattern.
  • Probe path layout: .healthcheck/<unix-nano>-<crypto/rand hex> — unique per call, collision-safe under concurrent replicas. Object is deleted after verify; delete failures surface as probe failures.
  • Transition-only logging (ok↔error), so Kubernetes-rate probing doesn't spam logs.

Behavioral notes / breaking changes

  • Response shape: any monitor that grep'd the body for "ok" will break. Status-code-based monitors keep working. Documented in README's new ### Health Check subsection and in the regenerated Swagger.
  • Probe-object cleanup: if Delete fails, the probe object is left under .healthcheck/. With a 30s TTL and a continuously-failing delete that's ~3 KB/hour per replica. The proxy_health_probe_failures_total{step="delete"} counter surfaces this. A future Storage.List extension would enable a startup sweep — explicitly out of scope here.
  • Deliberate spec deviation: health.go calls rc.Close() explicitly (not deferred) between ReadAll and Delete so the file handle is released before deletion. On Windows the deferred-close ordering caused Delete to fail with "file in use" — caught when wiring up TestHealthEndpoint against the real filesystem backend. Commented in the source.

Untested

I have not validated this against a remote backend (S3/Azure).

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the /health endpoint to actively verify storage backend health (write → size-check → read → verify → delete) alongside the existing database check, and updates the endpoint contract to return a structured JSON report (while keeping HTTP status code semantics: 200 healthy / 503 unhealthy). It also introduces caching for storage probe results to limit backend load and adds a Prometheus counter to surface probe failures by step.

Changes:

  • Updated /health to return JSON HealthResponse with per-subsystem status and error details, including storage probe step labeling.
  • Added a cached storage backend probe (configurable interval + fixed timeout) with transition-only logging.
  • Added proxy_health_probe_failures_total{step=...} metric plus documentation/config updates (README, example config, swagger, architecture docs).

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
README.md Documents JSON /health response and new probe failure metric; updates endpoint description.
internal/server/server.go Wires health cache into Server and updates /health handler to return HealthResponse JSON.
internal/server/server_test.go Updates health endpoint tests for JSON shape and adds DB short-circuit test.
internal/server/health.go Implements storage probe logic, caching, timeout behavior, and transition logging.
internal/server/health_test.go Adds unit tests for probe steps, cache semantics, concurrency, logging transitions, and metrics increment behavior.
internal/metrics/metrics.go Introduces and registers proxy_health_probe_failures_total and helper increment function.
internal/config/config.go Adds HealthConfig and env var wiring for PROXY_HEALTH_STORAGE_PROBE_INTERVAL.
docs/swagger/swagger.json Regenerates swagger to reflect JSON /health schema (HealthResponse).
docs/swagger/docs.go Regenerates embedded swagger template with JSON /health schema (HealthResponse).
docs/architecture.md Updates architecture documentation to describe new /health storage probing behavior.
config.example.yaml Adds health.storage_probe_interval example configuration.
cmd/proxy/main.go Updates env var help text to include PROXY_HEALTH_STORAGE_PROBE_INTERVAL.
Files not reviewed (1)
  • docs/swagger/docs.go: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/server/health.go
Comment on lines +61 to +69
// 1. Store
size, _, err := s.Store(ctx, path, bytes.NewReader(payload))
if err != nil {
return &probeError{step: "write", err: err}
}
// 2. Size check
if size != int64(len(payload)) {
return &probeError{step: "size", err: fmt.Errorf("wrote %d bytes, expected %d", size, len(payload))}
}
Comment thread internal/config/config.go
// Set to "0" to probe on every /health request (useful for low-traffic deployments).
StorageProbeInterval string `json:"storage_probe_interval" yaml:"storage_probe_interval"`
}

Comment thread README.md
| `proxy_storage_operation_duration_seconds` | histogram | `operation` | Storage read/write latency |
| `proxy_storage_errors_total` | counter | `operation` | Storage read/write failures |
| `proxy_active_requests` | gauge | | In-flight requests |
| `proxy_health_probe_failures_total` | Counter | `step` | Storage health probe failures by failing step (`write`, `size`, `read`, `verify`, `delete`). |
Comment thread docs/architecture.md
- Templates are embedded in the binary via `//go:embed`
- Enrichment API for package metadata, vulnerability scanning, and outdated detection
- Health, stats, and Prometheus metrics endpoints
- Health, stats, and Prometheus metrics endpoints. `/health` runs an active write → read → verify → delete probe against the storage backend and returns a structured JSON response (`HealthResponse`) with `"ok"` / `"error"` status per subsystem. Probe results are cached (default 30 s, configurable via `health.storage_probe_interval`) to avoid overwhelming remote backends.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add storage backend probe to health check

2 participants