Add storage backend probe to /health (closes #73)#119
Open
larsborn wants to merge 11 commits into
Open
Conversation
…TestServer Also fix Windows file-locking issue in storageProbe: close the reader explicitly before Delete so the file handle is released prior to os.Remove.
There was a problem hiding this comment.
Pull request overview
This PR enhances the /health endpoint to actively verify storage backend health (write → size-check → read → verify → delete) alongside the existing database check, and updates the endpoint contract to return a structured JSON report (while keeping HTTP status code semantics: 200 healthy / 503 unhealthy). It also introduces caching for storage probe results to limit backend load and adds a Prometheus counter to surface probe failures by step.
Changes:
- Updated
/healthto return JSONHealthResponsewith per-subsystem status and error details, including storage probe step labeling. - Added a cached storage backend probe (configurable interval + fixed timeout) with transition-only logging.
- Added
proxy_health_probe_failures_total{step=...}metric plus documentation/config updates (README, example config, swagger, architecture docs).
Reviewed changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| README.md | Documents JSON /health response and new probe failure metric; updates endpoint description. |
| internal/server/server.go | Wires health cache into Server and updates /health handler to return HealthResponse JSON. |
| internal/server/server_test.go | Updates health endpoint tests for JSON shape and adds DB short-circuit test. |
| internal/server/health.go | Implements storage probe logic, caching, timeout behavior, and transition logging. |
| internal/server/health_test.go | Adds unit tests for probe steps, cache semantics, concurrency, logging transitions, and metrics increment behavior. |
| internal/metrics/metrics.go | Introduces and registers proxy_health_probe_failures_total and helper increment function. |
| internal/config/config.go | Adds HealthConfig and env var wiring for PROXY_HEALTH_STORAGE_PROBE_INTERVAL. |
| docs/swagger/swagger.json | Regenerates swagger to reflect JSON /health schema (HealthResponse). |
| docs/swagger/docs.go | Regenerates embedded swagger template with JSON /health schema (HealthResponse). |
| docs/architecture.md | Updates architecture documentation to describe new /health storage probing behavior. |
| config.example.yaml | Adds health.storage_probe_interval example configuration. |
| cmd/proxy/main.go | Updates env var help text to include PROXY_HEALTH_STORAGE_PROBE_INTERVAL. |
Files not reviewed (1)
- docs/swagger/docs.go: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+61
to
+69
| // 1. Store | ||
| size, _, err := s.Store(ctx, path, bytes.NewReader(payload)) | ||
| if err != nil { | ||
| return &probeError{step: "write", err: err} | ||
| } | ||
| // 2. Size check | ||
| if size != int64(len(payload)) { | ||
| return &probeError{step: "size", err: fmt.Errorf("wrote %d bytes, expected %d", size, len(payload))} | ||
| } |
| // Set to "0" to probe on every /health request (useful for low-traffic deployments). | ||
| StorageProbeInterval string `json:"storage_probe_interval" yaml:"storage_probe_interval"` | ||
| } | ||
|
|
| | `proxy_storage_operation_duration_seconds` | histogram | `operation` | Storage read/write latency | | ||
| | `proxy_storage_errors_total` | counter | `operation` | Storage read/write failures | | ||
| | `proxy_active_requests` | gauge | | In-flight requests | | ||
| | `proxy_health_probe_failures_total` | Counter | `step` | Storage health probe failures by failing step (`write`, `size`, `read`, `verify`, `delete`). | |
| - Templates are embedded in the binary via `//go:embed` | ||
| - Enrichment API for package metadata, vulnerability scanning, and outdated detection | ||
| - Health, stats, and Prometheus metrics endpoints | ||
| - Health, stats, and Prometheus metrics endpoints. `/health` runs an active write → read → verify → delete probe against the storage backend and returns a structured JSON response (`HealthResponse`) with `"ok"` / `"error"` status per subsystem. Probe results are cached (default 30 s, configurable via `health.storage_probe_interval`) to avoid overwhelming remote backends. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
/healthnow performs an active write → size-check → read → verify → delete round-trip against the configured storage backend, in addition to the existing database check. Closes Add storage backend probe to health check #73.health.storage_probe_interval, envPROXY_HEALTH_STORAGE_PROBE_INTERVAL, default30s;"0"disables caching). The probe runs under a detachedcontext.WithTimeout(context.Background(), 10s)so a client disconnect can't poison the cache."ok"/"database error") to JSON:{"status":"ok","checks":{"database":{"status":"ok"},"storage":{"status":"ok"}}}errorfield and (for storage) asteplabel.proxy_health_probe_failures_total{step="write|size|read|verify|delete"}, following the existingproxy_integrity_failures_totalpattern..healthcheck/<unix-nano>-<crypto/rand hex>— unique per call, collision-safe under concurrent replicas. Object is deleted after verify; delete failures surface as probe failures.Behavioral notes / breaking changes
"ok"will break. Status-code-based monitors keep working. Documented in README's new### Health Checksubsection and in the regenerated Swagger.Deletefails, the probe object is left under.healthcheck/. With a 30s TTL and a continuously-failing delete that's ~3 KB/hour per replica. Theproxy_health_probe_failures_total{step="delete"}counter surfaces this. A futureStorage.Listextension would enable a startup sweep — explicitly out of scope here.health.gocallsrc.Close()explicitly (not deferred) betweenReadAllandDeleteso the file handle is released before deletion. On Windows the deferred-close ordering causedDeleteto fail with "file in use" — caught when wiring upTestHealthEndpointagainst the real filesystem backend. Commented in the source.Untested
I have not validated this against a remote backend (S3/Azure).