Add storage backend probe to /health (closes #73) by larsborn · Pull Request #119 · git-pkgs/proxy

larsborn · 2026-05-12T20:31:37Z

Summary

/health now performs an active write → size-check → read → verify → delete round-trip against the configured storage backend, in addition to the existing database check. Closes Add storage backend probe to health check #73.
Result is cached for a configurable interval (health.storage_probe_interval, env PROXY_HEALTH_STORAGE_PROBE_INTERVAL, default 30s; "0" disables caching). The probe runs under a detached context.WithTimeout(context.Background(), 10s) so a client disconnect can't poison the cache.
Response shape changes from plain text ("ok" / "database error") to JSON:
```
{"status":"ok","checks":{"database":{"status":"ok"},"storage":{"status":"ok"}}}
```
Status codes are unchanged (200 healthy / 503 unhealthy). Failures include an error field and (for storage) a step label.
New metric: proxy_health_probe_failures_total{step="write|size|read|verify|delete"}, following the existing proxy_integrity_failures_total pattern.
Probe path layout: .healthcheck/<unix-nano>-<crypto/rand hex> — unique per call, collision-safe under concurrent replicas. Object is deleted after verify; delete failures surface as probe failures.
Transition-only logging (ok↔error), so Kubernetes-rate probing doesn't spam logs.

Behavioral notes / breaking changes

Response shape: any monitor that grep'd the body for "ok" will break. Status-code-based monitors keep working. Documented in README's new ### Health Check subsection and in the regenerated Swagger.
Probe-object cleanup: if Delete fails, the probe object is left under .healthcheck/. With a 30s TTL and a continuously-failing delete that's ~3 KB/hour per replica. The proxy_health_probe_failures_total{step="delete"} counter surfaces this. A future Storage.List extension would enable a startup sweep — explicitly out of scope here.
Deliberate spec deviation: health.go calls rc.Close() explicitly (not deferred) between ReadAll and Delete so the file handle is released before deletion. On Windows the deferred-close ordering caused Delete to fail with "file in use" — caught when wiring up TestHealthEndpoint against the real filesystem backend. Commented in the source.

Untested

I have not validated this against a remote backend (S3/Azure).

…TestServer Also fix Windows file-locking issue in storageProbe: close the reader explicitly before Delete so the file handle is released prior to os.Remove.

Copilot

Pull request overview

This PR enhances the /health endpoint to actively verify storage backend health (write → size-check → read → verify → delete) alongside the existing database check, and updates the endpoint contract to return a structured JSON report (while keeping HTTP status code semantics: 200 healthy / 503 unhealthy). It also introduces caching for storage probe results to limit backend load and adds a Prometheus counter to surface probe failures by step.

Changes:

Updated /health to return JSON HealthResponse with per-subsystem status and error details, including storage probe step labeling.
Added a cached storage backend probe (configurable interval + fixed timeout) with transition-only logging.
Added proxy_health_probe_failures_total{step=...} metric plus documentation/config updates (README, example config, swagger, architecture docs).

Reviewed changes

Copilot reviewed 11 out of 12 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
README.md	Documents JSON `/health` response and new probe failure metric; updates endpoint description.
internal/server/server.go	Wires health cache into Server and updates `/health` handler to return `HealthResponse` JSON.
internal/server/server_test.go	Updates health endpoint tests for JSON shape and adds DB short-circuit test.
internal/server/health.go	Implements storage probe logic, caching, timeout behavior, and transition logging.
internal/server/health_test.go	Adds unit tests for probe steps, cache semantics, concurrency, logging transitions, and metrics increment behavior.
internal/metrics/metrics.go	Introduces and registers `proxy_health_probe_failures_total` and helper increment function.
internal/config/config.go	Adds `HealthConfig` and env var wiring for `PROXY_HEALTH_STORAGE_PROBE_INTERVAL`.
docs/swagger/swagger.json	Regenerates swagger to reflect JSON `/health` schema (`HealthResponse`).
docs/swagger/docs.go	Regenerates embedded swagger template with JSON `/health` schema (`HealthResponse`).
docs/architecture.md	Updates architecture documentation to describe new `/health` storage probing behavior.
config.example.yaml	Adds `health.storage_probe_interval` example configuration.
cmd/proxy/main.go	Updates env var help text to include `PROXY_HEALTH_STORAGE_PROBE_INTERVAL`.

Files not reviewed (1)

docs/swagger/docs.go: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+	// 1. Store
+	size, _, err := s.Store(ctx, path, bytes.NewReader(payload))
+	if err != nil {
+		return &probeError{step: "write", err: err}
+	}
+	// 2. Size check
+	if size != int64(len(payload)) {
+		return &probeError{step: "size", err: fmt.Errorf("wrote %d bytes, expected %d", size, len(payload))}
+	}


+	// Set to "0" to probe on every /health request (useful for low-traffic deployments).
+	StorageProbeInterval string `json:"storage_probe_interval" yaml:"storage_probe_interval"`
+}
+


 | `proxy_storage_operation_duration_seconds` | histogram | `operation` | Storage read/write latency |
 | `proxy_storage_errors_total` | counter | `operation` | Storage read/write failures |
 | `proxy_active_requests` | gauge | | In-flight requests |
+| `proxy_health_probe_failures_total` | Counter | `step` | Storage health probe failures by failing step (`write`, `size`, `read`, `verify`, `delete`). |


 - Templates are embedded in the binary via `//go:embed`
 - Enrichment API for package metadata, vulnerability scanning, and outdated detection
- Health, stats, and Prometheus metrics endpoints
+- Health, stats, and Prometheus metrics endpoints. `/health` runs an active write → read → verify → delete probe against the storage backend and returns a structured JSON response (`HealthResponse`) with `"ok"` / `"error"` status per subsystem. Probe results are cached (default 30 s, configurable via `health.storage_probe_interval`) to avoid overwhelming remote backends.


larsborn added 11 commits May 12, 2026 19:47

config: add Health.StorageProbeInterval

6ff0c65

metrics: add proxy_health_probe_failures_total counter

77e766d

server: add storageProbe with happy-path test

ba4aa76

server: add storageProbe failure-mode tests

928c53d

server: add healthCache with TTL, single-flight, transition logging

d7572c8

server: wire storage probe into /health

228b5aa

server: update TestHealthEndpoint for JSON; wire healthCache into new…

b80dcd3

…TestServer Also fix Windows file-locking issue in storageProbe: close the reader explicitly before Delete so the file handle is released prior to os.Remove.

server: clean up stale comment in storageProbe

d0e52b3

docs: document storage health probe and new metric

ca0803f

docs: regenerate Swagger for /health JSON response

c9f1231

server: simplify rc.Close error handling in storageProbe

f39a3e3

andrew requested a review from Copilot May 13, 2026 06:23

Copilot started reviewing on behalf of andrew May 13, 2026 06:24 View session

Copilot AI reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add storage backend probe to /health (closes #73)#119

Add storage backend probe to /health (closes #73)#119
larsborn wants to merge 11 commits into
git-pkgs:mainfrom
larsborn:feat/storage-health-probe

larsborn commented May 12, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

larsborn commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Behavioral notes / breaking changes

Untested

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

larsborn commented May 12, 2026 •

edited

Loading