refactor(request-log): gateway is sole writer of request log#205
refactor(request-log): gateway is sole writer of request log#205behinddwalls wants to merge 2 commits into
Conversation
## Summary ### Why? The request log had two persistence paths: the gateway wrote some entries directly, while the orchestrator ran the `log`-topic consumer that wrote all downstream entries to storage. Having the orchestrator persist the request log blurs ownership — the orchestrator is a pipeline that should only emit events, not own the request-log table. Concentrating all writes in the gateway gives a single owner for the request log and keeps the orchestrator free of request-log storage writes. ### What? The request-log persistence consumer moves from the orchestrator to the gateway: - Move `submitqueue/orchestrator/controller/log/` → `submitqueue/gateway/controller/log/` (importpath, doc comment, and default consumer group `orchestrator-log` → `gateway-log`). Logic is unchanged. - Orchestrator: `TopicKeyLog` becomes publish-only (subscription dropped), the log controller registration and import are removed, controller count 11 → 10. It still publishes via `submitqueue/core/request.PublishLog`. - Gateway: builds a consumer (generic + mysql classifiers), registers the moved log controller on `TopicKeyLog` with a subscription (group `gateway-log`), starts it, and drains it with `Stop(30000)` on shutdown — preserving the 128+SIGTERM graceful-exit contract. - Add `HOSTNAME=gateway-dev` to both gateway compose files for a stable subscriber name; update the workflow RFC and gateway README. - Tests: add a gateway integration test that publishes to the log topic (as the orchestrator does) and asserts the gateway consumer persists it, and an e2e test that lands a request and asserts Status advances to `started` — exercising the publish→consume→persist path across both services. The gateway keeps its two synchronous direct writes (`accepted` on Land, `cancelling` on Cancel) for read-your-write visibility at RPC return. Both are gateway writes, so the invariant holds: only the gateway persists the request log; the orchestrator only publishes. This works because gateway and orchestrator already share the same queue and app databases. ## Test Plan - ✅ `bazel build` of both servers + the moved package - ✅ `make test` — unit tests pass (incl. the moved log_test) - ✅ `make check-gazelle`, `make check-tidy`, `make lint` (fmt + license) - ✅ `make integration-test-submitqueue-gateway` — new `TestRequestLogConsumer` verifies the gateway consumer persists a log entry published to the log topic - ✅ `make e2e-test` — new `TestLandRequest_PersistsStartedLogViaGatewayConsumer` verifies an orchestrator-published `started` log is persisted by the gateway and readable via Status; both services still exit 128+SIGTERM on shutdown
cadd1b7 to
54f4c7b
Compare
sbalabanov
left a comment
There was a problem hiding this comment.
Eventually() in tests comment is the only blocking comment.
Side note: we should probably describe in a design doc the ownership of queues/databases, tiering of queues, and the flow of messages across queues.
Side note 2: dlq orchestrator controllers need to change to PublishLog()
|
|
||
| s.log.Logf("Published 'started' log for sqid=%s; waiting for gateway consumer to persist it", sqid) | ||
|
|
||
| require.Eventually(t, func() bool { |
There was a problem hiding this comment.
I think we do not do "Eventually()" with hard-coded timeouts even in integration tests?
There was a problem hiding this comment.
The log consumer runs inside the gateway-service container, so this suite can only observe persistence black-box via the Status RPC — there's no in-process channel/HookSignal to wait on across the container boundary (unlike the messagequeue suite). In 43670be I extracted named constants (persistTimeout/persistPollInterval) and added a comment explaining the bounded poll is the deterministic-enough analog here. If you'd prefer, I can instead restructure the test to run the consumer in-process for a signal-based wait — happy to do that if you think it's worth the added complexity (it would need a distinct consumer group to avoid contending with the container's consumer).
- Make the log-consumer subscriber name unique per instance (hostname+PID) so co-located gateway processes don't contend for the same partition lease. - Report the gRPC server error first in the shutdown errors.Join (it is the primary failure; consumer-stop is secondary cleanup). - Clarify in the README that the gateway is the sole owner (writer and reader) of the request log; Status/Cancel read directly, orchestrator only publishes. - Extract named poll constants (persistTimeout/persistPollInterval) in the gateway integration and e2e suites with a comment explaining that the in-container consumer is observed black-box via Status, so a bounded poll is used in lieu of an in-process channel/HookSignal wait. Follow-ups split out: design doc (#211) and DLQ PublishLog() (#212). Co-Authored-By: Oz <oz-agent@warp.dev>
Summary
Why?
The request log had two persistence paths: the gateway wrote some entries
directly, while the orchestrator ran the
log-topic consumer that wrote alldownstream entries to storage. Having the orchestrator persist the request log
blurs ownership — the orchestrator is a pipeline that should only emit events,
not own the request-log table. Concentrating all writes in the gateway gives a
single owner for the request log and keeps the orchestrator free of request-log
storage writes.
What?
The request-log persistence consumer moves from the orchestrator to the gateway:
submitqueue/orchestrator/controller/log/→submitqueue/gateway/controller/log/(importpath, doc comment, and default consumer group
orchestrator-log→gateway-log). Logic is unchanged.TopicKeyLogbecomes publish-only (subscription dropped), thelog controller registration and import are removed, controller count 11 → 10.
It still publishes via
submitqueue/core/request.PublishLog.log controller on
TopicKeyLogwith a subscription (groupgateway-log),starts it, and drains it with
Stop(30000)on shutdown — preserving the128+SIGTERM graceful-exit contract.
HOSTNAME=gateway-devto both gateway compose files for a stablesubscriber name; update the workflow RFC and gateway README.
orchestrator does) and asserts the gateway consumer persists it, and an e2e
test that lands a request and asserts Status advances to
started—exercising the publish→consume→persist path across both services.
The gateway keeps its two synchronous direct writes (
acceptedon Land,cancellingon Cancel) for read-your-write visibility at RPC return. Both aregateway writes, so the invariant holds: only the gateway persists the request
log; the orchestrator only publishes. This works because gateway and
orchestrator already share the same queue and app databases.
Test Plan
bazel buildof both servers + the moved packagemake test— unit tests pass (incl. the moved log_test)make check-gazelle,make check-tidy,make lint(fmt + license)make integration-test-submitqueue-gateway— newTestRequestLogConsumerverifies the gateway consumer persists a log entry published to the log topic
make e2e-test— newTestLandRequest_PersistsStartedLogViaGatewayConsumerverifies an orchestrator-published
startedlog is persisted by the gatewayand readable via Status; both services still exit 128+SIGTERM on shutdown
Issues
Stack