Skip to content

refactor(request-log): gateway is sole writer of request log#205

Open
behinddwalls wants to merge 2 commits into
mainfrom
request-log-gateway-consumer
Open

refactor(request-log): gateway is sole writer of request log#205
behinddwalls wants to merge 2 commits into
mainfrom
request-log-gateway-consumer

Conversation

@behinddwalls
Copy link
Copy Markdown
Collaborator

@behinddwalls behinddwalls commented Jun 5, 2026

Summary

Why?

The request log had two persistence paths: the gateway wrote some entries
directly, while the orchestrator ran the log-topic consumer that wrote all
downstream entries to storage. Having the orchestrator persist the request log
blurs ownership — the orchestrator is a pipeline that should only emit events,
not own the request-log table. Concentrating all writes in the gateway gives a
single owner for the request log and keeps the orchestrator free of request-log
storage writes.

What?

The request-log persistence consumer moves from the orchestrator to the gateway:

  • Move submitqueue/orchestrator/controller/log/submitqueue/gateway/controller/log/
    (importpath, doc comment, and default consumer group orchestrator-log
    gateway-log). Logic is unchanged.
  • Orchestrator: TopicKeyLog becomes publish-only (subscription dropped), the
    log controller registration and import are removed, controller count 11 → 10.
    It still publishes via submitqueue/core/request.PublishLog.
  • Gateway: builds a consumer (generic + mysql classifiers), registers the moved
    log controller on TopicKeyLog with a subscription (group gateway-log),
    starts it, and drains it with Stop(30000) on shutdown — preserving the
    128+SIGTERM graceful-exit contract.
  • Add HOSTNAME=gateway-dev to both gateway compose files for a stable
    subscriber name; update the workflow RFC and gateway README.
  • Tests: add a gateway integration test that publishes to the log topic (as the
    orchestrator does) and asserts the gateway consumer persists it, and an e2e
    test that lands a request and asserts Status advances to started
    exercising the publish→consume→persist path across both services.

The gateway keeps its two synchronous direct writes (accepted on Land,
cancelling on Cancel) for read-your-write visibility at RPC return. Both are
gateway writes, so the invariant holds: only the gateway persists the request
log; the orchestrator only publishes. This works because gateway and
orchestrator already share the same queue and app databases.

Test Plan

  • bazel build of both servers + the moved package
  • make test — unit tests pass (incl. the moved log_test)
  • make check-gazelle, make check-tidy, make lint (fmt + license)
  • make integration-test-submitqueue-gateway — new TestRequestLogConsumer
    verifies the gateway consumer persists a log entry published to the log topic
  • make e2e-test — new TestLandRequest_PersistsStartedLogViaGatewayConsumer
    verifies an orchestrator-published started log is persisted by the gateway
    and readable via Status; both services still exit 128+SIGTERM on shutdown

Issues

Stack

  1. @ refactor(request-log): gateway is sole writer of request log #205
  2. docs(workflow): document queue ownership and database ownership model #213

## Summary

### Why?

The request log had two persistence paths: the gateway wrote some entries
directly, while the orchestrator ran the `log`-topic consumer that wrote all
downstream entries to storage. Having the orchestrator persist the request log
blurs ownership — the orchestrator is a pipeline that should only emit events,
not own the request-log table. Concentrating all writes in the gateway gives a
single owner for the request log and keeps the orchestrator free of request-log
storage writes.

### What?

The request-log persistence consumer moves from the orchestrator to the gateway:

- Move `submitqueue/orchestrator/controller/log/` → `submitqueue/gateway/controller/log/`
  (importpath, doc comment, and default consumer group `orchestrator-log` →
  `gateway-log`). Logic is unchanged.
- Orchestrator: `TopicKeyLog` becomes publish-only (subscription dropped), the
  log controller registration and import are removed, controller count 11 → 10.
  It still publishes via `submitqueue/core/request.PublishLog`.
- Gateway: builds a consumer (generic + mysql classifiers), registers the moved
  log controller on `TopicKeyLog` with a subscription (group `gateway-log`),
  starts it, and drains it with `Stop(30000)` on shutdown — preserving the
  128+SIGTERM graceful-exit contract.
- Add `HOSTNAME=gateway-dev` to both gateway compose files for a stable
  subscriber name; update the workflow RFC and gateway README.
- Tests: add a gateway integration test that publishes to the log topic (as the
  orchestrator does) and asserts the gateway consumer persists it, and an e2e
  test that lands a request and asserts Status advances to `started` —
  exercising the publish→consume→persist path across both services.

The gateway keeps its two synchronous direct writes (`accepted` on Land,
`cancelling` on Cancel) for read-your-write visibility at RPC return. Both are
gateway writes, so the invariant holds: only the gateway persists the request
log; the orchestrator only publishes. This works because gateway and
orchestrator already share the same queue and app databases.

## Test Plan

- ✅ `bazel build` of both servers + the moved package
- ✅ `make test` — unit tests pass (incl. the moved log_test)
- ✅ `make check-gazelle`, `make check-tidy`, `make lint` (fmt + license)
- ✅ `make integration-test-submitqueue-gateway` — new `TestRequestLogConsumer`
  verifies the gateway consumer persists a log entry published to the log topic
- ✅ `make e2e-test` — new `TestLandRequest_PersistsStartedLogViaGatewayConsumer`
  verifies an orchestrator-published `started` log is persisted by the gateway
  and readable via Status; both services still exit 128+SIGTERM on shutdown
@behinddwalls behinddwalls force-pushed the request-log-gateway-consumer branch from cadd1b7 to 54f4c7b Compare June 5, 2026 03:59
@behinddwalls behinddwalls marked this pull request as ready for review June 5, 2026 04:12
@behinddwalls behinddwalls requested review from a team and sbalabanov as code owners June 5, 2026 04:12
Copy link
Copy Markdown
Contributor

@sbalabanov sbalabanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually() in tests comment is the only blocking comment.
Side note: we should probably describe in a design doc the ownership of queues/databases, tiering of queues, and the flow of messages across queues.
Side note 2: dlq orchestrator controllers need to change to PublishLog()

Comment thread example/submitqueue/gateway/server/main.go Outdated
Comment thread example/submitqueue/gateway/server/main.go Outdated
Comment thread submitqueue/gateway/README.md Outdated

s.log.Logf("Published 'started' log for sqid=%s; waiting for gateway consumer to persist it", sqid)

require.Eventually(t, func() bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not do "Eventually()" with hard-coded timeouts even in integration tests?

Copy link
Copy Markdown
Collaborator Author

@behinddwalls behinddwalls Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log consumer runs inside the gateway-service container, so this suite can only observe persistence black-box via the Status RPC — there's no in-process channel/HookSignal to wait on across the container boundary (unlike the messagequeue suite). In 43670be I extracted named constants (persistTimeout/persistPollInterval) and added a comment explaining the bounded poll is the deterministic-enough analog here. If you'd prefer, I can instead restructure the test to run the consumer in-process for a signal-based wait — happy to do that if you think it's worth the added complexity (it would need a distinct consumer group to avoid contending with the container's consumer).

- Make the log-consumer subscriber name unique per instance (hostname+PID)
  so co-located gateway processes don't contend for the same partition lease.
- Report the gRPC server error first in the shutdown errors.Join (it is the
  primary failure; consumer-stop is secondary cleanup).
- Clarify in the README that the gateway is the sole owner (writer and reader)
  of the request log; Status/Cancel read directly, orchestrator only publishes.
- Extract named poll constants (persistTimeout/persistPollInterval) in the
  gateway integration and e2e suites with a comment explaining that the
  in-container consumer is observed black-box via Status, so a bounded poll is
  used in lieu of an in-process channel/HookSignal wait.

Follow-ups split out: design doc (#211) and DLQ PublishLog() (#212).

Co-Authored-By: Oz <oz-agent@warp.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants