TestJumpToDateEndpoint: Guard the parallel boundary subtests against millisecond collisions.#872
Conversation
…t millisecond collisions. Both subtests sample `time.Now()` and immediately probe a homeserver that stamps events using its own millisecond clock; when the two readings land in the same millisecond, `/timestamp_to_event`'s inclusive boundary semantics return an event the subtest does not want. `should_find_event_after_given_timestamp` races the trailing createRoom state event ahead of the message that a forward search is expected to find; `should_find_nothing_before_the_earliest_timestamp` races `m.room.create` against a backward search expected to find nothing. Pause around the relevant samples so the next homeserver stamp lands in a strictly later millisecond. Both flakes have been observed across multiple homeserver implementations on otherwise green runs. Closes matrix-org#868. Signed-off-by: Jason Volk <jason@zemos.net>
| as.MustJoinRoom(t, roomID, []spec.ServerName{ | ||
| deployment.GetFullyQualifiedHomeserverName(t, "hs1"), | ||
| }) | ||
|
|
There was a problem hiding this comment.
Feels like we need time.Sleep(tsBoundaryGuard) here as well
It's not strictly necessary here but for consistency sake and to make it clear that it's not the main part of the thing we're testing.
| as.MustJoinRoom(t, roomID, []spec.ServerName{ | ||
| deployment.GetFullyQualifiedHomeserverName(t, "hs1"), | ||
| }) | ||
|
|
There was a problem hiding this comment.
Feels like we need time.Sleep(tsBoundaryGuard) here as well
(the join could be in the ms as the messages we're sending)
…the remaining samples A whole-millisecond pause always carries a timestamp into the next millisecond bucket, so `tsBoundaryGuard` only needs to be 1ms. Extend the guard to the two same-timestamp topological subtests and to `createTestRoom`'s `eventB` sample for consistency, even though no case exercises a forward search from `eventB` yet. Signed-off-by: Jason Volk <jason@zemos.net>
| // eventA at a shared boundary millisecond. Nothing exercises this yet, but | ||
| // keep the helper honest for callers that eventually will. |
There was a problem hiding this comment.
Not very relevant here
| // eventA at a shared boundary millisecond. Nothing exercises this yet, but | |
| // keep the helper honest for callers that eventually will. | |
| // eventA at a shared boundary millisecond. |
| // tsBoundaryGuard is the pause inserted around each time.Now() sample in this | ||
| // file's helpers and subtests so that no homeserver-stamped origin_server_ts | ||
| // can collide with the sample at millisecond granularity. /timestamp_to_event | ||
| // returns the boundary event inclusively (forward picks the earliest event | ||
| // with ts >= query, backward picks the latest with ts <= query), so a shared | ||
| // millisecond between sample and event lets the wrong event win the boundary. | ||
| // time.Sleep pauses for at least its argument, and adding a whole millisecond | ||
| // to a timestamp always carries it into the next millisecond bucket, so 1ms is | ||
| // enough to separate the sample from every event stamped after the pause. |
There was a problem hiding this comment.
Extra fluff (especially LLM sounding)
| // tsBoundaryGuard is the pause inserted around each time.Now() sample in this | |
| // file's helpers and subtests so that no homeserver-stamped origin_server_ts | |
| // can collide with the sample at millisecond granularity. /timestamp_to_event | |
| // returns the boundary event inclusively (forward picks the earliest event | |
| // with ts >= query, backward picks the latest with ts <= query), so a shared | |
| // millisecond between sample and event lets the wrong event win the boundary. | |
| // time.Sleep pauses for at least its argument, and adding a whole millisecond | |
| // to a timestamp always carries it into the next millisecond bucket, so 1ms is | |
| // enough to separate the sample from every event stamped after the pause. | |
| // tsBoundaryGuard is a pause inserted around (before and after) where we create events | |
| // so that `time.Now()` samples and subseqent event `origin_server_ts` don't collide at | |
| // the same millisecond granularity. /timestamp_to_event returns the boundary event | |
| // inclusively (forward picks the earliest event with ts >= query, backward picks the | |
| // latest with ts <= query), so a shared millisecond between events, means the wrong | |
| // event can be picked. Adding one whole millisecond to a timestamp always carries it | |
| // into the next millisecond bucket, so 1ms is enough to separate the sample from every | |
| // event stamped after the pause. |
| // state. | ||
| time.Sleep(tsBoundaryGuard) | ||
|
|
||
| timeBeforeEventA := time.Now() |
There was a problem hiding this comment.
| timeBeforeEventA := time.Now() | |
| timeBeforeEventA := time.Now() | |
| time.Sleep(tsBoundaryGuard) |
| @@ -358,7 +401,7 @@ | |||
| timeAfterEventB := time.Now() | |||
There was a problem hiding this comment.
| timeAfterEventB := time.Now() | |
| time.Sleep(tsBoundaryGuard) | |
| timeAfterEventB := time.Now() |
| time.Sleep(tsBoundaryGuard) | ||
| timeBeforeEventB := time.Now() |
There was a problem hiding this comment.
Actually, we probably need a time.Sleep(tsBoundaryGuard) after the sample for the same reasons (timeBeforeEventB should actually come 1ms before we send event B in all cases)
| time.Sleep(tsBoundaryGuard) | |
| timeBeforeEventB := time.Now() | |
| time.Sleep(tsBoundaryGuard) | |
| timeBeforeEventB := time.Now() | |
| time.Sleep(tsBoundaryGuard) |
(comment above could use work as well)
| @@ -358,7 +401,7 @@ | |||
| timeAfterEventB := time.Now() | |||
|
|
|||
| eventA = &eventTime{EventID: eventAID, BeforeTimestamp: timeBeforeEventA, AfterTimestamp: timeAfterEventA} | |||
There was a problem hiding this comment.
No need for timeAfterEventA if we're going this route.
| eventA = &eventTime{EventID: eventAID, BeforeTimestamp: timeBeforeEventA, AfterTimestamp: timeAfterEventA} | |
| eventA = &eventTime{EventID: eventAID, BeforeTimestamp: timeBeforeEventA, AfterTimestamp: timeBeforeEventB} |
Or we could keep with the previous pattern and just need to move the delays there instead and get rid of timeBeforeEventB
| // Just a sanity check that we're not leaking anything from the `/timestamp_to_event` endpoint | ||
| t.Run("should not be able to query a private room you are not a member of", func(t *testing.T) { | ||
| t.Parallel() | ||
| timeBeforeRoomCreation := time.Now() |
There was a problem hiding this comment.
For consistency sake (make sure we promote good patterns for future tests)
| timeBeforeRoomCreation := time.Now() | |
| time.Sleep(tsBoundaryGuard) |
| // Just a sanity check that we're not leaking anything from the `/timestamp_to_event` endpoint | ||
| t.Run("should not be able to query a public room you are not a member of", func(t *testing.T) { | ||
| t.Parallel() | ||
| timeBeforeRoomCreation := time.Now() |
There was a problem hiding this comment.
For consistency sake (make sure we promote good patterns for future tests)
| timeBeforeRoomCreation := time.Now() | |
| time.Sleep(tsBoundaryGuard) |
|
|
||
| t.Run("when looking backwards before the room was created, should be able to find event that was imported", func(t *testing.T) { | ||
| t.Parallel() | ||
| timeBeforeRoomCreation := time.Now() |
There was a problem hiding this comment.
| timeBeforeRoomCreation := time.Now() | |
| time.Sleep(tsBoundaryGuard) |
Closes #868.
The two
parallelboundary subtestsshould_find_event_after_given_timestampandshould_find_nothing_before_the_earliest_timestampflake against any homeserver fast enough to handle acreateRoomround-trip inside a single millisecond. The race is in the test, not the implementation:/timestamp_to_eventis millisecond-granular by spec and resolves the boundary inclusively (forward picks the earliest event withts >= query, backward picks the latest withts <= query), so when atime.Now()sample collides with anorigin_server_tsthe homeserver just stamped, the inclusive match returns an event the subtest does not want.should_find_event_after_given_timestampcreateTestRoomreadstime.Now()immediately afterMustCreateRoomreturns. The trailing state event of thepublic_chatpreset's createRoom batch (typicallym.room.guest_access, aftercreate,member/join,power_levels,join_rules,history_visibility) lands in the same millisecond as that sample on a fast deploy. The forward search anchored ontimeBeforeEventAthen iterates events withts >= samplefrom the earliest, hits the trailing state event beforeMessage A, and the assertion fails with the wrongevent_id.should_find_nothing_before_the_earliest_timestampThe subtest samples
timeBeforeRoomCreation := time.Now()and immediately callscreateTestRoom. Them.room.createevent the homeserver writes in response is stamped at the same millisecond. The backward search at that boundary inclusively returns the create event when the test wants 404.Solution
The race direction is different for the two subtests, so the placement of the guard differs:
createTestRoom, the offending event has already been stamped beforetime.Now()is called. The sleep goes before the sample (betweenMustCreateRoomreturning andtimeBeforeEventA := time.Now()), advancing the sample past the trailing state event's millisecond.time.Now(). The sleep goes after the sample (betweentimeBeforeRoomCreation := time.Now()andcreateTestRoom), pushingm.room.createinto a strictly later millisecond.A 2 ms pause is long enough to step past the 1 ms clock resolution exposed on every supported platform and short enough to be invisible against the normal homeserver round-trip. The constant
tsBoundaryGuardand a paragraph above it document the invariant so neither sleep gets removed as cosmetic later.The wire contract is unchanged:
/timestamp_to_eventcontinues to be millisecond-granular and continues to resolve the boundary inclusively. Only the test pacing changes.Empirical evidence
A local soak against tuwunel (
benchbuild,complement_count=N,complement_parallel=1, single docker invocation per batch, freshOldDeployper iteration):should_find_event_after_given_timestampshould_find_nothing_before_the_earliest_timestamp89 flakes vs 0 across the 200 paired trials. The unpatched tester image is the prior
tuwunel-changesHEAD (9e8d447); the patched tester image is the same Complement plus this commit. The testee binary, the homeserver deployment shape (hs1 + hs2 + AS bridge), and the test invocation flags are identical between batches.The same flake fired three times in matrix-construct/tuwunel CI in the days before we routed the two subtests onto a
complement_skipallowlist:25346770228, job74331724982(2026-05-04): both subtests failed.25363319978, job74376393945(2026-05-05):should_find_event_after_given_timestampfailed.25367139913, job74452002935(2026-05-05): both subtests failed.The patch only adjusts
tests/room_timestamp_to_event_test.go. Theshould_not_be_able_to_query_a_*_room_you_are_not_a_member_ofsubtests sample the sametimeBeforeRoomCreationpattern but assert HTTP 403, which the homeserver returns regardless of which event a backward search would surface, so the boundary does not matter there. Theparallel/federation/when_looking_backwards_before_the_room_was_created,_should_be_able_to_find_event_that_was_importedsubtest has the same pattern but its dominant failure mode is federation backfill, not the boundary; the guard is harmless if added later but is out of scope here.The matching guard is backported to the matrix-construct fork at
matrix-construct/complement@28df927on itstuwunel-changesbranch and is wired into matrix-construct/tuwunel CI (complement_refindocker/Dockerfile.complementis bumped to that hash, and the correspondingcomplement_skipentries are dropped) so the patched path is exercised against tuwunel ahead of merge here.Pull Request Checklist