Skip to content

TestJumpToDateEndpoint: Guard the parallel boundary subtests against millisecond collisions.#872

Open
jevolk wants to merge 2 commits into
matrix-org:mainfrom
matrix-construct:complement-868-ms-boundary-guard
Open

TestJumpToDateEndpoint: Guard the parallel boundary subtests against millisecond collisions.#872
jevolk wants to merge 2 commits into
matrix-org:mainfrom
matrix-construct:complement-868-ms-boundary-guard

Conversation

@jevolk
Copy link
Copy Markdown
Contributor

@jevolk jevolk commented May 6, 2026

Closes #868.

The two parallel boundary subtests should_find_event_after_given_timestamp and should_find_nothing_before_the_earliest_timestamp flake against any homeserver fast enough to handle a createRoom round-trip inside a single millisecond. The race is in the test, not the implementation: /timestamp_to_event is millisecond-granular by spec and resolves the boundary inclusively (forward picks the earliest event with ts >= query, backward picks the latest with ts <= query), so when a time.Now() sample collides with an origin_server_ts the homeserver just stamped, the inclusive match returns an event the subtest does not want.

should_find_event_after_given_timestamp

createTestRoom reads time.Now() immediately after MustCreateRoom returns. The trailing state event of the public_chat preset's createRoom batch (typically m.room.guest_access, after create, member/join, power_levels, join_rules, history_visibility) lands in the same millisecond as that sample on a fast deploy. The forward search anchored on timeBeforeEventA then iterates events with ts >= sample from the earliest, hits the trailing state event before Message A, and the assertion fails with the wrong event_id.

should_find_nothing_before_the_earliest_timestamp

The subtest samples timeBeforeRoomCreation := time.Now() and immediately calls createTestRoom. The m.room.create event the homeserver writes in response is stamped at the same millisecond. The backward search at that boundary inclusively returns the create event when the test wants 404.

Solution

The race direction is different for the two subtests, so the placement of the guard differs:

  • For the forward race in createTestRoom, the offending event has already been stamped before time.Now() is called. The sleep goes before the sample (between MustCreateRoom returning and timeBeforeEventA := time.Now()), advancing the sample past the trailing state event's millisecond.
  • For the backward race in the failing subtest, the offending event is stamped after time.Now(). The sleep goes after the sample (between timeBeforeRoomCreation := time.Now() and createTestRoom), pushing m.room.create into a strictly later millisecond.

A 2 ms pause is long enough to step past the 1 ms clock resolution exposed on every supported platform and short enough to be invisible against the normal homeserver round-trip. The constant tsBoundaryGuard and a paragraph above it document the invariant so neither sleep gets removed as cosmetic later.

The wire contract is unchanged: /timestamp_to_event continues to be millisecond-granular and continues to resolve the boundary inclusively. Only the test pacing changes.

Empirical evidence

A local soak against tuwunel (bench build, complement_count=N, complement_parallel=1, single docker invocation per batch, fresh OldDeploy per iteration):

Subtest Unpatched (N=100) Patched (N=100)
should_find_event_after_given_timestamp 23 fail / 77 pass 0 fail / 100 pass
should_find_nothing_before_the_earliest_timestamp 66 fail / 34 pass 0 fail / 100 pass

89 flakes vs 0 across the 200 paired trials. The unpatched tester image is the prior tuwunel-changes HEAD (9e8d447); the patched tester image is the same Complement plus this commit. The testee binary, the homeserver deployment shape (hs1 + hs2 + AS bridge), and the test invocation flags are identical between batches.

The same flake fired three times in matrix-construct/tuwunel CI in the days before we routed the two subtests onto a complement_skip allowlist:

The patch only adjusts tests/room_timestamp_to_event_test.go. The should_not_be_able_to_query_a_*_room_you_are_not_a_member_of subtests sample the same timeBeforeRoomCreation pattern but assert HTTP 403, which the homeserver returns regardless of which event a backward search would surface, so the boundary does not matter there. The parallel/federation/when_looking_backwards_before_the_room_was_created,_should_be_able_to_find_event_that_was_imported subtest has the same pattern but its dominant failure mode is federation backfill, not the boundary; the guard is harmless if added later but is out of scope here.

The matching guard is backported to the matrix-construct fork at matrix-construct/complement@28df927 on its tuwunel-changes branch and is wired into matrix-construct/tuwunel CI (complement_ref in docker/Dockerfile.complement is bumped to that hash, and the corresponding complement_skip entries are dropped) so the patched path is exercised against tuwunel ahead of merge here.

Pull Request Checklist

…t millisecond collisions.

Both subtests sample `time.Now()` and immediately probe a homeserver
that stamps events using its own millisecond clock; when the two
readings land in the same millisecond, `/timestamp_to_event`'s
inclusive boundary semantics return an event the subtest does not
want. `should_find_event_after_given_timestamp` races the trailing
createRoom state event ahead of the message that a forward search is
expected to find; `should_find_nothing_before_the_earliest_timestamp`
races `m.room.create` against a backward search expected to find
nothing. Pause around the relevant samples so the next homeserver
stamp lands in a strictly later millisecond. Both flakes have been
observed across multiple homeserver implementations on otherwise
green runs.

Closes matrix-org#868.

Signed-off-by: Jason Volk <jason@zemos.net>
@jevolk jevolk requested review from a team as code owners May 6, 2026 20:25
Comment thread tests/room_timestamp_to_event_test.go Outdated
as.MustJoinRoom(t, roomID, []spec.ServerName{
deployment.GetFullyQualifiedHomeserverName(t, "hs1"),
})

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like we need time.Sleep(tsBoundaryGuard) here as well

It's not strictly necessary here but for consistency sake and to make it clear that it's not the main part of the thing we're testing.

as.MustJoinRoom(t, roomID, []spec.ServerName{
deployment.GetFullyQualifiedHomeserverName(t, "hs1"),
})

Copy link
Copy Markdown
Collaborator

@MadLittleMods MadLittleMods Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like we need time.Sleep(tsBoundaryGuard) here as well

(the join could be in the ms as the messages we're sending)

Comment thread tests/room_timestamp_to_event_test.go
Comment thread tests/room_timestamp_to_event_test.go Outdated
…the remaining samples

A whole-millisecond pause always carries a timestamp into the next
millisecond bucket, so `tsBoundaryGuard` only needs to be 1ms. Extend
the guard to the two same-timestamp topological subtests and to
`createTestRoom`'s `eventB` sample for consistency, even though no case
exercises a forward search from `eventB` yet.

Signed-off-by: Jason Volk <jason@zemos.net>
Comment on lines +390 to +391
// eventA at a shared boundary millisecond. Nothing exercises this yet, but
// keep the helper honest for callers that eventually will.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very relevant here

Suggested change
// eventA at a shared boundary millisecond. Nothing exercises this yet, but
// keep the helper honest for callers that eventually will.
// eventA at a shared boundary millisecond.

Comment on lines +352 to +360
// tsBoundaryGuard is the pause inserted around each time.Now() sample in this
// file's helpers and subtests so that no homeserver-stamped origin_server_ts
// can collide with the sample at millisecond granularity. /timestamp_to_event
// returns the boundary event inclusively (forward picks the earliest event
// with ts >= query, backward picks the latest with ts <= query), so a shared
// millisecond between sample and event lets the wrong event win the boundary.
// time.Sleep pauses for at least its argument, and adding a whole millisecond
// to a timestamp always carries it into the next millisecond bucket, so 1ms is
// enough to separate the sample from every event stamped after the pause.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extra fluff (especially LLM sounding)

Suggested change
// tsBoundaryGuard is the pause inserted around each time.Now() sample in this
// file's helpers and subtests so that no homeserver-stamped origin_server_ts
// can collide with the sample at millisecond granularity. /timestamp_to_event
// returns the boundary event inclusively (forward picks the earliest event
// with ts >= query, backward picks the latest with ts <= query), so a shared
// millisecond between sample and event lets the wrong event win the boundary.
// time.Sleep pauses for at least its argument, and adding a whole millisecond
// to a timestamp always carries it into the next millisecond bucket, so 1ms is
// enough to separate the sample from every event stamped after the pause.
// tsBoundaryGuard is a pause inserted around (before and after) where we create events
// so that `time.Now()` samples and subseqent event `origin_server_ts` don't collide at
// the same millisecond granularity. /timestamp_to_event returns the boundary event
// inclusively (forward picks the earliest event with ts >= query, backward picks the
// latest with ts <= query), so a shared millisecond between events, means the wrong
// event can be picked. Adding one whole millisecond to a timestamp always carries it
// into the next millisecond bucket, so 1ms is enough to separate the sample from every
// event stamped after the pause.

// state.
time.Sleep(tsBoundaryGuard)

timeBeforeEventA := time.Now()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timeBeforeEventA := time.Now()
timeBeforeEventA := time.Now()
time.Sleep(tsBoundaryGuard)

@@ -358,7 +401,7 @@
timeAfterEventB := time.Now()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timeAfterEventB := time.Now()
time.Sleep(tsBoundaryGuard)
timeAfterEventB := time.Now()

Comment thread tests/room_timestamp_to_event_test.go
Comment on lines +392 to +393
time.Sleep(tsBoundaryGuard)
timeBeforeEventB := time.Now()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we probably need a time.Sleep(tsBoundaryGuard) after the sample for the same reasons (timeBeforeEventB should actually come 1ms before we send event B in all cases)

Suggested change
time.Sleep(tsBoundaryGuard)
timeBeforeEventB := time.Now()
time.Sleep(tsBoundaryGuard)
timeBeforeEventB := time.Now()
time.Sleep(tsBoundaryGuard)

(comment above could use work as well)

@@ -358,7 +401,7 @@
timeAfterEventB := time.Now()

eventA = &eventTime{EventID: eventAID, BeforeTimestamp: timeBeforeEventA, AfterTimestamp: timeAfterEventA}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for timeAfterEventA if we're going this route.

Suggested change
eventA = &eventTime{EventID: eventAID, BeforeTimestamp: timeBeforeEventA, AfterTimestamp: timeAfterEventA}
eventA = &eventTime{EventID: eventAID, BeforeTimestamp: timeBeforeEventA, AfterTimestamp: timeBeforeEventB}

Or we could keep with the previous pattern and just need to move the delays there instead and get rid of timeBeforeEventB

// Just a sanity check that we're not leaking anything from the `/timestamp_to_event` endpoint
t.Run("should not be able to query a private room you are not a member of", func(t *testing.T) {
t.Parallel()
timeBeforeRoomCreation := time.Now()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency sake (make sure we promote good patterns for future tests)

Suggested change
timeBeforeRoomCreation := time.Now()
time.Sleep(tsBoundaryGuard)

// Just a sanity check that we're not leaking anything from the `/timestamp_to_event` endpoint
t.Run("should not be able to query a public room you are not a member of", func(t *testing.T) {
t.Parallel()
timeBeforeRoomCreation := time.Now()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency sake (make sure we promote good patterns for future tests)

Suggested change
timeBeforeRoomCreation := time.Now()
time.Sleep(tsBoundaryGuard)


t.Run("when looking backwards before the room was created, should be able to find event that was imported", func(t *testing.T) {
t.Parallel()
timeBeforeRoomCreation := time.Now()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
timeBeforeRoomCreation := time.Now()
time.Sleep(tsBoundaryGuard)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TestJumpToDateEndpoint/parallel boundary subtests flake at millisecond granularity

2 participants