Fix state not showing up fast enough with TestDelayedEvents (de-flake v2)#869
Conversation
Follow-up to #830 As experienced in https://github.com/element-hq/synapse-rust-apps/actions/runs/24910122124/job/72949760158?pr=360 (element-hq/synapse-rust-apps#360) ``` ❌ TestDelayedEvents/delayed_state_events_are_kept_on_server_restart (10.12s) delayed_event_test.go:425: StopServer hs1 delayed_event_test.go:429: StartServer hs1 delayed_event_test.go:443: CSAPI.MustDo GET http://127.0.0.1:32978/_matrix/client/v3/rooms/%21MbDncghrqxTzEmQhCP:hs1/state/com.example.test/1 returned non-2xx code: 404 Not Found - body: {"errcode":"M_NOT_FOUND","error":"Event not found."} ``` This is most likely caused because the main process handles processing delayed events (and serving `/state`) but the state will be persisted on one of event_persister workers so the main process might be serving stale data until the invalidation comes through.
Traditional `/sync` expects you to use `state` as the base and layer state from the `timeline` on top (flawed because of state res). So we have to check in the `timeline` instead of `state` for the update. We can also consider this good here because we don't expect any state to be rejected because of state res issues.
Use a sync filter so `timeline` doesn't mess with us Don't use incremental sync because no need
| // Check that the `state` section for `roomID` has an event which passes the check function. | ||
| // | ||
| // Note that the `state` section of a sync response only contains the change in state up | ||
| // to the start of the `timeline` and will not contain the entire state of the room for | ||
| // incremental or `lazy_load_members` syncs. | ||
| func SyncStateHas(roomID string, check func(gjson.Result) bool) SyncCheckOpt { |
There was a problem hiding this comment.
This was the reason the previous PR didn't work, #865
Just re-formatted this a bit 🤷 (can revert if not useful)
|
|
||
| // And includes the correct content | ||
| // | ||
| // FIXME: This assertion seems superfluous to this test and should be it's own test |
There was a problem hiding this comment.
Added some FIXME for the messiness I'd like to see cleaned up but not doing in this PR so the purpose is more focused and understandable.
|
|
||
| stateKey := "to_send_on_timeout" | ||
|
|
||
| // Schedule a delayed event |
There was a problem hiding this comment.
Added comments to understand these tests better
| // No more delayed events | ||
| matchDelayedEvents(t, user, delayedEventsNumberEqual(0)) |
There was a problem hiding this comment.
Moved this check until after we've actually waited for the state to show up and we know the delayed event was processed.
| // Check for the state change from the delayed state event (using `MustSyncUntil` to | ||
| // account for any processing or worker replication delays) | ||
| user.MustSyncUntil(t, client.SyncReq{Filter: NoTimelineSyncFilter}, client.SyncStateHas(roomID, func(ev gjson.Result) bool { | ||
| return ev.Get("type").Str == eventType && ev.Get("state_key").Str == stateKey | ||
| })) |
There was a problem hiding this comment.
The main fix to all of these tests is using MustSyncUntil to wait and account for any processing or worker/replication delays.
| // A filter for `/sync` that excludes timeline events. | ||
| // | ||
| // This is useful if you want to see `state` in the `/sync` response without the pesky | ||
| // de-duplication with `timeline` that traditional `/sync` does. |
There was a problem hiding this comment.
Note that you can use state_after query param nowadays, though not sure how widespread its implementation is amongst servers
There was a problem hiding this comment.
I think this makes sense to use in this case as we control the sync requests but requires a new SyncStateAfterHas(...). I'll create a follow-up PR to update. Less custom filters and tribal knowledge ⏩
|
Thanks for the review @erikjohnston 🐠 |
Fix state not showing up fast enough with
TestDelayedEvents(de-flake tests)Re-introducing the changes from #865 which were reverted in #867
Follow-up to #830
As experienced when running this test against the worker-based Synapse setup we use alongside the Synapse Pro Rust apps, https://github.com/element-hq/synapse-rust-apps/actions/runs/24910122124/job/72949760158?pr=360 (https://github.com/element-hq/synapse-rust-apps/pull/360)
Also updating the rest of the
TestDelayedEventstests to better account for state changes that might not show up immediately because processing/worker/replication delayWhy does this flake happen?
Discussed in #865 (comment)
Dev notes
MSC4140 Synapse implementation added in element-hq/synapse#17326
Testing with Synapse:
Pull Request Checklist