Message order in Matrix: right now, we are deliberately inconsistent

December 04, 2024 [Matrix, Tech]

After lots of conversations with Element colleagues about message order in Matrix, and lots of surprises for me, I wanted to write down what I had learned before I forgot, and also write down some principles I think we should try to follow. A lot of this is just my half-formed opinions, and while I am very grateful to everyone who helped educate me about all of this, it in no way represents any kind of policy or consensus from Element or Matrix or anyone else :-)

Finding messages

If you're writing a Matrix client (e.g. a chat app), you need to ask the server for messages that have been sent in a room. To do this, you need to download the "events", which are just messages plus other things you might need to know about. Messages are one type of "timeline" event, meaning that they appear in the main display area of a room, showing you what people said.

Via /sync

The first and most common way to do this is to ask for the latest stuff that's happened, by hitting the /sync API:

GET https://example.com/_matrix/client/v3/sync?since=s3333

{
    "rooms": {"join": {"!roomid:example.com": {"timeline": {
        "events": [
            { "content": {"body": "How many roads?", ... }, ... }
            { "content": {"body": "Forty-two.", ... }, ... }
        ],
        "prev_batch": "s2222", ...
    }}}}, ...
}

(Note: we're not talking about "state" events here, which deal with e.g. who is a member of this room. Things get even more interesting when you start thinking about them, because the order in which they happen is critical to deciding who is banned, and similar issues.)

Via /messages or similar

The second way to get events is via one of the other APIs such as /messages, /context, or /relations:

GET https://example.com/_matrix/client/v3/rooms/!xyz%3Aexample.com/messages?dir=b&from=s2222

{
    "chunk": [
        { "content": {"body": "How many roads?", ... }, ... }
        { "content": {"body": "Forty-two.", ... }, ... }
    ],
    "start": "s2222",
    "end": "t1111", ...
}

This seems unremarkable at first glance: the /sync response even contains a prev_batch token which we can use as the from query parameter to /messages so we can page back through messages to find older ones.

So what is the problem?

These APIs return messages in a different order.

The /sync API returns events in an order "according to the arrival time of the event on the homeserver".

The spec for /messages says it returns events "in chronological order. (The exact definition of chronological is dependent on the server implementation.)".

For /context it also mentions chronological order.

For /relations it contradicts itself, stating both "events will be returned in chronological order" (when talking about the dir parameter) and events will be "ordered topologically" (in the chunk section). My guess is that the dir parameter docs were erroneously copied from elsewhere, and topological ordering was intended.

Topological ordering: events in a Matrix room are stored in a mathematical structure known as a directed acyclic graph. Topological ordering means using this graph structure (which is independent of the timing of when messages arrived on the server) to decide an order. This order is easy to calculate consistently, but it can be illogical from a common-sense point of view.

Synapse, and (I think) other homeservers actually use topological order for /messages and /context as well as /relations. I am not convinced that this actually complies with the spec, since topological order is very much not chronological, by my understanding of the word¹.

Why is this a problem?

Imagine I have two Matrix clients, both logged in as me. I leave the first client open, polling the /sync API and fetching events in order of their arrival on the homeserver. I close the second client, and open it later. It will run a /sync, but it will only receive the latest messages. If I scroll the room upwards, it will fetch more messages using the /messages API.

The two clients will show me the messages in a different order. Normally, the orders are similar or identical, but if two homeservers were disconnected for a while (a "netsplit" occurred) they can be very different.

Which order is correct? I would generally argue that the first client is most likely to fit with your intuition (because messages that you saw later are further down the screen), but it's definitely arguable. In actual fact, when messages were sent effectively in parallel, there is no correct order. What I am hoping for is a consistent order, as far as possible.

I would strongly argue that these two clients should show messages (and other events) in the same order.

I do feel honour-bound at this point to say that I spoke to a colleague recently who disagreed with this principle, and said that because of the different usage of these two clients, it was OK, or even useful, that they showed different results. I definitely disagree, but it's worth pointing out that this is a debatable point.

It's also worth saying that even a lone client can exhibit this inconsistency, if it doesn't store all messages forever. If it deletes some messages to save storage space, when the user pages up to read those messages, they will be fetched using /messages, so will appear in a different order from what the user saw originally when they were fetched via /sync. There is currently no API that can re-fetch messages in the same order they were first received over /sync.

How big a problem is this?

Does it really matter if a few messages are in a different order? On the face of it, maybe not. In most cases, the differences are minor, and when they are more significant this is the result of a significant problem like a netsplit or malicious behaviour.

I will admit that, even though I said I was not talking about state events (the important events that define e.g. who is a member of the room), part of my motivation here is that I want the order of state events to be consistent, because there are times when a user really wants to examine the history of what happened², and doesn't want that to change under them.

However, I personally think that even if we ignore state events, we should do our level best to order messages consistently. How should a user interpret a change in order? What does it mean to them? Most likely, if they notice, they will figure that Matrix is just a bit flaky. In the worst case, we might "gaslight" them: they remember things happening in a particular order, but when they check back they find that the evidence contradicts them.

If we accept that clients should display a linear view of what happened (despite the fact that in reality things may have happened in parallel) then I think we should work hard to make that view consistent.

How to fix it?

Use topological order everywhere?

One way that a client could "fix" this problem without any spec changes at all would be to ignore /sync timeline responses completely, and repeatedly call /messages to get messages to display. This would ensure that messages are displayed in a consistent order, but it has several critical disadvantages.

Firstly, this is clearly is not the intention of the spec authors, and is inefficient since it involves throwing away information that the server worked to produce.

Secondly, assuming that messages appear in topological order, if some old messages arrive late (e.g. due to a netsplit), this will mean that messages appear "in the past", high up the timeline, even though the user has not read them. To make this happen, the client would need to keep repeating old calls to /messages, to check whether the past has changed, and the client would need to find a way to display these late-arriving changes to the user.

Persistent sync order

I believe that the order messages arrive from /sync is the correct order for a client: as soon as the homeserver has a message, it should hand it over to the client, and the client should show that it arrived by rendering it at the bottom of the timeline.

I also believe that all of my clients should see the same order of messages.

So the logical conclusion is that the homeserver should be able to provide a back-paginatable view of messages in the order they were provided via /sync and by extension, if no client happened to be syncing at the time, the order in which they would have been provided, which is essentially the order in which they arrived at the homeserver.

One way to implement this would be to change the /messages and other APIs to return messages in this order. In the case of the /messages and /context APIs, I even think this would comply with the spec as it is now. One possible implementation for homeservers would be to mark each message with a timestamp when they first saw it, and sort their responses based on this timestamp. Spec issue #852 actually proposes this change for /messages.

Note: this might make it difficult for homeservers that process incoming events in parallel, requiring some kind of synchronisation mechanism to assign timestamps, or some other mechanism to provide a consistent order. The exact order of events that arrive very close to each other is not important though, so long as the order is consistent.

It is worth noting that whenever a client is syncing, the homeserver already chooses an order for the events it provides over /sync, proving that a linear order is possible in principle.

An alternative is to continue providing events in any order, but add some kind of order number that allows clients to sort events into /sync order. MSC4033 proposes this.

Of course, it's much worse than this

Feel free to stop reading here. I've made my main point.

But if you want to know how difficult this problem really is, read on!

State resolution can change the past

So far, we've been assuming that if the homeserver just passes on messages to the client as soon as it receives them, this will result in a reasonably sensible order and an accurate reflection of the timeline. In fact, this is not really true, because whether we like it or not, history as we understand it can change.

In particular, when the homeserver performs "state resolution" (an evaluation of which state events are considered valid based on who is a member of the room and what permissions they have), some events can be effectively removed because it turns out the person who created them didn't have permission to do so. Because Matrix is distributed, it definitely does happen that a homeserver evaluates an event as valid at one point (and passes it on to clients) and then later has to change its mind and decide it is not valid³.

Currently, we don't have a way of telling clients that messages should be removed because of a change like this. (For state events, there are mechanisms to tell clients they need to update their state, but not actually to delete state events, as I understand it.)

I think we need to allow homeservers to send "deny" items to clients, which tell a client to delete events, to cover this case. (Note: I don't use the word "event" here since these items would by necessity be created by the homeserver, not a client, and would not be signed by a client device.)

I also think these deny items should be part of the linear history of a room, as opposed to being signals to edit that history retrospectively. This way, clients can show clearly the history of what they displayed to the user, and why it changed when it did. The alternative is to "lie" to the user, "pretending" that these events never existed, when the user actually saw them. How the client presents this to the user would certainly need to be explored. For example, the events might disappear from the timeline but some kind of "detailed history" view might show that they used to exist and were later denied.

I think it's really important that we show the user what really happened from their perspective (in a persistent form). Anything else is confusing and betrays users' trust.

When the homeserver backpaginates

So far, we've been assuming that the homeserver has access to all the events, but that is not always true. If a client asks for some events that a homeserver does not have, the homeserver can ask another homeserver for them.

Now, when these new, old events arrive, they are very strange: they are new to this homeserver, but we only had to fetch them from the other server because they are old! Clients don't want to display them at the bottom of the timeline, because they were only requested when a user scrolled a long way up to the top of the timeline.

So these new, old events need to be inserted further back in the linear history that we are building, not at the end.

This is a difficult problem because we need to figure out where they need inserting - it may not be at the start because we may have multiple gaps.

However, I still argue that we should solve this difficult problem on the server, and present it as straightforward to the client i.e. the server should respond to a request for these events by returning them as if they already existed in the linear timeline, and from that moment on always returning them at the same point when asked.

So from a client's point of view, these events should be indistiguishable from events that were on the server already.

Different people just are going to see things in different orders

It would be nice if everyone had the same view. But from different people's point of view, things genuinely happened in different orders. If I typed and sent my message while someone else's message was travelling to me over the Internet, I will think that my message was first, and they will think theirs was first.

With two users on the same homeserver, we could choose to make the homeserver the arbiter, picking one message to come first. (Note: we don't currently do this, but we could.)

But, when two users are on different homeservers, this problem is unsolvable. An important part of the design of Matrix is that two homeservers can disagree on the exact order of messages, and still interoperate with each other. This is what the long words are for: "directed acyclic graph", "eventual consistency" and "state resolution".

So we can never give everyone a consistent view of the order of messages.

In this article I am arguing that a single person should always get a consistent order when they come back to a room or look via a different client.

I also think I would like to argue that users on the same homeserver should see a view consistent with each other, but I have not developed that argument here.

Addendum: receipts

The spec for read receipts states that a receipt means that "the user has read up to a given event".

In order to understand this, it is, of course, critically important to know which events are before or after this event i.e. what their order is. In practice, when existing homeservers report the read status of a room they use the order in which they received the message as the order for receipts, which I believe is a good order for this purpose.

As it stands, for an arbitrary event, the spec does not provide a way for a client to determine which events are before or after it in this ordering, making it essentially impossible for clients to handle read receipts in a fully-correct way, or to resolve receipts consistently with the homeserver. The information is hidden from the client! See Deciding whether a room or thread is unread for more detail.

Ordering events in server-arrival order would improve this situation, but to make it easy for client authors to get receipts right, I believe we need an order number for each event, making it easy to compare any two events' order, without constructing a timeline and placing them on it. This is MSC4033.

Addendum: linear timeline

After I wrote the first draft of this post, some colleagues and I discussed it, and thought up some reasons why a "persistent sync order" or "linearised timeline" has problems we would need to solve.

The problems come when the homeserver has a gap in its timeline, and then uses backpagination (or "backfilling") to fetch events within that gap.

Quoting Rich vdh's summary of our conversation:

Suppose Homeserver A has been participating in a busy room for a long time (10 years, for argument's sake).

A netsplit happens; during the netsplit, homeserver B sends 200 messages. Eventually, the netsplit heals. Homeserver A ends up receiving some, but definitely not all, of those 200 messages. We now have a "gap" in the DAG.

Time passes, and another 50 messages or so get sent.

Now, a user comes online. First of all, they only see the recent 50 messages. But they scroll up, and get to the point of the "gap". Assuming we backfill homeserver B's messages, where do they fit in the timeline?

at the beginning (ie, 10 years ago)?

at the end (ie, the user doesn't actually see the messages until they scroll back down to "now")

or do we try and slot them in at the right point in the timeline?

I think the only plausible answer is 3, but it means that you could scroll past the same point in the timeline twice and see different messages (because we might not have been able to backfill the message the first time the user scrolled past).

.... Which I think really means that we have to say to the client "<some messages missing here>" and the client needs to reflect that in the UI.

Thanks

Thank you to Erik J, Rich vdH, Kegan D, Florian H and others for discussions leading up to this article, and for help with writing it. All mistakes, misunderstandings and naiveties are my own.

Footnotes

The ambiguity of the term "chronological" was raised in 2019.

On "the history of what happened", since there is no definitive order of events in general, it is impossible to provide a linear history that always agrees with the outcome chosen by state resolution. What I am looking for is that we should provide a linear history that does not change for the user when they check back on it later.

Note: technically, no event is "invalid" in principle - it simply belongs in a particular place in the graph. When we say an event has become invalid, we really mean that the event is not included in what the server considers the current state.