Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

libera.ems.host stopped accepting incoming events from matrix.org in many rooms #15216

Closed
progval opened this issue Mar 6, 2023 · 13 comments
Closed

Comments

@progval
Copy link
Contributor

progval commented Mar 6, 2023

Description

libera.ems.host seemingly stopped accepting incoming federation events, from at least matrix.org, today between 8:11 and 8:56 UTC.

Here is the first rejected event:

matrix.org GET https://matrix.org/_matrix/federation/v1/event/$djGQj8bZThAAm9ZPS9xFvkZCBeU4OhLyY243f1KomdU?
-----------------------------------------------------------------------------------------
200
{
  "origin": "matrix.org",
  "origin_server_ts": 1678120631456,
  "pdus": [
    {
      "auth_events": [
        "$ZAsOgVox_c7NZkM07GKUZr2GDnnW5UhCOShVg0nhiSg",
        "$T42kX3b2HrywJafrDyPjRs4DsyX9rM29ueO6pbh9NeI",
        "$oFm6st-e9n71w3A4waSi_3WTyUdhcH72CdCsv7iEimQ"
      ],
      "content": <redacted>,
      "depth": 17864,
      "hashes": { "sha256": "mId5f+fEKKHMudWOWCcIlWlqlxzjH6/GsItaMfsk0pY" },
      "origin": "matrix.org",
      "origin_server_ts": 1678103484477,
      "prev_events": [ "$hWe9ucbEuiFBIS2J2xoLZ6FKdtJAHK968fYu3O-sNbQ" ],
      "room_id": "!lAZsOCrTyhrhnitJGE:matrix.org",
      "sender": "@<redacted>:matrix.org",
      "type": "m.room.message",
      "signatures": { "matrix.org": { "ed25519:a_RXGa": <redacted> } },
      "unsigned": { "age": 17146979 }
    }
  ]
}
libera.chat GET https://libera.ems.host/_matrix/federation/v1/event/$djGQj8bZThAAm9ZPS9xFvkZCBeU4OhLyY243f1KomdU?
-----------------------------------------------------------------------------------------
404
""

the prev event ($hWe9ucbEuiFBIS2J2xoLZ6FKdtJAHK968fYu3O-sNbQ) originates from libera.ems.host itself, but has $dCXcoHqS35OXY8ktVE2TKHQ2Vz_1BLqePLvtub4CYE8 as prev event, which originates from matrix.org, and was successfully federated:

libera.chat GET https://libera.ems.host/_matrix/federation/v1/event/$dCXcoHqS35OXY8ktVE2TKHQ2Vz_1BLqePLvtub4CYE8?
-----------------------------------------------------------------------------------------
403
{ "errcode": "M_FORBIDDEN", "error": "Host not in room." }

(the request is sent from synapse.test.progval.net, which joined the room after the shenanigans started happening)

Steps to reproduce

I don't know.

Homeserver

libera.ems.host

Synapse Version

1.77.0

Installation Method

I don't know

Database

postgresql

Workers

I don't know

Platform

I don't know

Configuration

No response

Relevant log output

I don't have access to logs

Anything else that would be useful to know?

No response

@progval
Copy link
Contributor Author

progval commented Mar 6, 2023

Possibly related: federation of this room from libera.ems.host to other homeservers (at least envs.net and synapse.test.progval.net, but not matrix.org) seems to be down too. I don't see any mention of incoming PDUs (or EDUs) from Libera in my server logs.

@progval
Copy link
Contributor Author

progval commented Mar 6, 2023

It seems that sending messages from synapse.test.progval.net made libera.ems.host successfully backfill messages; it's just not getting them directly from matrix.org

@mattcen
Copy link

mattcen commented Mar 6, 2023

FWIW, I'm experiencing the same behaviour with #everythingopen:matrix.org; when I send a message (or a react) from my matrix.mattcen.com homeserver, the other prior messages get backfilled to libera.ems.host.

@progval
Copy link
Contributor Author

progval commented Mar 7, 2023

It's now happening in #libera-matrix:libera.chat, which is rather inconvenient as that's where people ask about the issue

@alkisg
Copy link

alkisg commented Mar 7, 2023

I've been participating (via the bridge) in various IRC channels for a while, until I realized noone received what I was saying! Not knowing when it's working and when not, is a big issue!

It's essential that the IRC <=> Matrix bridge works reliably for e.g. a decade, so that IRC members are attracted to Matrix; otherwise I'm afraid that many of the people that already switched to Matrix, will switch back to IRC...

@progval
Copy link
Contributor Author

progval commented Mar 7, 2023

A matrix.org user in #libera-matrix:libera.chat reports this is also an issue in PM with the appservice bot

@progval progval changed the title libera.ems.host stopped accepting incoming events in #swh-team:matrix.org libera.ems.host stopped accepting incoming events from matrix.org in many rooms Mar 7, 2023
@DMRobertson
Copy link
Contributor

DMRobertson commented Mar 7, 2023

From trawling logs, matrix.org's outbound federation loop for libera.chat

  • saw 503 responses from libera.chat to /send starting at 2023-03-06 08:44:50,041
  • got a 200 response from libera.chat at 09:35:51,955
  • entered "catchup mode" for transactions to libera at this point
  • last logged for libera.chat at 2023-03-06 09:36:22,791 on that day

Then matrix.org was redeployed today (7th March) at 1300 UTC. We see the log line

2023-03-07 13:08:54,028 - synapse.federation.sender.per_destination_queue - 486 - INFO - federation_transaction_transmission_loop-10 - Catching up destination libera.chat with 50 PDUs

but then no signs of activity thereafter.

TL;DR I think Matrix.org's federation transaction loop is stuck in "catchup mode" for libera. Cause currently unknown.

@progval
Copy link
Contributor Author

progval commented Mar 7, 2023

Could https://status.matrix.org/ be updated to reflect this?

@Half-Shot
Copy link
Collaborator

Good plan, applied: https://status.matrix.org/incidents/h6t60zyv4r73

@progval
Copy link
Contributor Author

progval commented Mar 7, 2023

Thanks! (Federation from libera.ems.host isn't affected though)

@DMRobertson
Copy link
Contributor

DMRobertson commented Mar 7, 2023

The underlying cause was #15220.

Federation should now be resumed, and status page updated to say as much. We'll keep an eye on it. Thanks for spotting and reporting.

@progval
Copy link
Contributor Author

progval commented Mar 7, 2023

Indeed, it seems to be working again, and messages were backfilled in various rooms I'm in

@DMRobertson
Copy link
Contributor

Thanks for confirming.

For completeness, the proper fix for this is #15248 which should be landing in Synapse 1.79 tomorrow.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants