bug: filter delay errors #4459

chair28980 · 2023-12-13T20:21:57Z

Quoting @richard-ramos Discord conversation:

There seems to be an issue with filter in which messages are pushed to the clients with a noticeable delay.
Up until yesterday, there was a job that automagically deployed changes in nwaku master to shards.fleet. DB got broken because of that
QA mentioned that communities are not loading the message history (currently debugging this)

Quoting @chaitanyaprem from Discord conversation:

As of now, i see a lot of subscribe requests when i start my desktop app (but that could just be to different contentTopics) and unsubscribe only per pubsubTopic (which could just include all contentTopics which were subscribed).

Unable to debug further because of not logging contentTopics and not having an API to confirm what FilterSubscriptions are active as of now.

Maybe it would be good to have some Admin API's to query list of Filter Subscriptions and similar useful API's for monitoring other protocols as well.This may help node operators and also in turn Waku team to debug issues.

For now, I will try to figure out another way what could be causing this issue. And at the same time recommend to use unsubscribeAll while shutting down or existing status-go.

Couple of more observation we had noticed during this session:
In case peers are subscribed and are not available/alive , for each filterPush we try to dial the peers which seems like a bad strategy. If peers are down for few mins, then this leads to a lot of dials from the filterFullNode(which doesn't look optimal).

Secondly, the idle/inactive period after which a filterFullNode considers a subscription to be invalid may need to be reduced to maybe 30 mins or so(not sure what is the optimal value here at this point). But, having such idle subscriptions hang on a longer time could just be a minor attack vector on a service-node. An attacker could just spawn peers randomly and keep hitting with subscriptions and not unsubscribe leaving filterFullNode wasting resources just dialing peers for each message. Not sure if this is a big issue, but something to think of.

QA Test Scenario Description

Scenario 1
1.
2.
Scenario 2
1.
2.

chaitanyaprem · 2023-12-14T06:12:04Z

As of now, i see a lot of subscribe requests when i start my desktop app (but that could just be to different contentTopics) and unsubscribe only per pubsubTopic (which could just include all contentTopics which were subscribed).

This can be ignored as it is very much possible that unsubscribe is sent for each pubsubTopic separately which must include all contentTopics that may have been subscribed to independently. But, i did notice that when i bring my status-desktop down the number of subscriptions haven't reduced in fleet node whereas i do see a new subscription as i bring up my status-desktop. This indicates that either some of the filter subscriptions are not being unsubscribed explicitly when node is going down leading to open subscriptions being left on fleet node.

chaitanyaprem · 2023-12-14T06:12:52Z

Unable to debug further because of not logging contentTopics and not having an API to confirm what FilterSubscriptions are active as of now.

Maybe it would be good to have some Admin API's to query list of Filter Subscriptions and similar useful API's for monitoring other protocols as well.This may help node operators and also in turn Waku team to debug issues.

Created this issue that would help with such debugging in the future.

chaitanyaprem · 2023-12-14T06:26:41Z

And at the same time recommend to use unsubscribeAll while shutting down or existing status-go.

This is what needs to be done probably in status-go or fix #4459 (comment) as ultimately when desktop is brought down gracefully, noticed that filter subscription is not cleared in nwaku. This can be solved by sending an filter unsubscribeAll when status-go is shutting down, or unsubscribing to topics which are not being unsubscribed to.

chaitanyaprem · 2023-12-15T05:28:42Z

And at the same time recommend to use unsubscribeAll while shutting down or existing status-go.

This is what needs to be done probably in status-go or fix #4459 (comment) as ultimately when desktop is brought down gracefully, noticed that filter subscription is not cleared in nwaku. This can be solved by sending an filter unsubscribeAll when status-go is shutting down, or unsubscribing to topics which are not being unsubscribed to.

Do note that as per @richard-ramos we should not wait for the response for unsubscribeAll, rather this should be done in a fire and forget manner.

chaitanyaprem · 2023-12-27T05:56:49Z

I have not been able to reproduce this issue when i tested with 0.15.0-rc.10 of desktop.
I am running 3 desktop instances with 2 of them running in light-mode and 1 in relay mode.
Have tested by sending sequence of 10 messages from each of the 3 in a group chat and community.

Have not noticed any significant delays in receiving of the messages. Max delay i had observed was ~3 seconds.

@churik @anastasiyaig Would require more information as to when the delay with Filter is noticed (it was mentioned about 60 seconds of delay was noticed).
Can we retest using latest build which includes fix for lightpush delay ?

anastasiyaig · 2023-12-27T07:10:07Z

we did have a try yesterday, i dont see any delay from my end, however i know that the 'founder' of that issue was autotests on mobile side, maybe we can try to launch them again? @pavloburykh wdyt?

Can we keep this issue open for a while , perhaps massive dogfooding will confirm the absence / presence of it, so far I cant reproduce as well

pavloburykh · 2023-12-27T08:03:47Z

Hey @anastasiyaig @chaitanyaprem!

Can we keep this issue open for a while , perhaps massive dogfooding will confirm the absence / presence of it, so far I cant reproduce as well

Totally agree. Let's keep it open for a while as our e2e still catching issues that might be related.

Here is an example from nightly e2e run (December 27):

Sender's device sends a stack of 4 messages
Receiver's device receives only 1 of 4 messages after 30 seconds of waiting

Receiver_device.log.zip
Sender_device.log.zip

I am currently on vacation till January 2. Will try to investigate and provide more details when I am back. But from what I noticed, this issue is kinda random. Most of the time messages are delivered instantly but sometimes delay occurs.

Please note, that I do not take into consideration cases with online/offline handling or returning from the background as those are separate issues and described here by @churik.

chaitanyaprem · 2023-12-27T08:26:14Z

2. Receiver's device receives only 1 of 4 messages after 30 seconds of waiting

Hmm, This sounds like delay is caused due to lightpush as i had pointed out here. Which status-go commit/version is used for the nightly build?

pavloburykh · 2023-12-27T08:29:59Z

Hmm, This sounds like delay is caused due to lightpush as i had pointed out here. Which status-go commit/version is used for the nightly build?

@chaitanyaprem

"version": "v0.171.39",
"commit-sha1": "debfe1cab368140b16d2383efc68007d3c28a4ef",
"src-sha256": "0cbj3796qz69ylwcvdw5jsl1cdyrcfk81d7shjkakvjh59rv5sjk"

chaitanyaprem · 2023-12-28T09:13:24Z

Receiver_device.log.zip
Sender_device.log.zip

After going through logs, looks like connection to filter node was dropped after 1 or 2 messages are received. This could be the reason why the rest of the messages did not get delivered. Although, I could not find a reason as to why the connection was terminated and not re-established. I don't have enough info to know if other filter peers are connected at this point.

Also i am noticing that node is also connecting to go-waku.test and go-waku.prod fleets. Not sure if that is also causing some issue. @richard-ramos , I don't think it is a good idea to connect these fleets to shards.test fleet.

chaitanyaprem · 2023-12-28T09:13:53Z

Hmm, This sounds like delay is caused due to lightpush as i had pointed out here. Which status-go commit/version is used for the nightly build?

@chaitanyaprem
"version": "v0.171.39",
"commit-sha1": "debfe1cab368140b16d2383efc68007d3c28a4ef",
"src-sha256": "0cbj3796qz69ylwcvdw5jsl1cdyrcfk81d7shjkakvjh59rv5sjk"

Looks like the lightpush delay fix is already present in this version. So that could not be the cause of the issue.

mariia-skrypnyk · 2024-01-04T10:16:09Z

Hello to everyone! Sharing fresh logs from the failed e2e which are still catching data delivery issues.
Please, @chaitanyaprem take a look.

Here is an example from PR e2e run (January 3):

Sender's device sends "something to reply to" message to the community channel
Receiver's device doesn't get this message

Sender.log.zip
Receiver.log.zip

richard-ramos · 2024-01-04T15:02:25Z

Can we try with this PR? status-im/status-mobile#18395 this adds extra logging to go-waku and also updates it to latest version.

pavloburykh · 2024-01-05T12:20:02Z

Can we try with this PR? status-im/status-mobile#18395 this adds extra logging to go-waku and also updates it to latest version.

thanks @richard-ramos! We will re-run tests and provide new logs based on the results.

pavloburykh · 2024-01-05T18:19:53Z

@richard-ramos I have checked e2e results in status-im/status-mobile#18395 This time I do not see cases when data has not been delivered at all. Although there are couple of examples when data has been delivered with a rather long delay. For example:

User A sends contact request to User B
User B receives contact request only after 2 minutes of waiting.

UserA_CR_sender.log
UserB_CR_receiver.log

chaitanyaprem · 2024-01-08T07:12:28Z

@richard-ramos I have checked e2e results in status-im/status-mobile#18395 This time I do not see cases when data has not been delivered at all. Although there are couple of examples when data has been delivered with a rather long delay. For example:
1. User A [sends](https://app.eu-central-1.saucelabs.com/tests/90a5b7d76161424bb816825e8e4ff0d6?auth=a30526c1ef2662d8dcfbd5d9ec94f735#69) contact request to User B

2. User B [receives](https://app.eu-central-1.saucelabs.com/tests/b981848b126041cf8508108a96f7fea4?auth=7fea4ae9f7eab52d7795dfcc830e4339#72) contact request only after **2 minutes** of waiting.
UserA_CR_sender.log UserB_CR_receiver.log

Thanks for the logs.
After going through receiver logs it seems the receiver node got disconnected from all 2 filter peers within a span of 600msec (which falls within the range of message sent time). Hence, the message seems to have been retrieved via Store in next store sync cycle which is the cause of the delay noticed.

Taking a further look at logs for both Receiver and Sender for peer disconnections, i have noticed a lot of peer disconnections in both the logs.

Oddly the time interval between disconnections are same in both receiver and sender towards fleet nodes.
Towards peers 16Uiu2HAmGwcE8v7gmJNEWFtZtojYpPMTHy2jBLL6xRk33qgDxFWX and 16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 , the interval between disconnection seems to be 10 seconds.

Towards peer 16Uiu2HAm8mUZ18tBWPXDQsaF7PbCKYA35z7WB2xNZH2EVq1qS8LJ, interval between disconnection seems to be every 5 seconds. Its almost as if some timer was running or some periodic event is causing this disonnections. I don't see any logs in the fleet wrt these peers wrt disconnections, so it could be that disconnection is noticed only on client side and fleet is not noticing it. Either the network at clients in tests so flaky or it could it be something in the code causing this.
Any thoughts @richard-ramos , @vitvly @cammellos @kaichaosun wrt any timer in the code that could trigger this behaviour?

richard-ramos · 2024-01-09T18:09:20Z

That's very curious. Looking at instances of Network().ClosePeer in the code for status-go, It only happens in

status-go/wakuv2/waku.go

Line 476 in 744032e

    
           func (w *Waku) identifyAndConnect(ctx context.Context, isLightClient bool, peerInfo peer.AddrInfo) {

which is executed at the beginning while attempting to determine if the peers obtained from enrtree can be used.

richard-ramos · 2024-01-09T18:18:01Z

There are some interesting log lines here:

DEBUG[01-05|13:34:01.324|github.com/status-im/status-go/vendor/go.uber.org/zap/sugar.go:163]                                                 protocol mux failed: stream reset (took 1.080462861s) 

....

DEBUG[01-05|13:34:03.302|github.com/status-im/status-go/vendor/go.uber.org/zap/sugar.go:163]                                                 protocol EOF: 16Uiu2HAm8mUZ18tBWPXDQsaF7PbCKYA35z7WB2xNZH2EVq1qS8LJ (took 557.447009ms)

These are all part of the streamhandler in basichost: https://github.com/libp2p/go-libp2p/blob/c681c4a9b825089b83a2be806a8cca7ebd3bca13/p2p/host/basic/basic_host.go#L415-L417

chaitanyaprem · 2024-01-16T09:00:05Z

Update: After going through the fleet logs for this peerID and the date reported, we have identified that fleet (nwaku nodes) are disconnecting connections from the emulator due to reaching ip colocation limit.
This is causing connections to all Filternodes to be dropped constantly which is why the messages are not received via Filter and hence it waits for the StoreQuery via which messages are fetched.

Attached image

Suggested Solution:
Increase/disable the IP colocation limit on the shards.test fleet as per waku-org/nwaku#2323 so that CI doesn't get affected.
As disabling it completely will leave the node to a possibility of getting attacked, it is recommended to increase this limit to 100.
Once the fleet is updated with the new release of nwaku 0.24.0 (scheduled to be released 22nd-Jan), then this param can be used to specify the colocation limit.

--ip-colocation-limit=100

chaitanyaprem · 2024-01-16T11:27:10Z

Related issue in infra repo status-im/infra-status#27 that should address this issue.

pavloburykh · 2024-01-26T11:18:35Z

Hey team! The situation with data delivery has definitely been improved after deploying status-im/infra-status#27

But unfortunately we still encounter some issues. Here is an example caught by today's e2e run:

Contact request has not been delivered to receiver

User A sends contact request to User B
User B has not received contact request

User_A_CR_sender.log
User_B_CR_receiver.log

cc @chaitanyaprem

anastasiyaig · 2024-01-26T11:21:14Z

i think i saw that recently too but was not able to reproduce constantly... status-im/status-desktop#13270

chaitanyaprem · 2024-01-26T11:32:34Z

Hey team! The situation with data delivery has definitely been improved after deploying status-im/infra-shards#27

But unfortunately we still encounter some issues. Here is an example caught by today's e2e run:

Contact request has not been delivered to receiver
1. User A [sends](https://app.eu-central-1.saucelabs.com/tests/bb180f9bdd3c41b38a9a918f7588337f?auth=211c1e1f895e0ab346684834835141cc#68) contact request to User B

2. User B [has not received](https://app.eu-central-1.saucelabs.com/tests/7738311bf53d41b3a7564f5103956c21?auth=b46b5e9a430413473f06d6e11c1998f4#59) contact request
User_A_CR_sender.log User_B_CR_receiver.log

cc @chaitanyaprem

Quick look at the logs shows that receiver did not have any filterPeers which is why contact request is not received.
I could see the message received by the fleet.
Looking into why the receiver did not have any peer connections which seems to be the root cause.

anastasiyaig · 2024-01-26T12:30:50Z

Looking into why the receiver did not have any peer connections which seems to be the root cause.

That is something that I was observing a lot yesterday actually. App was saying that there is no internet connection when it was not like that at all and peer numbers was jumping like crazy (from 1 to 6 and back)

chaitanyaprem · 2024-01-26T13:05:44Z

Also, One more observation i had noticed is on the mobile there is no notification showing we are offline even if peercount is 0.
I guess that is also something which needs to be fixed in mobile app. Not sure whom to report this to.

pavloburykh · 2024-01-26T16:18:09Z

Also, One more observation i had noticed is on the mobile there is no notification showing we are offline even if peercount is 0. I guess that is also something which needs to be fixed in mobile app. Not sure whom to report this to.

@chaitanyaprem thank you for this observation. I will create related issue in mobile repo.

chaitanyaprem · 2024-01-29T08:17:30Z

Quick look at the logs shows that receiver did not have any filterPeers which is why contact request is not received.
I could see the message received by the fleet.
Looking into why the receiver did not have any peer connections which seems to be the root cause.

After looking into the logs further, finding something suspicious that after loosing all peer connections discv5 seems to have been stopped and not restarted which is preventing from discovering and connecting to peers.

Either it could be due to the issue @qfrank has identified here or something else.
When you rerun those changes, i want to see CI results as well.

In the meantime, i will try to find out if there is something else going on.

pavloburykh · 2024-01-29T11:09:48Z

When you rerun those changes, i want to see CI results as well.

Hi @chaitanyaprem! Here is an example of failed e2e from this PR:

Message has not been delivered in community channel

User A sends a message in community channel
User B has not received this message in community channel

User_A_message_Sender.log
User_B_message_Receiver.log

chaitanyaprem · 2024-01-29T12:15:59Z

When you rerun those changes, i want to see CI results as well.

Hi @chaitanyaprem! Here is an example of failed e2e from this PR:

Message has not been delivered in community channel
1. User A [sends](https://app.eu-central-1.saucelabs.com/tests/fd4e8074a8c045b1b90febb07826b9e6?auth=fe47e9f771c3f03cf7c69db13da0b3ab#195) a message in community channel

2. User B [has not received](https://app.eu-central-1.saucelabs.com/tests/3e4bb2ef7cf6489e976f5d9489edde3f?auth=a4236b3bb42593a9e60667f76046f4d3#124) this message in community channel
User_A_message_Sender.log User_B_message_Receiver.log

Thanks!
Looking at logs , the message was received by fleet and client has proper peer connections(which means this is different than what i had observed before).
For some reason this message did not make it to client. Will debug further and update

chaitanyaprem · 2024-01-30T06:11:12Z

When you rerun those changes, i want to see CI results as well.

Hi @chaitanyaprem! Here is an example of failed e2e from this PR:

Message has not been delivered in community channel
1. User A [sends](https://app.eu-central-1.saucelabs.com/tests/fd4e8074a8c045b1b90febb07826b9e6?auth=fe47e9f771c3f03cf7c69db13da0b3ab#195) a message in community channel

2. User B [has not received](https://app.eu-central-1.saucelabs.com/tests/3e4bb2ef7cf6489e976f5d9489edde3f?auth=a4236b3bb42593a9e60667f76046f4d3#124) this message in community channel
User_A_message_Sender.log User_B_message_Receiver.log
Thanks! Looking at logs , the message was received by fleet and client has proper peer connections(which means this is different than what i had observed before). For some reason this message did not make it to client. Will debug further and update

After analyzing the logs further it looks like for some reason the fleet node thinks subscription is not there, whereas the app thinks it is subscribed . One possible reason i can think of is some jumbling of subscribe and unsubscribe for same contentTopic happened and probably went out of sequence to fleet node which may have caused this.
Here is log from one of the fleet node for contentTopic where unsubscribe came within few msecs after subscribe.

Logs showing unsubscribe and subscribe around same time which may cause messages to go out of sequence.
@kaichaosun @vitvly , not sure if something in status-go is causing this kind of race condition leading to subscribe and unsubscribe intermingling for same content-Topic. Please take a look.

DEBUG[01-29|10:25:10.369|github.com/status-im/status-go/wakuv2/filter_manager.go:208]                                                        filter subscription success              filterId=6e211475e02507936a817f748d8a35e1660e888185e804780649f88f15131e9f peer=16Uiu2HAmGwcE8v7gmJNEWFtZtojYpPMTHy2jBLL6xRk33qgDxFWX pubsubTopic=/waku/2/rs/16/32 contentTopics=[/waku/1/0xeb0a33be/rfc26]
DEBUG[01-29|10:25:18.308|github.com/status-im/status-go/protocol/transport/filters_manager.go:565]                                           registering filter for                   namespace=filtersManager chatID=0x0312b80bd0e8afffeeabbc2a86ccaba3f86fb1c24e87c6caf2f9c5deac71f428863a0e0225-cde0-461e-916a-ddd15eef6c9a type=public ContentTopic=0xeb0a33be PubsubTopic=
DEBUG[01-29|10:25:18.309|github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/protocol/filter/client.go:239]            sending FilterSubscribeRequest           peerID=16Uiu2HAmGwcE8v7gmJNEWFtZtojYpPMTHy2jBLL6xRk33qgDxFWX request="request_id:\"9606495a1da897068e4338281581301a8eb60152d459691a465e04cb97cd8a2e\"  filter_subscribe_type:UNSUBSCRIBE  pubsub_topic:\"/waku/2/rs/16/32\"  content_topics:\"/waku/1/0xeb0a33be/rfc26\""
DEBUG[01-29|10:25:18.309|github.com/status-im/status-go/wakuv2/filter_manager.go:198]                                                        filter subscribe to filter node          filterId=18ba8884c635a265185075a1c7511b443c077267fe24fb1b6ec0612d50b8f72c pubsubTopic=/waku/2/rs/16/32 contentTopics=[/waku/1/0xeb0a33be/rfc26]
DEBUG[01-29|10:25:18.309|github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/protocol/filter/client.go:239]            sending FilterSubscribeRequest           peerID=16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 request="request_id:\"cdc7e5125c735389157568e403f364c65e3ac485ee826f0b32cc1a2da5f0c11c\"  filter_subscribe_type:SUBSCRIBE  pubsub_topic:\"/waku/2/rs/16/32\"  content_topics:\"/waku/1/0xeb0a33be/rfc26\""
DEBUG[01-29|10:25:18.309|github.com/status-im/status-go/wakuv2/filter_manager.go:198]                                                        filter subscribe to filter node          filterId=18ba8884c635a265185075a1c7511b443c077267fe24fb1b6ec0612d50b8f72c pubsubTopic=/waku/2/rs/16/32 contentTopics=[/waku/1/0xeb0a33be/rfc26]
DEBUG[01-29|10:25:18.310|github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/protocol/filter/client.go:239]            sending FilterSubscribeRequest           peerID=16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 request="request_id:\"9aad6cedddcfe011894e508d5f5b25125d95ae88ebc6a23b9fd1ec9323abf6ea\"  filter_subscribe_type:SUBSCRIBE  pubsub_topic:\"/waku/2/rs/16/32\"  content_topics:\"/waku/1/0xeb0a33be/rfc26\""
DEBUG[01-29|10:25:18.310|github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/protocol/filter/client.go:239]            sending FilterSubscribeRequest           peerID=16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 request="request_id:\"fa6dd07f2dcdf2691226b4828736f2f63ddc8bfd828114b3043610a898c956fd\"  filter_subscribe_type:UNSUBSCRIBE  pubsub_topic:\"/waku/2/rs/16/32\"  content_topics:\"/waku/1/0xeb0a33be/rfc26\""
DEBUG[01-29|10:25:18.334|github.com/status-im/status-go/wakuv2/filter_manager.go:208]                                                        filter subscription success              filterId=18ba8884c635a265185075a1c7511b443c077267fe24fb1b6ec0612d50b8f72c peer=16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 pubsubTopic=/waku/2/rs/16/32 contentTopics=[/waku/1/0xeb0a33be/rfc26]
DEBUG[01-29|10:25:18.336|github.com/status-im/status-go/wakuv2/filter_manager.go:208]                                                        filter subscription success              filterId=18ba8884c635a265185075a1c7511b443c077267fe24fb1b6ec0612d50b8f72c peer=16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 pubsubTopic=/waku/2/rs/16/32 contentTopics=[/waku/1/0xeb0a33be/rfc26]

chaitanyaprem · 2024-01-30T06:14:46Z

Looking into why the receiver did not have any peer connections which seems to be the root cause.

That is something that I was observing a lot yesterday actually. App was saying that there is no internet connection when it was not like that at all and peer numbers was jumping like crazy (from 1 to 6 and back)

This could be due to the same issue mentioned here. Do you have any debug logs around this time? I can take a look and confirm

pavloburykh · 2024-01-30T08:22:40Z

This could be due to the same issue mentioned here. Do you have any debug logs around this time? I can take a look and confirm

Hey @chaitanyaprem! Just to make sure I understand correctly: you need some more logs specifically from this PR status-im/status-mobile#18492 , correct?

anastasiyaig · 2024-01-30T08:28:41Z

@pavloburykh in case you have an option to apply debug log level for messaging (geth.log) - that is what @chaitanyaprem is looking for. Higher log levels do not allow to understand the nature of the issue sadly

chaitanyaprem · 2024-01-30T08:42:44Z

This could be due to the same issue mentioned here. Do you have any debug logs around this time? I can take a look and confirm

Hey @chaitanyaprem! Just to make sure I understand correctly: you need some more logs specifically from this PR status-im/status-mobile#18492 , correct?

No, I am assuming the issue @anastasiyaig faced could be similar to what has been observed in this case (which was fixed by @qfrank ). But i can only confirm that if i have logs from @anastasiyaig when she has faced the issue.

pavloburykh · 2024-01-30T08:55:44Z

@pavloburykh in case you have an option to apply debug log level for messaging (geth.log) - that is what @chaitanyaprem is looking for. Higher log levels do not allow to understand the nature of the issue sadly

hey @anastasiyaig! We have debug level applied by default in mobile PR builds.

anastasiyaig · 2024-01-30T09:09:55Z

But i can only confirm that if i have logs from @anastasiyaig when she has faced the issue.

I just reproduced it again.
I opened 0.15 version and sent to it a contact request. Waited for ~10 min and nothing arrieved.
I restarted the app and the contact request and messages are there now

Attaching fresh geth.log @chaitanyaprem
geth.log.zip

chaitanyaprem · 2024-01-31T05:52:57Z

But i can only confirm that if i have logs from @anastasiyaig when she has faced the issue.

I just reproduced it again. I opened 0.15 version and sent to it a contact request. Waited for ~10 min and nothing arrieved. I restarted the app and the contact request and messages are there now

Attaching fresh geth.log @chaitanyaprem geth.log.zip

Based on updated logs shared by you (unable to upload them here because fileSize is too big), it looks like the reason for message loss seems to be peer scores going down.
The reason for peer scores being reduced is because of message validation failure for messages received from the peers.
This is very similar to what @mprakhov faced during Doha. Similar behaviour is noticed in logs shared by him as well.

I am not sure why this could be happening.
It is as though the node is not having proper keys related to something that is used for message validation at pubsubTopic.

Any thoughts @cammellos @vitvly @richard-ramos ?

1723 DEBUG[01-30|13:55:29.040|go.uber.org/zap/sugar.go:163]                                                               validation done; took 29.292µs 
1724 DEBUG[01-30|13:55:29.040|go.uber.org/zap/sugar.go:163]                                                               message validation failed; dropping message from       16Uiu2HAmBDbMWFiG9ki8sDw6fYtraSxo4oHU9HbuN43S2HVyq1FD 
1725 DEBUG[01-30|13:55:29.040|go.uber.org/zap/sugar.go:163]                                                               validation done; took 8.333µs
1726 DEBUG[01-30|13:55:29.041|go.uber.org/zap/sugar.go:163]                                                               message validation failed; dropping message from       16Uiu2HAmBDbMWFiG9ki8sDw6fYtraSxo4oHU9HbuN43S2HVyq1FD 
1727 DEBUG[01-30|13:55:29.041|go.uber.org/zap/sugar.go:163]                                                               validation done; took 9.416µs

richard-ramos · 2024-01-31T13:58:29Z

I believe this problem might be happening due to status-im/status-desktop#13267 . I suspect that a shard assignment removal or change happened before, but the private key used originally for the community was not removed, so the topic validation in relay would still be applying a wrong key, and end up penalizing peers by mistake.

if the problem is due to that, then likely it is fixed in #4627 (this is already in desktop master)

kaichaosun · 2024-02-01T01:50:37Z

The changes mentioned by @richard-ramos has been merged to status-desktop master, the release binary for this PR (status-im/status-desktop#13372) should be good to use for test.

@pavloburykh @anastasiyaig could you confirm if this issue has been resolved or not?

chaitanyaprem · 2024-02-01T05:55:16Z

The changes mentioned by @richard-ramos has been merged to status-desktop master, the release binary for this PR (status-im/status-desktop#13372) should be good to use for test.

@pavloburykh @anastasiyaig could you confirm if this issue has been resolved or not?

The changes mentioned by @richard-ramos fixes only 1 of the issue observed above i.e why relay peer scores are down causing message loss.

The issue with Filter subscribe/unsubscribe is a separate issue and still need to be analyzed to figure out where the issue is. Ref #4459 (comment)

This issue is becoming dumping ground for multiple issues. Let us close this and open separate issues to track independently and to avoid confusion.

chaitanyaprem · 2024-02-01T05:57:41Z

Created separate issue for Filter unsusbcribe and subscribe out of sequence #4659

chaitanyaprem · 2024-02-01T06:02:15Z

Created a separate issue for relay loss observed due to lower peer scores #4660

Hence closing this one to avoid more issues being updated here.

Please open new issues for anything else that is being observed.

chair28980 changed the title ~~fix: Use unsubscribeAll when exiting or shutting down status-go~~ Filter delay bugs Dec 13, 2023

chair28980 changed the title ~~Filter delay bugs~~ bug: filter delay errors Dec 13, 2023

chair28980 mentioned this issue Dec 13, 2023

waku2: pending items for the integration of status-go and go-waku #4057

Closed

21 tasks

chair28980 added the E:Chat See https://github.com/waku-org/pm/issues/105 for details label Dec 13, 2023

chaitanyaprem mentioned this issue Dec 14, 2023

feat: Have additional Admin REST API endpoints that helps node operator in monitoring waku-org/nwaku#2290

Closed

1 task

kaichaosun self-assigned this Dec 15, 2023

kaichaosun mentioned this issue Dec 18, 2023

chore: stop light filter node by default waku-org/go-waku#964

Merged

chair28980 added the status-waku-integ All issues relating to the Status Waku integration. label Dec 19, 2023

chaitanyaprem mentioned this issue Jan 16, 2024

Set ip colocation limit to 100 on deploying 0.24 of nwaku status-im/infra-status#27

Closed

pavloburykh mentioned this issue Jan 26, 2024

Show notification about offline status when peercount is 0 status-im/status-mobile#18647

Open

chaitanyaprem mentioned this issue Feb 1, 2024

bug: Filter subscribe and unsubscribe seem to happen out of order #4659

Closed

chaitanyaprem mentioned this issue Feb 1, 2024

bug: peers are descored due to message validation failure #4660

Closed

chaitanyaprem closed this as completed Feb 1, 2024

bug: filter delay errors #4459

bug: filter delay errors #4459

Comments

chair28980 commented Dec 13, 2023 • edited Loading

QA Test Scenario Description

chaitanyaprem commented Dec 14, 2023

chaitanyaprem commented Dec 14, 2023 • edited Loading

chaitanyaprem commented Dec 14, 2023

chaitanyaprem commented Dec 15, 2023

chaitanyaprem commented Dec 27, 2023

anastasiyaig commented Dec 27, 2023

pavloburykh commented Dec 27, 2023 • edited Loading

chaitanyaprem commented Dec 27, 2023 • edited Loading

pavloburykh commented Dec 27, 2023

chaitanyaprem commented Dec 28, 2023

chaitanyaprem commented Dec 28, 2023

mariia-skrypnyk commented Jan 4, 2024 • edited Loading

richard-ramos commented Jan 4, 2024

pavloburykh commented Jan 5, 2024

pavloburykh commented Jan 5, 2024 • edited Loading

chaitanyaprem commented Jan 8, 2024 • edited Loading

richard-ramos commented Jan 9, 2024

richard-ramos commented Jan 9, 2024

chaitanyaprem commented Jan 16, 2024 • edited Loading

chaitanyaprem commented Jan 16, 2024

pavloburykh commented Jan 26, 2024

Contact request has not been delivered to receiver

anastasiyaig commented Jan 26, 2024

chaitanyaprem commented Jan 26, 2024

Contact request has not been delivered to receiver

anastasiyaig commented Jan 26, 2024

chaitanyaprem commented Jan 26, 2024

pavloburykh commented Jan 26, 2024

chaitanyaprem commented Jan 29, 2024 • edited Loading

pavloburykh commented Jan 29, 2024

Message has not been delivered in community channel

chaitanyaprem commented Jan 29, 2024

Message has not been delivered in community channel

chaitanyaprem commented Jan 30, 2024 • edited Loading

Message has not been delivered in community channel

chaitanyaprem commented Jan 30, 2024

pavloburykh commented Jan 30, 2024

anastasiyaig commented Jan 30, 2024

chaitanyaprem commented Jan 30, 2024

pavloburykh commented Jan 30, 2024

anastasiyaig commented Jan 30, 2024

chaitanyaprem commented Jan 31, 2024

richard-ramos commented Jan 31, 2024

kaichaosun commented Feb 1, 2024

chaitanyaprem commented Feb 1, 2024

chaitanyaprem commented Feb 1, 2024

chaitanyaprem commented Feb 1, 2024 • edited Loading

chair28980 commented Dec 13, 2023 •

edited

Loading

chaitanyaprem commented Dec 14, 2023 •

edited

Loading

pavloburykh commented Dec 27, 2023 •

edited

Loading

chaitanyaprem commented Dec 27, 2023 •

edited

Loading

mariia-skrypnyk commented Jan 4, 2024 •

edited

Loading

pavloburykh commented Jan 5, 2024 •

edited

Loading

chaitanyaprem commented Jan 8, 2024 •

edited

Loading

chaitanyaprem commented Jan 16, 2024 •

edited

Loading

chaitanyaprem commented Jan 29, 2024 •

edited

Loading

chaitanyaprem commented Jan 30, 2024 •

edited

Loading

chaitanyaprem commented Feb 1, 2024 •

edited

Loading