Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: filter delay errors #4459

Closed
chair28980 opened this issue Dec 13, 2023 · 42 comments
Closed

bug: filter delay errors #4459

chair28980 opened this issue Dec 13, 2023 · 42 comments
Assignees
Labels
E:Chat See https://github.com/waku-org/pm/issues/105 for details status-waku-integ All issues relating to the Status Waku integration.

Comments

@chair28980
Copy link

chair28980 commented Dec 13, 2023

Quoting @richard-ramos Discord conversation:

  • There seems to be an issue with filter in which messages are pushed to the clients with a noticeable delay.
  • Up until yesterday, there was a job that automagically deployed changes in nwaku master to shards.fleet. DB got broken because of that
  • QA mentioned that communities are not loading the message history (currently debugging this)

Quoting @chaitanyaprem from Discord conversation:

As of now, i see a lot of subscribe requests when i start my desktop app (but that could just be to different contentTopics) and unsubscribe only per pubsubTopic (which could just include all contentTopics which were subscribed).

Unable to debug further because of not logging contentTopics and not having an API to confirm what FilterSubscriptions are active as of now.

Maybe it would be good to have some Admin API's to query list of Filter Subscriptions and similar useful API's for monitoring other protocols as well.This may help node operators and also in turn Waku team to debug issues.

For now, I will try to figure out another way what could be causing this issue. And at the same time recommend to use unsubscribeAll while shutting down or existing status-go.

Couple of more observation we had noticed during this session:
In case peers are subscribed and are not available/alive , for each filterPush we try to dial the peers which seems like a bad strategy. If peers are down for few mins, then this leads to a lot of dials from the filterFullNode(which doesn't look optimal).

Secondly, the idle/inactive period after which a filterFullNode considers a subscription to be invalid may need to be reduced to maybe 30 mins or so(not sure what is the optimal value here at this point). But, having such idle subscriptions hang on a longer time could just be a minor attack vector on a service-node. An attacker could just spawn peers randomly and keep hitting with subscriptions and not unsubscribe leaving filterFullNode wasting resources just dialing peers for each message. Not sure if this is a big issue, but something to think of.

QA Test Scenario Description

Scenario 1
1.
2.
Scenario 2
1.
2.

@chair28980 chair28980 changed the title fix: Use unsubscribeAll when exiting or shutting down status-go Filter delay bugs Dec 13, 2023
@chair28980 chair28980 changed the title Filter delay bugs bug: filter delay errors Dec 13, 2023
@chair28980 chair28980 added the E:Chat See https://github.com/waku-org/pm/issues/105 for details label Dec 13, 2023
@chaitanyaprem
Copy link
Contributor

As of now, i see a lot of subscribe requests when i start my desktop app (but that could just be to different contentTopics) and unsubscribe only per pubsubTopic (which could just include all contentTopics which were subscribed).

This can be ignored as it is very much possible that unsubscribe is sent for each pubsubTopic separately which must include all contentTopics that may have been subscribed to independently. But, i did notice that when i bring my status-desktop down the number of subscriptions haven't reduced in fleet node whereas i do see a new subscription as i bring up my status-desktop. This indicates that either some of the filter subscriptions are not being unsubscribed explicitly when node is going down leading to open subscriptions being left on fleet node.

@chaitanyaprem
Copy link
Contributor

chaitanyaprem commented Dec 14, 2023

Unable to debug further because of not logging contentTopics and not having an API to confirm what FilterSubscriptions are active as of now.

Maybe it would be good to have some Admin API's to query list of Filter Subscriptions and similar useful API's for monitoring other protocols as well.This may help node operators and also in turn Waku team to debug issues.

Created this issue that would help with such debugging in the future.

@chaitanyaprem
Copy link
Contributor

And at the same time recommend to use unsubscribeAll while shutting down or existing status-go.

This is what needs to be done probably in status-go or fix #4459 (comment) as ultimately when desktop is brought down gracefully, noticed that filter subscription is not cleared in nwaku. This can be solved by sending an filter unsubscribeAll when status-go is shutting down, or unsubscribing to topics which are not being unsubscribed to.

@kaichaosun kaichaosun self-assigned this Dec 15, 2023
@chaitanyaprem
Copy link
Contributor

And at the same time recommend to use unsubscribeAll while shutting down or existing status-go.

This is what needs to be done probably in status-go or fix #4459 (comment) as ultimately when desktop is brought down gracefully, noticed that filter subscription is not cleared in nwaku. This can be solved by sending an filter unsubscribeAll when status-go is shutting down, or unsubscribing to topics which are not being unsubscribed to.

Do note that as per @richard-ramos we should not wait for the response for unsubscribeAll, rather this should be done in a fire and forget manner.

@chair28980 chair28980 added the status-waku-integ All issues relating to the Status Waku integration. label Dec 19, 2023
@chaitanyaprem
Copy link
Contributor

I have not been able to reproduce this issue when i tested with 0.15.0-rc.10 of desktop.
I am running 3 desktop instances with 2 of them running in light-mode and 1 in relay mode.
Have tested by sending sequence of 10 messages from each of the 3 in a group chat and community.

Have not noticed any significant delays in receiving of the messages. Max delay i had observed was ~3 seconds.

@churik @anastasiyaig Would require more information as to when the delay with Filter is noticed (it was mentioned about 60 seconds of delay was noticed).
Can we retest using latest build which includes fix for lightpush delay ?

@anastasiyaig
Copy link
Contributor

we did have a try yesterday, i dont see any delay from my end, however i know that the 'founder' of that issue was autotests on mobile side, maybe we can try to launch them again? @pavloburykh wdyt?

Can we keep this issue open for a while , perhaps massive dogfooding will confirm the absence / presence of it, so far I cant reproduce as well

@pavloburykh
Copy link

pavloburykh commented Dec 27, 2023

Hey @anastasiyaig @chaitanyaprem!

Can we keep this issue open for a while , perhaps massive dogfooding will confirm the absence / presence of it, so far I cant reproduce as well

Totally agree. Let's keep it open for a while as our e2e still catching issues that might be related.

Here is an example from nightly e2e run (December 27):

  1. Sender's device sends a stack of 4 messages

  2. Receiver's device receives only 1 of 4 messages after 30 seconds of waiting

Receiver_device.log.zip
Sender_device.log.zip

I am currently on vacation till January 2. Will try to investigate and provide more details when I am back. But from what I noticed, this issue is kinda random. Most of the time messages are delivered instantly but sometimes delay occurs.

Please note, that I do not take into consideration cases with online/offline handling or returning from the background as those are separate issues and described here by @churik.

@chaitanyaprem
Copy link
Contributor

chaitanyaprem commented Dec 27, 2023

2. Receiver's device receives only 1 of 4 messages after 30 seconds of waiting

Hmm, This sounds like delay is caused due to lightpush as i had pointed out here. Which status-go commit/version is used for the nightly build?

@pavloburykh
Copy link

Hmm, This sounds like delay is caused due to lightpush as i had pointed out here. Which status-go commit/version is used for the nightly build?

@chaitanyaprem

"version": "v0.171.39",
"commit-sha1": "debfe1cab368140b16d2383efc68007d3c28a4ef",
"src-sha256": "0cbj3796qz69ylwcvdw5jsl1cdyrcfk81d7shjkakvjh59rv5sjk"

@chaitanyaprem
Copy link
Contributor

Receiver_device.log.zip
Sender_device.log.zip

After going through logs, looks like connection to filter node was dropped after 1 or 2 messages are received. This could be the reason why the rest of the messages did not get delivered. Although, I could not find a reason as to why the connection was terminated and not re-established. I don't have enough info to know if other filter peers are connected at this point.

Also i am noticing that node is also connecting to go-waku.test and go-waku.prod fleets. Not sure if that is also causing some issue. @richard-ramos , I don't think it is a good idea to connect these fleets to shards.test fleet.

@chaitanyaprem
Copy link
Contributor

Hmm, This sounds like delay is caused due to lightpush as i had pointed out here. Which status-go commit/version is used for the nightly build?

@chaitanyaprem

"version": "v0.171.39",
"commit-sha1": "debfe1cab368140b16d2383efc68007d3c28a4ef",
"src-sha256": "0cbj3796qz69ylwcvdw5jsl1cdyrcfk81d7shjkakvjh59rv5sjk"

Looks like the lightpush delay fix is already present in this version. So that could not be the cause of the issue.

@mariia-skrypnyk
Copy link

mariia-skrypnyk commented Jan 4, 2024

Hello to everyone! Sharing fresh logs from the failed e2e which are still catching data delivery issues.
Please, @chaitanyaprem take a look.

Here is an example from PR e2e run (January 3):

Sender's device sends "something to reply to" message to the community channel
Receiver's device doesn't get this message

Sender.log.zip
Receiver.log.zip

@richard-ramos
Copy link
Member

Can we try with this PR? status-im/status-mobile#18395 this adds extra logging to go-waku and also updates it to latest version.

@pavloburykh
Copy link

Can we try with this PR? status-im/status-mobile#18395 this adds extra logging to go-waku and also updates it to latest version.

thanks @richard-ramos! We will re-run tests and provide new logs based on the results.

@pavloburykh
Copy link

pavloburykh commented Jan 5, 2024

@richard-ramos I have checked e2e results in status-im/status-mobile#18395 This time I do not see cases when data has not been delivered at all. Although there are couple of examples when data has been delivered with a rather long delay. For example:

  1. User A sends contact request to User B
  2. User B receives contact request only after 2 minutes of waiting.

UserA_CR_sender.log
UserB_CR_receiver.log

@chaitanyaprem
Copy link
Contributor

chaitanyaprem commented Jan 8, 2024

@richard-ramos I have checked e2e results in status-im/status-mobile#18395 This time I do not see cases when data has not been delivered at all. Although there are couple of examples when data has been delivered with a rather long delay. For example:

1. User A [sends](https://app.eu-central-1.saucelabs.com/tests/90a5b7d76161424bb816825e8e4ff0d6?auth=a30526c1ef2662d8dcfbd5d9ec94f735#69) contact request to User B

2. User B [receives](https://app.eu-central-1.saucelabs.com/tests/b981848b126041cf8508108a96f7fea4?auth=7fea4ae9f7eab52d7795dfcc830e4339#72) contact request only after **2 minutes** of waiting.

UserA_CR_sender.log UserB_CR_receiver.log

Thanks for the logs.
After going through receiver logs it seems the receiver node got disconnected from all 2 filter peers within a span of 600msec (which falls within the range of message sent time). Hence, the message seems to have been retrieved via Store in next store sync cycle which is the cause of the delay noticed.

Taking a further look at logs for both Receiver and Sender for peer disconnections, i have noticed a lot of peer disconnections in both the logs.

Oddly the time interval between disconnections are same in both receiver and sender towards fleet nodes.
Towards peers 16Uiu2HAmGwcE8v7gmJNEWFtZtojYpPMTHy2jBLL6xRk33qgDxFWX and 16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 , the interval between disconnection seems to be 10 seconds.

Towards peer 16Uiu2HAm8mUZ18tBWPXDQsaF7PbCKYA35z7WB2xNZH2EVq1qS8LJ, interval between disconnection seems to be every 5 seconds. Its almost as if some timer was running or some periodic event is causing this disonnections. I don't see any logs in the fleet wrt these peers wrt disconnections, so it could be that disconnection is noticed only on client side and fleet is not noticing it. Either the network at clients in tests so flaky or it could it be something in the code causing this.
Any thoughts @richard-ramos , @vitvly @cammellos @kaichaosun wrt any timer in the code that could trigger this behaviour?

@richard-ramos
Copy link
Member

That's very curious. Looking at instances of Network().ClosePeer in the code for status-go, It only happens in

func (w *Waku) identifyAndConnect(ctx context.Context, isLightClient bool, peerInfo peer.AddrInfo) {
which is executed at the beginning while attempting to determine if the peers obtained from enrtree can be used.

@richard-ramos
Copy link
Member

There are some interesting log lines here:

DEBUG[01-05|13:34:01.324|github.com/status-im/status-go/vendor/go.uber.org/zap/sugar.go:163]                                                 protocol mux failed: stream reset (took 1.080462861s) 

....

DEBUG[01-05|13:34:03.302|github.com/status-im/status-go/vendor/go.uber.org/zap/sugar.go:163]                                                 protocol EOF: 16Uiu2HAm8mUZ18tBWPXDQsaF7PbCKYA35z7WB2xNZH2EVq1qS8LJ (took 557.447009ms) 

These are all part of the streamhandler in basichost: https://github.com/libp2p/go-libp2p/blob/c681c4a9b825089b83a2be806a8cca7ebd3bca13/p2p/host/basic/basic_host.go#L415-L417

@chaitanyaprem
Copy link
Contributor

chaitanyaprem commented Jan 16, 2024

Update: After going through the fleet logs for this peerID and the date reported, we have identified that fleet (nwaku nodes) are disconnecting connections from the emulator due to reaching ip colocation limit.
This is causing connections to all Filternodes to be dropped constantly which is why the messages are not received via Filter and hence it waits for the StoreQuery via which messages are fetched.

Attached image
Screenshot 2024-01-16 at 11 29 11 AM

Suggested Solution:
Increase/disable the IP colocation limit on the shards.test fleet as per waku-org/nwaku#2323 so that CI doesn't get affected.
As disabling it completely will leave the node to a possibility of getting attacked, it is recommended to increase this limit to 100.
Once the fleet is updated with the new release of nwaku 0.24.0 (scheduled to be released 22nd-Jan), then this param can be used to specify the colocation limit.

--ip-colocation-limit=100

@chaitanyaprem
Copy link
Contributor

Related issue in infra repo status-im/infra-status#27 that should address this issue.

@pavloburykh
Copy link

Hey team! The situation with data delivery has definitely been improved after deploying status-im/infra-status#27

But unfortunately we still encounter some issues. Here is an example caught by today's e2e run:

Contact request has not been delivered to receiver

  1. User A sends contact request to User B
  2. User B has not received contact request

User_A_CR_sender.log
User_B_CR_receiver.log

cc @chaitanyaprem

@anastasiyaig
Copy link
Contributor

i think i saw that recently too but was not able to reproduce constantly... status-im/status-desktop#13270

@chaitanyaprem
Copy link
Contributor

Hey team! The situation with data delivery has definitely been improved after deploying status-im/infra-shards#27

But unfortunately we still encounter some issues. Here is an example caught by today's e2e run:

Contact request has not been delivered to receiver

1. User A [sends](https://app.eu-central-1.saucelabs.com/tests/bb180f9bdd3c41b38a9a918f7588337f?auth=211c1e1f895e0ab346684834835141cc#68) contact request to User B

2. User B [has not received](https://app.eu-central-1.saucelabs.com/tests/7738311bf53d41b3a7564f5103956c21?auth=b46b5e9a430413473f06d6e11c1998f4#59) contact request

User_A_CR_sender.log User_B_CR_receiver.log

cc @chaitanyaprem

Quick look at the logs shows that receiver did not have any filterPeers which is why contact request is not received.
I could see the message received by the fleet.
Looking into why the receiver did not have any peer connections which seems to be the root cause.

@anastasiyaig
Copy link
Contributor

Looking into why the receiver did not have any peer connections which seems to be the root cause.

That is something that I was observing a lot yesterday actually. App was saying that there is no internet connection when it was not like that at all and peer numbers was jumping like crazy (from 1 to 6 and back)

@chaitanyaprem
Copy link
Contributor

Also, One more observation i had noticed is on the mobile there is no notification showing we are offline even if peercount is 0.
I guess that is also something which needs to be fixed in mobile app. Not sure whom to report this to.

@pavloburykh
Copy link

Also, One more observation i had noticed is on the mobile there is no notification showing we are offline even if peercount is 0. I guess that is also something which needs to be fixed in mobile app. Not sure whom to report this to.

@chaitanyaprem thank you for this observation. I will create related issue in mobile repo.

@chaitanyaprem
Copy link
Contributor

chaitanyaprem commented Jan 29, 2024

Quick look at the logs shows that receiver did not have any filterPeers which is why contact request is not received.
I could see the message received by the fleet.
Looking into why the receiver did not have any peer connections which seems to be the root cause.

After looking into the logs further, finding something suspicious that after loosing all peer connections discv5 seems to have been stopped and not restarted which is preventing from discovering and connecting to peers.

Either it could be due to the issue @qfrank has identified here or something else.
When you rerun those changes, i want to see CI results as well.

In the meantime, i will try to find out if there is something else going on.

@pavloburykh
Copy link

When you rerun those changes, i want to see CI results as well.

Hi @chaitanyaprem! Here is an example of failed e2e from this PR:

Message has not been delivered in community channel

  1. User A sends a message in community channel
  2. User B has not received this message in community channel

User_A_message_Sender.log
User_B_message_Receiver.log

@chaitanyaprem
Copy link
Contributor

When you rerun those changes, i want to see CI results as well.

Hi @chaitanyaprem! Here is an example of failed e2e from this PR:

Message has not been delivered in community channel

1. User A [sends](https://app.eu-central-1.saucelabs.com/tests/fd4e8074a8c045b1b90febb07826b9e6?auth=fe47e9f771c3f03cf7c69db13da0b3ab#195) a message in community channel

2. User B [has not received](https://app.eu-central-1.saucelabs.com/tests/3e4bb2ef7cf6489e976f5d9489edde3f?auth=a4236b3bb42593a9e60667f76046f4d3#124) this message in community channel

User_A_message_Sender.log User_B_message_Receiver.log

Thanks!
Looking at logs , the message was received by fleet and client has proper peer connections(which means this is different than what i had observed before).
For some reason this message did not make it to client. Will debug further and update

@chaitanyaprem
Copy link
Contributor

chaitanyaprem commented Jan 30, 2024

When you rerun those changes, i want to see CI results as well.

Hi @chaitanyaprem! Here is an example of failed e2e from this PR:

Message has not been delivered in community channel

1. User A [sends](https://app.eu-central-1.saucelabs.com/tests/fd4e8074a8c045b1b90febb07826b9e6?auth=fe47e9f771c3f03cf7c69db13da0b3ab#195) a message in community channel

2. User B [has not received](https://app.eu-central-1.saucelabs.com/tests/3e4bb2ef7cf6489e976f5d9489edde3f?auth=a4236b3bb42593a9e60667f76046f4d3#124) this message in community channel

User_A_message_Sender.log User_B_message_Receiver.log

Thanks! Looking at logs , the message was received by fleet and client has proper peer connections(which means this is different than what i had observed before). For some reason this message did not make it to client. Will debug further and update

After analyzing the logs further it looks like for some reason the fleet node thinks subscription is not there, whereas the app thinks it is subscribed . One possible reason i can think of is some jumbling of subscribe and unsubscribe for same contentTopic happened and probably went out of sequence to fleet node which may have caused this.
Here is log from one of the fleet node for contentTopic where unsubscribe came within few msecs after subscribe.

Screenshot 2024-01-30 at 11 50 42 AM

Logs showing unsubscribe and subscribe around same time which may cause messages to go out of sequence.
@kaichaosun @vitvly , not sure if something in status-go is causing this kind of race condition leading to subscribe and unsubscribe intermingling for same content-Topic. Please take a look.

DEBUG[01-29|10:25:10.369|github.com/status-im/status-go/wakuv2/filter_manager.go:208]                                                        filter subscription success              filterId=6e211475e02507936a817f748d8a35e1660e888185e804780649f88f15131e9f peer=16Uiu2HAmGwcE8v7gmJNEWFtZtojYpPMTHy2jBLL6xRk33qgDxFWX pubsubTopic=/waku/2/rs/16/32 contentTopics=[/waku/1/0xeb0a33be/rfc26]
DEBUG[01-29|10:25:18.308|github.com/status-im/status-go/protocol/transport/filters_manager.go:565]                                           registering filter for                   namespace=filtersManager chatID=0x0312b80bd0e8afffeeabbc2a86ccaba3f86fb1c24e87c6caf2f9c5deac71f428863a0e0225-cde0-461e-916a-ddd15eef6c9a type=public ContentTopic=0xeb0a33be PubsubTopic=
DEBUG[01-29|10:25:18.309|github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/protocol/filter/client.go:239]            sending FilterSubscribeRequest           peerID=16Uiu2HAmGwcE8v7gmJNEWFtZtojYpPMTHy2jBLL6xRk33qgDxFWX request="request_id:\"9606495a1da897068e4338281581301a8eb60152d459691a465e04cb97cd8a2e\"  filter_subscribe_type:UNSUBSCRIBE  pubsub_topic:\"/waku/2/rs/16/32\"  content_topics:\"/waku/1/0xeb0a33be/rfc26\""
DEBUG[01-29|10:25:18.309|github.com/status-im/status-go/wakuv2/filter_manager.go:198]                                                        filter subscribe to filter node          filterId=18ba8884c635a265185075a1c7511b443c077267fe24fb1b6ec0612d50b8f72c pubsubTopic=/waku/2/rs/16/32 contentTopics=[/waku/1/0xeb0a33be/rfc26]
DEBUG[01-29|10:25:18.309|github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/protocol/filter/client.go:239]            sending FilterSubscribeRequest           peerID=16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 request="request_id:\"cdc7e5125c735389157568e403f364c65e3ac485ee826f0b32cc1a2da5f0c11c\"  filter_subscribe_type:SUBSCRIBE  pubsub_topic:\"/waku/2/rs/16/32\"  content_topics:\"/waku/1/0xeb0a33be/rfc26\""
DEBUG[01-29|10:25:18.309|github.com/status-im/status-go/wakuv2/filter_manager.go:198]                                                        filter subscribe to filter node          filterId=18ba8884c635a265185075a1c7511b443c077267fe24fb1b6ec0612d50b8f72c pubsubTopic=/waku/2/rs/16/32 contentTopics=[/waku/1/0xeb0a33be/rfc26]
DEBUG[01-29|10:25:18.310|github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/protocol/filter/client.go:239]            sending FilterSubscribeRequest           peerID=16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 request="request_id:\"9aad6cedddcfe011894e508d5f5b25125d95ae88ebc6a23b9fd1ec9323abf6ea\"  filter_subscribe_type:SUBSCRIBE  pubsub_topic:\"/waku/2/rs/16/32\"  content_topics:\"/waku/1/0xeb0a33be/rfc26\""
DEBUG[01-29|10:25:18.310|github.com/status-im/status-go/vendor/github.com/waku-org/go-waku/waku/v2/protocol/filter/client.go:239]            sending FilterSubscribeRequest           peerID=16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 request="request_id:\"fa6dd07f2dcdf2691226b4828736f2f63ddc8bfd828114b3043610a898c956fd\"  filter_subscribe_type:UNSUBSCRIBE  pubsub_topic:\"/waku/2/rs/16/32\"  content_topics:\"/waku/1/0xeb0a33be/rfc26\""
DEBUG[01-29|10:25:18.334|github.com/status-im/status-go/wakuv2/filter_manager.go:208]                                                        filter subscription success              filterId=18ba8884c635a265185075a1c7511b443c077267fe24fb1b6ec0612d50b8f72c peer=16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 pubsubTopic=/waku/2/rs/16/32 contentTopics=[/waku/1/0xeb0a33be/rfc26]
DEBUG[01-29|10:25:18.336|github.com/status-im/status-go/wakuv2/filter_manager.go:208]                                                        filter subscription success              filterId=18ba8884c635a265185075a1c7511b443c077267fe24fb1b6ec0612d50b8f72c peer=16Uiu2HAmAR24Mbb6VuzoyUiGx42UenDkshENVDj4qnmmbabLvo31 pubsubTopic=/waku/2/rs/16/32 contentTopics=[/waku/1/0xeb0a33be/rfc26]

@chaitanyaprem
Copy link
Contributor

Looking into why the receiver did not have any peer connections which seems to be the root cause.

That is something that I was observing a lot yesterday actually. App was saying that there is no internet connection when it was not like that at all and peer numbers was jumping like crazy (from 1 to 6 and back)

This could be due to the same issue mentioned here. Do you have any debug logs around this time? I can take a look and confirm

@pavloburykh
Copy link

This could be due to the same issue mentioned here. Do you have any debug logs around this time? I can take a look and confirm

Hey @chaitanyaprem! Just to make sure I understand correctly: you need some more logs specifically from this PR status-im/status-mobile#18492 , correct?

@anastasiyaig
Copy link
Contributor

@pavloburykh in case you have an option to apply debug log level for messaging (geth.log) - that is what @chaitanyaprem is looking for. Higher log levels do not allow to understand the nature of the issue sadly

@chaitanyaprem
Copy link
Contributor

This could be due to the same issue mentioned here. Do you have any debug logs around this time? I can take a look and confirm

Hey @chaitanyaprem! Just to make sure I understand correctly: you need some more logs specifically from this PR status-im/status-mobile#18492 , correct?

No, I am assuming the issue @anastasiyaig faced could be similar to what has been observed in this case (which was fixed by @qfrank ). But i can only confirm that if i have logs from @anastasiyaig when she has faced the issue.

@pavloburykh
Copy link

@pavloburykh in case you have an option to apply debug log level for messaging (geth.log) - that is what @chaitanyaprem is looking for. Higher log levels do not allow to understand the nature of the issue sadly

hey @anastasiyaig! We have debug level applied by default in mobile PR builds.

@anastasiyaig
Copy link
Contributor

But i can only confirm that if i have logs from @anastasiyaig when she has faced the issue.

I just reproduced it again.
I opened 0.15 version and sent to it a contact request. Waited for ~10 min and nothing arrieved.
I restarted the app and the contact request and messages are there now

Attaching fresh geth.log @chaitanyaprem
geth.log.zip

@chaitanyaprem
Copy link
Contributor

But i can only confirm that if i have logs from @anastasiyaig when she has faced the issue.

I just reproduced it again. I opened 0.15 version and sent to it a contact request. Waited for ~10 min and nothing arrieved. I restarted the app and the contact request and messages are there now

Attaching fresh geth.log @chaitanyaprem geth.log.zip

Based on updated logs shared by you (unable to upload them here because fileSize is too big), it looks like the reason for message loss seems to be peer scores going down.
The reason for peer scores being reduced is because of message validation failure for messages received from the peers.
This is very similar to what @mprakhov faced during Doha. Similar behaviour is noticed in logs shared by him as well.

I am not sure why this could be happening.
It is as though the node is not having proper keys related to something that is used for message validation at pubsubTopic.

Any thoughts @cammellos @vitvly @richard-ramos ?

1723 DEBUG[01-30|13:55:29.040|go.uber.org/zap/sugar.go:163]                                                               validation done; took 29.292µs 
1724 DEBUG[01-30|13:55:29.040|go.uber.org/zap/sugar.go:163]                                                               message validation failed; dropping message from       16Uiu2HAmBDbMWFiG9ki8sDw6fYtraSxo4oHU9HbuN43S2HVyq1FD 
1725 DEBUG[01-30|13:55:29.040|go.uber.org/zap/sugar.go:163]                                                               validation done; took 8.333µs
1726 DEBUG[01-30|13:55:29.041|go.uber.org/zap/sugar.go:163]                                                               message validation failed; dropping message from       16Uiu2HAmBDbMWFiG9ki8sDw6fYtraSxo4oHU9HbuN43S2HVyq1FD 
1727 DEBUG[01-30|13:55:29.041|go.uber.org/zap/sugar.go:163]                                                               validation done; took 9.416µs 

@richard-ramos
Copy link
Member

I believe this problem might be happening due to status-im/status-desktop#13267 . I suspect that a shard assignment removal or change happened before, but the private key used originally for the community was not removed, so the topic validation in relay would still be applying a wrong key, and end up penalizing peers by mistake.

if the problem is due to that, then likely it is fixed in #4627 (this is already in desktop master)

@kaichaosun
Copy link
Contributor

The changes mentioned by @richard-ramos has been merged to status-desktop master, the release binary for this PR (status-im/status-desktop#13372) should be good to use for test.

@pavloburykh @anastasiyaig could you confirm if this issue has been resolved or not?

@chaitanyaprem
Copy link
Contributor

The changes mentioned by @richard-ramos has been merged to status-desktop master, the release binary for this PR (status-im/status-desktop#13372) should be good to use for test.

@pavloburykh @anastasiyaig could you confirm if this issue has been resolved or not?

The changes mentioned by @richard-ramos fixes only 1 of the issue observed above i.e why relay peer scores are down causing message loss.

The issue with Filter subscribe/unsubscribe is a separate issue and still need to be analyzed to figure out where the issue is. Ref #4459 (comment)

This issue is becoming dumping ground for multiple issues. Let us close this and open separate issues to track independently and to avoid confusion.

@chaitanyaprem
Copy link
Contributor

Created separate issue for Filter unsusbcribe and subscribe out of sequence #4659

@chaitanyaprem
Copy link
Contributor

chaitanyaprem commented Feb 1, 2024

Created a separate issue for relay loss observed due to lower peer scores #4660

Hence closing this one to avoid more issues being updated here.

Please open new issues for anything else that is being observed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E:Chat See https://github.com/waku-org/pm/issues/105 for details status-waku-integ All issues relating to the Status Waku integration.
Projects
None yet
Development

No branches or pull requests

7 participants