[BUG] Serialization issue with RequestCacheStats, negative longs unsupported, use writeLong or writeZLong [-7386] #13343

weily2 · 2024-04-23T07:50:17Z

Describe the bug

Hello, after cluster update from 2.11.1 to 2.13.0, I have these messages in opensearch.log.

[2024-04-23T03:18:03,647][WARN ][o.o.t.OutboundHandler ] [node-hot-06] send message failed [channel: Netty4TcpChannel{localAddress=/A.B.C.40:9300, remoteAddress=/A.B.C.69:60146}] java.lang.IllegalStateException: Negative longs unsupported, use writeLong or writeZLong for negative numbers [-7386] at org.opensearch.core.common.io.stream.StreamOutput.writeVLong(StreamOutput.java:310) ~[opensearch-core-2.13.0.jar:2.13.0] at org.opensearch.index.cache.request.RequestCacheStats.writeTo(RequestCacheStats.java:103) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.core.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:974) ~[opensearch-core-2.13.0.jar:2.13.0] at org.opensearch.action.admin.indices.stats.CommonStats.writeTo(CommonStats.java:285) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.action.admin.indices.stats.ShardStats.writeTo(ShardStats.java:146) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.core.common.io.stream.StreamOutput.writeOptionalWriteable(StreamOutput.java:974) ~[opensearch-core-2.13.0.jar:2.13.0] at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$NodeResponse.writeTo(TransportBroadcastByNodeAction.java:651) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:104) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundMessage.serialize(OutboundMessage.java:81) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:235) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:221) ~[opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundHandler$SendContext.get(OutboundHandler.java:275) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundHandler.internalSend(OutboundHandler.java:197) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:192) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.OutboundHandler.sendResponse(OutboundHandler.java:159) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:96) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:72) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:482) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:456) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceivedDecorate(SecuritySSLRequestHandler.java:206) [opensearch-security-2.13.0.0.jar:2.13.0.0] at org.opensearch.security.transport.SecurityRequestHandler.messageReceivedDecorate(SecurityRequestHandler.java:317) [opensearch-security-2.13.0.0.jar:2.13.0.0] at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:154) [opensearch-security-2.13.0.0.jar:2.13.0.0] at org.opensearch.security.OpenSearchSecurityPlugin$6$1.messageReceived(OpenSearchSecurityPlugin.java:828) [opensearch-security-2.13.0.0.jar:2.13.0.0] at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:114) [opensearch-index-management-2.13.0.0.jar:2.13.0.0] at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:43) [opensearch-performance-analyzer-2.13.0.0.jar:2.13.0.0] at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:480) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:913) [opensearch-2.13.0.jar:2.13.0] at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.13.0.jar:2.13.0] at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?] at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?] at java.base/java.lang.Thread.run(Thread.java:1583) [?:?]

So Is There Any idea, what is wrong?

Related component

Cluster Manager

To Reproduce

Call the api http://A.B.C.35/_cat/shards?v ,The response is:

The docs and store field on A.B.C.34 and A.B.C.40 is None , And I can find previous logs in 34 and 40.
after call http://A.B.C.34/_cat/shards?v ,The response is:

The docs and store field on A.B.C.35 and A.B.C.40 is None , And I can find previous logs in 35 and 40.
Call /_nodes/stats will get the following response:

Expected behavior

Both Not None

Additional Details

No response

The text was updated successfully, but these errors were encountered:

peternied · 2024-04-24T15:20:13Z

[Triage - attendees 1 2 3 4 5 6]
@weily2 Thanks for creating this issue, this should definitely be addressed. Would you like to create a pull request to fix the issue?

dbwiddis · 2024-04-24T16:26:27Z

The stack trace points to the total memory size stat becoming negative in RequestCacheStats.

The error occurs when this negative value is communicated over transport; the result is that whichever node is handling the REST request will successfully return the stats to the calling node (with the negative value included) while other nodes will fail to return the value. This is what is leading to the symptom of only seeing stats on the same node that the REST Request is directed toward.

We need to further investigate what changed with the RequestCacheStats (particularly the total memory field) between 2.11.1 and 2.13.0. From the change log I see several PRs associated with Tiered Caching and/or pluggable caches. I've spent a bit of time digging into them but there's a lot changed. I'm guessing something's being subtracted (onRemoval) where it was never added (onCache). @msfroh, @sgup432, and @sohami do you have any ideas here?

sgup432 · 2024-04-25T20:03:48Z

@dbwiddis Let me try taking a look at this. We did change things around RequestCache with Tiered caching feature in 2.12/2.13.

sgup432 · 2024-05-01T02:21:03Z

@weily2 I need more info/help on this while I try to reproduce it on my end.
I had couple of hunches where things might be going wrong. Either there was some race condition on removal but theoretically it looks alright and I also verified by running concurrent test locally. And secondly either calculation of size when item is cached is different compared to when it is removed somehow but that doesn't seem the case for now.

It seems you were trying to upgrade domain from 2.11 to 2.13 and saw these errors. Did you see these errors during domain upgrade or after the upgrade was complete? Also were you performing a rolling upgrade?
Can you give more precise steps on reproducing this so that I can try same on my end?

lyradc · 2024-05-01T17:43:07Z

I am also seeing this error. The errors appear to occur until the affected node is restarted. After which the errors resume any where from 4 hours to 3 days later. The problem compounds as more nodes in a cluster begin presenting this error and more of the data becomes unavailable.

Given the call stack includes a reference to org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor I deleted all rollups. The errors resumed about 4.5 hours later. I then disabled rollups in persistent cluster settings.

Following rollup disable via cluster settings the errors ceased without node restarts. I will continue monitoring for the next few days to see if they occur again with rollups disabled.

lyradc · 2024-05-01T19:06:00Z

Disabling rollups didn't stop the 'Negative longs unsupported' errors.

sgup432 · 2024-05-01T19:25:40Z

@lyradc Thanks for confirming.
But this shouldn't be just related to rollups. Stacktrace just points out to RollupInterceptor in handlers list(like others) which may not get executed.
Do you mind confirming few things around your layout?

Are you running a 2.13 cluster?
Is it possible to see list of indices along with their layout/settings?
Do you perform any closing/deleting index operations manually/automatically?

lyradc · 2024-05-01T20:25:19Z

@sgup432

Yes, running 2.13 clusters
I should be able to get you indices schema/settings tomorrow.
Not using index closing. Using ISM to handle index deletion for old indices.

sgup432 · 2024-05-02T17:43:31Z

@lyradc It would also help if you can mention/share the steps to reproduce so that I can try on my end.

artemus717 · 2024-05-06T09:00:29Z

Is there a api to clear RequestCache on node ?

sgup432 · 2024-05-06T09:08:30Z

@artemus717 There is an api for that. But it will not work in this scenario where it is going negative.
For now, you can turn off the request cache for indices via dynamic setting index.requests.cache.enable and then restart node.
We already have the fix which hopefully should get picked up in next release.

sgup432 · 2024-05-06T17:31:48Z

I have raised the PR for this fix.

It turns out that the issue occurs when an indexShard is deleted and then reallocated on the same node. So whenever stale entries from older shard are deleted, those are accounted for the new shard which has the same shardId.

I was able to reproduce it via IT by recreating the above scenario.

sgup432 · 2024-05-06T17:33:49Z

Note:
Please don't close this issue even after the PR is merged. I would like to wait and confirm the fix to ensure we are not missing anything.

weily2 · 2024-05-07T05:53:16Z

@weily2 I need more info/help on this while I try to reproduce it on my end. I had couple of hunches where things might be going wrong. Either there was some race condition on removal but theoretically it looks alright and I also verified by running concurrent test locally. And secondly either calculation of size when item is cached is different compared to when it is removed somehow but that doesn't seem the case for now.

It seems you were trying to upgrade domain from 2.11 to 2.13 and saw these errors. Did you see these errors during domain upgrade or after the upgrade was complete? Also were you performing a rolling upgrade? Can you give more precise steps on reproducing this so that I can try same on my end?

@sgup432 sorry for the late reply. I do perform a rolling upgrade. After upgrade the log is normal ,but these error occurred after the upgrade was complete for a few hours.

sgup432 · 2024-05-08T22:52:24Z

@weily2 @lyradc @artemus717 The fix for this would be available in 2.14 version release.

weily2 added bug Something isn't working untriaged labels Apr 23, 2024

github-actions bot added the Cluster Manager label Apr 23, 2024

weily2 closed this as completed Apr 23, 2024

weily2 reopened this Apr 23, 2024

peternied removed the untriaged label Apr 24, 2024

peternied added the Storage Issues and PRs relating to data and metadata storage label Apr 24, 2024

peternied changed the title ~~[BUG] Negative longs unsupported, use writeLong or writeZLong for negative numbers [-7386]~~ [BUG] Serialization issue with RequestCacheStats, negative longs unsupported, use writeLong or writeZLong [-7386] Apr 24, 2024

kkhatua assigned sgup432 Apr 26, 2024

sgup432 mentioned this issue May 6, 2024

Fix negative requestStats memory_size issue #13553

Merged

7 tasks

sgup432 added Search:Performance and removed Cluster Manager Storage Issues and PRs relating to data and metadata storage labels May 6, 2024

sgup432 added the v2.14.0 label May 6, 2024

msfroh closed this as completed in #13553 May 7, 2024

sgup432 mentioned this issue May 8, 2024

Fix IndicesRequestCache clean up logic #13597

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Serialization issue with RequestCacheStats, negative longs unsupported, use writeLong or writeZLong [-7386] #13343

[BUG] Serialization issue with RequestCacheStats, negative longs unsupported, use writeLong or writeZLong [-7386] #13343

weily2 commented Apr 23, 2024 •

edited

Loading

peternied commented Apr 24, 2024

dbwiddis commented Apr 24, 2024 •

edited

Loading

sgup432 commented Apr 25, 2024

sgup432 commented May 1, 2024 •

edited

Loading

lyradc commented May 1, 2024

lyradc commented May 1, 2024

sgup432 commented May 1, 2024 •

edited

Loading

lyradc commented May 1, 2024

sgup432 commented May 2, 2024

artemus717 commented May 6, 2024

sgup432 commented May 6, 2024

sgup432 commented May 6, 2024

sgup432 commented May 6, 2024 •

edited

Loading

weily2 commented May 7, 2024 •

edited

Loading

sgup432 commented May 8, 2024

[BUG] Serialization issue with RequestCacheStats, negative longs unsupported, use writeLong or writeZLong [-7386] #13343

[BUG] Serialization issue with RequestCacheStats, negative longs unsupported, use writeLong or writeZLong [-7386] #13343

Comments

weily2 commented Apr 23, 2024 • edited Loading

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

peternied commented Apr 24, 2024

dbwiddis commented Apr 24, 2024 • edited Loading

sgup432 commented Apr 25, 2024

sgup432 commented May 1, 2024 • edited Loading

lyradc commented May 1, 2024

lyradc commented May 1, 2024

sgup432 commented May 1, 2024 • edited Loading

lyradc commented May 1, 2024

sgup432 commented May 2, 2024

artemus717 commented May 6, 2024

sgup432 commented May 6, 2024

sgup432 commented May 6, 2024

sgup432 commented May 6, 2024 • edited Loading

weily2 commented May 7, 2024 • edited Loading

sgup432 commented May 8, 2024

weily2 commented Apr 23, 2024 •

edited

Loading

dbwiddis commented Apr 24, 2024 •

edited

Loading

sgup432 commented May 1, 2024 •

edited

Loading

sgup432 commented May 1, 2024 •

edited

Loading

sgup432 commented May 6, 2024 •

edited

Loading

weily2 commented May 7, 2024 •

edited

Loading