Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Display all file cache metrics in NodeStats and use the correct human-readable field name #13232

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

lukas-vlcek
Copy link
Contributor

@lukas-vlcek lukas-vlcek commented Apr 16, 2024

Description

I think this is a copy&paste issue from previous fileCacheReserved test?
Also the order of parameters in builder.humanReadableField() should be different.

Question 1): Do we need CHANGELOG record for this PR? (I think we do, because this is a bugfix, right?)

Question 2): Why we display cache_utilized field only if its value != 0? The point is that if you want to design a client handling nodes stats response then you may not be even aware that such field exists (how can you learn it does exist if documentation is incomplete?). Can't we just print this field even if the value is 0? (See example below)

Similarly we can ask why we do not want to print cache_reserved_in_bytes if !=-1?
Sounds like the idea is that if these two fields have the default value we do not want to print them. Can I ask why? (Yes, -1 is not ideal default value).

      "fs" : {
        "timestamp" : 1713278935920,
        "total" : {
          "total_in_bytes" : 1995218165760,
          "free_in_bytes" : 476316594176,
          "available_in_bytes" : 476316594176,
          "cache_reserved_in_bytes" : 0,
          "cache_utilized_in_bytes" : 0
        },
        "data" : [
          {
            "path" : "/Users/lukas.vlcek/projects/OpenSearch-worktrees/OpenSearch-main/build/testclusters/runTask-0/data/nodes/0",
            "mount" : "/System/Volumes/Data (/dev/disk3s5)",
            "type" : "apfs",
            "total_in_bytes" : 1995218165760,
            "free_in_bytes" : 476316594176,
            "available_in_bytes" : 476316594176,
            "cache_reserved_in_bytes" : 0,
            "cache_utilized_in_bytes" : 0
          }
        ]
      }

Related Issues

Introduced in #6350
Back-ported in #6485

Check List

  • [ ] New functionality includes testing.
    • All tests pass
  • [ ] New functionality has been documented.
    • [ ] New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • [ ] Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@lukas-vlcek
Copy link
Contributor Author

This PR will need backport 2.x as well.

Copy link
Contributor

✅ Gradle check result for feb966c: SUCCESS

Copy link

codecov bot commented Apr 16, 2024

Codecov Report

Attention: Patch coverage is 57.69231% with 11 lines in your changes missing coverage. Please review.

Project coverage is 71.73%. Comparing base (6d0484a) to head (4466d34).
Report is 167 commits behind head on main.

Files Patch % Lines
...rc/main/java/org/opensearch/monitor/fs/FsInfo.java 30.76% 4 Missing and 5 partials ⚠️
...c/main/java/org/opensearch/monitor/fs/FsProbe.java 71.42% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13232      +/-   ##
============================================
+ Coverage     71.69%   71.73%   +0.03%     
+ Complexity    62385    62346      -39     
============================================
  Files          5148     5148              
  Lines        293301   293321      +20     
  Branches      42383    42389       +6     
============================================
+ Hits         210285   210405     +120     
+ Misses        65607    65561      -46     
+ Partials      17409    17355      -54     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lukas-vlcek lukas-vlcek force-pushed the fix_node_stats_fileCacheUtilized branch 2 times, most recently from 48b11ef to db1486a Compare April 17, 2024 14:39
Copy link
Contributor

❌ Gradle check result for 48b11ef:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for db1486a:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

CHANGELOG-3.0.md Outdated Show resolved Hide resolved
@reta
Copy link
Collaborator

reta commented Jul 3, 2024

However, why is it running against 2.16.0 when the node-selector is now restricted to 3.0.0 and above?:

Oh I see, there is no bug here: what is happening is that total is always built by 3.0.0 node (this is what we want), but data section comes from each node individually, in this case 3.0.0 and 2.16.0 nodes (and we cannot control that). I think the only reliable way is to exclude everything that is not 3.0.0:

"File Cache stats":
  - skip:
      version: " - 2.16.99"
      reason: "file cache statistics fields were added in 2.7 (#6485) but the fix (#13232) was released in 2.16.0"

@lukas-vlcek
Copy link
Contributor Author

@reta I think you are right. Let me push update...

@lukas-vlcek lukas-vlcek force-pushed the fix_node_stats_fileCacheUtilized branch from a431ec3 to 9603f47 Compare July 3, 2024 18:35
Copy link
Contributor

github-actions bot commented Jul 3, 2024

❌ Gradle check result for 9603f47: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Jul 4, 2024

❌ Gradle check result for 10cb157: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@lukas-vlcek lukas-vlcek force-pushed the fix_node_stats_fileCacheUtilized branch from 10cb157 to 9603f47 Compare July 4, 2024 14:14
@lukas-vlcek
Copy link
Contributor Author

@reta FYI: There is a bug... I need to find it.

Copy link
Contributor

github-actions bot commented Jul 4, 2024

❌ Gradle check result for 9603f47: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Jul 8, 2024

✅ Gradle check result for 75e04e4: SUCCESS

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Addming missing tests for file caches in NodeStats FsInfo.
This commit changes the test value for `cache_utilized` field.
Going forward both file cache fields are present in the node stats REST response.

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
In FsInfo.Path object the value of fileCacheUtilized is now initialized
like any other fields, that means it is -1, which means the value hasn't been
populated yet.

To make this possible it was necessary to fix FsProbeTest and use
a real FileCache object to enable internal logic to handle initialization
and safeguarding of the value properly.

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Use String concatenation instead of StringBuilder.

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Adding a second public ctor for FsInfo.Path (to be used in tests only).

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Add FsInfo.Path ctor asserts.

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
Deprecating `FsInfo.Path#getFSInfo(NodePath nodePath)` method because without
FileCache instance it produces incomplete Path object that does not have initialized
some file cache related variables.

WIP: We still need better tests running multinode cluster with FileCache.

Signed-off-by: Lukáš Vlček <lukas.vlcek@aiven.io>
@lukas-vlcek lukas-vlcek force-pushed the fix_node_stats_fileCacheUtilized branch from 75e04e4 to 4466d34 Compare July 9, 2024 15:50
Copy link
Contributor

github-actions bot commented Jul 9, 2024

✅ Gradle check result for 4466d34: SUCCESS

@dblock
Copy link
Member

dblock commented Jul 12, 2024

@lukas-vlcek Would you be interested in taking the API spec for these fields too? This is opensearch-project/opensearch-api-specification#156 - I started https://github.com/dblock/opensearch-api-specification/tree/nodes-stats but there's a lot of internal knowledge of what these fields are that I will take forever to dig up :(

@lukas-vlcek
Copy link
Contributor Author

@dblock I would love to... but I am currently at 🌴 🚴‍♂️ ☀️ I will be back in 2 weeks.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch stalled Issues that have stalled Storage:Snapshots v2.16.0 Issues and PRs related to version 2.16.0
Projects
Status: Now(This Quarter)
Development

Successfully merging this pull request may close these issues.

6 participants