Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] collects windows metrics from wmi_exporter #6001

Closed
ilyam8 opened this issue May 13, 2019 · 16 comments · Fixed by netdata/go.d.plugin#220
Closed

[feature] collects windows metrics from wmi_exporter #6001

ilyam8 opened this issue May 13, 2019 · 16 comments · Fixed by netdata/go.d.plugin#220
Assignees
Labels
area/collectors Everything related to data collection collectors/go.d new collector Issues to create new collector modules/plugins
Milestone

Comments

@ilyam8
Copy link
Member

ilyam8 commented May 13, 2019

Feature idea summary

#92 is very popular and important but it won't be implemented in the near feature, a lot of work.

What we can do is to parse prometheus wmi_exporter (give it a ⭐ btw) metrics and convert it to netdata format (charts 😄 ). This can be done relatively fast.

The result will be something like cgroups plugin.

Expected behavior

new go.d.plugin module that collects windows metrics from wmi_exporter

@ilyam8 ilyam8 added area/external new collector Issues to create new collector modules/plugins labels May 13, 2019
@ilyam8 ilyam8 self-assigned this May 13, 2019
@ilyam8 ilyam8 added this to the v1.16-rc0 milestone May 13, 2019
@cakrit cakrit added the size:5 label May 13, 2019
@ilyam8
Copy link
Member Author

ilyam8 commented May 20, 2019

hi @irvintim, will you able to test this collector?

@irvintim
Copy link

irvintim commented May 20, 2019 via email

@ilyam8
Copy link
Member Author

ilyam8 commented May 20, 2019

Nice, ping me when you are ready then 👍

@irvintim
Copy link

@ilyam8 I am caught up following my outage, and would like to start testing this collector. Looking over what I think you are doing, I need to install the wmi_exporter on my Windows test server, and then on a linux box running netdata use your collector go plugin to receive the data from the windows box. Does the collector assign this data to a specific host representing the windows box, or is it shown as data for the netdata server itself? If you have any specific instructions for me that would be great.

Thanks for your efforts!!

Tim

@ilyam8
Copy link
Member Author

ilyam8 commented May 30, 2019

I need to install the wmi_exporter on my Windows test server, and then on a linux box running netdata use your collector go plugin to receive the data from the windows box.

Yes @irvintim

Here you go

windows server:

  • download latest wmi_exporter (link)
  • run wmi_exporter on the Winows server

wmi_exporter-0.7.999-preview.2-386.exe --collectors.enabled="cpu,memory,net,logical_disk,os,system"

  • ensure that /metrics endpoint is reachable and reports metrics, check in the browser http://<server ip address>:9182/metrics

linux server:

  • update netdata to the latest
  • add a new job to the wmi collector configuration file (cd /etc/netdata; ./edit-config go.d/wmi.conf)
  • restart netdata.service

After it you will see your windows servers on the netdata dashboard as wmi job_name

@irvintim
Copy link

@ilyam8:

I followed your instructions but am not seeing the wmi job_name section in the dashboard.

The error_log has the following error:
2019-05-31 04:31:10: go.d ERROR: main[main] skipping 'wmi': yaml: unmarshal errors: line 131: cannot unmarshal !!seq into orchestrator.moduleConfig

Here is the config that I am using:

[root@ca2test netdata]# grep -v '^#' /etc/netdata/go.d/wmi.conf 
[ GLOBAL ]
update_every: 1
autodetection_retry: 300
priority: 70000


[ JOBS ]
jobs:
 - name: sutter
   url: http://10.10.0.11:9182/metrics

When I go to the URL listed above I get the following -- this is just a snippet of the frst few lines, lots more data is returned:

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0.0009766
...

Version: Your netdata version: v1.15.0-31-nightly

@ilyam8
Copy link
Member Author

ilyam8 commented May 30, 2019

@irvintim

#[ GLOBAL ]
update_every: 1
autodetection_retry: 300
priority: 70000


#[ JOBS ]
jobs:
 - name: sutter
   url: http://10.10.0.11:9182/metrics

@irvintim
Copy link

That did the trick.

  1. I am getting data now in Netdata for that server. All except CPU data, back on the server itself the wmi_exporter is outputting errors about CPU details:
←[31mERRO←[0m[14616] collector cpu failed after 0.000000s: counter not found  ←[31msource←[0m="exporter.go:172"
←[31mERRO←[0m[14617] collector cpu failed after 0.000000s: could not find counter "Clock Interrupts/sec" on instance  ←[
31msource←[0m="exporter.go:172"
←[31mERRO←[0m[14618] collector cpu failed after 0.000000s: counter not found  ←[31msource←[0m="exporter.go:172"
←[31mERRO←[0m[14619] collector cpu failed after 0.000000s: could not find counter "Clock Interrupts/sec" on instance  ←[

I'll have to dig into that to see what's up there.

  1. Semantically, personally I would rather have this data treated like it's another netdata server, and have this windows server appear as one of the hosts in the drop down list of streaming netdata servers. Having these metrics shown in the list of metrics for the receiving netdata server is a bit confusing.

I'll dig into the CPU issue, and also get this test running on a couple more servers and let them fly for a few days to see how things pan out.

Thanks!

Tim

@irvintim
Copy link

I tried wmi_exporter 32 and 64 bit (it's a 64-bit machine), but got the same error. I have opened an issue on the wmi_exporter github project.

@ilyam8
Copy link
Member Author

ilyam8 commented May 31, 2019

Semantically, personally I would rather have this data treated like it's another netdata server, and have this windows server appear as one of the hosts in the drop down list of streaming netdata servers. Having these metrics shown in the list of metrics for the receiving netdata server is a bit confusing

wmi_exporter is technically just a remote source of metrics (same as apache, nginx, etc..) so just an another sectrion on the dashboard, that is how it is now.

We could add windows icon https://fontawesome.com/v4.7.0/icon/windows (instead of puzzle-piece) , it would look better imo

@irvintim
Copy link

@ilyam8 some follow up from our testing:

  1. The CPU issue I mentioned earlier is a limitation in wmi_exporter, the project maintainers have identified the issue and are implementing a fix. (The gist is that older versions of windows don't have all the metrics and the current code would bomb out if a metric didn't exist -- they are updating their code to ignore metrics that don't exist).
  2. I put one of our test servers under heavy load and let it run all night. The wmi_exporter code would timeout often, so the netdata graph is very choppy as lots of datapoints are missed. Not sure if there is any solution to this, but I'll make the wmi_exporter team aware of this result.
  3. I do agree that the windows logo would be better than the puzzle piece.
  4. Otherwise, the testing has gone well -- I am running this on a couple of Windos 2019 servers and a Windows 2016 R2 Server. I will keep the test running for a while to gather more info.

I do understand your point on the presentation of the data in a section of the main server dashboard, but ultimately the solution we are looking for would treat each host independently and not intermix data from one server into the dashboard of another server. But what you have done here is a great start.
To make our goal possible, I have 3 ideas to spitball:

  1. Update Netdata code to treat certain locally collected metrics as if they were a separate netdata server -- this would also allow for docker containers, of libvirt VMs to be split out into their own server tabs.
  2. Update wmi_exporter (if they are interested) to allow for a push model in addition to their current pull model, and push data in netdata streaming format. This way the wmi_exporter would appear to netdata to simply be another netdata server.
  3. Write a shim program to run on the windows box that pulls data from wmi_exporter and pushes that data to netdata on a separate server in the netdata streaming format.

Option 3 is probably the easiest to implement but unnecessarily uses resources and that's not a good thing for a monitoring service.
Option 2 is my favorite if they are willing, I will look at their code in the near future to see how feasible this is, but I know the netdata team has warned against trying to use the netdata streaming format with outside sources or destinations since that format may change.
So, Option 1 might be the safest bet -- but I don't know how much work that will be to implement right now, again I'll look at the code in a couple days when I've cleared out a few higher priority items.

@ilyam8
Copy link
Member Author

ilyam8 commented May 31, 2019

I put one of our test servers under heavy load and let it run all night. The wmi_exporter code would timeout often, so the netdata graph is very choppy as lots of datapoints are missed. Not sure if there is any solution to this, but I'll make the wmi_exporter team aware of this result.

👍

have you tried to increase http timeout and lower collection frequency?

#[ JOBS ]
jobs:
 - name          : sutter
   url           : http://10.10.0.11:9182/metrics
   update_every  : 5
   timeout       : 5

, but ultimately the solution we are looking for would treat each host independently and not intermix data from one server into the dashboard of another server.

i understand your point, it was a quick implementation, just to have something.

Update wmi_exporter (if they are interested) to allow for a push model in addition to their current pull model, and push data in netdata streaming format.

No way they will do it, and i agree with it, they are doing prometheus exporter and it is not responsible for pushing metrics in netdata format.

@ilyam8
Copy link
Member Author

ilyam8 commented May 31, 2019

Update Netdata code to treat certain locally collected metrics as if they were a separate netdata server -- this would also allow for docker containers, of libvirt VMs to be split out into their own server tabs.

I guess this could be done on the cloud side @cakrit

@irvintim
Copy link

I had already set the timeout to 5, as wmi_exporter was responding in > 1 second each time and so I was getting no data initially.

I just set the update_every value to 5, I'll let you know if that changes anything.

Someone responded to my issue on the wmi_exporter site and suggested we try: https://github.com/leoluk/perflib_exporter . From the README on perflib_exporter:

perflib_exporter is a Prometheus exporter for Windows system performance. It queries performance data using the low-level HKEY_PERFORMANCE_DATA registry API instead of the high-level WMI or PDH interfaces.

The registry API will return metrics for all perflib providers in a single binary blob that we have to parse ourselves. This makes it very efficient - querying all metrics (~20-30k) takes about ~40ms or ~300ms with cold cache. The providers enabled by default take ~10ms to query.

Its raison d'être is a bug in WMI which causes collection times to spike every ~16 minutes - see prometheus-community/windows_exporter#89 (comment) for details.

@irvintim
Copy link

The update_every setting of 5 has resolved the timeout issues with wmi_exporter. The netdata graphs are no longer showing missing datapoints. However, the netdata error.log file is getting a ton of these:

2019-06-01 03:07:16: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:17: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:18: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:20: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:21: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:22: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:25: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:26: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:30: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.

@ilyam8
Copy link
Member Author

ilyam8 commented May 31, 2019

file is getting a ton of these

it is ok, not an error actually (plugin logs it if collection takes more than 1 second)

@ilyam8 ilyam8 added area/collectors Everything related to data collection collectors/go.d and removed area/external labels Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/collectors Everything related to data collection collectors/go.d new collector Issues to create new collector modules/plugins
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants