[feature] collects windows metrics from wmi_exporter #6001

ilyam8 · 2019-05-13T11:23:33Z

Feature idea summary

#92 is very popular and important but it won't be implemented in the near feature, a lot of work.

What we can do is to parse prometheus wmi_exporter (give it a ⭐ btw) metrics and convert it to netdata format (charts 😄 ). This can be done relatively fast.

The result will be something like cgroups plugin.

Expected behavior

new go.d.plugin module that collects windows metrics from wmi_exporter

The text was updated successfully, but these errors were encountered:

ilyam8 · 2019-05-20T14:02:03Z

hi @irvintim, will you able to test this collector?

irvintim · 2019-05-20T14:08:04Z

Yes we'd be very interested in testing this. I am out of the office for this week but we could plan to do some testing starting next week.

…

On Mon, May 20, 2019, 7:03 AM Ilya Mashchenko ***@***.***> wrote: hi @irvintim <https://github.com/irvintim>, will you able to test this collector? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6001?email_source=notifications&email_token=ABLU4VVXXDAHEIK3Q74OXQ3PWKVRVA5CNFSM4HMO25XKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODVY5SKY#issuecomment-494000427>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABLU4VTUZHJXQLQNFSGZT7TPWKVRVANCNFSM4HMO25XA> .

ilyam8 · 2019-05-20T14:14:04Z

Nice, ping me when you are ready then 👍

irvintim · 2019-05-30T17:35:24Z

@ilyam8 I am caught up following my outage, and would like to start testing this collector. Looking over what I think you are doing, I need to install the wmi_exporter on my Windows test server, and then on a linux box running netdata use your collector go plugin to receive the data from the windows box. Does the collector assign this data to a specific host representing the windows box, or is it shown as data for the netdata server itself? If you have any specific instructions for me that would be great.

Thanks for your efforts!!

Tim

ilyam8 · 2019-05-30T18:05:44Z

I need to install the wmi_exporter on my Windows test server, and then on a linux box running netdata use your collector go plugin to receive the data from the windows box.

Yes @irvintim

Here you go

windows server:

download latest wmi_exporter (link)
run wmi_exporter on the Winows server

wmi_exporter-0.7.999-preview.2-386.exe --collectors.enabled="cpu,memory,net,logical_disk,os,system"

ensure that /metrics endpoint is reachable and reports metrics, check in the browser http://<server ip address>:9182/metrics

linux server:

update netdata to the latest
add a new job to the wmi collector configuration file (cd /etc/netdata; ./edit-config go.d/wmi.conf)
restart netdata.service

After it you will see your windows servers on the netdata dashboard as wmi job_name

irvintim · 2019-05-30T22:16:43Z

@ilyam8:

I followed your instructions but am not seeing the wmi job_name section in the dashboard.

The error_log has the following error:
2019-05-31 04:31:10: go.d ERROR: main[main] skipping 'wmi': yaml: unmarshal errors: line 131: cannot unmarshal !!seq into orchestrator.moduleConfig

Here is the config that I am using:

[root@ca2test netdata]# grep -v '^#' /etc/netdata/go.d/wmi.conf 
[ GLOBAL ]
update_every: 1
autodetection_retry: 300
priority: 70000


[ JOBS ]
jobs:
 - name: sutter
   url: http://10.10.0.11:9182/metrics

When I go to the URL listed above I get the following -- this is just a snippet of the frst few lines, lots more data is returned:

# HELP go_gc_duration_seconds A summary of the GC invocation durations.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0.0009766
...

Version: Your netdata version: v1.15.0-31-nightly

ilyam8 · 2019-05-30T22:25:28Z

@irvintim

#[ GLOBAL ]
update_every: 1
autodetection_retry: 300
priority: 70000


#[ JOBS ]
jobs:
 - name: sutter
   url: http://10.10.0.11:9182/metrics

irvintim · 2019-05-30T23:12:24Z

That did the trick.

I am getting data now in Netdata for that server. All except CPU data, back on the server itself the wmi_exporter is outputting errors about CPU details:

←[31mERRO←[0m[14616] collector cpu failed after 0.000000s: counter not found  ←[31msource←[0m="exporter.go:172"
←[31mERRO←[0m[14617] collector cpu failed after 0.000000s: could not find counter "Clock Interrupts/sec" on instance  ←[
31msource←[0m="exporter.go:172"
←[31mERRO←[0m[14618] collector cpu failed after 0.000000s: counter not found  ←[31msource←[0m="exporter.go:172"
←[31mERRO←[0m[14619] collector cpu failed after 0.000000s: could not find counter "Clock Interrupts/sec" on instance  ←[

I'll have to dig into that to see what's up there.

Semantically, personally I would rather have this data treated like it's another netdata server, and have this windows server appear as one of the hosts in the drop down list of streaming netdata servers. Having these metrics shown in the list of metrics for the receiving netdata server is a bit confusing.

I'll dig into the CPU issue, and also get this test running on a couple more servers and let them fly for a few days to see how things pan out.

Thanks!

Tim

irvintim · 2019-05-30T23:46:24Z

I tried wmi_exporter 32 and 64 bit (it's a 64-bit machine), but got the same error. I have opened an issue on the wmi_exporter github project.

ilyam8 · 2019-05-31T06:21:17Z

Semantically, personally I would rather have this data treated like it's another netdata server, and have this windows server appear as one of the hosts in the drop down list of streaming netdata servers. Having these metrics shown in the list of metrics for the receiving netdata server is a bit confusing

wmi_exporter is technically just a remote source of metrics (same as apache, nginx, etc..) so just an another sectrion on the dashboard, that is how it is now.

We could add windows icon https://fontawesome.com/v4.7.0/icon/windows (instead of puzzle-piece) , it would look better imo

irvintim · 2019-05-31T18:01:13Z

@ilyam8 some follow up from our testing:

The CPU issue I mentioned earlier is a limitation in wmi_exporter, the project maintainers have identified the issue and are implementing a fix. (The gist is that older versions of windows don't have all the metrics and the current code would bomb out if a metric didn't exist -- they are updating their code to ignore metrics that don't exist).
I put one of our test servers under heavy load and let it run all night. The wmi_exporter code would timeout often, so the netdata graph is very choppy as lots of datapoints are missed. Not sure if there is any solution to this, but I'll make the wmi_exporter team aware of this result.
I do agree that the windows logo would be better than the puzzle piece.
Otherwise, the testing has gone well -- I am running this on a couple of Windos 2019 servers and a Windows 2016 R2 Server. I will keep the test running for a while to gather more info.

I do understand your point on the presentation of the data in a section of the main server dashboard, but ultimately the solution we are looking for would treat each host independently and not intermix data from one server into the dashboard of another server. But what you have done here is a great start.
To make our goal possible, I have 3 ideas to spitball:

Update Netdata code to treat certain locally collected metrics as if they were a separate netdata server -- this would also allow for docker containers, of libvirt VMs to be split out into their own server tabs.
Update wmi_exporter (if they are interested) to allow for a push model in addition to their current pull model, and push data in netdata streaming format. This way the wmi_exporter would appear to netdata to simply be another netdata server.
Write a shim program to run on the windows box that pulls data from wmi_exporter and pushes that data to netdata on a separate server in the netdata streaming format.

Option 3 is probably the easiest to implement but unnecessarily uses resources and that's not a good thing for a monitoring service.
Option 2 is my favorite if they are willing, I will look at their code in the near future to see how feasible this is, but I know the netdata team has warned against trying to use the netdata streaming format with outside sources or destinations since that format may change.
So, Option 1 might be the safest bet -- but I don't know how much work that will be to implement right now, again I'll look at the code in a couple days when I've cleared out a few higher priority items.

ilyam8 · 2019-05-31T19:00:18Z

I put one of our test servers under heavy load and let it run all night. The wmi_exporter code would timeout often, so the netdata graph is very choppy as lots of datapoints are missed. Not sure if there is any solution to this, but I'll make the wmi_exporter team aware of this result.

👍

have you tried to increase http timeout and lower collection frequency?

#[ JOBS ]
jobs:
 - name          : sutter
   url           : http://10.10.0.11:9182/metrics
   update_every  : 5
   timeout       : 5

, but ultimately the solution we are looking for would treat each host independently and not intermix data from one server into the dashboard of another server.

i understand your point, it was a quick implementation, just to have something.

Update wmi_exporter (if they are interested) to allow for a push model in addition to their current pull model, and push data in netdata streaming format.

No way they will do it, and i agree with it, they are doing prometheus exporter and it is not responsible for pushing metrics in netdata format.

ilyam8 · 2019-05-31T19:07:12Z

Update Netdata code to treat certain locally collected metrics as if they were a separate netdata server -- this would also allow for docker containers, of libvirt VMs to be split out into their own server tabs.

I guess this could be done on the cloud side @cakrit

irvintim · 2019-05-31T19:45:00Z

I had already set the timeout to 5, as wmi_exporter was responding in > 1 second each time and so I was getting no data initially.

I just set the update_every value to 5, I'll let you know if that changes anything.

Someone responded to my issue on the wmi_exporter site and suggested we try: https://github.com/leoluk/perflib_exporter . From the README on perflib_exporter:

perflib_exporter is a Prometheus exporter for Windows system performance. It queries performance data using the low-level HKEY_PERFORMANCE_DATA registry API instead of the high-level WMI or PDH interfaces.

The registry API will return metrics for all perflib providers in a single binary blob that we have to parse ourselves. This makes it very efficient - querying all metrics (~20-30k) takes about ~40ms or ~300ms with cold cache. The providers enabled by default take ~10ms to query.

Its raison d'être is a bug in WMI which causes collection times to spike every ~16 minutes - see prometheus-community/windows_exporter#89 (comment) for details.

irvintim · 2019-05-31T20:09:12Z

The update_every setting of 5 has resolved the timeout issues with wmi_exporter. The netdata graphs are no longer showing missing datapoints. However, the netdata error.log file is getting a ton of these:

2019-06-01 03:07:16: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:17: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:18: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:20: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:21: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:22: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:25: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:26: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.
2019-06-01 03:07:30: go.d ERROR: wmi[win_server1] Skip the tick due to previous run hasn't been finished.

ilyam8 · 2019-05-31T20:14:55Z

file is getting a ton of these

it is ok, not an error actually (plugin logs it if collection takes more than 1 second)

ilyam8 added area/external new collector Issues to create new collector modules/plugins labels May 13, 2019

ilyam8 mentioned this issue May 13, 2019

Windows Support #92

Open

ilyam8 self-assigned this May 13, 2019

ilyam8 added this to the v1.16-rc0 milestone May 13, 2019

cakrit added the size:5 label May 13, 2019

ilyam8 mentioned this issue May 14, 2019

WMI exporter collector netdata/go.d.plugin#220

Merged

4 tasks

ilyam8 closed this as completed in netdata/go.d.plugin#220 May 20, 2019

cakrit mentioned this issue Jun 3, 2019

WMI collector improvements #6196

Closed

ilyam8 added area/collectors Everything related to data collection collectors/go.d and removed area/external labels Apr 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature] collects windows metrics from wmi_exporter #6001

[feature] collects windows metrics from wmi_exporter #6001

ilyam8 commented May 13, 2019 •

edited

Loading

ilyam8 commented May 20, 2019

irvintim commented May 20, 2019 via email

ilyam8 commented May 20, 2019

irvintim commented May 30, 2019

ilyam8 commented May 30, 2019 •

edited

Loading

irvintim commented May 30, 2019

ilyam8 commented May 30, 2019

irvintim commented May 30, 2019

irvintim commented May 30, 2019

ilyam8 commented May 31, 2019

irvintim commented May 31, 2019

ilyam8 commented May 31, 2019 •

edited

Loading

ilyam8 commented May 31, 2019

irvintim commented May 31, 2019

irvintim commented May 31, 2019

ilyam8 commented May 31, 2019 •

edited

Loading

[feature] collects windows metrics from wmi_exporter #6001

[feature] collects windows metrics from wmi_exporter #6001

Comments

ilyam8 commented May 13, 2019 • edited Loading

Feature idea summary

Expected behavior

ilyam8 commented May 20, 2019

irvintim commented May 20, 2019 via email

ilyam8 commented May 20, 2019

irvintim commented May 30, 2019

ilyam8 commented May 30, 2019 • edited Loading

irvintim commented May 30, 2019

ilyam8 commented May 30, 2019

irvintim commented May 30, 2019

irvintim commented May 30, 2019

ilyam8 commented May 31, 2019

irvintim commented May 31, 2019

ilyam8 commented May 31, 2019 • edited Loading

ilyam8 commented May 31, 2019

irvintim commented May 31, 2019

irvintim commented May 31, 2019

ilyam8 commented May 31, 2019 • edited Loading

ilyam8 commented May 13, 2019 •

edited

Loading

ilyam8 commented May 30, 2019 •

edited

Loading

ilyam8 commented May 31, 2019 •

edited

Loading

ilyam8 commented May 31, 2019 •

edited

Loading