Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal error: runtime.unlock: lock count with Go >= 1.10 #870

Closed
daenney opened this issue Mar 29, 2018 · 88 comments
Closed

fatal error: runtime.unlock: lock count with Go >= 1.10 #870

daenney opened this issue Mar 29, 2018 · 88 comments

Comments

@daenney
Copy link
Contributor

daenney commented Mar 29, 2018

Host operating system: output of uname -a

Linux endor 4.13.0-37-generic #42~16.04.1-Ubuntu SMP Wed Mar 7 16:03:28 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 0.16.0-rc.0 (branch: HEAD, revision: 002c1ca02917406cbecc457162e2bdb1f29c2f49)
  build user:       root@5ff5455ac873
  build date:       20180309-15:09:26
  go version:       go1.10

Used the release artifact at: https://github.com/prometheus/node_exporter/releases/download/v0.16.0-rc.0/node_exporter-0.16.0-rc.0.linux-amd64.tar.gz

node_exporter command line flags

None, the defaults for 0.16 match my needs

Are you running node_exporter in Docker?

No

What did you do that produced an error?

Just ran it for a couple of days

What did you expect to see?

It not to crash

What did you see instead?

Mar 28 19:47:54 endor node_exporter[18076]: fatal error: runtime·unlock: lock count
Mar 28 19:48:03 endor systemd[1]: node_exporter.service: Main process exited, code=killed, status=11/SEGV

That fatal error line got spewed about a 1000 times, all logged at 19:47:54 according to systemd.

@mschroeder223
Copy link

I have the same issue as @daenney

Host operating system: output of uname -a

Linux lb1 4.13.0-26-generic #29~16.04.2-Ubuntu SMP Tue Jan 9 22:00:44 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 0.16.0-rc.0 (branch: HEAD, revision: 002c1ca02917406cbecc457162e2bdb1f29c2f49)
  build user:       root@5ff5455ac873
  build date:       20180309-15:09:26
  go version:       go1.10

Observation: I have node_exporter installed on 6 machines at this point. The two machines it keeps failing on are the only two vms I am currently using. The others are bare metal and are running great.

@brian-brazil
Copy link
Contributor

If it's reproducible, could you try bisecting which collectors are enabled to see which it is?

@SuperQ
Copy link
Member

SuperQ commented Mar 29, 2018

Also, running with debug level logs would help identify if there is a specific collector problem.

@daenney
Copy link
Contributor Author

daenney commented Mar 30, 2018

@mschroeder223 That's interesting. In my case the machine it blows up on is bare metal.

I've set systemd to restart the node_exporter for now, but looking at the logs a new error surfaced too:
Mar 30 09:15:14 endor systemd[1]: node_exporter.service: Main process exited, code=killed, status=5/TRAP. Unsure what that's about.

I've updated my unit file to run with debug logs, I'll check in on it in 24 hours, see what gives.

@mschroeder223
Copy link

It took about 4.5hrs for it to crash this morning, this is the output that was captured https://gist.github.com/mschroeder223/16e48f64a8b17888eae0e1d597a83504

I do see several collectors which are running and do not need to be enabled so I am going to disable them and see if it makes any difference.

@grobie
Copy link
Member

grobie commented Mar 31, 2018

There is quite a lot going on here. I believe the "lock count" panic is a red herring, and is the result of the go runtime raising a panic while handling another panic (see "panic during panic" in the log).

This issue appears to be highly related to what we see here golang/go#24059 A fix was included in go 1.10.1. As far as I understand, an actual panic is masked by a bug in the panic handling code in the go runtime.

I suggest to build a new release candidate with go 1.10.1 and let @mschroeder223 and @daenney run that release candidate for a while. Hopefully we'll be able to catch the actual panic afterwards.

@SuperQ
Copy link
Member

SuperQ commented Mar 31, 2018

I'll work on building a new RC release.

@mschroeder223
Copy link

I don't know if this will help isolate the problem at all, but between the two servers that are continuously crashing, I updated one to exclude some unused collectors:
node_exporter --no-collector.infiniband --no-collector.zfs --no-collector.wifi --no-collector.nfs --no-collector.nfsd

and the other server which continued to crash all weekend:
node_exporter

Also one more crash log from the weekend but with no real trace history
https://gist.github.com/mschroeder223/ca7243393e3c13d619e8ba0f517ca9b9

@SuperQ
Copy link
Member

SuperQ commented Apr 5, 2018

I've published v0.16.0-rc.1. This is now built with Go 1.10.1. Please give it a try and see if it fixes the crash.

@daenney
Copy link
Contributor Author

daenney commented Apr 6, 2018

Pushed it out today, I'll check back in 24hrs, see if any crashes have shown up. @grobie's line of thought sounds rather plausible though, so lets hope that is it.

@JBauto
Copy link

JBauto commented Apr 12, 2018

I'm still getting this error even with v0.16.0-rc.1. It runs for a while and then it crashes.

uname -a

Linux node8 4.4.0-116-generic #140-Ubuntu SMP Mon Feb 12 21:23:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

node_exporter --version

node_exporter, version 0.16.0-rc.1 (branch: HEAD, revision: f3ff44fcca5bb20a181cd7b99734afeae553c78e)
  build user:       root@448aa887a7ca
  build date:       20180404-17:47:46
  go version:       go1.10.1

journaltctl -u node_exporter

Apr 09 14:13:47 node8 systemd[1]: Started Node Exporter.
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Starting node_exporter (version=0.16.0-rc.1, branch=HEAD, revision=f3ff44fcca5bb20a181cd7b99734afeae553c78e)" source="node_exporter.go:82"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Build context (go=go1.10.1, user=root@448aa887a7ca, date=20180404-17:47:46)" source="node_exporter.go:83"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Enabled collectors:" source="node_exporter.go:90"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - arp" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - bcache" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - bonding" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - conntrack" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - cpu" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - diskstats" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - edac" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - entropy" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - filefd" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - filesystem" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - hwmon" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - infiniband" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - ipvs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - loadavg" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - mdadm" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - meminfo" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - netdev" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - netstat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - nfs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - nfsd" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - sockstat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - stat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - textfile" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - time" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - timex" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - uname" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - vmstat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - wifi" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - xfs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - zfs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Listening on :9100" source="node_exporter.go:111"
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count

@grobie
Copy link
Member

grobie commented Apr 12, 2018

We really need a stack trace here to debug that further. Please check your journalctl output at that time without the -u node_exporter filter and paste the output.

@JBauto
Copy link

JBauto commented Apr 12, 2018

Without the filter, it doesn't show anything more useful I think.

Apr 09 14:13:47 node8 ansible-command[1576231]: Invoked with warn=True executable=None _uses_shell=True _raw_params=systemctl restart node_exporter removes=None creates=None chdir=None stdin=None
Apr 09 14:13:47 node8 systemd[1]: Stopped Node Exporter.
Apr 09 14:13:47 node8 systemd[1]: Started Node Exporter.
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Starting node_exporter (version=0.16.0-rc.1, branch=HEAD, revision=f3ff44fcca5bb20a181cd7b99734afeae553c78e)" source="node_exporter.go:82"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Build context (go=go1.10.1, user=root@448aa887a7ca, date=20180404-17:47:46)" source="node_exporter.go:83"
Apr 09 14:13:47 node8 sudo[1576228]: pam_unix(sudo:session): session closed for user root
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Enabled collectors:" source="node_exporter.go:90"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - arp" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - bcache" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - bonding" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - conntrack" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - cpu" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - diskstats" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - edac" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - entropy" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - filefd" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - filesystem" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - hwmon" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - infiniband" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - ipvs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - loadavg" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - mdadm" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - meminfo" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - netdev" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - netstat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - nfs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - nfsd" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - sockstat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - stat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - textfile" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - time" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - timex" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - uname" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - vmstat" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - wifi" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - xfs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg=" - zfs" source="node_exporter.go:97"
Apr 09 14:13:47 node8 node_exporter[1576234]: time="2018-04-09T14:13:47+01:00" level=info msg="Listening on :9100" source="node_exporter.go:111"
Apr 09 14:14:47 node8 sshd[1576083]: Received disconnect from IP_ADDRESS port 43820:11: disconnected by user
Apr 09 14:14:47 node8 sshd[1576083]: Disconnected from IP_ADDRESS port 43820
Apr 09 14:14:47 node8 sshd[1576003]: pam_unix(sshd:session): session closed for user manager
Apr 09 14:14:47 node8 systemd-logind[1082]: Removed session 501.
Apr 09 14:14:47 node8 systemd[1]: Stopping User Manager for UID 1000...
Apr 09 14:14:47 node8 systemd[1576005]: Stopped target Default.
Apr 09 14:14:47 node8 systemd[1576005]: Reached target Shutdown.
Apr 09 14:14:47 node8 systemd[1576005]: Starting Exit the Session...
Apr 09 14:14:47 node8 systemd[1576005]: Stopped target Basic System.
Apr 09 14:14:47 node8 systemd[1576005]: Stopped target Sockets.
Apr 09 14:14:47 node8 systemd[1576005]: Stopped target Paths.
Apr 09 14:14:47 node8 systemd[1576005]: Stopped target Timers.
Apr 09 14:14:47 node8 systemd[1576005]: Received SIGRTMIN+24 from PID 1576309 (kill).
Apr 09 14:14:47 node8 systemd[1576007]: pam_unix(systemd-user:session): session closed for user manager
Apr 09 14:14:47 node8 systemd[1]: Stopped User Manager for UID 1000.
Apr 09 14:14:47 node8 systemd[1]: Removed slice User Slice of manager.
Apr 09 14:17:01 node8 CRON[1576450]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 14:17:01 node8 CRON[1576451]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 14:17:01 node8 CRON[1576450]: pam_unix(cron:session): session closed for user root
Apr 09 15:17:01 node8 CRON[1579817]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 15:17:01 node8 CRON[1579818]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 15:17:01 node8 CRON[1579817]: pam_unix(cron:session): session closed for user root
Apr 09 16:17:01 node8 CRON[1581532]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 16:17:01 node8 CRON[1581533]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 16:17:01 node8 CRON[1581532]: pam_unix(cron:session): session closed for user root
Apr 09 17:17:01 node8 CRON[1583185]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 17:17:01 node8 CRON[1583186]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 17:17:01 node8 CRON[1583185]: pam_unix(cron:session): session closed for user root
Apr 09 18:17:01 node8 CRON[1584837]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 18:17:01 node8 CRON[1584838]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 18:17:01 node8 CRON[1584837]: pam_unix(cron:session): session closed for user root
Apr 09 18:30:42 node8 dhclient[1269]: DHCPREQUEST of IP_ADDRESS on eno1 to IP_ADDRESS port 67 (xid=0x61fb15b0)
Apr 09 18:30:42 node8 dhclient[1269]: DHCPACK of IP_ADDRESS from IP_ADDRESS
Apr 09 18:30:42 node8 dhclient[1269]: bound to IP_ADDRESS -- renewal in 42306 seconds.
Apr 09 18:44:44 node8 systemd[1]: Starting Cleanup of Temporary Directories...
Apr 09 18:44:44 node8 systemd-tmpfiles[1585610]: [/usr/lib/tmpfiles.d/var.conf:14] Duplicate line for path "/var/log", ignoring.
Apr 09 18:44:44 node8 systemd[1]: Started Cleanup of Temporary Directories.
Apr 09 19:17:01 node8 CRON[1586506]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 19:17:01 node8 CRON[1586507]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 19:17:01 node8 CRON[1586506]: pam_unix(cron:session): session closed for user root
Apr 09 20:17:01 node8 CRON[1589398]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 20:17:01 node8 CRON[1589399]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 20:17:01 node8 CRON[1589398]: pam_unix(cron:session): session closed for user root
Apr 09 21:17:01 node8 CRON[1593667]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 21:17:01 node8 CRON[1593668]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 21:17:01 node8 CRON[1593667]: pam_unix(cron:session): session closed for user root
Apr 09 22:17:01 node8 CRON[1597959]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 22:17:01 node8 CRON[1597960]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 22:17:01 node8 CRON[1597959]: pam_unix(cron:session): session closed for user root
Apr 09 23:17:01 node8 CRON[1602223]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 09 23:17:01 node8 CRON[1602224]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 09 23:17:01 node8 CRON[1602223]: pam_unix(cron:session): session closed for user root
Apr 10 00:17:01 node8 CRON[1606511]: pam_unix(cron:session): session opened for user root by (uid=0)
Apr 10 00:17:01 node8 CRON[1606512]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Apr 10 00:17:01 node8 CRON[1606511]: pam_unix(cron:session): session closed for user root
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count
Apr 10 00:49:26 node8 node_exporter[1576234]: fatal error: runtime·unlock: lock count

@discordianfish
Copy link
Member

I can confirm this is still happening with rc.0:

2018-04-10T16:48:52.441683157Z time="2018-04-10T16:48:52Z" level=info msg="Starting node_exporter (version=0.16.0-rc.1, branch=master, revision=9e485f11578b40915db8668ef9060a014d9cea2f)" source="node_exporter.go:82"
...
2018-04-10T16:48:52.442221476Z time="2018-04-10T16:48:52Z" level=info msg=" - zfs" source="node_exporter.go:97"
2018-04-10T16:48:52.442223509Z time="2018-04-10T16:48:52Z" level=info msg="Listening on :9100" source="node_exporter.go:111"
2018-04-13T10:06:30.639476572Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.63951748Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639523112Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639526776Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639530495Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639535581Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639539318Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639542781Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639546352Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639549837Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.63955346Z fatal error: runtime·unlock: lock count
2018-04-13T10:06:30.639556917Z fatal error: runtime·unlock: lock count

That happened 95268 times within two seconds. No more details/stacktrace :-/

@discordianfish
Copy link
Member

Oh didn't know somebody ran into this long time ago already and filled an upstream issue: golang/go#15438
Suggestions were to check unsafe and cgo code. I don't think we use any unsafe code on linux/amd64 so might be cgo or still an upstream issue.

@discordianfish
Copy link
Member

There is also an issue here #228

@daenney
Copy link
Contributor Author

daenney commented Apr 14, 2018

I can confirm I'm still experiencing this. Node Exporter crashes about once every two days now with that error. Running in debug mode but not getting any stack traces in the journal either.

@daenney
Copy link
Contributor Author

daenney commented Apr 15, 2018

Actually... this seems to have gotten a bit worse for me. Before once the fatal error: runtime.unlock: lock count happened the node_exporter would crash and systemd would then restart it. However, now it just gets stuck in that mode, so unless I automate checking if the node exporter is still running and if not manually restart it I'm missing metrics.

@daenney daenney changed the title fatal error: runtime.unlock: lock count (0.16.0-rc.0) fatal error: runtime.unlock: lock count (0.16.0-rc.*) Apr 15, 2018
@discordianfish
Copy link
Member

Although we don't make much use of cgo on linux, it would make sense to rule this out.
@daenney: Can you deploy a cgo disabled version of the node-exporter and see if that problem goes away? You can build it yourself by running CGO_ENABLED=0 go build in the repo or you can use my build:

node_exporter.gz

@daenney
Copy link
Contributor Author

daenney commented Apr 16, 2018

Alright. Built one with Go 1.10.1, with HEAD at 768be13.

$ go get github.com/prometheus/node_exporter
$ cd ~/Development/go/src/github.com/prometheus/node_exporter
$ env CGO_ENABLED=0 go build

It's deployed now, lets see what happens. It might take some time for the bug to manifest though.

@daenney
Copy link
Contributor Author

daenney commented Apr 19, 2018

So far so good. It's been running for 3 days now roughly, no errors, hasn't been restarted by systemd since I deployed it and no missing data in any graphs (or complaints by Prometheus that the target is unreachable).

This seems to suggest it is indeed related to cgo somehow. What can I do to help narrow this down? Can we do some kind of a build matrix building a node_exporter with only one extension that uses cgo at a time to try and narrow this down? Or build it with cgo, disable all cgo-powered extensions, and then enable one at a time? Would it help to build it with cgo but on an older Go version (say 1.8)?

@discordianfish
Copy link
Member

The only collector enabled on linux that use any cgo reference itself is timex, but this only reads some constants and I doubt that's the problem. But you could try running the official release and disable that collector. Though I suspect it's rather related to some stdlib stuff that uses c implementation instead of native go implementation.
Beside that, yeah using an older go version and see if it still happens would help isolate this too.

@SuperQ
Copy link
Member

SuperQ commented Apr 19, 2018

@daenney Let's try Go 1.9.5 + CGO first. Go 1.9 seemed stable enough in node_exporter 0.15.x. So it should be the same with 0.16. If that still crashes, we can narrow it down to code changes in the node_exporter.

@mdlayher
Copy link
Contributor

mdlayher commented Aug 3, 2018

A note from our conversation on that thread:

Keep in mind that just because bisect points at a commit does not mean the commit is bad. It could be a perfectly correct commit that is just exposing an issue that already existed. I could certainly see how thrashing OS threads more often by forcing termination when a locked goroutine exits could expose other issues (e.g. memory corruption) elsewhere.

I am curious if we have a Cgo bug somewhere in one of the collectors and the interaction with the WiFi collector is causing it all to explode.

@marcan
Copy link

marcan commented Aug 3, 2018

The fact that it crashes with just the wifi collector suggests this isn't an issue with another collector. Does the node_exporter core rely on Cgo for anything?

@daenney
Copy link
Contributor Author

daenney commented Aug 3, 2018

Kind of an aside here, but given the node exporter seems to primarly target backend systems, should we consider disabling the WiFi collector by default, at least until it the issue is resolved?

@SuperQ
Copy link
Member

SuperQ commented Aug 3, 2018

While I would like to turn it off, I think it would make too many eyes stop looking into it and it would never get fixed.

I'll consider it.

SuperQ added a commit that referenced this issue Aug 6, 2018
Disable the wifi collector by default due to suspected cashing issues and goroutine leaks.
* #870
* #1008

Signed-off-by: Ben Kochie <superq@gmail.com>
SuperQ added a commit that referenced this issue Aug 7, 2018
* Disable wifi collector by default

Disable the wifi collector by default due to suspected cashing issues and goroutine leaks.
* #870
* #1008

Signed-off-by: Ben Kochie <superq@gmail.com>
xeji added a commit to NixOS/nixpkgs that referenced this issue Aug 9, 2018
@daenney
Copy link
Contributor Author

daenney commented Aug 14, 2018

Reading up on golang/go#25128 it seems @mdlayher has applied a fix to to netlink in the form of mdlayher/netlink@3d8cc9a that causes the error to disappear.

Even though the WiFi collector is now disabled by default it seems worthwhile to upgrade to at least that commit so that those that do need the WiFi collector and enable it can have a more stable experience.

@discordianfish
Copy link
Member

@daenney Now that we isolated this issue a bit further, maybe it makes sense to update the title and description of this issue since it seems like it might take a while to get this fixed upstream and released.

@daenney daenney changed the title fatal error: runtime.unlock: lock count (0.16.0-rc.*) fatal error: runtime.unlock: lock count when WiFi collector is enabled Aug 18, 2018
@daenney
Copy link
Contributor Author

daenney commented Aug 18, 2018

@discordianfish Mmm, indeed. I updated it to "when WiFi collector is enabled" only to realise that this is no longer the case as with the changes @mdlayher has done the issue appears resolved.

Since there is an upstream issue about the particulars of this bug, should we consider closing this? Is there anything left to do from a Prometheus/node_exporter side of things?

@daenney daenney changed the title fatal error: runtime.unlock: lock count when WiFi collector is enabled fatal error: runtime.unlock: lock count with Go >= 1.10 Aug 18, 2018
@SuperQ SuperQ added this to the v0.17.0 milestone Sep 11, 2018
@discordianfish
Copy link
Member

@daenney Agree, probably is fine to close this.

majewsky added a commit to majewsky/system-configuration that referenced this issue Nov 16, 2018
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this issue Apr 9, 2024
Chaneges since v0.16.0-rc.2
* Remove gmond collector prometheus#852
* Build with Go 1.9[0]
* Fix /proc/net/dev/ interface name handling prometheus#910

[0]: prometheus#870

Signed-off-by: Ben Kochie <superq@gmail.com>
oblitorum pushed a commit to shatteredsilicon/node_exporter that referenced this issue Apr 9, 2024
* Disable wifi collector by default

Disable the wifi collector by default due to suspected cashing issues and goroutine leaks.
* prometheus#870
* prometheus#1008

Signed-off-by: Ben Kochie <superq@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.