Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU steal stuck at 100% #1210

Closed
vfuse opened this issue Jan 17, 2018 · 16 comments
Closed

CPU steal stuck at 100% #1210

vfuse opened this issue Jan 17, 2018 · 16 comments
Assignees

Comments

@vfuse
Copy link

vfuse commented Jan 17, 2018

Occasionally psutil returns a CPU steal time of 100%, the only way to get this back to the correct value is by rebooting the system.

cpu:
{
"0": {
"guest": 0.0, 
"guest_nice": 0.0, 
"idle": 0.0, 
"iowait": 0.0, 
"irq": 0.0, 
"nice": 0.0, 
"softirq": 0.0, 
"steal": 100.0, 
"system": 0.0, 
"user": 0.0
}, 
"1": {
"guest": 0.0, 
"guest_nice": 0.0, 
"idle": 0.0, 
"iowait": 0.0, 
"irq": 0.0, 
"nice": 0.0, 
"softirq": 0.0, 
"steal": 100.0, 
"system": 0.0, 
"user": 0.0
}
}

top - 10:25:55 up 46 days, 20:48, 1 user, load average: 0,34, 0,19, 0,15
Tasks: 120 total, 1 running, 119 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0,0 us, 0,0 sy, 0,0 ni, 99,7 id, 0,3 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem : 8173956 total, 496288 free, 6969612 used, 708056 buff/cache
KiB Swap: 2097148 total, 349612 free, 1747536 used. 894700 avail Mem

Running version 5.4.3 on AWS Ubuntu 16.04 xenial.

@lbguilherme
Copy link

I currently own the vps affected by this problem, in case something need to be tested on the machine itself.

@giampaolo
Copy link
Owner

Mmm can you try this and paste the output?

import time, psutil
for x in range(10):
      print(psutil.cpu_times())
      time.sleep(1)

@lbguilherme
Copy link

$ python
Python 2.7.12 (default, Nov 20 2017, 18:23:56) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import time, psutil
>>> for x in range(10):
...       print(psutil.cpu_times())
...       time.sleep(1)
... 
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577169.14, iowait=16110.07, irq=0.0, softirq=6806.69, steal=14416055402.18, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577171.03, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14395838578.8, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577172.96, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14379585838.44, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577174.86, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14350251624.13, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577176.76, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14330034800.75, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577178.68, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14312989243.78, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577180.61, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14298322136.63, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577182.51, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14273744821.94, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577184.44, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14245996240.83, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577186.36, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14231725541.97, guest=0.0, guest_nice=0.0)

@wiggin15
Copy link
Collaborator

wiggin15 commented Feb 4, 2018

can you please also post the results of:

python -c "import psutil; print psutil._psplatform.CLOCK_TICKS"
cat /proc/stat

@lbguilherme
Copy link

Timeline of events:

  1. Started querying cpu data every 1 minute
  2. Few months later problem of 100% steal started. Every single read reports 100% cpu (steal).
  3. After few days, I rebooted the machine. The problem went away.
  4. Few months later, it started again.
  5. I contacted @vfuse and this issue was opened.
  6. I rebooted the machine again. The problem went away.
  7. @wiggin15 asked to run a command. Here is the output:
$ python -c "import psutil; print psutil._psplatform.CLOCK_TICKS"
100
$ cat /proc/stat
cpu  12326678 4215 2516722 191279683 472016 0 256286 318860 0 0
cpu0 5912132 2099 1173877 96225380 148732 0 364 146317 0 0
cpu1 6414546 2115 1342844 95054303 323284 0 255922 172542 0 0
intr 3667149875 43 9 0 0 2365 0 0 0 2 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 187140193 539604218 0 0 66275271 0 147668695 424863922 0 0 66381535 0 363 265905 17790286 1120025059 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 3067462634
btime 1516735396
processes 6283265
procs_running 4
procs_blocked 0
softirq 1353534844 0 63826704 614262166 549351704 0 0 33399117 58254856 0 34440297

@wiggin15
Copy link
Collaborator

wiggin15 commented Feb 5, 2018

From what I see in the logs requested by @giampaolo , the "steal" value actually decreases every second instead of going up (the values are supposed to be cumulative). Looking at the first two results:

steal=14416055402.18
steal=14395838578.8

When we count the percentage, we divide the difference in the specific field (steal) with the total difference of the cpu times. In this case almost all of the difference is the decrease in steal time so we return 100%:

all_delta = -20216821.48000145
field_delta =-20216823.380001068
field_perc = (100 * -20216823.380001068) / (-20216821.48000145) = 100.00000939811245

A decrease in the cumulative steal time should not happen, but apparently can happen erroneously:
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest/

psutil should probably ignore negative differences if values in "/proc/stat" decrease. Something like:

all_delta = max(0, min_delta)
field_delta = max(0, field_delta)

"top" is doing this: https://github.com/thlorenz/procps/blob/faa41f864a599854ceafa4ea634b29a6924bbbe6/deps/procps/top/top.c#L5017

@giampaolo
Copy link
Owner

giampaolo commented Feb 5, 2018

Wow! Thanks for fine brain work Arnon! psutil already provides some logic to prevent numbers from wrapping:
https://github.com/giampaolo/psutil/blob/master/psutil/_common.py#L477
We may reuse that (or a variant of), even though maybe it makes more sense to restrict this only to linux + steal time.
@lbguilherme are you running XEN?

@wiggin15
Copy link
Collaborator

wiggin15 commented Feb 5, 2018

I think we need to use the same logic as in "top" which is to wrap ("TRIMZ") each delta separately (they do it for all values, not just "steal") and then sum up the "trimmed" deltas to get the total to divide by. This way negative differences won't impact other fields when we use sum in _cpu_tot_time. Not sure if _WrapNumbers can help here. I will work on a pull request.
@giampaolo I wonder if it's possible to assign this issue to me or if assignees must be project collaborators.

@giampaolo
Copy link
Owner

Arnon: don't know let me check.

@giampaolo
Copy link
Owner

Yes, you need to be a collaborator. I just made you one (then I suppose you can assign the issue to yourself).

@lbguilherme
Copy link

@giampaolo Yes, it is running KVM on Xen (AWS)

@wiggin15 wiggin15 self-assigned this Feb 5, 2018
@wiggin15
Copy link
Collaborator

wiggin15 commented Feb 5, 2018

Yes, you need to be a collaborator. I just made you one (then I suppose you can assign the issue to yourself).

Thanks!

@giampaolo
Copy link
Owner

Not sure if _WrapNumbers can help here. I will work on a pull request.

The only thing is that wrap_numbers (the public function defined later) accepts a dict, whereas here we're dealing with a list. Maybe you can simply translate the CPU times in an OrderedDict and convert it back to a list. grep for wrap_numbers to get an idea of how it is used.

@giampaolo
Copy link
Owner

giampaolo commented Feb 5, 2018

Actually I'm not entirely sure wrap_numbers is what we really need. That was designed for situations where a big number goes back to zero and then starts increasing, e.g.:

1000
5
10
15
....

...in which case wrap_numbers will change it to:

1000
1005
1010
1015

This case is different as AFAIU we can have numbers like this:

1000
955
1000
1100
...

I suppose what should happen in this case is 955 should be translated to 1000 (wrap_numbers would translated it to 1955, which is wrong).

@wiggin15
Copy link
Collaborator

Patch in #1214 merged. Please close this issue.

@agbesi1845
Copy link

okay

giampaolo added a commit that referenced this issue May 3, 2021
…etimes go backwards

Signed-off-by: Giampaolo Rodola <g.rodola@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants