CPU steal stuck at 100% #1210

vfuse · 2018-01-17T10:38:08Z

Occasionally psutil returns a CPU steal time of 100%, the only way to get this back to the correct value is by rebooting the system.

cpu:
{
"0": {
"guest": 0.0, 
"guest_nice": 0.0, 
"idle": 0.0, 
"iowait": 0.0, 
"irq": 0.0, 
"nice": 0.0, 
"softirq": 0.0, 
"steal": 100.0, 
"system": 0.0, 
"user": 0.0
}, 
"1": {
"guest": 0.0, 
"guest_nice": 0.0, 
"idle": 0.0, 
"iowait": 0.0, 
"irq": 0.0, 
"nice": 0.0, 
"softirq": 0.0, 
"steal": 100.0, 
"system": 0.0, 
"user": 0.0
}
}


top - 10:25:55 up 46 days, 20:48, 1 user, load average: 0,34, 0,19, 0,15
Tasks: 120 total, 1 running, 119 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0,0 us, 0,0 sy, 0,0 ni, 99,7 id, 0,3 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem : 8173956 total, 496288 free, 6969612 used, 708056 buff/cache
KiB Swap: 2097148 total, 349612 free, 1747536 used. 894700 avail Mem

Running version 5.4.3 on AWS Ubuntu 16.04 xenial.

The text was updated successfully, but these errors were encountered:

lbguilherme · 2018-01-17T10:46:24Z

I currently own the vps affected by this problem, in case something need to be tested on the machine itself.

giampaolo · 2018-01-17T11:12:24Z

Mmm can you try this and paste the output?

import time, psutil
for x in range(10):
      print(psutil.cpu_times())
      time.sleep(1)

lbguilherme · 2018-01-17T12:18:01Z

$ python
Python 2.7.12 (default, Nov 20 2017, 18:23:56) 
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import time, psutil
>>> for x in range(10):
...       print(psutil.cpu_times())
...       time.sleep(1)
... 
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577169.14, iowait=16110.07, irq=0.0, softirq=6806.69, steal=14416055402.18, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577171.03, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14395838578.8, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577172.96, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14379585838.44, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577174.86, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14350251624.13, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577176.76, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14330034800.75, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577178.68, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14312989243.78, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577180.61, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14298322136.63, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577182.51, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14273744821.94, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577184.44, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14245996240.83, guest=0.0, guest_nice=0.0)
scputimes(user=316284.26, nice=144.61, system=64877.16, idle=7577186.36, iowait=16110.08, irq=0.0, softirq=6806.69, steal=14231725541.97, guest=0.0, guest_nice=0.0)

wiggin15 · 2018-02-04T19:00:27Z

can you please also post the results of:

python -c "import psutil; print psutil._psplatform.CLOCK_TICKS"
cat /proc/stat

lbguilherme · 2018-02-05T01:32:16Z

Timeline of events:

Started querying cpu data every 1 minute
Few months later problem of 100% steal started. Every single read reports 100% cpu (steal).
After few days, I rebooted the machine. The problem went away.
Few months later, it started again.
I contacted @vfuse and this issue was opened.
I rebooted the machine again. The problem went away.
@wiggin15 asked to run a command. Here is the output:

$ python -c "import psutil; print psutil._psplatform.CLOCK_TICKS"
100
$ cat /proc/stat
cpu  12326678 4215 2516722 191279683 472016 0 256286 318860 0 0
cpu0 5912132 2099 1173877 96225380 148732 0 364 146317 0 0
cpu1 6414546 2115 1342844 95054303 323284 0 255922 172542 0 0
intr 3667149875 43 9 0 0 2365 0 0 0 2 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 187140193 539604218 0 0 66275271 0 147668695 424863922 0 0 66381535 0 363 265905 17790286 1120025059 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 3067462634
btime 1516735396
processes 6283265
procs_running 4
procs_blocked 0
softirq 1353534844 0 63826704 614262166 549351704 0 0 33399117 58254856 0 34440297

wiggin15 · 2018-02-05T10:39:16Z

From what I see in the logs requested by @giampaolo , the "steal" value actually decreases every second instead of going up (the values are supposed to be cumulative). Looking at the first two results:

steal=14416055402.18
steal=14395838578.8

When we count the percentage, we divide the difference in the specific field (steal) with the total difference of the cpu times. In this case almost all of the difference is the decrease in steal time so we return 100%:

all_delta = -20216821.48000145
field_delta =-20216823.380001068
field_perc = (100 * -20216823.380001068) / (-20216821.48000145) = 100.00000939811245

A decrease in the cumulative steal time should not happen, but apparently can happen erroneously:
https://0xstubs.org/debugging-a-flaky-cpu-steal-time-counter-on-a-paravirtualized-xen-guest/

psutil should probably ignore negative differences if values in "/proc/stat" decrease. Something like:

all_delta = max(0, min_delta)
field_delta = max(0, field_delta)

"top" is doing this: https://github.com/thlorenz/procps/blob/faa41f864a599854ceafa4ea634b29a6924bbbe6/deps/procps/top/top.c#L5017

giampaolo · 2018-02-05T13:08:57Z

Wow! Thanks for fine brain work Arnon! psutil already provides some logic to prevent numbers from wrapping:
https://github.com/giampaolo/psutil/blob/master/psutil/_common.py#L477
We may reuse that (or a variant of), even though maybe it makes more sense to restrict this only to linux + steal time.
@lbguilherme are you running XEN?

wiggin15 · 2018-02-05T13:22:11Z

I think we need to use the same logic as in "top" which is to wrap ("TRIMZ") each delta separately (they do it for all values, not just "steal") and then sum up the "trimmed" deltas to get the total to divide by. This way negative differences won't impact other fields when we use sum in _cpu_tot_time. Not sure if _WrapNumbers can help here. I will work on a pull request.
@giampaolo I wonder if it's possible to assign this issue to me or if assignees must be project collaborators.

giampaolo · 2018-02-05T13:25:24Z

Arnon: don't know let me check.

giampaolo · 2018-02-05T13:25:46Z

Yes, you need to be a collaborator. I just made you one (then I suppose you can assign the issue to yourself).

lbguilherme · 2018-02-05T13:27:12Z

@giampaolo Yes, it is running KVM on Xen (AWS)

wiggin15 · 2018-02-05T13:33:03Z

Yes, you need to be a collaborator. I just made you one (then I suppose you can assign the issue to yourself).

Thanks!

giampaolo · 2018-02-05T13:41:17Z

Not sure if _WrapNumbers can help here. I will work on a pull request.

The only thing is that wrap_numbers (the public function defined later) accepts a dict, whereas here we're dealing with a list. Maybe you can simply translate the CPU times in an OrderedDict and convert it back to a list. grep for wrap_numbers to get an idea of how it is used.

giampaolo · 2018-02-05T13:43:08Z

Actually I'm not entirely sure wrap_numbers is what we really need. That was designed for situations where a big number goes back to zero and then starts increasing, e.g.:

1000
5
10
15
....

...in which case wrap_numbers will change it to:

This case is different as AFAIU we can have numbers like this:

I suppose what should happen in this case is 955 should be translated to 1000 (wrap_numbers would translated it to 1955, which is wrong).

giampaolo#1210

…mpaolo#1210)

…) (#1214)

wiggin15 · 2018-03-16T14:35:53Z

Patch in #1214 merged. Please close this issue.

agbesi1845 · 2020-08-12T17:57:08Z

okay

…etimes go backwards Signed-off-by: Giampaolo Rodola <g.rodola@gmail.com>

wiggin15 self-assigned this Feb 5, 2018

wiggin15 pushed a commit to Infinidat/psutil that referenced this issue Feb 6, 2018

Ignore negative deltas in cpu times when calculating percentages. Fixes

c83bb7b

giampaolo#1210

wiggin15 pushed a commit to Infinidat/psutil that referenced this issue Feb 6, 2018

Ignore negative deltas in cpu times when calculating percentages (gia…

cc2f866

…mpaolo#1210)

wiggin15 mentioned this issue Feb 6, 2018

Ignore negative deltas in cpu times when calculating percentages #1214

Merged

wiggin15 pushed a commit to Infinidat/psutil that referenced this issue Feb 6, 2018

Ignore negative deltas in cpu times when calculating percentages (gia…

ab28452

…mpaolo#1210)

wiggin15 pushed a commit to Infinidat/psutil that referenced this issue Feb 10, 2018

Ignore negative deltas in cpu times when calculating percentages (gia…

41cadf5

…mpaolo#1210)

wiggin15 pushed a commit to Infinidat/psutil that referenced this issue Mar 14, 2018

Ignore negative deltas in cpu times when calculating percentages (gia…

7237f57

…mpaolo#1210)

giampaolo pushed a commit that referenced this issue Mar 14, 2018

Ignore negative deltas in cpu times when calculating percentages (#1210…

7b0e1c4

…) (#1214)

giampaolo added a commit that referenced this issue Mar 14, 2018

#1210, #1214: update README and give CREDITs to @wiggin15

c088fb5

giampaolo closed this as completed Mar 16, 2018

dependencies bot mentioned this issue Apr 14, 2018

psutil versions available: 5.4.5, 5.4.4 Harmon758/Harmonbot#76

Closed

giampaolo added a commit that referenced this issue May 3, 2021

Re. #1210: add doc warning explaining that cpu_times() values can som…

4d30272

…etimes go backwards Signed-off-by: Giampaolo Rodola <g.rodola@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU steal stuck at 100% #1210

CPU steal stuck at 100% #1210

vfuse commented Jan 17, 2018 •

edited

Loading

lbguilherme commented Jan 17, 2018

giampaolo commented Jan 17, 2018

lbguilherme commented Jan 17, 2018

wiggin15 commented Feb 4, 2018

lbguilherme commented Feb 5, 2018

wiggin15 commented Feb 5, 2018 •

edited by giampaolo

Loading

giampaolo commented Feb 5, 2018 •

edited

Loading

wiggin15 commented Feb 5, 2018

giampaolo commented Feb 5, 2018

giampaolo commented Feb 5, 2018

lbguilherme commented Feb 5, 2018

wiggin15 commented Feb 5, 2018

giampaolo commented Feb 5, 2018

giampaolo commented Feb 5, 2018 •

edited

Loading

wiggin15 commented Mar 16, 2018

agbesi1845 commented Aug 12, 2020

CPU steal stuck at 100% #1210

CPU steal stuck at 100% #1210

Comments

vfuse commented Jan 17, 2018 • edited Loading

lbguilherme commented Jan 17, 2018

giampaolo commented Jan 17, 2018

lbguilherme commented Jan 17, 2018

wiggin15 commented Feb 4, 2018

lbguilherme commented Feb 5, 2018

wiggin15 commented Feb 5, 2018 • edited by giampaolo Loading

giampaolo commented Feb 5, 2018 • edited Loading

wiggin15 commented Feb 5, 2018

giampaolo commented Feb 5, 2018

giampaolo commented Feb 5, 2018

lbguilherme commented Feb 5, 2018

wiggin15 commented Feb 5, 2018

giampaolo commented Feb 5, 2018

giampaolo commented Feb 5, 2018 • edited Loading

wiggin15 commented Mar 16, 2018

agbesi1845 commented Aug 12, 2020

vfuse commented Jan 17, 2018 •

edited

Loading

wiggin15 commented Feb 5, 2018 •

edited by giampaolo

Loading

giampaolo commented Feb 5, 2018 •

edited

Loading

giampaolo commented Feb 5, 2018 •

edited

Loading