Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk-io - add iops and improve match condition #772

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 29 additions & 24 deletions check-plugins/disk-io/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,21 @@ Overview

Checks disk bandwidth over a period of time. The check tracks the maximum bandwidth and alerts if the bandwidth over the last n reads is above a certain percentage (by default 80/90% over the last 5 reads). This works similar to Load5, but at the disk I/O level.

On Linux, the check plugin by default tries to find "important" disks automatically and returns only useful perfdata information, so as not to waste disk space in a time series database with unnecessary disk information (as in earlier versions). To do this, it looks for disks that are mounted to a folder.
On Linux, the check plugin by default tries to find "important" disks automatically and returns only useful perfdata information, so as not to waste disk space in a time series database with unnecessary disk information (as in earlier versions). To do this, it looks for disks that are mounted to a folder. If you want to monitor more disk than the automatic scan provides, you can use the match parameter. This will generate a list of all disks including the "important" ones and will then act on the ones matching the regex provided. This is indeed necessary on systems with e.g zfs pools, where the pools will not be automatically recognised and you will need to monitor the raw disks with the match option. As a starting point the following regex match will include most disks ``^(nvme[0-9]{1,}n[0-9]{1,}$|[sv]d[a-z][0-9]{1,}|md|dm)``

Disk I/O always starts at 10 MiB/sec, but stores the highest measured bandwidth, so it adjusts the ``RWmax/s`` value accordingly. For this reason, this check takes some time to warm up its (cached) readings: The check will throw some warnings and criticals during the first major disk activities above 10Mib/sec until the maximum bandwidth of the disk has been determined.

Example: The (shortened) result of ``./disk-io --count 5 --warning 80 --critical 90`` could look like this:

.. code-block:: text

/dev/dm-4: 0.0B/s read1, 48.7KiB/s write1, 48.7KiB/s total, 227.9MiB/s max

Name ! RWmax/s ! R1/s ! W1/s ! R5/s ! W5/s ! RW5/s
-----+---------+----------+----------+----------+----------+--------------------
dm-0 ! 44.9MiB ! 42.8MiB ! 17.2MiB ! 23.1MiB ! 18.6MiB ! 36.3MiB [CRITICAL]
dm-1 ! 10.0MiB ! 4.7KiB ! 4.0KiB ! 2.0KiB ! 6.8KiB ! 8.7KiB
/dev/dm-0: 0.0B/s read1, 380.0KiB/s write1, 380.0KiB/s total, 10.0MiB/s max, 0/s readops, 89/s writeops

Name ! MntPnts ! DvMppr ! RWmax/s ! R1/s ! W1/s ! R5/s ! W5/s ! RW5/s ! R1/s ! W1/s ! R5/s ! W5/s
-----+---------+-------------+---------+---------+----------+---------+---------+--------------------+------+------+------+------
dm-0 ! / ! ubuntu-root ! 44.9MiB ! 42.8MiB ! 17.2MiB ! 23.1MiB ! 18.6MiB ! 36.3MiB [CRITICAL] ! 0 ! 89 ! 0 ! 71
md0 ! /boot ! ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0 ! 0 ! 0 ! 0
dm-2 ! /var ! ubuntu-var ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0 ! 0 ! 0 ! 0
dm-1 ! /home ! ubuntu-home ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0 ! 0 ! 0 ! 0
...

The first line always shows the disk with the currently highest bandwidth (here ``dm-0``).
Expand All @@ -30,6 +31,8 @@ The table columns mean:
* R1, W1: The current bandwidth is 23.6 MB/sec read and 17.2 MB/sec write.
* R5, W5: The bandwidth from now to 5 measured values in the past is 23.1 MB/sec read and 18.6 MB/sec write.
* First line in the table, RW5: Compared to the current values, there was a higher bandwidth for a while. Since a maximum of 44.9 MB/sec bandwidth has been measured for this disk so far, a mean bandwidth (RW5) value of 36.3 MB/sec results in a warning (``36.3 MB/sec >= 44.9 MB/sec * 80%``). The current value of 42.8 MB/sec doesn't matter, this is only a peak. The check alerts because there is unusual high disk I/O over a certain amount of time.
* R1, W1: The current IOPs for read and write
* R5, W5: The IOPs from now to 5 measured valued in the past for read and write

Hints:

Expand Down Expand Up @@ -96,32 +99,32 @@ Just check disk ``dm-0`` (if listed as ``/dev/dm-0``):

.. code-block:: bash

./disk-io --match='.*dm-0$'
./disk-io --match='dm-0$'

Match all disks except ``vdc``, ``vdh`` and ``vdz``:

.. code-block:: bash

./disk-io --match='^(?:(?!.*vdc|.*vdh|.*vdz).)*$'
./disk-io --match='^(?:(?!vdc|vdh|vdz).)*$'

Match all disks starting with sd, vd, md, dm and nvme disks except the raw disk itself

.. code-block:: bash

./disk-io --match='^(nvme[0-9]{1,}n[0-9]{1,}$|[sv]d[a-z][0-9]{1,}|md|dm)'

Example Output:

.. code-block:: text

/dev/dm-8: 5.6KiB/s read1, 2.2MiB/s write1, 2.2MiB/s total, 10.0MiB/s max

Name ! MntPnts ! DvMppr ! RWmax/s ! R1/s ! W1/s ! R5/s ! W5/s ! RW5/s
-----+----------------+------------------+---------+--------+---------+--------+---------+---------
dm-0 ! / ! rl-root ! 10.0MiB ! 0.0B ! 426.0B ! 0.0B ! 343.0B ! 343.0B
vda2 ! /boot ! ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B
vda1 ! /boot/efi ! ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B
dm-5 ! /var ! rl-var ! 10.0MiB ! 0.0B ! 586.0B ! 0.0B ! 1.1KiB ! 1.1KiB
dm-8 ! /data ! rl-lv_data ! 10.0MiB ! 5.6KiB ! 2.2MiB ! 8.3KiB ! 2.3MiB ! 2.3MiB
dm-6 ! /tmp ! rl-tmp ! 10.0MiB ! 0.0B ! 4.8KiB ! 0.0B ! 7.1KiB ! 7.1KiB
dm-7 ! /home ! rl-home ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B
dm-2 ! /var/tmp ! rl-var_tmp ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B
dm-4 ! /var/log ! rl-var_log ! 10.0MiB ! 0.0B ! 51.8KiB ! 0.0B ! 51.2KiB ! 51.2KiB
dm-3 ! /var/log/audit ! rl-var_log_audit ! 10.0MiB ! 0.0B ! 918.0B ! 0.0B ! 876.0B ! 876.0B
/dev/dm-0: 0.0B/s read1, 380.0KiB/s write1, 380.0KiB/s total, 10.0MiB/s max, 0/s readops, 89/s writeops

Name ! MntPnts ! DvMppr ! RWmax/s ! R1/s ! W1/s ! R5/s ! W5/s ! RW5/s ! R1/s ! W1/s ! R5/s ! W5/s
-----+---------+-------------+---------+------+----------+------+----------+----------+------+------+------+------
dm-0 ! / ! ubuntu-root ! 10.0MiB ! 0.0B ! 380.0KiB ! 0.0B ! 305.0KiB ! 305.0KiB ! 0 ! 89 ! 0 ! 71
md0 ! /boot ! ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0 ! 0 ! 0 ! 0
dm-2 ! /var ! ubuntu-var ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0 ! 0 ! 0 ! 0
dm-1 ! /home ! ubuntu-home ! 10.0MiB ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0.0B ! 0 ! 0 ! 0 ! 0

Top 5 processes that generate the most I/O traffic:
1. nfsd: 149.2GiB/5.7TiB (r/w)
Expand Down Expand Up @@ -149,8 +152,10 @@ Per (matched) disk, where <disk> is the block device name:
Name, Type, Description
<disk>_busy_time, Continous Counter, Time spent doing actual I/Os (in milliseconds).
<disk>_read_bytes, Continous Counter, Number of bytes read.
<disk>_read_count, Continous Counter, Number of read operations.
<disk>_read_time, Continous Counter, Time spent reading from disk (in milliseconds).
<disk>_write_bytes, Continous Counter, Number of bytes written.
<disk>_write_count, Continous Counter, Number of write operations.
<disk>_write_time, Continous Counter, Time spent writing to disk (in milliseconds).


Expand Down
83 changes: 60 additions & 23 deletions check-plugins/disk-io/disk-io
Original file line number Diff line number Diff line change
Expand Up @@ -138,16 +138,19 @@ def get_max_bandwidth(disk, current_bandwidth):
return max_bandwidth


def get_rate(ts1, ts2, r1, r2, w1, w2):
def get_rate(ts1, ts2, rr1, rr2, wr1, wr2, r1, r2, w1, w2):
"""Given two read-, write- and timestamp-values, return the read- and write-rate
plus bandwidth.
plus bandwidth and iops.
"""
timediff = abs(ts1 - ts2) # in seconds
if timediff == 0:
return 0, 0, 0, 0
return 0, 0, 0, 0, 0, 0
rr = abs(int(float(rr1 - rr2) / timediff))
wr = abs(int(float(wr1 - wr2) / timediff))
r = abs(int(float(r1 - r2) / timediff))
w = abs(int(float(w1 - w2) / timediff))
return timediff, r, w, r + w

return timediff, rr, wr, rr + wr, r, w


def top(count):
Expand Down Expand Up @@ -214,6 +217,8 @@ def main():
bd TEXT NOT NULL,
dmd TEXT,
mp TEXT,
read_count INT DEFAULT 0,
write_count INT DEFAULT 0,
busy_time INT DEFAULT 0,
read_bytes INT DEFAULT 0,
read_merged_count INT DEFAULT 0,
Expand Down Expand Up @@ -244,25 +249,35 @@ def main():

# analyze and enrich data, store it to database
real_disks = lib.disk.get_real_disks()

# if match argument is supplied, try the match on all interfaces from pustil disk_io_counters
# do not try the match if the interface is already included in real_disks
if args.MATCH:
for disk in disk_io_counters.keys():
if lib.base.coe(lib.txt.match_regex(compiled_regex, disk)) and not any(disk in x['bd'] for x in real_disks):
real_disks.append({'bd': disk, 'dmd': '', 'mp': ''})

for disk in real_disks:
psutil_name = os.path.basename(disk['bd'])

# disks we have to match
if args.MATCH \
and all((
not lib.base.coe(lib.txt.match_regex(compiled_regex, disk['bd'])),
not lib.base.coe(lib.txt.match_regex(compiled_regex, psutil_name)),
not lib.base.coe(lib.txt.match_regex(compiled_regex, disk['dmd'])),
not lib.base.coe(lib.txt.match_regex(compiled_regex, disk['mp'])),
)):
continue

psutil_name = os.path.basename(disk['bd'])
if psutil_name not in disk_io_counters:
continue

data = {}
data['bd'] = disk['bd']
data['dmd'] = disk['dmd']
data['mp'] = disk['mp']
# read_count and write_count are the same value over all disks, so simply ignore them
data['read_count'] = getattr(disk_io_counters[psutil_name], 'read_count', 0)
data['write_count'] = getattr(disk_io_counters[psutil_name], 'write_count', 0)
data['busy_time'] = getattr(disk_io_counters[psutil_name], 'busy_time', 0)
data['read_bytes'] = getattr(disk_io_counters[psutil_name], 'read_bytes', 0)
data['read_merged_count'] = getattr(disk_io_counters[psutil_name], 'read_merged_count', 0)
Expand Down Expand Up @@ -300,13 +315,17 @@ def main():
lib.base.oao('Waiting for more data.', state)

# calculate current rates (like "load1")
timediff, read_bytes_per_second1, write_bytes_per_second1, bandwidth1 = get_rate(
timediff, read_bytes_per_second1, write_bytes_per_second1, bandwidth1, read_per_second1, write_per_second1 = get_rate(
data[0]['timestamp'],
data[1]['timestamp'],
data[0]['read_bytes'],
data[1]['read_bytes'],
data[0]['write_bytes'],
data[1]['write_bytes'],
data[0]['read_count'],
data[1]['read_count'],
data[0]['write_count'],
data[1]['write_count']
)
if timediff <= 0:
# often happens after a reboot
Expand All @@ -318,28 +337,34 @@ def main():

if bandwidth1 > busiest_disk:
# get the current busiest disk for the first line of the message
msg = '{}: {}/s read1, {}/s write1, {}/s total, {}/s max'.format(
msg = '{}: {}/s read1, {}/s write1, {}/s total, {}/s max, {}/s readops, {}/s writeops'.format(
disk['bd'],
lib.human.bytes2human(read_bytes_per_second1),
lib.human.bytes2human(write_bytes_per_second1),
lib.human.bytes2human(bandwidth1),
lib.human.bytes2human(bandwidth_max),
read_per_second1,
write_per_second1
)
if args.MATCH:
msg += ' (disks matching `{}`).'.format(args.MATCH)
busiest_disk = bandwidth1

# calculate read/write rate over the entire period (like "load15")
# calculate read/write rate over the entire period (like "load5")
if len(data) != args.COUNT:
# not enough data yet
continue
timediff, read_bytes_per_second15, write_bytes_per_second15, bandwidth15 = get_rate(
timediff, read_bytes_per_second5, write_bytes_per_second5, bandwidth5, read_per_second5, write_per_second5 = get_rate(
data[0]['timestamp'],
data[args.COUNT - 1]['timestamp'],
data[0]['read_bytes'],
data[args.COUNT - 1]['read_bytes'],
data[0]['write_bytes'],
data[args.COUNT - 1]['write_bytes'],
data[0]['read_count'],
data[args.COUNT - 1]['read_count'],
data[0]['write_count'],
data[args.COUNT - 1]['write_count'],
)
if timediff <= 0:
# often happens after a reboot
Expand All @@ -348,7 +373,7 @@ def main():

# get state based on max measured I/O values
local_state = lib.base.get_state(
bandwidth15,
bandwidth5,
bandwidth_max * args.WARN / 100,
bandwidth_max * args.CRIT / 100,
)
Expand All @@ -360,21 +385,25 @@ def main():
'dmd': disk['dmd'].replace('/dev/mapper/', ''),
'mp': disk['mp'],
'max': lib.human.bytes2human(bandwidth_max),
'r1': lib.human.bytes2human(read_bytes_per_second1),
'w1': lib.human.bytes2human(write_bytes_per_second1),
'r15': lib.human.bytes2human(read_bytes_per_second15),
'w15': lib.human.bytes2human(write_bytes_per_second15),
't15': lib.human.bytes2human(bandwidth15) + lib.base.state2str(local_state, prefix=' '),
'rr1': lib.human.bytes2human(read_bytes_per_second1),
'wr1': lib.human.bytes2human(write_bytes_per_second1),
'rr5': lib.human.bytes2human(read_bytes_per_second5),
'wr5': lib.human.bytes2human(write_bytes_per_second5),
'tr5': lib.human.bytes2human(bandwidth5) + lib.base.state2str(local_state, prefix=' '),
'r1': read_per_second1,
'w1': write_per_second1,
'r5': read_per_second5,
'w5': write_per_second5,
})

# perfdata
try:
perfdata += lib.base.get_perfdata('{}_busy_time'.format(bd), data[0]['busy_time'], 'c', None, None, 0, None) # pylint: disable=C0301
perfdata += lib.base.get_perfdata('{}_read_bytes'.format(bd), data[0]['read_bytes'], 'c', None, None, 0, None) # pylint: disable=C0301
#perfdata += lib.base.get_perfdata('{}_read_merged_count'.format(bd), data[0]['read_merged_count'], 'c', None, None, 0, None) # pylint: disable=C0301
perfdata += lib.base.get_perfdata('{}_read_count'.format(bd), data[0]['read_count'], 'c', None, None, 0, None) # pylint: disable=C0301
perfdata += lib.base.get_perfdata('{}_read_time'.format(bd), data[0]['read_time'], 'c', None, None, 0, None) # pylint: disable=C0301
perfdata += lib.base.get_perfdata('{}_write_bytes'.format(bd), data[0]['write_bytes'], 'c', None, None, 0, None) # pylint: disable=C0301
#perfdata += lib.base.get_perfdata('{}_write_merged_count'.format(bd), data[0]['write_merged_count'], 'c', None, None, 0, None) # pylint: disable=C0301
perfdata += lib.base.get_perfdata('{}_write_count'.format(bd), data[0]['write_count'], 'c', None, None, 0, None) # pylint: disable=C0301
perfdata += lib.base.get_perfdata('{}_write_time'.format(bd), data[0]['write_time'], 'c', None, None, 0, None) # pylint: disable=C0301
except:
pass
Expand All @@ -391,11 +420,15 @@ def main():
'mp',
'dmd',
'max',
'rr1',
'wr1',
'rr5',
'wr5',
'tr5',
'r1',
'w1',
'r15',
'w15',
't15',
'r5',
'w5'
],
header=[
'Name',
Expand All @@ -406,7 +439,11 @@ def main():
'W1/s',
'R{}/s'.format(args.COUNT),
'W{}/s'.format(args.COUNT),
'RW{}/s'.format(args.COUNT)
'RW{}/s'.format(args.COUNT),
'R1/s',
'W1/s',
'R{}/s'.format(args.COUNT),
'W{}/s'.format(args.COUNT)
],
)

Expand Down
Loading