Linuxfabrik · leo-pempera · Jul 3, 2024 · Jul 3, 2024 · Jul 3, 2024 · Jul 3, 2024
diff --git a/check-plugins/disk-io/README.rst b/check-plugins/disk-io/README.rst
@@ -6,20 +6,21 @@ Overview
 
 Checks disk bandwidth over a period of time. The check tracks the maximum bandwidth and alerts if the bandwidth over the last n reads is above a certain percentage (by default 80/90% over the last 5 reads). This works similar to Load5, but at the disk I/O level.
 
-On Linux, the check plugin by default tries to find "important" disks automatically and returns only useful perfdata information, so as not to waste disk space in a time series database with unnecessary disk information (as in earlier versions). To do this, it looks for disks that are mounted to a folder.
+On Linux, the check plugin by default tries to find "important" disks automatically and returns only useful perfdata information, so as not to waste disk space in a time series database with unnecessary disk information (as in earlier versions). To do this, it looks for disks that are mounted to a folder. If you want to monitor more disk than the automatic scan provides, you can use the match parameter. This will generate a list of all disks including the "important" ones and will then act on the ones matching the regex provided. This is indeed necessary on systems with e.g zfs pools, where the pools will not be automatically recognised and you will need to monitor the raw disks with the match option. As a starting point the following regex match will include most disks ``^(nvme[0-9]{1,}n[0-9]{1,}$|[sv]d[a-z][0-9]{1,}|md|dm)``
 
 Disk I/O always starts at 10 MiB/sec, but stores the highest measured bandwidth, so it adjusts the ``RWmax/s`` value accordingly. For this reason, this check takes some time to warm up its (cached) readings: The check will throw some warnings and criticals during the first major disk activities above 10Mib/sec until the maximum bandwidth of the disk has been determined.
 
 Example: The (shortened) result of ``./disk-io --count 5 --warning 80 --critical 90`` could look like this:
 
 .. code-block:: text
-
-    /dev/dm-4: 0.0B/s read1, 48.7KiB/s write1, 48.7KiB/s total, 227.9MiB/s max
-
-    Name ! RWmax/s ! R1/s     ! W1/s     ! R5/s     ! W5/s     ! RW5/s              
-    -----+---------+----------+----------+----------+----------+--------------------
-    dm-0 ! 44.9MiB ! 42.8MiB  ! 17.2MiB  ! 23.1MiB  ! 18.6MiB  ! 36.3MiB [CRITICAL] 
-    dm-1 ! 10.0MiB ! 4.7KiB   ! 4.0KiB   ! 2.0KiB   ! 6.8KiB   ! 8.7KiB             
+    /dev/dm-0: 0.0B/s read1, 380.0KiB/s write1, 380.0KiB/s total, 10.0MiB/s max, 0/s readops, 89/s writeops
+
+    Name ! MntPnts ! DvMppr      ! RWmax/s ! R1/s    ! W1/s     ! R5/s    ! W5/s    ! RW5/s              ! R1/s ! W1/s ! R5/s ! W5/s
+    -----+---------+-------------+---------+---------+----------+---------+---------+--------------------+------+------+------+------
+    dm-0 ! /       ! ubuntu-root ! 44.9MiB ! 42.8MiB ! 17.2MiB  ! 23.1MiB ! 18.6MiB ! 36.3MiB [CRITICAL] ! 0    ! 89   ! 0    ! 71
+    md0  ! /boot   !             ! 10.0MiB ! 0.0B    ! 0.0B     ! 0.0B    ! 0.0B    ! 0.0B               ! 0    ! 0    ! 0    ! 0
+    dm-2 ! /var    ! ubuntu-var  ! 10.0MiB ! 0.0B    ! 0.0B     ! 0.0B    ! 0.0B    ! 0.0B               ! 0    ! 0    ! 0    ! 0
+    dm-1 ! /home   ! ubuntu-home ! 10.0MiB ! 0.0B    ! 0.0B     ! 0.0B    ! 0.0B    ! 0.0B               ! 0    ! 0    ! 0    ! 0            
     ...
 
 The first line always shows the disk with the currently highest bandwidth (here ``dm-0``).
@@ -30,6 +31,8 @@ The table columns mean:
 * R1, W1: The current bandwidth is 23.6 MB/sec read and 17.2 MB/sec write.
 * R5, W5: The bandwidth from now to 5 measured values in the past is 23.1 MB/sec read and 18.6 MB/sec write.
 * First line in the table, RW5: Compared to the current values, there was a higher bandwidth for a while. Since a maximum of 44.9 MB/sec bandwidth has been measured for this disk so far, a mean bandwidth (RW5) value of 36.3 MB/sec results in a warning (``36.3 MB/sec >= 44.9 MB/sec * 80%``). The current value of 42.8 MB/sec doesn't matter, this is only a peak. The check alerts because there is unusual high disk I/O over a certain amount of time.
+* R1, W1: The current IOPs for read and write
+* R5, W5: The IOPs from now to 5 measured valued in the past for read and write
 
 Hints:
 
@@ -96,32 +99,32 @@ Just check disk ``dm-0`` (if listed as ``/dev/dm-0``):
 
 .. code-block:: bash
 
-    ./disk-io --match='.*dm-0$'
+    ./disk-io --match='dm-0$'
 
 Match all disks except ``vdc``, ``vdh`` and ``vdz``:
 
 .. code-block:: bash
 
-    ./disk-io --match='^(?:(?!.*vdc|.*vdh|.*vdz).)*$'
+    ./disk-io --match='^(?:(?!vdc|vdh|vdz).)*$'
+
+Match all disks starting with sd, vd, md, dm and nvme disks except the raw disk itself
+
+.. code-block:: bash
+
+    ./disk-io --match='^(nvme[0-9]{1,}n[0-9]{1,}$|[sv]d[a-z][0-9]{1,}|md|dm)'
 
 Example Output:
 
 .. code-block:: text
 
-    /dev/dm-8: 5.6KiB/s read1, 2.2MiB/s write1, 2.2MiB/s total, 10.0MiB/s max
-
-    Name ! MntPnts        ! DvMppr           ! RWmax/s ! R1/s   ! W1/s    ! R5/s   ! W5/s    ! RW5/s   
-    -----+----------------+------------------+---------+--------+---------+--------+---------+---------
-    dm-0 ! /              ! rl-root          ! 10.0MiB ! 0.0B   ! 426.0B  ! 0.0B   ! 343.0B  ! 343.0B  
-    vda2 ! /boot          !                  ! 10.0MiB ! 0.0B   ! 0.0B    ! 0.0B   ! 0.0B    ! 0.0B    
-    vda1 ! /boot/efi      !                  ! 10.0MiB ! 0.0B   ! 0.0B    ! 0.0B   ! 0.0B    ! 0.0B    
-    dm-5 ! /var           ! rl-var           ! 10.0MiB ! 0.0B   ! 586.0B  ! 0.0B   ! 1.1KiB  ! 1.1KiB  
-    dm-8 ! /data          ! rl-lv_data       ! 10.0MiB ! 5.6KiB ! 2.2MiB  ! 8.3KiB ! 2.3MiB  ! 2.3MiB  
-    dm-6 ! /tmp           ! rl-tmp           ! 10.0MiB ! 0.0B   ! 4.8KiB  ! 0.0B   ! 7.1KiB  ! 7.1KiB  
-    dm-7 ! /home          ! rl-home          ! 10.0MiB ! 0.0B   ! 0.0B    ! 0.0B   ! 0.0B    ! 0.0B    
-    dm-2 ! /var/tmp       ! rl-var_tmp       ! 10.0MiB ! 0.0B   ! 0.0B    ! 0.0B   ! 0.0B    ! 0.0B    
-    dm-4 ! /var/log       ! rl-var_log       ! 10.0MiB ! 0.0B   ! 51.8KiB ! 0.0B   ! 51.2KiB ! 51.2KiB 
-    dm-3 ! /var/log/audit ! rl-var_log_audit ! 10.0MiB ! 0.0B   ! 918.0B  ! 0.0B   ! 876.0B  ! 876.0B  
+    /dev/dm-0: 0.0B/s read1, 380.0KiB/s write1, 380.0KiB/s total, 10.0MiB/s max, 0/s readops, 89/s writeops
+
+    Name ! MntPnts ! DvMppr      ! RWmax/s ! R1/s ! W1/s     ! R5/s ! W5/s     ! RW5/s    ! R1/s ! W1/s ! R5/s ! W5/s
+    -----+---------+-------------+---------+------+----------+------+----------+----------+------+------+------+------
+    dm-0 ! /       ! ubuntu-root ! 10.0MiB ! 0.0B ! 380.0KiB ! 0.0B ! 305.0KiB ! 305.0KiB ! 0    ! 89   ! 0    ! 71
+    md0  ! /boot   !             ! 10.0MiB ! 0.0B ! 0.0B     ! 0.0B ! 0.0B     ! 0.0B     ! 0    ! 0    ! 0    ! 0
+    dm-2 ! /var    ! ubuntu-var  ! 10.0MiB ! 0.0B ! 0.0B     ! 0.0B ! 0.0B     ! 0.0B     ! 0    ! 0    ! 0    ! 0
+    dm-1 ! /home   ! ubuntu-home ! 10.0MiB ! 0.0B ! 0.0B     ! 0.0B ! 0.0B     ! 0.0B     ! 0    ! 0    ! 0    ! 0
 
     Top 5 processes that generate the most I/O traffic:
     1. nfsd: 149.2GiB/5.7TiB (r/w)
@@ -149,8 +152,10 @@ Per (matched) disk, where <disk> is the block device name:
     Name,                               Type,                   Description                                           
     <disk>_busy_time,                   Continous Counter,      Time spent doing actual I/Os (in milliseconds).
     <disk>_read_bytes,                  Continous Counter,      Number of bytes read.
+    <disk>_read_count,                  Continous Counter,      Number of read operations.
     <disk>_read_time,                   Continous Counter,      Time spent reading from disk (in milliseconds).
     <disk>_write_bytes,                 Continous Counter,      Number of bytes written.
+    <disk>_write_count,                 Continous Counter,      Number of write operations.
     <disk>_write_time,                  Continous Counter,      Time spent writing to disk (in milliseconds).
 
 

diff --git a/check-plugins/disk-io/disk-io b/check-plugins/disk-io/disk-io
@@ -138,16 +138,19 @@ def get_max_bandwidth(disk, current_bandwidth):
     return max_bandwidth
 
 
-def get_rate(ts1, ts2, r1, r2, w1, w2):
+def get_rate(ts1, ts2, rr1, rr2, wr1, wr2, r1, r2, w1, w2):
     """Given two read-, write- and timestamp-values, return the read- and write-rate
-    plus bandwidth.
+    plus bandwidth and iops.
     """
     timediff = abs(ts1 - ts2) # in seconds
     if timediff == 0:
-        return 0, 0, 0, 0
+        return 0, 0, 0, 0, 0, 0
+    rr = abs(int(float(rr1 - rr2) / timediff))
+    wr = abs(int(float(wr1 - wr2) / timediff))
     r = abs(int(float(r1 - r2) / timediff))
     w = abs(int(float(w1 - w2) / timediff))
-    return timediff, r, w, r + w
+
+    return timediff, rr, wr, rr + wr, r, w
 
 
 def top(count):
@@ -214,6 +217,8 @@ def main():
             bd                  TEXT NOT NULL,
             dmd                 TEXT,
             mp                  TEXT,
+            read_count          INT DEFAULT 0,
+            write_count         INT DEFAULT 0,
             busy_time           INT DEFAULT 0,
             read_bytes          INT DEFAULT 0,
             read_merged_count   INT DEFAULT 0,
@@ -244,25 +249,35 @@ def main():
 
     # analyze and enrich data, store it to database
     real_disks = lib.disk.get_real_disks()
+
+    # if match argument is supplied, try the match on all interfaces from pustil disk_io_counters
+    # do not try the match if the interface is already included in real_disks 
+    if args.MATCH:
+        for disk in disk_io_counters.keys():
+            if lib.base.coe(lib.txt.match_regex(compiled_regex, disk)) and not any(disk in x['bd'] for x in real_disks):
+                real_disks.append({'bd': disk, 'dmd': '', 'mp': ''})
+
     for disk in real_disks:
+        psutil_name = os.path.basename(disk['bd'])
+
         # disks we have to match
         if args.MATCH \
         and all((
-            not lib.base.coe(lib.txt.match_regex(compiled_regex, disk['bd'])),
+            not lib.base.coe(lib.txt.match_regex(compiled_regex, psutil_name)),
             not lib.base.coe(lib.txt.match_regex(compiled_regex, disk['dmd'])),
             not lib.base.coe(lib.txt.match_regex(compiled_regex, disk['mp'])),
         )):
             continue
 
-        psutil_name = os.path.basename(disk['bd'])
         if psutil_name not in disk_io_counters:
             continue
 
         data = {}
         data['bd'] = disk['bd']
         data['dmd'] = disk['dmd']
         data['mp'] = disk['mp']
-        # read_count and write_count are the same value over all disks, so simply ignore them
+        data['read_count'] = getattr(disk_io_counters[psutil_name], 'read_count', 0)
+        data['write_count'] = getattr(disk_io_counters[psutil_name], 'write_count', 0)
         data['busy_time'] = getattr(disk_io_counters[psutil_name], 'busy_time', 0)
         data['read_bytes'] = getattr(disk_io_counters[psutil_name], 'read_bytes', 0)
         data['read_merged_count'] = getattr(disk_io_counters[psutil_name], 'read_merged_count', 0)
@@ -300,13 +315,17 @@ def main():
             lib.base.oao('Waiting for more data.', state)
 
         # calculate current rates (like "load1")
-        timediff, read_bytes_per_second1, write_bytes_per_second1, bandwidth1 = get_rate(
+        timediff, read_bytes_per_second1, write_bytes_per_second1, bandwidth1, read_per_second1, write_per_second1 = get_rate(
             data[0]['timestamp'],
             data[1]['timestamp'],
             data[0]['read_bytes'],
             data[1]['read_bytes'],
             data[0]['write_bytes'],
             data[1]['write_bytes'],
+            data[0]['read_count'],
+            data[1]['read_count'],
+            data[0]['write_count'],
+            data[1]['write_count']
         )
         if timediff <= 0:
             # often happens after a reboot
@@ -318,28 +337,34 @@ def main():
 
         if bandwidth1 > busiest_disk:
             # get the current busiest disk for the first line of the message
-            msg = '{}: {}/s read1, {}/s write1, {}/s total, {}/s max'.format(
+            msg = '{}: {}/s read1, {}/s write1, {}/s total, {}/s max, {}/s readops, {}/s writeops'.format(
                 disk['bd'],
                 lib.human.bytes2human(read_bytes_per_second1),
                 lib.human.bytes2human(write_bytes_per_second1),
                 lib.human.bytes2human(bandwidth1),
                 lib.human.bytes2human(bandwidth_max),
+                read_per_second1,
+                write_per_second1
             )
             if args.MATCH:
                 msg += ' (disks matching `{}`).'.format(args.MATCH)
             busiest_disk = bandwidth1
 
-        # calculate read/write rate over the entire period (like "load15")
+        # calculate read/write rate over the entire period (like "load5")
         if len(data) != args.COUNT:
             # not enough data yet
             continue
-        timediff, read_bytes_per_second15, write_bytes_per_second15, bandwidth15 = get_rate(
+        timediff, read_bytes_per_second5, write_bytes_per_second5, bandwidth5, read_per_second5, write_per_second5 = get_rate(
             data[0]['timestamp'],
             data[args.COUNT - 1]['timestamp'],
             data[0]['read_bytes'],
             data[args.COUNT - 1]['read_bytes'],
             data[0]['write_bytes'],
             data[args.COUNT - 1]['write_bytes'],
+            data[0]['read_count'],
+            data[args.COUNT - 1]['read_count'],
+            data[0]['write_count'],
+            data[args.COUNT - 1]['write_count'],
         )
         if timediff <= 0:
             # often happens after a reboot
@@ -348,7 +373,7 @@ def main():
 
         # get state based on max measured I/O values
         local_state = lib.base.get_state(
-            bandwidth15,
+            bandwidth5,
             bandwidth_max * args.WARN / 100,
             bandwidth_max * args.CRIT / 100,
         )
@@ -360,21 +385,25 @@ def main():
             'dmd': disk['dmd'].replace('/dev/mapper/', ''),
             'mp': disk['mp'],
             'max': lib.human.bytes2human(bandwidth_max),
-            'r1': lib.human.bytes2human(read_bytes_per_second1),
-            'w1': lib.human.bytes2human(write_bytes_per_second1),
-            'r15': lib.human.bytes2human(read_bytes_per_second15),
-            'w15': lib.human.bytes2human(write_bytes_per_second15),
-            't15': lib.human.bytes2human(bandwidth15) + lib.base.state2str(local_state, prefix=' '),
+            'rr1': lib.human.bytes2human(read_bytes_per_second1),
+            'wr1': lib.human.bytes2human(write_bytes_per_second1),
+            'rr5': lib.human.bytes2human(read_bytes_per_second5),
+            'wr5': lib.human.bytes2human(write_bytes_per_second5),
+            'tr5': lib.human.bytes2human(bandwidth5) + lib.base.state2str(local_state, prefix=' '),
+            'r1': read_per_second1,
+            'w1': write_per_second1,
+            'r5': read_per_second5,
+            'w5': write_per_second5,
         })
 
         # perfdata
         try:
             perfdata += lib.base.get_perfdata('{}_busy_time'.format(bd), data[0]['busy_time'], 'c', None, None, 0, None) # pylint: disable=C0301
             perfdata += lib.base.get_perfdata('{}_read_bytes'.format(bd), data[0]['read_bytes'], 'c', None, None, 0, None) # pylint: disable=C0301
-            #perfdata += lib.base.get_perfdata('{}_read_merged_count'.format(bd), data[0]['read_merged_count'], 'c', None, None, 0, None) # pylint: disable=C0301
+            perfdata += lib.base.get_perfdata('{}_read_count'.format(bd), data[0]['read_count'], 'c', None, None, 0, None) # pylint: disable=C0301
             perfdata += lib.base.get_perfdata('{}_read_time'.format(bd), data[0]['read_time'], 'c', None, None, 0, None) # pylint: disable=C0301
             perfdata += lib.base.get_perfdata('{}_write_bytes'.format(bd), data[0]['write_bytes'], 'c', None, None, 0, None) # pylint: disable=C0301
-            #perfdata += lib.base.get_perfdata('{}_write_merged_count'.format(bd), data[0]['write_merged_count'], 'c', None, None, 0, None) # pylint: disable=C0301
+            perfdata += lib.base.get_perfdata('{}_write_count'.format(bd), data[0]['write_count'], 'c', None, None, 0, None) # pylint: disable=C0301
             perfdata += lib.base.get_perfdata('{}_write_time'.format(bd), data[0]['write_time'], 'c', None, None, 0, None) # pylint: disable=C0301
         except:
             pass
@@ -391,11 +420,15 @@ def main():
                 'mp',
                 'dmd',
                 'max',
+                'rr1',
+                'wr1',
+                'rr5',
+                'wr5',
+                'tr5',
                 'r1',
                 'w1',
-                'r15',
-                'w15',
-                't15',
+                'r5',
+                'w5'
             ],
             header=[
                 'Name',
@@ -406,7 +439,11 @@ def main():
                 'W1/s',
                 'R{}/s'.format(args.COUNT),
                 'W{}/s'.format(args.COUNT),
-                'RW{}/s'.format(args.COUNT)
+                'RW{}/s'.format(args.COUNT),
+                'R1/s',
+                'W1/s',
+                'R{}/s'.format(args.COUNT),
+                'W{}/s'.format(args.COUNT)
             ],
         )