Skip to content

Commit

Permalink
[auto-techsupport] support techsupport generation on potential memory…
Browse files Browse the repository at this point in the history
… leaks (#939)

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
  • Loading branch information
stepanblyschak authored Mar 20, 2022
1 parent 31b8462 commit 61e6c87
Showing 1 changed file with 125 additions and 32 deletions.
157 changes: 125 additions & 32 deletions doc/auto_techsupport_and_coredump_mgmt.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,17 @@
* [2. High Level Requirements](#2-high-level-requirements)
* [3. Core Dump Generation in SONiC](#3-core-dump-generation-in-sonic)
* [4. Schema Additions](#4-schema-additions)
* [5. CLI Enhancements](#5-cli-enhancements)
* [6. Design](#6-design)
* [6. CLI Enhancements](#5-cli-enhancements)
* [7. Design](#6-design)
* [6.1 Modifications to coredump-compress script](#61-Modifications-to-coredump-compress-script)
* [6.2 coredump_gen_handler script](#62-coredump_gen_handler-script)
* [6.3 Modifications to generate_dump script](#64-Modifications-to-generate-dump-script)
* [6.4 techsupport_cleanup script](#65-techsupport_cleanup-script)
* [6.5 Warmboot consideration](#65-Warmboot-consideration)
* [6.6 MultiAsic consideration](#66-MultiAsic-consideration)
* [6.7 Design choices for max-techsupport-limit & max-techsupport-limit arguments](#67-Design-choices-for-max-core-limit-&-max-techsupport-limit-arguments)
* [7. Test Plan](#7-Test-Plan)
* [8. SONiC-to-SONiC Upgrade Considerations](#8-SONiC-to-SONiC-Upgrade-Considerations)
* [8. Test Plan](#7-Test-Plan)
* [9. SONiC-to-SONiC Upgrade Considerations](#8-SONiC-to-SONiC-Upgrade-Considerations)


### Revision
Expand All @@ -27,6 +27,7 @@
| 1.0 | 06/22/2021 | Vivek Reddy Karri | Auto Invocation of Techsupport, triggered by a core dump |
| 1.1 | TBD | Vivek Reddy Karri | Add the capability to Register/Deregister app extension to AUTO_TECHSUPPORT_FEATURE table |
| 2.0 | TBD | Vivek Reddy Karri | Extending Support for Kernel Dumps |
| 3.0 | 02/2022 | Stepan Blyshchak | Extending Support for memory usage threshold crossed |

## About this Manual
This document describes the details of the system which facilitates the auto techsupport invocation support in SONiC. The auto invocation is triggered when any process inside the docker crashes and a core dump is generated.
Expand All @@ -36,13 +37,20 @@ Currently, techsupport is run by invoking `show techsupport` either by orchestra

However if the techsupport invocation can be made event-driven based on core dump generation, that would definitely improve the debuggability. That is the overall idea behind this HLD. All the high-level requirements are summarized in the next section

Another use case is to gather more information about the system in case there is a memory usage threshold crossed.
SONiC dump generated after system reboots due to out of memory is not enough for debugging the issue
as all the information about processes and their mem usage, smaps (/proc/PID/smaps) is lost.
Once the system detects abnormal memory usage SONiC dump is generated automatically.

## 2. High Level Requirements
### Global Scope
* Techsupport invocation should also be made event-driven based on core dump generation.
* This is only applicable for the processes running inside the dockers. Does not apply for other processes.
* init_cfg.json will be enhanced to include the "CONFIG" required for this feature (described in section 4) and is enabled by default.
* To provide flexibility, a compile time flag "ENABLE_AUTO_TECH_SUPPORT" should be provided to enable/disable the "CONFIG" for this feature.
* Users should have the abiliity to enable/disable this capability through CLI.
* Techsupport invocation should also be made event-driven based on memory usage threshold crossing.
* The memory usage threshold should be configurable system-wise and per container.

### Configurable Params
* A configurable "rate_limit_interval" should be introduced to limit the number consecutive of techsupport invocations.
Expand All @@ -68,14 +76,59 @@ The naming format and compression is governed by the script `/usr/local/bin/core

Where `<comm>` value in the command name associated with a process. comm value of a running process can be read from `/proc/[pid]/comm` file

## 4. Schema Additions
## 4. Memory usage based techsupport invocation

If the following condition resolves to true:
```
(mem_usage > mem_usage_threshold || ${container}_mem_usage > ${container}_mem_usage_threshold) || mem_free <= mem_free_threshold
```

where ```mem_usage``` is total system memory used (MemAvailable from /proc/meminfo),
```mem_usage_threshold``` configured threshold, (100 - available_mem_threshold),

```${container}_mem_usage``` used memory by $container ("docker stats --no-stream --format {{.MemUsage}}" $container),

```${container}_mem_usage_threshold``` configured memory threshold for $container, (100 - ${container}available_mem_threshold),

```mem_free``` is the total minus mem usage, ```mem_free_threshold``` - mem free threshold.

the SONiC techsupport is automatically generated.

```mem_free_threshold``` is there to invoke dump when there is quite small amount of memory left that is needed to successfully execute "show techsupport". This is going to be 200 MB by default, as at least 80-90 MB takes "show techsupport" execution.

The check will be implemented as a script that is ran by monit periodically:

```
check program mem_checker with path "/usr/bin/mem_threshold_check"
if status != 0 for 10 times within 20 cycles then exec /usr/local/bin/mem_threshold_check_handler"
```

The action is going to be ran only once the mem_check script detects memory usage above threshold.

The "10 times within 20 cycles" part is kept in sync with mem_usage alert from sonic-host monit configuration.
It is possible to make those values configurable however, only the threshold value is considered to be configurable.

The rate limit as well as techsupport maximum limit is applicable to techsupport generated by memory check.

#### 202106 and older

To support thechsupport generation on memory leaks a simple rule to monit is added:

```
check system $HOST
if memory usage > 90% for 10 times within 20 cycles then exec /usr/bin/generate_dump
```

## 5. Schema Additions

### Config DB

#### AUTO_TECHSUPPORT Table
```
key = "AUTO_TECHSUPPORT|global"
state = "enabled" / "disabled" ; Enable this to make the Techsupport Invocation event driven based on core-dump generation
available_mem_threshold = 1*2DIGIT ; Memory threshold; 0 to disable techsupport invocation on mem leak.
min_available_mem = 1*5DIGIT ; Minimum free memory amount in MB when techsupport will be executed.
rate_limit_interval = 1*5DIGIT ; Minimum Time in seconds, between two successive techsupport invocations.
Manual Invocations will be considered as well in the calculation.
Configure 0 to explicitly disable
Expand All @@ -99,6 +152,7 @@ since = 1*32VCHAR; ; This limits the auto-invoke
```
key = feature name
state = "enabled" / "disabled" ; Enable auto techsupport invocation on the critical processes running inside this feature
available_mem_threshold = 1*2DIGIT ; Memory threshold; 0 to disable techsupport invocation on mem leak in this container.
rate_limit_interval = 1*5DIGIT ; Rate limit interval for the corresponding feature. Configure 0 to explicitly disable
```

Expand Down Expand Up @@ -143,6 +197,18 @@ module sonic-auto_techsupport {
type stypes:admin_mode;
}
leaf available_mem_threshold {
description "Enable techsupport invocation on available memory threshold crossing; 0 to disable"
type decimal-repr;
default 10.0;
}
leaf min_available_mem {
description "Minimum free memory amount in MB when techsupport will be executed"
type uint32;
default 200;
}
leaf rate_limit_interval {
description "Minimum time in seconds between two successive techsupport invocations. Configure 0 to explicitly disable";
type uint16;
Expand Down Expand Up @@ -206,6 +272,12 @@ module sonic-auto_techsupport {
type stypes:admin_mode;
}
leaf available_mem_threshold {
description "Enable techsupport invocation on available memory threshold crossing; 0 to disable"
type decimal-repr;
default 10.0;
}
leaf rate_limit_interval {
description "Rate limit interval for the corresponding feature. Configure 0 to explicitly disable";
type uint16;
Expand All @@ -226,27 +298,43 @@ module sonic-auto_techsupport {
#### AUTO_TECHSUPPORT_DUMP_INFO Table
```
key = Techsupport Dump Name
core_dump = 1*64VCHAR ; Core Dump Name
timestamp = 1*12DIGIT ; epoch of this record creation
container_name = 1*64VCHAR ; Container in which the process crashed
event_type = "core" / "memory" ; Type of event caused techsupport invocation
core_dump = 1*64VCHAR ; Core Dump Name
timestamp = 1*12DIGIT ; epoch of this record creation
container_name = 1*64VCHAR ; Container in which the process crashed/mem threshold. Unset when triggered from host.
```

Eg:

```
hgetall "AUTO_TECHSUPPORT_DUMP_INFO|sonic_dump_sonic_20210412_223645"
1) "core_dump"
2) "orchagent.1599047232.39.core"
1) "event_type"
2) "core"
2) "core_dump"
3) "orchagent.1599047232.39.core"
4) "timestamp"
5) "1599047233"
6) "container_name"
7) "swss"
```

```
hgetall "AUTO_TECHSUPPORT_DUMP_INFO|sonic_dump_sonic_20210412_223123"
1) "event_type"
2) "memory"
3) "timestamp"
4) "1599047233"
4) "1612045251"
5) "container_name"
6) "swss"
```


## 5. CLI Enhancements.
## 6. CLI Enhancements.

### config cli
```
config auto-techsupport global state <enabled/disabled>
config auto-techsupport global available-mem-threshold <float upto two decimal places>
config auto-techsupport global min-available-mem <float upto two decimal places>
config auto-techsupport global rate-limit-interval <uint16>
config auto-techsupport global max-techsupport-limit <float upto two decimal places>
config auto-techsupport global max-core-limit <float upto two decimal places>
Expand All @@ -261,26 +349,26 @@ config auto-techsupport-feature delete restapi

```
admin@sonic:~$ show auto-techsupport global
STATE RATE LIMIT INTERVAL (sec) MAX TECHSUPPORT LIMIT (%) MAX CORE SIZE (%) SINCE
------- --------------------------- -------------------------- ------------------ ----------
enabled 180 10.0 5.0 2 days ago
STATE RATE LIMIT INTERVAL (sec) MAX TECHSUPPORT LIMIT (%) MAX CORE SIZE (%) MEM THRESHOLD (%) MEM THRESHOLD (%) SINCE
------- --------------------------- -------------------------- ------------------ ------------------ ------------------- ---------
enabled 180 10.0 5.0 10.0 10.0 2 days ago
admin@sonic:~$ show auto-techsupport-feature
FEATURE NAME STATE RATE LIMIT INTERVAL (sec)
-------------- -------- --------------------------
bgp enabled 600
database enabled 600
dhcp_relay enabled 600
lldp enabled 600
macsec enabled 600
mgmt-framework enabled 600
nat enabled 600
pmon enabled 600
radv enabled 600
restapi disabled 800
sflow enabled 600
snmp enabled 600
swss disabled 800
FEATURE NAME STATE MEM THRESHOLD (%) RATE LIMIT INTERVAL (sec)
-------------- -------- ------------------ --------------------------
bgp enabled 10.0 600
database enabled 10.0 600
dhcp_relay enabled 10.0 600
lldp enabled 10.0 600
macsec enabled 10.0 600
mgmt-framework enabled 10.0 600
nat enabled 10.0 600
pmon enabled 10.0 600
radv enabled 10.0 600
restapi disabled 10.0 800
sflow enabled 10.0 600
snmp enabled 10.0 600
swss disabled 10.0 800
admin@sonic:~$ show auto-techsupport history
Expand Down Expand Up @@ -374,7 +462,7 @@ Enhance the existing techsupport sonic-mgmt test with the following cases.
| 2 | Check if the techsupport cleanup is working as expected |
| 3 | Check if the global rate-& & per-process rate-limit-interval is working as expected |
| 4 | Check if the core-dump cleanup is working as expected |

| 5 | Check if the core-dump generated when reaching memory threshold |
## 8. SONiC-to-SONiC Upgrade Considerations

The default config required for auto_techsupport is present in the init_cfg.json. Therefore, when a clean installation of SONiC is performed, the configuration is found in the config DB and the feature is active.
Expand All @@ -391,6 +479,7 @@ Load this Example config provided below to enable the feature. Each of the field
"rate_limit_interval": "180",
"max_techsupport_limit": "10.0",
"max_core_limit": "5.0",
"available_mem_threashold": "10.0",
"since": "2 days ago"
}
},
Expand Down Expand Up @@ -461,4 +550,8 @@ Load this Example config provided below to enable the feature. Each of the field
}
}
}
```

# Open question

1. Is 10 % free memory/90 % used memory threshold a reasonable default?

0 comments on commit 61e6c87

Please sign in to comment.