smf: sled-agent should never go into maintenance #6626

jclulow · 2024-09-20T21:00:32Z

Under some conditions, sled-agent will hit a critical error and crash. SMF will restart the agent immediately, but only if it does not fail too often. It's relatively easy for a fault that does not clear out within a few seconds to cause sled-agent to restart several times in quick succession and then be moved to the terminal maintenance state, where operator intervention is required to revive it. The agent is a critical component of sled management, and is likely where our future ability to debug a misbehaving sled may live, so it's important that it never come to a discretionary rest.

There are two immediate options to improve this state of affairs:

The sled-agent service is presently using the contract duration, so the critical_failure_count and critical_failure_period thresholds are used to determine if a failing instance is eligible for another restart or if it should go into maintenance. From svc.startd(8):
```
Additionally, svc.startd managed services can define the optional
properties listed below in the startd property group.

startd/critical_failure_count
startd/critical_failure_period

   The critical_failure_count and critical_failure_period properties
   together specify the maximum number of service failures allowed in a
   given time interval before svc.startd transitions the service to
   maintenance.  If the number of failures exceeds
   critical_failure_count in any period of critical_failure_period
   seconds, svc.startd will transition the service to maintenance.
```
These properties can be set in the service manifest. If an appropriately large value is chosen for critical_failure_count (e.g., 1000000) and an appropriately small value is chosen for critical_failure_period (e.g., 1), then the service will essentially never go into the maintenance state.

This option is probably the most minimal change, but it will require some investigation and testing. I can't recall, for instance, if the throttling applied to wait model services when they are constantly restarting also applies to contract model services which have had adjustments to their critical failure thresholds. It's probably not ideal to have SMF spinning in a tight loop trying to start a process that will be failing for the foreseeable future. If this behaviour is not quite right, we should file an OS ticket and get it fixed.
The sled-agent service could be converted from a contract to a wait model service. This is arguably the more correct change, as the process does not daemonise itself today. We're also using the ctrun workaround already to attempt to get wait model semantics (rather than empty contract semantics); this workaround has the additional effect of mitigating 13511 svc.startd should terminate orphaned contracts for wait model services.

If we switch to the wait model (a duration of child, as per svc.startd(8)), then there is by definition no maintenance state to enter. The process will always be restarted, with a governor that prevents a restart loop at more than 1Hz.

I originally anticipated suggesting the first option, but in writing all this out I now believe the second option is actually the one we should probably use.

The text was updated successfully, but these errors were encountered:

karencfv added the Sled Agent Related to the Per-Sled Configuration and Management label Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smf: sled-agent should never go into maintenance #6626

smf: sled-agent should never go into maintenance #6626

jclulow commented Sep 20, 2024 •

edited

Loading

smf: sled-agent should never go into maintenance #6626

smf: sled-agent should never go into maintenance #6626

Comments

jclulow commented Sep 20, 2024 • edited Loading

jclulow commented Sep 20, 2024 •

edited

Loading