Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smf: sled-agent should never go into maintenance #6626

Open
jclulow opened this issue Sep 20, 2024 · 0 comments
Open

smf: sled-agent should never go into maintenance #6626

jclulow opened this issue Sep 20, 2024 · 0 comments
Labels
Sled Agent Related to the Per-Sled Configuration and Management

Comments

@jclulow
Copy link
Collaborator

jclulow commented Sep 20, 2024

Under some conditions, sled-agent will hit a critical error and crash. SMF will restart the agent immediately, but only if it does not fail too often. It's relatively easy for a fault that does not clear out within a few seconds to cause sled-agent to restart several times in quick succession and then be moved to the terminal maintenance state, where operator intervention is required to revive it. The agent is a critical component of sled management, and is likely where our future ability to debug a misbehaving sled may live, so it's important that it never come to a discretionary rest.

There are two immediate options to improve this state of affairs:

  • The sled-agent service is presently using the contract duration, so the critical_failure_count and critical_failure_period thresholds are used to determine if a failing instance is eligible for another restart or if it should go into maintenance. From svc.startd(8):

    Additionally, svc.startd managed services can define the optional
    properties listed below in the startd property group.
    
    startd/critical_failure_count
    startd/critical_failure_period
    
       The critical_failure_count and critical_failure_period properties
       together specify the maximum number of service failures allowed in a
       given time interval before svc.startd transitions the service to
       maintenance.  If the number of failures exceeds
       critical_failure_count in any period of critical_failure_period
       seconds, svc.startd will transition the service to maintenance.
    

    These properties can be set in the service manifest. If an appropriately large value is chosen for critical_failure_count (e.g., 1000000) and an appropriately small value is chosen for critical_failure_period (e.g., 1), then the service will essentially never go into the maintenance state.

    This option is probably the most minimal change, but it will require some investigation and testing. I can't recall, for instance, if the throttling applied to wait model services when they are constantly restarting also applies to contract model services which have had adjustments to their critical failure thresholds. It's probably not ideal to have SMF spinning in a tight loop trying to start a process that will be failing for the foreseeable future. If this behaviour is not quite right, we should file an OS ticket and get it fixed.

  • The sled-agent service could be converted from a contract to a wait model service. This is arguably the more correct change, as the process does not daemonise itself today. We're also using the ctrun workaround already to attempt to get wait model semantics (rather than empty contract semantics); this workaround has the additional effect of mitigating 13511 svc.startd should terminate orphaned contracts for wait model services.

    If we switch to the wait model (a duration of child, as per svc.startd(8)), then there is by definition no maintenance state to enter. The process will always be restarted, with a governor that prevents a restart loop at more than 1Hz.

I originally anticipated suggesting the first option, but in writing all this out I now believe the second option is actually the one we should probably use.

@karencfv karencfv added the Sled Agent Related to the Per-Sled Configuration and Management label Sep 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sled Agent Related to the Per-Sled Configuration and Management
Projects
None yet
Development

No branches or pull requests

2 participants