Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows restart service #1681

Merged
merged 36 commits into from
Jul 15, 2024
Merged

windows restart service #1681

merged 36 commits into from
Jul 15, 2024

Conversation

zackattack01
Copy link
Contributor

@zackattack01 zackattack01 commented Apr 16, 2024

Here is a basic sequence diagram displaying the enable path for the windows watchdog service. The launcher_watchdog_enabled control flag will trigger the initial configuration and installation, and removal of the flag will trigger removal of the service.

sequenceDiagram
    participant LauncherKolideK2Svc
    Note right of LauncherKolideK2Svc: ./launcher.exe svc ...
    create participant WindowsServiceManager
    LauncherKolideK2Svc->>WindowsServiceManager: if launcher_watchdog_enabled
    create participant LauncherKolideWatchdogSvc
    WindowsServiceManager->>LauncherKolideWatchdogSvc: have we installed the watchdog?
    Note left of LauncherKolideWatchdogSvc: ./launcher.exe watchdog

    alt yes the service already exists
        LauncherKolideK2Svc->>LauncherKolideWatchdogSvc: Restart to ensure latest
    else no the service does not exist
        LauncherKolideK2Svc->>WindowsServiceManager: 1 - create, configure, etc
        LauncherKolideK2Svc->>LauncherKolideWatchdogSvc: 2 - Start
        activate LauncherKolideWatchdogSvc
    end

    loop every n minutes
        LauncherKolideWatchdogSvc->>WindowsServiceManager: Query LauncherKolideK2Svc status
        LauncherKolideWatchdogSvc->>LauncherKolideK2Svc: Start if Stopped
    end
Loading
  • The restart functionality is currently limited to detecting a stopped state, but the idea here is to lay out the foundation for more advanced healthchecking.
  • The watchdog service itself runs as a separate invocation of launcher, writing all logs to sqlite. The main invocation of launcher runs a watchdog controller, which responds to the launcher_watchdog_enabled flag, and publishes all sqlite logs to debug.json.
  • I am very open to any suggestions for better test coverage here. I had started to add more but the vast majority of the logic is dependent specifically on the windows service manager calls, and the amount of stubbing required significantly reduced the value. I suspect adding tests to CI test suite and being able to look across logs after the fact would be a much better approach, I'd like to explore that soon depending on where we land with this

@zackattack01 zackattack01 force-pushed the zack/windows_watchdog branch 2 times, most recently from 1fca6bc to e530c23 Compare May 17, 2024 19:29
@zackattack01 zackattack01 force-pushed the zack/windows_watchdog branch 2 times, most recently from c086936 to faefa20 Compare June 4, 2024 21:31
@zackattack01 zackattack01 marked this pull request as ready for review June 5, 2024 14:20
cmd/launcher/watchdog/controller_windows.go Outdated Show resolved Hide resolved
cmd/launcher/watchdog/controller_windows.go Outdated Show resolved Hide resolved
cmd/launcher/watchdog/controller_windows.go Outdated Show resolved Hide resolved
Copy link
Contributor

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments on the windows stuff. Haven't dug into the log or sqlite stuff

Comment on lines 24 to 25
launcherWatchdogServiceName string = `LauncherKolideWatchdogSvc`
launcherServiceName string = `LauncherKolideK2Svc`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are k2 specific names, as such, I would probably move this all into ee

Comment on lines 88 to 91
// do nothing if watchdog is not enabled
if !wc.knapsack.LauncherWatchdogEnabled() {
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oo. This is interesting. What happens if there are pending logs, and then watchdog is disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure about the right call here- i think if we're expecting that most windows devices will have this enabled someday then removing this enabled check is the right call. I had added because the case of stuck logs seemed extremely rare (and logs would still be recoverable) vs most devices having this disabled and checking logs every 5 minutes for no reason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a fine choice. I guess if we get stale logs, hopefully we'll notice the timestamp

cmd/launcher/watchdog/controller_windows.go Outdated Show resolved Hide resolved
cmd/launcher/watchdog/controller_windows.go Outdated Show resolved Hide resolved
cmd/launcher/watchdog/controller_windows.go Outdated Show resolved Hide resolved
}

func runLauncherWatchdogService(ctx context.Context, w *winWatchdogSvc) error {
ticker := time.NewTicker(1 * time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this frequency will be okay. Or if there's going to be some memory leak. Well, it's behind a feature flag, so we'll get to find out!

@@ -0,0 +1,30 @@
### Watchdog Service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could call this file README.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or move to docs/architecture?

}
}

func (w *winWatchdogSvc) checkLauncherStatus(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to skip this if we're in powersave mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a link to the docs I'm working on at the end but the short answer is not easily- unless we are willing to:

  • add logic into here to subscribe to power events from this process as well (similar to current launcher functionality)
  • add logic to register directly for power events as a handler

I would love to be wrong here but there does not appear to be any reliable mechanism to gather the current state in real time (without significantly complicating the watchdog)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this powersave issue is the one we should focus on. The PR generally looks good, and the small details can drag out forever, but if we don't have an answer for powersave we probably can't ship.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, updated with logic to integrate our existing power event watcher code 👍

@@ -19,20 +19,23 @@ import (
_ "modernc.org/sqlite"
)

type storeName int
type StoreName int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be exported?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported for use in the pkg/log/sqlitelogger package, looks like

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. Feels a little weird. I wonder if we should be create stores and passing them around. But okay...

// TimestampedIteratorDeleterAppenderCloser is an interface to support the storage and retrieval of
// sets of timestamped values. This can be used where a strict key/value interface may not suffice,
// e.g. for writing logs or historical records to sqlite
type TimestampedIteratorDeleterAppenderCloser interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

if err := wc.logPublisher.ForEach(func(rowid, timestamp int64, v []byte) error {
logRecord := make(map[string]any)
if err := json.Unmarshal(v, &logRecord); err != nil {
wc.slogger.Log(ctx, slog.LevelError, "failed to unmarshal sqlite log", "log", string(v))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include the err in this log also?

logRecord := make(map[string]any)
if err := json.Unmarshal(v, &logRecord); err != nil {
wc.slogger.Log(ctx, slog.LevelError, "failed to unmarshal sqlite log", "log", string(v))
logsToDelete = append(logsToDelete, rowid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we can move this line to the top of the ForEach function, since there's no circumstance where we won't want to delete this log -- instead of having the append here and on line 113?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes thank you! will do

return
}

if err.Error() == serviceDoesNotExistError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree w/ an if err != nil + a switch or if/elseif inside to check the error type

wc.slogger.Log(ctx, slog.LevelError,
"installing launcher watchdog, unable to collect current executable path",
"err", err,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this log since we're returning an error and the calling function logs it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope I can remove, good call thank you!

},
}

if err = restartService.SetRecoveryActions(recoveryActions, 10800); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more readable as a const with a comment --

const serviceResetPeriod = 10800 // 3 hours in seconds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because this looks like a fun, unimportant thing to quibble over, I would go with

const serviceResetPeriodSeconds = 3 * 60 * 60 // 3 hours in seconds

@@ -0,0 +1,30 @@
### Watchdog Service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or move to docs/architecture?

@@ -19,20 +19,23 @@ import (
_ "modernc.org/sqlite"
)

type storeName int
type StoreName int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported for use in the pkg/log/sqlitelogger package, looks like

"strings"
)

func (s *sqliteStore) getColumns() *sqliteColumns {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function feels like maybe it belongs in keyvalue_store_sqlite.go instead, since it applies to both tables? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that probably makes more sense, I was torn about adding a new shared file but think we can probably wait on that. i'll move there for now!

ee/agent/storage/sqlite/logstore_sqlite.go Outdated Show resolved Hide resolved
ee/agent/storage/sqlite/logstore_sqlite.go Show resolved Hide resolved
@directionless
Copy link
Contributor

side note -- do not merge until 1.7 is stable

@@ -19,20 +19,23 @@ import (
_ "modernc.org/sqlite"
)

type storeName int
type StoreName int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. Feels a little weird. I wonder if we should be create stores and passing them around. But okay...

}
)

func NewSqliteLogWriter(ctx context.Context, rootDirectory string, tableName agentsqlite.StoreName) (*SqliteLogWriter, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This signature feels a little off to me. I wonder if it would be cleaner if it took types.LogStore directly.

@@ -0,0 +1,52 @@
package sqlitelogger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's anything here related to sqlite or logging? It's implementing an io.WriteCloser. Why not in ee/agent/storage/sqlite/logstore_sqlite.go?

(I mean, it's also pretty cool, but I'm not sure why it's split out)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you mean, I think I originally assumed there would be more to do here. Consolidating should clean up those extra exported types too, I will see how that looks. thank you!

switch s {
case StartupSettingsStore:
return "startup_settings"
case WatchdogLogStore:
return "watchdog_logs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to push other logs here (say early startup logs) how would we do it? Would we make a new store? Accept it as slightly misnamed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think new store will give us more implementation flexibility and we can still re-use all of the new interfaces and wire that in pretty easily

ee/agent/storage/sqlite/logstore_sqlite.go Show resolved Hide resolved
Copy link
Contributor

@RebeccaMahany RebeccaMahany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it thru everything but ee/watchdog/watchdog_service_windows.go, will come back for that tomorrow AM!

cmd/launcher/launcher.go Outdated Show resolved Hide resolved
ee/watchdog/controller_windows.go Outdated Show resolved Hide resolved
RebeccaMahany
RebeccaMahany previously approved these changes Jul 2, 2024
ee/watchdog/watchdog_service_windows.go Outdated Show resolved Hide resolved
ee/watchdog/watchdog_service_windows.go Show resolved Hide resolved
Co-authored-by: Rebecca Mahany-Horton <rebeccamahany@gmail.com>
@zackattack01 zackattack01 added this pull request to the merge queue Jul 15, 2024
Merged via the queue into main with commit fe97939 Jul 15, 2024
29 checks passed
@zackattack01 zackattack01 deleted the zack/windows_watchdog branch July 15, 2024 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants