Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows restart service #1681

Open
wants to merge 30 commits into
base: main
Choose a base branch
from
Open

windows restart service #1681

wants to merge 30 commits into from

Conversation

zackattack01
Copy link
Contributor

@zackattack01 zackattack01 commented Apr 16, 2024

Here is a basic sequence diagram displaying the enable path for the windows watchdog service. The launcher_watchdog_enabled control flag will trigger the initial configuration and installation, and removal of the flag will trigger removal of the service.

sequenceDiagram
    participant LauncherKolideK2Svc
    Note right of LauncherKolideK2Svc: ./launcher.exe svc ...
    create participant WindowsServiceManager
    LauncherKolideK2Svc->>WindowsServiceManager: if launcher_watchdog_enabled
    create participant LauncherKolideWatchdogSvc
    WindowsServiceManager->>LauncherKolideWatchdogSvc: have we installed the watchdog?
    Note left of LauncherKolideWatchdogSvc: ./launcher.exe watchdog

    alt yes the service already exists
        LauncherKolideK2Svc->>LauncherKolideWatchdogSvc: Restart to ensure latest
    else no the service does not exist
        LauncherKolideK2Svc->>WindowsServiceManager: 1 - create, configure, etc
        LauncherKolideK2Svc->>LauncherKolideWatchdogSvc: 2 - Start
        activate LauncherKolideWatchdogSvc
    end

    loop every n minutes
        LauncherKolideWatchdogSvc->>WindowsServiceManager: Query LauncherKolideK2Svc status
        LauncherKolideWatchdogSvc->>LauncherKolideK2Svc: Start if Stopped
    end
  • The restart functionality is currently limited to detecting a stopped state, but the idea here is to lay out the foundation for more advanced healthchecking.
  • The watchdog service itself runs as a separate invocation of launcher, writing all logs to sqlite. The main invocation of launcher runs a watchdog controller, which responds to the launcher_watchdog_enabled flag, and publishes all sqlite logs to debug.json.
  • I am very open to any suggestions for better test coverage here. I had started to add more but the vast majority of the logic is dependent specifically on the windows service manager calls, and the amount of stubbing required significantly reduced the value. I suspect adding tests to CI test suite and being able to look across logs after the fact would be a much better approach, I'd like to explore that soon depending on where we land with this

@zackattack01 zackattack01 marked this pull request as ready for review June 5, 2024 14:20
cmd/launcher/watchdog/controller_windows.go Outdated Show resolved Hide resolved
cmd/launcher/watchdog/controller_windows.go Outdated Show resolved Hide resolved
cmd/launcher/watchdog/controller_windows.go Outdated Show resolved Hide resolved
Copy link
Contributor

@directionless directionless left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initial comments on the windows stuff. Haven't dug into the log or sqlite stuff

Comment on lines 24 to 25
launcherWatchdogServiceName string = `LauncherKolideWatchdogSvc`
launcherServiceName string = `LauncherKolideK2Svc`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are k2 specific names, as such, I would probably move this all into ee

Comment on lines 88 to 91
// do nothing if watchdog is not enabled
if !wc.knapsack.LauncherWatchdogEnabled() {
return
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oo. This is interesting. What happens if there are pending logs, and then watchdog is disabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure about the right call here- i think if we're expecting that most windows devices will have this enabled someday then removing this enabled check is the right call. I had added because the case of stuck logs seemed extremely rare (and logs would still be recoverable) vs most devices having this disabled and checking logs every 5 minutes for no reason

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is a fine choice. I guess if we get stale logs, hopefully we'll notice the timestamp

cmd/launcher/watchdog/controller_windows.go Outdated Show resolved Hide resolved
cmd/launcher/watchdog/controller_windows.go Outdated Show resolved Hide resolved
cmd/launcher/watchdog/controller_windows.go Outdated Show resolved Hide resolved
}

func runLauncherWatchdogService(ctx context.Context, w *winWatchdogSvc) error {
ticker := time.NewTicker(1 * time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this frequency will be okay. Or if there's going to be some memory leak. Well, it's behind a feature flag, so we'll get to find out!

@@ -0,0 +1,30 @@
### Watchdog Service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could call this file README.md

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or move to docs/architecture?

}
}

func (w *winWatchdogSvc) checkLauncherStatus(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to skip this if we're in powersave mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add a link to the docs I'm working on at the end but the short answer is not easily- unless we are willing to:

  • add logic into here to subscribe to power events from this process as well (similar to current launcher functionality)
  • add logic to register directly for power events as a handler

I would love to be wrong here but there does not appear to be any reliable mechanism to gather the current state in real time (without significantly complicating the watchdog)

@@ -19,20 +19,23 @@ import (
_ "modernc.org/sqlite"
)

type storeName int
type StoreName int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be exported?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported for use in the pkg/log/sqlitelogger package, looks like

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. Feels a little weird. I wonder if we should be create stores and passing them around. But okay...

// TimestampedIteratorDeleterAppenderCloser is an interface to support the storage and retrieval of
// sets of timestamped values. This can be used where a strict key/value interface may not suffice,
// e.g. for writing logs or historical records to sqlite
type TimestampedIteratorDeleterAppenderCloser interface {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

zackattack01 and others added 2 commits June 6, 2024 09:37
PR feedback: less noisy log levels, comment updates and style fixes

Co-authored-by: seph <[email protected]>
if err := wc.logPublisher.ForEach(func(rowid, timestamp int64, v []byte) error {
logRecord := make(map[string]any)
if err := json.Unmarshal(v, &logRecord); err != nil {
wc.slogger.Log(ctx, slog.LevelError, "failed to unmarshal sqlite log", "log", string(v))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include the err in this log also?

logRecord := make(map[string]any)
if err := json.Unmarshal(v, &logRecord); err != nil {
wc.slogger.Log(ctx, slog.LevelError, "failed to unmarshal sqlite log", "log", string(v))
logsToDelete = append(logsToDelete, rowid)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we can move this line to the top of the ForEach function, since there's no circumstance where we won't want to delete this log -- instead of having the append here and on line 113?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes thank you! will do

return
}

if err.Error() == serviceDoesNotExistError {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree w/ an if err != nil + a switch or if/elseif inside to check the error type

wc.slogger.Log(ctx, slog.LevelError,
"installing launcher watchdog, unable to collect current executable path",
"err", err,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this log since we're returning an error and the calling function logs it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope I can remove, good call thank you!

},
}

if err = restartService.SetRecoveryActions(recoveryActions, 10800); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be more readable as a const with a comment --

const serviceResetPeriod = 10800 // 3 hours in seconds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because this looks like a fun, unimportant thing to quibble over, I would go with

const serviceResetPeriodSeconds = 3 * 60 * 60 // 3 hours in seconds

@@ -0,0 +1,30 @@
### Watchdog Service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or move to docs/architecture?

@@ -19,20 +19,23 @@ import (
_ "modernc.org/sqlite"
)

type storeName int
type StoreName int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exported for use in the pkg/log/sqlitelogger package, looks like

"strings"
)

func (s *sqliteStore) getColumns() *sqliteColumns {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function feels like maybe it belongs in keyvalue_store_sqlite.go instead, since it applies to both tables? What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that probably makes more sense, I was torn about adding a new shared file but think we can probably wait on that. i'll move there for now!

ee/agent/storage/sqlite/logstore_sqlite.go Outdated Show resolved Hide resolved
colInfo := s.getColumns()
if s == nil || s.conn == nil || colInfo == nil {
return errors.New("store is nil")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you also want a restriction on calling this for a store that isn't the logstore, since scanning data into var timestamp int64 on line 91 would fail with a different table schema?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yeah that's a good call! I can probably expand on getColumns to either gate usage here or make it work for either, i'll take a look

zackattack01 and others added 2 commits June 6, 2024 13:10
…ovements, add checks for non-LogStore table methods
@directionless
Copy link
Contributor

side note -- do not merge until 1.7 is stable

@@ -19,20 +19,23 @@ import (
_ "modernc.org/sqlite"
)

type storeName int
type StoreName int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm. Feels a little weird. I wonder if we should be create stores and passing them around. But okay...

}
)

func NewSqliteLogWriter(ctx context.Context, rootDirectory string, tableName agentsqlite.StoreName) (*SqliteLogWriter, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This signature feels a little off to me. I wonder if it would be cleaner if it took types.LogStore directly.

@@ -0,0 +1,52 @@
package sqlitelogger
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's anything here related to sqlite or logging? It's implementing an io.WriteCloser. Why not in ee/agent/storage/sqlite/logstore_sqlite.go?

(I mean, it's also pretty cool, but I'm not sure why it's split out)

switch s {
case StartupSettingsStore:
return "startup_settings"
case WatchdogLogStore:
return "watchdog_logs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to push other logs here (say early startup logs) how would we do it? Would we make a new store? Accept it as slightly misnamed?

}

insertSql := fmt.Sprintf(
`INSERT INTO %s (%s, %s) VALUES (?, ?)`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For timestamps, should one of this be a number, not a string? Might not matter, since sqlite everything is a string...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants