Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: scrub logs #1802

Merged
merged 33 commits into from May 8, 2024
Merged

feature: scrub logs #1802

merged 33 commits into from May 8, 2024

Conversation

saikonen
Copy link
Collaborator

@saikonen saikonen commented Apr 16, 2024

introduces logs scrub run_id/step/task_id command for scrubbing logs of tasks.

by default it deletes both stdout/stderr for the latest attempt of a task.

Final feature set:

  • logs [pathspec] works as previously, with the new default subcommand custom click group
  • logs show [pathspec] is the default subcommand for logs cmd group
  • logs scrub [pathspec] will overwrite logs with a redacted message

also adds support for both show/scrub commands to specify --attempt [number] to target a specific attempt of a task, instead of the latest.

@savingoyal
Copy link
Collaborator

@saikonen we could introduce a top-level delete command that has a sub command - logs? wdyt? also, the default behavior could be for --latest-attempt and users can pass in --all-attempts to wipe out logs for all attempts?

@saikonen
Copy link
Collaborator Author

this now introduces logs as a command group, with a default fallback to the new logs show command in case no subcommmand is specified.

the following command still works the same

python LongLog.py logs 1713532810109387/start/1
python LongLog.py logs 1713532810109387/start/1 --stderr

but now we can also support the following

python LongLog.py logs delete 1713532810109387/start/1

run_id, step_name, task_id = parts
else:
raise CommandException(
"input_path should either be run_id/step_name "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input-path?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as in the existing 'show' command. should this be reworded to something else? pathspec for example?

echo("Logs have been scrubbed.")

else:
raise CommandException(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the behavior for tasks that haven't finished yet? a few different cases here -

  1. the attempt is still in flight
  2. the attempt has finished but we expect further attempts

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one approach would be that we instead init the datastore with allow_not_done=False ?

Copy link
Collaborator Author

@saikonen saikonen Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bunch of fixes regarding this. How things work now:
we rely on allow_not_done=False and check when initializing the datastore in mode="d" that the latest / specified attempt is done, otherwise we raise an error.

  • trying to logs scrub [pathspec] on a running task will lead to an error.
  • logs scrub [pathspec] --attempt x will work even on a running task, as long as the specified attempt has finished.

so specifically for the above cases:

  1. if attempt is in flight, log scrubbing raises an exception for that specific attempt (or no attempt specified)
  2. attempt has finished, if scrubbing with --attempt it will work. If not specifying an attempt, the latest attempt will point to an attempt that has not yet finished, raising an exception

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I also tried simply allowing the scrubbing of logs for running tasks, but due to how logs are written in bursts from a buffer, the effects of overwriting a log with the [REDACTED] message were being nullified, as every log write cycle would simply rewrite the scrubbed log lines if they were still in buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would definiltely not advocate for allowing the redaction of in-flight logs. THere are just too many complications. We would have to prevent writing to a log that has been redacted which we can't do with past versions of metaflow so there would be this big gapping hole.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the current behavior described @saikonen -- ie: can scrub past completed attempts with --attempt; can scrub latest attempt if finished with no arg and can't scrub any other case.

Copy link
Collaborator

@savingoyal savingoyal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a few comments

Copy link
Contributor

@romain-intel romain-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues with the larger feature -- a few comments here and there.

metaflow/datastore/flow_datastore.py Outdated Show resolved Hide resolved
metaflow/datastore/flow_datastore.py Outdated Show resolved Hide resolved

# Do not allow destructive operations on the datastore if attempt is still in flight
# and we explicitly did not allow operating on running tasks.
if not allow_not_done and not self.has_metadata(self.METADATA_DONE_SUFFIX):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a case where we want to delete logs for running tasks? It seems in delete mode we may want to ingore allow_not_done?

Copy link
Collaborator Author

@saikonen saikonen Apr 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this in mainly in order to keep the datastore side more generic (allowing delete for in-flight attempts as well, though no use case exists yet). It is definitely not required for this feature, so I could remove it

# NOTE: We need this in order to not introduce breaking changes to existing CLI, as we wanted to
# nest both existing `logs` and the new `logs scrub` under a shared group, but `logs` already has
# a well defined behavior of showing the logs.
class CustomGroup(click.Group):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to check if this is handled by @madhur-ob 's click parsing thing as well (should be but mentioning).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested, and this does not work as-is with the low level click api due to the fallback logic happening when invoking the command. Calling the explicit subcommand works as expected though, but the group level fallback fails

from metaflow.metaflow_runner import MetaflowAPI
from metaflow.cli import start

if __name__ == "__main__":

    api = MetaflowAPI.from_cli("AttemptLog.py", start)

    # works
    cmd = api().logs().show(input_path="217190/start")
    print(cmd)

    # does not
    cmd = api().logs(input_path="217190/start")
    print(cmd)

Will look into how easy of a fix there could be to support it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool -- we don't need to fix it for this but ya, we need to probbaly have it working at some point. As in, the low level api should be able to handle all our cases.

metaflow/plugins/logs_cli.py Outdated Show resolved Hide resolved
metaflow/plugins/logs_cli.py Outdated Show resolved Hide resolved
metaflow/plugins/logs_cli.py Outdated Show resolved Hide resolved
echo("Logs have been scrubbed.")

else:
raise CommandException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the current behavior described @saikonen -- ie: can scrub past completed attempts with --attempt; can scrub latest attempt if finished with no arg and can't scrub any other case.

metaflow/plugins/logs_cli.py Outdated Show resolved Hide resolved
metaflow/plugins/logs_cli.py Show resolved Hide resolved
savingoyal
savingoyal previously approved these changes May 2, 2024
show_default=False,
help="Scrub latest/all attempts of a step or task",
)
@click.option(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this also try to scrub in-flight tasks?

Copy link
Contributor

@romain-intel romain-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this refactor. I think with a couple name changes, it makes a lot more sense and the code looks clear imho.

metaflow/datastore/flow_datastore.py Outdated Show resolved Hide resolved
metaflow/datastore/flow_datastore.py Outdated Show resolved Hide resolved
metaflow/datastore/flow_datastore.py Outdated Show resolved Hide resolved
metaflow/datastore/flow_datastore.py Outdated Show resolved Hide resolved
@saikonen saikonen changed the title feature: delete logs feature: scrub logs May 8, 2024
@saikonen saikonen merged commit d4cf28a into master May 8, 2024
26 checks passed
@saikonen saikonen deleted the feature/delete-logs branch May 8, 2024 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants