initial runner api #1732

madhur-ob · 2024-02-15T03:24:44Z

The Usage looks something like this:

from metaflow import metaflow_runner

params = [{'alpha': 40}, {'alpha': 20}, {'alpha': 10}, {'alpha': 30}]
with metaflow_runner.Pool('../try.py', num_processes=2) as pool:
    results = pool.map(params)
    for i in results:
        print(i)

where the contents of try.py for the example are:

import time
from metaflow import FlowSpec, Parameter, step

class ParameterFlow(FlowSpec):
    alpha = Parameter('alpha',
                      help='Time to Sleep',
                      default=1)

    @step
    def start(self):
        print('alpha is %f' % self.alpha)
        print(f"Let me sleep for {self.alpha} seconds")
        time.sleep(self.alpha)
        self.next(self.end)

    @step
    def end(self):
        print('alpha is still %f' % self.alpha)

if __name__ == '__main__':
    ParameterFlow()

In the above example, the runs are executed 2 at a time. The order of them being completed is:

run that sleeps for 20 seconds
run that sleeps for 10 seconds
run that sleeps for 40 seconds
run that sleeps for 30 seconds

The total running time is 60 seconds. This is as follows:

40 seconds that includes
- run for 20 seconds
- run for 10 seconds
- 10 seconds of the run for 30 seconds
last 20 seconds of the run for 30 seconds

Assumptions:

Considering the use-case of Controlling and not Authoring metaflow runs... at least for now...
Only accept a file which contains the flow, and not an instance of the class which is derived from FlowSpec
Multiple concurrent runs can be spin off with the API

Things that need to be discussed

What other stuff should the API accept? (Eg: more environment variables, other options such as --with, etc.)
This is only the API for Pool, what about just Run?
What should a successful run return? Currently, we return the status code...
- If the run is supposed to produce some data / artefact, how can we pass that further?
  Eg: this API is used as a step in some flow to produce an artefact which is consumed and used by another step
Error handling / Crashing aspects...

romain-intel · 2024-02-16T22:13:18Z

I'm not going to comment on the implementation (which is fairly straightforward at this time anyways) because I want to align on some UX things first.

These are my initial thoughts on this but feel free to disagree with anything:

I think for the initial scope, we can restrict to basically the run command (so python myflow.py ... run ...). We should, however, try to design a UX that is then extensible to other command line options (resume, dump, step-functions, etc.) that are "flow based" and we can then later also take care of the "metaflow xyz" command line (although that one can probably be addressed in a different way so I am excluding it from too much consideration for now.
It doesn't feel like the current UX would extend to these other options. It is not so much a UX wrapping the "run" command but more an abstraction even on top of that that takes care of parallel executions of flows. In other words, I think the concept of having something like with Pool(...) is great, it should probably be built on top of a more direct overlay on the "run" CLI.
One aspect that I feel may be important to capture is the hierarchical nature of the command line (or at least it would be closer to what click does which means it may be easier to map functionality. For example, one option could be the following:

mf_cli = MetaflowCli(flow_file="myflow.py", environment="conda")
mf_cli.run(max_workers=5, alpha=123)

Not saying something exactly like this but this more cleanly maps to what we have for click and doesn't feel too un-natural (you have a Cli object -- or call it something else -- and from there you can run with various options exactly like you would on the command line). We can separately argue if we want to infer "environment" for example but that's a separate conversation :). Anyways, hopefully this shows the idea of having a multi level approach that maps more closely to click and can support any arbitrary options. It could be a very thin layer and should be reusable for various commands.

I also think that mf_cli.run should return immediately something akin to a future but also akin to the client's Run object. We could (with some modifications) actually return a Run object and you could do something like this:

my_run = mf_cli.run(max_workers=5, alpha=123)
while not my_run.finished
  ...

(note this is probably not a good idea since it can loop forever but you get the idea). We should also seriously revisit some of these status things (we had a long conversation in the past about this but it may become more useful as part of this as well -- for example, we can actually return a better "finished" value here because we know if the subprocess dies/goes away and so it may not run forever :)).
You could also get the log, etc. We could consider adding other functions to list things like "in process steps" for example or what not. Basically it's an object that has value during execution as well as after. If we want to use a recent analogy, it's like a "dynamic card" (it updates through the execution and is then frozen at the end of execution.

With something like that, it then becomes quite easy to write a Pool abstraction on top of it by having the async workers do something akin to what is above and just loop around until it is finished. As discussed, Pool should also return Run objects.

madhur-ob · 2024-02-21T07:15:37Z

The usage looks something like this, we can of-course pass blocking as True if we wanna wait...

with metaflow_runner.Runner('../try.py',  metadata="local") as runner:
    result = runner.run(blocking=False, alpha=5, tags=["abc", "def"], max_workers=5)
    print(result)
    while not result.finished:
        print("waiting for run to finish...")
        time.sleep(0.1)

or perhaps...

mf = metaflow_runner.Runner('../try.py', metadata="local")
result = mf.run(blocking=False, alpha=5, tags=["abc", "def"], max_workers=5)
print(result)
while not result.finished:
    print("waiting for run to finish...")
    time.sleep(0.1)

All this is built on top of a low-level chaining API that is dynamically constructed while traversing through the CLI command tree... Please see the following for usage:

from metaflow.click_api import MetaflowAPI
from metaflow.cli import start
    
api = MetaflowAPI.from_cli("../try.py", start)

command = api(metadata="local").run(
    tags=["abc", "def"],
    decospecs=["kubernetes"],
    max_workers=5,
    alpha=3,
    myfile="path/to/file",
)
print(command)

command = (
    api(metadata="local")
    .kubernetes()
    .step(
        step_name="process",
        code_package_sha="some_sha",
        code_package_url="some_url",
    )
)
print(command)

command = api().tag().add(tags=["abc", "def"])
print(command)

command = getattr(api(decospecs=["retry"]), "argo-workflows")().create()
print(command)

madhur-ob · 2024-03-12T19:45:53Z

Pushed out a subprocess manager that allows us to tail logs via streaming them into a queue first...
The higher level runner API uses this subprocess manager...

An example of how the higher level UX is as follows:

import asyncio
from metaflow import metaflow_runner

async def main():
    with metaflow_runner.Runner('../try.py',  metadata="local") as runner:
        result = await runner.run(blocking=False, alpha=5, tags=["abc", "def"], max_workers=5)
        print(result) # this is the Run object
        print(result.finished) # this should be False
        await runner.tail_logs(stream="stdout")
        print(result.finished) # this should be True

        # the below can also be done...

        # while not result.finished:
        #     print("waiting for run to finish...")
        #     `asyncio.sleep(0.1)` or perhaps `time.sleep(0.1)`

if __name__ == "__main__":
    asyncio.run(main())

the output looks like:

Run('ParameterFlow/1710272507865687')
False
2024-03-13 01:11:47.872
Workflow starting (run-id 1710272507865687):
2024-03-13 01:11:47.878 [1710272507865687/start/1 (pid 85339)] Task is starting.
2024-03-13 01:11:48.081 [1710272507865687/start/1 (pid 85339)] alpha is 5.000000
2024-03-13 01:11:53.112 [1710272507865687/start/1 (pid 85339)] Let me sleep for 5 seconds
2024-03-13 01:11:53.114
[1710272507865687/start/1 (pid 85339)] Task finished successfully.
2024-03-13 01:11:53.121 [1710272507865687/end/2 (pid 85381)] Task is starting.
2024-03-13 01:11:53.359 [1710272507865687/end/2 (pid 85381)] alpha is still 5.000000
2024-03-13 01:11:53.393 [1710272507865687/end/2 (pid 85381)] Task finished successfully.
2024-03-13 01:11:53.394 Done!
True

romain-intel

Just looked at the process manager for now.

metaflow/cli.py

metaflow/subprocess_manager.py

metaflow/click_api.py

metaflow/metaflow_runner.py

madhur-ob · 2024-04-09T13:32:30Z

The latest usage looks like this:

import time
import asyncio
from metaflow import metaflow_runner

async def main():
    async with metaflow_runner.Runner('../try.py',  metadata="local") as runner:
        # returns immediately
        result = await runner.run(alpha=5, tags=["abc", "def"], max_workers=5)

        # can access properties of metaflow's Run object..
        print(result.finished)
        while not result.finished:
            print("waiting for run to finish...")
            time.sleep(1)
        print(result.successful)

        # can access properties of the CommandManager as well..
        print(result.log_files)

        await result.wait()
        await result.wait(timeout=2)
        await result.wait(stream="stderr")
        await result.wait(stream="stderr", timeout=3)

        async for _, line in result.stream_logs(stream="stdout"):
            print(line)

if __name__ == "__main__":
    asyncio.run(main())

romain-intel

Initial comments. WIll need a bit more time but looks promising :)

metaflow/subprocess_manager.py

romain-intel · 2024-04-11T16:39:44Z

metaflow/subprocess_manager.py

+ else:
+ line = await asyncio.wait_for(f.readline(), timeout_per_line)
+ except asyncio.TimeoutError as e:
+ raise LogReadTimeoutError(


we could also return the position and None?

So we would return:

the actual line

an empty line if EOF or just an actual empty line

None if a timeout.

I don't know if we want to distinguish the cases in 2.

can discuss for more clarity...

metaflow/subprocess_manager.py

romain-intel · 2024-04-11T16:44:25Z

metaflow/metaflow_runner.py

+ self.run_obj = run_obj
+
+ def __getattr__(self, name: str):
+ if hasattr(self.run_obj, name):


need to think a bit more about this. We may want a more explicit control to make sure we expose what we want. It's also a bit harder to document this way.

can discuss more in the meeting...

madhur-ob · 2024-05-02T04:32:31Z

New Higher Level UX:

import time
import asyncio
from metaflow import metaflow_runner

if __name__ == "__main__":

    # sync version that calls run() directly
    with metaflow_runner.Runner('../try.py', metadata="local").run(alpha=5, tags=["abc", "def"], max_workers=5) as result:
        print(result.run.finished)
        print(result.stdout)

    # sync version using runner
    with metaflow_runner.Runner('../try.py', metadata="local") as runner:
        result = runner.run(alpha=5, tags=["abc", "def"], max_workers=5)
        print(result.run.finished)
        print(result.stdout)

    # async version that calls async_run() directly
    async def async_run_directly():
        with await metaflow_runner.Runner('../try.py', metadata="local").async_run(alpha=5, tags=["abc", "def"], max_workers=5) as result:
            print(result.run.data)
            print(result.run.finished)

            print(result.stdout)
            time.sleep(3)
            print(result.stdout)

            print((await result.wait()).stdout)
            # OR we can also do:
            #    await result.wait()
            #    print(result.stdout)

            while not result.run.finished:
                print("waiting for run to finish...")
                time.sleep(1)
            print(result.run.successful)

            print((await result.wait(timeout=3, stream="stdout")).run.data)
            print(result.run.finished)

            await result.wait(timeout=3, stream="stdout")

            async for _, line in result.stream_logs(stream="stdout"):
                print(line)

    asyncio.run(async_run_directly())

    # async version using runner
    async def async_runner():
        async with metaflow_runner.Runner('../try.py', metadata="local") as runner:
            result = await runner.async_run(alpha=5, tags=["abc", "def"], max_workers=5)

            print(result.run.data)
            print(result.run.finished)

            print(result.stdout)
            time.sleep(3)
            print(result.stdout)

            print((await result.wait()).stdout)
            # OR we can also do:
            #    await result.wait()
            #    print(result.stdout)

            while not result.run.finished:
                print("waiting for run to finish...")
                time.sleep(1)
            print(result.run.successful)

            print((await result.wait(timeout=3, stream="stdout")).run.data)
            print(result.run.finished)

            await result.wait(timeout=3, stream="stdout")

            async for _, line in result.stream_logs(stream="stdout"):
                print(line)

    asyncio.run(async_runner())

romain-intel

I haven't checked all the logic in click_api.py but if it works for the runner, it should be good enough for now. We can improve as we go along and expose more of it.

metaflow/_vendor/vendor_any.txt

metaflow/_vendor/vendor_v3_8.txt

metaflow/metaflow_runner.py

metaflow/subprocess_manager.py

metaflow/click_api.py

romain-intel reviewed Mar 14, 2024

View reviewed changes

metaflow/cli.py Outdated Show resolved Hide resolved

metaflow/subprocess_manager.py Outdated Show resolved Hide resolved

metaflow/subprocess_manager.py Outdated Show resolved Hide resolved

metaflow/subprocess_manager.py Show resolved Hide resolved

madhur-ob force-pushed the runner-api branch from 6f725aa to bafe765 Compare March 28, 2024 00:13

romain-intel reviewed Mar 28, 2024

View reviewed changes

metaflow/metaflow_runner.py Outdated Show resolved Hide resolved

madhur-ob requested a review from romain-intel April 9, 2024 13:43

romain-intel reviewed Apr 11, 2024

View reviewed changes

madhur-ob added 20 commits April 16, 2024 15:44

initial runner map api

a7bf9be

return Run objects

772662f

WIP: click to method mapping

e7159a5

WIP: click to method, adding groups

23aa2af

remove comment

4add23a

pass in args and opts correctly

d73adb3

fix validation for top level params

30e8dc2

restore metaflow_runner.py

5b3fd19

handle list in options

d4652cd

fix nargs

229aecf

fix issue with validating missing args

1d7ad0f

inject flow parameters as well

dd526fc

handle include file

a1bf917

internal click to chaining methods functionality

4c856f7

subprocess manager

7d4a107

simpler subprocess manager

b11555c

handle SIGINT

7cdcc5b

wait() handles timeout + logs both

e2efa4a

improved log streaming with ability to resume + stream multiple times

00c1bd9

cleanup in context manager, otherwise user is responsible

0a07711

madhur-ob added 3 commits April 16, 2024 15:44

return ExecutingRun encapsulating CommandManager and metaflow.Run

ae99408

print as string

9c95627

suggested improvements

c681f71

madhur-ob force-pushed the runner-api branch from 8cb6d8b to c681f71 Compare April 16, 2024 21:33

madhur-ob added 3 commits April 17, 2024 03:06

use correct variable name

a9678ef

update usage of subprocess manager

4e9a0c9

fix flags such as --pylint/--no-pylint

1f80cc3

madhur-ob mentioned this pull request Apr 28, 2024

tests with lower level chaining api #1816

Open

madhur-ob added 4 commits April 28, 2024 15:19

cache loaded modules

334450c

fix boolean flags

2f0ab07

remove aiofiles

01894b8

no f-strings

dafb6bf

madhur-ob force-pushed the runner-api branch from 98742e7 to dafb6bf Compare April 28, 2024 17:21

madhur-ob added 2 commits May 1, 2024 08:23

get run object explicitly

474c85d

new higher level UX

aa57760

madhur-ob added 2 commits May 3, 2024 05:02

typeguard warning for 3.5/3.6 + add profile argument

c6699aa

default None for profile

cf8df8f

madhur-ob changed the title ~~initial runner map api~~ initial runner api May 2, 2024

romain-intel reviewed May 9, 2024

View reviewed changes

madhur-ob added 7 commits May 15, 2024 18:58

detect early issues

2c7603e

fix user defined params not being recognised

78c7722

kill all descendants

906d159

do not use f-strings

d1cbae9

suggested changes

953ce56

use pkill instead of psutil

fde2a37

minor fix

61d7a6d

madhur-ob force-pushed the runner-api branch from 24fbb41 to 61d7a6d Compare May 15, 2024 13:31

vendor typeguard

3069977

romain-intel mentioned this pull request May 16, 2024

Question on Executing Metaflow Workflow from Python Script Without 'run' Argument #1838

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial runner api #1732

initial runner api #1732

madhur-ob commented Feb 15, 2024 •

edited

romain-intel commented Feb 16, 2024

madhur-ob commented Feb 21, 2024 •

edited

madhur-ob commented Mar 12, 2024

romain-intel left a comment

madhur-ob commented Apr 9, 2024

romain-intel left a comment

romain-intel Apr 11, 2024

romain-intel Apr 11, 2024

madhur-ob Apr 16, 2024

romain-intel Apr 11, 2024

madhur-ob Apr 16, 2024

madhur-ob commented May 2, 2024

romain-intel left a comment

initial runner api #1732

Are you sure you want to change the base?

initial runner api #1732

Conversation

madhur-ob commented Feb 15, 2024 • edited

Assumptions:

Things that need to be discussed

romain-intel commented Feb 16, 2024

madhur-ob commented Feb 21, 2024 • edited

madhur-ob commented Mar 12, 2024

romain-intel left a comment

Choose a reason for hiding this comment

madhur-ob commented Apr 9, 2024

romain-intel left a comment

Choose a reason for hiding this comment

romain-intel Apr 11, 2024

Choose a reason for hiding this comment

romain-intel Apr 11, 2024

Choose a reason for hiding this comment

madhur-ob Apr 16, 2024

Choose a reason for hiding this comment

romain-intel Apr 11, 2024

Choose a reason for hiding this comment

madhur-ob Apr 16, 2024

Choose a reason for hiding this comment

madhur-ob commented May 2, 2024

romain-intel left a comment

Choose a reason for hiding this comment

madhur-ob commented Feb 15, 2024 •

edited

madhur-ob commented Feb 21, 2024 •

edited