-
Notifications
You must be signed in to change notification settings - Fork 871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe integration with MLflow #3856
base: main
Are you sure you want to change the base?
Conversation
a8256cf
to
df922b7
Compare
Closes #3541. Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
df922b7
to
24c25c2
Compare
Note to reviewers: The idea of this page is to complement
The idea is for this page to serve as brief collection of MLOps use cases, and to use this as a template for future integrations. I'm paging @stichbury as well because I was somewhat careless with the prose in certain parts. I |
At the moment it's not straightforward to specify the `run_id` | ||
to use a specific version of a model for inference within a Kedro pipeline, | ||
see discussion on [this GitHub issue](https://github.com/Galileo-Galilei/kedro-mlflow/issues/549). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see people doing it in production for a lot of models, so it is definitely possible ;) I really need to see the code you run to get that error, but essentially you cannot save a model in an existing run, because you would alterate it and this is definitely not the purpose of experiment tracking. However, there is absolutely no problem to load a model in an existing mlflow run (say the sapecflights regressor you've just trained) to run a kedro pipeline which will log in another mlflow run, or can just be used without logging in mlflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
essentially you cannot save a model in an existing run, because you would alterate it and this is definitely not the purpose of experiment tracking
That makes a lot of sense. Still chewing on your explanation, will rewrite this part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"mlflow": ("https://www.mlflow.org/docs/2.12.1/", None), | ||
"kedro-mlflow": ("https://kedro-mlflow.readthedocs.io/en/0.12.2/", None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this config actually do, why do we prefer mapping to a specific version over "stable" or "latest"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because pinning the version means the links will never break ⭐ and it captures what version of MLflow was used to write the docs
docs/source/integrations/mlflow.md
Outdated
|
||
- A working Kedro project. | ||
- The MLflow client properly installed in your project: `pip install mlflow`. | ||
- (Recommended) An MLflow Tracking Server, which will allow you to quickly inspect the runs with a user-friendly web interface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention how to start a basic one with mlflow ui
etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be short and wouldn't harm I'd say, good suggestion.
I went back and forth several times on how to structure this page. I like how I ended up making 1 section per use case but I agree there might be other ways. Awaiting for @stichbury's take on this. |
I found the sectioning useful but agree with @noklam that we usually break into simple and advanced usage. Can I suggest we do the same here and have the structure as follows, but will leave you to decide where the simple/advanced sections fall? Maybe something like this? However, I'm not that attached to this and if you want to stick with what you have, I'd say that's fine, but omit the basic second level "Use cases" header and promote all the following (currently 3rd level) to 2nd level. HeaderPrerequisitesSimple use casesTracking Kedro pipeline runs in MLflow using HooksArtifact tracking in MLflow using hooksAdvanced use casesComplete tracking of Kedro runs in MLflow using
|
My biggest gripe with this is that it seems wrong to declare custom hooks as "basic" and using I can totally see how someone starts with the custom hook ("basic"), then they start making it more complex because they need more functionality, and in the end it becomes way more difficult than just |
I can't work on your branch so I've forked and made a PR to commit back to it #3862 Please take a look, merge what you want, and I can review again when you have the entire page in your preferred final state (see comment about sectioning above). |
* Some proposed edits Signed-off-by: Jo Stichbury <[email protected]> * Fix some Vale warnings Signed-off-by: Jo Stichbury <[email protected]> --------- Signed-off-by: Jo Stichbury <[email protected]>
We agreed to do this 👍🏼 Will make the change today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @astrojuanlu , for the excellent manual. Everything is working well. It's a great starting point for exploring Kedro+MLFlow. I've left a few minor comments.
We recommend use of an [MLflow tracking server](https://mlflow.org/docs/latest/tracking/server.html), that enables you to inspect runs through a web interface. This is optional, but strongly recommended, even if you're working locally on a solo project. This documentation assumes that you have a tracking server, whether it's a local one run by yourself or a remote one running on another machine or provided by a cloud vendor. | ||
You will need to configure your client properly, | ||
either by setting the `MLFLOW_TRACKING_URI` environment variable | ||
or by using the {external+mlflow:py:func}`mlflow.set_tracking_uri` function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t quite understand the idea of that section. I simply run mlflow ui
and everything works well without any configuration. Perhaps we should include this command in the manual for simplicity, and provide links for those interested in more detailed information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is very confusing for users. Actually it is not mandatory to have a tracking server to be able to use the UI, and I think the entire documentation examples should work without one. It is obviously very recommended to have one for many reasons (persistence, collaboration, granular access, separate storage of heavy components from metadata...), but it's like saying "don't use local csv files to store your data but use a database instead" : beginners won't see the immediate need and it's not a big deal for the first use I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're both completely right, this stems from some misunderstanding I had (I confess my first experience with MLflow was kinda bumpy). I will simplify all this, agreed with @Galileo-Galilei that it's not really needed to introduce this complexity early on.
import mlflow | ||
from kedro.framework.hooks import hook_impl | ||
from mlflow.entities import RunStatus | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the following lines to make it work:
import logging
logger = logging.getLogger(__name__)
Perhaps it would be sensible to include these in the manual?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, will add
|
||
One possible way of doing it is using the {py:meth}`~kedro.framework.hooks.specs.PipelineSpecs.before_pipeline_run` Hook | ||
to log the `run_params` passed to the Hook. | ||
An implementation would look as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, it would be beneficial for me to receive a brief explanation of what I should do next. Including a short description that instructs me to create a new file and register it according to the guidelines in this manual Kedro Hooks Introduction would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wanted to avoid adding too much prose here but at least a link to the Kedro Hooks intro would indeed be useful here, will add.
We recommend use of an [MLflow tracking server](https://mlflow.org/docs/latest/tracking/server.html), that enables you to inspect runs through a web interface. This is optional, but strongly recommended, even if you're working locally on a solo project. This documentation assumes that you have a tracking server, whether it's a local one run by yourself or a remote one running on another machine or provided by a cloud vendor. | ||
You will need to configure your client properly, | ||
either by setting the `MLFLOW_TRACKING_URI` environment variable | ||
or by using the {external+mlflow:py:func}`mlflow.set_tracking_uri` function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think this is very confusing for users. Actually it is not mandatory to have a tracking server to be able to use the UI, and I think the entire documentation examples should work without one. It is obviously very recommended to have one for many reasons (persistence, collaboration, granular access, separate storage of heavy components from metadata...), but it's like saying "don't use local csv files to store your data but use a database instead" : beginners won't see the immediate need and it's not a big deal for the first use I guess.
:::{note} | ||
If you implemented your custom `MLflowHooks` as described in the previous section, | ||
consider disabling them to avoid logging the runs twice. | ||
::: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is confusing for beginners and the warning should be at the beginning of the section. If one's read the section from linearly top to bottom (which is what you usually do 😄) I think it is very confusing to understand that this section suggests to replace the custom hook with kedro-mlflow
and not add it. Eventually as a beginner I would likely wonder "why the hell did I start to do something that is suggested t be removed in the following section?
I think an (advanced) example that would be much nicer could be an example of a combination of kedro-mlflow
and a custom hook (which will for exemple track something custom that kedro-mlflow
does not, e.g. the commit sha ?), but still benefit from kedro-mlflow configuration management because kedro-mlflow hook will run first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very valid point and goes back to something @stichbury had mentioned to me as well about mixing "explanation" with "how to" content in here.
The first hook is supposed to have pedagogical value but you're right it kind of breaks the flow.
I'll see if I can rework this to keep that first part as an explainer, or if I'll have to drop it completely and do as you say @Galileo-Galilei and evolve from kedro-mlflow
with custom hooks that are used in combination with the plugin.
Notice that `kedro-mlflow` assumes 1 Kedro project = 1 MLflow experiment. | ||
If this does not suit your use case, consider pursuing a more customised approach | ||
like the one described in the previous section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually this is the default but it is fully customisable in the mlflow.yml
, so I won't call it an assumption (you can even use a custom resolver to create something very custom and dynamic in the mlflow.yml if you want)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very true. I'll reword this section.
Notice though that storing artifacts as Python pickles | ||
prevents you from fully using MLflow native preview capabilities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really necessary to specify it? Just a nitpick, but maybe it would be nicer to write it the other way around with a csv or a png example which would render with a nice preview and say that mlflow offers a preview for some data format?
At the moment it's not straightforward to specify the `run_id` | ||
to use a specific version of a model for inference within a Kedro pipeline, | ||
see discussion on [this GitHub issue](https://github.com/Galileo-Galilei/kedro-mlflow/issues/549). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see people doing it in production for a lot of models, so it is definitely possible ;) I really need to see the code you run to get that error, but essentially you cannot save a model in an existing run, because you would alterate it and this is definitely not the purpose of experiment tracking. However, there is absolutely no problem to load a model in an existing mlflow run (say the sapecflights regressor you've just trained) to run a kedro pipeline which will log in another mlflow run, or can just be used without logging in mlflow.
Thanks for the review @Galileo-Galilei 🙏🏼 will address your comments ASAP. |
See #3541.
Description
Development notes
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file