Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss use cases about skipping cells during execution #429

Open
willingc opened this issue Sep 9, 2019 · 43 comments
Open

Discuss use cases about skipping cells during execution #429

willingc opened this issue Sep 9, 2019 · 43 comments
Labels
enhancement reference:extensions information and ideas about extensions

Comments

@willingc
Copy link
Member

willingc commented Sep 9, 2019

From #425 by @akx:

Passing --skip-tag (can be passed multiple times) will have the executor entirely skip cells endowed with that tag.

That status is recorded in the result notebook (metadata.papermill.status = "skipped"), as well as the tags that contributed to this skipping (metadata.papermill.skip_tags).

cc @JuhaKiili

I'm opening this issue to gather feedback on real world use cases which may benefit from the ability to skip cells during execution. Let's focus on documenting the use cases before moving toward a technical implementation.

@JuhaKiili
Copy link

notebook-generic

We are developing a machine learning platform @ valohai.com

One of our features is a Jupyter notebook plug-in, which will allow our customers to run the notebook from start to finish on a cloud instance asynchronously. We use Papermill for the server-side execution.

One of the feature requests from customers is an ability to tag certain cells as "local only", which means they will not be executed on the server. Here is a direct quote from our customer:

Sometimes we have some cell in the notebook that are just used for a quick eda and don't need to be run on a gpu machine. Would it be possible to add some form of tag so that we can choose which cells of the notebook to execute on vh?

We could programmatically remove the tagged cells before we send the notebook to the server, BUT this would ruin the version control side of things. When our customers come back 6 months later and download the resulting notebook, they still want the disabled cells to be there, even if they were not executed.

@willingc
Copy link
Member Author

willingc commented Sep 10, 2019

Thanks @JuhaKiili.

A few thoughts on notebook functionality that appears to address your use case today:

If you look in the keyboard shortcuts for the notebook, you can disable a cell from running by changing it to a raw cell (ESC+r). You can later reactivate it as a code cell with (ESC+y).

Changing a cell to a raw cell provides the same functionality as skipping a cell during execution.

cc/ @MSeal

@JuhaKiili
Copy link

Changing a cell to a raw cell provides the same functionality as skipping a cell during execution.

Didn't know about raw cells. Thanks for the info!

@MSeal
Copy link
Member

MSeal commented Sep 17, 2019

The raw cells would work. Though I can see that being a bit annoying to manage.

I also noticed the cell that you likely want skipped in that gif is a series of magics which would make conditional execution of the logic within the cell irritating to manage by the user.

While we're exploring options. A way to manage how a notebook executes more precisely can be to use a registered engine: https://papermill.readthedocs.io/en/latest/extending-entry-points.html#developing-a-new-engine.
This would let you modify cells or choose how to (or not) execute them within the papermill context. You could change cells to raw there or skip and add some output message indicating the cell wasn't executed perhaps.

@akx
Copy link
Member

akx commented Sep 24, 2019

@MSeal Engines can't register new parameters during the CLI parsing phase though, so having a custom engine for the #425 use case would also require some other method of passing in the regex/pattern/list-of-names, which also sounds clunky. 😕

@MSeal
Copy link
Member

MSeal commented Oct 14, 2019

We discussed the topic again today in our weekly meetup for nteract and are still of the opinion that it doesn't fit the project to support as a built-in mechanism, especially with some known work-around patterns and the ability to extend papermill for more custom execution patterns. Adding this feature would increase the complexity of describing how papermill operates and why the outcome ran the way it did. It would also break the construct we rely on heavily that rerunning a papermill output in any notebook interface will produce the same result as when it ran in papermill.

Of the solutions available, converting cells to raw that you don't want to have executed if probably the best option. Though if people wish to go the engines route we can help improve the engines abstraction where there are gaps to using the plugin system. e.g. @akx If we need to make a way for plugins to be able to accept plugin kwargs from CLI calls we could open a separate issue on that topic. I think that'd benefit customization in general.

Other maintainers please feel free to add comments or suggestions to this thread as it goes on, since I'm echoing a wider conversation that we've had over the past few months.

@willingc willingc added reference:extensions information and ideas about extensions and removed needs:discussion additional discussion is requested labels Oct 14, 2019
@casperdcl
Copy link
Member

Adding this feature would increase the complexity of describing how papermill operates

could you specify an explicit word/character limit for describing new features? We can then see if it's possible to concisely describe.

why the outcome ran the way it did [...] we rely on heavily that rerunning a papermill output in any notebook interface will produce the same result as when it ran in papermill.

#403 fixes this by removing cells altogether hence producing notebooks that are runnable in any interface with the same results

Of the solutions available, converting cells to raw that you don't want to have executed if probably the best option.

not sure

we can help improve the engines abstraction where there are gaps to using the plugin system.

sounds complicated but still welcome

@MSeal
Copy link
Member

MSeal commented Oct 22, 2019

Sorry for not being highly responsive on this thread, it's been a busy couple weeks. I'm going to try to post the nteract groups response here, though it's flavored a bit in my fashion of wording. I asked the group to also comment on this thread so it's not just myself speaking for the group as a whole.

could you specify an explicit word/character limit for describing new features? We can then see if it's possible to concisely describe.

The complexity of describing is not in the number of words, it's in the fact that we no longer have a linear execution model. There's a lot of users of papermill that this would confuse. Many users it wouldn't confuse at all but in the past when we've had to describe non-linear behavior in other jupyter tooling it put a burden on the maintainers and community to constantly explain why execution happened the way it did. We'd like to not have base papermill have the same explanation and support burden.

#403 fixes this by removing cells altogether hence producing notebooks that are runnable in any interface with the same results

That's fair, though then we have to explain to users that they need to flip cell types when they want to turn them on / off. The current UIs don't lend themselves well to this effort, as it won't tell you how to execute a raw cell, it'll just do nothing when you execute it. For a specialized usecase this is probably fine, but there's many users that don't understand the cell type concept so the change has questionable value to them. If there were a more first-class "disabled" concept that guided the users in the interface I think my objection here would be mute for a general purpose tool.

sounds complicated but still welcome

Agreed, just trying to make extensibility for other solutions to problems such as these not be explicitly blocked by the tooling in place.

@willingc
Copy link
Member Author

Adding an additional point that I find important for the linear execution model: reproducibility of research notebooks. A non-linear execution model with more variability makes it more difficult to account for which cells were run or should be run to reproduce.

I agree @MSeal that a future jupyter/nbformat spec would be a better place for a first-class "disabled" concept.

@rgbkrk
Copy link
Member

rgbkrk commented Oct 22, 2019

The issue here is that it makes the job of maintainability really difficult for us in terms of support costs. Not the literal code, but the expectation that we have to respond to issues related to this not just in papermill but also in the classic notebook UI, jupyterlab, etc. This creates notebooks that run differently in different systems, which goes against our core tenant of reproducibility.

@casperdcl
Copy link
Member

@rgbkrk I'm not convinced by the maintainability argument; however the "core tenant of reproducibility" is definitely not violated by #403

@mpacer
Copy link
Member

mpacer commented Oct 23, 2019

Today, papermill adds one cell and then runs a notebook. It's simple to explain to people.

From the perspective of papermill, skipping cells while leaving them in the notebook is definitely a violation of the "core tenant of reproducibility". While #403 does not do that (by removing the cells altogether), it still introduce an additional level of complexity to the project.

As the person who wrote the tag exclude rules in nbconvert, I can speak to how much complexity they can introduce to how you think about the flow of a notebook (especially when executing it). It's a lot of complexity.

The complexity shows up when conveying how to use these features to users. Getting folks to understand these kinds of features & how to use them has been really challenging.

All that said, it should be possible to use nbconvert to achieve the end that you want from #403, though not with regexes (which admittedly adds additional complexity). Then, you would use nbconvert to remove the cells and then pipe the output notebook directly into papermill with the --stdout flag.

So if you wanted to remove cells with nbconvert with the tag "remove_me" and then pipe it into papermill as follows:

jupyter nbconvert myNotebook.ipynb --stdout --TagRemovePreprocessor.remove_cell_tags={\"remove_me\"} --to notebook | papermill myOutputNotebook.ipynb

@tgandor
Copy link

tgandor commented Nov 19, 2019

There are use cases involving scripted / programmatic usage and ad-hoc command line usage.

A compact, intuitive CLI can be helped by some conventions, like the one discussed - with cell tagging. Think about C preprocessor, with some predefined symbols.

I was thinking about suggesting something like:

import papermill
if papermill.am_i_running_inside_papermill():  # provisional name
    dont_setup_interactive_stuff()
else:
    %matplotlib notebook
    etc()

Of course, you could always have a parameter for that:

{
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "INTERACTIVE = True"
   ]
}

And run your batch processing with -p INTERACTIVE False, but it's 20 keystrokes more, and also this parameter needs to exist in the first place (for it to be usable "Interactively").

The UX criteria for such a feature would be:

  • is it compact? (not requiring extra parameters / scripting outside of papermill)
  • Does it make the notbook look ugly? (... the code inside the notebook ...)
  • does it work without changing the notebook on disk? (version control)
  • does it work correclty on a "dirty" notebook (currently running)?

@ODemidenko
Copy link

as was requested - here are my use cases:
Use case 1:
Notebook is used to do some clusterization. Usually it is done for the current month.
But sometimes I need to re-run it over historical months, which necessitates re-running almost the whole notebook. Thus - papermill.
But for historical computations I need to skip some cells that belong only for the current state information. Now I do this by creating a copy of the notebook, deleting unnecessary cells and feeding in into papermill.

Use case 2:
I have a universal notebook that produces a lot of charts on the particular input (all we can say, about arbitrary set of input subscribers).
I need those charts on different segments, so I run this notebook through papermill, providing different segments it must describe.
Some charts take a lot of time to compute, or need data that I don't have. Then I want to run this notebook only partially. Now I create a copy of the notebook and delete manually unnecessary cases.

Problems that I encounter now, with copies that I create is when I start changing smth in these "extracts" from the original notebook - I then sometimes forget to propagate this changes back to the full notebook.

@ctivanovich
Copy link

ctivanovich commented Apr 10, 2020

Another use case request: I need to gracefully exit a papermill run of a notebook, error free, when, for example, the datasource being imported is invalid or empty. But the immediate Python options like if condition: exit() or assert or raise Exception all lead to errors that get propagated up to papermill, which then get logged as failed tasks in Airflow. A work around would be to set up error catching in the host environment, but that would not be fine-grained enough to avoid catching errors I do want to propagate.

@MSeal
Copy link
Member

MSeal commented Apr 10, 2020

@ctivanovich You can exit(0) and it will terminate without raising an exception in papermill fyi.

@Krever
Copy link

Krever commented Apr 16, 2020

In our case, we are trying to hang the execution of the flow waiting for an event. This means we need to sav execution state, terminate the execution and after some time start where we finished. For now, we didn't find any way to accomplish it easily, but I assume skipping cells would help.

@MSeal
Copy link
Member

MSeal commented Apr 16, 2020

In our case, we are trying to hang the execution of the flow waiting for an event.

Could you explain more what this means? And why you think skipping cells would help?

@Krever
Copy link

Krever commented Apr 17, 2020

Sure, we are experimenting with using jupyter notebooks for business process automation. In such an environment sooner or later you have to wait for some kind of event to happen. This could be either a timer event (e.g. wait 5 days) or user input (send a form to the user and wait for events) or something else. Anyway, this means the process (which in this experiment is expressed as a notebook) can take weeks or months to execute and we cannot rely on an os process to survive.

And so we need a way to suspend a notebook and then resume it once the event arrives. This brings two major problems: persisting notebook state and starting process in a given place. I believe skipping cells could be used as a solution to the second one, we would just skip all cells up to the point where we finished previous execution.

If you think using notebooks for business process automation is crazy then you're probably right. But as I said, its an experiment and we have quite good reasons to hope it will succeed.

@tgandor
Copy link

tgandor commented Apr 17, 2020

@Krever crazy idea indeed. I think you should write your own %%cell_magic which would know if it should execute or not. (Save progress outside the notebook)

@MSeal
Copy link
Member

MSeal commented Apr 18, 2020

@Krever I hear you on taking crazy ideas and proving them out. We did something similarly crazy at the time with https://netflixtechblog.com/scheduling-notebooks-348e6c14cfd6.

In our case we used an external scheduler to manage timer and event triggers to release a notebook. We also tended to split notebooks up to single execution units that can be run end-to-end because it's difficult to rehydrate the full execution state part-way through and error prone to not do so at arbitrary cells. If you are headed in that direction and have a plan for how you want to handle execution patterns within a single notebook, I think you're best bet is going to be to use engine extensions to better manage when you execute what. You can even manipulate when the kernel is up or if you need to run it remotely. This has been successfully applied for domain specific executions where you want a lot more control over execution patterns. Skipping cells would be a way to achieve part of what you need, but if you're investing in complex patterns I think you'd want more control over other aspects as well.

@antonkulaga
Copy link

Guys, I am really surprised that somebody needs to explain why skiping cells is useful! When you do your research you have a lot of cells where you render part of the data for explaratory analysis (i.e. dataframe.head(30)) Such type of cells are totally useless for any machine-learning workflows but are very helpful for everybody who uses the code

This was linked to pull requests May 15, 2020
@MSeal
Copy link
Member

MSeal commented May 15, 2020

What you're not seeing in this thread @antonkulaga is the several thousand users we interface with in a daily basis across our respective companies that use this tool. Making it easier to conditional execute cells does helps some users do more complex tasks. It also make novice developers and non-professional programmers more likely to set themselves up for failure. Papermill is an opinionated tool. One of those opinions is that notebooks when run in a non-iterative environment should be run from beginning to end. If we add deviations from this it complicates support patterns and leads to needing to encode what cells should and shouldn't be used in offline systems, or when passing to another person / toolchain those tools now have to be aware of more configuration and complexity.

That being said, we're not saying we won't do it in papermill directly. We've been delaying making a reversal decision here for two reasons: A) the maintainers of papermill have a bunch of these projects we work on in our spare time so we've been busy and B) we wanted more time to collect usecases and impact it might have if we added the option.

You can also add this with your own extension if you really want it today while we sit on the topic. Plus we presented a few alternatives that are available to combine tooling or patterns to achieve skipped cells. Saying "I am really surprised that somebody needs to explain why skiping cells is useful!" is a little inflammatory when we're keeping things open to discuss the ideas and tradeoffs here despite initially saying we didn't want the feature to be added in the first place.

After nbconvert 6.0 and the new nbclient are in a solid place I'll personally have more time to think through this issue given the context in the thread and in our interactions with the other maintainer's development ecosystems.

@casperdcl
Copy link
Member

casperdcl commented May 15, 2020

sigh.

Papermill is an opinionated tool.

This statement should have been made years ago and featured prominently in the docs. It is perfectly reasonable to be opinionated (e.g. https://github.com/psf/black) provided you say so clearly and in advance. This issue could have been closed a long time ago, or at least a lot of debate avoided.

is a little inflammatory

You imply helpful contributors are being deliberately inflammatory. This is a worrying allegation you seem to resort to every time someone expresses passion equal to your own. It's not in any way helpful, especially since it's abundantly clear that said contributors are genuinely looking to solve a problem, not create one. Furthermore, let us assume some contributors are deliberately being inflammatory. Someone who claims to manage several thousand users daily should be able to accept harsh criticism. If you genuinely believe any conduct to be inappropriate it should be reported.

@MSeal
Copy link
Member

MSeal commented May 16, 2020

sigh

Do we need to sigh?

This statement should have been made years ago and featured prominently in the docs

Fair that it's not documented. I've said this in almost all of the papermill talks I have done, and other like Kyle, M, and Carol have said it different settings but it's not mentioned in the docs.

is a little inflammatory

You imply helpful contributors are being deliberately inflammatory.

I never said it was deliberate. Sometimes it's a language barrier or just not carefully worded. We get a lot of negativity thrown at maintainers on some threads across the projects. I try to mention when it's headed in a bad direction and not jump to reporting people for miscommunication alone.

This is a worrying allegation you seem to resort to every time someone expresses passion equal to your own.

I'm fine with passion. But implying we don't understand and that it's so simply anyone whose reasonable would see, after many responses to the topic here IS frustrating and doesn't help motivate us to meet in the middle. As to my response, your comments in #425. the one place where I feel we pushed back strongly on an idea were called out for you violating our conduct (this for was for the same feature being discussed here). Another primary maintainer responded in that thread and I had the remaining nteract maintainers review my response before I made it. I didn't report it, as I wanted to make clear the issue we were seeing and ask for a correction in behavior because we value contributors and I don't want a hostile project space. This here is nowhere near that level of issue, so I'd ask to not compare the two.

I do not wish to continue this line of discussion in this thread, and would prefer we talk about the subject. On that I will make a concerted effort to work towards a better solution in this tool or in combination with other tooling to enable to intended usecase, but I don't currently have bandwidth to engage on the topic and I worry about adding cell skipping needs clear direction and tooling decisions or can lead to the concerns mentioned above removing value from the tool's simplicity. If another nteract maintainer has more bandwidth to tackle the topic I'll help review and promote changes that are agreed upon.

@MSeal
Copy link
Member

MSeal commented May 16, 2020

If folks have proposals to help mitigate the risks of adding tooling complexity or making the outputs clear about what executed in the different interfaces for viewing notebooks that is likely the next step to helping get support for the capability more natively available.

@captainsafia
Copy link
Member

Thank you, @MSeal.

I want to remind everyone of the Code of Conduct. By participating in the issue discussion here and in any other nteract project, you agree to the guidelines outlined in this Code of Conduct.

Please acknowledge the combined hundreds of hours of time that maintainers invest in this project, despite diminutive comments, and lend everyone kindness in the conversation.

@willingc
Copy link
Member Author

It is perfectly reasonable to be opinionated (e.g. https://github.com/psf/black) provided you say so clearly and in advance. This issue could have been closed a long time ago, or at least a lot of debate avoided.

As a maintainer of black and papermill, I will go ahead and add that papermill is also opinionated.

As for the comments, I'm disappointed that folks post disrespectful, arrogant, and entitled statements. I'm sincerely sorry that you disagree with our decisions and approach. You are welcome to fork the project, but bullying the maintainers is rarely a road to success in open source.

@MSeal
Copy link
Member

MSeal commented May 19, 2020

In discussing the topic in today's weekly nteract meeting, I proposed that we could mark cells to be commented instead of strictly skipped by tag. This would preserve the execution end-to-end without downstream systems needing to understand, leave the code which was skipped in the document, and would mimic how a user would temporarily skip some extra code in a notebook before hitting Run All.

The other option would be to remove the cells being marked, though some of the earlier comments indicated that this would be non-ideal for their use-case.

Thoughts?

@ODemidenko
Copy link

Why not to have both options simultaneously: with one tag cell code will be commented out, with another - it will be removed?
Right now I have only use-cases where it is better to completely remove some cells.
But commenting out will be ok as well.

@MSeal
Copy link
Member

MSeal commented May 21, 2020

Noted. Probably possible to implement both if it's done.

@ianhellstrom
Copy link

@MSeal In my use cases commenting out and removal will both be fine, so I don't have a preference. I would like to see this implemented though as I think it provides good value, especially in data science.

@MSeal
Copy link
Member

MSeal commented Jun 11, 2020

Ok we're going to start with --remove-tagged-cells={tag}, not as a regex but a fixed string match to keep it as simple as possible. I'm opting for remove over skip to imply the output won't contain cells with the given tag. Later on we can add for support commenting functionality via --comment-tagged-cells={tag} if someone still needs that beyond the removal feature.

@mogwai
Copy link

mogwai commented Dec 8, 2020

Fantastic project guys, I tried to write a simple script and hit all sorts of problems with running code as part of tests but papermill made it very easy.

We're using this in fastaudio at the moment to verify the tutorial notebooks aren't broken. The problem is, these notebooks have long running executing code in them that we don't want running in the CI. For example we train models and download large datasets and we don't want that happening in the CI. What we do need to happen, is a quick check of syntax and that the code runs. We've created a method for this:

https://github.com/fastaudio/fastaudio/blob/78bfa46113998eb9b401a98f1b90cfcb8cab0884/src/fastaudio/ci.py#L9-L16

I feel that this feature belongs as a part of papermill. A way of skipping cells based on a state.

@knkumar
Copy link

knkumar commented Feb 12, 2021

+1
would it be possible to parametrize execution of a cell, i.e., update the notebook cell tag based on a variable? I can personally avoid using lot of if statements to control execution flow.

@amardeep
Copy link

Any update on this? I see a message from June 2020 about implementing --remove-tagged-cells={tag} option. Is this still planned and being worked on?

@casperdcl
Copy link
Member

@ianhellstrom @mogwai @knkumar @amardeep you could try the implementation from #403

pip install "git+git://github.com/casperdcl/papermill@tag_regex#egg=papermill"

which will give you:

  -t, --tag-include-regex TEXT    Regular expression for tags of code cells to
                                  include. If specified, will skip untagged
                                  cells.

  -T, --tag-exclude-regex TEXT    Regular expression for tags of code cells to
                                  remove. Takes precedence if in conflict with
                                  `--tag-include-regex`.

diff from current nteract/papermill@main | CI

@MSeal
Copy link
Member

MSeal commented Sep 27, 2021

Any update on this? I see a message from June 2020 about implementing --remove-tagged-cells={tag} option. Is this still planned and being worked on?

Yes, I just had a child and got really distracted on Open Source issues. If someone wants to add a PR with the described approach I'd approve and merge. Otherwise I will put this back on my radar to tackle.

@eromoe
Copy link

eromoe commented Jan 31, 2023

@MSeal Does remove-tagged-cells implement in pm.execute_notebook ? I didn't find it .

@MSeal
Copy link
Member

MSeal commented Jan 31, 2023

No I have not implemented it. Covid + first child killed a lot of my available time. Happy to help merge a PR if someone opens it up though

@edublancas
Copy link
Contributor

my team works on a similar project, and we implemented the remove-tagged-cells feature:

pip install ploomber-engine
ploomber-engine in.ipynb out.ipynb --remove-tagged-cells remove

or

from ploomber_engine import execute_notebook

execute_notebook("in.ipynb", "out.ipynb", remove_tagged_cells="remove")

Docs

@amardeep
Copy link

my team works on a similar project, and we implemented the remove-tagged-cells feature:
Docs

Thanks for the quick implementation Eduardo.

Here is the relevant issue if others would like to chime in with their use cases:
ploomber/ploomber-engine#37

@astrofrog
Copy link

I have implemented @MSeal's suggestion in #738

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement reference:extensions information and ideas about extensions
Projects
None yet