-
Notifications
You must be signed in to change notification settings - Fork 231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support gaps for notebooks: Python source
task with a notebook upstream chokes trying to deserialize 'nb' part of upstream
#1097
Comments
Can confirm, by changing my un/de-serializer definition to: @unserializer({".h5": _hdfs_to_df, '.html': lambda product: None}, fallback=True, defaults=defaults)
def default_deserializer(product):
pass The error is avoided. Suggestion: Better suggestion: lazy load the products of any upstream; this way upstreams with multiple products that will be partially consumed by downstream tasks don't produce unnecessary deserialization and memory overhead and won't deserialize "periphal" products like 'nb'. |
hi @marr75, sorry I missed this! Looks like you found a workaround and things are working for you right? I agree that the lazy loading is the best option but it'd involve more work. Ignoring the notebook product is a simpler alternative. I'll discuss with the team and see if we can allocate some resources to this. But if you have time, feel free to submit a PR. Since you don't care about the output notebook, you could switch to the I noticed the following in your pipeline.yaml: # notebook task, uses jupysql which was poorly supported in the .py notebook(?) Can you open another issue and provide more details? My guess is that the problem is with jupytext and not with ploomber itself. jupysql powers some of our internal pipelines and I remember encountering some issues because jupytext failed to do a roundtrip conversion of scripts with |
Related to #1088
I'm working on a reference ploomber pipeline for my team on and off and have encountered an issue using serializer/unserializer (which I think should be the default practice to support IoC in pipelines) and mixed notebook/script, python source tasks.
pipeline.yaml as follows:
right now, combine-data is just:
It executes with no issue when resolve-columns-2 is the upstream. It produces the following error when resolve-columns is the upstream:
I'm fairly certain the error related to trying to use pickle to unserialize the 'nb' sub-product of the resolve-columns upstream (unpickling and HTML aren't friends). I can work around it by not using notebooks but this would really limit adoption in my org if notebooks were never an option or notebooks couldn't be used with python callables and IoC features.
The text was updated successfully, but these errors were encountered: