🎉 Create function to import run from a previous version of a step #2410

pabloarosado · 2024-03-13T09:44:05Z

Summary

Create a helper function to import the run function from a previous version. This could let us avoid duplicating a lot of code, every time we update a data step with no changes in the code.

Motivation

The ETL currently relies on having a recipe for each dataset. But this means that we end up having many files with the exact same content.

Having duplicated code is not only bad when refactoring, or upgrading libraries. It is also inconvenient when searching for previous code. It can be quite confusing when many files have the same content.

I think it's very valuable to have easy access to multiple versions of a dataset, and its code. But, because of the need to duplicate code, one may be less keen to keep different versions, and prefer to use "latest" (there are other reasons why "latest" is convenient, but that's out of the scope of this PR).

Solution

This PR would let us have data steps with the following content:

from etl.helpers import load_run_from_previous_version

run = load_run_from_previous_version(__file__, "2024-01-01")

This way, we would avoid duplicated code. And it would also make it more evident when different versions of a dataset have exactly the same data processing.

Possible issues

Would the new step be loading the correct dependencies? I've checked that, indeed, the new run module imports the correct dependencies stated in the dag (and not the old ones).
I've also checked that, if the old step imports other modules within the same folder (e.g. shared.py), the new step still runs well.
(Unlikely) If one accidentally archived the old file, the new step would fail to run. We could find a way to raise an error when this happens, instead of waiting for ETL to fail. But even if ETL fails, it would not be a big deal (we would easily find the issue).
If one edits the old file, the new step will be changed. But (1) one has no direct way to be aware of that, and (2) ETL will not consider the new step as "dirty".
What if I just want to edit the new metadata? I've checked that, using load_run_from_previous_version and having a new metadata file inside the new folder works as desired: The old code runs with the new dependencies and loads the new metadata. This works as long as the metadata is loaded with create_dataset, and not imported directly with paths.metadata_file (see next point).
What if I just want to edit the new country harmonization? Unlike with metadata, a new .countries.json file in the new folder is ignored. I think that the underlying reason is that the paths = PathFinder(__file__) object defined in the old file is still loaded in the new file, and therefore all paths are the old ones.

Given these caveats, with the current implementation, load_run_from_previous_version can safely be used in those cases where the code is exactly duplicated (including metadata and countries). In any other case, if someone wants to edit minor things, it may or may not work, so it's safest to duplicate code and edit it.

Marigold · 2024-03-14T11:37:05Z

I'd kill for being able to import from other versions... I think it'd be also useful to load other (helper) functions from the module. For instance

from etl.helpers import load_module_from_previous_version

last_version = load_module_from_previous_version(__file__, "2024-01-01")
last_version.my_func()

also, having an optional module name would be nice in case it differs from the actual module (ran into when trying to import dummy.py from dummy_monster.py).

This is definitely an improvement over the current status quo, but I'd try to brainstorm if we could do even better than this and use native imports, i.e. having a way to run from ..2024-01-01 import run. Some adhoc ideas:

Can we name folders v2024_01_01 instead of 2024-01-01?
Perhaps we could create symlinks from importable paths (e.g. that v2024_01_01) to our current paths
?

cc @larsyencken if you have any ideas

We obviously don't have to do this now, so I'm up for merging this workaround.

Marigold

Super useful. Left a comment there, but it shouldn't block this.

larsyencken · 2024-03-14T16:48:48Z

I don't think I've understood this well enough, although we did talk about it.

The problem it's solving is the inconvenience of having duplicated code and having to identify the right version when searching in VSCode, if I've understood correctly.

The problem it introduces is that you can no longer delete a step, since other steps might be using its code, and that if you're using another step's code, you now depend on that code and have to include it in your manifest for checksumming the step. That seems like quite a downside to swallow.

@Marigold What's the use case here that you'd kill for?

Marigold · 2024-03-14T17:29:43Z

you now depend on that code and have to include it in your manifest for checksumming the step

Oh right, I totally forgot. That changes everything then.

What's the use case here that you'd kill for?

I might have been too dramatic with that. If checksum wasn't the problem (which it is...) then I'd find it really valuable for creating new versions of data and DRYing a lot of code. I guess we can still symlink shared.py module from previous version.

larsyencken · 2024-03-15T07:09:39Z

I think you hit upon my preferred approach, if we are keen on sharing more code: put all the logic in a shared.py module. If it needs to be shared even more broadly, push shared.py up the tree, and catch it in checksums. That would only be a minor change to the checksum logic.

If it's above the version folders, then you need to import it from the other side (from etl.data.steps.garden.faostat import shared).

It still has the side-effect that if you archive a step, it's no longer an isolated and standalone set of code.

Overall, I lean to feeling that DRY is less important than reproducibility, so I still lean towards just having multiple copies of code around for older versions of datasets.

pabloarosado · 2024-03-15T10:45:01Z

I have moved this discussion to the issue: #2419

stale · 2024-05-18T03:17:41Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Create function to import run from a previous version of a step

6edda0f

pabloarosado self-assigned this Mar 13, 2024

Add docstrings

6358fee

pabloarosado requested review from Marigold and larsyencken and removed request for Marigold March 13, 2024 10:00

pabloarosado marked this pull request as ready for review March 13, 2024 10:00

pabloarosado requested a review from Marigold March 13, 2024 10:14

Allow loading shared modules

bafb00f

Marigold approved these changes Mar 14, 2024

View reviewed changes

pabloarosado mentioned this pull request Mar 15, 2024

How to avoid duplicated code for data steps #2419

Closed

lucasrodes changed the title ~~Create function to import run from a previous version of a step~~ 🎉 Create function to import run from a previous version of a step Mar 18, 2024

stale bot added the wontfix This will not be worked on label May 18, 2024

stale bot closed this May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎉 Create function to import run from a previous version of a step #2410

🎉 Create function to import run from a previous version of a step #2410

pabloarosado commented Mar 13, 2024 •

edited

Marigold commented Mar 14, 2024

Marigold left a comment

larsyencken commented Mar 14, 2024

Marigold commented Mar 14, 2024

larsyencken commented Mar 15, 2024

pabloarosado commented Mar 15, 2024

stale bot commented May 18, 2024

🎉 Create function to import run from a previous version of a step #2410

🎉 Create function to import run from a previous version of a step #2410

Conversation

pabloarosado commented Mar 13, 2024 • edited

Summary

Motivation

Solution

Possible issues

Marigold commented Mar 14, 2024

Marigold left a comment

Choose a reason for hiding this comment

larsyencken commented Mar 14, 2024

Marigold commented Mar 14, 2024

larsyencken commented Mar 15, 2024

pabloarosado commented Mar 15, 2024

stale bot commented May 18, 2024

pabloarosado commented Mar 13, 2024 •

edited