Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

馃帀 Create function to import run from a previous version of a step #2410

Closed
wants to merge 3 commits into from

Conversation

pabloarosado
Copy link
Contributor

@pabloarosado pabloarosado commented Mar 13, 2024

Summary

Create a helper function to import the run function from a previous version. This could let us avoid duplicating a lot of code, every time we update a data step with no changes in the code.

Motivation

The ETL currently relies on having a recipe for each dataset. But this means that we end up having many files with the exact same content.

Having duplicated code is not only bad when refactoring, or upgrading libraries. It is also inconvenient when searching for previous code. It can be quite confusing when many files have the same content.

I think it's very valuable to have easy access to multiple versions of a dataset, and its code. But, because of the need to duplicate code, one may be less keen to keep different versions, and prefer to use "latest" (there are other reasons why "latest" is convenient, but that's out of the scope of this PR).

Solution

This PR would let us have data steps with the following content:

from etl.helpers import load_run_from_previous_version

run = load_run_from_previous_version(__file__, "2024-01-01")

This way, we would avoid duplicated code. And it would also make it more evident when different versions of a dataset have exactly the same data processing.

Possible issues

  • Would the new step be loading the correct dependencies? I've checked that, indeed, the new run module imports the correct dependencies stated in the dag (and not the old ones).
  • I've also checked that, if the old step imports other modules within the same folder (e.g. shared.py), the new step still runs well.
  • (Unlikely) If one accidentally archived the old file, the new step would fail to run. We could find a way to raise an error when this happens, instead of waiting for ETL to fail. But even if ETL fails, it would not be a big deal (we would easily find the issue).
  • If one edits the old file, the new step will be changed. But (1) one has no direct way to be aware of that, and (2) ETL will not consider the new step as "dirty".
  • What if I just want to edit the new metadata? I've checked that, using load_run_from_previous_version and having a new metadata file inside the new folder works as desired: The old code runs with the new dependencies and loads the new metadata. This works as long as the metadata is loaded with create_dataset, and not imported directly with paths.metadata_file (see next point).
  • What if I just want to edit the new country harmonization? Unlike with metadata, a new .countries.json file in the new folder is ignored. I think that the underlying reason is that the paths = PathFinder(__file__) object defined in the old file is still loaded in the new file, and therefore all paths are the old ones.

Given these caveats, with the current implementation, load_run_from_previous_version can safely be used in those cases where the code is exactly duplicated (including metadata and countries). In any other case, if someone wants to edit minor things, it may or may not work, so it's safest to duplicate code and edit it.

@pabloarosado pabloarosado self-assigned this Mar 13, 2024
@pabloarosado pabloarosado requested review from Marigold and larsyencken and removed request for Marigold March 13, 2024 10:00
@pabloarosado pabloarosado marked this pull request as ready for review March 13, 2024 10:00
@Marigold
Copy link
Collaborator

I'd kill for being able to import from other versions... I think it'd be also useful to load other (helper) functions from the module. For instance

from etl.helpers import load_module_from_previous_version

last_version = load_module_from_previous_version(__file__, "2024-01-01")
last_version.my_func()

also, having an optional module name would be nice in case it differs from the actual module (ran into when trying to import dummy.py from dummy_monster.py).

This is definitely an improvement over the current status quo, but I'd try to brainstorm if we could do even better than this and use native imports, i.e. having a way to run from ..2024-01-01 import run. Some adhoc ideas:

  • Can we name folders v2024_01_01 instead of 2024-01-01?
  • Perhaps we could create symlinks from importable paths (e.g. that v2024_01_01) to our current paths
  • ?

cc @larsyencken if you have any ideas

We obviously don't have to do this now, so I'm up for merging this workaround.

Copy link
Collaborator

@Marigold Marigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super useful. Left a comment there, but it shouldn't block this.

@larsyencken
Copy link
Collaborator

I don't think I've understood this well enough, although we did talk about it.

The problem it's solving is the inconvenience of having duplicated code and having to identify the right version when searching in VSCode, if I've understood correctly.

The problem it introduces is that you can no longer delete a step, since other steps might be using its code, and that if you're using another step's code, you now depend on that code and have to include it in your manifest for checksumming the step. That seems like quite a downside to swallow.

@Marigold What's the use case here that you'd kill for?

@Marigold
Copy link
Collaborator

you now depend on that code and have to include it in your manifest for checksumming the step

Oh right, I totally forgot. That changes everything then.

What's the use case here that you'd kill for?

I might have been too dramatic with that. If checksum wasn't the problem (which it is...) then I'd find it really valuable for creating new versions of data and DRYing a lot of code. I guess we can still symlink shared.py module from previous version.

@larsyencken
Copy link
Collaborator

I think you hit upon my preferred approach, if we are keen on sharing more code: put all the logic in a shared.py module. If it needs to be shared even more broadly, push shared.py up the tree, and catch it in checksums. That would only be a minor change to the checksum logic.

If it's above the version folders, then you need to import it from the other side (from etl.data.steps.garden.faostat import shared).

It still has the side-effect that if you archive a step, it's no longer an isolated and standalone set of code.

Overall, I lean to feeling that DRY is less important than reproducibility, so I still lean towards just having multiple copies of code around for older versions of datasets.

@pabloarosado
Copy link
Contributor Author

I have moved this discussion to the issue: #2419

@lucasrodes lucasrodes changed the title Create function to import run from a previous version of a step 馃帀 Create function to import run from a previous version of a step Mar 18, 2024
Copy link

stale bot commented May 18, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix This will not be worked on label May 18, 2024
@stale stale bot closed this May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants