馃帀 Create function to import run from a previous version of a step #2410
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Create a helper function to import the
run
function from a previous version. This could let us avoid duplicating a lot of code, every time we update a data step with no changes in the code.Motivation
The ETL currently relies on having a recipe for each dataset. But this means that we end up having many files with the exact same content.
Having duplicated code is not only bad when refactoring, or upgrading libraries. It is also inconvenient when searching for previous code. It can be quite confusing when many files have the same content.
I think it's very valuable to have easy access to multiple versions of a dataset, and its code. But, because of the need to duplicate code, one may be less keen to keep different versions, and prefer to use "latest" (there are other reasons why "latest" is convenient, but that's out of the scope of this PR).
Solution
This PR would let us have data steps with the following content:
This way, we would avoid duplicated code. And it would also make it more evident when different versions of a dataset have exactly the same data processing.
Possible issues
run
module imports the correct dependencies stated in the dag (and not the old ones).shared.py
), the new step still runs well.load_run_from_previous_version
and having a new metadata file inside the new folder works as desired: The old code runs with the new dependencies and loads the new metadata. This works as long as the metadata is loaded withcreate_dataset
, and not imported directly withpaths.metadata_file
(see next point)..countries.json
file in the new folder is ignored. I think that the underlying reason is that thepaths = PathFinder(__file__)
object defined in the old file is still loaded in the new file, and therefore all paths are the old ones.Given these caveats, with the current implementation,
load_run_from_previous_version
can safely be used in those cases where the code is exactly duplicated (including metadata and countries). In any other case, if someone wants to edit minor things, it may or may not work, so it's safest to duplicate code and edit it.