Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with dependencies in pyCSEP #192

Open
wsavran opened this issue Jun 8, 2022 · 8 comments
Open

Dealing with dependencies in pyCSEP #192

wsavran opened this issue Jun 8, 2022 · 8 comments

Comments

@wsavran
Copy link
Collaborator

wsavran commented Jun 8, 2022

Working through the reproducibility packages and thinking about the testing experiments uncovered the need for a discussion on how we are managing the dependencies in pyCSEP.

Currently we are only pinning dependencies when a conflict or issue is known. Once the issue or conflict has been resolved we remove the pin.

Pros of this approach:

  • Provides most up-to-date versions of packages and dependencies
  • Plays much more nicely when users are trying to install pyCSEP alongside their working environment (this is the cast for most normal users)

Cons of this approach:

  • Environment can be transient with time which causes issues for reproducibility of results by simply choosing a version of pyCSEP
  • Deal with inevitable errors in pyCSEP from third-party incompatibilities (eg, new version of numpy removes function used by matplotlib)

Goals:

  • Enable reproducible research using the sofware
  • Provide users with ability to easily integrate pycsep into their working environment (ie although users should create their own environments we dont want to create dependency issues when they try and install pycsep).

Possible ways improve reproducibility of the computing environment:

  • Users could be responsible for providing a reproducible environment themselves eg, with reproducibility packages
  • Pin versions of dependencies within pycsep (see above) or use a min/max dependency approach
  • We could provide docker images associated with each build to freeze the high and low level dependencies (could be built into CI). adds other options of where to store and how to reproduce it exactly at anytime, etc).

References

@pjm-usc
Copy link
Contributor

pjm-usc commented Jun 9, 2022

I believe it is important that each pyCSEP distribution provides list of package versions that work. The release can leave the packages unspecified, but the user should be able to find a complete list of the combination of versions that work. Maybe this list of versions can be given in a Dockerfile, or with an environment summary from the CI stack.

@mherrmann3
Copy link
Contributor

mherrmann3 commented Jun 10, 2022

Too bad that PEP 665 got rejected. Luckily, there will be a 'take 2'.
This discussion links to another approach, which unfortunately is also not implemented yet.
So we'll have to wait or do it ourselves.

I had exactly the same thought as Philip: For every release, we (you) could provide a pip freeze / conda env export only for the packages that pycsep requires (maybe within a requirements_pinned.txt/yml). Not elegant, but perhaps sufficient - it's a backup solution in case of dependency issues.

@wsavran
Copy link
Collaborator Author

wsavran commented Jun 10, 2022

I like that idea as well. This should be done by CI during the release flow and should create and register a docker image of the build. What is the reason behind not pinning all dependencies and only the packages that pycsep requires?

@mherrmann3
Copy link
Contributor

Great.

What is the reason behind not pinning all dependencies and only the packages that pycsep requires?

I thought to keep the requirements more compact. But it's possibly a bad idea, since pycsep's direct dependencies (e.g., numpy) in turn may not guarantee reproducibility as we intend it. So yes, we'll have to report the versions of all packages.

@wsavran
Copy link
Collaborator Author

wsavran commented Jun 11, 2022

I ask because that was my first thought as well, and how I implemented things for the first iteration of the global experiment. There are caveats though. A pro of that solution is that it provides cross-platform support, because we cannot guarantee the exact same environment across different OS using an exact environment specification. Different binaries, etc. I think Docker provides a good solution for cross-platform support.

We could maybe provide both, a requirements.yml and an environment.yml where the former will provide pinned deps for pycsep and the latter will provide an exact environment that will run on Ubuntu.

@mherrmann3
Copy link
Contributor

A similar approach is also suggested in this article: Reproducible and upgradable Conda environments with conda-lock

Essentially:

  1. we keep the environment.yml clean—with 'versioned direct' dependencies to be in control of upgrades;
  2. to specify a reproducible environment ('transitively pinned'/'locked dependencies'), call conda env export > environment.lock.yml.

BUT, more interestingly, the article proposes a solution for the several technical difficulties with conda env export in step 2 (most importantly: the possible cross-platform inconsistencies, which we currently circumvent by using docker): conda-lock: basically, it defines a set of URLs to download (also speeding up installs). Nice: "you can specify which operating system you want to build the lock file for, so you can create a Linux lock file on other operating systems. By default it generates for Linux, macOS, and 64-bit Windows out of the box".

So we can create kind-of platform-specific environment.lock.ymls proxies (e.g., conda-linux-64.lock, conda-osx-64.lock, conda-win-64.lock), which may have the potential to completely abandon docker (or similar) for reproducibility packages. 🤞

Cool: the conda environment can be created directly from this lock file: conda create --name fromlock --file conda-linux-64.lock.

@wsavran
Copy link
Collaborator Author

wsavran commented Jun 15, 2022

I like that approach of creating platform specific lock files, we can just do that on release and then provide folks with a reproducible installation. I still think Docker is a solid tool for sharing environments, but this conda-lock is worth exploring. If we can provide some of the leg work in setting up Dockerfiles or at least a template, I'm pretty happy with the tool. There is also a tool called repo2docker provided by Jupyter that we can explore as well

@pabloitu
Copy link
Collaborator

pabloitu commented Nov 8, 2022

I wonder if we could provide wheels for the required dependencies, which could also deal with cartopy/pygeos issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants