Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for OAR Scheduler #1744

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Conversation

ychiat35
Copy link

The Oar scheduler is widely used in France, including mesocentre supercomputers (e.g., GRICAD), INRIA supercomputers, Grid5000 testbed and other plateforms.

This PR adds support for the OAR Scheduler as a plugin. Four main classes have been implemented in oar.py (following the previous implementation made for slurm):

  • OarInfoWatcher: Retrieves job status using the oarstat command (similar to the sinfo command on the Slurm scheduler).
  • OarJob: Represents an OAR job.
  • OarExecutor: Contains the parameters to submit a job on OAR.
  • OarJobEnvironment: Provides mappings with OAR environment variables such as job_id, nodes or array_task_id.

Unit tests were created in test_oar.py and test_auto.py to ensure that the OAR plugin offers the same basic functionalities as the Slurm plugin.

A few notes about the implementation:

  • In the OarExecutor class, OAR parameters are matched with those of submitit (using the _equivalence_dict dictionary). Additional OAR parameters can be set with the additional_parameters dictionary.
  • As the OAR submission file must exist at both the submission time AND at the job launch time (not the case for Slurm), the _make_submission_command method in the OarExecutor class is overridden from PicklingExecutor. The content of the file is read and the job is submitted using the OAR "inline command" instead of using the submission file.
  • OAR job state names differ from the Submitit ones. We map the OAR state names to the submitit ones to unify status accros plugins.
  • Resuming preempted jobs differ between OAR and Slurm. The OAR equivalent command for scontrol (i.e., oarsub) is not available on nodes. To automatically requeue the job after preemption, the original job must be submitted with the idempotent type and be exited with the 99 code.
  • For job arrays, OAR does not provide a feature to limit the number of concurrently executed jobs so we could not implement that.

Our implemented OAR plugin covers most of submitit features (e.g., job submission, checkpointing, job array). The only feature that we did not address is the task submission. Indeed, contrary to Slurm, OAR does not provide such a feature. We believe a workaround could be implemented in another iteration. Meanwhile, we raise a "NotImplemeted" error if a user attempts to use such a feature.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 25, 2023
@gwenzek gwenzek self-requested a review September 25, 2023 10:15
@gwenzek
Copy link
Contributor

gwenzek commented Oct 5, 2023

Hi, thanks for contributing this.
The code looks good to me, but I don't have access to an OAR cluster to test it out, and won't have the knowledge to answer questions about OAR if users have issues.
So I'd rather have this code in a separate repository.
submitit does in fact have a plugin system that allows that. The process isn't documented because you're actually the first external user to make such a PR, but we already have a Meta internal plugin.

The steps to follow are the following:

  • Create a new repo with the "oar" folder you created.
  • Rename "oar" to "submitit_oar" (this will be the Python package name)
  • In setup.py declare an entry point
setup(
    name="submitit_oar",
    install_requires=["submitit>=1.4.6"],
    ...
    entry_points={
        "submitit": "\n".join(
            [
                "",
                "executor = submitit_oar:OarExecutor",
                "job_environment = submitit_oar:OarJobEnvironment",
                "",
            ]
        )
    },
    zip_safe=False,
)

If all work well, we can add an entry in the readme that point to your plugin.

@ychiat35
Copy link
Author

Hello, thanks for your review and your proposal about the plugin. Here is the repository: https://github.com/ychiat35/submitit_oar. I will try to add some CI/CD actions for tests and package releases.

About this point:

The code looks good to me, but I don't have access to an OAR cluster to test it out, and won't have the knowledge to answer questions about OAR if users have issues.

have you thinked about some CI tests for OAR (and Slurm), similarly to what is done for Slurm and SGE clusters on Dask-jobqueue repository: https://github.com/dask/dask-jobqueue/blob/main/ci/slurm/docker-compose.yml ? maybe it will be a good way to test real jobs launched on OAR/Slurm clusters.

@ychiat35
Copy link
Author

Hello,

We'd like to inform you that we have successfully integrated the submitit_oar plugin into the Grid5000 repositories, at this link: Grid5000/submitit_oar. Additionally, we have released a new version of the plugin on PyPi, accessible here: submitit_oar 1.1.1.

The integration of the submitit_oar plugin has been smooth, and it seamlessly aligns with the Submitit's plugin system.

To finalize the pull request, we'd like to confirm if you're still fine with us submitting a PR to update the readme to mention our plugin.

Thanks a lot for your feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants