Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing 'Scouting' Status in WMS state transitions #7083

Open
michmx opened this issue Jun 28, 2023 · 3 comments · May be fixed by #7251
Open

Introducing 'Scouting' Status in WMS state transitions #7083

michmx opened this issue Jun 28, 2023 · 3 comments · May be fixed by #7251
Milestone

Comments

@michmx
Copy link

michmx commented Jun 28, 2023

In our BelleDIRAC extension, we have defined Scout jobs as a small subset of the main jobs that run first, and the rest of the jobs are executed only when scouting is done. Main jobs get the status ‘Scouting’ while waiting for the execution of the subset. Scout jobs were presented in the DIRAC Users Workshop in 2021 (link to contribution).

From DIRAC 7.3, WMS changes the way that statuses change and now state transitions are defined: JobStatus.py#L82

When we migrated our system to DIRAC 7.3, we noticed error messages like

2023-03-09 18:20:12 UTC WorkloadManagement/OptimizationMind ERROR: There was a problem processing task 213431:
getNextState: 'Scouting' is not a valid state

Our Scouting state is very much similar to Staging in the sense that jobs stay in that state before Waiting until some conditions are fulfilled. The transitions that job states with scouting face are:

  • RECEIVED → SCOUTING
  • SCOUTING → CHECKING, WAITING, KILLED, FAILED, STALLED

If you agree, we need to

  1. Define the state “Scouting” at JOB_STATES:
    https://github.com/DIRACGrid/DIRAC/blob/rel-v7r3/src/DIRAC/WorkloadManagementSystem/Client/JobStatus.py#L48

  2. Enable the transitions

    SCOUTING: State(2, [CHECKING, WAITING, FAILED, STALLED, KILLED], defState=SCOUTING),
    CHECKING: State(2, [SCOUTING, STAGING, WAITING, RESCHEDULED, FAILED, DELETED], defState=CHECKING),
    RECEIVED: State(1, [SCOUTING, CHECKING, WAITING, FAILED, DELETED], defState=RECEIVED),
@fstagni
Copy link
Contributor

fstagni commented Jun 28, 2023

Hi,
IIUC (correct me!) in BelleDIRAC you developed a specific Optimizer (in addition to those in https://github.com/DIRACGrid/DIRAC/tree/rel-v7r3/src/DIRAC/WorkloadManagementSystem/Executor) that creates the "scouting jobs" and move the "master job" status to SCOUTING.
Before I answer your question, I have one myself: is what you have done very Belle2 specific? (I have the impression it is...).

@iueda
Copy link
Contributor

iueda commented Jun 29, 2023

Yes and No.

"Scout jobs" are created at the job submission -- when a user submits a set of jobs, our client tool makes a smaller set of shorter jobs as "scout jobs" and submits them (the original and the scout) altogether.

Then, our BelleDIRAC Optimizer (in BelleDIRAC/WorkloadManagementSystem/Executor) changes the status of the original jobs to "Scouting" while waiting for the execution of the scout jobs.
We have an Agent that changes the status of the original jobs from "Scouting" to "Checking" so that they can go through the vanilla Optimizer.

The first one is Belle II specific in the sense we copy jobs with expecting some Belle II specific scripts in them.
The latter two are supposed to be generic, as we have reported in the past.

See the slides at the last DUW
https://indico.cern.ch/event/1107386/contributions/4846372/

=====
Slide 10: What is included in BelleDIRAC
Extensions of Vanilla systems
WMS

  • Agents and Executor for Scout Jobs.

Slide 13: What else is included in BelleDIRAC
Features for end-users

  • Scout Jobs

Can be included as part of vanilla DIRAC.
Scout job creation performed on BelleDIRAC side.
But agent and executor are under WMS.
So, possible (with some modifications).

Slide 21: Summary
Potential new additions to Vanilla DIRAC:

  • Scout jobs.

=====

@fstagni
Copy link
Contributor

fstagni commented Jun 29, 2023

What I would suggest is:

  • in v7r3 you add the "SCOUTING" job status
  • in v8.0 you add the generic part of your implementation (optimizer and agent)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants