Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v8.0] Introduce Scout Agent and Optimizer #7251

Open
wants to merge 20 commits into
base: integration
Choose a base branch
from

Conversation

qdcampagna
Copy link

@qdcampagna qdcampagna commented Oct 20, 2023

This pull request introduces the Scout Agent and Optimizer. They submit a subset of primary jobs first, then only allows the primary jobs to run once a certain number of Scout jobs are successfully finished. Follow up to #7118. Closes #7083.

BEGINRELEASENOTES

*WorkloadManagement
NEW: Introduce Agent and Optimizer for scouting jobs

ENDRELEASENOTES

Copy link
Contributor

@fstagni fstagni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR.
The main problem I see, that requires change, is that you are calling directly JobDB() to get/set JobParameters. But, JobParameters can be stored also in ElasticSearch (and will only be stored there from DIRAC v9 on). So, use JobStateUpdateClient() instead.

"""Sets defaults
"""

self.jobDB = JobStateUpdateClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it was me not being clear, but this can't be a 1-to-1 substitution between JobDB and JobStateUpdateClient, as the 2 don't expose the same methods. For example, JobStateUpdateClient does not expose the selectJobs method used below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Does JobStateUpdateClient access only the methods found in JobStateUpdateHandler, or are there others?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I would do this, for added clarity:

Suggested change
self.jobDB = JobStateUpdateClient()
self.jobStateUpdate = JobStateUpdateClient()

then you'll need at least also

        self.jobMonitoring = JobMonitoringClient()

failedjoblist = []
failedsitelist = []
stalledjoblist = []
scoutjobs = result['Value'].keys()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beware, this is python3 and not py2 code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is safer:

Suggested change
scoutjobs = result['Value'].keys()
scoutjobs = list(result['Value'])

src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated Show resolved Hide resolved
src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated Show resolved Hide resolved
src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated Show resolved Hide resolved
src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated Show resolved Hide resolved
src/DIRAC/WorkloadManagementSystem/Executor/Scouting.py Outdated Show resolved Hide resolved
Copy link
Contributor

@fstagni fstagni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, would you add a unit test or two?

Copy link
Contributor

@andresailer andresailer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some documentation about this somewhere?

From the #7083

"Scout jobs" are created at the job submission -- when a user submits a set of jobs, our client tool makes a smaller set of shorter jobs as "scout jobs" and submits them (the original and the scout) altogether.

Would be interesting to know how this is implemented at the job submission level.

Comment on lines +42 to +48
self.totalScoutJobs = Operations().getValue('WorkloadManagement/Scouting/totalScoutJobs', 10)
self.criteriaFailedRate = Operations().getValue('WorkloadManagement/Scouting/criteriaFailedRate', 0.5)
self.criteriaSucceededRate = Operations().getValue('WorkloadManagement/Scouting/criteriaSucceededRate', 0.3)
self.criteriaStalledRate = Operations().getValue('WorkloadManagement/Scouting/criteriaStalledRate', 1.0)
self.criteriaFailed = Operations().getValue('WorkloadManagement/Scouting/criteriaFailed',
int(self.totalScoutJobs * self.criteriaFailedRate))
self.criteriaSucceeded = Operations().getValue('WorkloadManagement/Scouting/criteriaSucceeded',
int(self.totalScoutJobs * self.criteriaSucceededRate))
self.criteriaStalled = Operations().getValue('WorkloadManagement/Scouting/criteriaStalled',
int(self.totalScoutJobs * self.criteriaStalledRate))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this go via the Agent options, or is this used also in other places?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case you have to add the agent to the ConfigTemplate file?

def execute(self):
"""The ScoutingJobStatus execution method.
"""
result = self.jobDB.selectJobs({'Status': 'Scouting'})
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JobStateUpdateClient doesn't seem to have a method that does what we want here. However, it seems that getJobs() in JobMonitoringHandler might have this functionality. Would you recommend using getJobs, or implementing this in JobStateUpdateHandler?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use getJobs.

"""Sets defaults
"""

self.jobDB = JobStateUpdateClient()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I would do this, for added clarity:

Suggested change
self.jobDB = JobStateUpdateClient()
self.jobStateUpdate = JobStateUpdateClient()

then you'll need at least also

        self.jobMonitoring = JobMonitoringClient()

failedjoblist = []
failedsitelist = []
stalledjoblist = []
scoutjobs = result['Value'].keys()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is safer:

Suggested change
scoutjobs = result['Value'].keys()
scoutjobs = list(result['Value'])


if len(donejoblist) >= criteriaSucceeded:
self.log.verbose(f'Scout (ID = {scoutid}) are done.')
scoutStatus = {'Status': 'Checking', 'MinorStatus': 'Scouting', 'appstatus': 'Scout Complete'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
scoutStatus = {'Status': 'Checking', 'MinorStatus': 'Scouting', 'appstatus': 'Scout Complete'}
scoutStatus = {'Status': JobStatus.CHECKING, 'MinorStatus': 'Scouting', 'appstatus': 'Scout Complete'}

and similar in the lines below.

Comment on lines +54 to +55
self.jobLog.info('Skipping optimizer, since no scout \
corresponding to this job group')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would print lots of empty spaces. Better to do

Suggested change
self.jobLog.info('Skipping optimizer, since no scout \
corresponding to this job group')
self.jobLog.info('Skipping optimizer, since no scout '
'corresponding to this job group')

result = jobState.getAttribute('RescheduleCounter')
if not result['OK']:
return result
if result['Value'] == None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if result['Value'] == None:
if result['Value'] is None:

@fstagni
Copy link
Contributor

fstagni commented Apr 23, 2024

Ping! @qdcampagna

@qdcampagna
Copy link
Author

Sorry for the long delay, there were problems with the development server used to test this. We are working on setting up a new one so hopefully soon there will be progress on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introducing 'Scouting' Status in WMS state transitions
4 participants