Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prevent infinite reject logging if job is in prerun while deleting #2640

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

vchlum
Copy link
Contributor

@vchlum vchlum commented Apr 18, 2024

Describe Bug or Feature

The server can get into an infinite loop logging messages like:

04/17/2024 09:41:34;0008;Server@pbs-m1;Job;216551.pbs-m1.metacentrum.cz;MOM rejected signal during delete (15001)

The server is not "stuck" because the loop goes through the work task. So it can be terminated. It logs the messages one after another... we got a 20+GB log size before noticing.

It can happen if a job does not reach the mom (e.g. due to an error on mom) and remains in PRERUN state while an attempt to delete the job is invoked.

Describe Your Change

  • A delay was added to resending the signal to a job. It only makes sense to wait a moment before the resend is issued. This should give the job some time to reach the mom. I suggest 5 seconds resend time.
  • The number of attempts is counted so an infinite loop can not happen. I suggest removing the job after 10 attempts anyway.

Link to Design Doc

Attach Test and Valgrind Logs/Output

This happened twice on our server but it is not easy to reproduce, so I made a fake situation with gdb to test the new code:

Thread 1 "pbs_server.bin" hit Breakpoint 1, post_delete_mom1 (pwt=0x5620c7121310) at req_delete.c:1592
1592		int tries = 0;
(gdb) n
1594		preq_sig = pwt->wt_parm1;
(gdb) 
1595		rc = preq_sig->rq_reply.brp_code;
(gdb) 
1597		if (rc && pwt->wt_aux && (rc < PBSE_) && (pwt->wt_aux > PBSE_))
(gdb) 
1599		auxcode = preq_sig->rq_reply.brp_auxcode;
(gdb) 
1600		preq_clt = preq_sig->rq_extra;
(gdb) 
1601		if (preq_clt == NULL) {
(gdb) 
1606		pjob = find_job(preq_sig->rq_ind.rq_signal.rq_jid);
(gdb) 
1607		if (pjob == NULL) {
(gdb) 
1614		release_req(pwt);
(gdb) 
1617		if (rc) {
(gdb) set rc = 15001
(gdb) call set_job_substate(pjob, 41)
(gdb) n
1619			sprintf(log_buffer, "MOM rejected signal during delete (%d)", rc);
(gdb) 
1621				  pjob->ji_qs.ji_jobid, log_buffer);
(gdb) 
1620			log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, LOG_INFO,
(gdb) 
1623			if (rc == PBSE_UNKSIG) {
(gdb) 
1638			} else if (rc == PBSE_UNKJOBID) {
(gdb) 
1639				if (check_job_substate(pjob, JOB_SUBSTATE_PRERUN)) {
(gdb) 
1644					if (preq_clt->rq_reply.brp_count < 10) {
(gdb) 
1645						preq_clt->rq_reply.brp_count++;
(gdb) 
1646						rc = delayed_issue_signal(pjob, sigtj, post_delete_mom1, preq_clt, 5);
(gdb) c
Continuing.

Thread 1 "pbs_server.bin" hit Breakpoint 1, post_delete_mom1 (pwt=0x5620c713e5c0) at req_delete.c:1592
1592		int tries = 0;
(gdb) n
1594		preq_sig = pwt->wt_parm1;
(gdb) 
1595		rc = preq_sig->rq_reply.brp_code;
(gdb) 
1597		if (rc && pwt->wt_aux && (rc < PBSE_) && (pwt->wt_aux > PBSE_))
(gdb) 
1599		auxcode = preq_sig->rq_reply.brp_auxcode;
(gdb) 
1600		preq_clt = preq_sig->rq_extra;
(gdb) 
1601		if (preq_clt == NULL) {
(gdb) 
1606		pjob = find_job(preq_sig->rq_ind.rq_signal.rq_jid);
(gdb) 
1607		if (pjob == NULL) {
(gdb) 
1614		release_req(pwt);
(gdb) 
1617		if (rc) {
(gdb) set rc = 15001
(gdb) call set_job_substate(pjob, 41)
(gdb) n
1619			sprintf(log_buffer, "MOM rejected signal during delete (%d)", rc);
(gdb) 
1621				  pjob->ji_qs.ji_jobid, log_buffer);
(gdb) 
1620			log_event(PBSEVENT_JOB, PBS_EVENTCLASS_JOB, LOG_INFO,
(gdb) 
1623			if (rc == PBSE_UNKSIG) {
(gdb) 
1638			} else if (rc == PBSE_UNKJOBID) {
(gdb) 
1639				if (check_job_substate(pjob, JOB_SUBSTATE_PRERUN)) {
(gdb) 
1644					if (preq_clt->rq_reply.brp_count < 10) {
(gdb) 
1645						preq_clt->rq_reply.brp_count++;
(gdb) 
1646						rc = delayed_issue_signal(pjob, sigtj, post_delete_mom1, preq_clt, 5);
(gdb) p preq_clt->rq_reply.brp_count
$1 = 2

/* prevent infinite loop, try 10 times only */
if (preq_clt->rq_reply.brp_count < 10) {
preq_clt->rq_reply.brp_count++;
rc = delayed_issue_signal(pjob, sigt, post_delete_mom1, preq_clt, 5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider this delay of 5 seconds be configurable as a server attribute. 5 seconds may not always be the best. It might need more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new server attribute resend_term_delay was added.

@bayucan
Copy link
Contributor

bayucan commented May 28, 2024

Thanks for adding the server attribute. Code looks good.

"resend_term_delay reverting back to default val %ld",
resend_term_delay);
log_event(PBSEVENT_ADMIN, PBS_EVENTCLASS_SERVER,
LOG_NOTICE, msg_daemonname, log_buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you setting the internal variable back to the original value, but not the attribute itself. I'm actually not a huge fan of internal variables like this, especially after get_jattr_long() was written. If you have an internal global, you now have two places a value is set. It leads to the opportunity of them being set differently. If what qmgr reports is different than the value that is actually used, it leads to very difficult bugs to find.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your comments. The internal global variable was removed.

resend_term_delay = PBS_RESEND_TERM_DELAY_DEFAULT;

sprintf(log_buffer,
"resend_term_delay reverting back to default val %ld",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember if log_eventf() made it into open PBS. At some point I wrote a function which combines the sprintf()/log_event() calls into one. If that is in OpenPBS, I'd use it. If not, please ignore me :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, log_eventf() is in OpenPBS. I changed it. Thx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants