Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tpp connections: change round robin to fixed assignment of threads #2641

Merged
merged 2 commits into from
May 23, 2024

Conversation

vchlum
Copy link
Contributor

@vchlum vchlum commented Apr 18, 2024

Describe Bug or Feature

The pbs_comm can crash if a huge amount of requests(/new connections) are issued to pbs_comm. It can be also invoked by a few hundred/thousands invocations of pbs_rmget parallel running for a few minutes (yes. unsupported command, but rm protocol can show the error in tpp). The pbs_comm must utilize the CPUs to reproduce. Also, gss encryption was involved.

The main loop in tpp_transport.c:work() expects the thread-safely obtained conn can only be used in one thread at the same time.

There is a hidden race condition, and honestly, I wasn't able to find the very exact path of how the conn gets in the different thread.

In this stack trace, the fd 519 has a faulty conn structure, because the second handle_disconnect() is called before the first one is finished... at the time of this stack trace, the first handle_disconnect() is gone... As you can see the incorrect sock_fd later causes the crash.

(gdb) thread 5
[Switching to thread 5 (Thread 0x7f4afdb506c0 (LWP 835448))]
#0  handle_disconnect (conn=0x7f4af4020e00) at /tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c:1748
1748	/tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c: No such file or directory.
(gdb) bt
#0  handle_disconnect (conn=0x7f4af4020e00) at /tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c:1748
#1  0x00007f4aff734b0c in handle_incoming_data (conn=0x7f4af4020e00) at /tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c:1836
#2  0x00007f4aff73435e in work (v=0x55cec68ae180) at /tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c:1570
#3  0x00007f4aff579134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007f4aff5f97dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) select-frame 1
(gdb) print *conn
$1 = {sock_fd = -201326448, lasterr = -38470256, net_state = 1, ev_mask = 32586, conn_params = 0x7f4af40266e0, send_mbox = {mbox_name = "Conn_519\000", mbox_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, 
        __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\001", '\000' <repeats 22 times>, __align = 0}, mbox_queue = {head = 0x0, tail = 0x0}, max_size = 640000, 
    mbox_size = 0, mbox_eventfd = 520}, scratch = {chunk_link = {ll_prior = 0x0, ll_next = 0x0, ll_struct = 0x0}, data = 0x7f4aec5b59f0 "`\341S\354J\177", len = 8192, pos = 0x7f4aec5b59f0 "`\341S\354J\177"}, curr_send_pkt = 0x0, 
  td = 0x55cec68ae180, ctx = 0x0, extra = **0x0}**

Describe Your Change

I suggest fixing the expected condition - so the sentence 'the conn can always be associated only with one thread in work()' is true.

There was a round-robin for assigning conn to threads. The round-robin does not consider the real utilization of threads, so there is no harm in the following solution: The same fd is always assigned to the same thread.

While investigating, obvious errors were discovered, so the fixes are included:

  • send_tok was not initialized (based on gss doc - it should be)
  • in the case of PBS_GSS_ERR_CONTEXT_ESTABLISH, the old gss context remained in structure - should be changed before the return.
  • conn was set free without locking the mutex and setting related conns_array to free.

Link to Design Doc

Attach Test and Valgrind Logs/Output

I tested with commands like parallel -j 500 ./pbs_rmget -m torque4.grid.cesnet.cz -p 0 ncpus -- {1..100000} for a few hours after the fix with no crash. The pbs_comm utilized CPUs during the test.

Copy link
Collaborator

@subhasisb subhasisb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assignment of fds to threads based on modulo sounds fine to me. Thanks!

@bayucan
Copy link
Contributor

bayucan commented May 23, 2024

Looks fine.

@bayucan bayucan merged commit 3b862f7 into openpbs:master May 23, 2024
6 checks passed
@vchlum vchlum deleted the fix_tpp_threads branch May 24, 2024 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants