tpp connections: change round robin to fixed assignment of threads #2641

vchlum · 2024-04-18T09:09:23Z

Describe Bug or Feature

The pbs_comm can crash if a huge amount of requests(/new connections) are issued to pbs_comm. It can be also invoked by a few hundred/thousands invocations of pbs_rmget parallel running for a few minutes (yes. unsupported command, but rm protocol can show the error in tpp). The pbs_comm must utilize the CPUs to reproduce. Also, gss encryption was involved.

The main loop in tpp_transport.c:work() expects the thread-safely obtained conn can only be used in one thread at the same time.

There is a hidden race condition, and honestly, I wasn't able to find the very exact path of how the conn gets in the different thread.

In this stack trace, the fd 519 has a faulty conn structure, because the second handle_disconnect() is called before the first one is finished... at the time of this stack trace, the first handle_disconnect() is gone... As you can see the incorrect sock_fd later causes the crash.

(gdb) thread 5
[Switching to thread 5 (Thread 0x7f4afdb506c0 (LWP 835448))]
#0  handle_disconnect (conn=0x7f4af4020e00) at /tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c:1748
1748	/tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c: No such file or directory.
(gdb) bt
#0  handle_disconnect (conn=0x7f4af4020e00) at /tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c:1748
#1  0x00007f4aff734b0c in handle_incoming_data (conn=0x7f4af4020e00) at /tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c:1836
#2  0x00007f4aff73435e in work (v=0x55cec68ae180) at /tmp/openpbs-src/src/lib/Libpbs/../Libtpp/tpp_transport.c:1570
#3  0x00007f4aff579134 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
#4  0x00007f4aff5f97dc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
(gdb) select-frame 1
(gdb) print *conn
$1 = {sock_fd = -201326448, lasterr = -38470256, net_state = 1, ev_mask = 32586, conn_params = 0x7f4af40266e0, send_mbox = {mbox_name = "Conn_519\000", mbox_mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 0, 
        __kind = 1, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 16 times>, "\001", '\000' <repeats 22 times>, __align = 0}, mbox_queue = {head = 0x0, tail = 0x0}, max_size = 640000, 
    mbox_size = 0, mbox_eventfd = 520}, scratch = {chunk_link = {ll_prior = 0x0, ll_next = 0x0, ll_struct = 0x0}, data = 0x7f4aec5b59f0 "`\341S\354J\177", len = 8192, pos = 0x7f4aec5b59f0 "`\341S\354J\177"}, curr_send_pkt = 0x0, 
  td = 0x55cec68ae180, ctx = 0x0, extra = **0x0}**

Describe Your Change

I suggest fixing the expected condition - so the sentence 'the conn can always be associated only with one thread in work()' is true.

There was a round-robin for assigning conn to threads. The round-robin does not consider the real utilization of threads, so there is no harm in the following solution: The same fd is always assigned to the same thread.

While investigating, obvious errors were discovered, so the fixes are included:

send_tok was not initialized (based on gss doc - it should be)
in the case of PBS_GSS_ERR_CONTEXT_ESTABLISH, the old gss context remained in structure - should be changed before the return.
conn was set free without locking the mutex and setting related conns_array to free.

Link to Design Doc

Attach Test and Valgrind Logs/Output

I tested with commands like parallel -j 500 ./pbs_rmget -m torque4.grid.cesnet.cz -p 0 ncpus -- {1..100000} for a few hours after the fix with no crash. The pbs_comm utilized CPUs during the test.

subhasisb

The assignment of fds to threads based on modulo sounds fine to me. Thanks!

bayucan · 2024-05-23T23:55:43Z

Looks fine.

vchlum added 2 commits April 18, 2024 10:12

tpp connections: change round robin to fixed assignment of threads

4c54f47

use correct return

c008d86

subhasisb approved these changes May 23, 2024

View reviewed changes

bayucan approved these changes May 23, 2024

View reviewed changes

bayucan merged commit 3b862f7 into openpbs:master May 23, 2024
6 checks passed

vchlum deleted the fix_tpp_threads branch May 24, 2024 05:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tpp connections: change round robin to fixed assignment of threads #2641

tpp connections: change round robin to fixed assignment of threads #2641

vchlum commented Apr 18, 2024 •

edited

subhasisb left a comment

bayucan commented May 23, 2024

tpp connections: change round robin to fixed assignment of threads #2641

tpp connections: change round robin to fixed assignment of threads #2641

Conversation

vchlum commented Apr 18, 2024 • edited

Describe Bug or Feature

Describe Your Change

Link to Design Doc

Attach Test and Valgrind Logs/Output

subhasisb left a comment

Choose a reason for hiding this comment

bayucan commented May 23, 2024

vchlum commented Apr 18, 2024 •

edited