tpp connections: change round robin to fixed assignment of threads #2641
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Describe Bug or Feature
The pbs_comm can crash if a huge amount of requests(/new connections) are issued to pbs_comm. It can be also invoked by a few hundred/thousands invocations of pbs_rmget parallel running for a few minutes (yes. unsupported command, but rm protocol can show the error in tpp). The pbs_comm must utilize the CPUs to reproduce. Also, gss encryption was involved.
The main loop in tpp_transport.c:work() expects the thread-safely obtained conn can only be used in one thread at the same time.
There is a hidden race condition, and honestly, I wasn't able to find the very exact path of how the
conn
gets in the different thread.In this stack trace, the fd 519 has a faulty conn structure, because the second handle_disconnect() is called before the first one is finished... at the time of this stack trace, the first handle_disconnect() is gone... As you can see the incorrect sock_fd later causes the crash.
Describe Your Change
I suggest fixing the expected condition - so the sentence 'the conn can always be associated only with one thread in work()' is true.
There was a round-robin for assigning conn to threads. The round-robin does not consider the real utilization of threads, so there is no harm in the following solution: The same fd is always assigned to the same thread.
While investigating, obvious errors were discovered, so the fixes are included:
Link to Design Doc
Attach Test and Valgrind Logs/Output
I tested with commands like
parallel -j 500 ./pbs_rmget -m torque4.grid.cesnet.cz -p 0 ncpus -- {1..100000}
for a few hours after the fix with no crash. The pbs_comm utilized CPUs during the test.