DicomServer performance #1777

gofal · 2024-04-30T07:44:33Z

Fixes #1776

Checklist

The pull request branch is in sync with latest commit on the fo-dicom/development branch
I have updated API documentation
I have included unit tests
I have updated the change log
I am listed in the CONTRIBUTORS file

Changes proposed in this pull request:

ConfigureAwait(false) was missing in some time critical paths of DicomServer
a blocking await of some notification was changed to fire-and-forget in DicomServer.ListenForConnectionAsync()
DicomServices in DicomServer are disposed immediatelly after their Task has finished

…mpleted

amoerie · 2024-05-02T06:31:40Z

FO-DICOM.Core/Network/DicomServer.cs

 // We don't actually care about the values inside the channel, they just serve as a notification that a service has connected
- await _servicesChannel.Writer.WriteAsync(numberOfServices, _cancellationToken);
+ // Fire and forget
+ _ = _servicesChannel.Writer.WriteAsync(numberOfServices, _cancellationToken);


Looking into the details here, I don't believe WriteAsync was ever blocking.

_servicesChannel is an unbounded channel. WriteAsync can only "block" if the capacity of a channel has been reached, in which case it must wait for someone else to read from the channel first before it can write another value.
For unbounded channels, WriteAsync always completes synchronously immediately. (In fact, that is probably why it returns a ValueTask instead of a Task in the first place)

However, I do admit that this was not obvious. We can make it more explicit at this calling site by doing the following:

if(!_servicesChannel.Writer.TryWrite(numberOfServices)) { throw new DicomNetworkException("Failed to write to the services channel. This should never happen, because services channels should be created with unlimited capacity"); }

There are already similar implementations inside AdvancedDicomClientAssociation.

Well, the very high performance is hard to reproduce on a small dev-machine. So I was doing test runs in the nightshifts directly with Sectra on the server over the last weeks. I did not do the changes based on assumptions, but on tests. And belive it or not, but the WriteAsync actually was blocking.
Before I added the .ConfigureAwait(false), I tried to avoid a Contextswitch at all and called the synchronous method Writer.TryWrite (I did not throw, but I was assuming it always has to work properly). But in fact then very often there was nothing written into the channel and the whole performance became much worse - this was before I did the optimizations in RemoveUnusedServices.
I cannot tell you the reason, maybe it is some synchronization-stuff when many threads in parallel are writing or reading the channel, where TryWrite returns fast without actually writing and falls back to the WriteAsync.

Sorry, but the idea, to throw an exception there is catastrophic! Because of this single fail, the whole loop of the server that accepts incomming requests would quit. This would tear down the whole DicomServer functionality - just because 1 notification into the channel to clean up unused services would fail.

The comparision with AdvancedDicomClientAssociation does not work. You know, that the performance regression is on high load. One single DicomClient does not have high load within its DicomService, just as it has a DicomServer. So the mechanisms, that work well in DicomClient do not scale to DicomServer.

Don't get me wrong @gofal, obviously I don't think you are making any of this up. But I just fail to understand how what you are describing was caused by Channel.WriteAsync. It does not match my experience and knowledge about Channels at all. Something mysterious is going on, and I would like to know what.

Sorry, but the idea, to throw an exception there is catastrophic! Because of this single fail, the whole loop of the server that accepts incomming requests would quit. This would tear down the whole DicomServer functionality - just because 1 notification into the channel to clean up unused services would fail.

I know but that exception should never happen. It's like a switch case where you know some case should never, EVER occur, under any circumstance. The only scenario in which TryWrite CAN return false is if a developer changes the _servicesChannel from an unbounded into a bounded channel during its declaration.

I cannot tell you the reason, maybe it is some synchronization-stuff when many threads in parallel are writing or reading the channel, where TryWrite returns fast without actually writing and falls back to the WriteAsync.

This is what bothers me. I would very much like to know what is actually going on there.

Fair enough that you are so much interested in how things work under the hood. And even if you would know exactly how the code in Channels work internally now, this may change with any .netcore runtime update.
If .net publshies an api with a method name "TryWrite", then by design we have to assume that this might fail.

[...] is if a developer changes the _servicesChannel from an unbounded into a bounded channel during its declaration

Now you've got the point :-) . We should really avoid code, that is fragile in a way, that it should now work as long a nobody changes something else. And if someone changes it, then the error does not arise in compile time, and not in single testing, but then on load, in worst case only in product.
And you know, there is a dev here who loves doing mass-changes and refactorings in code without exactly testing all these changes, then it may easily happen that constructs that should never ever occur, then suddenly occur.

So I can tell you, that the construction here in this PR with the fire-and-forget works pefectly stable and performant in production (because we are using it in production right now).

I appreciate all the effort that you have done to this repository, just as I appreciate all the code and input and issues of all contributors.
I do have a plan for fo-dicom and I tried to tell it to you often: the project is used in production of medical applications, so stability and reability is the highest priority. The project in the stable phase gets bugfixes, and new features that do not introduce risks. As you noted, a software needs refactorings from time to time. So if there are some defects/bugs/instabilities that are not managable any more or if there are some feature requests in issues that are not so easy to integrate, then it's time to start the next major version where massive refactorings, new risky features and generally changes in structure are fine, and that are released in alpha/beta versions in parallel to updates of the old stable version, until the new version becomes stable.
You joined fo-dicom in the phase of reworking from version 4 to 5. There many refactorings were OK and as you noticed refactorings take time to make them stable again. You wrote that you you are still using 4 in production and will use version 5 in a year or so. So medical software is rather slow (for good reasons). It's an assumption, but maybe for you this project is still a "future version", where it's fine to try new things or where throwing exceptions in critial areas is fine to detect bugs? But this project is now in a stable phase. e.g we in our company, or even Microsoft Cloud for Healthcare is using it.
I also do not want to fight each other and wast time, where I could fix some bugs or manage the issues. I am happy if you contribute to fo-dicom, but therefore please accept the plan behind it.

For your investiagation:

Target platform is net6.0

there is one server listening on one port, hundreds of incoming associations/connections that send C-STORE requests (8000/hour average, meaning there are times where there are more and times with less. There are phases with 10 requests/seconds, and then when huge mammo tomos with 8 GB data are sent, there are even some seconds without an association)

No MaxClientsAllowed.

the performance was without errors, but just generally reduced throughput.

I know from sectra, that they have a limited amout of parallel network connections. so only after the networkstream of a connection was closed, the next connection in queue was started. This was, why also the disposing of the ready DicomServices and therefore the closing of the networkstream was important.

I have some detailed logs of incomming association requests and then incomming cstore-requests and then with notepad++ on the server I could do some analysis. So I had do make some assumptions and prepare builds of fo-dicom which I then could evaluate on the server outside the opening hours.

When I changed toe WriteAsync to a TryWrite I did not log the result (true/false) and I even had no "if" statement, because I also assumed that the write always had to work properly. But then in the testrun the overal performance was lower and in the logs I noticed that the lines "Accepted an incoming client connection, there are now X connected clients" showed much higher values of connected clients. The only conclusion was, that the cleanup of the unused cliends did not work properly and therefore the connections were not disposed and therefore the performance was low. and the cleanup just depends on the reading from the chanel. And because the only change in code, that I had made was to change the WriteAsync to the TryWrite, this had to be the reason.
I did not do further investigations, but changed the "await WriteAsync" to a fire-and-forget "WriteAsync", and then the throughput became better.

I appreciate all the effort that you have done to this repository, just as I appreciate all the code and input and issues of all contributors. I do have a plan for fo-dicom and I tried to tell it to you often: the project is used in production of medical applications, so stability and reability is the highest priority. The project in the stable phase gets bugfixes, and new features that do not introduce risks. As you noted, a software needs refactorings from time to time. So if there are some defects/bugs/instabilities that are not managable any more or if there are some feature requests in issues that are not so easy to integrate, then it's time to start the next major version where massive refactorings, new risky features and generally changes in structure are fine, and that are released in alpha/beta versions in parallel to updates of the old stable version, until the new version becomes stable. You joined fo-dicom in the phase of reworking from version 4 to 5. There many refactorings were OK and as you noticed refactorings take time to make them stable again. You wrote that you you are still using 4 in production and will use version 5 in a year or so. So medical software is rather slow (for good reasons). It's an assumption, but maybe for you this project is still a "future version", where it's fine to try new things or where throwing exceptions in critial areas is fine to detect bugs? But this project is now in a stable phase. e.g we in our company, or even Microsoft Cloud for Healthcare is using it. I also do not want to fight each other and wast time, where I could fix some bugs or manage the issues. I am happy if you contribute to fo-dicom, but therefore please accept the plan behind it.

Just some notes here:

Some of our high traffic components are still using (a private fork of) v4. That is not out of safety concerns, but caused by limited development resources. We already moved one (very) high traffic component to v5, but that does not use the network layer yet, only the DicomFile APIs and the rendering logic. Our migration from .NET Framework to .NET Core is a multi-year effort, and we are making the switch from v4 to v5 in the same move.

v5 is not a future version for me, we are running it in production since last year.

stability and reliability is the highest priority

I mostly agree with this statement, but my approach to this is different. I am more in favour of building more safety nets (benchmarks, tests, code analysis, nullable annotations, etc.) so the risk of changes/refactoring is lowered, while you seem to be a little more conservative and want to avoid changes/risks. We probably need to find a good balance.

Maybe we can take inspiration from other libraries and release beta versions? I'm not sure if we discussed this before. The nice thing about beta versions is that it is opt-in for early adopters, while conservative minded developers (such as yourself) can stick to the stable releases. It's just an idea though.

For your investiagation:
Target platform is net6.0
there is one server listening on one port, hundreds of incoming associations/connections that send C-STORE requests (8000/hour average, meaning there are times where there are more and times with less. There are phases with 10 requests/seconds, and then when huge mammo tomos with 8 GB data are sent, there are even some seconds without an association)
No MaxClientsAllowed.
the performance was without errors, but just generally reduced throughput.
I know from sectra, that they have a limited amout of parallel network connections. so only after the networkstream of a connection was closed, the next connection in queue was started. This was, why also the disposing of the ready DicomServices and therefore the closing of the networkstream was important.
I have some detailed logs of incomming association requests and then incomming cstore-requests and then with notepad++ on the server I could do some analysis. So I had do make some assumptions and prepare builds of fo-dicom which I then could evaluate on the server outside the opening hours.

I tried reproducing this issue today. I wrote a benchmark that made 1000 connections to a single server, and then sent a single C-STORE request. Then, I compared the numbers before and after I made the change to WriteAsync like you did. Unfortunately, there was no difference. So I can't seem to reproduce this yet. My benchmark probably does not come close enough to your circumstances to reproduce the problem.

When I changed toe WriteAsync to a TryWrite I did not log the result (true/false) and I even had no "if" statement, because I also assumed that the write always had to work properly. But then in the testrun the overal performance was lower and in the logs I noticed that the lines "Accepted an incoming client connection, there are now X connected clients" showed much higher values of connected clients. The only conclusion was, that the cleanup of the unused cliends did not work properly and therefore the connections were not disposed and therefore the performance was low. and the cleanup just depends on the reading from the chanel. And because the only change in code, that I had made was to change the WriteAsync to the TryWrite, this had to be the reason.
I did not do further investigations, but changed the "await WriteAsync" to a fire-and-forget "WriteAsync", and then the throughput became better.

I actually have a theory what the problem is. Channels have an option called "AllowSynchronousContinuations", which means that if you call WriteAsync and it immediately succeeds in writing a value, it may synchronously execute the reading end. In other words, the code inside "RemoveUnusedServicesAsync" would be executed on the same thread that called WriteAsync before continuing with the rest of its work. This is a pitfall we've encountered before with TaskCompletionSource, and that is why we have a class called TaskCompletionSourceFactory that should always be used, because it explicitly sets a similar option to avoid this problem.
I'm not sure if you are in a position to re-test this (since I cannot reproduce it), but if my theory is correct, adding the option AllowSynchronousContinuations = false to the UnboundedChannelOptions near line 38 in DicomServer should also have the same positive result.
It looks like this option is correctly set to false everywhere, except when initializing _servicesChannel.
I can imagine a scenario where there are lot of unused service to remove, and a new connection happens, that then the WriteAsync call is "blocked" while it disposes of all the unused services. In truth, WriteAsync completes immediately, but because any attached continuations happen synchronously, it may take a while before we go beyond the WriteAsync method. So, we were both right, isn't that wonderful. 😅
This theory also explains why discarding the WriteAsync call solved your problem. Any attached continuations would also not run synchronously in that case. It also explains why TryWrite did not solve your problem.
All of this makes me relatively confident that this is the root cause.

So, I made this mistake. I never caught it because it does not show up in any benchmarks or tests. (This is not a good case for my approach haha!)
I would propose the same "solution" like we did for TaskCompletionSource: make a ChannelFactory that is responsible for creating channels, and that always sets this option. This ensures that it is never forgotten.
It is not air tight, but it is better than what we have today: hoping that every developer remembers.
(P.S. You could argue that AllowSynchronousContinuations has a bad default value, but maybe they were afraid or breaking changes 🙃 )

Great that you have found the root cause. You created this conversation because you wanted to understand what was going on, now you seem to know.
The current solution - just fire and forget the WriteAsync - is simple and works perfectly. It should be easily understandable for devs who read this ( i could extend the comment above to "fire and forget the notification, no need to await it"). So I guess there is no need for a proposal of an other solution, is it?

Your method is proven to be functioning, while my theory - although very likely in my mind - is not.
I also agree it is simple enough, so I'll just approve and merge this. No need to add the comment.
In an additional PR, I would like to add the ChannelFactory so we can avoid this mistake altogether in the future. I'll do that immediately.

Edit: the default value of AllowSynchronousContinuations is false. That means my theory cannot be true. I'm still in the dark now, but I concede. I won't be able to figure this out without being able to reproduce it, and I just cannot make the time investment. I do conclude we need to start having perf benchmarks that can fail a pull request pipeline if performance degrades too much.

FO-DICOM.Core/Network/DicomServer.cs

amoerie · 2024-05-02T06:44:43Z

I think we should take a look at enabling Code Analysis (https://learn.microsoft.com/en-us/dotnet/fundamentals/code-analysis/overview), there's a check in there that would have caught the missing ConfigureAwait methods. But that should probably be a separate PR, I'll take a stab at it.

Edit: made a first attempt here: #1782
The minimum set of analysis checks revealed a number of other "issues" as well, I fixed them where they made sense. I would categorize all of the changes as "low risk".

gofal added 5 commits April 30, 2024 09:37

Add ConfigureAwait(false) where missing

1e33d0c

Fire-and-forget notification that a service has connected

88812f1

dispose DicomServce in DicomServer immediatelly after the Task has co…

6c7836e

…mpleted

some test data complained about duplicate entries with private fields

925503c

update changelog

adfd5cb

amoerie reviewed May 2, 2024

View reviewed changes

FO-DICOM.Core/Network/DicomServer.cs Show resolved Hide resolved

amoerie self-requested a review May 23, 2024 05:25

amoerie approved these changes May 23, 2024

View reviewed changes

amoerie merged commit 048e9d4 into development May 23, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DicomServer performance #1777

DicomServer performance #1777

gofal commented Apr 30, 2024

amoerie May 2, 2024 •

edited

gofal May 2, 2024

amoerie May 3, 2024

gofal May 6, 2024

gofal May 6, 2024

gofal May 7, 2024

gofal May 7, 2024

amoerie May 22, 2024 •

edited

gofal May 23, 2024

amoerie May 23, 2024 •

edited

amoerie commented May 2, 2024 •

edited

DicomServer performance #1777

DicomServer performance #1777

Conversation

gofal commented Apr 30, 2024

Checklist

Changes proposed in this pull request:

amoerie May 2, 2024 • edited

Choose a reason for hiding this comment

gofal May 2, 2024

Choose a reason for hiding this comment

amoerie May 3, 2024

Choose a reason for hiding this comment

gofal May 6, 2024

Choose a reason for hiding this comment

gofal May 6, 2024

Choose a reason for hiding this comment

gofal May 7, 2024

Choose a reason for hiding this comment

gofal May 7, 2024

Choose a reason for hiding this comment

amoerie May 22, 2024 • edited

Choose a reason for hiding this comment

gofal May 23, 2024

Choose a reason for hiding this comment

amoerie May 23, 2024 • edited

Choose a reason for hiding this comment

amoerie commented May 2, 2024 • edited

amoerie May 2, 2024 •

edited

amoerie May 22, 2024 •

edited

amoerie May 23, 2024 •

edited

amoerie commented May 2, 2024 •

edited