Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Performance fixes in GameProfilingSystem and ChromeTracingProfileWriter #2050

Draft
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

manio143
Copy link
Member

PR Details

Description

This is a continuation of #1953 to fix performance issues with the Profiler.
As part of this work I ended up producing #2049.

From #1953

  • Fix busy-wait in GameProfilingSystem and make surrounding code clearer.
  • Add manual flushing to the JsonWriter, so the internal buffer doesn't grow indefinitely and put it on the fast path by skipping validation.

Additionally:

  • Refactor GameProfilingSystem to extract all string formatting related logic into a private class GameProfilingSystemStrings
  • Modify a lot of formatting code to allocate less
  • Interchange between two different tasks in the GameProfilingSystem depending on the filtering mode
  • Add cancellation tokens to the Profiler subscribers
  • Remove semaphore lock from the Profiler in favor of ConcurrentDictionary with a flag checked using Interlocked
  • Move from Unbounded channels to Bounded to reduce allocation and incredible GC pressure
  • Add Task.Delay to consumption of events in GameProfilingSystem to further reduce allocation by waiting on channels

Related Issue

Closes #1953

Motivation and Context

The Profiler should be very efficient. After all if by measuring performance we decrease it, it's not a great system.
These changes aim to improve mainly GC pressure by minimizing allocations.
There's still further field for improvement that needs some deeper investigation - I was aiming to achieve perf parity between Fps profiler mode and CpuEvents -> it's close but there's still a bit of a difference. Furthermore there's still difference between measuring FPS without Profiling turned on and with.

Types of changes

  • Refactoring
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist

  • My change requires a change to the documentation.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • I have built and run the editor to try this change out.

@Eideren
Copy link
Collaborator

Eideren commented Nov 23, 2023

I don't think we have tests for the profiler, could you run those changes ? Also pinging @froce for review, let us know if you don't have time, I should have time to take a look in your stead.

@Eideren Eideren requested a review from froce November 23, 2023 02:24
@manio143
Copy link
Member Author

I was running the game (Starbreach) with those changes to validate the allocations are reduced (using allocation tool in VS). Didn't mark the checkbox for the editor as I didn't invoke it, but will give it a try later.

Comment on lines -67 to +76
internal static ProfilingEventChannel Create(UnboundedChannelOptions options)
internal static ProfilingEventChannel Create(bool singleReader = false, bool singleWriter = false)
{
var channel = Channel.CreateUnbounded<ProfilingEvent>(options);
// bounded channel is supposed to have lower allocation overhead than unbounded
var channel = Channel.CreateBounded<ProfilingEvent>(new BoundedChannelOptions(capacity: short.MaxValue)
{
SingleReader = singleReader,
SingleWriter = singleWriter,
FullMode = BoundedChannelFullMode.DropWrite, // optimize caller to not block
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this. Allocation behavior for bounded and unbounded channels should be roughly the same. BoundedChannel grows a deque, UnboundedChannel grows a linked list of arrays. And they should both stabilize quickly after enabling the profiler(*). If you saw the channels constantly growing, that means the system couldn't keep up and we'd need to fix that. The tradeoff is that BoundedChannel is safe against that and we can notice dropped events, but UnboundedChannel is more efficient, because SingleConsumerUnboundedChannel exists and has no bounded equivalent. (We'd propbably want SingleProducerSingleConsumerBoundedChannel ... maybe in the future)

If we go with the BoundedChannel, we should expose the itemDropped delegate or have some other notification that events were dropped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you may be right, I ran BenchmarkDotNet and Unbounded channel Single Produce Single Consumer does seem to perform significantly better than bounded. I must have misread the initialization allocations for ones occuring later on.

finally { subscriberChannelLock.Release(); }
SingleReader = true,
SingleWriter = true,
FullMode = BoundedChannelFullMode.DropNewest, // dropping newer events so that we don't mess with events continuity
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this also be DropWrite? I think the timeline will be messed up anyway.

@@ -376,17 +385,24 @@ public static void AppendTime([NotNull] StringBuilder builder, long accumulatedT

public static void AppendTime([NotNull] StringBuilder builder, TimeSpan accumulatedTimeSpan)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to GameProfilingSystem or a utility class?

Comment on lines +391 to +393
accumulatedTimeSpan.TotalMinutes.TryFormat(buffer, out _, "000.000");
builder.Append(buffer);
builder.Append("m ");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

increase buffer size and combine these to

Suggested change
accumulatedTimeSpan.TotalMinutes.TryFormat(buffer, out _, "000.000");
builder.Append(buffer);
builder.Append("m ");
accumulatedTimeSpan.TotalMinutes.TryFormat(buffer, out _, "000.000m ");
builder.Append(buffer);

or is it clearer your way?

// depends on the timer/update logic, but it does not actually need the profiling events.

while (Enabled)
while (await profilerChannel.WaitToReadAsync(cancellationToken))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to support cancellation here. If we're running this task and request cancellation, we've already called Profiler.Unsubscribe(), which will complete the channel, ending this loop. (And we must call Unsubscribe() to let the Profiler know that the channel should be removed). And if we don't support cancellation the call will also no longer allocate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose the only scenario where ending this task would be delayed is if we have built up a lot of events in the channel and now must wait to consume them all. In normal conditions that shouldn't happen much and the task will end very soon after completing the channel.

@manio143
Copy link
Member Author

I'm gonna try and create some BenchmarkDotNet samples to ensure my evidence wasn't incorrectly collected.

@manio143
Copy link
Member Author

manio143 commented Dec 3, 2023

Benchmark description

I wanted to check there's a difference between bounded and unbounded channel for 1P, 1C scenario and what is the impact of including cancellation token on allocations.

Code
[MemoryDiagnoser]
[SimpleJob]
public class ProfilerBenchmark
{
    private static ProfilingKey ProfilingKey = new ProfilingKey("Test");

    private readonly Channel<ProfilingEvent> boundedChannel = Channel.CreateBounded<ProfilingEvent>(new BoundedChannelOptions(short.MaxValue)
    {
        SingleReader = true,
        SingleWriter = true,
        FullMode = BoundedChannelFullMode.DropWrite,
    });

    private readonly Channel<ProfilingEvent> unboundedChannel = Channel.CreateUnbounded<ProfilingEvent>(new UnboundedChannelOptions
    {
        SingleReader = true,
        SingleWriter = true,
    });

    private readonly CancellationTokenSource CTS = new();

    [Params(1000, 10_000)]
    public int N { get; set; }

    /// <summary>
    /// Reads events from the stream, uses in a calculation and returns a result to ensure the compiler doesn't optimize out relevant parts.
    /// </summary>
    private async Task<TimeSpan> Consume(ChannelReader<ProfilingEvent> eventReader)
    {
        TimeSpan sum = default;
        int i = 0;
        // await foreach (var e in eventReader.ReadAllAsync()) // without cancellation token
        await foreach (var e in eventReader.ReadAllAsync(CTS.Token))
        {
            sum += e.ElapsedTime;
            if (++i == N) break;
        }
        return sum;
    }

    private void Produce(ChannelWriter<ProfilingEvent> writer)
    {
        for (int i = 0; i < N; i++)
        {
            writer.TryWrite(new ProfilingEvent(1, ProfilingKey, ProfilingMessageType.Begin, default, default, 1, null, default));
        }
    }

    [Benchmark]
    public async Task Bounded()
    {
        Task consumer = Consume(boundedChannel.Reader);
        Produce(boundedChannel.Writer);
        await consumer;
    }

    [Benchmark]
    public async Task Unbounded()
    {
        Task consumer = Consume(unboundedChannel.Reader);
        Produce(unboundedChannel.Writer);
        await consumer;
    }
}

Machine

BenchmarkDotNet v0.13.10, Windows 10 (10.0.19045.3693/22H2/2022Update)
AMD Ryzen 5 2600X, 1 CPU, 12 logical and 6 physical cores
.NET SDK 8.0.100
  [Host]     : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2
  Job-LYCRLG : .NET 8.0.0 (8.0.23.53103), X64 RyuJIT AVX2

Runtime=.NET 8.0

Without cancellation token

Method Job Toolchain IterationCount LaunchCount WarmupCount N Mean Error StdDev Allocated
Bounded DefaultJob Default Default Default Default 1000 459.7 us 20.31 us 59.88 us 822 B
Unbounded DefaultJob Default Default Default Default 1000 128.9 us 2.47 us 2.31 us 838 B
Bounded DefaultJob Default Default Default Default 10000 4,797.2 us 230.61 us 679.96 us 802 B
Unbounded DefaultJob Default Default Default Default 10000 1,119.2 us 9.72 us 8.62 us 841 B

With cancellation token

Method Job IterationCount LaunchCount WarmupCount N Mean Error StdDev Allocated
Bounded DefaultJob Default Default Default 1000 418.9 us 15.79 us 46.56 us 1050 B
Unbounded DefaultJob Default Default Default 1000 126.1 us 2.48 us 3.48 us 941 B
Bounded DefaultJob Default Default Default 10000 4,949.8 us 318.49 us 939.09 us 8028 B
Unbounded DefaultJob Default Default Default 10000 1,141.3 us 16.94 us 15.85 us 946 B

@Eideren
Copy link
Collaborator

Eideren commented Jan 7, 2024

Is this ready for merge ?

@manio143
Copy link
Member Author

manio143 commented Jan 7, 2024

No, it needs further work. Might need to convert it to draft for now

@Eideren Eideren marked this pull request as draft January 8, 2024 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants