Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for enhanced barriers in D3D12. #91769

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

DarioSamo
Copy link
Contributor

Enhanced barriers are a relatively modern feature that wasn't part of the initial D3D12 specification. This feature brings barriers in D3D12 much closer in functionality to Vulkan (with a few caveats this PR works around).

The current D3D12 driver was designed around legacy barriers, and this PR maintains that path as it is. Enhanced barriers are only used if the driver reports them as supported. To enable this, you might need to use the Agility SDK and use it as indicated here. If you wanna double check if this is reported, you can check by verifying the return of this line.

By implementing enhanced barriers, that means the D3D12 driver can finally use the information provided by the recently introduced Acyclic Rendering Graph to group barriers together as much as possible and allow more commands to execute in parallel.

So far this has given me a real performance increase in an NVIDIA GeForce RTX 3090 Ti. A 3D project with about every post processing effect enabled went from ~2.57ms in master to ~2.16ms in d3d12_enhanced_barriers, an improvement very much in line with the performance benefits we saw when the render graph was introduced.

The second part of this PR was reworking the fallback that the D3D12 driver had to do into something more generic. When relaxed format casting is not available, the driver has to opt for making copies of textures, transitioning to the proper layouts and replacing them in descriptor sets without the render graph knowing any of that is going on. This fallback was instead re-implemented entirely at the RenderingDevice level, allowing the ARG to reorganize the operations and also get a performance boost. This also allows enhanced barriers to work with this fallback without issue, as it would've been impossible to solve it properly otherwise.

Some small changes had to be made to RenderingDeviceDriver to give more detail about the operations and states the resources are used to comply with D3D12's restrictions. No regressions are expected but it should be tested thoroughly regardless.

@DarioSamo DarioSamo requested review from a team as code owners May 9, 2024 15:17
@AThousandShips AThousandShips added this to the 4.x milestone May 9, 2024
@DarioSamo DarioSamo requested a review from a team as a code owner May 9, 2024 16:07
Copy link
Member

@RandomShaper RandomShaper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Admittedly, this is a somewhat naive review because it's hard to follow everything going on here properly in detail. As such, besides witnessing the code changes look solid and sensible, I can only raise a few nitpicks:

  • driver_clear_resources_with_views and the other bool member above could become bitfields.
  • Also for driver_clear_resources_with_views, I'd suggest a rename to the more compact driver_uses_views_for_clears. Or, alternatively, inverting its meaning and naming it driver_clears_via_copy_engine, or something similar. (The rename would carry over to API_TRAIT_CLEAR_RESOURCES_WITH_VIEWS.)
  • I've spotted a few pieces of code style that deviate from the latest best or estabilished practices in the codebase. That was already the case in some of these files, so I think we'd rather do a style-fixing pass later. They are things such as uninitialized local variables and the omission of the (superfluous, I know) public keyword in the specification of inheritance across structs.
  • Finally, there's the thread_local LocalVector technique. That's kind of a safer alloca(), isn't it? I guess we should discuss rules about when to prefer that over the ALLOCA_ARRAY() utility macro that drivers use, which may mean we let the drivers do the same, but in any case come up with something consistent.

@DarioSamo
Copy link
Contributor Author

DarioSamo commented May 10, 2024

* `driver_clear_resources_with_views` and the other bool member above could become bitfields.

Admittedly the purpose of these is just to store one flag and be consistent throughout the execution of the graph without having to constantly call the api_trait_get and go through the cost of using the virtual function. I'm not really very much for or against the approach chosen here (the alternative being just querying the API or using the bitfield), but I went with whatever requires the least amount of knowledge from the reader. The graph is only created once per RenderingDevice so there's not much in the way of memory savings to be found.

* Also for `driver_clear_resources_with_views`, I'd suggest a rename to the more compact `driver_uses_views_for_clears`. Or, alternatively, inverting its meaning and naming it `driver_clears_via_copy_engine`, or something similar. (The rename would carry over to `API_TRAIT_CLEAR_RESOURCES_WITH_VIEWS`.)

This I agree with. It'd be best to change the trait to something that represents more clearly that Vulkan is capable of doing this. It might be more common that we find APIs that don't.

* Finally, there's the `thread_local LocalVector` technique. That's kind of a safer `alloca()`, isn't it? I guess we should discuss rules about when to prefer that over the `ALLOCA_ARRAY()` utility macro that drivers use, which may mean we let the drivers do the same, but in any case come up with something consistent.

Yes, the usage of thread_local is very deliberate across this and my other PRs as it's an effective way to avoid using stack memory and reusing an already allocated vector across calls while also allowing the method to remain safe when using multi-threading. It's important to note thread_local also implies the existence of static, so these vectors do not get re-allocated on each call and merely get reused.

This will be a pretty common pattern among some other optimization PRs that will be coming later. Considering there's no upper bound to some of these calls, relying on stack memory could become a problem at some point that'd leave no other possible alternative as a fix.

@RandomShaper
Copy link
Member

[...] but I went with whatever requires the least amount of knowledge from the reader. The graph is only created once per RenderingDevice so there's not much in the way of memory savings to be found.

I believe anyone hoping to understand this area of the engine is aware of bitfields. It's not a great saving, but why missing the opportunity of packing a few similarly-purposed booleans together?

@DarioSamo
Copy link
Contributor Author

I believe anyone hoping to understand this area of the engine is aware of bitfields. It's not a great saving, but why missing the opportunity of packing a few similarly-purposed booleans together?

It's not something I gave that much thought to be honest as it wouldn't make much of a difference and from my POV it seemed easier to understand. I have no reason to argue for or against it as there's no measurable performance impact. I'll go with whatever preference the project wants.

@RandomShaper
Copy link
Member

RandomShaper commented May 13, 2024

Yes, the usage of thread_local is very deliberate across this and my other PRs as it's an effective way to avoid using stack memory and reusing an already allocated vector across calls while also allowing the method to remain safe when using multi-threading. It's important to note thread_local also implies the existence of static, so these vectors do not get re-allocated on each call and merely get reused.

The technique is cool. I'd only like to raise the need of discussing its usage at some point in the future. Lately, I've been trying to remove or minimize runtime allocations in some PRs, and I don't quite feel good about we adding them to other area of the engine. I believe you are more experienced than me regarding the real impact of them in this context, so I'm all ears (or eyes in this context). My point is that something as thin as the rendering driver shouldn't be the culprit of a spike in frame time because it happened to need to invoke the allocator.

I have an idea (again not for now, not for this PR): we can use something similar to your technique, that would hit the sweet spot between alloc() and thread_local worst-case arrays. That would be using arena-backed arrays. The worst-case allocation would be a chunk of memory per frame-in-flight, used as an arena from where the arrays would be allocated. LocalVector could be extended to support custom allocators and we'd have an arena allocator class. That would have the safety of a proper array class while also being growable, but would cause much fewer runtime allocations (and a project setting could be used to set the initial size so users can tweak settings so their games don't ever need to allocate at runtime if they really want to do so and check the max. frame arena size the profiler would report).

UPDATE: It doesn't really need to be per-frame, since it's throwaway data, but the idea is mostly the same.

Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine to me to merge for 4.3. It should have a pretty small impact outside of D3D12 and will be a huge benefit for D3D12 users (it brings performance in line with the Vulkan backend).

Approving from my side, but it would be good for @RandomShaper to confirm that he is happy with the details before merging

@clayjohn clayjohn modified the milestones: 4.x, 4.3 May 14, 2024
Copy link
Member

@AThousandShips AThousandShips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style looks good!

servers/rendering/rendering_device_driver.h Outdated Show resolved Hide resolved
@RandomShaper
Copy link
Member

I'm not sure my feedback (the directly actionable part) has been addressed yet.

Also, this may be better rebased on top of #91416, which already includes the upgrade of the Agility SDK to 613, but in a more comprehensive manner.

@DarioSamo
Copy link
Contributor Author

I'm not sure my feedback (the directly actionable part) has been addressed yet.

Would you say this covers the copy engine suggestion and the bitfields? I feel the arena allocator would be out of scope for this PR until we get a common class we can reuse to adopt that strategy elsewhere.

@DarioSamo
Copy link
Contributor Author

DarioSamo commented May 15, 2024

I believe anyone hoping to understand this area of the engine is aware of bitfields. It's not a great saving, but why missing the opportunity of packing a few similarly-purposed booleans together?

Upon trying the suggestion, one annoying thing I realized is that bitfield declaration does not allow us to give it an initialization value on the class itself and instead requires to do a union to do so. Do we feel it's worth doing this:

union {
  struct {
    bool driver_honors_barriers : 1;
    bool driver_clear_resources_with_views : 1;
  }
  
  bool driver_all = false;
};

Instead of just this?

bool driver_honors_barriers = false;
bool driver_clear_resources_with_views = false;

The other alternative is just doing the initialization withing the class' initializer manually, but that breaks a bit of the style of the rest of the class where you can just read it directly to find out what the defaults are.

@DarioSamo DarioSamo force-pushed the d3d12_enhanced_barriers branch 3 times, most recently from cc5d782 to 97ddfd1 Compare May 15, 2024 14:21
@DarioSamo
Copy link
Contributor Author

I've added the rename to use copy engine. We should decide about the booleans vs the bitfields for the reasons outlined above.

@RandomShaper
Copy link
Member

The lack of bitfield initializers is unfortunate indeed, but the style ship has already sailed in that regard. See this:

struct Task {
TaskID self = -1;
Callable callable;
void (*native_func)(void *) = nullptr;
void (*native_group_func)(void *, uint32_t) = nullptr;
void *native_func_userdata = nullptr;
String description;
Semaphore done_semaphore; // For user threads awaiting.
bool completed : 1;
bool pending_notify_yield_over : 1;
Group *group = nullptr;
SelfList<Task> task_elem;
uint32_t waiting_pool = 0;
uint32_t waiting_user = 0;
bool low_priority = false;
BaseTemplateUserdata *template_userdata = nullptr;
int pool_thread_index = -1;
void free_template_userdata();
Task() :
completed(false),
pending_notify_yield_over(false),
task_elem(this) {}
};

So, until we upgrade to the version of C++ that allows in-place initializers for that, I'd suggest embracing that style as the convention.

To clarify, yes, that's indeed the last piece of feedback I was talking about. Everything else belongs indeed to future iterations, in any case.

Enables support for enhanced barriers if available.

Gets rid of the implementation of [CROSS_FAMILY_FALLBACK] in the D3D12 driver. The logic has been reimplemented at a higher level in RenderingDevice itself.

This fallback is only used if the RenderingDeviceDriver reports the API traits and the capability of sharing texture formats correctly. Aliases created in this way can only be used for sampling: never for writing. In most cases, the formats that do not support sharing do not support unordered access/storage writes in the first place.
@DarioSamo
Copy link
Contributor Author

I've addressed the bitfield suggestion.

@@ -245,13 +255,34 @@ class RenderingDevice : public RenderingDeviceCommons {
r.layer_count = layers;
return r;
}

TextureFormat texture_format() const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Godot the naming convention in cases like this involves adding a get_ prefix. However, since there are already other cases of this other style in the code, I guess a future PR (by maybe @akien-mga or @AThousandShips) can do a uniformity pass.

@akien-mga
Copy link
Member

akien-mga commented May 21, 2024

As this is relatively big and I'm planning a 4.3-beta1 release this week, I'm scheduling to merge this after beta1 and for beta2, so we can hopefully find potential regressions before making the beta2 build. Or at least have a "stable" beta1 baseline to compare with if users report D3D12 or RenderGraph regressions in beta2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants