#12502 Remove limit on RenderLayers. #13317

tychedelia · 2024-05-10T17:52:33Z

Objective

Remove the limit of RenderLayer by using a growable mask using SmallVec.

Changes adopted from @UkoeHB's initial PR here #12502 that contained additional changes related to propagating render layers.

Changes

Solution

The main thing needed to unblock this is removing RenderLayers from our shader code. This primarily affects DirectionalLight. We are now computing a skip field on the CPU that is then used to skip the light in the shader.

Testing

Checked a variety of examples and did a quick benchmark on many_cubes. There were some existing problems identified during the development of the original pr (see: https://discord.com/channels/691052431525675048/1220477928605749340/1221190112939872347). This PR shouldn't change any existing behavior besides removing the layer limit (sans the comment in migration about all layers no longer being possible).

Changelog

Removed the limit on RenderLayers by using a growable bitset that only allocates when layers greater than 64 are used.

Migration Guide

RenderLayers::all() no longer exists. Entities expecting to be visible on all layers, e.g. lights, should compute the active layers that are in use.

tychedelia · 2024-05-10T18:35:30Z

@james7132 I think I know how to use tracy, would you mind pointing me in the direction of what I should benchmark or any documentation for this relative to rendering prs?

crates/bevy_dev_tools/src/ui_debug_overlay/mod.rs

examples/3d/render_to_texture.rs

crates/bevy_pbr/src/light/mod.rs

crates/bevy_pbr/src/render/mesh_view_types.wgsl

crates/bevy_render/src/view/visibility/render_layers.rs

Co-authored-by: robtfm <[email protected]>

superdump · 2024-05-13T20:34:52Z

crates/bevy_pbr/src/render/light.rs

 {
+ let gpu_light = &mut gpu_lights.directional_lights[light_index];


Minor but gpu_light is only used inside the if block, right?

superdump · 2024-05-13T20:40:27Z

crates/bevy_pbr/src/render/light.rs

+
+ // Check if the light intersects with the view.
+ if !view_layers.intersects(&light.render_layers) {
+ gpu_light.skip = 1u32;


So this makes it so that if the render layers of the view don’t intersect the render layers of the light then the light is disabled in the view. Makes sense.

superdump

Assuming performance is ok, and I don’t see why it wouldn’t be given that the shader code is doing one fewer operation than it would have been, then this is not controversial. So I’m just going to say it’s not controversial.

james7132

I would appreciate a bit more perf investigation but I won't block on it.

I'll raise an objection on this basis. A SmallVec is still going to be 6x larger as a component and involves other base costs (i.e. the capacity check on every access). This is quite a lot to be adding to where we were previously just fetching and doing bitwise operations on a u32. This is also expected to be on the majority of entities that should be rendered, so these costs can easily scale with scene size. My biggest concern is that once there's enough layers to spill onto the heap, the performance of visibility calculation will tank as the cache hit rate drops.

Expanding the fixed size to u64 or even u128 is likely less controversial, and are likely already more than sufficient given that Unity/Unreal already ship production games with a hard 32 render layer maximum. A u128 is enough to give a Fortnite-like game of 100+ players their own unique view and still render a shared world state, though I personally would not implement it that way, and it would still use less memory/fewer instructions to process than a SmallVec.

I'd definitely suggest checking this against many_cubes, potentially with a variety of layers settings used.

james7132 · 2024-05-13T23:01:01Z

crates/bevy_render/src/view/visibility/render_layers.rs

 #[reflect(Component, Default, PartialEq)]
-pub struct RenderLayers(LayerMask);
+pub struct RenderLayers(SmallVec<[u64; 1]>);


Suggested change

pub struct RenderLayers(SmallVec<[u64; 1]>);

pub struct RenderLayers(SmallVec<[u64; 2]>);

With the union feature on, we can fit 2 u64s on the stack before it starts to grow bigger than a base Vec (on 64-bit platforms), alongside the one usize for the capacity check. This should support at least 128 layers before it spills onto the heap.

Thanks, this is a good call out on this feature.

tychedelia · 2024-05-14T02:14:51Z

I definitely appreciate concerns about performance and do not want to regress the base case here. I'm happy to do some more testing as best I can.

This is also expected to be on the majority of entities that should be rendered, so these costs can easily scale with scene size.

I'm not sure I understand this comment. The baseline I'd expect is that neither the view nor the entity has a render layer, in which case we're just checking that the default equals the default. In fact, because we're providing a Default impl for &RenderLayers that points to a static, we could theoretically just start doing a ptr::eq for the common case here where the view and the entity point to the same default layer.

My biggest concern is that once there's enough layers to spill onto the heap, the performance of visibility calculation will tank as the cache hit rate drops.

That's a fair concern that I would also like to understand better, but it's also worth pointing out that the status quo is that the user simply cannot go beyond 64 32 layers. Assuming the non-heap scenario doesn't regress too much, it's strictly better to be able to do something than not to do it. Plus, there are also scenarios, like the one I'm specifically trying to unblock for myself, with many layers and very few entities.

Expanding the fixed size to u64 or even u128 is likely less controversial, and are likely already more than sufficient

In my use case I need to spawn a camera for every node in a node graph which is necessarily unbounded. There are absolutely other ways that I could work around the limit, but they are very complex. I recognize this is more niche and so don't want to ram through changes solely on my behalf, but it's a pretty hard blocker for me personally.

I'll try to collect some more evidence with regards to the impact here.

tychedelia · 2024-05-14T03:05:44Z

Micro benchmark on intersects shows a 70% regression. This isn't too surprising since we are are doing an iter versus just a bitwise op.

check_visibility shows a ~5% regression.
frames shows no regression

Tested my M2 mbp.

For many_cubes, I attached an explicit layer to the camera and spawned each cube in one of 3 possible layers.

robtfm · 2024-05-14T10:06:59Z

i see a 2.5% regression in check_visibility without adding any layers, from 428ns to 440ns. adding if self.0.as_ptr() == other.0.as_ptr() { return true } to the start of the intersect test reduces that to 1.5%.

i don't think 12 nanoseconds (or 7 nanoseconds with the fast path) per frame for one thread for 160k entities is significant, but i don't write bullet hell games ... if this is a concern i guess we'd have to use a feature flag, which would be messy.

separately, running this myself i was getting a panic in prepare_lights where you replaced directional_shadow_enabled_count with MAX_DIRECTIONAL_LIGHTS on line 1141.
the change makes sense - we want to take as many shadow casters for the view as we can, even if there are shadow casters on other layers - but it looks like that needs an additional check for light.shadows_enabled as well to avoid trying to access cascades when they may not exist. in case shadows_enabled is false, we can break as the shadow casters are sorted to the front.

james7132 · 2024-05-14T15:20:29Z

Can you check how much of an impact spilling onto the heap would have in the worst case?

tychedelia · 2024-05-14T16:28:10Z

adding if self.0.as_ptr() == other.0.as_ptr() { return true } to the start of the intersect test reduces that to 1.5%.

Thanks for validating this, I've added it to intersects.

if this is a concern i guess we'd have to use a feature flag, which would be messy.

Barring some stuff around the edges regarding constness, we could do an ugly feature flag for this.

separately, running this myself i was getting a panic in prepare_lights where you replaced directional_shadow_enabled_count

Added a guard for this. Curious why I didn't hit it on my Mac.

Can you check how much of an impact spilling onto the heap would have in the worst case?

layer 10k vs layer 1. A 100% slowdown in check_visibility that results in only a 10% slowdown on frames.

A more realistic layer 256 vs layer 1 shows a 20% slowdown in check_visibility.

robtfm · 2024-05-14T17:15:43Z

Curious why I didn't hit it on my Mac.

I ran cargo run --release --example many_cubes -- --benchmark

Will check its fixed shortly

robtfm · 2024-05-14T19:50:58Z

works for me too now.

tychedelia · 2024-05-14T20:52:02Z

Don't want to claim to be a perf expert, but the numbers seem pretty reasonable to me? I'm sure there are scenarios we haven't covered yet.

If the performance hit is acceptable, I don't think the additional maintence burden is worth it, but I will just say that I was able to implement a feature flag like so that "just works":

#[cfg(feature = "unlimited_render_layers")]
#[path = "unlimited_render_layers.rs"]
mod render_layers;
#[cfg(not(feature = "unlimited_render_layers"))]
mod render_layers;

alice-i-cecile · 2024-05-14T21:43:42Z

I don't think feature flagging this is worth it. That's a nasty bit of maintenance burden, and IMO the flexibility here is worth the performance cost.

Brezak · 2024-05-16T00:30:36Z

I know this is pretty far into development but have we considered a tagged pointer to a slice with it's size stored inline?

The render layer struct would be a single pointer whose address would define what it contains:

0bXX...XX00 A address to 4 byte aligned bitset with it's size stored in the allocation.
0bXX...XX01 A 62/30 bit bitset stored in the address bits.
0b11...1111 A full bitset.

A intersection would be performed as follows:

If neither bitset is allocated on the heap do a bitwise and of the addresses and return the result.
If one bitset is heap allocated and the other is full return a clone of the allocated one.
If both bitsets are heap allocated return a heap allocated bitset.

If most our users only use the lower render layers or all render layers intersections and clones would operate without allocating.

Pointer provenance shouldn't be an issue since we only ever dereference pointers whose addresses we haven't modified in any way.

alice-i-cecile · 2024-05-16T16:15:44Z

The powers that SME have decided that we should merge this as is, and tackle perf follow-ups in future PRs if they become meaningful (or someone wants to get nerd-sniped).

Merging!

# Objective - in example `render_to_texture`, #13317 changed the comment on the existing light saying lights don't work on multiple layers, then add a light on multiple layers explaining that it will work. it's confusing ## Solution - Keep the original light, with the updated comment ## Testing - Run example `render_to_texture`, lighting is correct

tychedelia added 8 commits May 10, 2024 10:51

bevyengine#12502 Remove limit on RenderLayers.

77f544b

Cargo fmt.

89e8ed2

Add const versions of with/without.

8d992d8

Fml const.

d70895c

CI.

7010bfd

CI.

322ae90

CI.

3b58a42

CI.

a90b431

james7132 added A-Rendering Drawing game state to the screen C-Usability A simple quality-of-life change that makes Bevy easier to use S-Needs-Benchmarking This set of changes needs performance benchmarking to double-check that they help labels May 10, 2024

tychedelia mentioned this pull request May 10, 2024

Unify RenderLayers and TargetCamera (Part 1: RenderGroups) [ADOPT ME] #12502

Open

5 tasks

Add tests for resizing.

8252d2d

alice-i-cecile added X-Uncontroversial This work is generally agreed upon D-Straightforward Simple bug fixes and API improvements, docs, test and examples labels May 10, 2024

Cargo fmt.

afefa42

robtfm reviewed May 11, 2024

View reviewed changes

tychedelia and others added 9 commits May 11, 2024 11:44

Update examples/3d/render_to_texture.rs

7263b05

Co-authored-by: robtfm <[email protected]>

Remove unecessary copies.

a4bf84f

Move skip field to end of struct.

e593983

Fix iter.

8810fbd

Cargo fmt.

1c8e99d

Clippy.

b862aee

Fix mistaken import.

a5aed1c

Ci.

65c8e96

Add none_with const constructor.

1461498

tychedelia requested a review from robtfm May 11, 2024 19:51

tychedelia added 2 commits May 11, 2024 12:56

Ci.

92508e1

Fix none_with logic.

9ef5181

superdump reviewed May 13, 2024

View reviewed changes

james7132 reviewed May 13, 2024

View reviewed changes

alice-i-cecile added S-Waiting-on-Author The author needs to make changes or address concerns before this can be merged and removed S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it labels May 13, 2024

alice-i-cecile added X-Contentious There are nontrivial implications that should be thought through and removed X-Uncontroversial This work is generally agreed upon labels May 14, 2024

Add check for lights enabled; Check ptr eq in intersects.

aae7525

alice-i-cecile added S-Ready-For-Final-Review This PR has been approved by the community. It's ready for a maintainer to consider merging it and removed S-Needs-SME Decision or review from an SME is required labels May 16, 2024

alice-i-cecile added this pull request to the merge queue May 16, 2024

Merged via the queue into bevyengine:main with commit 4c3b767 May 16, 2024
33 checks passed

mockersf mentioned this pull request May 16, 2024

example render_to_texture: remove extra light #13398

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#12502 Remove limit on RenderLayers. #13317

#12502 Remove limit on RenderLayers. #13317

tychedelia commented May 10, 2024

tychedelia commented May 10, 2024

superdump May 13, 2024

superdump May 13, 2024

superdump left a comment

james7132 left a comment

james7132 May 13, 2024

tychedelia May 14, 2024

tychedelia commented May 14, 2024 •

edited

tychedelia commented May 14, 2024 •

edited

robtfm commented May 14, 2024

james7132 commented May 14, 2024

tychedelia commented May 14, 2024

robtfm commented May 14, 2024 •

edited

robtfm commented May 14, 2024

tychedelia commented May 14, 2024 •

edited

alice-i-cecile commented May 14, 2024 •

edited

Brezak commented May 16, 2024 •

edited

alice-i-cecile commented May 16, 2024

		{
		let gpu_light = &mut gpu_lights.directional_lights[light_index];

	pub struct RenderLayers(SmallVec<[u64; 1]>);
	pub struct RenderLayers(SmallVec<[u64; 2]>);

#12502 Remove limit on RenderLayers. #13317

#12502 Remove limit on RenderLayers. #13317

Conversation

tychedelia commented May 10, 2024

Objective

Solution

Testing

Changelog

Migration Guide

tychedelia commented May 10, 2024

superdump May 13, 2024

Choose a reason for hiding this comment

superdump May 13, 2024

Choose a reason for hiding this comment

superdump left a comment

Choose a reason for hiding this comment

james7132 left a comment

Choose a reason for hiding this comment

james7132 May 13, 2024

Choose a reason for hiding this comment

tychedelia May 14, 2024

Choose a reason for hiding this comment

tychedelia commented May 14, 2024 • edited

tychedelia commented May 14, 2024 • edited

robtfm commented May 14, 2024

james7132 commented May 14, 2024

tychedelia commented May 14, 2024

robtfm commented May 14, 2024 • edited

robtfm commented May 14, 2024

tychedelia commented May 14, 2024 • edited

alice-i-cecile commented May 14, 2024 • edited

Brezak commented May 16, 2024 • edited

alice-i-cecile commented May 16, 2024

tychedelia commented May 14, 2024 •

edited

tychedelia commented May 14, 2024 •

edited

robtfm commented May 14, 2024 •

edited

tychedelia commented May 14, 2024 •

edited

alice-i-cecile commented May 14, 2024 •

edited

Brezak commented May 16, 2024 •

edited