Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add debug utilities for Vulkan #90993

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

darksylinc
Copy link
Contributor

@darksylinc darksylinc commented Apr 21, 2024

Features:

  • Debug-only tracking of objects by type. See get_driver_allocs_by_object_type et al.
  • Debug-only Breadcrumb info for debugging GPU crashes and device lost
  • Performance report per frame from get_perf_report
  • Some VMA calls had to be modified in order to insert the necessary memory callbacks

Functionality marked as "debug-only" is only available in debug or dev builds.

Misc fixes:

  • Shaders are more aggressively unloaded
  • Early break optimization in RenderingDevice::uniform_set_create

============================

The work was performed by collaboration of TheForge and Google. I am merely splitting it up into smaller PRs and cleaning it up.

PR comments

This is the first PR from the TheForge changed. This one is the most important one because:

  1. It sets the foundation for the rest of the upcoming changes.
  2. 99% of these changes should be harmless. They are not compiled in non-debug and non-dev builds.
  3. They're quite verbose in terms of lines of code, so getting this out of the way really reduces the mental load.

Update

Thanks everyone for the help! Most of the issues have been addressed. Almost all CI passes now. Windows now compiles.

3 outstanding things remain:

  1. Turning breadcrumbs into a single uint32_t. I want the enum to still be defined in rendering_device_commons.h and document it better. Done.
  2. CI is complaining about draw_list_begin changing its signature (this is correct; although I'm not sure if the error is triggering because the bindings are wrong?). Help & Tips on this are appreciated.
  3. I've been talking with Dario, and it appears TheForge added RenderingDevice::shader_destroy_modules which must be called by hand to free (unused) Shader modules after creating the PSO. We both agree it seems like adding a shotgun to the API just to save few kilobytes of RAM (maybe a few megabytes?). I will remove it from this PR and ask TheForge for further info (perhaps it's a lot more RAM than we think?). It may also be a better idea to destroy shader modules after each PSO creation (unless they're shared), instead of doing it at higher level. Done for now. We'll revisit this later.

Comment on lines 274 to 311
enum BreadcrumbMarker {
NONE = 0,
// Environment
REFLECTION_PROBES,
SKY_PASS,
// Light mapping
LIGHTMAPPER_PASS,
// Shadows
SHADOW_PASS_DIRECTIONAL,
SHADOW_PASS_CUBE,
// Geometry passes
OPAQUE_PASS,
ALPHA_PASS,
TRANSPARENT_PASS,
// Screen effects
POST_PROCESSING_PASS,
BLIT_PASS,
UI_PASS,
// Other
DEBUG_PASS
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the responsibility of RenderingDevice should stay as abstract as possible in this case and not have knowledge of how the rendering pipeline works. Perhaps we can keep the breadcrumbs as a typedef of just uint32_t and the renderers (Forward+, Mobile, or future ones) can assign the semantic meaning to the numbers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Breadcrumbs don't really add knowledge.

They're just tagging an ID. The idea is that if the app crashes we look at the report and see "what was being executed?" oh breadcrumb says ID = 2, which corresponds to... let me see... SKY_PASS.

It's up to the rendering pipelines to actually use those IDs.

In fact some parts of the code do the following:

breadcrumb = LIGHTMAPPER_PASS | (pass_id << 16)

In other words, it doesn't give knowledge about the upper levels; it's just a way of standardizing some common tags that are more likely to be used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with both of you.

Dario's comment is more about a separation of responsibilities. In the RenderingDevice API, we should just use uint32_t's and then in the Renderer itself, we should use an enum to standardize the common tags

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But wouldn't that mean that:

  1. Mobile and Forward+ would have their own copies of the enum, which can get out of sync. The main issue about that is that it can cause silly mistakes when debugging that can waste hours (i.e. user got id = 7, dev looked it up what that means, but saw the wrong version).
  2. Users who want to use the RenderingDevice directly have no information about what that id is for, and even if they do, they would not adhere to any convention.

Btw I looked again and turns out only 16 bits of each breadcrumb argument are used:

uint32_t breadcrumb = (p_phase << 16) | (p_breadcrumb_data & ((1 << 16) - 1));

Perhaps we could change this argument signature to...?

// original:
void draw_list_begin(RID p_framebuffer, InitialAction p_initial_color_action, FinalAction p_final_color_action, InitialAction p_initial_depth_action, FinalAction p_final_depth_action, const Vector<Color> &p_clear_color_values, float p_clear_depth, uint32_t p_clear_stencil, const Rect2 &p_region, RDD::BreadcrumbMarker p_phase, uint32_t p_breadcrumb_data) {}

// My proposal:
enum BreadcrumbMarker {
   // Stays the same.
}

uint32_t breadcrumb( BreadcrumbMarker p_phase, uint16_t p_data ) {
  return (p_phase << 16) | (p_data & 0xFFFF)
}

void draw_list_begin(RID p_framebuffer, InitialAction p_initial_color_action, FinalAction p_final_color_action, InitialAction p_initial_depth_action, FinalAction p_final_depth_action, const Vector<Color> &p_clear_color_values, float p_clear_depth, uint32_t p_clear_stencil, const Rect2 &p_region, uint32_t p_breadcrumb_data) {}

This way the BreadcrumbMarker enum is still visible to everyone, but it is made clear the underlying datatype is an arbitrary uint32_t.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back on this. I like your suggested approach. I really don't love the RDD having high level concepts like POST_PROCESSING_PASS or SHADOW_PASS_CUBE, but at the same time, to make this feature useful we want the debug utility to print out the human readable name for the pass and not force users to look it up. You are right that looking up an ID would be a huge annoyance and significantly reduce the value of breadcrumbs.

PipelineCacheRD *pipeline = nullptr;

pipeline = &shader->pipelines[cull_variant][primitive][shader_version];
PipelineCacheRD *pipeline = &shader->pipelines[cull_variant][primitive][shader_version];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a heads up, we will have a merge conflict on this case as this code will be replaced in #90400. Seeing as it's for aesthetic reasons I'd forego this change and the removal of the comment below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. This is exactly why I included it (so someone could notice it, it wasn't an aesthetic change but more like bait).

There are more upcoming changes that will affect RenderForwardMobile::_render_list_template, as there is an optimization that uses Spec Constants for key in sorting the draw calls for better draw call batching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will probably have to ask for your help when the time comes.

core/os/memory.cpp Outdated Show resolved Hide resolved
core/os/memory.cpp Outdated Show resolved Hide resolved
servers/rendering/rendering_device.cpp Outdated Show resolved Hide resolved
servers/rendering/rendering_device.cpp Outdated Show resolved Hide resolved
servers/rendering/rendering_device.cpp Outdated Show resolved Hide resolved
drivers/vulkan/rendering_device_driver_vulkan.cpp Outdated Show resolved Hide resolved
drivers/vulkan/rendering_device_driver_vulkan.cpp Outdated Show resolved Hide resolved
core/os/memory.h Show resolved Hide resolved
servers/rendering/rendering_device_driver.h Show resolved Hide resolved
servers/rendering/rendering_device_driver.h Show resolved Hide resolved
doc/classes/RenderingDevice.xml Outdated Show resolved Hide resolved
doc/classes/RenderingDevice.xml Outdated Show resolved Hide resolved
drivers/vulkan/rendering_device_driver_vulkan.cpp Outdated Show resolved Hide resolved
drivers/vulkan/rendering_device_driver_vulkan.cpp Outdated Show resolved Hide resolved
servers/rendering/rendering_device.cpp Show resolved Hide resolved
drivers/vulkan/rendering_device_driver_vulkan.cpp Outdated Show resolved Hide resolved
drivers/vulkan/rendering_device_driver_vulkan.cpp Outdated Show resolved Hide resolved
drivers/vulkan/rendering_device_driver_vulkan.cpp Outdated Show resolved Hide resolved
servers/rendering/rendering_device.h Outdated Show resolved Hide resolved
servers/rendering/rendering_device.cpp Outdated Show resolved Hide resolved
@darksylinc darksylinc force-pushed the matias-TheForge branch 5 times, most recently from d71ead1 to a7aa977 Compare April 27, 2024 20:25
@darksylinc
Copy link
Contributor Author

I've made an update. Look at the "3 outstanding things remain:" in the top post where I ask for a bit of help 😄

@darksylinc darksylinc force-pushed the matias-TheForge branch 3 times, most recently from 7e15b93 to 23c4298 Compare April 28, 2024 16:13
doc/classes/RenderingDevice.xml Outdated Show resolved Hide resolved
doc/classes/RenderingDevice.xml Outdated Show resolved Hide resolved
doc/classes/RenderingDevice.xml Outdated Show resolved Hide resolved
doc/classes/RenderingDevice.xml Outdated Show resolved Hide resolved
doc/classes/RenderingDevice.xml Outdated Show resolved Hide resolved
doc/classes/RenderingDevice.xml Outdated Show resolved Hide resolved
servers/rendering/rendering_device.cpp Outdated Show resolved Hide resolved
@darksylinc
Copy link
Contributor Author

@DarioSamo pointed me how to make a backwards compatible C# binding (by editing rendering_device.compat.inc).

However CI is still failing.

I'm not sure if I did something wrong, but I noticed that we may be in a bit of a pickle: draw_list_begin already had a compatibility method and I suspect this is causing problems:

Previous version before my PR:

DrawListID draw_list_begin(RID p_framebuffer, InitialAction p_initial_color_action, FinalAction p_final_color_action, InitialAction p_initial_depth_action, FinalAction p_final_depth_action, const Vector<Color> &p_clear_color_values = Vector<Color>(), float p_clear_depth = 1.0, uint32_t p_clear_stencil = 0, const Rect2 &p_region = Rect2());

New Version:

DrawListID draw_list_begin(RID p_framebuffer, InitialAction p_initial_color_action, FinalAction p_final_color_action, InitialAction p_initial_depth_action, FinalAction p_final_depth_action, const Vector<Color> &p_clear_color_values = Vector<Color>(), float p_clear_depth = 1.0, uint32_t p_clear_stencil = 0, const Rect2 &p_region = Rect2(), uint32_t p_breadcrumb = 0);

That is, my version just adds uint32_t p_breadcrumb = 0 at the end.

Old version before PR #84976:

DrawListID draw_list_begin(RID p_framebuffer, InitialAction p_initial_color_action, FinalAction p_final_color_action, InitialAction p_initial_depth_action, FinalAction p_final_depth_action, const Vector<Color> &p_clear_color_values, float p_clear_depth, uint32_t p_clear_stencil, const Rect2 &p_region, const TypedArray<RID> &p_storage_textures);

That is, this version used to have const TypedArray<RID> &p_storage_textures at the end, but it got removed.

The problem is that the very old version used to have const TypedArray<RID> &p_storage_textures, then it got removed, and now I'm adding a new parameter uint32_t p_breadcrumb = 0 which are clashing.

I'm not sure if this is simply a CI bug where the compat test gets confused, or it will fail in real C#.

The only solution I can think of is to add draw_list_begin2 and mark draw_list_begin as deprecated.

@darksylinc darksylinc requested review from a team as code owners May 5, 2024 17:16
@darksylinc darksylinc force-pushed the matias-TheForge branch 2 times, most recently from c90205a to 5b6d8cf Compare May 5, 2024 19:27
@darksylinc
Copy link
Contributor Author

All checks passed!

@darksylinc darksylinc force-pushed the matias-TheForge branch 2 times, most recently from 5cc5c3a to 8c7c196 Compare May 15, 2024 04:02
@clayjohn clayjohn modified the milestones: 4.x, 4.3 May 15, 2024
Copy link
Member

@clayjohn clayjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me to merge for Beta 1 pending the small changes suggested by Bruvzg.

I would like to get this in to Beta 1 as it will help massively for debugging Vulkan issues. We fixed a lot of the significant bugs in 4.3 already, the rest are going to be nearly impossible to track down without this additional info. So IMO, its best to get this in now so that users of 4.3 that still run into Vulkan bugs can give us a sensible report by running a debug build

drivers/vulkan/rendering_device_driver_vulkan.cpp Outdated Show resolved Hide resolved
AThousandShips

This comment was marked as resolved.

@AThousandShips AThousandShips dismissed their stale review May 17, 2024 10:29

My bad missed there was a matching one in the 4.3 cycle

--------
Validate extension JSON: Error: Field 'classes/RenderingDevice/methods/draw_list_begin/arguments/9': type changed value in new API, from "Array" to "int".
Validate extension JSON: Error: Field 'classes/RenderingDevice/methods/draw_list_begin/arguments/9': default_value changed value in new API, from "Array[RID]([])" to "0".
Validate extension JSON: Error: Field 'classes/RenderingDevice/methods/draw_list_begin/arguments/9': type changed value in new API, from "typedarray::RID" to "int".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one of these should be kept I think, the type part

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last two are complaints from 4.2 -> 4.3
But the last one is a complaint from 4.1 (or 4.0?) -> 4.3

I was a bit confused by it because the same issue should've happened when going 4.1 -> 4.2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm there seems to be a clash with #84976, which is a 4.2->4.3 change that removed the 9th argument to draw_list_begin.

Now this PR adds a new 9th argument with a different type.

I think it's the first time we have this case, not sure if the script we use to check for compatibility issues is able to handle that. CC @dsnopek

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since both changes were in 4.3, I would expect it to treat this like the 9th argument changed type, completely ignoring the previous removal of the argument.

The weird thing is these two lines:

Validate extension JSON: Error: Field 'classes/RenderingDevice/methods/draw_list_begin/arguments/9': type changed value in new API, from "Array" to "int".
Validate extension JSON: Error: Field 'classes/RenderingDevice/methods/draw_list_begin/arguments/9': type changed value in new API, from "typedarray::RID" to "int".

I'm not sure how it can change from both Array and typedarray::RID.

Looking at the reference extension_api.json, it seems like typedarray::RID is the correct one.

What happens if you remove this line from the 4.2-stable.expected:

Validate extension JSON: Error: Field 'classes/RenderingDevice/methods/draw_list_begin/arguments/9': type changed value in new API, from "Array" to "int".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also the CI:

The following validation errors no longer occur (compared to 4.2-stable):
Validate extension JSON: Error: Field 'classes/RenderingDevice/methods/draw_list_begin/arguments/9': type changed value in new API, from "Array" to "int".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In godot-4.0-stable/gdextension/extension_api.json:

{
	"name": "storage_textures",
	"type": "Array",
	"default_value": "Array[RID]([])"
}

In godot-4.1-stable/gdextension/extension_api.json & godot-4.2-stable/gdextension/extension_api.json:

{
	"name": "storage_textures",
	"type": "typedarray::RID",
	"default_value": "Array[RID]([])"
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try removing this line as well:

Validate extension JSON: Error: Field 'classes/RenderingDevice/methods/draw_list_begin/arguments/9': type changed value in new API, from "Array" to "typedarray::RID".

Or change it to fit the new value

Copy link
Member

@akien-mga akien-mga May 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into it myself, I think Matias' time is better spent on rendering issues than on debugging our CI API compat testing framework.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO our validation scripts are buggy and need to be fixed to properly handle APIs changing multiple times across versions.

But for now this should be an ok workaround to silence both the error if not adding the line, and the false positive warning when adding it in the 4.2 file.

diff --git a/misc/extension_api_validation/4.0-stable_4.1-stable.expected b/misc/extension_api_validation/4.0-stable_4.1-stable.expected
index 5c3bf07fb2..7388c90e2e 100644
--- a/misc/extension_api_validation/4.0-stable_4.1-stable.expected
+++ b/misc/extension_api_validation/4.0-stable_4.1-stable.expected
@@ -349,3 +349,11 @@ Validate extension JSON: Error: Hash changed for 'classes/EditorUndoRedoManager/
 Validate extension JSON: Error: Hash changed for 'classes/UndoRedo/methods/create_action', from 0AEC1BFC to E87757EB. This means that the function has changed and no compatibility function was provided.
 
 Added a optional parameters with default values. No adjustments should be necessary.
+
+
+GH-90993
+--------
+Validate extension JSON: Error: Field 'classes/RenderingDevice/methods/draw_list_begin/arguments/9': type changed value in new API, from "Array" to "int".
+
+The change to `int` actually happened from 4.2 to 4.3, but our current validation script doesn't handle well the case where an argument changed type twice in different versions.
+This argument already changed in 4.1 in GH-76418, so the script gets confused.
diff --git a/misc/extension_api_validation/4.2-stable.expected b/misc/extension_api_validation/4.2-stable.expected
index 54f95cde58..cf80419cf9 100644
--- a/misc/extension_api_validation/4.2-stable.expected
+++ b/misc/extension_api_validation/4.2-stable.expected
@@ -367,7 +367,6 @@ Optional arguments added. Compatibility methods registered.
 
 GH-90993
 --------
-Validate extension JSON: Error: Field 'classes/RenderingDevice/methods/draw_list_begin/arguments/9': type changed value in new API, from "Array" to "int".
 Validate extension JSON: Error: Field 'classes/RenderingDevice/methods/draw_list_begin/arguments/9': default_value changed value in new API, from "Array[RID]([])" to "0".
 Validate extension JSON: Error: Field 'classes/RenderingDevice/methods/draw_list_begin/arguments/9': type changed value in new API, from "typedarray::RID" to "int".
 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!!!

I've applied your patch, rebased to latest master, and pushed.

Now all CI passes!

@darksylinc darksylinc requested a review from a team as a code owner May 18, 2024 14:44
@darksylinc darksylinc force-pushed the matias-TheForge branch 2 times, most recently from 698b419 to 2446b41 Compare May 18, 2024 16:23
Comment on lines +84 to +93
void *Memory::realloc_aligned_static(void *p_memory, size_t p_bytes, size_t p_prev_bytes, size_t p_alignment) {
if (p_memory == nullptr) {
return alloc_aligned_static(p_bytes, p_alignment);
}

void *ret = alloc_aligned_static(p_bytes, p_alignment);
memcpy(ret, p_memory, p_prev_bytes);
free_aligned_static(p_memory);
return ret;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say this has to be on realloc_static(), to avoid pessimizing for the case where the address changes.

@akien-mga akien-mga modified the milestones: 4.3, 4.x May 28, 2024
Features:
- Debug-only tracking of objects by type. See
get_driver_allocs_by_object_type et al.
 - Debug-only Breadcrumb info for debugging GPU crashes and device lost
 - Performance report per frame from get_perf_report
- Some VMA calls had to be modified in order to insert the necessary
memory callbacks

Functionality marked as "debug-only" is only available in debug or dev
builds.

Misc fixes:
 - Early break optimization in RenderingDevice::uniform_set_create

============================

The work was performed by collaboration of TheForge and Google. I am
merely splitting it up into smaller PRs and cleaning it up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants