Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rasterize indices only so that the alpha composition can be done in python with more flexibility. #120

Closed
wants to merge 5 commits into from

Conversation

liruilong940607
Copy link
Collaborator

@liruilong940607 liruilong940607 commented Feb 2, 2024

Motivated by @hangg7’s question on how to implement distortion loss with gsplat, I'm abstracting out the volrend integral calculation from the rasterizer function, so that things like ND features, alpha channel rendering, distortion loss can be implemented in python with torch's auto diff.

Add a new function rasterize_indices which only return indices so do not need to be differentiable in any case. With this, all gradients can be managed by native torch in the rasterization stage.

Note in the rasterize_indices we still apply early stop so that the returned indices are only those useful for the volrend integral.

This, in most of the case, is inevitably slower than the rasterize_gaussians functions that fuses everything in a single kernel with one pass. However, this can be faster in the ND case (when D is large) as the ND features are now processed in parallel in native torch with the rasterize_indices way.

Here are some profilings (N is number of gaussians, D is the ND color dimension):

Impl. N=1e3; D=3 N=1e4; D=3 N=1e5; D=3 N=1e6; D=3 N=1e3; D=32 N=1e3; D=256
rasterize gaussians 1452 it/s 535 it/s 56 it/s 4.5 it/s 206 it/s 31 it/s
rasterize indices 588 it/s 376 it/s 53 it/s 4.5 it/s 472 it/s 258 it/s

The command line to get the above profiling (on NVIDIA TITAN RTX):

CUDA_LAUNCH_BLOCKING=1 python tests/test_rasterize.py --profile --N 1000 --D 256

@liruilong940607
Copy link
Collaborator Author

Distortion loss is also implemented with 5 lines of python in the test_rasterize.py file (Maybe should move to another place).

@kerrj
Copy link
Collaborator

kerrj commented Feb 7, 2024

Should we make this the default for rasterizing ND things?

opacity: Float[Tensor, "*batch 1"],
img_height: int,
img_width: int,
) -> Tensor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return type should be Tuple[Tensor,Tensor]

Returns:
A Tensor:

- **gaussian_ids** (Tensor): Packed (flattened) gaussian ids for intersects. [M,]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate a bit on the docstrings of return types? M is >> number of pixels right, since every gaussian gets counted once for each time it's seen in a pixel? if there's any gaurantees on the formatting of the tensors that'd be nice to note too (like it's sorted by pixel or sorted by gaussian etc)

def _distortion_loss(
weights: Tensor, t_mids: Tensor, ray_indices: Tensor, n_rays: int
) -> Tensor:
from nerfacc import accumulate_along_rays, exclusive_sum
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add nerfacc as dependency....?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this dependency!

@kerrj
Copy link
Collaborator

kerrj commented Feb 7, 2024

Testing on a 4090 I get pretty much the same speed between default ND rasterization and the index-based one:
CUDA_LAUNCH_BLOCKING=1 python tests/test_rasterize.py --profile --N 1000 --D 256 produces 429 and 468 it/s, N=1e6 produces 12it/s for both

@liruilong940607
Copy link
Collaborator Author

liruilong940607 commented Feb 7, 2024

Testing on a 4090 I get pretty much the same speed between default ND rasterization and the index-based one: CUDA_LAUNCH_BLOCKING=1 python tests/test_rasterize.py --profile --N 1000 --D 256 produces 429 and 468 it/s, N=1e6 produces 12it/s for both

Thanks for the further profiling. Yeah with large N the sorting would become the main bottleneck so they both converge to the similar speed. I'm pretty surprised with the difference is pretty minor in the small N regime on 4090. On 3090 Ti, I get 73it/s v.s. 354it/s with nerfacc==0.5.3

@kerrj
Copy link
Collaborator

kerrj commented Feb 8, 2024

Interesting, I also tried it in nerfstudio and it runs out of memory (requests like 70GB of memory or something), is there a way to get the footprint down or is this just unavoidable because the indices array is so large

@liruilong940607
Copy link
Collaborator Author

Interesting, I also tried it in nerfstudio and it runs out of memory (requests like 70GB of memory or something), is there a way to get the footprint down or is this just unavoidable because the indices array is so large

It is unavoidable if we want the \emph{exact} solution. It is expected it would consume a lot of memory at the start of training where everything is transparent so each pixel would have prob a dozen of GSs intersected with. At the end of the training it would need much less footprint though when the GSs become more opaque.

std::tuple<torch::Tensor, torch::Tensor, torch::Tensor>
rasterize_indices_tensor(
const std::tuple<int, int, int> tile_bounds,
const std::tuple<int, int, int> block,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update this interface so that we hide the block size from the user as in #129 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants