Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: sparse element-wise multiplication returns wrong indptr on CUDA #1273

Open
ClaudiaComito opened this issue Nov 23, 2023 · 5 comments
Open
Labels
bug Something isn't working sparse stale

Comments

@ClaudiaComito
Copy link
Contributor

ClaudiaComito commented Nov 23, 2023

What happened?

While running our unit tests with PyTorch 2.1, the sparse module tests failed on GPU (see error message below). Tests passed on CPU.

Failure is on any number of processes, single-GPU as well as multi-GPU.

Tested with CUDA only, not yet with ROCm.

Tagging @Mystic-Slice in case he wants to explore.

Python was actually 3.11, PyTorch 2.1.0. Will update the issue template.

Code snippet triggering the error

heat.sparse.tests.test_arithmetics.TestArithmetics.test_mul

Error message or erroneous outcome

======================================================================
FAIL: test_mul (heat.sparse.tests.test_arithmetics.TestArithmetics.test_mul)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/p/scratch/haf/comito1/devel/heat/heat/sparse/tests/test_arithmetics.py", line 750, in test_mul
    self.assertTrue(
AssertionError: tensor(False, device='cuda:0') is not true

Version

1.3.x

Python version

None

PyTorch version

None

MPI version

No response

@ClaudiaComito ClaudiaComito added bug Something isn't working sparse labels Nov 23, 2023
@ClaudiaComito ClaudiaComito changed the title [Bug]: sparse element-wise multiplication returns wrong indptr [Bug]: sparse element-wise multiplication returns wrong indptr on CUDA Nov 23, 2023
@ClaudiaComito ClaudiaComito mentioned this issue Nov 23, 2023
5 tasks
@Mystic-Slice
Copy link
Collaborator

I tried reproducing it on my local machine.
Seems like there is a change in behavior of sparse torch tensors in version 2.1.1

Code:

A = [[0, 0], 
     [1, 0], 
     [0, 2]]

B = [[1, 0],
     [0, 0],
     [2, 3]]

a = torch.tensor(A, device='cuda:0').float().to_sparse_csr()
b = torch.tensor(B, device='cuda:0').float().to_sparse_csr()

print(a * b)

Output Torch 2.0.0:

(torch2.0.0) mystic-slice@MysticSlice:/mnt/e/Opensource/heat$ python3 dummy.py
tensor(crow_indices=tensor([0, 0, 0, 1]),
       col_indices=tensor([1]),
       values=tensor([6.]), device='cuda:0', size=(3, 2), nnz=1, layout=torch.sparse_csr)

Output Torch 2.1.1:

(torch2.1.1) mystic-slice@MysticSlice:/mnt/e/Opensource/heat$ python3 dummy.py
tensor(crow_indices=tensor([0, 0, 1, 2]),
       col_indices=tensor([0, 1]),
       values=tensor([0., 6.]), device='cuda:0', size=(3, 2), nnz=2, layout=torch.sparse_csr)

The value 0 that is obtained after multiplication is considered significant in the new version. And this happens only when run on GPU.
Couldn't find any reference to this change in their release notes.
I feel like this is something Pytorch has to sort out because a tensor's behaviour cannot be different wrt the device.

What do you think we should do, @ClaudiaComito?

@ClaudiaComito
Copy link
Contributor Author

Brilliant @Mystic-Slice , thanks for looking into this!

I think you should go ahead and report it to PyTorch. When we merge support for PyTorch 2.1, we'll skip that test until a fix is out. Does that sound reasonable?

@Mystic-Slice
Copy link
Collaborator

Yeah. Sounds good.
I will raise an issue in the PyTorch repo.

@ClaudiaComito
Copy link
Contributor Author

Reported here. Thanks again @Mystic-Slice !

Copy link
Contributor

This issue is stale because it has been open for 60 days with no activity.

@github-actions github-actions bot added the stale label Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working sparse stale
Projects
None yet
Development

No branches or pull requests

2 participants