Fix mem size mismatch from split/chunk in const folding #125199

trieuat · 2024-04-29T23:02:34Z

Summary:
The chunk/split ops on the weights/constants is folded in a fx pass and each output tensor has the same storage size of the original tensor (which is 3x of its actual size if chunk(3)). However Backend calculates the mem size on device from tensor shape/stride/dtype. This causes the mismatch when copying weights/constants to device as allocated mem on device is always smaller than the size of weights/constants and results in a runtime error in loading weight/constant (T172125529).

This diff fixes the issue by cloning the tensors after const folding so that the tensors has correct storage size.

Test Plan:
Before this change: (18432 = 48 * 64 * 2 * 3)

RuntimeError: Failed to load constant getitem_idx0 split (remaining=18432) at fbcode/caffe2/torch/fb/acc_runtime/afg/afg_bindings.cpp:3422: Request failed because an invalid parameter

buck2 run mode/opt //caffe2/torch/fb/acc_runtime/afg/tests:test_operators-artemis -- -r test_mem_size_mismatch

Ran 1 test in 7.048s

OK

Reviewed By: jfix71

Differential Revision: D56663931

pytorch-bot · 2024-04-29T23:02:37Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125199

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 9e9ccf6 with merge base 2c8237c ():

NEW FAILURE - The following job has failed:

trunk / win-vs2019-cpu-py3 / test (default, 1, 3, windows.4xlarge.nonephemeral) (gh)
The action 'Test' has timed out after 210 minutes.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

trieuat · 2024-04-30T23:59:22Z

@pytorchbot merge

pytorch-bot · 2024-04-30T23:59:26Z

This PR needs to be approved by an authorized maintainer before merge.

facebook-github-bot · 2024-05-01T17:19:38Z

This pull request was exported from Phabricator. Differential Revision: D56663931

Summary: The chunk/split ops on the weights/constants is folded in a fx pass and each output tensor has the same storage size of the original tensor (which is 3x of its actual size if chunk(3)). However Backend calculates the mem size on device from tensor shape/stride/dtype. This causes the mismatch when copying weights/constants to device as allocated mem on device is always smaller than the size of weights/constants and results in a runtime error in loading weight/constant (T172125529). This diff fixes the issue by cloning the tensors after const folding so that the tensors has correct storage size. Test Plan: Before this change: (18432 = 48 * 64 * 2 * 3) ``` RuntimeError: Failed to load constant getitem_idx0 split (remaining=18432) at fbcode/caffe2/torch/fb/acc_runtime/afg/afg_bindings.cpp:3422: Request failed because an invalid parameter ``` ``` buck2 run mode/opt //caffe2/torch/fb/acc_runtime/afg/tests:test_operators-artemis -- -r test_mem_size_mismatch ``` ``` Ran 1 test in 7.048s OK ``` Reviewed By: jfix71 Differential Revision: D56663931

facebook-github-bot · 2024-05-02T22:35:48Z

This pull request was exported from Phabricator. Differential Revision: D56663931

facebook-github-bot · 2024-05-03T04:40:52Z

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

pytorchmergebot · 2024-05-03T04:42:29Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Looking at the unrelated Windows timeout failure on #125199, it looks like we don't have a timeout value set for C++ tests atm. In this case, a C++ test on Windows timed out after 2+ hours. ``` 2024-05-02T23:35:34.0639067Z Running cpp/c10_TypeList_test 1/1 ... [2024-05-02 23:35:34.059021] 2024-05-02T23:35:34.0641108Z Executing ['pytest', 'C:\\actions-runner\\_work\\pytorch\\pytorch\\build\\win_tmp\\build\\torch\\test\\c10_TypeList_test.exe', '-m', 'not serial', '-v', '-vv', '-rfEX', '-n', '2', '--junit-xml-reruns', 'test-reports\\python-pytest\\test\\run_test\\test\\run_test-c898ddeff8f33cbf.xml', '-x', '--reruns=2'] ... [2024-05-02 23:35:34.062137] 2024-05-03T02:45:33.7862004Z Process SpawnPoolWorker-2: 2024-05-03T02:45:33.7927201Z Traceback (most recent call last): 2024-05-03T02:45:33.7928032Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 315, in _bootstrap 2024-05-03T02:45:33.7928722Z self.run() 2024-05-03T02:45:33.7929722Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\process.py", line 108, in run 2024-05-03T02:45:33.7931639Z self._target(*self._args, **self._kwargs) 2024-05-03T02:45:33.7932435Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\pool.py", line 114, in worker 2024-05-03T02:45:33.7933338Z task = get() 2024-05-03T02:45:33.7933946Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\queues.py", line 365, in get 2024-05-03T02:45:33.7935219Z res = self._reader.recv_bytes() 2024-05-03T02:45:33.7935897Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 221, in recv_bytes 2024-05-03T02:45:33.7936609Z buf = self._recv_bytes(maxlength) 2024-05-03T02:45:33.7937302Z File "C:\Jenkins\Miniconda3\lib\multiprocessing\connection.py", line 310, in _recv_bytes 2024-05-03T02:45:33.7938316Z waitres = _winapi.WaitForMultipleObjects( 2024-05-03T02:45:33.7938766Z KeyboardInterrupt ``` Retrying was working, but it was already too late to finish the job. I'm setting the same default `THRESHOLD * 3` timeout value here for C++ tests. Pull Request resolved: #125517 Approved by: https://github.com/clee2000

pytorch-bot bot added the release notes: fx release notes category label Apr 29, 2024

trieuat requested a review from pytorchmergebot April 30, 2024 16:50

trieuat requested review from jfix71 and removed request for pytorchmergebot May 1, 2024 00:09

jfix71 approved these changes May 1, 2024

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 1, 2024

facebook-github-bot added the fb-exported label May 1, 2024

trieuat force-pushed the export-D56663931 branch from d51d3ff to 6386ceb Compare May 1, 2024 17:19

trieuat force-pushed the export-D56663931 branch from 6386ceb to 9e9ccf6 Compare May 2, 2024 22:35

pytorchmergebot added the merging label May 3, 2024

pytorchmergebot closed this in d156cb2 May 3, 2024

pytorchmergebot added Merged and removed merging labels May 3, 2024

huydhn mentioned this pull request May 3, 2024

Set timeout for C++ tests #125517

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mem size mismatch from split/chunk in const folding #125199

Fix mem size mismatch from split/chunk in const folding #125199

trieuat commented Apr 29, 2024

pytorch-bot bot commented Apr 29, 2024 •

edited

trieuat commented Apr 30, 2024

pytorch-bot bot commented Apr 30, 2024

facebook-github-bot commented May 1, 2024

facebook-github-bot commented May 2, 2024

facebook-github-bot commented May 3, 2024

pytorchmergebot commented May 3, 2024

Fix mem size mismatch from split/chunk in const folding #125199

Fix mem size mismatch from split/chunk in const folding #125199

Conversation

trieuat commented Apr 29, 2024

pytorch-bot bot commented Apr 29, 2024 • edited

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/125199

❌ 1 New Failure

trieuat commented Apr 30, 2024

pytorch-bot bot commented Apr 30, 2024

facebook-github-bot commented May 1, 2024

facebook-github-bot commented May 2, 2024

facebook-github-bot commented May 3, 2024

pytorchmergebot commented May 3, 2024

Merge started

pytorch-bot bot commented Apr 29, 2024 •

edited