You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I am using transformers Trainer + accelerate to tune language models. What I noticed was post training when I call gc.collect() and torch.cuda.empty_cache() the trainable layers stick around on gpu memory (all ranks)
On adding a debugger and weakref to torch parameters I was able to at least trace it down to _hp_mapping and added methods on top of the params
These references never go to zero even when the optimizer is completely destroyed.
I am a bit puzzled why and wondering if the memory is leaking because torch.tensor / torch.nn.Parameter are not entirely python implementations and adding methods on top can cause unaccounted reference leaks
To Reproduce
Train a model with Deepspeed 2 Config - no offload
Describe the bug
I am using transformers Trainer + accelerate to tune language models. What I noticed was post training when I call gc.collect() and torch.cuda.empty_cache() the trainable layers stick around on gpu memory (all ranks)
On adding a debugger and weakref to torch parameters I was able to at least trace it down to _hp_mapping and added methods on top of the params
DeepSpeed/deepspeed/utils/mixed_precision_linkage.py
Lines 33 to 37 in f32ad3e
These references never go to zero even when the optimizer is completely destroyed.
I am a bit puzzled why and wondering if the memory is leaking because torch.tensor / torch.nn.Parameter are not entirely python implementations and adding methods on top can cause unaccounted reference leaks
To Reproduce
Note: some of the auto values are filled in by the HF Trainer. I'll try and get these values soon
Expected behavior
ds_report output
Please run
ds_report
to give us details about your setup.Screenshots
System info (please complete the following information):
Launcher context
Launch with
accelerate launch --use_deepspeed
Docker context
N/A
Additional context
N/A
The text was updated successfully, but these errors were encountered: