[BUG] Memory Leak in Stage 2 Optimizer #5496

chiragjn · 2024-05-02T17:10:39Z

Describe the bug
I am using transformers Trainer + accelerate to tune language models. What I noticed was post training when I call gc.collect() and torch.cuda.empty_cache() the trainable layers stick around on gpu memory (all ranks)

On adding a debugger and weakref to torch parameters I was able to at least trace it down to _hp_mapping and added methods on top of the params

DeepSpeed/deepspeed/utils/mixed_precision_linkage.py

Lines 33 to 37 in f32ad3e

 lp_param._hp_mapping = None 

 lp_param._dp_group = dp_group 

 lp_param.get_full_hp_param = types.MethodType(get_full_hp_param, lp_param) 

 lp_param.get_full_hp_grad = types.MethodType(get_full_hp_grad, lp_param) 

 lp_param.set_full_hp_param = types.MethodType(set_full_hp_param, lp_param)

These references never go to zero even when the optimizer is completely destroyed.
I am a bit puzzled why and wondering if the memory is leaking because torch.tensor / torch.nn.Parameter are not entirely python implementations and adding methods on top can cause unaccounted reference leaks

To Reproduce

Train a model with Deepspeed 2 Config - no offload

{
  "fp16": {
    "enabled": false,
    "auto_cast": false,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 32,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": true
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "activation_checkpointing": {
    "partition_activations": false,
    "cpu_checkpointing": false
  },
  "zero_optimization": {
    "stage": 2,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "allgather_partitions": true,
    "allgather_bucket_size": 2e8,
    "reduce_scatter": true,
    "round_robin_gradients": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false,
  "zero_allow_untested_optimizer": true
}

Note: some of the auto values are filled in by the HF Trainer. I'll try and get these values soon

Once the training function finishes and wrapped model goes out of scope, call gc.collect() and torch.cuda.empty_cache()

Expected behavior

All parameters should be garbage collected and deallocated

ds_report output
Please run ds_report to give us details about your setup.

[2024-05-02 16:42:04,970] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fp_quantizer ........... [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/data/v/ft/lib/python3.11/site-packages/torch']
torch version .................... 2.2.1+cu121
deepspeed install path ........... ['/data/v/ft/lib/python3.11/site-packages/deepspeed']
deepspeed info ................... 0.14.2+0866580c, 0866580c, fix-memory-leak
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.2
deepspeed wheel compiled w. ...... torch 2.2, cuda 12.1
shared memory (/dev/shm) size .... 216.48 GB

Screenshots

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]: 22.04
GPU count and types [e.g. two machines with x8 A100s each]: 1 VM with 2 x A100 80GB
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version: 3.11
Any other relevant info about your setup

Launcher context
Launch with accelerate launch --use_deepspeed

Docker context
N/A

Additional context
N/A

The text was updated successfully, but these errors were encountered:

chiragjn added bug Something isn't working training labels May 2, 2024

chiragjn changed the title ~~[BUG] Memory Leak in Stage 2 Optimizer and somewhere else?~~ [BUG] Memory Leak in Stage 2 Optimizer May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Memory Leak in Stage 2 Optimizer #5496

[BUG] Memory Leak in Stage 2 Optimizer #5496

chiragjn commented May 2, 2024

[BUG] Memory Leak in Stage 2 Optimizer #5496

[BUG] Memory Leak in Stage 2 Optimizer #5496

Comments

chiragjn commented May 2, 2024