Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CLI]: GPUs Hanging when distributed training caused by wandb.watch #7423

Open
nmd2k opened this issue Apr 19, 2024 · 5 comments
Open

[CLI]: GPUs Hanging when distributed training caused by wandb.watch #7423

nmd2k opened this issue Apr 19, 2024 · 5 comments
Labels
a:cli Area: Client c:integration Component: Integration c:watch

Comments

@nmd2k
Copy link

nmd2k commented Apr 19, 2024

Describe the bug

I found that when distributed training, the wandb.watch with the argument log="all" will lead to GPUs hanging (the rank 0 GPUs loaded but not working, the rank 1 run without any further progress).

(with log="all")
image

The integrated code:

if accelerator.is_main_process and args.report_to == "wandb":
      wandb.watch(model, log="all", log_freq=args.logging_steps)

The problem is gone when removed the argument log="all". It seems like something is wrong with logging model parameter.

(without log="all")
image

Additional Files

No response

Environment

wandb = 0.16.5
transformers = 4.39.0
pytorch = 2.2.1
accelerate = 0.29.3

Additional Context

No response

@kptkin kptkin added a:cli Area: Client c:watch c:integration Component: Integration labels Apr 19, 2024
@fmamberti-wandb
Copy link

Hi @nmd2k, thank you for reporting this and letting us know you have been experiencing the issue with log="all" only.

Would you mind sharing some additional information to help us reproduce and troubleshoot the issue:

  • The debug.log and debug-internal.log files you can find in the `./wandb/run-<date_time>-/logs folder
  • What is your experiment environment setup? Are you running the training locally or on a remote resource? If so which kind and how is the training initiated? Are you running via Jupyter Notebook or through a script?
  • A code snippet for your training experiment would also be useful

@fmamberti-wandb
Copy link

Hi @nmd2k , I wanted to follow up on this request. Please let us know if we can be of further assistance or if your issue has been resolved.

@nmd2k
Copy link
Author

nmd2k commented Apr 25, 2024

Hi, sorry for the late reply.
I used the official huggingface example to fine-tune LLMs (on a simple predict next token task), the only modification is added the watch function in the line L563

if accelerator.is_main_process and args.report_to == "wandb":
      wandb.watch(model, log="all", log_freq=args.logging_steps)

Here is the debug.log:

2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Current SDK version is 0.16.3
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Configure stats pid to 2311240
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Loading settings from /home/dungnm31/.config/wandb/settings
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Loading settings from /home/dungnm31/foundation-models/wandb/settings
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Loading settings from environment variables: {'project': 'foundation-models-exp2', 'api_key': '***REDACTED***'}
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Applying setup settings: {'_disable_service': False}
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_setup.py:_flush():76] Inferring run settings from compute environment: {'program_relpath': 'train/ft_no_trainer.py', 'program_abspath': '/home/dungnm31/foundation-models/train/ft_no_trainer.py', 'program': '/home/dungnm31/foundation-models/train/ft_no_trainer.py'}
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:_log_setup():526] Logging user logs to /home/dungnm31/foundation-models/wandb/run-20240425_134526-u8vu5idg/logs/debug.log
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:_log_setup():527] Logging internal logs to /home/dungnm31/foundation-models/wandb/run-20240425_134526-u8vu5idg/logs/debug-internal.log
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:init():566] calling init triggers
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:init():573] wandb.init called with sweep_config: {}
config: {}
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:init():616] starting backend
2024-04-25 13:45:26,331 INFO    MainThread:2311240 [wandb_init.py:init():620] setting up manager
2024-04-25 13:45:26,332 INFO    MainThread:2311240 [backend.py:_multiprocessing_setup():105] multiprocessing start_methods=fork,spawn,forkserver, using: spawn
2024-04-25 13:45:26,334 INFO    MainThread:2311240 [wandb_init.py:init():628] backend started and connected
2024-04-25 13:45:26,336 INFO    MainThread:2311240 [wandb_init.py:init():720] updated telemetry
2024-04-25 13:45:26,443 INFO    MainThread:2311240 [wandb_init.py:init():753] communicating run to backend with 90.0 second timeout
2024-04-25 13:45:27,111 INFO    MainThread:2311240 [wandb_run.py:_on_init():2262] communicating current version
2024-04-25 13:45:27,410 INFO    MainThread:2311240 [wandb_run.py:_on_init():2271] got version response upgrade_message: "wandb version 0.16.6 is available!  To upgrade, please run:\n $ pip install wandb --upgrade"

2024-04-25 13:45:27,410 INFO    MainThread:2311240 [wandb_init.py:init():804] starting run threads in backend
2024-04-25 13:45:30,400 INFO    MainThread:2311240 [wandb_run.py:_console_start():2241] atexit reg
2024-04-25 13:45:30,400 INFO    MainThread:2311240 [wandb_run.py:_redirect():2096] redirect: wrap_raw
2024-04-25 13:45:30,400 INFO    MainThread:2311240 [wandb_run.py:_redirect():2161] Wrapping output streams.
2024-04-25 13:45:30,400 INFO    MainThread:2311240 [wandb_run.py:_redirect():2186] Redirects installed.
2024-04-25 13:45:30,401 INFO    MainThread:2311240 [wandb_init.py:init():847] run started, returning control to user process
2024-04-25 13:45:30,402 INFO    MainThread:2311240 [wandb_run.py:_config_callback():1343] config_cb None None {'dataset_name_or_path': '/cm/archive/dungnm31/data/foundation-model/data_test.jsonl', 'dataset_config_name': None, 'model_name_or_path': 'mistralai/Mistral-7B-v0.1', 'lora': False, 'config_name': None, 'tokenizer_name': None, 'use_slow_tokenizer': False, 'per_device_train_batch_size': 2, 'per_device_eval_batch_size': 2, 'learning_rate': 2.5e-05, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'adam_epsilon': 1e-07, 'weight_decay': 3e-05, 'num_train_epochs': 3, 'gradient_accumulation_steps': 1, 'gradient_checkpointing': False, 'lr_scheduler_type': 'linear', 'warmup_steps': 50, 'output_dir': '/cm/archive/dungnm31/foundation-exps/mistral7B-LR-finalv1-1', 'seed': 42, 'preprocessing_num_workers': 50, 'max_length': 1024, 'prompt_template': 'llama', 'trust_remote_code': True, 'logging_steps': 1, 'eval_steps': 50, 'save_steps': 50, 'save_total_limit': 10, 'resume_from_checkpoint': None, 'report_to': 'wandb', 'low_cpu_mem_usage': False, 'metric_for_best_model': 'loss'}
2024-04-25 13:45:30,447 INFO    MainThread:2311240 [wandb_watch.py:watch():51] Watching

Here is the debug-internal.log:
debug-internal.log

@fzp0424
Copy link

fzp0424 commented May 10, 2024

Same issue. The whole training process will hold for several minutes every fixed steps (rank 0 GPU will be 0% util). Everything returns to normal as I delete the "wandb_watch["all"]".

@wandb wandb deleted a comment from exalate-issue-sync bot May 21, 2024
@luisbergua
Copy link
Contributor

Hey @nmd2k @fzp0424, thanks for sharing these details! Would you have any problems with setting os.environ["WANDB_WATCH"] = "all" instead of passing as an argument and seeing if you face the same issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:cli Area: Client c:integration Component: Integration c:watch
Projects
None yet
Development

No branches or pull requests

5 participants