-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Intel GPUs #999
Comments
I don't believe PyTorch supports Intel GPUs natively. You might need to install some third-party packages to enable this. See https://pytorch.org/tutorials/recipes/intel_extension_for_pytorch.html for an example, this is an official Intel package for better support with Intel CPU and GPU. Although, I cannot guarantee that all features of torchtune will work with this extension. Let me know how it goes! |
@RdoubleA --I will give it a shot and let you know. I will start by adjusting a single GPU recipe and config file to see if I can get it to work. |
I am relatively new to using Torchtune. From what I understand, it is designed to facilitate LLM training on consumer-grade Nvidia-based GPUs. However, the process involves some deep abstractions, making it more complex than it initially seems. Here are the steps that are necessary to achieve this:
Here's an example implementation: import torch
import intel_extension_for_pytorch as ipex
# Initialize the model, criterion, and optimizer
model = Model()
criterion = ...
optimizer = ...
model.train()
# Move the model and loss criterion to XPU before calling ipex.optimize()
model = model.to("xpu")
criterion = criterion.to("xpu")
# Optimize the model and optimizer for Float32
model, optimizer = ipex.optimize(model, optimizer=optimizer)
# Optimize the model and optimizer for BFloat16
model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)
# Prepare the dataloader
dataloader = ...
for input, target in dataloader:
input = input.to("xpu")
target = target.to("xpu")
optimizer.zero_grad()
# For Float32
output = model(input)
# For BFloat16
with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
output = model(input)
loss = criterion(output, target)
loss.backward()
optimizer.step() |
it would be great if the maintainer would point us in the proper direction. I will take this up again this week. |
@RdoubleA Intel GPUs are supported in pytorch version 2.30 |
We should be able to support intel GPUs! We are using the intel developer cloud. Please advise.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
Python 3.9.18 (tags/v3.9.18-26-g6b320c3b2f6-dirty:6b320c3b2f6, Sep 28 2023, 00:35:27)
[GCC 13.2.0] :: Intel Corporation on linux
(null)Type "help", "copyright", "credits" or "license" for more information.
Intel(R) Distribution for Python is brought to you by Intel Corporation.
Please check out: https://software.intel.com/en-us/python-distribution
Notebook commands:
!echo "List of Intel GPUs available on the system:"
!xpu-smi discovery 2> /dev/null
!echo "Intel Xeon CPU used by this notebook:"
!lscpu | grep "Model name"
List of Intel GPUs available on the system:
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Name: Intel(R) Data Center GPU Max 1100 |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0029-0000-002f0bda8086 |
| | PCI BDF Address: 0000:29:00.0 |
| | DRM Device: /dev/dri/card0 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 1 | Device Name: Intel(R) Data Center GPU Max 1100 |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-003a-0000-002f0bda8086 |
| | PCI BDF Address: 0000:3a:00.0 |
| | DRM Device: /dev/dri/card2 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 2 | Device Name: Intel(R) Data Center GPU Max 1100 |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-009a-0000-002f0bda8086 |
| | PCI BDF Address: 0000:9a:00.0 |
| | DRM Device: /dev/dri/card3 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 3 | Device Name: Intel(R) Data Center GPU Max 1100 |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-00ca-0000-002f0bda8086 |
| | PCI BDF Address: 0000:ca:00.0 |
| | DRM Device: /dev/dri/card4 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
Intel Xeon CPU used by this notebook:
Model name: Intel(R) Xeon(R) Platinum 8480+
I discovered that Intel GPU doesn't seem to be supported because originally tried to run my training job across the 4 GPUS and got the following:
$ tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/recipes/configs/llama3/8B_qlora_single_device.yaml
Running with torchrun...
W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757]
W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] *****************************************
W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] *****************************************
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
sys.exit(recipe_main())sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
return func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
func_return = func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
func_return = func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
default_pg, _ = _new_process_group_helper(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper
default_pg, _ = _new_process_group_helper(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper
backend_class = ProcessGroupNCCL(
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
backend_class = ProcessGroupNCCL(
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
func_return = func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
default_pg, _ = _new_process_group_helper(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper
func_return = func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
backend_class = ProcessGroupNCCL(
ValueError : default_pg, _ = _new_process_group_helper(ProcessGroupNCCL is only supported with GPUs, no GPUs found!
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper
backend_class = ProcessGroupNCCL(
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
E0517 11:41:43.068940 23389872468672 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 400561) of binary: /opt/intel/oneapi/intelpython/bin/python3.9
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/bin/tune", line 8, in
sys.exit(main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 177, in _run_cmd
self._run_distributed(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 88, in _run_distributed
run(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py FAILED
Failures:
[1]:
time : 2024-05-17_11:41:43
host : idc-beta-batch-pvc-node-18
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 400562)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-05-17_11:41:43
host : idc-beta-batch-pvc-node-18
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 400563)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-05-17_11:41:43
host : idc-beta-batch-pvc-node-18
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 400565)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Root Cause (first observed failure):
[0]:
time : 2024-05-17_11:41:43
host : idc-beta-batch-pvc-node-18
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 400561)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
u2b3e96b2fc320ef8c781f51df67225d@idc-beta-batch-pvc-node-18:~$ tune run lora_finetune_single_device --config /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/recipes/configs/llama3/8B_qlora_single_device.yaml
INFO:torchtune.utils.logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:
batch_size: 2
checkpointer:
component: torchtune.utils.FullModelMetaCheckpointer
checkpoint_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/original/
checkpoint_files:
model_type: LLAMA3
output_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/
recipe_checkpoint: null
compile: false
dataset:
component: torchtune.datasets.alpaca_cleaned_dataset
train_on_input: true
device: cuda
dtype: bf16
enable_activation_checkpointing: true
epochs: 1
gradient_accumulation_steps: 16
log_every_n_steps: 1
log_peak_memory_stats: false
loss:
component: torch.nn.CrossEntropyLoss
lr_scheduler:
component: torchtune.modules.get_cosine_schedule_with_warmup
num_warmup_steps: 100
max_steps_per_epoch: null
metric_logger:
component: torchtune.utils.metric_logging.DiskLogger
log_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/tmp/qlora_finetune_output/
model:
component: torchtune.models.llama3.qlora_llama3_8b
apply_lora_to_mlp: true
apply_lora_to_output: false
lora_alpha: 16
lora_attn_modules:
lora_rank: 8
optimizer:
component: torch.optim.AdamW
lr: 0.0003
weight_decay: 0.01
output_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/tmp/qlora_finetune_output/
profiler:
component: torchtune.utils.profiler
enabled: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
component: torchtune.models.llama3.llama3_tokenizer
path: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/bin/tune", line 8, in
sys.exit(main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 179, in _run_cmd
self._run_single_device(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 93, in _run_single_device
runpy.run_path(str(args.recipe), run_name="main")
File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 288, in run_path
return _run_module_code(code, init_globals, run_name,
File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 550, in
sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 543, in recipe_main
recipe = LoRAFinetuneRecipeSingleDevice(cfg=cfg)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 100, in init
self._device = utils.get_device(device=cfg.device)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/utils/_device.py", line 117, in get_device
device = _setup_cuda_device(device)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/utils/_device.py", line 44, in _setup_cuda_device
raise RuntimeError(
RuntimeError: The local rank is larger than the number of available GPUs.
The text was updated successfully, but these errors were encountered: