Support for Intel GPUs #999

raymondbernard · 2024-05-17T19:05:04Z

We should be able to support intel GPUs! We are using the intel developer cloud. Please advise.

Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
Python 3.9.18 (tags/v3.9.18-26-g6b320c3b2f6-dirty:6b320c3b2f6, Sep 28 2023, 00:35:27)
[GCC 13.2.0] :: Intel Corporation on linux
(null)Type "help", "copyright", "credits" or "license" for more information.
Intel(R) Distribution for Python is brought to you by Intel Corporation.
Please check out: https://software.intel.com/en-us/python-distribution

import torch
print(torch.version)
2.3.0+cu121

Notebook commands:
!echo "List of Intel GPUs available on the system:"
!xpu-smi discovery 2> /dev/null
!echo "Intel Xeon CPU used by this notebook:"
!lscpu | grep "Model name"

List of Intel GPUs available on the system:
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Name: Intel(R) Data Center GPU Max 1100 |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0029-0000-002f0bda8086 |
| | PCI BDF Address: 0000:29:00.0 |
| | DRM Device: /dev/dri/card0 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 1 | Device Name: Intel(R) Data Center GPU Max 1100 |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-003a-0000-002f0bda8086 |
| | PCI BDF Address: 0000:3a:00.0 |
| | DRM Device: /dev/dri/card2 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 2 | Device Name: Intel(R) Data Center GPU Max 1100 |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-009a-0000-002f0bda8086 |
| | PCI BDF Address: 0000:9a:00.0 |
| | DRM Device: /dev/dri/card3 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 3 | Device Name: Intel(R) Data Center GPU Max 1100 |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-00ca-0000-002f0bda8086 |
| | PCI BDF Address: 0000:ca:00.0 |
| | DRM Device: /dev/dri/card4 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
Intel Xeon CPU used by this notebook:
Model name: Intel(R) Xeon(R) Platinum 8480+

I discovered that Intel GPU doesn't seem to be supported because originally tried to run my training job across the 4 GPUS and got the following:

$ tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/recipes/configs/llama3/8B_qlora_single_device.yaml
Running with torchrun...
W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757]
W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] *****************************************
W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] *****************************************
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
sys.exit(recipe_main())sys.exit(recipe_main())

File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
return func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
func_return = func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
func_return = func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
default_pg, _ = _new_process_group_helper(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper
default_pg, _ = _new_process_group_helper(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper
backend_class = ProcessGroupNCCL(
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
backend_class = ProcessGroupNCCL(
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
func_return = func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
default_pg, _ = _new_process_group_helper(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper
func_return = func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
backend_class = ProcessGroupNCCL(
ValueError : default_pg, _ = _new_process_group_helper(ProcessGroupNCCL is only supported with GPUs, no GPUs found!

File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper
backend_class = ProcessGroupNCCL(
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
E0517 11:41:43.068940 23389872468672 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 400561) of binary: /opt/intel/oneapi/intelpython/bin/python3.9
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/bin/tune", line 8, in
sys.exit(main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 177, in _run_cmd
self._run_distributed(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 88, in _run_distributed
run(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py FAILED

Failures:
[1]:
time : 2024-05-17_11:41:43
host : idc-beta-batch-pvc-node-18
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 400562)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-05-17_11:41:43
host : idc-beta-batch-pvc-node-18
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 400563)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-05-17_11:41:43
host : idc-beta-batch-pvc-node-18
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 400565)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-05-17_11:41:43
host : idc-beta-batch-pvc-node-18
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 400561)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

u2b3e96b2fc320ef8c781f51df67225d@idc-beta-batch-pvc-node-18:~$ tune run lora_finetune_single_device --config /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/recipes/configs/llama3/8B_qlora_single_device.yaml
INFO:torchtune.utils.logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:

batch_size: 2
checkpointer:
component: torchtune.utils.FullModelMetaCheckpointer
checkpoint_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/original/
checkpoint_files:

consolidated.00.pth
model_type: LLAMA3
output_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/
recipe_checkpoint: null
compile: false
dataset:
component: torchtune.datasets.alpaca_cleaned_dataset
train_on_input: true
device: cuda
dtype: bf16
enable_activation_checkpointing: true
epochs: 1
gradient_accumulation_steps: 16
log_every_n_steps: 1
log_peak_memory_stats: false
loss:
component: torch.nn.CrossEntropyLoss
lr_scheduler:
component: torchtune.modules.get_cosine_schedule_with_warmup
num_warmup_steps: 100
max_steps_per_epoch: null
metric_logger:
component: torchtune.utils.metric_logging.DiskLogger
log_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/tmp/qlora_finetune_output/
model:
component: torchtune.models.llama3.qlora_llama3_8b
apply_lora_to_mlp: true
apply_lora_to_output: false
lora_alpha: 16
lora_attn_modules:
q_proj
v_proj
k_proj
output_proj
lora_rank: 8
optimizer:
component: torch.optim.AdamW
lr: 0.0003
weight_decay: 0.01
output_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/tmp/qlora_finetune_output/
profiler:
component: torchtune.utils.profiler
enabled: false
resume_from_checkpoint: false
seed: null
shuffle: true
tokenizer:
component: torchtune.models.llama3.llama3_tokenizer
path: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model

Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/bin/tune", line 8, in
sys.exit(main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 179, in _run_cmd
self._run_single_device(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 93, in _run_single_device
runpy.run_path(str(args.recipe), run_name="main")
File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 288, in run_path
return _run_module_code(code, init_globals, run_name,
File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 550, in
sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 543, in recipe_main
recipe = LoRAFinetuneRecipeSingleDevice(cfg=cfg)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 100, in init
self._device = utils.get_device(device=cfg.device)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/utils/_device.py", line 117, in get_device
device = _setup_cuda_device(device)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/utils/_device.py", line 44, in _setup_cuda_device
raise RuntimeError(
RuntimeError: The local rank is larger than the number of available GPUs.

RdoubleA · 2024-05-18T18:21:18Z

I don't believe PyTorch supports Intel GPUs natively. You might need to install some third-party packages to enable this. See https://pytorch.org/tutorials/recipes/intel_extension_for_pytorch.html for an example, this is an official Intel package for better support with Intel CPU and GPU. Although, I cannot guarantee that all features of torchtune will work with this extension. Let me know how it goes!

raymondbernard · 2024-05-18T22:25:12Z

@RdoubleA --I will give it a shot and let you know. I will start by adjusting a single GPU recipe and config file to see if I can get it to work.

raymondbernard · 2024-05-19T00:39:04Z

I am relatively new to using Torchtune. From what I understand, it is designed to facilitate LLM training on consumer-grade Nvidia-based GPUs. However, the process involves some deep abstractions, making it more complex than it initially seems. Here are the steps that are necessary to achieve this:

Import intel_extension_for_pytorch as ipex.
Use the ipex.optimize function for additional performance enhancements, which applies optimizations to both the model and the optimizer.
Utilize Auto Mixed Precision (AMP) with the BFloat16 data type.
Convert input tensors, the loss criterion, and the model to the XPU.

Here's an example implementation:

import torch
import intel_extension_for_pytorch as ipex

# Initialize the model, criterion, and optimizer
model = Model()
criterion = ...
optimizer = ...

model.train()

# Move the model and loss criterion to XPU before calling ipex.optimize()
model = model.to("xpu")
criterion = criterion.to("xpu")

# Optimize the model and optimizer for Float32
model, optimizer = ipex.optimize(model, optimizer=optimizer)

# Optimize the model and optimizer for BFloat16
model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)

# Prepare the dataloader
dataloader = ...

for input, target in dataloader:
    input = input.to("xpu")
    target = target.to("xpu")
    optimizer.zero_grad()

    # For Float32
    output = model(input)

    # For BFloat16
    with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
        output = model(input)

    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

raymondbernard · 2024-05-19T04:33:49Z

it would be great if the maintainer would point us in the proper direction. I will take this up again this week.

raymondbernard · 2024-05-21T15:28:14Z

@RdoubleA Intel GPUs are supported in pytorch version 2.30
https://github.com/pytorch/pytorch?tab=readme-ov-file#intel-gpu-support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Intel GPUs #999

Support for Intel GPUs #999

raymondbernard commented May 17, 2024 •

edited

RdoubleA commented May 18, 2024

raymondbernard commented May 18, 2024

raymondbernard commented May 19, 2024

raymondbernard commented May 19, 2024

raymondbernard commented May 21, 2024 •

edited

Support for Intel GPUs #999

Support for Intel GPUs #999

Comments

raymondbernard commented May 17, 2024 • edited

/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py FAILED

Root Cause (first observed failure): [0]: time : 2024-05-17_11:41:43 host : idc-beta-batch-pvc-node-18 rank : 0 (local_rank: 0) exitcode : 1 (pid: 400561) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

RdoubleA commented May 18, 2024

raymondbernard commented May 18, 2024

raymondbernard commented May 19, 2024

raymondbernard commented May 19, 2024

raymondbernard commented May 21, 2024 • edited

raymondbernard commented May 17, 2024 •

edited

Root Cause (first observed failure):
[0]:
time : 2024-05-17_11:41:43
host : idc-beta-batch-pvc-node-18
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 400561)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

raymondbernard commented May 21, 2024 •

edited