Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Intel GPUs #999

Open
raymondbernard opened this issue May 17, 2024 · 5 comments
Open

Support for Intel GPUs #999

raymondbernard opened this issue May 17, 2024 · 5 comments

Comments

@raymondbernard
Copy link

raymondbernard commented May 17, 2024

We should be able to support intel GPUs! We are using the intel developer cloud. Please advise.

Distributor ID: Ubuntu
Description: Ubuntu 22.04.4 LTS
Release: 22.04
Codename: jammy
Python 3.9.18 (tags/v3.9.18-26-g6b320c3b2f6-dirty:6b320c3b2f6, Sep 28 2023, 00:35:27)
[GCC 13.2.0] :: Intel Corporation on linux
(null)Type "help", "copyright", "credits" or "license" for more information.
Intel(R) Distribution for Python is brought to you by Intel Corporation.
Please check out: https://software.intel.com/en-us/python-distribution

import torch
print(torch.version)
2.3.0+cu121

Notebook commands:
!echo "List of Intel GPUs available on the system:"
!xpu-smi discovery 2> /dev/null
!echo "Intel Xeon CPU used by this notebook:"
!lscpu | grep "Model name"

List of Intel GPUs available on the system:
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Name: Intel(R) Data Center GPU Max 1100 |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0029-0000-002f0bda8086 |
| | PCI BDF Address: 0000:29:00.0 |
| | DRM Device: /dev/dri/card0 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 1 | Device Name: Intel(R) Data Center GPU Max 1100 |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-003a-0000-002f0bda8086 |
| | PCI BDF Address: 0000:3a:00.0 |
| | DRM Device: /dev/dri/card2 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 2 | Device Name: Intel(R) Data Center GPU Max 1100 |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-009a-0000-002f0bda8086 |
| | PCI BDF Address: 0000:9a:00.0 |
| | DRM Device: /dev/dri/card3 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 3 | Device Name: Intel(R) Data Center GPU Max 1100 |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-00ca-0000-002f0bda8086 |
| | PCI BDF Address: 0000:ca:00.0 |
| | DRM Device: /dev/dri/card4 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
Intel Xeon CPU used by this notebook:
Model name: Intel(R) Xeon(R) Platinum 8480+

I discovered that Intel GPU doesn't seem to be supported because originally tried to run my training job across the 4 GPUS and got the following:

$ tune run --nnodes 1 --nproc_per_node 4 lora_finetune_distributed --config /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/recipes/configs/llama3/8B_qlora_single_device.yaml
Running with torchrun...
W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757]
W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] *****************************************
W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0517 11:41:38.025980 23389872468672 torch/distributed/run.py:757] *****************************************
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
sys.exit(recipe_main())sys.exit(recipe_main())

File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
return func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
func_return = func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
func_return = func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
default_pg, _ = _new_process_group_helper(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper
default_pg, _ = _new_process_group_helper(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper
backend_class = ProcessGroupNCCL(
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
backend_class = ProcessGroupNCCL(
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 652, in
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
func_return = func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py", line 641, in recipe_main
init_process_group(backend="gloo" if cfg.device == "cpu" else "nccl")
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
return func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 89, in wrapper
default_pg, _ = _new_process_group_helper(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper
func_return = func(*args, **kwargs)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1312, in init_process_group
backend_class = ProcessGroupNCCL(
ValueError : default_pg, _ = _new_process_group_helper(ProcessGroupNCCL is only supported with GPUs, no GPUs found!

File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1533, in _new_process_group_helper
backend_class = ProcessGroupNCCL(
ValueError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
E0517 11:41:43.068940 23389872468672 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 400561) of binary: /opt/intel/oneapi/intelpython/bin/python3.9
Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/bin/tune", line 8, in
sys.exit(main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 177, in _run_cmd
self._run_distributed(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 88, in _run_distributed
run(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_distributed.py FAILED

Failures:
[1]:
time : 2024-05-17_11:41:43
host : idc-beta-batch-pvc-node-18
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 400562)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-05-17_11:41:43
host : idc-beta-batch-pvc-node-18
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 400563)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
time : 2024-05-17_11:41:43
host : idc-beta-batch-pvc-node-18
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 400565)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure):
[0]:
time : 2024-05-17_11:41:43
host : idc-beta-batch-pvc-node-18
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 400561)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

u2b3e96b2fc320ef8c781f51df67225d@idc-beta-batch-pvc-node-18:~$ tune run lora_finetune_single_device --config /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/recipes/configs/llama3/8B_qlora_single_device.yaml
INFO:torchtune.utils.logging:Running LoRAFinetuneRecipeSingleDevice with resolved config:

batch_size: 2
checkpointer:
component: torchtune.utils.FullModelMetaCheckpointer
checkpoint_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/original/
checkpoint_files:

  • consolidated.00.pth
    model_type: LLAMA3
    output_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/
    recipe_checkpoint: null
    compile: false
    dataset:
    component: torchtune.datasets.alpaca_cleaned_dataset
    train_on_input: true
    device: cuda
    dtype: bf16
    enable_activation_checkpointing: true
    epochs: 1
    gradient_accumulation_steps: 16
    log_every_n_steps: 1
    log_peak_memory_stats: false
    loss:
    component: torch.nn.CrossEntropyLoss
    lr_scheduler:
    component: torchtune.modules.get_cosine_schedule_with_warmup
    num_warmup_steps: 100
    max_steps_per_epoch: null
    metric_logger:
    component: torchtune.utils.metric_logging.DiskLogger
    log_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/tmp/qlora_finetune_output/
    model:
    component: torchtune.models.llama3.qlora_llama3_8b
    apply_lora_to_mlp: true
    apply_lora_to_output: false
    lora_alpha: 16
    lora_attn_modules:
  • q_proj
  • v_proj
  • k_proj
  • output_proj
    lora_rank: 8
    optimizer:
    component: torch.optim.AdamW
    lr: 0.0003
    weight_decay: 0.01
    output_dir: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/tmp/qlora_finetune_output/
    profiler:
    component: torchtune.utils.profiler
    enabled: false
    resume_from_checkpoint: false
    seed: null
    shuffle: true
    tokenizer:
    component: torchtune.models.llama3.llama3_tokenizer
    path: /home/u2b3e96b2fc320ef8c781f51df67225d/Training/AI/GenAI/torchtune/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model

Traceback (most recent call last):
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/bin/tune", line 8, in
sys.exit(main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 49, in main
parser.run(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/tune.py", line 43, in run
args.func(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 179, in _run_cmd
self._run_single_device(args)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/_cli/run.py", line 93, in _run_single_device
runpy.run_path(str(args.recipe), run_name="main")
File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 288, in run_path
return _run_module_code(code, init_globals, run_name,
File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 97, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/opt/intel/oneapi/intelpython/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 550, in
sys.exit(recipe_main())
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/config/_parse.py", line 50, in wrapper
sys.exit(recipe_main(conf))
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 543, in recipe_main
recipe = LoRAFinetuneRecipeSingleDevice(cfg=cfg)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/recipes/lora_finetune_single_device.py", line 100, in init
self._device = utils.get_device(device=cfg.device)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/utils/_device.py", line 117, in get_device
device = _setup_cuda_device(device)
File "/home/u2b3e96b2fc320ef8c781f51df67225d/.local/lib/python3.9/site-packages/torchtune/utils/_device.py", line 44, in _setup_cuda_device
raise RuntimeError(
RuntimeError: The local rank is larger than the number of available GPUs.

@RdoubleA
Copy link
Contributor

I don't believe PyTorch supports Intel GPUs natively. You might need to install some third-party packages to enable this. See https://pytorch.org/tutorials/recipes/intel_extension_for_pytorch.html for an example, this is an official Intel package for better support with Intel CPU and GPU. Although, I cannot guarantee that all features of torchtune will work with this extension. Let me know how it goes!

@raymondbernard
Copy link
Author

@RdoubleA --I will give it a shot and let you know. I will start by adjusting a single GPU recipe and config file to see if I can get it to work.

@raymondbernard
Copy link
Author

I am relatively new to using Torchtune. From what I understand, it is designed to facilitate LLM training on consumer-grade Nvidia-based GPUs. However, the process involves some deep abstractions, making it more complex than it initially seems. Here are the steps that are necessary to achieve this:

  1. Import intel_extension_for_pytorch as ipex.
  2. Use the ipex.optimize function for additional performance enhancements, which applies optimizations to both the model and the optimizer.
  3. Utilize Auto Mixed Precision (AMP) with the BFloat16 data type.
  4. Convert input tensors, the loss criterion, and the model to the XPU.

Here's an example implementation:

import torch
import intel_extension_for_pytorch as ipex

# Initialize the model, criterion, and optimizer
model = Model()
criterion = ...
optimizer = ...

model.train()

# Move the model and loss criterion to XPU before calling ipex.optimize()
model = model.to("xpu")
criterion = criterion.to("xpu")

# Optimize the model and optimizer for Float32
model, optimizer = ipex.optimize(model, optimizer=optimizer)

# Optimize the model and optimizer for BFloat16
model, optimizer = ipex.optimize(model, optimizer=optimizer, dtype=torch.bfloat16)

# Prepare the dataloader
dataloader = ...

for input, target in dataloader:
    input = input.to("xpu")
    target = target.to("xpu")
    optimizer.zero_grad()

    # For Float32
    output = model(input)

    # For BFloat16
    with torch.xpu.amp.autocast(enabled=True, dtype=torch.bfloat16):
        output = model(input)

    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

@raymondbernard
Copy link
Author

it would be great if the maintainer would point us in the proper direction. I will take this up again this week.

@raymondbernard
Copy link
Author

raymondbernard commented May 21, 2024

@RdoubleA Intel GPUs are supported in pytorch version 2.30
https://github.com/pytorch/pytorch?tab=readme-ov-file#intel-gpu-support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants