Skip to content

Releases: huggingface/accelerate

v0.30.1: Bugfixes

10 May 17:47
Compare
Choose a tag to compare

Patchfix

  • Fix duplicate environment variable check in multi-cpu condition thanks to @yhna940 in #2752
  • Fix issue with missing values in the SageMaker config leading to not being able to launch in #2753
  • Fix CPU OMP num threads setting thanks to @jiqing-feng in #2755
  • Fix FSDP checkpoint unable to resume when using offloading and sharded weights due to CUDA OOM when loading the optimizer and model #2762
  • Fixed the problem of incorrect conditional judgment statement when configuring enable_cpu_affinity thanks to @statelesshz in #2748
  • Fix stacklevel in logging to log the actual user call site (instead of the call site inside the logger wrapper) of log functions thanks to @luowyang in #2730
  • Fix support for multiple optimizers when using LOMO thanks to @younesbelkada in #2745

Full Changelog: v0.30.0...v0.30.1

v0.30.0: Advanced optimizer support, MoE DeepSpeed support, add upcasting for FSDP, and more

03 May 15:29
Compare
Choose a tag to compare

Core

  • We've simplified the tqdm wrapper to make it fully passthrough, no need to have tqdm(main_process_only, *args), it is now just tqdm(*args) and you can pass in is_main_process as a kwarg.
  • We've added support for advanced optimizer usage:
  • Enable BF16 autocast to everything during FP8 and enable FSDP by @muellerzr in #2655
  • Support dataloader send_to_device calls to use non-blocking by @drhead in #2685
  • allow gather_for_metrics to be more flexible by @SunMarc in #2710
  • Add cann version info to command accelerate env for NPU by @statelesshz in #2689
  • Add MLU rng state setter by @ArthurinRUC in #2664
  • device agnostic testing for hooks&utils&big_modeling by @statelesshz in #2602

Documentation

  • Through collaboration between @fabianlim (lead contribuitor), @stas00, @pacman100, and @muellerzr we have a new concept guide out for FSDP and DeepSpeed explicitly detailing how each interop and explaining fully and clearly how each of those work. This was a momumental effort by @fabianlim to ensure that everything can be as accurate as possible to users. I highly recommend visiting this new documentation, available here
  • New distributed inference examples have been added thanks to @SunMarc in #2672
  • Fixed some docs for using internal trackers by @brentyi in #2650

DeepSpeed

  • Accelerate can now handle MoE models when using deepspeed, thanks to @pacman100 in #2662
  • Allow "auto" for gradient clipping in YAML by @regisss in #2649
  • Introduce a deepspeed-specific Docker image by @muellerzr in #2707. To use, pull the gpu-deepspeed tag docker pull huggingface/accelerate:cuda-deepspeed-nightly

Megatron

Big Modeling

  • Add strict arg to load_checkpoint_and_dispatch by @SunMarc in #2641

Bug Fixes

  • Fix up state with xla + performance regression by @muellerzr in #2634
  • Parenthesis on xpu_available by @muellerzr in #2639
  • Fix is_train_batch_min type in DeepSpeedPlugin by @yhna940 in #2646
  • Fix backend check by @jiqing-feng in #2652
  • Fix the rng states of sampler's generator to be synchronized for correct sharding of dataset across GPUs by @pacman100 in #2694
  • Block AMP for MPS device by @SunMarc in #2699
  • Fixed issue when doing multi-gpu training with bnb when the first gpu is not used by @SunMarc in #2714
  • Fixup free_memory to deal with garbage collection by @muellerzr in #2716
  • Fix sampler serialization failing by @SunMarc in #2723
  • Fix deepspeed offload device type in the arguments to be more accurate by @yhna940 in #2717

Full Changelog

New Contributors

Full Changelog: https://github.com/huggingface/acce...

Read more

v0.29.3: Patchfix

17 Apr 15:46
Compare
Choose a tag to compare
  • Fixes issue with backend refactor not working on CPU-based distributed environments by @jiqing-feng: #2670
  • Fixes issue where load_checkpoint_and_dispatch needs a strict argument
  • by @SunMarc: #2641

Full Changelog: v0.29.2...v0.29.3

v0.29.2: Patchfix

09 Apr 12:04
Compare
Choose a tag to compare
  • Fixes xpu missing parenthesis #2639
  • Fixes XLA and performance degradation on init with the state #2634

v0.29.1: Patchfix

05 Apr 17:09
Compare
Choose a tag to compare

Fixed an import which would cause running accelerate CLI to fail if pytest wasn't installed

v0.29.0: NUMA affinity control, MLU Support, and DeepSpeed Improvements

05 Apr 14:27
Compare
Choose a tag to compare

Core

  • Accelerate can now optimize NUMA affinity, which can help increase throughput on NVIDIA multi-GPU systems. To enable it either follow the prompt during accelerate config, set the ACCELERATE_CPU_AFFINITY=1 env variable, or manually using the following:
from accelerate.utils import set_numa_affinity

# For GPU 0
set_numa_affinity(0)

Big thanks to @stas00 for the recommendation, request, and feedback during development

  • Allow for setting deterministic algorithms in set_seed by @muellerzr in #2569
  • Fixed the test script for TPU v2/v3 by @vanbasten23 in #2542
  • Cambricon MLU device support introduced by @huismiling in #2552
  • A big refactor was performed to the PartialState and AcceleratorState to allow for easier future-proofing and simplification of adding new devices by @muellerzr in #2576
  • Fixed a reproducibility issue in distributed environments with Dataloader shuffling when using BatchSamplerShard by @universuen in #2584
  • notebook_launcher can use multiple GPUs in Google Colab if using a custom instance that supports multiple GPUs by @StefanTodoran in #2561

Big Model Inference

  • Add log message for RTX 4000 series when performing multi-gpu inference with device_map which can lead to hanging by @SunMarc in #2557
  • Fix load_checkpoint_in_model behavior when unexpected keys are in the checkpoint by @fxmarty in #2588

DeepSpeed

  • Fix issue with the mapping of main_process_ip and master_addr when not using standard as deepspeed launcher by @asdfry in #2495
  • Improve deepspeed env gen by checking for bad keys, by @muellerzr and @ricklamers in #2565
  • We now support custom deepspeed env files. Like normal deepspeed, set it with the DS_ENV_FILE environmental variable by @muellerzr in #2566
  • Resolve ZeRO-3 Initialization Failure in already-started distributed environments by @sword865 in #2578

What's Changed

New Contributors

Full Changelog: v0.28.0...v0.29.0

v0.28.0: DataLoaderConfig, XLA improvements, FSDP + QLORA foundations, Gradient Synchronization Tweaks, and Bug Fixes

12 Mar 16:58
Compare
Choose a tag to compare

Core

  • Introduce a DataLoaderConfiguration and begin deprecation of arguments in the Accelerator
+from accelerate import DataLoaderConfiguration
+dl_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True)
-accelerator = Accelerator(split_batches=True, dispatch_batches=True)
+accelerator = Accelerator(dataloader_config=dl_config)
  • Allow gradients to be synced each data batch while performing gradient accumulation, useful when training in FSDP by @fabianlim in #2531
from accelerate import GradientAccumulationPlugin
plugin = GradientAccumulationPlugin(
+    num_steps=2, 
    sync_each_batch=sync_each_batch
)
accelerator = Accelerator(gradient_accumulation_plugin=plugin)

Torch XLA

  • Support for XLA on the GPU by @anw90 in #2176
  • Enable gradient accumulation on TPU in #2453

FSDP

  • Support downstream FSDP + QLORA support through tweaks by allowing configuration of buffer precision by @pacman100 in #2544

launch changes

What's Changed

New Contributors

Full Changelog: v0.27.2...v0.28.0

v0.27.0: PyTorch 2.2.0 Support, PyTorch-Native Pipeline Parallism, DeepSpeed XPU support, and Bug Fixes

09 Feb 16:30
Compare
Choose a tag to compare

PyTorch 2.2.0 Support

With the latest release of PyTorch 2.2.0, we've guaranteed that there are no breaking changes regarding it

PyTorch-Native Pipeline Parallel Inference

With this release we are excited to announce support for pipeline-parallel inference by integrating PyTorch's PiPPy framework (so no need to use Megatron or DeepSpeed)! This supports automatic model-weight splitting to each device using a similar API to device_map="auto". This is still under heavy development, however the inference side is stable enough that we are ready for a release. Read more about it in our docs and check out the example zoo.

Requires pippy of version 0.2.0 or later (pip install torchpippy -U)

Example usage (combined with accelerate launch or torchrun):

from accelerate import PartialState, prepare_pippy
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model = prepare_pippy(model, split_points="auto", example_args=(input,))
input = input.to("cuda:0")
with torch.no_grad():
    output = model(input)
# The outputs are only on the final process by default
# You can pass in `gather_outputs=True` to prepare_pippy to
# make them available on all processes
if PartialState().is_last_process:
    output = torch.stack(tuple(output[0]))
    print(output.shape)

DeepSpeed

This release provides support for utilizing DeepSpeed on XPU devices thanks to @faaany

What's Changed

New Contributors

Full Changelog: v0.26.1...v0.27.0

v0.26.1: Patch Release

11 Jan 15:26
Compare
Choose a tag to compare

What's Changed

  • Raise error when using batches of different sizes with dispatch_batches=True by @SunMarc in #2325

Full Changelog: v0.26.0...v0.26.1

v0.26.0 - MS-AMP Support, Critical Regression Fixes, and More

11 Jan 14:55
Compare
Choose a tag to compare

Support for MS-AMP

This release adds support for the MS-AMP (Microsoft Automatic Mixed Precision Library) into Accelerate as an alternative backend for doing FP8 training on appropriate hardware. It is the default backend of choice. Read more in the docs here. Introduced in #2232 by @muellerzr

Core

In the prior release a new sampler for the DataLoader was introduced that while across seeds does not show statistical differences in the results, repeating the same seed would result in a different end-accuracy that was scary to some users. We have now disabled this behavior by default as it required some additional setup, and brought back the original implementation. To have the new sampling technique (which can provide more accurate repeated results) pass use_seedable_sampler=True to the Accelerator. We will be propagating this up to the Trainer soon.

Big Model Inference

  • NPU support was added thanks to @statelesshz in #2222
  • When generating an automatic device_map we've made it possible to not returned grouped key results if desired in #2233
  • We now handle corner cases better when users pass device_map="cuda" etc thanks to @younesbelkada in #2254

FSDP and DeepSpeed

  • Many improvements to the docs have been made thanks to @stass. Along with this we've made it easier to adjust the config for the sharding strategy and other config values thanks to @pacman100 in #2288

  • A regression in Accelerate 0.23.0 occurred that showed learning is much slower on multi-GPU setups compared to a single GPU. #2304 has now fixed this thanks to @pacman100

  • The DeepSpeed integration now also handles auto values better when making a configuration in #2313

Bits and Bytes

  • Params4bit added to bnb classes in set_module_tensor_to_device() by @poedator in #2315

Device Agnostic Testing

For developers, we've made it much easier to run the tests on different devices with no change to the code thanks to @statelesshz in #2123 and #2235

Bug Fixes

Major Contributors

  • @statelesshz for their work on device-agnostic testing and NPU support
  • @stas00 for many docfixes when it comes to DeepSpeed and FSDP

General Changelog

New Contributors

Read more