Releases: huggingface/accelerate
v0.30.1: Bugfixes
Patchfix
- Fix duplicate environment variable check in multi-cpu condition thanks to @yhna940 in #2752
- Fix issue with missing values in the SageMaker config leading to not being able to launch in #2753
- Fix CPU OMP num threads setting thanks to @jiqing-feng in #2755
- Fix FSDP checkpoint unable to resume when using offloading and sharded weights due to CUDA OOM when loading the optimizer and model #2762
- Fixed the problem of incorrect conditional judgment statement when configuring enable_cpu_affinity thanks to @statelesshz in #2748
- Fix stacklevel in logging to log the actual user call site (instead of the call site inside the logger wrapper) of log functions thanks to @luowyang in #2730
- Fix support for multiple optimizers when using LOMO thanks to @younesbelkada in #2745
Full Changelog: v0.30.0...v0.30.1
v0.30.0: Advanced optimizer support, MoE DeepSpeed support, add upcasting for FSDP, and more
Core
- We've simplified the
tqdm
wrapper to make it fully passthrough, no need to havetqdm(main_process_only, *args)
, it is now justtqdm(*args)
and you can pass inis_main_process
as a kwarg. - We've added support for advanced optimizer usage:
- Schedule free optimizer introduced by Meta by @muellerzr in #2631
- LOMO optimizer introduced by OpenLMLab by @younesbelkada in #2695
- Enable BF16 autocast to everything during FP8 and enable FSDP by @muellerzr in #2655
- Support dataloader send_to_device calls to use non-blocking by @drhead in #2685
- allow gather_for_metrics to be more flexible by @SunMarc in #2710
- Add
cann
version info to command accelerate env for NPU by @statelesshz in #2689 - Add MLU rng state setter by @ArthurinRUC in #2664
- device agnostic testing for hooks&utils&big_modeling by @statelesshz in #2602
Documentation
- Through collaboration between @fabianlim (lead contribuitor), @stas00, @pacman100, and @muellerzr we have a new concept guide out for FSDP and DeepSpeed explicitly detailing how each interop and explaining fully and clearly how each of those work. This was a momumental effort by @fabianlim to ensure that everything can be as accurate as possible to users. I highly recommend visiting this new documentation, available here
- New distributed inference examples have been added thanks to @SunMarc in #2672
- Fixed some docs for using internal trackers by @brentyi in #2650
DeepSpeed
- Accelerate can now handle MoE models when using deepspeed, thanks to @pacman100 in #2662
- Allow "auto" for gradient clipping in YAML by @regisss in #2649
- Introduce a
deepspeed
-specific Docker image by @muellerzr in #2707. To use, pull thegpu-deepspeed
tagdocker pull huggingface/accelerate:cuda-deepspeed-nightly
Megatron
- Megatron plugin can support NPU by @zhangsheng377 in #2667
Big Modeling
Bug Fixes
- Fix up state with xla + performance regression by @muellerzr in #2634
- Parenthesis on xpu_available by @muellerzr in #2639
- Fix
is_train_batch_min
type in DeepSpeedPlugin by @yhna940 in #2646 - Fix backend check by @jiqing-feng in #2652
- Fix the rng states of sampler's generator to be synchronized for correct sharding of dataset across GPUs by @pacman100 in #2694
- Block AMP for MPS device by @SunMarc in #2699
- Fixed issue when doing multi-gpu training with bnb when the first gpu is not used by @SunMarc in #2714
- Fixup
free_memory
to deal with garbage collection by @muellerzr in #2716 - Fix sampler serialization failing by @SunMarc in #2723
- Fix deepspeed offload device type in the arguments to be more accurate by @yhna940 in #2717
Full Changelog
- Schedule free optimizer support by @muellerzr in #2631
- Fix up state with xla + performance regression by @muellerzr in #2634
- Parenthesis on xpu_available by @muellerzr in #2639
- add third-party device prefix to
execution_device
by @faaany in #2612 - add strict arg to load_checkpoint_and_dispatch by @SunMarc in #2641
- device agnostic testing for hooks&utils&big_modeling by @statelesshz in #2602
- Docs fix for using internal trackers by @brentyi in #2650
- Allow "auto" for gradient clipping in YAML by @regisss in #2649
- Fix
is_train_batch_min
type in DeepSpeedPlugin by @yhna940 in #2646 - Don't use deprecated
Repository
anymore by @Wauplin in #2658 - Fix test_from_pretrained_low_cpu_mem_usage_measured failure by @yuanwu2017 in #2644
- Add MLU rng state setter by @ArthurinRUC in #2664
- fix backend check by @jiqing-feng in #2652
- Megatron plugin can support NPU by @zhangsheng377 in #2667
- Revert "fix backend check" by @muellerzr in #2669
tqdm
:*args
should come ahead ofmain_process_only
by @rb-synth in #2654- Handle MoE models with DeepSpeed by @pacman100 in #2662
- Fix deepspeed moe test with version check by @pacman100 in #2677
- Pin DS...again.. by @muellerzr in #2679
- fix backend check by @jiqing-feng in #2670
- Deprecate tqdm args + slight logic tweaks by @muellerzr in #2673
- Enable BF16 autocast to everything during FP8 + some tweaks to enable FSDP by @muellerzr in #2655
- Fix the rng states of sampler's generator to be synchronized for correct sharding of dataset across GPUs by @pacman100 in #2694
- Simplify test logic by @pacman100 in #2697
- Add source code for DataLoader Animation by @muellerzr in #2696
- Block AMP for MPS device by @SunMarc in #2699
- Do a pip freeze during workflows by @muellerzr in #2704
- add cann version info to command accelerate env by @statelesshz in #2689
- Add version checks for the import of DeepSpeed moe utils by @pacman100 in #2705
- Change dataloader send_to_device calls to non-blocking by @drhead in #2685
- add distributed examples by @SunMarc in #2672
- Add diffusers to req by @muellerzr in #2711
- fix bnb multi gpu training by @SunMarc in #2714
- allow gather_for_metrics to be more flexible by @SunMarc in #2710
- Add Upcasting for FSDP in Mixed Precision. Add Concept Guide for FSPD and DeepSpeed. by @fabianlim in #2674
- Segment out a deepspeed docker image by @muellerzr in #2707
- Fixup
free_memory
to deal with garbage collection by @muellerzr in #2716 - fix sampler serialization by @SunMarc in #2723
- Fix sampler failing test by @SunMarc in #2728
- Docs: Fix build main documentation by @SunMarc in #2729
- Fix Documentation in FSDP and DeepSpeed Concept Guide by @fabianlim in #2725
- Fix deepspeed offload device type by @yhna940 in #2717
- FEAT: Add LOMO optimizer by @younesbelkada in #2695
- Fix tests on main by @muellerzr in #2739
New Contributors
- @brentyi made their first contribution in #2650
- @regisss made their first contribution in #2649
- @yhna940 made their first contribution in #2646
- @Wauplin made their first contribution in #2658
- @ArthurinRUC made their first contribution in #2664
- @jiqing-feng made their first contribution in #2652
- @zhangsheng377 made their first contribution in #2667
- @rb-synth made their first contribution in #2654
- @drhead made their first contribution in #2685
Full Changelog: https://github.com/huggingface/acce...
v0.29.3: Patchfix
- Fixes issue with backend refactor not working on CPU-based distributed environments by @jiqing-feng: #2670
- Fixes issue where
load_checkpoint_and_dispatch
needs astrict
argument - by @SunMarc: #2641
Full Changelog: v0.29.2...v0.29.3
v0.29.2: Patchfix
v0.29.1: Patchfix
Fixed an import which would cause running accelerate CLI to fail if pytest wasn't installed
v0.29.0: NUMA affinity control, MLU Support, and DeepSpeed Improvements
Core
- Accelerate can now optimize NUMA affinity, which can help increase throughput on NVIDIA multi-GPU systems. To enable it either follow the prompt during
accelerate config
, set theACCELERATE_CPU_AFFINITY=1
env variable, or manually using the following:
from accelerate.utils import set_numa_affinity
# For GPU 0
set_numa_affinity(0)
Big thanks to @stas00 for the recommendation, request, and feedback during development
- Allow for setting deterministic algorithms in
set_seed
by @muellerzr in #2569 - Fixed the test script for TPU v2/v3 by @vanbasten23 in #2542
- Cambricon MLU device support introduced by @huismiling in #2552
- A big refactor was performed to the PartialState and AcceleratorState to allow for easier future-proofing and simplification of adding new devices by @muellerzr in #2576
- Fixed a reproducibility issue in distributed environments with Dataloader shuffling when using
BatchSamplerShard
by @universuen in #2584 notebook_launcher
can use multiple GPUs in Google Colab if using a custom instance that supports multiple GPUs by @StefanTodoran in #2561
Big Model Inference
- Add log message for RTX 4000 series when performing multi-gpu inference with device_map which can lead to hanging by @SunMarc in #2557
- Fix
load_checkpoint_in_model
behavior when unexpected keys are in the checkpoint by @fxmarty in #2588
DeepSpeed
- Fix issue with the mapping of
main_process_ip
andmaster_addr
when not using standard as deepspeed launcher by @asdfry in #2495 - Improve deepspeed env gen by checking for bad keys, by @muellerzr and @ricklamers in #2565
- We now support custom deepspeed env files. Like normal
deepspeed
, set it with theDS_ENV_FILE
environmental variable by @muellerzr in #2566 - Resolve ZeRO-3 Initialization Failure in already-started distributed environments by @sword865 in #2578
What's Changed
- Fix test_script.py on TPU v2/v3 by @vanbasten23 in #2542
- Add mapping
main_process_ip
andmaster_addr
when not using standard as deepspeed launcher by @asdfry in #2495 - split_between_processes for Dataset by @geronimi73 in #2433
- Include working driver check by @muellerzr in #2558
- 馃毃馃毃馃毃Move to using tags rather than latest for docker images and consolidate image repos 馃毃 馃毃馃毃 by @muellerzr in #2554
- Add Cambricon MLU accelerator support by @huismiling in #2552
- Add NUMA affinity control for NVIDIA GPUs by @muellerzr in #2535
- Add log message for RTX 4000 series when performing multi-gpu inference with device_map by @SunMarc in #2557
- Improve deepspeed env gen by @muellerzr in #2565
- Allow for setting deterministic algorithms by @muellerzr in #2569
- Unpin deepspeed by @muellerzr in #2570
- Rm uv install by @muellerzr in #2577
- Allow for custom deepspeed env files by @muellerzr in #2566
- [docs] Missing functions from API by @stevhliu in #2580
- Update data_loader.py to Ensure Reproducibility in Multi-Process Environments with Dataloader Shuffle by @universuen in #2584
- Refactor affinity and make it stateful by @muellerzr in #2579
- Refactor and improve model estimator tool by @muellerzr in #2581
- Fix
load_checkpoint_in_model
behavior when unexpected keys are in the checkpoint by @fxmarty in #2588 - Guard stateful objects by @muellerzr in #2572
- Expound PartialState docstring by @muellerzr in #2589
- [docs] Fix kwarg docstring by @stevhliu in #2590
- Allow notebook_launcher to launch to multiple GPUs from Colab by @StefanTodoran in #2561
- Fix warning log for unused checkpoint keys by @fxmarty in #2594
- Resolve ZeRO-3 Initialization Failure in Pre-Set Torch Distributed Environments (huggingface/transformers#28803) by @sword865 in #2578
- Refactor PartialState and AcceleratorState by @muellerzr in #2576
- Allow for force unwrapping by @muellerzr in #2595
- Pin hub for tests by @muellerzr in #2608
- Default false for trust_remote_code by @muellerzr in #2607
- fix llama example for pippy by @SunMarc in #2616
- Fix links in Quick Tour by @muellerzr in #2617
- Link to bash in env reporting by @muellerzr in #2623
- Unpin hub by @muellerzr in #2625
New Contributors
- @asdfry made their first contribution in #2495
- @geronimi73 made their first contribution in #2433
- @huismiling made their first contribution in #2552
- @universuen made their first contribution in #2584
- @StefanTodoran made their first contribution in #2561
- @sword865 made their first contribution in #2578
Full Changelog: v0.28.0...v0.29.0
v0.28.0: DataLoaderConfig, XLA improvements, FSDP + QLORA foundations, Gradient Synchronization Tweaks, and Bug Fixes
Core
- Introduce a
DataLoaderConfiguration
and begin deprecation of arguments in theAccelerator
+from accelerate import DataLoaderConfiguration
+dl_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True)
-accelerator = Accelerator(split_batches=True, dispatch_batches=True)
+accelerator = Accelerator(dataloader_config=dl_config)
- Allow gradients to be synced each data batch while performing gradient accumulation, useful when training in FSDP by @fabianlim in #2531
from accelerate import GradientAccumulationPlugin
plugin = GradientAccumulationPlugin(
+ num_steps=2,
sync_each_batch=sync_each_batch
)
accelerator = Accelerator(gradient_accumulation_plugin=plugin)
Torch XLA
FSDP
- Support downstream FSDP + QLORA support through tweaks by allowing configuration of buffer precision by @pacman100 in #2544
launch
changes
What's Changed
- Fix model metadata issue check by @muellerzr in #2435
- Use py 3.9 by @muellerzr in #2436
- Fix seedable sampler logic and expound docs by @muellerzr in #2434
- Fix tied_pointers_to_remove type by @fxmarty in #2439
- Make test assertions more idiomatic by @akx in #2420
- Prefer
is_torch_tensor
overhasattr
for torch.compile. by @PhilJd in #2387 - Enable more Ruff lints & fix issues by @akx in #2419
- Fix warning when dispatching model by @SunMarc in #2442
- Make torch xla available on GPU by @anw90 in #2176
- Include pippy_file_path by @muellerzr in #2444
- [Big deprecation] Introduces a
DataLoaderConfig
by @muellerzr in #2441 - Check for None by @muellerzr in #2452
- Fix the pytest version to be less than 8.0.1 by @BenjaminBossan in #2461
- Fix wrong
is_namedtuple
implementation by @fxmarty in #2475 - Use grad-accum on TPU by @muellerzr in #2453
- Add pre-commit configuration by @akx in #2451
- Replace
os.path.sep.join
path manipulations with a helper by @akx in #2446 - DOC: Fixes to Accelerator docstring by @BenjaminBossan in #2443
- Context manager fixes by @akx in #2450
- Fix TPU with new
XLA
device type by @will-cromar in #2467 - Free mps memory by @SunMarc in #2483
- [FIX] allow
Accelerator
to detect distributed type from the "LOCAL_RANK" env variable for XPU by @faaany in #2473 - Fix CI tests due to pathlib issues by @muellerzr in #2491
- Remove all cases of torchrun in tests and centralize as
accelerate launch
by @muellerzr in #2498 - Fix link typo by @SunMarc in #2503
- [docs] Accelerator API by @stevhliu in #2465
- Docstring fixup by @muellerzr in #2504
- [docs] Divide training and inference by @stevhliu in #2466
- add custom dtype INT2 by @SunMarc in #2505
- quanto compatibility for cpu/disk offload by @SunMarc in #2481
- [docs] Quicktour by @stevhliu in #2456
- Check if hub down by @muellerzr in #2506
- Remove offline stuff by @muellerzr in #2509
- Fixed 0MiB bug in convert_file_size_to_int by @StoyanStAtanasov in #2507
- Fix edge case in infer_auto_device_map when dealing with buffers by @SunMarc in #2511
- [docs] Fix typos by @omahs in #2490
- fix typo in launch.py (
----main_process_port
to--main_process_port
) by @DerrickWang005 in #2516 - Add copyright + some ruff lint things by @muellerzr in #2523
- Don't manage
PYTORCH_NVML_BASED_CUDA_CHECK
when callingaccelerate.utils.imports.is_cuda_available()
by @luiscape in #2524 - Quanto compatibility with QBitsTensor by @SunMarc in #2526
- Remove unnecessary
env=os.environ.copy()
s by @akx in #2449 - Launch mpirun from accelerate launch for multi-CPU training by @dmsuehir in #2493
- Enable using dash or underscore for CLI args by @muellerzr in #2527
- Update the default behavior of
zero_grad(set_to_none=None)
to align with PyTorch by @yongchanghao in #2472 - Update link to dynamo/compile doc by @WarmongeringBeaver in #2533
- Check if the buffers fit GPU memory after device map auto inferred by @notsyncing in #2412
- [Refactor] Refactor send_to_device to treat tensor-like first by @vmoens in #2438
- Overdue email change... by @muellerzr in #2534
- [docs] Troubleshoot by @stevhliu in #2538
- Remove extra double-dash in error message by @drscotthawley in #2541
- Allow Gradients to be Synced Each Data Batch While Performing Gradient Accumulation by @fabianlim in #2531
- Update FSDP mixed precision setter to enable fsdp+qlora by @pacman100 in #2544
- Use uv instead of pip install for github CI by @muellerzr in #2546
New Contributors
- @anw90 made their first contribution in #2176
- @StoyanStAtanasov made their first contribution in #2507
- @omahs made their first contribution in #2490
- @DerrickWang005 made their first contribution in #2516
- @luiscape made their first contribution in #2524
- @dmsuehir made their first contribution in #2493
- @yongchanghao made their first contribution in #2472
- @WarmongeringBeaver made their first contribution in #2533
- @vmoens made their first contribution in #2438
- @drscotthawley made their first contribution in #2541
- @fabianlim made their first contribution in #2531
Full Changelog: v0.27.2...v0.28.0
v0.27.0: PyTorch 2.2.0 Support, PyTorch-Native Pipeline Parallism, DeepSpeed XPU support, and Bug Fixes
PyTorch 2.2.0 Support
With the latest release of PyTorch 2.2.0, we've guaranteed that there are no breaking changes regarding it
PyTorch-Native Pipeline Parallel Inference
With this release we are excited to announce support for pipeline-parallel inference by integrating PyTorch's PiPPy framework (so no need to use Megatron or DeepSpeed)! This supports automatic model-weight splitting to each device using a similar API to device_map="auto"
. This is still under heavy development, however the inference side is stable enough that we are ready for a release. Read more about it in our docs and check out the example zoo.
Requires pippy
of version 0.2.0 or later (pip install torchpippy -U
)
Example usage (combined with accelerate launch
or torchrun
):
from accelerate import PartialState, prepare_pippy
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model = prepare_pippy(model, split_points="auto", example_args=(input,))
input = input.to("cuda:0")
with torch.no_grad():
output = model(input)
# The outputs are only on the final process by default
# You can pass in `gather_outputs=True` to prepare_pippy to
# make them available on all processes
if PartialState().is_last_process:
output = torch.stack(tuple(output[0]))
print(output.shape)
DeepSpeed
This release provides support for utilizing DeepSpeed on XPU devices thanks to @faaany
What's Changed
- Convert model.hf_device_map back to Dict by @SunMarc in #2326
- Fix model memory issue by @muellerzr in #2327
- Fixed typos in readme files of docs folder. by @rishit5 in #2329
- Disable P2P in just the 4000 series by @muellerzr in #2332
- Avoid duplicating memory for tied weights in
dispatch_model
, and in forward with offloading by @fxmarty in #2330 - Show DeepSpeed option when multi-XPU is selected in
accelerate config
by @faaany in #2346 - FIX: add oneCCL environment variable for non-MPI launcher (accelerate launch) by @faaany in #2339
- device agnostic test_accelerator/test_multigpu by @wangshuai09 in #2343
- Fix mpi4py/failing deepspeed test issues by @muellerzr in #2353
- Fix
block_size
picking inmegatron_lm_gpt_pretraining
example. by @nilq in #2342 - Fix dispatch_model with tied weights test on T4 by @fxmarty in #2354
- bugfix to allow usage of TE or MSAMP in
FP8RecipeKwargs
by @sudhakarsingh27 in #2355 - Pin DeepSpeed until patch by @muellerzr in #2366
- Remove init_hook_kwargs by @fxmarty in #2365
- device agnostic optimizer testing by @statelesshz in #2363
add_hook_to_module
andremove_hook_from_module
compatibility with fx.GraphModule by @fxmarty in #2369- Adding
requires_grad
tokwargs
when registering empty parameters. by @BlackSamorez in #2376 - Add
adapter_only
option tosave_fsdp_model
andload_fsdp_model
to only save/load PEFT weights by @AjayP13 in #2321 - device agnostic cli/data_loader/grad_sync/kwargs_handlers/memory_utils testing by @wangshuai09 in #2356
- Fix batch_size sanity check logic for
split_batches
by @izhx in #2344 - Pin Torch version to <2.2.0 by @Rocketknight1 in #2394
- Address PIP-632 deprecation of distutils by @AieatAssam in #2388
- [don't merge yet] unpin torch by @ydshieh in #2406
- Revert "[don't merge yet] unpin torch" by @muellerzr in #2407
- Fix CI due to pytest by @muellerzr in #2408
- Added activateEnviroment.sh to readme by @TJ-Solergibert in #2409
- Fix XPU inference by @notsyncing in #2383
- Fix the size of int and bool type when computing module size by @notsyncing in #2411
- Adding Local SGD support for NPU by @statelesshz in #2415
- Unpin torch by @muellerzr in #2418
- Use Ruff for formatting too by @akx in #2400
- torch-native pipeline parallelism for big models by @muellerzr in #2345
- Update FSDP docs by @pacman100 in #2430
- Make output end up on all GPUs at the end by @muellerzr in #2423
- Migrate pippy examples over and run tests by @muellerzr in #2424
- [FIX] fix the wrong
nproc_per_node
in the multi gpu test by @faaany in #2422 - Fix fp8 things by @muellerzr in #2403
- [FIX] allow
Accelerator
to prepare models in eval mode for XPU&CPU by @faaany in #2426 - [Fix] make all tests pass on XPU by @faaany in #2427
New Contributors
- @rishit5 made their first contribution in #2329
- @faaany made their first contribution in #2346
- @wangshuai09 made their first contribution in #2343
- @nilq made their first contribution in #2342
- @BlackSamorez made their first contribution in #2376
- @AjayP13 made their first contribution in #2321
- @Rocketknight1 made their first contribution in #2394
- @AieatAssam made their first contribution in #2388
- @ydshieh made their first contribution in #2406
- @notsyncing made their first contribution in #2383
- @akx made their first contribution in #2400
Full Changelog: v0.26.1...v0.27.0
v0.26.1: Patch Release
What's Changed
Full Changelog: v0.26.0...v0.26.1
v0.26.0 - MS-AMP Support, Critical Regression Fixes, and More
Support for MS-AMP
This release adds support for the MS-AMP (Microsoft Automatic Mixed Precision Library) into Accelerate as an alternative backend for doing FP8 training on appropriate hardware. It is the default backend of choice. Read more in the docs here. Introduced in #2232 by @muellerzr
Core
In the prior release a new sampler for the DataLoader
was introduced that while across seeds does not show statistical differences in the results, repeating the same seed would result in a different end-accuracy that was scary to some users. We have now disabled this behavior by default as it required some additional setup, and brought back the original implementation. To have the new sampling technique (which can provide more accurate repeated results) pass use_seedable_sampler=True
to the Accelerator
. We will be propagating this up to the Trainer
soon.
Big Model Inference
- NPU support was added thanks to @statelesshz in #2222
- When generating an automatic
device_map
we've made it possible to not returned grouped key results if desired in #2233 - We now handle corner cases better when users pass
device_map="cuda"
etc thanks to @younesbelkada in #2254
FSDP and DeepSpeed
-
Many improvements to the docs have been made thanks to @stass. Along with this we've made it easier to adjust the config for the sharding strategy and other config values thanks to @pacman100 in #2288
-
A regression in Accelerate 0.23.0 occurred that showed learning is much slower on multi-GPU setups compared to a single GPU. #2304 has now fixed this thanks to @pacman100
-
The DeepSpeed integration now also handles
auto
values better when making a configuration in #2313
Bits and Bytes
Device Agnostic Testing
For developers, we've made it much easier to run the tests on different devices with no change to the code thanks to @statelesshz in #2123 and #2235
Bug Fixes
- Check notebook launcher for 3090+ by @muellerzr in #2212
- Fix dtype bug when
offload_state_dict=True
anddtype
is specified by @fxmarty in #2116 - fix tqdm wrapper to print when process id ==0 by @kashif in #2223
- fix BFloat16 is not supported on MPS (#2226) by @jxysoft in #2227
- Fix MpDeviceLoaderWrapper not having attribute batch_sampler by @vanbasten23 in #2242
- [deepspeed] fix setting
auto
values for comm buffers by @stas00 in #2295 - Fix infer_auto_device_map when tied weights share the same prefix name by @fxmarty in #2324
- Fixes bug in swapping weights when replacing with Transformer-Engine layers by @sudhakarsingh27 in #2305
- Fix breakpoint API in test_script.py on TPU. by @vanbasten23 in #2263
- Bring old seed technique back by @muellerzr in #2319
Major Contributors
- @statelesshz for their work on device-agnostic testing and NPU support
- @stas00 for many docfixes when it comes to DeepSpeed and FSDP
General Changelog
- add missing whitespace by @stas00 in #2206
- MNT Delete the delete doc workflows by @BenjaminBossan in #2217
- Update docker images by @muellerzr in #2213
- Add allgather check for xpu by @abhilash1910 in #2199
- Check notebook launcher for 3090+ by @muellerzr in #2212
- Fix dtype bug when
offload_state_dict=True
anddtype
is specified by @fxmarty in #2116 - fix tqdm wrapper to print when process id ==0 by @kashif in #2223
- [data_loader] expand the error message by @stas00 in #2221
- Update the 'Frameworks using Accelerate' section to include Amphion by @RMSnow in #2225
- [Docs] Add doc for cpu/disk offload by @SunMarc in #2231
- device agnostic testing by @statelesshz in #2123
- Make cleaning optional for device map by @muellerzr in #2233
- Add npu support to big model inference by @statelesshz in #2222
- fix the DS failing test by @pacman100 in #2237
- Fix nb tests by @muellerzr in #2230
- fix BFloat16 is not supported on MPS (#2226) by @jxysoft in #2227
- Fix MpDeviceLoaderWrapper not having attribute batch_sampler by @vanbasten23 in #2242
- [
Big-Modeling
] Harmonize device check to handle corner cases by @younesbelkada in #2254 - Support
log_images
for aim tracker by @Justin900429 in #2257 - Integrate MS-AMP Support for FP8 as a seperate backend by @muellerzr in #2232
- refactor deepspeed dataloader prepare logic by @pacman100 in #2238
- device agnostic deepspeed&fsdp testing by @statelesshz in #2235
- Solve CUDA issues by @muellerzr in #2272
- Uninstall DVC in the Trainer tests by @muellerzr in #2271
- Rm DVCLive from test reqs as latest version causes failures by @muellerzr in #2279
- typo fix by @stas00 in #2276
- Add condition before using
check_tied_parameters_on_same_device
by @SunMarc in #2218 - [doc] FSDP improvements by @stas00 in #2274
- [deepspeed docs] auto-values aren't being covered by @stas00 in #2286
- Improve FSDP config usability by @pacman100 in #2288
- [doc] language fixes by @stas00 in #2292
- Bump tj-actions/changed-files from 22.2 to 41 in /.github/workflows by @dependabot in #2300
- add back dvclive to tests by @dberenbaum in #2280
- Fixes bug in swapping weights when replacing with Transformer-Engine layers by @sudhakarsingh27 in #2305
- Fix breakpoint API in test_script.py on TPU. by @vanbasten23 in #2263
- make test_state_checkpointing device agnostic by @statelesshz in #2290
- [deepspeed] documentation by @stas00 in #2296
- Add more missing items by @muellerzr in #2309
- Update docs: Add warning for device_map=None for load_checkpoint_and_dispatch by @PhilJd in #2308
- [deepspeed] fix setting
auto
values for comm buffers by @stas00 in #2295 - DeepSpeed refactoring by @pacman100 in #2313
- Fix DeepSpeed related regression by @pacman100 in #2304
- Update test_deepspeed.py by @pacman100 in #2323
- Bring old seed technique back by @muellerzr in #2319
- Fix batch_size sanity check in
prepare_data_loader
by @izhx in #2310 Params4bit
added to bnb classes in set_module_tensor_to_device() by @poedator in #2315- Fix infer_auto_device_map when tied weights share the same prefix name by @fxmarty in #2324
New Contributors
- @fxmarty made their first contribution in #2116
- @RMSnow made their first contribution in #2225
- @jxysoft made their first contribution in #2227
- @vanbasten23 made their first contribution in #2242
- @Justin900429 made their first contribution in #2257
- @dependabot made their first contribution in #2300
- @sudhakarsingh27 ma...