08 Jun 03:11

Bug Fixes

Fix backward compatibility issue caused by missing eval metrics class

What's Changed:

Fix backward compatibility issue caused by missing eval metrics class by @bigning in #3385

Full Changelog: v0.23.1...release/v0.23.2

Contributors

bigning

Assets 2

07 Jun 15:03

mvpatel2000

v0.23.1

7a533cb

v0.23.1

What's New

1. PyTorch 2.3.1 Upgrade

Composer now supports PyTorch 2.3.1.

What's Changed

Torch 2.3.1 Upgrade by @mvpatel2000 in #3367
Fix monkeypatch imports by @mvpatel2000 in #3375
Remove unnecessary state dict and load_state_dict functions by @eracah in #3361
Adding checkpoint backwards compatibility tests after 0.23.0 release by @bigning in #3377
prepare_fsdp_module documentation fix by @KuuCi in #3379
Composer version bump to v0.23.1 by @snarayan21 in #3380
Clear caplog and use as context manager in test_logging by @snarayan21 in #3382

Full Changelog: v0.23.0...v0.23.1

Contributors

bigning, eracah, and 3 other contributors

Assets 2

05 Jun 20:34

bigning

v0.23.0

bf2cb35

v0.23.0

What's New

1. Parallelism V2 + Tensor Parallel (#3335)

Composer now supports PyTorch's implementation of tensor parallelism. As part of this, we've revamped and simplified how Composer does distributed training. Previously, Composer accepted a fsdp_config attribute in the Trainer:

trainer = Trainer(model, fsdp_config = {'sharding_strategy': 'FULL_SHARD'})

As we generalize to more forms of parallelism, we've deprecated fsdp_config in favor of parallelism_config:

trainer = Trainer(
    model = model,
    ...
    parallelism_config = {
        'fsdp': {
            'sharding_strategy': 'FULL_SHARD',
            'data_parallel_shard_degree': 2,      # Size of shard dimension
            'data_parallel_replicate_degree': 2,  # Size of replicate dimension
        },
        'tp_config': {
            'tensor_parallel_degree': 2,          # Size of TP dimension
            'layer_plan': ...  # describes how to TP layers
        }
    }
)

As part of this change, we now default to using DTensor for parallelism with PyTorch FSDP. PyTorch has deprecated ShardedTensor, so this migrates to the new backend which avoids various checkpointing bugs.

See the docs for tensor parallel for more information. Note that tensor parallel is still experimental and may be subject to API breaking changes. All checkpointing features may also not work with this parallelism.

2. MLFLow API Simplification

Previously, MLFlow logger required a tracking URI and an absolute user path when using MLFlow with Databricks:

mlflow_logger = MLFlowLogger(
    tracking_uri = 'databricks',
    experiment_name = '/Users/[email protected]/my-first-project/'
)

trainer = Trainer(
    model = model,
    ...
    loggers = mlflow_logger,
)

Now, if you are using Databricks secrets as an environment variable, Composer will autopopulate tracking_uri and the experiment_name prefix:

trainer = Trainer(
    model = model,
    ...
    loggers = MLFlowLogger(experiment_name='my-first-project'),
)

3. Wallclock Save Interval

Composer now supports setting a save interval in wallclock time:

trainer = Trainer(
    model = model,
    ...
    save_interval='30m',
)

Note that most durations, such as max_duration, do not accept wallclock time, and the initial version of this feature is only limited to a subset of time features like save_interval.

Bug Fixes

Don't close the engine if it's already closed in #3143
Fix HF tests with Pin in #3248
Fix backwards compatibility tests in #3252
Fix unexpected remote checkpointing downloading in #3271
Fix HSDP with ShardDegree < 8 in #3313

What's Changed

Remove CPU offload for DDP/single-gpu by @mvpatel2000 in #3242
Adding more checkpoint backwards compatability tests by @snarayan21 in #3244
Don't close the engine if its already closed by @dakinggg in #3143
Replace evaluator.dataloader.device_eval_batch_size with evaluator.device_eval_microbatch_size by @ShashankMosaicML in #3247
Fix HF tests with Pin by @mvpatel2000 in #3248
Remove ICL metrics by @mvpatel2000 in #3243
Add offset and length arguments for checkpoint validation functions by @irenedea in #3246
Fix backwards compatibility tests, raise error for torch version mismatch by @snarayan21 in #3252
Bump cryptography from 41.0.5 to 42.0.6 by @dependabot in #3256
Bump databricks-sdk from 0.25.1 to 0.27.0 by @dependabot in #3257
Improve GCS Object Store by @mvpatel2000 in #3251
add retry to gcs.upload_file by @bigning in #3232
Add unit test support for full state dict + load_weights_only and save_weights_only by @eracah in #3260
will/bump_aws_ofi_nccl by @willgleich in #3253
Fix daily GCS tests by @mvpatel2000 in #3268
Fix: SAM not working with FSDP/DeepSpeed and LR scheduler. by @Joqsan in #3259
Add upload timeout patch to mlflow on azure by @dakinggg in #3265
Add option to stagger uploads based on local rank by @dakinggg in #3275
explicit close by @dakinggg in #3276
Update NCCL_ASYNC_ERROR_HANDLING env variable by @priba in #3267
new dist_cp save planner to fix issue that each rank needs to download all checkpoint files by @bigning in #3271
Bump to torch 2.2.2 by @mvpatel2000 in #3283
Fix UCObjectStore.list_objects by @dakinggg in #3284
Update peft version by @dakinggg in #3287
replace load_fsdp_monolith_ with load_monolith_ by @milocress in #3288
Return PyTorch Latest by @mvpatel2000 in #3290
Fix daily tests by filtering a warning by @mvpatel2000 in #3291
remove orig_params check by @milocress in #2981
[ckpt-rewr] Get Model State Dict Util Function by @eracah in #3250
Skip compression check with symlink files by @mvpatel2000 in #3300
Monkeypatch Device Mesh ND Slicing by @mvpatel2000 in #3302
Bump coverage[toml] from 7.4.4 to 7.5.1 by @dependabot in #3305
Bump databricks-sdk from 0.27.0 to 0.27.1 by @dependabot in #3306
Update transformers requirement from !=4.34.0,<4.41,>=4.11 to >=4.11,!=4.34.0,<4.42 by @dependabot in #3307
Allow overwrite on upload retry in remote uploader downloader by @irenedea in #3310
Update platform references by @aspfohl in #3304
Fix cometml unit tests by @j316chuck in #3314
Fix HSDP with ShardDegree < 8 by @bigning in #3313
Update docstring for get_model_state_dict by @eracah in #3318
Tensor Parallelism Integration by @mvpatel2000 in #3269
Bugfixes to FSDP + TP by @mvpatel2000 in #3323
Wct save interval by @KuuCi in #3264
Wrap ChunkedEncodingError from UCObjectStore by @irenedea in #3321
Add checkpoint events to mosaicml logger by @b-chu in #3316
Bump timeout to fix daily tests by @j316chuck in #3325
Fix FSDP ckpt by filtering User Waring by @j316chuck in #3327
Revert TP integration by @dakinggg in #3328
Bump databricks-sdk from 0.27.1 to 0.28.0 by @dependabot in #3331
Bump sphinxcontrib-katex from 0.9.6 to 0.9.10 by @dependabot in #3333
Update peft requirement from <0.11,>=0.10.0 to >=0.10.0,<0.12 by @dependabot in #3332
Bump coverage[toml] from 7.5.1 to 7.5.2 by @dependabot in #3330
Update protobuf requirement from <5.27 to <5.28 by @dependabot in #3329
Improving memory snapshot by @cli99 in #3315
Add A10 to speed monitor by @mvpatel2000 in #3336
change ComposerModel output type by @hyenal in #3341
Remove evaluator state by @snarayan21 in #3339
[ckpt-rewr] Generate Metadata State Dict API by @eracah in #3311
Tensor Parallelism v2 by @mvpatel2000 in #3335
Migrate Type Hints for PEP 585 by @mvpatel2000 in #3344
[checkpoint v2] add remote uploader class by @bigning in #3303
Raise errors on all ranks for checkpoint download failures by @irenedea in #3345
Add return type annotation when init doesn't take any argument by @antoinebrl in #3347
[ckpt-rewr] Get Optim State Dict Util API by @eracah in #3299
Fix type check issue with device train microbatch size by @mvpatel2000 in https://github.com/...

Contributors

bigning, eracah, and 17 other contributors

Assets 2

01 May 16:59

snarayan21

v0.22.0

fe7964f

v0.22.0

What's New

🔥 Support for PyTorch v2.3.0

Composer now supports the recently-released PyTorch version 2.3.0! Please raise any issues with us so we can address them.

Bug Fixes

Fixing checks for device microbatch size for sequence parallelism in #3200
Fixing token logging in #3206
Search for run name in MLFlowLogger in #3215
Fix FQN names with activation checkpointing in #3210
Strict weight matching for checkpoint loading in #3219

What's Changed

Bump transformers by @dakinggg in #3197
Add deprecation warnings for ICL datasets/helper functions/metrics by @bmosaicml in #3125
Bump traitlets from 5.14.2 to 5.14.3 by @dependabot in #3204
Raise LR schedule warnings only when necessary by @snarayan21 in #3207
Add torch 2.3 support by @mvpatel2000 in #3209
Add torch 2.3 CI/CD by @mvpatel2000 in #3211
Fix daily test images by @mvpatel2000 in #3212
Try FAv2 2.5.7 from source by @mvpatel2000 in #3213
Update tests by @mvpatel2000 in #3217
Fix torch 2.3 GPU tests by @mvpatel2000 in #3218
Use flash-attn 2.5.8 with no build isolation in docker images by @snarayan21 in #3224
Add a torch.cuda.empty_cache() in utils.save_checkpoint by @bfontain in #3216
Require 2 steps for GS object store by @mvpatel2000 in #3228
Add rename_metrics to Mlflow logger by @hanlint in #3225
Fix daily tests by @mvpatel2000 in #3229
Change precision for daily tests by @mvpatel2000 in #3231
Create new Mlflow run by default and introduce run_group by @chenmoneygithub in #3208
Fix daily test pt 4 by @mvpatel2000 in #3233
Deprecate and bump version to 0.22 by @mvpatel2000 in #3230
Fix daily tests v5 by @mvpatel2000 in #3234
Fix daily v6 by @mvpatel2000 in #3235
fix daily tests v7 by @mvpatel2000 in #3236
Raise the daily test timeout by @dakinggg in #3241
Accelerate GPU tests by @mvpatel2000 in #3237
Make sharded checkpoint loading backwards-compatible by @snarayan21 in #3240

Full Changelog: v0.21.3...v0.22.0

Contributors

bfontain, hanlint, and 6 other contributors

Assets 2

19 Apr 15:41

mvpatel2000

v0.21.3

d39a5e0

v0.21.3

Bug Fixes

1. Increased Robustness to Checkpoint Loading

We've patched several edge cases in loading sharded checkpoints, especially with DTensors, which should decrease memory usage when loading checkpoints. We've also hardened retry logic against object cloud failure, ensuring higher robustness to transient network issues.

What's Changed

Raise daily test timeout by @mvpatel2000 in #3172
fix remote file naming by @cli99 in #3173
[fix] DTensor + SHARD_GRAD_OP + use_orig_params by @bigning in #3175
Bump db sdk by @dakinggg in #3176
Build latest pytorch nightly images by @dakinggg in #3179
Add FP8 TransformerEngine activation checkpointing by @cli99 in #3156
Enabling the computation of validation loss and other metrics when using sequence parallelism by @ShashankMosaicML in #3183
Update mosaic_fsdp_utils.py by @vchiley in #3185
Fix the FSDP.optim_state_dict_to_load OOM by @bigning in #3184
Revert "Update mosaic_fsdp_utils.py" by @vchiley in #3187
Bump databricks-sdk from 0.24.0 to 0.25.1 by @dependabot in #3190
Add version tag to local builds by @mvpatel2000 in #3188
Update NeptuneLogger by @AleksanderWWW in #3165
Filter neptune warning in doctests by @mvpatel2000 in #3195
Removal of metrics deepcopy before computing the metrics by @gregjauvion in #3180
Fix MLFlow Tag Name for Resumption by @KuuCi in #3194
Fix mistral gating by @dakinggg in #3199
Bump version to 0.21.3 by @mvpatel2000 in #3198

New Contributors

@gregjauvion made their first contribution in #3180

Full Changelog: v0.21.2...v0.21.3

Contributors

bigning, vchiley, and 8 other contributors

Assets 2

03 Apr 21:14

mvpatel2000

v0.21.2

082d4e0

v0.21.2

Bug Fixes

1. Enable torch 2.2.2 (#3161)

Composer currently monkeypatches PyTorch for nightly versions in order to fix upstream bugs. With the release of torch 2.2.2, these monkeypatches were mistakenly applied to the stable release due to incorrect gating on imports. This release fixes the gating, enabling torch 2.2.2.

2. MPS Metric Computation on CPU (#3105)

Due to bugs in computing torchmetrics on Mac devices, we move metric computation onto CPU. This previously had issues with data not properly moving to CPU.

Thank you to @hyenal for this contribution!

3. Batch Sampler Support (#3105)

Composer now supports batch sampler, which previously resulted in an error if specified in the dataloader.

Thank you to @Ghelfi for this contribution!

What's Changed

Make codequality callable by @mvpatel2000 in #3133
Explicitly print checkpoint downloading exception by @bigning in #3131
Change release actions by @mvpatel2000 in #3136
Passing rank and num_replicas to dist.get_sampler by @ShashankMosaicML in #3137
Fix broadcast by @mvpatel2000 in #3138
Compressor fixes by @mbway in #3142
In case of MPS device also copy batch to CPU by @hyenal in #3105
Composer object store download retry by @bigning in #3140
Bump databricks-sdk from 0.22.0 to 0.23.0 by @dependabot in #3144
Update transformers requirement from !=4.34.0,<4.39,>=4.11 to >=4.11,!=4.34.0,<4.40 by @dependabot in #3148
Update protobuf requirement from <3.21 to <5.27 by @dependabot in #3147
Bump traitlets from 5.14.1 to 5.14.2 by @dependabot in #3145
Bump to 0.21 by @mvpatel2000 in #3150
Fixing sequence parallel error conditions and adding type float for microbatch_size in typehints by @ShashankMosaicML in #3139
Fix torch monkeypatch version check by @dakinggg in #3155
Update torchmetrics requirement from <1.3.2,>=0.10.0 to >=0.10.0,<1.3.3 by @dependabot in #3157
Bump gitpython from 3.1.42 to 3.1.43 by @dependabot in #3160
Prevent crash if signal handler cannot be set by @mbway in #3152
Pin pillow for code quality workflow by @dakinggg in #3162
Fix torch version check by @dakinggg in #3161
add more retry to checkpoint downloading by @bigning in #3164
Append to gpu rank log files instead of throwing error by @jjanezhang in #3166
Call set_epoch on Dataloader.batch_sampler if defined by @Ghelfi in #3124
Bump version to 0.21.2 by @mvpatel2000 in #3168

New Contributors

@hyenal made their first contribution in #3105
@Ghelfi made their first contribution in #3124

Full Changelog: v0.21.1...v0.21.2

Contributors

bigning, mbway, and 7 other contributors

Assets 2

22 Mar 01:08

mvpatel2000

v0.21.1

1b87a07

v0.21.1

Bug Fixes

1. Fix to HSDP checkpoint loading

The previous release broke checkpoint loading when using HSDP with mutliple replicas. This patch release fixes checkpoint loading.

What's Changed

Fix broadcast by @mvpatel2000 in #3138

Full Changelog: v0.21.0...v0.21.1

Contributors

mvpatel2000

Assets 2

21 Mar 21:19

mvpatel2000

v0.21.0

c36d3e1

v0.21.0

What's New

1. Aggregate Memory Monitoring (#3042)

The Memory Monitor callback now supports aggregating memory statistics across nodes. Getting summary stats for a run's memory usage across the cluster can dramatically help debug straggler nodes or non-homogenous workloads. The memory monitor can now aggregate and log combined values at a user specified frequency.

Example:

from composer import Trainer
from composer.callbacks import MemoryMonitor

trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    optimizers=optimizer,
    max_duration="1ep",
    callbacks=[
        MemoryMonitor(
            dist_aggregate_batch_interval=10,  # aggregate every 10 batches
        )
    ],
)

2. Advanced Compression Options (#3118)

Large model checkpoints can be expensive to store and transfer. In this release, we've upgraded our compression support to accept several new formats which result in better compression-time tradeoffs using CLI tools. In order to use compression, you can post-fix your checkpoint name with a compression path. We know support the following extensions:

bz2
gz
lz4
lzma
lzo
xz
zst

Example:

from composer import Trainer
from composer.callbacks import MemoryMonitor

trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    optimizers=optimizer,
    max_duration="1ep",
    save_filename='ep{epoch}-ba{batch}-rank{rank}.pt.lz4',
)

Thank you to @mbway for adding this support!

What's Changed

Rename composer_run_name tag to run_name when logging to MLflow by @jerrychen109 in #3040
enable aggregate mem monitoring by @vchiley in #3042
Bump junitparser from 3.1.1 to 3.1.2 by @dependabot in #3056
Add SHARD_GRAD_OP to device mesh error check by @mvpatel2000 in #3058
Add torch 2.2.1 support by @mvpatel2000 in #3059
Use testing repo actions for linting by @b-chu in #3060
Link autoresume docs back to watchdog by @aspfohl in #3052
Deprecate get_state and remove deprecations by @b-chu in #3017
Bump version to 0.20.1 by @mvpatel2000 in #3061
Remove s3_bucket pytest cli flag by @b-chu in #3064
Remove s3_bucket flag from gpu test by @b-chu in #3065
Clean Up OOM Observer Remote Uploader Download path by @j316chuck in #3070
Fix daily test for iteration by @b-chu in #3068
Remove "generation_length" in favor of "generation_kwargs" by @maxisawesome in #3014
Bump packaging by @mvpatel2000 in #3072
Use ci-testing repo for CPU and GPU tests by @b-chu in #3062
Add new torch monkeypatches to Composer by @mvpatel2000 in #3063
Add initial support for neuron devices by @bfontain in #3049
Stripping whitespaces as default for QATask ICL eval by @ksreenivasan in #3073
Add ICL base class to all by @mvpatel2000 in #3079
pass prelimiter into ALL ICL datasets by @eitanturok in #3069
Bump sentencepiece from 0.1.99 to 0.2.0 by @dependabot in #3083
Add Iteration related Events to callbacks by @b-chu in #3077
Add Iteration related Events by @b-chu in #3076
Bump CI/CD to v3 by @mvpatel2000 in #3086
Add docstring to _iteration_length by @b-chu in #3088
Check FSDP module has _device_mesh before getting it by @eracah in #3091
Bump minor version in base image by @mvpatel2000 in #3092
Enforce async logging flush in mlflow logger at post_close call by @chenmoneygithub in #3093
Warning log to info log by @aspfohl in #3096
Bump transformers by @dakinggg in #3095
Change style for splitting on commas by @b-chu in #3078
Remove slash by @b-chu in #3098
Allowing for fractional number of samples per rank by @ShashankMosaicML in #3075
Output eval logging (batch level) by @maxisawesome in #2977
Replace errors with warnings for eval args by @mvpatel2000 in #3100
Ability to load sharded checkpoints with remote symlink load_path by @eracah in #3097
Improvements to NeptuneLogger by @AleksanderWWW in #3085
Revert "Improvements to NeptuneLogger" by @mvpatel2000 in #3111
Bump mlflow min pin by @dakinggg in #3110
Fix rounding issue in interval calculation by @dakinggg in #3109
Bump coverage[toml] from 7.4.1 to 7.4.3 by @dependabot in #3102
Uses v0.0.4 of ci-testing by @b-chu in #3112
Add versioned deprecation warning by @irenedea in #2984
Update Flash Attention to 2.5.5 by @Skylion007 in #3113
Setting the max duration to current timestamp in the same units as cu… by @ShashankMosaicML in #3090
Making default_split_batch public by @ShashankMosaicML in #3116
Adding log exception to Mosaic Logger by @jjanezhang in #3089
Add checks to schedulers by @b-chu in #3115
Removed default attrs from exception class in the attrs dict by @jjanezhang in #3126
Bump coverage[toml] from 7.4.3 to 7.4.4 by @dependabot in #3121
Refactor initialization by @Practicinginhell in #3127
Bump databricks sdk version by @dakinggg in #3128
Update packaging requirement from <23.3,>=21.3.0 to >=21.3.0,<24.1 by @dependabot in #3122
Remove rng from save_weights_only ckpt by @eracah in #3129
More compression options by @mbway in #3118
Only broadcast distcp files by @mvpatel2000 in #3130
Bump version to 0.21 by @mvpatel2000 in #3132

New Contributors

@ksreenivasan made their first contribution in #3073
@eitanturok made their first contribution in #3069
@Practicinginhell made their first contribution in #3127
@mbway made their first contribution in #3118

Full Changelog: v0.20.1...v0.21.0

Contributors

Skylion007, mbway, and 19 other contributors

Assets 2

27 Feb 19:51

mvpatel2000

v0.20.1

118c7f2

v0.20.1

What's New

1. Torch 2.2.1 Support

Composer now supports torch 2.2.1! We've raised the pin to allow the latest torch, and we've upstreamed all torch monkeypatches so Composer can run out of the box with the latest and greatest torch features.

What's Changed

Add torch 2.2.1 support by @mvpatel2000 in #3059
Bump version to 0.20.1 by @mvpatel2000 in #3061

Contributors

mvpatel2000

Assets 2

23 Feb 18:39

j316chuck

v0.20.0

9ecea4f

v0.20.0

What's New

1. New Neptune Logger

Composer now supports logging training data to neptune.ai using the NeptuneLogger. To get started:

neptune_project = 'test_project'
neptune_api_token = 'test_token'

neptune_logger = NeptuneLogger(
    project=neptune_project,
    api_token=neptune_api_token,
    rank_zero_only=False,
    mode='debug',
    upload_artifacts=True,
)

We also have an example project demonstrating all the awesome things you can do with this integration!

Additional information on the NeptuneLogger can be found in the docs.

2. OOM observer callback with memory visualizations

Composer now has an OOM observer callback. When a model runs out of memory, this callback helps produce a trace which identifies memory allocations, which can be critical to designing strategies to mitigate memory usage.

Example:

from composer import Trainer
from composer.callbacks import OOMObserver
# constructing trainer object with this callback
trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
    optimizers=optimizer,
    max_duration="1ep",
    callbacks=[
        OOMObserver(
            folder="traces",
            overwrite=true,
            filename="rank{rank}_oom",
            remote_filename="oci://bucket_name/{run_name}/oom_traces/rank{rank}_oom",
        )
    ],
)

OOM Visualization:

3. Log all gpu rank stdout/err to MosaicML platform

Composer has expanded it's integration with the MosaicML platform.. Now, we can view all gpu rank stdout/stderrs with MCLI logs to enable more comprehensive analysis of jobs.

Example:

mcli logs <run-name> --node x --gpu x

Note, this defaults to node rank 0 if --node is not provided.

Also, we can find the logs of any global gpu rank with the command:

mcli logs <run-name> --global-gpu-rank x

Bug Fixes

Only save RNG on rank 0 by @mvpatel2000 in #2998
[Auto-microbatch fix] FSDP reshard and cleanup after OOM to fix the cuda memory leak by @bigning in #3030
Fix skip_first for profiler during resumption by @bigning in #2986
Race condition fix in checkpoint loading util by @jessechancy in #3001

What's Changed

Remove .ci folder and move FILE_HEADER and CODEOWNERS by @irenedea in #2957
Modify UCObjectStore.list_objects to lists all files recursively by @irenedea in #2959
Refactor MemorySnapshot by @cli99 in #2960
Log all gpu rank stdout/err to MosaicML platform by @jjanezhang in #2839
Add Torch 2.2 tests by @mvpatel2000 in #2970
Memory snapshot dump pickle by @cli99 in #2968
Neptune logger by @AleksanderWWW in #2447
Fix torch pins in tests by @mvpatel2000 in #2973
Add a register_model_with_run_id api to MLflowLogger by @dakinggg in #2967
Remove bespoke codeowners by @mvpatel2000 in #2971
Add a BEFORE_LOAD event by @snarayan21 in #2974
More torch 2.2 fixes by @mvpatel2000 in #2975
Adding the step argument to logger.log_table by @ShashankMosaicML in #2961
Fix daily tests for torch 2.2 by @mvpatel2000 in #2980
Format load_path with name by @mvpatel2000 in #2978
Bump to 0.19.1 by @mvpatel2000 in #2979
Fix UC object store bugfix by @nancyhung in #2982
[Bugfix][UC] Add back the full object path by @nancyhung in #2988
Minor cleanup of UC get_object_size by @dakinggg in #2989
Pin UC to earlier version by @dakinggg in #2990
Revert "fix skip_first for resumption" by @bigning in #2991
Broadcast files for HSDP by @mvpatel2000 in #2914
Bump ipykernel from 6.29.0 to 6.29.2 by @dependabot in #2994
Bump yamllint from 1.33.0 to 1.34.0 by @dependabot in #2995
Refactor update_metric by @maxisawesome in #2965
Add azure integration test by @mvpatel2000 in #2996
Fix Profiler schedule skip_first by @bigning in #2992
Remove planner validation by @mvpatel2000 in #2985
Fix load for non-HSDP device mesh by @mvpatel2000 in #2997
Update NCCL arg since torch deprecated old one by @mvpatel2000 in #3000
Add bias argument to LPLN by @mvpatel2000 in #2999
Revert "Add bias argument to LPLN" by @mvpatel2000 in #3003
Revert "Update NCCL arg since torch deprecated old one" by @mvpatel2000 in #3004
Add torch 2.3 image for aws cluster by @j316chuck in #3002
Patch torch 2.3 aws naming by @j316chuck in #3006
Add debug log before training loop starts by @mvpatel2000 in #3005
Deprecate ffcv code by @j316chuck in #3007
Remove log for mosaicml logger by @mvpatel2000 in #3008
[EASY] Always log 1st batch when resuming training by @bigning in #3009
Use reusable actions for linting by @b-chu in #2948
Make CodeEval respect device_eval_batch_size by @josejg in #2969
Use Mosaic constant for GPU file prefix by @jjanezhang in #3018
Fall back to normal logging when gpu prefix is not present by @jjanezhang in #3020
Revert "Use reusable actions for linting" to fix CI/CD by @mvpatel2000 in #3023
Change to pull_request_target by @b-chu in #3025
Bump gitpython from 3.1.41 to 3.1.42 by @dependabot in #3031
Bump yamllint from 1.34.0 to 1.35.1 by @dependabot in #3034
Update torchmetrics requirement from <1.3.1,>=0.10.0 to >=0.10.0,<1.3.2 by @dependabot in #3035
Bump pypandoc from 1.12 to 1.13 by @dependabot in #3033
Add tensorboard images support by @Menduist in #3021
Add sorted to logs for checkpoint broadcast by @mvpatel2000 in #3036
Friendlier device mesh error by @mvpatel2000 in #3039
Upgrade to python3.11 for torch nightly by @j316chuck in #3038
Download symlink once by @mvpatel2000 in #3043
Add min size to OCI download by @mvpatel2000 in #3044
Lint fix by @mvpatel2000 in #3045
Revert "Change to pull_request_target " by @mvpatel2000 in #3047
Bump composer version 0.19.2 by @j316chuck in #3048
Update XLA support by @bfontain in #2964
Bump composer version 0.20.0 by @j316chuck in #3051
Update ruff. Fix PLE & LOG lints by @Skylion007 in #3050

New Contributors

@AleksanderWWW made their first contribution in #2447
@ShashankMosaicML made their first contribution in #2961
@nancyhung made their first contribution in #2982
@bigning made their first contribution in #2986
@jessechancy made their first contribution in #3001
@josejg made their first contribution in #2969
@Menduist made their first contribution in #3021
@bfontain made their first contribution in #2964

**Full Chang...

Contributors

Skylion007, bigning, and 17 other contributors

Assets 2

Releases: mosaicml/composer

v0.23.2

Bug Fixes

What's Changed:

Contributors

v0.23.1

What's New

What's Changed

Contributors

v0.23.0

What's New

Bug Fixes

What's Changed

Contributors

v0.22.0

What's New

🔥 Support for PyTorch v2.3.0

Bug Fixes

What's Changed

Contributors

v0.21.3

Bug Fixes

What's Changed

New Contributors

Contributors

v0.21.2

Bug Fixes

1. Enable torch 2.2.2 (#3161)

2. MPS Metric Computation on CPU (#3105)

3. Batch Sampler Support (#3105)

What's Changed

New Contributors

Contributors

v0.21.1

Bug Fixes

1. Fix to HSDP checkpoint loading

What's Changed

Contributors

v0.21.0

What's New

1. Aggregate Memory Monitoring (#3042)

2. Advanced Compression Options (#3118)

What's Changed

New Contributors

Contributors

v0.20.1

What's New

1. Torch 2.2.1 Support

What's Changed

Contributors

v0.20.0

What's New

1. New Neptune Logger

2. OOM observer callback with memory visualizations

3. Log all gpu rank stdout/err to MosaicML platform

Bug Fixes

What's Changed

New Contributors

Contributors