Skip to content

Releases: mosaicml/composer

v0.23.2

08 Jun 03:11
Compare
Choose a tag to compare

Bug Fixes

  • Fix backward compatibility issue caused by missing eval metrics class

What's Changed:

  • Fix backward compatibility issue caused by missing eval metrics class by @bigning in #3385

Full Changelog: v0.23.1...release/v0.23.2

v0.23.1

07 Jun 15:03
Compare
Choose a tag to compare

What's New

1. PyTorch 2.3.1 Upgrade

Composer now supports PyTorch 2.3.1.

What's Changed

Full Changelog: v0.23.0...v0.23.1

v0.23.0

05 Jun 20:34
Compare
Choose a tag to compare

What's New

1. Parallelism V2 + Tensor Parallel (#3335)

Composer now supports PyTorch's implementation of tensor parallelism. As part of this, we've revamped and simplified how Composer does distributed training. Previously, Composer accepted a fsdp_config attribute in the Trainer:

trainer = Trainer(model, fsdp_config = {'sharding_strategy': 'FULL_SHARD'})

As we generalize to more forms of parallelism, we've deprecated fsdp_config in favor of parallelism_config:

trainer = Trainer(
    model = model,
    ...
    parallelism_config = {
        'fsdp': {
            'sharding_strategy': 'FULL_SHARD',
            'data_parallel_shard_degree': 2,      # Size of shard dimension
            'data_parallel_replicate_degree': 2,  # Size of replicate dimension
        },
        'tp_config': {
            'tensor_parallel_degree': 2,          # Size of TP dimension
            'layer_plan': ...  # describes how to TP layers
        }
    }
)

As part of this change, we now default to using DTensor for parallelism with PyTorch FSDP. PyTorch has deprecated ShardedTensor, so this migrates to the new backend which avoids various checkpointing bugs.

See the docs for tensor parallel for more information. Note that tensor parallel is still experimental and may be subject to API breaking changes. All checkpointing features may also not work with this parallelism.

2. MLFLow API Simplification

Previously, MLFlow logger required a tracking URI and an absolute user path when using MLFlow with Databricks:

mlflow_logger = MLFlowLogger(
    tracking_uri = 'databricks',
    experiment_name = '/Users/[email protected]/my-first-project/'
)

trainer = Trainer(
    model = model,
    ...
    loggers = mlflow_logger,
)

Now, if you are using Databricks secrets as an environment variable, Composer will autopopulate tracking_uri and the experiment_name prefix:

trainer = Trainer(
    model = model,
    ...
    loggers = MLFlowLogger(experiment_name='my-first-project'),
)

3. Wallclock Save Interval

Composer now supports setting a save interval in wallclock time:

trainer = Trainer(
    model = model,
    ...
    save_interval='30m',
)

Note that most durations, such as max_duration, do not accept wallclock time, and the initial version of this feature is only limited to a subset of time features like save_interval.

Bug Fixes

  • Don't close the engine if it's already closed in #3143
  • Fix HF tests with Pin in #3248
  • Fix backwards compatibility tests in #3252
  • Fix unexpected remote checkpointing downloading in #3271
  • Fix HSDP with ShardDegree < 8 in #3313

What's Changed

Read more

v0.22.0

01 May 16:59
Compare
Choose a tag to compare

What's New

🔥 Support for PyTorch v2.3.0

Composer now supports the recently-released PyTorch version 2.3.0! Please raise any issues with us so we can address them.

Bug Fixes

  • Fixing checks for device microbatch size for sequence parallelism in #3200
  • Fixing token logging in #3206
  • Search for run name in MLFlowLogger in #3215
  • Fix FQN names with activation checkpointing in #3210
  • Strict weight matching for checkpoint loading in #3219

What's Changed

Full Changelog: v0.21.3...v0.22.0

v0.21.3

19 Apr 15:41
Compare
Choose a tag to compare

Bug Fixes

1. Increased Robustness to Checkpoint Loading

We've patched several edge cases in loading sharded checkpoints, especially with DTensors, which should decrease memory usage when loading checkpoints. We've also hardened retry logic against object cloud failure, ensuring higher robustness to transient network issues.

What's Changed

New Contributors

Full Changelog: v0.21.2...v0.21.3

v0.21.2

03 Apr 21:14
Compare
Choose a tag to compare

Bug Fixes

1. Enable torch 2.2.2 (#3161)

Composer currently monkeypatches PyTorch for nightly versions in order to fix upstream bugs. With the release of torch 2.2.2, these monkeypatches were mistakenly applied to the stable release due to incorrect gating on imports. This release fixes the gating, enabling torch 2.2.2.

2. MPS Metric Computation on CPU (#3105)

Due to bugs in computing torchmetrics on Mac devices, we move metric computation onto CPU. This previously had issues with data not properly moving to CPU.

Thank you to @hyenal for this contribution!

3. Batch Sampler Support (#3105)

Composer now supports batch sampler, which previously resulted in an error if specified in the dataloader.

Thank you to @Ghelfi for this contribution!

What's Changed

New Contributors

Full Changelog: v0.21.1...v0.21.2

v0.21.1

22 Mar 01:08
Compare
Choose a tag to compare

Bug Fixes

1. Fix to HSDP checkpoint loading

The previous release broke checkpoint loading when using HSDP with mutliple replicas. This patch release fixes checkpoint loading.

What's Changed

Full Changelog: v0.21.0...v0.21.1

v0.21.0

21 Mar 21:19
Compare
Choose a tag to compare

What's New

1. Aggregate Memory Monitoring (#3042)

The Memory Monitor callback now supports aggregating memory statistics across nodes. Getting summary stats for a run's memory usage across the cluster can dramatically help debug straggler nodes or non-homogenous workloads. The memory monitor can now aggregate and log combined values at a user specified frequency.

Example:

from composer import Trainer
from composer.callbacks import MemoryMonitor

trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    optimizers=optimizer,
    max_duration="1ep",
    callbacks=[
        MemoryMonitor(
            dist_aggregate_batch_interval=10,  # aggregate every 10 batches
        )
    ],
)

2. Advanced Compression Options (#3118)

Large model checkpoints can be expensive to store and transfer. In this release, we've upgraded our compression support to accept several new formats which result in better compression-time tradeoffs using CLI tools. In order to use compression, you can post-fix your checkpoint name with a compression path. We know support the following extensions:

  • bz2
  • gz
  • lz4
  • lzma
  • lzo
  • xz
  • zst

Example:

from composer import Trainer
from composer.callbacks import MemoryMonitor

trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    optimizers=optimizer,
    max_duration="1ep",
    save_filename='ep{epoch}-ba{batch}-rank{rank}.pt.lz4',
)

Thank you to @mbway for adding this support!

What's Changed

New Contributors

Full Changelog: v0.20.1...v0.21.0

v0.20.1

27 Feb 19:51
Compare
Choose a tag to compare

What's New

1. Torch 2.2.1 Support

Composer now supports torch 2.2.1! We've raised the pin to allow the latest torch, and we've upstreamed all torch monkeypatches so Composer can run out of the box with the latest and greatest torch features.

What's Changed

v0.20.0

23 Feb 18:39
9ecea4f
Compare
Choose a tag to compare

What's New

1. New Neptune Logger

Composer now supports logging training data to neptune.ai using the NeptuneLogger. To get started:

neptune_project = 'test_project'
neptune_api_token = 'test_token'

neptune_logger = NeptuneLogger(
    project=neptune_project,
    api_token=neptune_api_token,
    rank_zero_only=False,
    mode='debug',
    upload_artifacts=True,
)

We also have an example project demonstrating all the awesome things you can do with this integration!

image

Additional information on the NeptuneLogger can be found in the docs.

2. OOM observer callback with memory visualizations

Composer now has an OOM observer callback. When a model runs out of memory, this callback helps produce a trace which identifies memory allocations, which can be critical to designing strategies to mitigate memory usage.

Example:

from composer import Trainer
from composer.callbacks import OOMObserver
# constructing trainer object with this callback
trainer = Trainer(
    model=model,
    train_dataloader=train_dataloader,
    eval_dataloader=eval_dataloader,
    optimizers=optimizer,
    max_duration="1ep",
    callbacks=[
        OOMObserver(
            folder="traces",
            overwrite=true,
            filename="rank{rank}_oom",
            remote_filename="oci://bucket_name/{run_name}/oom_traces/rank{rank}_oom",
        )
    ],
)

OOM Visualization:

Screenshot 2024-02-23 at 9.30.03 AM

3. Log all gpu rank stdout/err to MosaicML platform

Composer has expanded it's integration with the MosaicML platform.. Now, we can view all gpu rank stdout/stderrs with MCLI logs to enable more comprehensive analysis of jobs.

Example:

mcli logs <run-name> --node x --gpu x 

Note, this defaults to node rank 0 if --node is not provided.

Also, we can find the logs of any global gpu rank with the command:

mcli logs <run-name> --global-gpu-rank x

Bug Fixes

What's Changed

New Contributors

**Full Chang...

Read more