Skip to content

Releases: ml-energy/zeus

v0.9.1

07 May 04:07
cf8324c
Compare
Choose a tag to compare

What's new

  • For GPU power draw, we use nvmlDeviceGetFieldValues, which gives us instant power draw (instead of average power draw) for any microarchitecture.

v0.9.0: Batch size optimizer and big cleanups

06 May 16:07
0ae4de1
Compare
Choose a tag to compare

What's new

  • The batch size optimizer is now a full-fledged server that can be deployed independently, with Docker Compose, or on Kubernetes + KubeFlow.
  • GPU abstraction: We created an abstraction layer over GPU vendors (NVIDIA and AMD). We're on our way to supporting AMD GPUs.
  • Completely revamped documentation under https://ml.energy/zeus.

Deprecated

  • See #20 (ZeusDataLoader, ZeusMaster, and the C++ Zeus monitor)

v0.8.0: Energy-efficient large model training

13 Oct 21:34
076df3d
Compare
Choose a tag to compare

This release features Perseus, an optimizer for energy-efficient large model training.

See the Perseus docs for details.

v0.7.1: Moved to under `ml-energy`!

24 Sep 04:10
6082db4
Compare
Choose a tag to compare

We moved our repository to under ml-energy. No feature changes :)

v0.7.0: Python-based power monitor

24 Aug 21:22
Compare
Choose a tag to compare

What's New

  • We used to have a C++ power monitor under zeus_monitor, but we've deprecated that. There's no need for high speed polling because NVML power counters do not update that quick anyway.
    • In order to poll power consumption programmatically, use zeus.monitor.power.PowerMonitor.
  • CLI power & energy monitor:
    • python -m zeus.monitor power
    • python -m zeus.monitor energy
  • We switched from the old setup.py to the new package metadata standard pyproject.toml.
  • Docker image sizes are drastically smaller now! The compressed image used to be 8.48 GB, but now it's down to 2.71 GB.

v0.6.1: `approx_instant_energy`

07 Aug 21:18
Compare
Choose a tag to compare

What's New

approx_instant_energy in ZeusMonitor

  • Sometimes, the NVML energy counter update period is longer than the measurement window, in which case energy consumption may be return as 0.0. In this case, when approx_instant_energy=True, ZeusMonitor will approximate the energy consumption of the window as instant power consumption multiplied by the duration of the measurement window: $$\textrm{Energy} = \int_0^T \textrm{Power}(t) dt \approx \textrm{Power}(T) \cdot T$$

v0.6.0: `OptimumSelector`

28 Jul 21:03
Compare
Choose a tag to compare

What's New

OptimumSelector

  • Until know, the optimal power limit for GlobalPowerLimitOptimizer was the one that minimizes the Zeus time-energy cost. Not everyone would want that.
  • Now, OptimumSelector is an abstract base class with which you can implement your own optimal power limit selection policy.
  • Pre-implemented one are Time, Energy, ZeusCost, and MaxSlowdownConstraint. These are thoroughly tested.

wait_steps

  • Now, you can specify wait_steps in GlobalPowerLimitOptimizer, and it'll wait for the specified number of steps before profiling and optimizing.
  • wait_steps is set to 1 by default to because users may have torch.backends.cudnn.benchmark = True and DataLoader workers usually need time to warm up before ramping up to their normal fetch throughput.

Breaking Changes

  • GlobalPowerLimitOptimizer now takes an instance of OptimumSelector in its constructor, instead of eta_knob. If you want to recover the functionality of v0.5.0, modify your code like this:
    # Before
    plo = GlobalPowerLimitOptimizer(..., eta_knob=0.5, ...)
    # After
    from zeus.optimizer.power_limit import ZeusCost
    
    plo = GlobalPowerLimitOptimizer(..., optimum_selector=ZeusCost(eta_knob=0.5), ...)

v0.5.0: Big refactor, `GlobalPowerLimitOptimizer`

12 Jul 03:34
Compare
Choose a tag to compare

What's New

Callback-based architecture

  • zeus.callback.Callback is the new backbone for Zeus components
  • GlobalPowerLimitOptimizer is the shiny new way to online-profile and optimize the power limit of DNN training.
  • EarlyStopController monitors and manages all sorts of conditions to determine whether training should stop.

Extensive testing

  • tests/ is richer than ever. With deep component tests with exhaustive parametrization, there are now around 1500 test cases.
  • Especially, zeus.util.testing.ReplayZeusMonitor exposes the same public API as ZeusMonitor but replays the measurement window logs produced by ZeusMonitor, instead of doing actual measurement. With this, Zeus can now be tested without any actual GPUs.

v0.4.0: `ZeusMonitor`

21 Jun 01:35
Compare
Choose a tag to compare

What's New

v0.3.0: `ZeusMonitorContext` for in-training-loop profiling

05 Dec 20:54
Compare
Choose a tag to compare

What's New

  • ZeusMonitorContext allows users to profile their per-iteration energy and time consumption.
    • It's aimed for those who would like to get a feel for the energy consumption of their DNN training job with a couple additional lines (as opposed to modified lines).
    • Documentation and integration example: here