Releases: ml-energy/zeus
Releases Β· ml-energy/zeus
v0.9.1
v0.9.0: Batch size optimizer and big cleanups
What's new
- The batch size optimizer is now a full-fledged server that can be deployed independently, with Docker Compose, or on Kubernetes + KubeFlow.
- GPU abstraction: We created an abstraction layer over GPU vendors (NVIDIA and AMD). We're on our way to supporting AMD GPUs.
- Completely revamped documentation under https://ml.energy/zeus.
Deprecated
- See #20 (
ZeusDataLoader
,ZeusMaster
, and the C++ Zeus monitor)
v0.8.0: Energy-efficient large model training
This release features Perseus, an optimizer for energy-efficient large model training.
See the Perseus docs for details.
v0.7.1: Moved to under `ml-energy`!
We moved our repository to under ml-energy
. No feature changes :)
v0.7.0: Python-based power monitor
What's New
- We used to have a C++ power monitor under
zeus_monitor
, but we've deprecated that. There's no need for high speed polling because NVML power counters do not update that quick anyway.- In order to poll power consumption programmatically, use
zeus.monitor.power.PowerMonitor
.
- In order to poll power consumption programmatically, use
- CLI power & energy monitor:
python -m zeus.monitor power
python -m zeus.monitor energy
- We switched from the old
setup.py
to the new package metadata standardpyproject.toml
. - Docker image sizes are drastically smaller now! The compressed image used to be 8.48 GB, but now it's down to 2.71 GB.
v0.6.1: `approx_instant_energy`
What's New
approx_instant_energy
in ZeusMonitor
- Sometimes, the NVML energy counter update period is longer than the measurement window, in which case energy consumption may be return as
0.0
. In this case, whenapprox_instant_energy=True
,ZeusMonitor
will approximate the energy consumption of the window as instant power consumption multiplied by the duration of the measurement window:$$\textrm{Energy} = \int_0^T \textrm{Power}(t) dt \approx \textrm{Power}(T) \cdot T$$
v0.6.0: `OptimumSelector`
What's New
OptimumSelector
- Until know, the optimal power limit for
GlobalPowerLimitOptimizer
was the one that minimizes the Zeus time-energy cost. Not everyone would want that. - Now,
OptimumSelector
is an abstract base class with which you can implement your own optimal power limit selection policy. - Pre-implemented one are
Time
,Energy
,ZeusCost
, andMaxSlowdownConstraint
. These are thoroughly tested.
wait_steps
- Now, you can specify
wait_steps
inGlobalPowerLimitOptimizer
, and it'll wait for the specified number of steps before profiling and optimizing. wait_steps
is set to 1 by default to because users may havetorch.backends.cudnn.benchmark = True
andDataLoader
workers usually need time to warm up before ramping up to their normal fetch throughput.
Breaking Changes
GlobalPowerLimitOptimizer
now takes an instance ofOptimumSelector
in its constructor, instead ofeta_knob
. If you want to recover the functionality of v0.5.0, modify your code like this:# Before plo = GlobalPowerLimitOptimizer(..., eta_knob=0.5, ...)
# After from zeus.optimizer.power_limit import ZeusCost plo = GlobalPowerLimitOptimizer(..., optimum_selector=ZeusCost(eta_knob=0.5), ...)
v0.5.0: Big refactor, `GlobalPowerLimitOptimizer`
What's New
Callback-based architecture
zeus.callback.Callback
is the new backbone for Zeus componentsGlobalPowerLimitOptimizer
is the shiny new way to online-profile and optimize the power limit of DNN training.EarlyStopController
monitors and manages all sorts of conditions to determine whether training should stop.
Extensive testing
tests/
is richer than ever. With deep component tests with exhaustive parametrization, there are now around 1500 test cases.- Especially,
zeus.util.testing.ReplayZeusMonitor
exposes the same public API asZeusMonitor
but replays the measurement window logs produced byZeusMonitor
, instead of doing actual measurement. With this, Zeus can now be tested without any actual GPUs.
v0.4.0: `ZeusMonitor`
What's New
- Just measuring energy with Zeus has been non-trivial. Now,
ZeusMonitor
is the only way to measure time and energy consumed by an arbitrary set of GPUs from executing an arbitrary range of code. There should be one-- and preferably only one --obvious way to do it.ZeusDataLoader
was refactored to build aroundZeusMonitor
.ZeusMonitor
is quite thoroughly tested now.
v0.3.0: `ZeusMonitorContext` for in-training-loop profiling
What's New
ZeusMonitorContext
allows users to profile their per-iteration energy and time consumption.- It's aimed for those who would like to get a feel for the energy consumption of their DNN training job with a couple additional lines (as opposed to modified lines).
- Documentation and integration example: here