Cross mini-batch gradients average #826

guoxuesong · 2017-04-11T15:02:53Z

GPU RAM are always rare resources.

Although momentum act as some kind of cross mini-batch gradients average, I learned that it is not good enough, Do you like some updates support mini-batch gradients ?

I try to write some code like this:

def apply_momentum(updates, params=None, momentum=0.9, average=5):

    N=average

    if params is None:
        params = updates.keys()
    updates = OrderedDict(updates)

    for param in params:
        value = param.get_value(borrow=True)
        sumval = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                                 broadcastable=param.broadcastable)
        velocity = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                                 broadcastable=param.broadcastable)
        count = theano.shared(np.zeros(value.shape, dtype='int8'),
                                 broadcastable=param.broadcastable)
        updates[count]=(count+1) % N

        avg=(sumval+updates[param])/N

        x = momentum * velocity + avg

        updates[sumval] = T.switch(T.eq(count,N-1),T.zeros_like(sumval),sumval+updates[param])
        updates[velocity] = T.switch(T.eq(count,N-1),x - param,velocity)
        updates[param] = T.switch(T.eq(count,N-1),x,param)

    return updates

def momentum(loss_or_grads, params, learning_rate, momentum=0.9, average=5):

    updates = sgd(loss_or_grads, params, learning_rate)
    return apply_momentum(updates, momentum=momentum, average=average)

I know count part need optimization, just fast coding so that I can test it.

Following is the result of the test:

control nodrop minibatch=5 x100 100 iters 0.164350 95.32 lr=0.0001 momentum=0.999
control nodrop minibatch=1 x100 100 iters 0.120086 96.25 lr=0.00001 momentum=0.999
control nodrop minibatch=1 x100 100 iters 2.347241 10.10 lr=0.0001 momentum=0.999
cross nodrop minibatch=1(10) x100 100 iters 0.337935 89.74 lr=0.0001 momentum=0.999
cross nodrop minibatch=1(5) x100 100 iters 0.214818 94.15 lr=0.0001 momentum=0.999
cross nodrop minibatch=1(5) x100 500 iters 0.235046 94.52 lr=0.0001 momentum=0.999
cross nodrop minibatch=1(5) x100 100 iters 0.137648 95.73 lr=0.0001 momentum=0.995
cross nodrop minibatch=1(5) x100 500 iters 0.064822 97.97 lr=0.0001 momentum=0.995
cross nodrop minibatch=1(5) x100 100 iters 0.421069 86.62 lr=0.00001 momentum=0.999

The test using lasagne example mnist.py cnn, dropout is removed, maybe later I would run a test with dropout enabled, minibatch=N for the minibatch size, default value in mnist.py is 500, minibatch=N(M) for using M as average in the code above, x100 means every epoch run only 100 mini batch, not through the full samples, 100/500 iters for num of epochs, following two float value are test loss and test accuracy, the right most is the momentum used.

I also writed another adam. It works fine, I would not think lack of GPU memory for sometime.

The text was updated successfully, but these errors were encountered:

f0k · 2017-04-19T13:02:38Z

Hey, sorry for the late reply. Your implementation looks correct, except that having a separate count variable for each parameter seems wasteful -- a single np.zeros((), dtype='int8') should suffice. For integrating this into Lasagne it would be nice to find a more modular implementation that can be combined with any of the existing update rules to get it to accumulate gradients over multiple batches. As you said, this would allow to replicate experiments that require a larger batch size than fits on GPU memory (except when using batch normalization -- that will get noisier mean/std estimates when accumulating gradients from multiple smaller batches).

f0k · 2017-04-19T15:36:00Z

Thinking again, while a modular implementation would be feasible (it'd need an interface like updates = lasagne.updates.with_accumulated_gradients(loss, params, 5, lasagne.updates.nesterov, learning_rate=1e-5, momentum=0.9) to be able to inject the accumulated gradients into the call for the original update function), I guess it's more efficient to split this into two training functions: One with a simple update dictionary that accumulates gradients into buffers, and another one without any inputs and outputs that applies the update and resets the accumulators. You would then change your training loop to call the accumulator training function for every batch, and the application training function every Nth batch. We could provide a helper function for this in lasagne.updates as well, something like:

def accumulate_gradients(loss_or_grads, params):
    grads = get_gradients(...)  # as in the other update functions
    accumulate = OrderedDict()
    reset = OrderedDict()
    for param, grad in zip(params, grads):
        accu = theano.shared(np.zeros_like(...))  # as in some other update functions
        accumulate[accu] = accu + grad
        reset[accu] = accu * 0
    return accumulate, reset

Then you could do:

accu, reset = lasagne.updates.accumulate_gradients(loss, params)
apply = lasagne.updates.adam(accu.keys(), params, ...)
apply.update(reset)
train_fn1 = theano.function([input_var, target_var], loss, updates=accu)
train_fn2 = theano.function([], [], updates=apply)

And voilà, there's one function to call on every minibatch and another to call every now and then. (It now sums the gradients instead of averaging them, but this could be fixed as well using a counter and Welford's method.)

dataintensiveapplication · 2017-06-30T16:05:04Z

Hi, I'm really interested in the same implementation, on adam. I tried to implement the solution of @f0k but it didn't work. May I have the full implementation with adam you've done?

thank you

guoxuesong · 2017-07-14T15:05:11Z

@dataintensiveapplication following is my code, enjoy it and happy coding.

#!/usr/bin/env python
#coding:utf-8
# vi:tabstop=4:shiftwidth=4:expandtab:sts=4

from collections import OrderedDict

import numpy as np

import theano
import theano.tensor as T
from theano.sandbox.rng_mrg import MRG_RandomStreams as RandomStreams
from lasagne import utils
import lasagne

def get_or_compute_grads(loss_or_grads, params):
    """Helper function returning a list of gradients

    Parameters
    ----------
    loss_or_grads : symbolic expression or list of expressions
        A scalar loss expression, or a list of gradient expressions
    params : list of shared variables
        The variables to return the gradients for

    Returns
    -------
    list of expressions
        If `loss_or_grads` is a list, it is assumed to be a list of
        gradients and returned as is, unless it does not match the length
        of `params`, in which case a `ValueError` is raised.
        Otherwise, `loss_or_grads` is assumed to be a cost expression and
        the function returns `theano.grad(loss_or_grads, params)`.

    Raises
    ------
    ValueError
        If `loss_or_grads` is a list of a different length than `params`, or if
        any element of `params` is not a shared variable (while we could still
        compute its gradient, we can never update it and want to fail early).
    """
    if any(not isinstance(p, theano.compile.SharedVariable) for p in params):
        raise ValueError("params must contain shared variables only. If it "
                         "contains arbitrary parameter expressions, then "
                         "lasagne.utils.collect_shared_vars() may help you.")
    if isinstance(loss_or_grads, list):
        if not len(loss_or_grads) == len(params):
            raise ValueError("Got %d gradient expressions for %d parameters" %
                             (len(loss_or_grads), len(params)))
        return loss_or_grads
    else:
        return theano.grad(loss_or_grads, params,disconnected_inputs='warn')


def sgd(loss_or_grads, params, learning_rate, grads_clip=False):
    """Stochastic Gradient Descent (SGD) updates

    Generates update expressions of the form:

    * ``param := param - learning_rate * gradient``

    Parameters
    ----------
    loss_or_grads : symbolic expression or list of expressions
        A scalar loss expression, or a list of gradient expressions
    params : list of shared variables
        The variables to generate update expressions for
    learning_rate : float or symbolic scalar
        The learning rate controlling the size of update steps

    Returns
    -------
    OrderedDict
        A dictionary mapping each parameter to its update expression
    """
    grads = get_or_compute_grads(loss_or_grads, params)
    updates = OrderedDict()

    for param, grad in zip(params, grads):
        if type(grads_clip)==float and grads_clip>0:
            grad = T.clip(grad,utils.floatX(-grads_clip),utils.floatX(grads_clip))
        elif grads_clip:
            grad = T.clip(grad,utils.floatX(-1.0),utils.floatX(1.0))
        updates[param] = param - learning_rate * grad

    return updates

def apply_momentum(updates, params=None, momentum=0.9, average=1):

    N=average

    if params is None:
        params = updates.keys()
    updates = OrderedDict(updates)

    #count = theano.shared(np.zeros((1,), dtype='int8'))
    #updates[count]=(count+1) % N
    for param in params:
        value = param.get_value(borrow=True)
        sumval = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                                 broadcastable=param.broadcastable)
        velocity = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                                 broadcastable=param.broadcastable)
        count = theano.shared(np.zeros(value.shape, dtype='int8'),
                                 broadcastable=param.broadcastable)
        updates[count]=(count+1) % N

        avg=(sumval+updates[param])/N

        x = momentum * velocity + avg

        updates[sumval] = T.switch(T.eq(count,N-1),T.zeros_like(sumval),sumval+updates[param])
        updates[velocity] = T.switch(T.eq(count,N-1),x - param,velocity)
        updates[param] = T.switch(T.eq(count,N-1),x,param)

    return updates

def momentum(loss_or_grads, params, learning_rate, momentum=0.9, average=1, grads_clip=False):

    updates = sgd(loss_or_grads, params, learning_rate)
    return apply_momentum(updates, momentum=momentum, average=average)

def adamax(loss_or_grads, params, learning_rate=0.002, beta1=0.9,
           beta2=0.999, epsilon=1e-8, average=1, grads_clip=False, noise=False):

    N=average

    all_grads = get_or_compute_grads(loss_or_grads, params)
    t_prev = theano.shared(utils.floatX(0.))
    updates = OrderedDict()

    # Using theano constant to prevent upcasting of float32
    one = T.constant(1)

    t = t_prev + 1
    a_t = learning_rate/(one-beta1**t)

    srng = RandomStreams(lasagne.random.get_rng().randint(1, 2147462579))

    for param, g_t in zip(params, all_grads):
        value = param.get_value(borrow=True)
        if type(noise)==float and noise>0:
            g_t = g_t+srng.normal(value.shape,avg=0.0,std=noise)
        elif noise:
            g_t = g_t+srng.normal(value.shape,avg=0.0,std=1e-8)

        if type(grads_clip)==float and grads_clip>0:
            g_t = T.clip(g_t,utils.floatX(-grads_clip),utils.floatX(grads_clip))
        elif grads_clip:
            g_t = T.clip(g_t,utils.floatX(-1.0),utils.floatX(1.0))
        m_prev = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                               broadcastable=param.broadcastable)
        u_prev = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                               broadcastable=param.broadcastable)
        sumval = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                                 broadcastable=param.broadcastable)
        count = theano.shared(np.zeros(value.shape, dtype='int8'),
                                 broadcastable=param.broadcastable)
        updates[count]=(count+1) % N

        avg=(sumval+g_t)/N

        m_t = beta1*m_prev + (one-beta1)*avg
        u_t = T.maximum(beta2*u_prev, abs(avg))
        step = a_t*m_t/(u_t + epsilon)

        updates[sumval] = T.switch(T.eq(count,N-1),T.zeros_like(sumval),sumval+g_t)
        updates[m_prev] = T.switch(T.eq(count,N-1),m_t,m_prev)
        updates[u_prev] = T.switch(T.eq(count,N-1),u_t,u_prev)
        updates[param] = T.switch(T.eq(count,N-1),param - step,param)

    count = theano.shared(0)
    updates[count]=(count+1) % N
    updates[t_prev] = T.switch(T.eq(count,N-1),t,t_prev)
    return updates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross mini-batch gradients average #826

Cross mini-batch gradients average #826

guoxuesong commented Apr 11, 2017 •

edited

f0k commented Apr 19, 2017

f0k commented Apr 19, 2017 •

edited

dataintensiveapplication commented Jun 30, 2017

guoxuesong commented Jul 14, 2017 •

edited

Cross mini-batch gradients average #826

Cross mini-batch gradients average #826

Comments

guoxuesong commented Apr 11, 2017 • edited

f0k commented Apr 19, 2017

f0k commented Apr 19, 2017 • edited

dataintensiveapplication commented Jun 30, 2017

guoxuesong commented Jul 14, 2017 • edited

guoxuesong commented Apr 11, 2017 •

edited

f0k commented Apr 19, 2017 •

edited

guoxuesong commented Jul 14, 2017 •

edited