Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data source module for working with mini-batches from data sets #801

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

Britefury
Copy link
Contributor

The data_source module supports extracting mini-batches from data sets in a flexible manner, supporting a range of network training schemes. It is the first part in building towards a training loop that would replace the one submitted in PR #759.

For a quick overview of the effect, see the changes to examples/mnist.py. Notice that the iterate_minibatches function is gone. It is replaced by wrapping the training, validation and test sets in ArrayDataSource instances and invoking their batch_iterator methods to acquire mini-batches of data. examples/mnist_batch_map.py takes this further, replacing the training and evaluation loops with invocations of the mean_batch_map method that performs the loop and the mean in one go.

The changes to the mnist examples don't tell the full story, as it would appear that data_source adds a fair bit of code to save only a few lines. The main value becomes more apparent when we consider further uses.

Module overview

The ArrayDataSource wraps a data set and builds iterators that generates mini-batches:

# Data; inputs and targets
X = np.random.normal(size=(10000, 10))
y = np.random.randint(0, 10, size=(10000,))

ds = ArrayDataSource([X, y])

# Get mini-batches
for batch_X, batch_y in ds.batch_iterator(batch_size=64):
    eval_func(batch_X, batch_y)

To shuffle the samples in the data set:

for batch_X, batch_y in ds.batch_iterator(batch_size=64, shuffle_rng=lasagne.random.get_rng()):
    train_func(batch_X, batch_y)

To create an iterator that loops infinitely over the data set and shuffles samples:

loss = 0.0
for i, batch_X, batch_y in enumerate(ds.batch_iterator(
            batch_size=64, shuffle_rng=lasagne.random.get_rng()), circular=True)):
    loss += train_func(batch_X, batch_y)
    if i % report == 0:
        print('Loss = {:.6f}'.format(loss / (report * 64)))
    if i >= end:
        break

(uses of infinite looping / circular will be explored later)

ArrayDataSource is sufficiently flexible to permit the use of array-like objects. Such objects must support __len__ and __getitem__, with __getitem__ receiving an integer array of sample indices. While NumPy arrays would be the most common form of input data, other sources that read or generate data on demand are also supported.

Should you wish to define a function that creates the batch iterator, use CallableDataSource. If you want to wrap an iterator, use IteratorDataSource.

CompositeDataSource allows you to combine the above data sources in flexible ways that support a variety of training schemes as required by different models. Lets take a GAN as an example. Typically, we want to iterate over our real data set X, extracting different batches for training the discriminator and the generator. To extract two different batches simultaneously:

disc_ds = ArrayDataSource([X])
gen_ds = ArrayDataSource([X])
gan_ds = CompositeDataSource([disc_ds, gen_ds])

for batch_X_disc, batch_X_gen in gan_ds.batch_iterator(
        batch_size=64, shuffle_rng=lasagne.random.get_rng()):
    train_disc(batch_X_disc)
    train_gen(batch_X_gen)

While the same random number generate will be used to shuffle both disc_ds and gen_ds, different permutations will be used by each instance.

A more complex model would be the one presented in Improved Techniques for Training GANs by Salimans et al., in which they use GANs for semi-supervised learning. In addition to use different random orders for training the discriminator and generator as above, they also train the discriminator with labeled samples. Given that the labeled data set is much smaller than the unlabeled data set, they loop over the labeled data set repeatedly to make up a complete epoch of the unlabeled data set. We can implement this like so:

lab_ds = ArrayDataSource([X_labeled, y_labled])
disc_ds = ArrayDataSource([X_unlabeled])
gen_ds = ArrayDataSource([X_unlabeled])
imp_gan_ds = CompositeDataSource([lab_ds.with_params(circular=True), disc_ds, gen_ds])

for b_lab_X, b_lab_y, b_unlab_X_disc, b_unlab_X_gen in imp_gan_ds.batch_iterator(
        batch_size=64, shuffle_rng=lasagne.random.get_rng()):
    train_disc(b_lab_X, b_lab_y, b_unlab_X_disc)
    train_gen(b_unlab_X_gen)

The invocation of with_params(circular=True) applies the circular to the labeled data source lab_ds, causing the labeled samples to be repeated. They will be shuffled in a different order on each repetition. The unlabeled samples will be differently shuffled for the discriminator and the generator.

Future work

The batch_map and mean_batch_map methods have a batch_limit parameter that allow you to limit the number of batches that will be processed. I think it might be a good idea to replace this with a sample_limit parameter that specifies the limit in terms of the number of samples instead of the number of batches. The issue situations where the number of samples is not divisible by the batch size would need to be figured out.

An additional class to add data augmentation into the pipeline. I think that this would make sense.

Conclusion

I hope that you consider this module to be useful and sufficient for inclusion in Lasagne. It is my intention to use it as a foundation for building a replacement for the training loop in PR #759.

…he definition of data sources from which batches of samples can be extracted.
…_source.ArrayDataSource`, simplifying code.

Implemented new `mnist_batch_map` example that uses `mean_batch_map` method of `data_source.ArrayDataSource` to further replace training and evaluation loops.
@Britefury Britefury mentioned this pull request Feb 16, 2017
@f0k
Copy link
Member

f0k commented Feb 17, 2017

Hey, looks like a good start, thank you for the nice documentation! Sorry for being so unresponsive about the training loop, this will take time to think through which I currently don't have :( I'm sad enough that we still didn't manage to finish release 0.2 (help wanted!), which we should before the end of the month so we can start development on 0.3 by switching to the new GPU backend before Theano drops it.

Some first comments:

  • Distinguishing ArrayDataSource, CallableDataSource and IteratorDataSource seems very reasonable.
  • The interface seems good as well, although I guess we should support just shuffle=True for users who don't care. We could still have shuffle accept a RNG to minimize the number of keyword arguments. And instead of circular I'd have a slight preference for forever or epochs=-1 (where the latter would allow specifying the number of repetitions you want). circular sounds like it would repeat the same permutation over and over. And finally, for some purposes it'd be good to return the remainder batch (the last len(datapoints) % batchsize items ) instead of dropping it, there should be an option for that.
  • I'm a bit puzzled about CompositeDataSource. It makes sense to have it so that you can combine multiple callables or arrays, but I'm not sure about the shuffling behaviour pulling different items from the different sources, instead of the same items. If you want that, itertools.izip on the generators would be the more obvious option. (And in your case, where both sources are the same, you could just double the batchsize and feed the two halves to the two training functions.) In short, I would expect CompositeDataSource(ArrayDataSource([x]), ArrayDataSource([y])) to be equivalent to ArrayDataSource([x, y]).
  • with_params: Hmm, this raises some additional questions. What if you pass circular=False to the outer batch_iterator call? Will this overwrite what you gave to with_params? I find it a dangerous path to mix the dataset representations (DataSource) with the iterators over them. If we want to combine iterators, we should just use itertools.izip, or instantiate the iterators before the loop and call next() on them whenever we need something. Combining datasets (as in CompositeDataSource) should not influence iteration.

I think it might be a good idea to replace this with a sample_limit parameter

Hmm, why?

An additional class to add data augmentation into the pipeline. I think that this would make sense.

I'm unsure how much of this belongs into Lasagne. I had long been planning to tidy up my data iteration module and release it (as a separate project). It would be good to have something in Lasagne to reduce the boilerplate needed to run an experiment, because not every user plans to write their own framework, but it cannot be as full-fledged as a library dedicated to data iteration. It should just provide the means to either train from simple array-based sources or plug in whatever more complex data iteration code you have.

print(" validation accuracy:\t\t{:.2f} %".format(
val_acc / val_batches * 100))
print(" training loss:\t\t{:.6f}".format(
train_loss / X_train.shape[0]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only correct if len(X_train) % batchsize == 0. Dividing the sum of batch averages by the batch count is safer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the remainder batch will always be returned, this should be fine.

@Britefury
Copy link
Contributor Author

The interface seems good as well, although I guess we should support just shuffle=True for users who don't care. We could still have shuffle accept a RNG to minimize the number of keyword arguments.

Done. shuffle now accepts None/False for in-order, or to shuffle, pass a RandomState or True to use Lasagne's default RNG.

And instead of circular I'd have a slight preference for forever or epochs=-1 (where the latter would allow specifying the number of repetitions you want). circular sounds like it would repeat the same permutation over and over.

Done. circular now replaced with epochs that specifies the number of repetitions, or -1 for infinite.

And finally, for some purposes it'd be good to return the remainder batch (the last len(datapoints) % batchsize items ) instead of dropping it, there should be an option for that.

The remainder batch is always returned; I have updated the ArrayDataSource to make this more obvious. Should there be an option to discard it?

I'm a bit puzzled about CompositeDataSource. It makes sense to have it so that you can combine multiple callables or arrays, but I'm not sure about the shuffling behaviour pulling different items from the different sources, instead of the same items. If you want that, itertools.izip on the generators would be the more obvious option.

CompositeDataSource does use itertools.izip underneath. The main benefit is that the batch_map and mean_batch_map methods are inherited from AbstractDataSource making it nice and easy to iterate-map-reduce :) . You could use itertools.izip and the batch_map function to achieve the same effect, so CompositeDataSource is mainly syntactic/API sugar. Would some improved documentation and examples help?

(And in your case, where both sources are the same, you could just double the batchsize and feed the two halves to the two training functions.)

Yes... If both sources were repeated infinitely. If you wanted a complete epoch for each, but with different permutations, and you cared enough that each side should see each example once and only once then permuting separately would be desirable.

In short, I would expect CompositeDataSource(ArrayDataSource([x]), ArrayDataSource([y])) to be equivalent to ArrayDataSource([x, y]).

Hmmm... I see where you are coming from. Would there be a way of structuring the API or documenting the system to explain this better? I guess my intention is that an ArrayDataSource will iterate through the elements 'in sync'. I also want to provide a way of iterating over different sections 'out of sync' (this has to happen when different parts of the data set have different lengths, such as in semi-supervised learning scenarios).

with_params: Hmm, this raises some additional questions. What if you pass circular=False to the outer batch_iterator call? Will this overwrite what you gave to with_params? I find it a dangerous path to mix the dataset representations (DataSource) with the iterators over them

with_params overrides keyword arguments. As for the risks, I can see where you are coming from. Another option would be to pass settings such as epochs=-1 to the constructor of ArrayDataSource and store them as attributes. Thinking about it, this looks like a good option.

I think it might be a good idea to replace this with a sample_limit parameter

Hmm, why?

Some times I have to change my batch size due to memory constraints etc. It may be more natural to specify such limits in terms of number of samples rather than number of batches, which could depend on batch size.

An additional class to add data augmentation into the pipeline. I think that this would make sense.

I'm unsure how much of this belongs into Lasagne. I had long been planning to tidy up my data iteration module and release it (as a separate project). It would be good to have something in Lasagne to reduce the boilerplate needed to run an experiment, because not every user plans to write their own framework, but it cannot be as full-fledged as a library dedicated to data iteration. It should just provide the means to either train from simple array-based sources or plug in whatever more complex data iteration code you have.

I was thinking of something very simple, along the lines of:

ads = ArrayDataSource([X, y])
augmented = ads.map(users_augmentation_function)

The user would still have to implement the augmentation themselves; we could provide them with a simple hook. The documentation could provide a basic data augmentation example. More something to inform the user 'this is an easy way to add data augmentation into your pipeline' rather than to provide them with a complete implementation.

…ayDataSource` to the constructor.

Removed `ApplyParamsDataSource` and `with_params` method of `AbstractDataSource` as they are now redundant.
@Britefury
Copy link
Contributor Author

Britefury commented Feb 20, 2017

@f0k I've moved the epochs parameter from the num_samples and batch_iterator methods of ArrayDataSource to its constructor. This eliminated the only use case for the ApplyParamsDataSource class and the with_params method, so they have been removed. This should address the API pitfalls.

@Britefury
Copy link
Contributor Author

@f0k Is there anything more that you'd like me to do with this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants