-
Notifications
You must be signed in to change notification settings - Fork 951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data source module for working with mini-batches from data sets #801
base: master
Are you sure you want to change the base?
Conversation
…he definition of data sources from which batches of samples can be extracted.
…_source.ArrayDataSource`, simplifying code. Implemented new `mnist_batch_map` example that uses `mean_batch_map` method of `data_source.ArrayDataSource` to further replace training and evaluation loops.
Hey, looks like a good start, thank you for the nice documentation! Sorry for being so unresponsive about the training loop, this will take time to think through which I currently don't have :( I'm sad enough that we still didn't manage to finish release 0.2 (help wanted!), which we should before the end of the month so we can start development on 0.3 by switching to the new GPU backend before Theano drops it. Some first comments:
Hmm, why?
I'm unsure how much of this belongs into Lasagne. I had long been planning to tidy up my data iteration module and release it (as a separate project). It would be good to have something in Lasagne to reduce the boilerplate needed to run an experiment, because not every user plans to write their own framework, but it cannot be as full-fledged as a library dedicated to data iteration. It should just provide the means to either train from simple array-based sources or plug in whatever more complex data iteration code you have. |
print(" validation accuracy:\t\t{:.2f} %".format( | ||
val_acc / val_batches * 100)) | ||
print(" training loss:\t\t{:.6f}".format( | ||
train_loss / X_train.shape[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only correct if len(X_train) % batchsize == 0
. Dividing the sum of batch averages by the batch count is safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the remainder batch will always be returned, this should be fine.
Done.
Done.
The remainder batch is always returned; I have updated the
Yes... If both sources were repeated infinitely. If you wanted a complete epoch for each, but with different permutations, and you cared enough that each side should see each example once and only once then permuting separately would be desirable.
Hmmm... I see where you are coming from. Would there be a way of structuring the API or documenting the system to explain this better? I guess my intention is that an
Some times I have to change my batch size due to memory constraints etc. It may be more natural to specify such limits in terms of number of samples rather than number of batches, which could depend on batch size.
I was thinking of something very simple, along the lines of: ads = ArrayDataSource([X, y])
augmented = ads.map(users_augmentation_function) The user would still have to implement the augmentation themselves; we could provide them with a simple hook. The documentation could provide a basic data augmentation example. More something to inform the user 'this is an easy way to add data augmentation into your pipeline' rather than to provide them with a complete implementation. |
…ayDataSource` to the constructor. Removed `ApplyParamsDataSource` and `with_params` method of `AbstractDataSource` as they are now redundant.
@f0k I've moved the |
@f0k Is there anything more that you'd like me to do with this? |
The
data_source
module supports extracting mini-batches from data sets in a flexible manner, supporting a range of network training schemes. It is the first part in building towards a training loop that would replace the one submitted in PR #759.For a quick overview of the effect, see the changes to
examples/mnist.py
. Notice that theiterate_minibatches
function is gone. It is replaced by wrapping the training, validation and test sets inArrayDataSource
instances and invoking theirbatch_iterator
methods to acquire mini-batches of data.examples/mnist_batch_map.py
takes this further, replacing the training and evaluation loops with invocations of themean_batch_map
method that performs the loop and the mean in one go.The changes to the
mnist
examples don't tell the full story, as it would appear thatdata_source
adds a fair bit of code to save only a few lines. The main value becomes more apparent when we consider further uses.Module overview
The
ArrayDataSource
wraps a data set and builds iterators that generates mini-batches:To shuffle the samples in the data set:
To create an iterator that loops infinitely over the data set and shuffles samples:
(uses of infinite looping / circular will be explored later)
ArrayDataSource
is sufficiently flexible to permit the use of array-like objects. Such objects must support__len__
and__getitem__
, with__getitem__
receiving an integer array of sample indices. While NumPy arrays would be the most common form of input data, other sources that read or generate data on demand are also supported.Should you wish to define a function that creates the batch iterator, use
CallableDataSource
. If you want to wrap an iterator, useIteratorDataSource
.CompositeDataSource
allows you to combine the above data sources in flexible ways that support a variety of training schemes as required by different models. Lets take a GAN as an example. Typically, we want to iterate over our real data setX
, extracting different batches for training the discriminator and the generator. To extract two different batches simultaneously:While the same random number generate will be used to shuffle both
disc_ds
andgen_ds
, different permutations will be used by each instance.A more complex model would be the one presented in
Improved Techniques for Training GANs
by Salimans et al., in which they use GANs for semi-supervised learning. In addition to use different random orders for training the discriminator and generator as above, they also train the discriminator with labeled samples. Given that the labeled data set is much smaller than the unlabeled data set, they loop over the labeled data set repeatedly to make up a complete epoch of the unlabeled data set. We can implement this like so:The invocation of
with_params(circular=True)
applies thecircular
to the labeled data sourcelab_ds
, causing the labeled samples to be repeated. They will be shuffled in a different order on each repetition. The unlabeled samples will be differently shuffled for the discriminator and the generator.Future work
The
batch_map
andmean_batch_map
methods have abatch_limit
parameter that allow you to limit the number of batches that will be processed. I think it might be a good idea to replace this with asample_limit
parameter that specifies the limit in terms of the number of samples instead of the number of batches. The issue situations where the number of samples is not divisible by the batch size would need to be figured out.An additional class to add data augmentation into the pipeline. I think that this would make sense.
Conclusion
I hope that you consider this module to be useful and sufficient for inclusion in Lasagne. It is my intention to use it as a foundation for building a replacement for the training loop in PR #759.