Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature alpha dropout #855

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

rfratila
Copy link

This is my first PR to Lasagne so I'm definitely looking for suggestions. To complement SELU activation, I've made a new AlphaDropoutLayer as specified in the paper: https://arxiv.org/abs/1706.02515

@rfratila
Copy link
Author

Due to the similarities and common code from DropoutLayer, should it extend from DropoutLayer and modify the get_out_for()?

Copy link
Member

@f0k f0k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Welcome, and thank you for the PR! Seems a good strategy to attract contributors is to wait for an interesting paper to come out and not implement it right away ;)

I've got some detailed comments, and a general note on the implementation. You're right that it shares a lot of code with the DropoutLayer. Making AlphaDroupoutLayer inherit from DropoutLayer will not immediately reduce the duplication, though, we'd need some refactoring. Your current implementation requires the mask to be known, so we cannot simply call the super get_output_for(), we'd still have to implement it. It's possible to formulate alpha dropout in terms of the original dropout, by shifting the input, calling the super get_output_for(), shifting and scaling the output. Theano will not be able to optimize this, though.

So I'd suggest the following:

  1. Create a read-only property self.q that encapsulates the conversion from p
  2. In DropoutLayer, create a method apply_dropout(self, input, const=0) that rescales the input, creates the mask, multiplies the input with it, and adds const * (1 - mask) if const != 0. This can be called from the DropoutLayer (with const=0) and AlphaDropoutLayer (with const=self.alpha).
  3. Make AlphaDropoutLayer inherit from DropoutLayer, overwrite get_output_for(). It will have minimal code duplication.

A nuisance is that a separate AlphaDropoutLayer means the spatial_dropout and dropout_locations convenience functions would have to be duplicated. But merging everything into a single DropoutLayer implementation would be ugly as well, as AlphaDropoutLayer with alpha=0 is still not the same as DropoutLayer, so we'd need at least two additional parameters for the DropoutLayer if we were to unify the implementations.

Sets values to zero with probability p. Will also converge the remaining
neurons to mean 0 and a variance of 1 unit. See notes for disabling dropout
during testing.
Parameters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing blank line before the Parameters heading.

``False`` or not specified, dropout (and possibly scaling) is enabled.
Usually, you would use ``deterministic=False`` at train time and
``deterministic=True`` at test time.
References
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing blank line.

class AlphaDropoutLayer(Layer):
"""Dropout layer.

Sets values to zero with probability p. Will also converge the remaining
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it sets values to alpha.

super(AlphaDropoutLayer, self).__init__(incoming, **kwargs)
self._srng = RandomStreams()
self.p = p
self.alpha = -1.7580993408473766
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think alpha should be a parameter for the layer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and we should use all available digits, because we can. And maybe call it differently, since it's called alpha prime in the paper, computed as minus lambda times alpha? Not sure.

The default value would be -1.0507009873554804934193349852946 * 1.6732632423543772848170429916717.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I'll default it to that and create an additional parameter. The floating point precision is the difference between the calculated alpha and what I have.

mask_shape = tuple(1 if a in shared_axes else s
for a, s in enumerate(mask_shape))

mask = self._srng.uniform(mask_shape, dtype=input.dtype) < q
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will result in a binary mask. Why not use binomial as in the original dropout layer?

@@ -21,3 +21,6 @@ Noise layers
.. autoclass:: GaussianNoiseLayer
:members:

.. autoclass:: AlphaDropoutLayer
:members:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd place it after the dropout layers.

@@ -14,6 +14,7 @@
"spatial_dropout",
"dropout_locations",
"GaussianNoiseLayer",
"AlphaDropoutLayer",
Copy link
Member

@f0k f0k Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd place it after the dropout layers.

bcast = tuple(bool(s == 1) for s in mask_shape)
mask = T.patternbroadcast(mask, bcast)

a = T.pow(q + self.alpha ** 2 * q * (1 - q), -0.5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use p, not 1 - q, otherwise it might result in an unwanted upcast again. Should also use T.square instead of ** 2 to make the graph smaller.


a = T.pow(q + self.alpha ** 2 * q * (1 - q), -0.5)

b = -a * (1 - q) * self.alpha
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, should use p, not 1-q.

@rfratila
Copy link
Author

I apologize for all of the commits @f0k I'm having some trouble squashing them together. Any help with this would be very much appreciated.

Thanks

@f0k
Copy link
Member

f0k commented Jun 30, 2017

Thank you for continuing, and sorry for the delay!

I'm having some trouble squashing them together. Any help with this would be very much appreciated.

I guess the easiest to get rid of all the merge commits in between will be to squash everything into a single commit for now. In the end I'd like two commits -- one refactoring the dropout layer and one introducing alpha dropout -- but we can take care of that later. To squash, do:

git reset --soft edbbc12  # set pointer to the first commit of the PR, leave other changes stages
git commit --amend --no-edit  # tack all other changes to the first commit
git push --force  # overwrite the current PR with the squashed version

I have made a local backup of the current state of your PR in case anything goes wrong.

I will review the code afterwards, as I don't know if the reviews will stick after squashing.

@rfratila
Copy link
Author

rfratila commented Jun 30, 2017

Thanks for the assist! I think everything is still working as I left it. Should be ready for another round of reviews @f0k .

@rfratila
Copy link
Author

@f0k Have you had a chance to look at the changes?

**kwargs)
if alpha is None:
self.alpha = -1.0507009873554804934193349852946 * \
1.6732632423543772848170429916717
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to mention in a comment that these two values are fixed point solutions to equations (4) and (5) in [1] for zero mean and unit variance input and that the analytic expressions for them are given in equation (14) ?

Furthermore, these value for alpha should be consistent with a SELU input layer, would a possible option be to pass a SELU instance for the parameter alpha and take the value from there, e.g.:

elif isinstance(alpha, SELU):
    self.alpha = - alpha.scale * alpha.scale_neg

?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You bring up good points. I will update the documentation accordingly. Regarding the option to pass a SELU instance and calculate the alpha based on that, I think that since AlphaDropout is meant to work with SELU, it only makes sense for it to be able to use SELU. I will make this change.

@f0k
Copy link
Member

f0k commented Jul 31, 2017

Sorry, had to finish my thesis first. You're including the commits of the SELU PR in your PR, can you please rebase onto master instead? It's more difficult to review otherwise. Cheers!

git fetch upstream master
git rebase upstream/master
git push --force  # assuming that the rebase went through without conflicts, otherwise git rebase --abort and let me know

@rfratila
Copy link
Author

No worries and congrats! I rebased onto master. Resolved a few conflicts and pushed. Everything is as it was.

@f0k
Copy link
Member

f0k commented Aug 1, 2017

Oops, sorry, introduced a conflict by merging #871. Will review and can then rebase your PR onto the new master myself, resolving the conflict. Thank you for rebasing onto yesterday's master!

Copy link
Member

@f0k f0k left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for rebasing, much easier to review now! There are still a couple of issues with the code and the documentation. The latter will need some thinking to explain well what the layer does.

def test_alpha(self, input_layer):
input = theano.shared(numpy.ones((100, 100)))
from lasagne.layers.noise import AlphaDropoutLayer
layer = AlphaDropoutLayer(input_layer, alpha=-1.7580993408473766)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that's just the default value for alpha. It should test another value, otherwise you never know whether that value was actually respected or it took the default anyway :)

input = theano.shared(numpy.ones((100, 100)))
from lasagne.layers.noise import AlphaDropoutLayer
from lasagne.nonlinearities import selu # SELU instance
layer = AlphaDropoutLayer(input_layer, alpha=selu)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, it should be a SELU instance with nonstandard scales. You can then test whether self.alpha was set correctly, in addition to checking the output.


self.q = one - self.p
if self.rescale:
input /= self.q
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self.q should not be set here, it should become a property of the layer (i.e., not an attribute):

@property
def q(self):
    return T.constant(1) - self.p

Then you never need to set self.q and can use it in apply_dropout() and in get_output_for().

mask = T.patternbroadcast(mask, bcast)

if const != 0:
return (input * mask) + (const * (1 - mask))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to use T.constant(1) instead of 1 again.

"""Dropout layer.

Sets values to alpha. Will also converge the remaining
neurons to mean 0 and a variance of 1 unit. See notes for disabling dropout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this explanation is a bit imprecise. Can you be more explicit what the layer does? (If you can't find anything, I'll try that myself.)

incoming : a :class:`Layer` instance or a tuple
the layer feeding into this layer, or the expected input shape
p : float or scalar tensor
The probability of setting a value to zero
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only that it's not set to zero...

alpha : float or SELU instance
The default values are fixed point solutions to equations (4) and (5)
in [1] for zero mean and unit variance input. The analytic expressions
for them are given in equation (14) also in [1].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should probably first state what role alpha plays for the operation of the layer. Also the reference must be given as [1]_, not [1].

-----
The alpha dropout layer is a regularizer that randomly sets input values to
zero and also applies a normalization function to bring the mean to 0
and the variance to 1; see [1]_for why this might improve
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That used to reference the original dropout paper. Either the reference or the text should be changed.

a = T.pow(self.q + T.square(self.alpha) * self.q * self.p, -0.5)
b = -a * self.p * self.alpha

return T.cast(a * mask + b, input.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this cast? Can you trace back why it is necessary? If possible, it would be better to cast earlier, otherwise Theano might execute part of the graph on CPU and then only cast and copy the results back to GPU afterwards.

Copy link
Author

@rfratila rfratila Aug 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it returns a float64 so I just want to make sure the dtype of the output is the same as the dtype of the input.

@rfratila
Copy link
Author

rfratila commented Aug 3, 2017

@f0k I have made the changes that address all your comments. Let me know what you think.

@f0k
Copy link
Member

f0k commented Aug 7, 2017

Thank you, the code looks good now! For the docstring I'll need to have a closer look, unfortunately I'll be abroad without web access for the next two weeks -- sorry to let you wait, I'll get back to this when I return!

@rfratila
Copy link
Author

Have you had a chance to review the docstring in more detail @f0k ? Please let me know if anything is inaccurate. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants