This is quick evaluation of different activation functions performance on ImageNet-2012.

The architecture is similar to CaffeNet, but has differences:

Images are resized to small side = 128 for speed reasons.
fc6 and fc7 layers have 2048 neurons instead of 4096.
Networks are initialized with LSUV-init

Because LRN layers add nothing to accuracy, they were removed for speed reasons in further experiments. *ELU curves is unsmooth because of incorrectly set test set size. However, results from 310K to 320K iterations are obtained with fixed set size

Activations

Name	Accuracy	LogLoss	Comments
ReLU	0.470	2.36	With LRN layers
ReLU	0.471	2.36	No LRN, as in rest
TanH	0.401	2.78
1.73TanH(2x/3)	0.423	2.66	As recommended in Efficient BackProp, LeCun98
ArcSinH	0.417	2.71
VLReLU	0.469	2.40	y=max(x,x/3)
RReLU	0.478	2.32
Maxout	0.482	2.30	sqrt(2) narrower layers, 2 pieces. Same complexity, as for ReLU
Maxout	0.517	2.12	same width layers, 2 pieces
PReLU	0.485	2.29
ELU	0.488	2.28	alpha=1, as in paper
ELU	0.485	2.29	alpha=0.5
(ELU+LReLU) / 2	0.486	2.28	alpha=1, slope=0.05
Shifted Softplus	0.486	2.29	Shifted BNLL aka softplus, y = log(1 + exp(x)) - log(2). Same as ELU, as expected
SELU = Scaled ELU	0.470	2.38	1.05070 * ELU(x,alpha = 1.6732)
FReLU = ReLU + (learned) bias	0.488	2.27
[FELU = ELU + (learned) bias]	0.489	2.28
No	0.389	2.93	No non-linearity, with max-pooling
No, no max pooling	0.035	6.28	No non-linearity, strided convolution
APL2	0.471	2.38	2 linear pieces. Unlike other activations, current author`s implementation leads to different parameters for each x,y position of neuron
APL5	0.465	2.39	5 linear pieces. Unlike other activations, current author`s implementation leads to different parameters for each x,y position of neuron
ConvReLU,FCMaxout2	0.490	2.26	ReLU in convolution, Maxout (sqrt(2) narrower) 2 pieces in FC. Inspired by kaggle and INVESTIGATION OF MAXOUT NETWORKS FOR SPEECH RECOGNITION*
ConvELU,FCMaxout2	0.499	2.22	ELU in convolution, Maxout (sqrt(2) narrower) 2 pieces in FC.

The above analyses show that the bottom layers seem to waste a large portion of the additional parametrisation (figure 2 (a,e)) thus could be replaced, for example, by smaller ReLU layers. Similarly, maxout units in higher layers seem to use piecewise-linear components in a more active way suggesting the use of larger pools._

BN and activations

Name	Accuracy	LogLoss
ReLU	0.499	2.21
RReLU	0.500	2.20
PReLU	0.503	2.19
ELU	0.498	2.23
Maxout	0.487	2.28
Sigmoid	0.475	2.35
No	0.384	2.96

Previous results on small datasets like CIFAR (see LSUV-init, Table3) looks a bit contradictory to ImageNet ones so far.

Maxout net has two linear pieces and each has sqrt(2) less parameters than *ReLU networks, so overall complexity is same.

P.S. Logs are merged from lots of "save-resume", because were trained at nights, so plot "Accuracy vs. seconds" will give weird results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Activations.md

Activations.md

Activations

BN and activations

Files

Activations.md

Latest commit

History

Activations.md

File metadata and controls

Activations

BN and activations