Regularization

Bagging

Bootstrap AGGregatING. Reduces generalization error by combining several models. Train several different models separately, then have all the models vote on the output for test examples. It’s an example of model averaging or ensemble methods. Averaging works because different models will usually not make all the same errors on the test set. Boosting constructs an ensemble with higher capacity than the individual models.

Dropout

Technique for resolving overfitting in a large network. It randomly drop units(along with their connections) from the NN during training. [Dropout_A_Simple_Way_to_Prevent_Neural_Networks_from_Overfitting]

[Dropout_A_Simple_Way_to_Prevent_Neural_Networks_from_Overfitting]https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf

Dropout rate

Dropout rate p = 1, implies no dropout and low values of p mean more dropout. Typical values of p for hidden units are in the range 0.5 to 0.8. For input layers, the choice depends on the kind of input. For real-valued inputs (image patches or speech frames), a typical value is 0.8. For hidden layers, the choice of p is coupled with the choice of number of hidden units n. Smaller p requires big n which slows down the training and leads to underfitting. Large p may not produce enough dropout to prevent overfitting.

Learning rate and Momentum

Dropout introduces a lot of noise in the gradients compared to standard stochastic gradient descent. Therefore, a lot of gradients tend to cancel each other. You can use 10-100 times the learning rate to fix this. While momentum values of 0.9 are common for standard nets, with dropout 0.95-0.99 works well.

Max-norm Regularization

Large momentum and learning rate can cause the network weights to grow very large. Max-norm regularization constrains the norm of the vector of incoming weights at each hidden unit to be bound by a constant \(c\) in the interval of 3-4.