ResNet

Is learning better networks as easy as stacking more layers?

  • Vanishing/exploding gradients —> Hamper convergence from the beginning.

Traditional model: The deeper network has higher training error.

Batch Normalization: Force the input of each layer to maintain a stable, consistent distribution.

SGD: Stochastic Gradient Descent

ReLu: Rectified Linear Unit