ResNet
Is learning better networks as easy as stacking more layers?
- Vanishing/exploding gradients —> Hamper convergence from the beginning.
Traditional model: The deeper network has higher training error.
Batch Normalization: Force the input of each layer to maintain a stable, consistent distribution.
SGD: Stochastic Gradient Descent
ReLu: Rectified Linear Unit