DL weekly study notes: initialization and normalization

We can “focus” on the weights in this last layer because we are talking about initialization, which we directly control. The scheme is to set the bias of the last layer all to 0, and scale all the weights down by applying a multiplifaction of like 0.01
Modern innovations in neural network optimization make initialization easier.
- – residual connections
- – normalization of different kinds
- – better optimization techniques like Adam
Batch normalization normalizes hidden layers’ pre-activation, whereas initialization techniques apply to parameters but not pre-activation.
- Since they are tightly related, i.e. reactive are computed using parameters, it makes sense that less care needed to be paid to initialization when BN is used. #here In fact when network gets deeper, it becomes more difficult to initialize the network parameters

Gang Fang's Blog