Gradient Descent
Gradient Descent
Function:
Goal:
Gradient Descent:
 ...
Outline:
 Start with some
 Keep changing to reduce
 End up at a minimum
Simultaneous Update
Repeat Until Converge:
 : learning rate
 for sufficiently small , should decrease after each iteration
 if is too small:
 slow decrease
 if is too large:
 increase
 overshoot the minimum
 fail to converge, or even diverge
 take smaller automatically as it approaches a local minimum
 : derivative
Note:
 Gradient Descent can only converge to a local minimum
 Declare convergence if decreases by less than in a single iteration
Gradient Vanishing/Exploding
In deep network, activations end up increasing/decreasing exponentially.
 is bigger than 1
 activations explode
 is smaller than 1
 activations vanish
Gradient Checking
 only to debug
 check components to identify bug
 use regularization
 not work with dropout
 run at random initialization
Formula
 : height
 : length
Grad Check
 Reshape into a big vector
 Reshape into a big vector
for each i:
Check
 check
 , grate
 , wrong
 check
Batch Gradient Descent
Batch Gradient Descent
Each step of Gradient Descent use all the training examples.
MiniBatch Gradient Descent
Each step of Gradient Descent use a set of the training examples.

 (m=1000, e.g.)
 (m=1000, e.g.)
 ...
 (m=1000, e.g.)
MiniBatch Size
 size = n: Batch Gradient Descent
 too long per iteration
 size = 1: Stochastic Gradient Descent
 lose speedup from vectorization
 size = m: InBetween
 speedup from vectorization
 make progress without waiting
MiniBatch Gradient Descent in Neural Network
for t = 1, ... , n/m // 1 epoch: 1 pass through training set
Forward Propagation on
Compute cost
Backward Propagation to compute gradient descent
Adam
Exponentially Weighted Average
Formula
 : averaging over day's temperature
e.g : 2 day's average
e.g. : 50 day's average
e.g. : 10 day's average
 ...
Bias Correction
not good estimate during initial phase
more accurate during initial phase
 : weighted average of data
 remove the bias
Aim
 damp the oscillation
Momentum
 for iteration = 1, ... , n=
 Forward Propagation on
 Compute cost
 Backward Propagation to compute gradient descent
 Compute , on the current minibatch
RMSprop
 for iteration = 1, ... , n=
 Forward Propagation on
 Compute cost
 Backward Propagation to compute gradient descent
 Compute , on the current minibatch
Adam
Adaptive Moment Estimation
 for t = 1, ... , n // t: iteration
 Forward Propagation on
 Compute cost
 Backward Propagation to compute gradient descent
 Compute , on the current minibatch
Hyperparameters
 needs to be tuned
 : first moment
 Momentum term
 default 0.9
 : second moment
 RMSprop term
 default 0.999
 default
Learning Rate Decay
 bigger learning rate during the initial steps
 Slower learning rate as approaching convergence
Decay Rate
 for = 0.2, decay_rate = 1
epoch  

1  0.1 
2  0.67 
3  0.5 
4  0.4 
...  ... 
Other Rate Decay

 Exponential Decay
 exponentially quickly decay