Gradient Descent

Gradient Descent

Function:

Goal:

Gradient Descent:

  • ...

Outline:

  • Start with some
  • Keep changing to reduce
  • End up at a minimum

Simultaneous Update

Repeat Until Converge:

  • : learning rate
    • for sufficiently small , should decrease after each iteration
    • if is too small:
      • slow decrease
    • if is too large:
      • increase
      • overshoot the minimum
      • fail to converge, or even diverge
    • take smaller automatically as it approaches a local minimum
  • : derivative

Note:

  • Gradient Descent can only converge to a local minimum
  • Declare convergence if decreases by less than in a single iteration

Gradient Vanishing/Exploding

In deep network, activations end up increasing/decreasing exponentially.

  • is bigger than 1
    • activations explode
  • is smaller than 1
    • activations vanish

Gradient Checking

  • only to debug
  • check components to identify bug
  • use regularization
    • not work with dropout
  • run at random initialization

Formula

  • : height
  • : length

Grad Check

  • Reshape into a big vector
  • Reshape into a big vector
  • for each i:

  • Check

    • check
      • , grate
      • , wrong

Batch Gradient Descent

Batch Gradient Descent

Each step of Gradient Descent use all the training examples.

Mini-Batch Gradient Descent

Each step of Gradient Descent use a set of the training examples.

    • (m=1000, e.g.)
    • (m=1000, e.g.)
    • ...
    • (m=1000, e.g.)

Mini-Batch Size

  • size = n: Batch Gradient Descent
    • too long per iteration
  • size = 1: Stochastic Gradient Descent
    • lose speedup from vectorization
  • size = m: In-Between
    • speedup from vectorization
    • make progress without waiting

Mini-Batch Gradient Descent in Neural Network

for t = 1, ... , n/m // 1 epoch: 1 pass through training set

​ Forward Propagation on

​ Compute cost

​ Backward Propagation to compute gradient descent

Adam

Exponentially Weighted Average

Formula

  • : averaging over day's temperature
  • e.g : 2 day's average

  • e.g. : 50 day's average

  • e.g. : 10 day's average

    • ...

Bias Correction

  • not good estimate during initial phase

  • more accurate during initial phase

      • : weighted average of data
      • remove the bias

Aim

  • damp the oscillation

Momentum

  • for iteration = 1, ... , n=
    • Forward Propagation on
    • Compute cost
    • Backward Propagation to compute gradient descent
      • Compute , on the current mini-batch

RMSprop

  • for iteration = 1, ... , n=
    • Forward Propagation on
    • Compute cost
    • Backward Propagation to compute gradient descent
      • Compute , on the current mini-batch

Adam

Adaptive Moment Estimation

  • for t = 1, ... , n // t: iteration
    • Forward Propagation on
    • Compute cost
    • Backward Propagation to compute gradient descent
      • Compute , on the current mini-batch

Hyperparameters

    • needs to be tuned
  • : first moment
    • Momentum term
    • default 0.9
  • : second moment
    • RMSprop term
    • default 0.999
    • default

Learning Rate Decay

  • bigger learning rate during the initial steps
  • Slower learning rate as approaching convergence

Decay Rate

  • for = 0.2, decay_rate = 1
epoch
1 0.1
2 0.67
3 0.5
4 0.4
... ...

Other Rate Decay

    • Exponential Decay
    • exponentially quickly decay

Local Optimal

results matching ""

    No results matching ""