# Gradient Descent

## Gradient Descent

Function:

Goal:

Gradient Descent:

• ...

Outline:

• Start with some
• Keep changing to reduce
• End up at a minimum

### Simultaneous Update

Repeat Until Converge:

• : learning rate
• for sufficiently small , should decrease after each iteration
• if is too small:
• slow decrease
• if is too large:
• increase
• overshoot the minimum
• fail to converge, or even diverge
• take smaller automatically as it approaches a local minimum
• : derivative

Note:

• Gradient Descent can only converge to a local minimum
• Declare convergence if decreases by less than in a single iteration

### Gradient Vanishing/Exploding

In deep network, activations end up increasing/decreasing exponentially.

• is bigger than 1
• activations explode
• is smaller than 1
• activations vanish

### Gradient Checking

• only to debug
• check components to identify bug
• use regularization
• not work with dropout
• run at random initialization

• : height
• : length

#### Grad Check

• Reshape into a big vector
• Reshape into a big vector
• for each i:

• Check

• check
• , grate
• , wrong

## Batch Gradient Descent

### Batch Gradient Descent

Each step of Gradient Descent use all the training examples.

### Mini-Batch Gradient Descent

Each step of Gradient Descent use a set of the training examples.

• (m=1000, e.g.)
• (m=1000, e.g.)
• ...
• (m=1000, e.g.)

#### Mini-Batch Size

• size = n: Batch Gradient Descent
• too long per iteration
• size = 1: Stochastic Gradient Descent
• lose speedup from vectorization
• size = m: In-Between
• speedup from vectorization
• make progress without waiting

#### Mini-Batch Gradient Descent in Neural Network

for t = 1, ... , n/m // 1 epoch: 1 pass through training set

​ Forward Propagation on

​ Compute cost

​ Backward Propagation to compute gradient descent

## Adam

### Exponentially Weighted Average

#### Formula

• : averaging over day's temperature
##### \beta
• e.g : 2 day's average

• e.g. : 50 day's average

• e.g. : 10 day's average

• ...

#### Bias Correction

• not good estimate during initial phase

• more accurate during initial phase

• : weighted average of data
• remove the bias

#### Aim

• damp the oscillation

### Momentum

• for iteration = 1, ... , n=
• Forward Propagation on
• Compute cost
• Backward Propagation to compute gradient descent
• Compute , on the current mini-batch

### RMSprop

• for iteration = 1, ... , n=
• Forward Propagation on
• Compute cost
• Backward Propagation to compute gradient descent
• Compute , on the current mini-batch

### Adam

Adaptive Moment Estimation

• for t = 1, ... , n // t: iteration
• Forward Propagation on
• Compute cost
• Backward Propagation to compute gradient descent
• Compute , on the current mini-batch

#### Hyperparameters

• needs to be tuned
• : first moment
• Momentum term
• default 0.9
• : second moment
• RMSprop term
• default 0.999
• default

## Learning Rate Decay

• bigger learning rate during the initial steps
• Slower learning rate as approaching convergence

### Decay Rate

• for = 0.2, decay_rate = 1
epoch
1 0.1
2 0.67
3 0.5
4 0.4
... ...

Other Rate Decay

• Exponential Decay
• exponentially quickly decay