Neural Network

Neural Network

input layer (layer 0) hidden layer (layer 1) ... hidden layer (layer n -1) output layer (layer n)

Chain Rule

-> ->

, ,

Activation Function

  • Sigmoid function:

    • only used in binary classification

  • tanh function:

  • ReLU function:

    • if z < 0
    • if z > 0
    • if z = 0
    • ReLU: rectified linear unit
    • z < 0, a = 0
    • z = 0, a = z
  • Leaky ReLU function: a = max(cz, z)

    • if z < 0
    • if z > 0
    • if z = 0
    • c is very small
  • identity activation function: a = z
    • useless

Gradient Descent

Repeat:

  • initialize the parmeters randomly
    • np.random.randn ((n,m)) * 0.01
    • np.zero ((n,1))

Forward Propagation

input:

output: , cache

Back Propagation

input:

output:

Weight Initialization

  • variance() =

    • prevent z blow up or too small
    • n is the number of input features
  • = np.random.randn(shape) * np.sqrt()

    • ReLU activation
    • tanh activation

Cost Function

Regularization

L2 Norm

  • Frobenius Norm

Dropout Regularization

Intuition: Cannot rely on any one feature, so have to spread out weights

parameter:

  • keep_prob for different layers
    • the chance of keeping a unit in each layer
    • keep_prob = 1: keep all units

Deep Neural Network

graph LR
x1((x1)) --> a11((a11)) 
x2((x2)) --> a11((a11))
x3((x3)) --> a11((a11))
x1((x1)) --> a12((a12))
x2((x2)) --> a12((a12))
x3((x3)) --> a12((a12))
x1((x1)) --> a13((a13))
x2((x2)) --> a13((a13))
x3((x3)) --> a13((a13))
x1((x1)) --> a14((a14))
x2((x2)) --> a14((a14))
x3((x3)) --> a14((a14))
x1((x1)) --> a15((a15))
x2((x2)) --> a15((a15))
x3((x3)) --> a15((a15))
a11((a11)) --> a21((a21))
a12((a12)) --> a21((a21))
a13((a13)) --> a21((a21))
a14((a14)) --> a21((a21))
a15((a15)) --> a21((a21))
a11((a11)) --> a22((a22))
a12((a12)) --> a22((a22))
a13((a13)) --> a22((a22))
a14((a14)) --> a22((a22))
a15((a15)) --> a22((a22))
a11((a11)) --> a23((a23))
a12((a12)) --> a23((a23))
a13((a13)) --> a23((a23))
a14((a14)) --> a23((a23))
a15((a15)) --> a23((a23))
a11((a11)) --> a24((a24))
a12((a12)) --> a24((a24))
a13((a13)) --> a24((a24))
a14((a14)) --> a24((a24))
a15((a15)) --> a24((a24))
a11((a11)) --> a25((a25))
a12((a12)) --> a25((a25))
a13((a13)) --> a25((a25))
a14((a14)) --> a25((a25))
a15((a15)) --> a25((a25))
a21((a11)) --> a31((a31))
a22((a12)) --> a31((a31))
a23((a13)) --> a31((a31))
a24((a14)) --> a31((a31))
a25((a15)) --> a31((a31))
a21((a11)) --> a32((a32))
a22((a12)) --> a32((a32))
a23((a13)) --> a32((a32))
a24((a14)) --> a32((a32))
a25((a15)) --> a32((a32))
a21((a11)) --> a33((a33))
a22((a12)) --> a33((a33))
a23((a13)) --> a33((a33))
a24((a14)) --> a33((a33))
a25((a15)) --> a33((a33))
a31((a31)) --> a41((a41))
a32((a32)) --> a41((a41))
a33((a33)) --> a41((a41))
  • = 4​ (hidden layers and output layer)
  • = units in layer i
    • = 5, = 5, = 3, = = 1, = = 3
  • = activations in layer i
  • = cache in layer i (before being put in activation function)

Parameters

Hyperparameters

Contrul the ultimate parameters and

  • learning rate
  • iterations of gradient descent
  • number of hidden layers
  • number of hidden units
  • choice of activation function

Parameters

Forward Propagation

input:

output: , cache

Backward Propagation

input:

output:

Dimension

Single Example

    • : ()
    • : ( )
    • : ( )
    • : ()

Vectorization

    • : ()
    • : ()
    • : ( ) -- Broadcast -- > : ( )

Circuit Theory

Small L-layer deep neural network require exponentially more hidden units to compute.

results matching ""

    No results matching ""