Notes for CNN for Visual Recongnition

13 Apr 2016

DeepLearning

notes for CNN for Visual Recongnition

Course Notes

Loss function

$L = \frac{1}{N} \sum_{i=1}^N \sum_{j \neq y_i}$ max $(0, f_j(x_i, W) -f_{y_j}(x_i, W)+1)$+$\lambda R(W)$

In commom use:

L2 regularization $R(W) = \sum_k \sum_l W_{k,l}^2$
L1 regularization $R(W) = \sum_k \sum_l \vert W_{k,l} \vert$
Elastic net (L1+L2) $R(W) = \sum_k \sum_l \beta_1 W_{k,l}^2 + \beta_2 \vert W_{k,l} \vert$
Dropout: randomly set some neurons to zeros in the forward pass; force the network to have a redundant representation; need to scale at test time.

Loss function type:

Softmax $L_i = -log(\frac{e^{s_{y_j}}}{\sum_j e^{s_j}})$
SVM (hinge loss) $L_i = \sum_{j \neq y_j} max(0, s_j - s_{y_i} + 1)$

In practice: always use analytic gradient, but check implementation with numerical gradient. This is called gradient check.

Activation functions

Sigmoid $\sigma (x) = \frac{1}{(1+e^{-x})}$
- saturated neurons “kill” the gradients
- Sigmoid outputs are not zero-centered
- exp() is a bit compute expensive
tanh(x)
- squashes numbers to range [-1, 1]
- zero centered
- still kills gradients when saturated
ReLU ( Rectified Linear Unit) $f(x) = max(0, x)$
- does not saturate
- very computationally efficient
- converges much faster than sigmoid/tanh in practice
- not zero-centered output
Leaky ReLU $f(x) = max(\alpha x, x)$
- does not saturate
- computationally efficient
- converges much faster than sigmoid/tanh in practice
- will not “die”
Exponential Linear Units (ELU) $f(x) = x $ if $x > 0$; $f(x) = \alpha (exp(x) -1)$ if $x \le 0$
- all benefits of ReLU
- does not die
- closer to zero mean outputs
- computation requires exp()
Maxout “neuron” $max(w_1^T x + b_1, w_2^Tx+b_2)$
- generalize ReLu and Leaky ReLU
- linear regime, does not saturate, does not die
- doubles the number of parameters

In practice

Use ReLU. Be careful with learning rates
Try out Leaky ReLU/ Maxout/ ELU
Try out tanh but do not expect much
Do not use sigmoid

Weight initilization

Small random numbers: works okay for small networks, but can lead non-homogeneous distributions of activations across the layers of a network
Xavier initilization

Batch normalization:

Improves gradient flow through the network
Allow higher learning rates
Reduces the strong dependence on initialization
Acts as a form of regularization

Hyperparameters to play with:

network architecture
learning rate, its decay schedule, update type
regularization

Step update

Gradient descent x += -learning_rate * dx
Momentum update
- v = mu * v - learning_rate * dx # integrate velocity
- x += v # integrate position
- physical interpretation as ball rolling down the loss function + friction (mu coefficient)
- mu = usually ~0.5, 0.9, or 0.99 (or annealed over time)
Nesterov momentum update
- $ v_t = \mu v_{t-1} - \epsilon \nabla f(\theta_{t-1} + \mu \v_{t-1}) $
- $ \theta t = \theta{t-1} + v_t$
- use variable tansform
  - $ v_t = \mu v_{t-1} - \epsilon \nabla f(\phi)$
  - $\phi_t = \phi_{t-1} - \mu v_{t-1} + (1+\mu) v_t$
- code
  - v_pre = v
  - v = mu * v -learning_rate * dx
  - x += -mu * v_prev + (1 + mu) * v
AdaGrad update
- cache += dx**2
- x += - learning_rate * dx / (np.sqrt(cache) + 1e-7)
- added element-wise scaling of the gradient based on the historical sum of squares in each dimension
RMSProp update
- cache = decay_rate * cache + (1 - decay_rate) * dx**2
- x += - learning_rate * dx / (np.sqrt(cache) + 1e-7)
Adam update
- m = beta1 * m + (1-beta1) * dx # update first moment
- v = beta2 * v + (1-beta2) * (dx**2) # update second moment
- mb = m/(1-beta1**t) # correct bias
- vb = v/(1-beta2**t) # correct bias
- x += -learning_rate * mb / (np.sqrt(vb) + 1e-7

Learning rate decay over time:

step decay: e.g. decay learning rate by half every few epochs
exponential decay: $\alpha = \alpha_0 e^{-kt}$
1/t decay: $\alpha = \alpha_0 / (1+kt)$

Second order optimization methods:

second order Taylor: $J(\theta) = J(\theta_0) + (\theta-\theta_0)^T \nabla _{\theta}J(\theta_0) + \frac{1}{2} (\theta - \theta_0) ^T H (\theta - \theta_0)$
solve for critical point: $\hat{\theta} = \theta_0 - H^{-1} \nabla_{\theta} J(\theta_0)$
Quasi-Newton methods (e.g. BGFS): instead of inverting the Hessian, approximate inverse Hessian with rank 1 updates over time
L-BFGS (Limited memory BFGS): does not form/store the full inverse Hessian
- usually works very well in full batch, deterministic code
- does not transfer very well to mini-batch setting

In practice:

Adam is a good default choice in most cases
Try L-BFGS if can afford to do full batch updates

Convolutional Neural Network (CNN)

ConvNet is a sequence of Convolution Layers, interspersed with activation functions.

$f(x,y)*g(x,y) = \sum_{n_1 = -\infty}^{\infty} \sum_{n2=-\infty}^{\infty} f(n_1, n_2) g(x-n_1, y-n_2)$

The Conv Layer:

Accepts a volume of size $W_1 \times H_1 \times D_1$
Requires four hyperparameters:
- Number of filters $K$,
- their spatial extent $F$,
- the stride $S$,
- the amount of zero padding $P$.
Produces a volume of size $W_2 \times H_2 \times D_2$ where:
- $W_2 = (W_1 - F + 2P)/S + 1$
- $H_2 = (H_1 - F + 2P)/S + 1$ (i.e. width and height are computed equally by symmetry)
- $D_2 = K$
With parameter sharing, it introduces $F \cdot F \cdot D_1$ weights per filter, for a total of $(F \cdot F \cdot D_1) \cdot K$ weights and $K$ biases.
In the output volume, the $d$-th depth slice (of size $W_2 \times H_2$) is the result of performing a valid convolution of the $d$-th filter over the input volume with a stride of $S$, and then offset by $d$-th bias.
Common settings: $F=3$, $S=1$, $P=1$.

Pooling layer:

make the representations smaller and more manageable
operates over each activation map independently

Generally, the pooling layer:

Accepts a volume of size $W_1 \times H_1 \times D_1$
Requires three hyperparameters:
- their spatial extent $F$,
- the stride $S$,
Produces a volume of size $W_2 \times H_2 \times D_2$ where:
- $W_2 = (W_1 - F)/S + 1$
- $H_2 = (H_1 - F)/S + 1$
- $D_2 = D_1$
Introduces zero parameters since it computes a fixed function of the input
Note that it is not common to use zero-padding for Pooling layers
Common setting: $F = 2, S=2$; $F=3, S=2$.

How to stack convolutions:

Replace large convolutions (5x5, 7x7) with stacks of 3x3 convolutions
1x1 “bottleneck” convolutions are very efficient
Can factor NxN convolutions into 1xN and Nx1
All of the above give fewer parameters, less compute, and more nonlinearity

Convolution Theorem: The convolution of f and g is equal to the elementwise product of their Fourier Transforms: $F(f * g) = F(f)F(g)$. Using the FFT, we can compute the DFT of an N-dimension vector in O(NlogN) time.

Implement convolutions FFT:

Compute FFT of weights: F(W)
Compute FFT of image: F(X)
Compute elementwise product: F(W)F(X)
Compute inverse FFT: $Y=F^{-1}(F(W)F(X))$

Segmentation:

Semantic segmentation
- classify all pixels
- fully convolutional models, downsample then upsample
- learnable upsampling: fractionally strided convolution
- skip connections can help
Instance segmentation
- detect instance, generate mask
- similar pipelines to object detection

Attention:

Soft attention:
- easy to implement: produce distribution over input locations, reweight features and feed as input
- attend to arbitrary input locations using spatial transformer networks
Hard attention:
- attend to a single input location
- cannot use gradient descent
- need reinforcement learning

Unsupervised learning:

Autoencoders
- Traditional: feature learning, reconstruct input, not used much anymore
- Variational: generate samples, Bayesian meets deep learning
Generative adversarial networks: generate samples

Recurrent Neural Networks (RNN)

$ h_t = f_w (h_{t-1}, x_t)$, where $h_t$ is new state, $h_{t-1}$ is old state, $f_w$ is some function with parameters $w$, $x_t$ is input vector at some time step.

Vanilla RNN