InterviewSolution
| 1. |
What is Gradient Descent? |
Answer»
Gradient descent is one of the most popular algorithms to perform optimization and widely used to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. lasagne's, caffe's, and keras' documentation). Gradient descent is a way to minimize an objective function J(θ) parameterized by a MODEL's parameters θ∈Rd by updating the parameters in the opposite direction of the gradient of the objective function ∇θJ(θ) w.r.t. to the parameters. The learning rate η determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley.
Batch gradient descent, computes the gradient of the cost function w.r.t. to the parameters θ for the entire training dataset: θ=θ−η⋅∇θJ(θ) As we need to calculate the gradients for the whole dataset to perform just one update, batch gradient descent can be very slow and is intractable for datasets that don't fit in memory. Batch gradient descent also doesn't allow us to update our model online, i.e. with new examples on-the-fly.
Stochastic gradient descent (SGD) in contrast performs a parameter update for each training example x(i) and label y(i): θ = θ−η⋅∇θJ(θ; x(i); y(i)) Batch gradient descent performs redundant computations for large datasets, as it recomputes gradients for SIMILAR examples before each parameter update. SGD does away with this redundancy by performing one update at a time. It is therefore usually much faster and can also be used to learn online.
Mini-batch gradient descent considers the best of both worlds and performs an update for every mini-batch of n training examples: θ=θ−η⋅∇θJ(θ ; x(i:i+n) ; y(i:i+n)) This way, it a) helps in reducing the variance of the parameter UPDATES, which can lead to more stable convergence; and b) can make an effective use of highly-optimized MATRIX optimizations common to state-of-the-art deep learning libraries that make computing the gradient w.r.t. a mini-batch very efficient. Common mini-batch sizes range between 50 and 256, but can VARY for different applications. Mini-batch gradient descent is typically the algorithm of choice when training a neural network and the term SGD usually is employed also when mini-batches are used. |
|