Stochastic Gradient Descent (SGD)
Stochastic Gradient Descent (SGD) is an optimization method for a linear classifier. This implementation is a Linear SVM with stochastic gradient descent (SGD) learning.
The gradient of the loss is estimated one sample at a time and the model is updated with each estimation. The algorithm descends along the cost function towards its minimum for each training example.
SGD has been successfully applied to large-scale and sparse machine learning problems often encountered in text classification and NLP. Given that the data is sparse, the classifiers in this module easily scale to problems with large training sets and a big number of features.
SGD training provides three major advantages over standard SVM training due to implementing an online update of its parameters one sample at a time:
- Lower memory consumption.
- Scalability: larger datasets can be used for training.
- Re-trainability: a model can be saved, used, and further updated
In contrast, when given a specific batch of training data, SGD will tend to slightly under-perform SVM training since it is more prone to stop into local minima.