Here are some more notes from my recent reading. The following are some key challenges in machine learning which motivate the solutions offered by deep learning.
When a function consists of many variables, then it’s possible the combinatorial set of possibilities (regions) is much larger than the training set. Many ML algorithms rely upon at least a statistically significant number of training points for each possibility (why? see next…) However, this assumption is not always valid for many data sets.
“What is the best forecast for weather tomorrow?” Well, the weather today, but just a bit different.
Basically, function smoothness and local constancy assumes that test points will return results “near to” similar points in the training set. ML algorithms such as k-nearest neighbors and decision trees rely on this assumption.
A key finding has been that a large number of regions can be defined with a much smaller number of examples if dependencies between regions are introduced based on the underlying data generating function. In particular, one assumes the data was generated by a composition of functions, possibly in a hierarchy.
Hmm, so is this why they call it “deep”?
Further reduces the solution space from a combinatorially huge set of possibilities, by observing that most data sets (particularly NLP) have more likely paths between adjoining points. Perhaps like a conditional probability for predicting the next word given the known words (context)?
Treating the most likely paths as a lower-dimension manifold in the higher-dimension solution space allows for much more efficient algorithms and a means to tackle larger and more challenging problems.
From from Chapter 5.11 of Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville.