A previous post ended with the promise of a fabulous exposition of the cornerstones of machine learning theory as we know it - concentration inequalities, PAC learning, and uniform deviation. But in the horizon, a whole new outlook towards learning appears to be taking shape. This has been my highlight of #ITA2019 this year. Much of this new theory is motivated by the recent success of deep neural networks.
Let me first define supervised learning - the learning paradigm that has been the building block of much of thehype progress in ML lately! In supervised learning, one is presented with a collection of labelled data and the objective is to arrive at a pattern to explain the label. The data takes the form of a collection of ordered pairs (𝑋,𝑌), where 𝑋 is referred to as the feature and 𝑌 as the label. For example, suppose a credit agency has labelled data on a customer where the feature - 𝑋= (𝑋₁, 𝑋₂) is the credit score (𝑋₁), the credit length (𝑋₂) and 𝑌 indicates whether the customer defaulted on a loan. In this case, learning a pattern would allow us to predict whether a new customer with a certain credit score and credit length would be likely to default on a loan.
Such a pattern may be represented by a function 𝒉(𝑋), so that 𝑌=𝒉(𝑋) is the prediction corresponding to 𝑋. Suppose the space of functions 𝒉 belongs to some universal class ℌ. For example, in this case, one could choose ℌ to be the collection of linear classifiers, wherein each 𝒉 ∈ ℌ, depending on its coefficients would declare `default' based on a linear combination of 𝑋₁ and 𝑋₂ exceeding a threshold. One could, in contrast, work with a more complicated ℌ' consisting perhaps of degree - 2 polynomials of 𝑋₁ and 𝑋₂. 𝒉 is said to make a classification error if 𝒉(𝑋)≠𝑌. Now the choice of ℌ also dictates an algorithm for finding a 'good' classifier 𝒉* ∈ ℌ. The algorithm simply declares 𝒉* to be that 𝒉 which makes the least error on the given data set (or the error sum minimizer). 𝒉* is referred to as the empirical risk minimizer (ERM).
The metric of interest to us is the classification error of 𝒉* on unseen data points. This error consists of two components, bias, and generalization error. While the bias is directly related to how well 𝒉* fits the given labelled data, the generalization error captures the difference in performances of 𝒉* between the given and unseen data.
The more accommodating or complex ℌ is, better is 𝒉* able to fit the given data and therefore smaller is the bias. On the other hand, narrowing down to the 'best' classifier within ℌ requires more samples when the ℌ is more complicated resulting in a larger generalization error. In the example above, the degree - 2 polynomial class ℌ' can be expected to have a smaller bias and a larger generalization error than the linearly classifying ℌ. This trade-off is also referred to as the bias-variance trade-off. Practitioners attempt to find the ℌ that hits the sweet spot and it was widely believed that this is the best ℌ that one can hope for.
However, more recently, almost everyone has witnessed a striking outlier to this theory - neural networks. Neural networks (NNs) are simply another class of classifiers ℌ. The NNs that have achieved breakthrough classification errors in various tasks such as image classification, speech recognition, etc. consist of billions of parameters, even as the provided data sets are of the order of mere millions. Thus the NN class being unimaginably complex, known theory would predict a very large generalization error, but differing from what we see in practice.
This is the phenomenon of 'overparametrization', wherein generalization error decreases indefinitely even as ℌ is made more complex. Overparametrization has now been shown to occur in multiple settings with different classifiers. What is being realized is that studying complex classifiers such as billion parameter-NNs via the ERM framework is incorrect.
Such an NN class by simple virtue of its complexity is such that 𝒉* labels correctly every point in the data set! Clearly such a 𝒉* will not generalize well to new data points. In NNs, although ERM is the motivation, stochastic gradient descent (SGD) is the real workhorse. NN optimization landscape being non-convex, SGD does not converge to the global minima 𝒉*. The classifier that SGD does converge to, 𝒉₁, is often far away from the ERM 𝒉*. 𝒉₁, in fact, is heavily dependent on the initialization.
Yet 𝒉₁ is special in that it surprisingly generalizes unlike 𝒉*! Under the lens of overparametrization, SGD is believed to perform an implicit regularization. The `effective complexity' of 𝒉₁ is thus considerably smaller than what a naive calculation with the number of parameters would lead one to believe. While this theory is still a work in progress and I'm hazy about the details, there is growing interest along this frontier. That this might, at last, be that theory which explains NNs. The interest was certainly commensurate with the level of attendance in these talks!
For further reading, I leave the reader with webpages of researchers whose talks on overparameterization I found to be very interesting - Mikhail Belkin, Nathan Srebro, Jason Lee.
Let me first define supervised learning - the learning paradigm that has been the building block of much of the
Such a pattern may be represented by a function 𝒉(𝑋), so that 𝑌=𝒉(𝑋) is the prediction corresponding to 𝑋. Suppose the space of functions 𝒉 belongs to some universal class ℌ. For example, in this case, one could choose ℌ to be the collection of linear classifiers, wherein each 𝒉 ∈ ℌ, depending on its coefficients would declare `default' based on a linear combination of 𝑋₁ and 𝑋₂ exceeding a threshold. One could, in contrast, work with a more complicated ℌ' consisting perhaps of degree - 2 polynomials of 𝑋₁ and 𝑋₂. 𝒉 is said to make a classification error if 𝒉(𝑋)≠𝑌. Now the choice of ℌ also dictates an algorithm for finding a 'good' classifier 𝒉* ∈ ℌ. The algorithm simply declares 𝒉* to be that 𝒉 which makes the least error on the given data set (or the error sum minimizer). 𝒉* is referred to as the empirical risk minimizer (ERM).
The metric of interest to us is the classification error of 𝒉* on unseen data points. This error consists of two components, bias, and generalization error. While the bias is directly related to how well 𝒉* fits the given labelled data, the generalization error captures the difference in performances of 𝒉* between the given and unseen data.
The more accommodating or complex ℌ is, better is 𝒉* able to fit the given data and therefore smaller is the bias. On the other hand, narrowing down to the 'best' classifier within ℌ requires more samples when the ℌ is more complicated resulting in a larger generalization error. In the example above, the degree - 2 polynomial class ℌ' can be expected to have a smaller bias and a larger generalization error than the linearly classifying ℌ. This trade-off is also referred to as the bias-variance trade-off. Practitioners attempt to find the ℌ that hits the sweet spot and it was widely believed that this is the best ℌ that one can hope for.
However, more recently, almost everyone has witnessed a striking outlier to this theory - neural networks. Neural networks (NNs) are simply another class of classifiers ℌ. The NNs that have achieved breakthrough classification errors in various tasks such as image classification, speech recognition, etc. consist of billions of parameters, even as the provided data sets are of the order of mere millions. Thus the NN class being unimaginably complex, known theory would predict a very large generalization error, but differing from what we see in practice.
This is the phenomenon of 'overparametrization', wherein generalization error decreases indefinitely even as ℌ is made more complex. Overparametrization has now been shown to occur in multiple settings with different classifiers. What is being realized is that studying complex classifiers such as billion parameter-NNs via the ERM framework is incorrect.
Such an NN class by simple virtue of its complexity is such that 𝒉* labels correctly every point in the data set! Clearly such a 𝒉* will not generalize well to new data points. In NNs, although ERM is the motivation, stochastic gradient descent (SGD) is the real workhorse. NN optimization landscape being non-convex, SGD does not converge to the global minima 𝒉*. The classifier that SGD does converge to, 𝒉₁, is often far away from the ERM 𝒉*. 𝒉₁, in fact, is heavily dependent on the initialization.
Yet 𝒉₁ is special in that it surprisingly generalizes unlike 𝒉*! Under the lens of overparametrization, SGD is believed to perform an implicit regularization. The `effective complexity' of 𝒉₁ is thus considerably smaller than what a naive calculation with the number of parameters would lead one to believe. While this theory is still a work in progress and I'm hazy about the details, there is growing interest along this frontier. That this might, at last, be that theory which explains NNs. The interest was certainly commensurate with the level of attendance in these talks!
For further reading, I leave the reader with webpages of researchers whose talks on overparameterization I found to be very interesting - Mikhail Belkin, Nathan Srebro, Jason Lee.