Sunday, June 9, 2019

A Tale of Taste, Learning and Information

A prominent sense of taste is key to learning anything well. Paraphrasing Jiro Ono, arguably the greatest sushi chef, "If your sense of taste is lower than that of the customers, how will you impress them?". For his chefs in training, he believes that it is the sensitivity of their palates that determine the heights that they will eventually attain.

This makes sense. The more sensitive the artist is to differences, more is learnt in each iteration. The more that the writer is affected by the images conjured by a piece of text, greater is the number of crumbled drafts that end up in the dustbin, and the resulting improvement. The more that the composer's heart rises with a particular rhythm of notes, more is the impetus to innovate.

Learning continues for as long as discrimination is possible and promptly ends when one is no longer able to do so. It may be said that the limits of discrimination determine the limits of learning. It is particularly fascinating that this maxim, of breaking down learning to the act of testing between alternatives, has resulted in, perhaps, in one of our greatest conceptual leaps.

Towards illustrating this, consider the following learning task. Suppose we have 𝑛 distinct integers arranged in some unknown order. Our task is to learn this arrangement, via continually querying for the larger number at two different locations.

Notice that there are 𝑛! possible ways in which these integers can be arranged. Also notice that there are 2ⁿ possibilities corresponding to an 𝑛-length binary sequence. We may thus draw a one-one mapping between each integer arrangement and a log₂(𝑛!) long binary sequence. On the other hand, using any known sorting algorithm of complexity O(𝑛 log₂𝑛), O(𝑛 log₂𝑛) queries suffice in determining the order. Now log₂(𝑛!) = O(𝑛 log₂𝑛). Therefore each of the O(𝑛 log₂𝑛) queries can be viewed as resolving one element of uncertainty in this binary sequence.

In other words, learning here is equivalent to the cumulative reduction in uncertainty that results from the tests. In order to generalize this example, let us retrace the steps we just took. Notice that once the learning task was framed, a space of alternatives was automatically created. Creation of this space is implicit in the definition of any such learning task. Thus, it makes sense to capture and characterize the complexity of the particular learning task via the size of this alternative space.

Further, as we just saw, testing is one natural way to identify the particular alternative in that space. Several such testing strategies might be possible. The minimum number of tests across strategies depends on the particular alternative space and the type of test that is used. Under a given alternative space, the more complicated that each test is, less is the number of tests required. But upon fixing these, the minimum number of tests is directly related to the size of the alternative space.

The notion of 'bit' allows us to quantify both these quantities. First, the size of a particular alternative space, in bits, is simply the length of the smallest binary strings to which the space can be mapped to, injectively. Second, each test is worth one bit if it resolves between two alternatives. That is, a test to which the answer is either a 'yes' or a 'no'. With both the reserve of information and the amount exchanged in each transaction quantified by the number of 'bits', information can be treated as a currency.

Now what makes a currency particularly useful is the abstraction that it affords. For example, treating money from a both a fruit vendor and a venture capitalist in exactly the same manner allows banks to conduct commerce at scale. In the same vein, the universal nature of bits allow for algorithms to process, store and transmit these alternatives independent of the learning task. For instance, the same algorithm to learn the integer arrangement may be applied to learning an alphabet arrangement. Thus it is the 'bit' that has allowed us to raise information to the status of a real physical entity, enabling the current 'Information Age'.

This duality between testing and learning has since been used in other domains too. In machine learning, for instance, in deriving the VC dimension which quantifies the complexity of hypothesis classes. VC dimension here is related to the number of tests required to distinguish between hypotheses. Similarly, this duality is found in several reductions that occur in statistics and theoretical computer science. And in several such cases, the duality has been found to be tight, that is, the bounds derived from testing matches that of the task's difficulty.

Testing paves way, and determines the limits of learning. Understanding this relationship better has been one of the cornerstones of 20th century research.

Sunday, May 5, 2019

Sustainability in of Itself

Sustainability is one term that we've simultaneously become adept at ignoring, while becoming increasingly hard to do so. And sure enough, while on an endless barrage of binge watching Youtube videos I stumbled upon something on sustainability, but which also somehow managed to catch my eye.

It was on the social dynamics of enabling action towards combating climate change, global warming and sustainability in general. With a looming crisis such as this, our response has been to simply divert our attention. As a species, we're terrible at dealing with panic, and solving these issues calls for a study into the social aspect of it as much as the technological.

In one study undertaken at the Climate Lab, UCLA, the effects of different types of persuasion on the benefits of saving electricity were compared -
1. Letting the participants know how much money it enables them to save
2. How it provides a better future for their children
3. Having a little a competition between participants of the study.

The result turned out to be that reducing bills caused the least push toward action while competition provided the greatest. In hindsight, this makes sense. The decisions we make are more informed by the change they bring to our place in society than by lofty ideals.

Therefore if global warming is to be tackled, and more generally sustainable living be encouraged, mechanisms that provides such social rewards must be made more widespread. What is called for is to have sustainable lifestyle be perceived as a reward in itself. To make this certain virtue as something of a fashion statement.

Root of the problem is that right now, we reward the elevation in social status that consumption brings, even if done indiscriminately. Wearing new clothes elevates ones social position while re-using the current wardrobe does not. What if we have a means to instead socially reward re-use? This could extend to re-using other devices. Perhaps a resource sharing score as opposed to individual ownership? This also extends to diet, rewarding discriminative eating.

At this point I can almost here cries of disapproval from many. Such cutting down is often perceived as a diminishment. I believe there are primarily two reasons why such a view exists and I shall try to address these in the remainder of the post.

First, the notion that this inherently involves a sacrifice to one’s lifestyle. But this is not true. Less is sometimes more. What I want to emphasize here is that beyond a certain level of consumption, it’s the quality of objects at hand, their interplay, which if delicately done makes the whole much larger than the sum of its parts. There’s examples of this in cultures whose values provide some useful guidance as a starting point.

Japanese is one. The emphasis on such minimalism is intimately tied to the Japanese Shinto heritage. For instance, 'Shibui', a Japanese word that roughly refers to objects that are extremely simple, but which retain their subtle elements, making them beautiful. This principle is what is said to make sushi absolutely delectable. An awareness of the subtlety around us, the whispers as opposed to the shouts, can make our lives profoundly rewarding.

Second, about how it might impact our present economy. The criticism is that our economy is dependent on ever increasing consumption demand, that being what drives production, translating to better services and prosperity. Now I could put on a more socialist hat and go on about how the present structure benefits only the very few at the top. But I realize that’s not the most popular opinion to have and will not, for now.

But consider this. A more sustainably woke society will also spur the growth of entirely new industries dedicated to creating and managing sustainable products. New industries that should offset the fall in consumption. We already see this motto become increasingly popular. Apple is one, with their MacBook Air made entirely out of recycled aluminum. Instead, we’d like to see this taken to the next step. A longer refresh cycle of these products, even at the cost of making them more expensive, provided they last longer.

Taking this further, if we are to move to a subscription basis rather than ownership. For then it is in the company’s interest to not have to replace products, and therefore makes business sense. This also provides better service to the customers since at no point are they stuck to using a defective product, as is often the case in the years post warranty.

Moreover, with the current reach of social media, this change in attitude can be rapidly enabled. Facebook/Instgram/Twitter could create a portal where people may enter information about how long they use a particular product. Social scientists could help create a score on the impact of each, and a badge could be displayed next to each’s profile, seeding such a lifestyle.

Similar, more sustainable models may be studied for the other major contributors - the food industry and transport. I believe it’s only with such structural social changes that we can truly solve climate change, and more generally, sustainability. As much as we’d like it to be, Teslas aren’t the solution.

I must, however, post a disclaimer that much of this is an armchair analysis. A more rigorous analysis is necessary, and I welcome any thoughts/criticisms. But the possibilities appear to be very exciting and this is one angle that I intend to pursue, and study more about in the time to come.

Tuesday, February 26, 2019

Overparametrization

A previous post ended with the promise of a fabulous exposition of the cornerstones of machine learning theory as we know it - concentration inequalities, PAC learning, and uniform deviation. But in the horizon, a whole new outlook towards learning appears to be taking shape. This has been my highlight of #ITA2019 this year. Much of this new theory is motivated by the recent success of deep neural networks.

Let me first define supervised learning - the learning paradigm that has been the building block of much of the hype progress in ML lately! In supervised learning, one is presented with a collection of labelled data and the objective is to arrive at a pattern to explain the label. The data takes the form of a collection of ordered pairs (𝑋,𝑌), where 𝑋 is referred to as the feature and 𝑌 as the label. For example, suppose a credit agency has labelled data on a customer where the feature -  𝑋= (𝑋₁, 𝑋₂)  is the credit score (𝑋₁), the credit length (𝑋₂) and 𝑌 indicates whether the customer defaulted on a loan. In this case, learning a pattern would allow us to predict whether a new customer with a certain credit score and credit length would be likely to default on a loan.

Such a pattern may be represented by a function 𝒉(𝑋), so that 𝑌=𝒉(𝑋) is the prediction corresponding to 𝑋. Suppose the space of functions 𝒉 belongs to some universal class ℌ. For example, in this case, one could choose ℌ to be the collection of linear classifiers, wherein each 𝒉 ∈ ℌ, depending on its coefficients would declare `default' based on a linear combination of 𝑋₁ and 𝑋₂ exceeding a threshold. One could, in contrast, work with a more complicated ℌ' consisting perhaps of degree - 2 polynomials of 𝑋₁ and 𝑋₂. 𝒉 is said to make a classification error if 𝒉(𝑋)≠𝑌. Now the choice of ℌ also dictates an algorithm for finding a 'good' classifier 𝒉* ∈ ℌ. The algorithm simply declares 𝒉* to be that 𝒉 which makes the least error on the given data set (or the error sum minimizer). 𝒉* is referred to as the empirical risk minimizer (ERM).

The metric of interest to us is the classification error of 𝒉* on unseen data points. This error consists of two components, bias, and generalization error. While the bias is directly related to how well 𝒉* fits the given labelled data, the generalization error captures the difference in performances of 𝒉* between the given and unseen data.

The more accommodating or complex ℌ is, better is 𝒉* able to fit the given data and therefore smaller is the bias. On the other hand, narrowing down to the 'best' classifier within ℌ requires more samples when the ℌ is more complicated resulting in a larger generalization error. In the example above, the degree - 2 polynomial class ℌ' can be expected to have a smaller bias and a larger generalization error than the linearly classifying ℌ. This trade-off is also referred to as the bias-variance trade-off. Practitioners attempt to find the ℌ that hits the sweet spot and it was widely believed that this is the best ℌ that one can hope for.

However, more recently, almost everyone has witnessed a striking outlier to this theory - neural networks. Neural networks (NNs) are simply another class of classifiers ℌ. The NNs that have achieved breakthrough classification errors in various tasks such as image classification, speech recognition, etc. consist of billions of parameters, even as the provided data sets are of the order of mere millions. Thus the NN class being unimaginably complex, known theory would predict a very large generalization error, but differing from what we see in practice.

This is the phenomenon of 'overparametrization', wherein generalization error decreases indefinitely even as ℌ is made more complex. Overparametrization has now been shown to occur in multiple settings with different classifiers. What is being realized is that studying complex classifiers such as billion parameter-NNs via the ERM framework is incorrect.

Such an NN class by simple virtue of its complexity is such that 𝒉* labels correctly every point in the data set! Clearly such a 𝒉* will not generalize well to new data points. In NNs, although ERM is the motivation, stochastic gradient descent (SGD) is the real workhorse. NN optimization landscape being non-convex, SGD does not converge to the global minima 𝒉*. The classifier that SGD does converge to, 𝒉₁, is often far away from the ERM 𝒉*. 𝒉₁, in fact, is heavily dependent on the initialization.

Yet 𝒉₁ is special in that it surprisingly generalizes unlike 𝒉*! Under the lens of overparametrization, SGD is believed to perform an implicit regularization. The `effective complexity' of 𝒉₁ is thus considerably smaller than what a naive calculation with the number of parameters would lead one to believe. While this theory is still a work in progress and I'm hazy about the details, there is growing interest along this frontier. That this might, at last, be that theory which explains NNs. The interest was certainly commensurate with the level of attendance in these talks!

For further reading, I leave the reader with webpages of researchers whose talks on overparameterization I found to be very interesting - Mikhail BelkinNathan SrebroJason Lee.