Sunday, June 9, 2019

A Tale of Taste, Learning and Information

A prominent sense of taste is key to learning anything well. Paraphrasing Jiro Ono, arguably the greatest sushi chef, "If your sense of taste is lower than that of the customers, how will you impress them?". For his chefs in training, he believes that it is the sensitivity of their palates that determine the heights that they will eventually attain.

This makes sense. The more sensitive the artist is to differences, more is learnt in each iteration. The more that the writer is affected by the images conjured by a piece of text, greater is the number of crumbled drafts that end up in the dustbin, and the resulting improvement. The more that the composer's heart rises with a particular rhythm of notes, more is the impetus to innovate.

Learning continues for as long as discrimination is possible and promptly ends when one is no longer able to do so. It may be said that the limits of discrimination determine the limits of learning. It is particularly fascinating that this maxim, of breaking down learning to the act of testing between alternatives, has resulted in, perhaps, in one of our greatest conceptual leaps.

Towards illustrating this, consider the following learning task. Suppose we have 𝑛 distinct integers arranged in some unknown order. Our task is to learn this arrangement, via continually querying for the larger number at two different locations.

Notice that there are 𝑛! possible ways in which these integers can be arranged. Also notice that there are 2ⁿ possibilities corresponding to an 𝑛-length binary sequence. We may thus draw a one-one mapping between each integer arrangement and a log₂(𝑛!) long binary sequence. On the other hand, using any known sorting algorithm of complexity O(𝑛 log₂𝑛), O(𝑛 log₂𝑛) queries suffice in determining the order. Now log₂(𝑛!) = O(𝑛 log₂𝑛). Therefore each of the O(𝑛 log₂𝑛) queries can be viewed as resolving one element of uncertainty in this binary sequence.

In other words, learning here is equivalent to the cumulative reduction in uncertainty that results from the tests. In order to generalize this example, let us retrace the steps we just took. Notice that once the learning task was framed, a space of alternatives was automatically created. Creation of this space is implicit in the definition of any such learning task. Thus, it makes sense to capture and characterize the complexity of the particular learning task via the size of this alternative space.

Further, as we just saw, testing is one natural way to identify the particular alternative in that space. Several such testing strategies might be possible. The minimum number of tests across strategies depends on the particular alternative space and the type of test that is used. Under a given alternative space, the more complicated that each test is, less is the number of tests required. But upon fixing these, the minimum number of tests is directly related to the size of the alternative space.

The notion of 'bit' allows us to quantify both these quantities. First, the size of a particular alternative space, in bits, is simply the length of the smallest binary strings to which the space can be mapped to, injectively. Second, each test is worth one bit if it resolves between two alternatives. That is, a test to which the answer is either a 'yes' or a 'no'. With both the reserve of information and the amount exchanged in each transaction quantified by the number of 'bits', information can be treated as a currency.

Now what makes a currency particularly useful is the abstraction that it affords. For example, treating money from a both a fruit vendor and a venture capitalist in exactly the same manner allows banks to conduct commerce at scale. In the same vein, the universal nature of bits allow for algorithms to process, store and transmit these alternatives independent of the learning task. For instance, the same algorithm to learn the integer arrangement may be applied to learning an alphabet arrangement. Thus it is the 'bit' that has allowed us to raise information to the status of a real physical entity, enabling the current 'Information Age'.

This duality between testing and learning has since been used in other domains too. In machine learning, for instance, in deriving the VC dimension which quantifies the complexity of hypothesis classes. VC dimension here is related to the number of tests required to distinguish between hypotheses. Similarly, this duality is found in several reductions that occur in statistics and theoretical computer science. And in several such cases, the duality has been found to be tight, that is, the bounds derived from testing matches that of the task's difficulty.

Testing paves way, and determines the limits of learning. Understanding this relationship better has been one of the cornerstones of 20th century research.

Sunday, May 5, 2019

Sustainability in of Itself

Sustainability is one term that we've simultaneously become adept at ignoring, while becoming increasingly hard to do so. And sure enough, while on an endless barrage of binge watching Youtube videos I stumbled upon something on sustainability, but which also somehow managed to catch my eye.

It was on the social dynamics of enabling action towards combating climate change, global warming and sustainability in general. With a looming crisis such as this, our response has been to simply divert our attention. As a species, we're terrible at dealing with panic, and solving these issues calls for a study into the social aspect of it as much as the technological.

In one study undertaken at the Climate Lab, UCLA, the effects of different types of persuasion on the benefits of saving electricity were compared -
1. Letting the participants know how much money it enables them to save
2. How it provides a better future for their children
3. Having a little a competition between participants of the study.

The result turned out to be that reducing bills caused the least push toward action while competition provided the greatest. In hindsight, this makes sense. The decisions we make are more informed by the change they bring to our place in society than by lofty ideals.

Therefore if global warming is to be tackled, and more generally sustainable living be encouraged, mechanisms that provides such social rewards must be made more widespread. What is called for is to have sustainable lifestyle be perceived as a reward in itself. To make this certain virtue as something of a fashion statement.

Root of the problem is that right now, we reward the elevation in social status that consumption brings, even if done indiscriminately. Wearing new clothes elevates ones social position while re-using the current wardrobe does not. What if we have a means to instead socially reward re-use? This could extend to re-using other devices. Perhaps a resource sharing score as opposed to individual ownership? This also extends to diet, rewarding discriminative eating.

At this point I can almost here cries of disapproval from many. Such cutting down is often perceived as a diminishment. I believe there are primarily two reasons why such a view exists and I shall try to address these in the remainder of the post.

First, the notion that this inherently involves a sacrifice to one’s lifestyle. But this is not true. Less is sometimes more. What I want to emphasize here is that beyond a certain level of consumption, it’s the quality of objects at hand, their interplay, which if delicately done makes the whole much larger than the sum of its parts. There’s examples of this in cultures whose values provide some useful guidance as a starting point.

Japanese is one. The emphasis on such minimalism is intimately tied to the Japanese Shinto heritage. For instance, 'Shibui', a Japanese word that roughly refers to objects that are extremely simple, but which retain their subtle elements, making them beautiful. This principle is what is said to make sushi absolutely delectable. An awareness of the subtlety around us, the whispers as opposed to the shouts, can make our lives profoundly rewarding.

Second, about how it might impact our present economy. The criticism is that our economy is dependent on ever increasing consumption demand, that being what drives production, translating to better services and prosperity. Now I could put on a more socialist hat and go on about how the present structure benefits only the very few at the top. But I realize that’s not the most popular opinion to have and will not, for now.

But consider this. A more sustainably woke society will also spur the growth of entirely new industries dedicated to creating and managing sustainable products. New industries that should offset the fall in consumption. We already see this motto become increasingly popular. Apple is one, with their MacBook Air made entirely out of recycled aluminum. Instead, we’d like to see this taken to the next step. A longer refresh cycle of these products, even at the cost of making them more expensive, provided they last longer.

Taking this further, if we are to move to a subscription basis rather than ownership. For then it is in the company’s interest to not have to replace products, and therefore makes business sense. This also provides better service to the customers since at no point are they stuck to using a defective product, as is often the case in the years post warranty.

Moreover, with the current reach of social media, this change in attitude can be rapidly enabled. Facebook/Instgram/Twitter could create a portal where people may enter information about how long they use a particular product. Social scientists could help create a score on the impact of each, and a badge could be displayed next to each’s profile, seeding such a lifestyle.

Similar, more sustainable models may be studied for the other major contributors - the food industry and transport. I believe it’s only with such structural social changes that we can truly solve climate change, and more generally, sustainability. As much as we’d like it to be, Teslas aren’t the solution.

I must, however, post a disclaimer that much of this is an armchair analysis. A more rigorous analysis is necessary, and I welcome any thoughts/criticisms. But the possibilities appear to be very exciting and this is one angle that I intend to pursue, and study more about in the time to come.

Tuesday, February 26, 2019

Overparametrization

A previous post ended with the promise of a fabulous exposition of the cornerstones of machine learning theory as we know it - concentration inequalities, PAC learning, and uniform deviation. But in the horizon, a whole new outlook towards learning appears to be taking shape. This has been my highlight of #ITA2019 this year. Much of this new theory is motivated by the recent success of deep neural networks.

Let me first define supervised learning - the learning paradigm that has been the building block of much of the hype progress in ML lately! In supervised learning, one is presented with a collection of labelled data and the objective is to arrive at a pattern to explain the label. The data takes the form of a collection of ordered pairs (𝑋,𝑌), where 𝑋 is referred to as the feature and 𝑌 as the label. For example, suppose a credit agency has labelled data on a customer where the feature -  𝑋= (𝑋₁, 𝑋₂)  is the credit score (𝑋₁), the credit length (𝑋₂) and 𝑌 indicates whether the customer defaulted on a loan. In this case, learning a pattern would allow us to predict whether a new customer with a certain credit score and credit length would be likely to default on a loan.

Such a pattern may be represented by a function 𝒉(𝑋), so that 𝑌=𝒉(𝑋) is the prediction corresponding to 𝑋. Suppose the space of functions 𝒉 belongs to some universal class ℌ. For example, in this case, one could choose ℌ to be the collection of linear classifiers, wherein each 𝒉 ∈ ℌ, depending on its coefficients would declare `default' based on a linear combination of 𝑋₁ and 𝑋₂ exceeding a threshold. One could, in contrast, work with a more complicated ℌ' consisting perhaps of degree - 2 polynomials of 𝑋₁ and 𝑋₂. 𝒉 is said to make a classification error if 𝒉(𝑋)≠𝑌. Now the choice of ℌ also dictates an algorithm for finding a 'good' classifier 𝒉* ∈ ℌ. The algorithm simply declares 𝒉* to be that 𝒉 which makes the least error on the given data set (or the error sum minimizer). 𝒉* is referred to as the empirical risk minimizer (ERM).

The metric of interest to us is the classification error of 𝒉* on unseen data points. This error consists of two components, bias, and generalization error. While the bias is directly related to how well 𝒉* fits the given labelled data, the generalization error captures the difference in performances of 𝒉* between the given and unseen data.

The more accommodating or complex ℌ is, better is 𝒉* able to fit the given data and therefore smaller is the bias. On the other hand, narrowing down to the 'best' classifier within ℌ requires more samples when the ℌ is more complicated resulting in a larger generalization error. In the example above, the degree - 2 polynomial class ℌ' can be expected to have a smaller bias and a larger generalization error than the linearly classifying ℌ. This trade-off is also referred to as the bias-variance trade-off. Practitioners attempt to find the ℌ that hits the sweet spot and it was widely believed that this is the best ℌ that one can hope for.

However, more recently, almost everyone has witnessed a striking outlier to this theory - neural networks. Neural networks (NNs) are simply another class of classifiers ℌ. The NNs that have achieved breakthrough classification errors in various tasks such as image classification, speech recognition, etc. consist of billions of parameters, even as the provided data sets are of the order of mere millions. Thus the NN class being unimaginably complex, known theory would predict a very large generalization error, but differing from what we see in practice.

This is the phenomenon of 'overparametrization', wherein generalization error decreases indefinitely even as ℌ is made more complex. Overparametrization has now been shown to occur in multiple settings with different classifiers. What is being realized is that studying complex classifiers such as billion parameter-NNs via the ERM framework is incorrect.

Such an NN class by simple virtue of its complexity is such that 𝒉* labels correctly every point in the data set! Clearly such a 𝒉* will not generalize well to new data points. In NNs, although ERM is the motivation, stochastic gradient descent (SGD) is the real workhorse. NN optimization landscape being non-convex, SGD does not converge to the global minima 𝒉*. The classifier that SGD does converge to, 𝒉₁, is often far away from the ERM 𝒉*. 𝒉₁, in fact, is heavily dependent on the initialization.

Yet 𝒉₁ is special in that it surprisingly generalizes unlike 𝒉*! Under the lens of overparametrization, SGD is believed to perform an implicit regularization. The `effective complexity' of 𝒉₁ is thus considerably smaller than what a naive calculation with the number of parameters would lead one to believe. While this theory is still a work in progress and I'm hazy about the details, there is growing interest along this frontier. That this might, at last, be that theory which explains NNs. The interest was certainly commensurate with the level of attendance in these talks!

For further reading, I leave the reader with webpages of researchers whose talks on overparameterization I found to be very interesting - Mikhail BelkinNathan SrebroJason Lee.




Wednesday, June 28, 2017

Communicating Using Chessboards!

Any blog such as this must necessarily have a few puzzles. Here's one towards this, which I came across in a recent talk by Paul Cuff. Solving it has been one of my more enlightening moments in the past few months (another equally enlightening moment was standing on the weighing scale to discover how much weight I'd gained).

The puzzle is as follows. There are two parties, Alice and Bob. Alice has access to a chessboard with coins placed on each square. Using this, Alice wishes to send a 6-bit message to Bob. Alice is allowed to flip one coin (among the 64 coins) and upon flipping, the chessboard is made visible to Bob. The question is then to find a strategy that Alice and Bob can agree upon so that Bob can obtain the message on looking at the board.

As a primer, consider the case that the initial configuration (heads up or tails up) of the coins on the board is already known to Bob. A simple strategy here is to map the 6-bit message set to one of 2^6=64 squares, and Alice flips the coin on the square corresponding to the message she wishes to send. Bob then looks at the board, figures out which square has the flipped coin and viola, decodes the message. Sounds too trivial? Read on.

Now suppose that the initial configuration of the coins on the board is not known to Bob. Can Alice still send a 6-bit message? Sounds impossible? Turns out it is possible! Find a strategy for this!

A disclaimer: This problem might take some time to solve. However, the reward/enlightenment on solving it far outweighs the effort involved in doing so. As some of you may have noticed, this problem is notoriously similar to "Dirty Paper Coding".

Update: While musing the web, I found this list of puzzles in Prof. Young-Han Kim's webpage. It includes a variant of this puzzle and other equally elegant ones. I encourage the enthusiast to take a look.

Find the solution in the comments, solved by one of the legendary puzzle enthusiasts of UCSD, Anant Dhayal. So as to prevent any accidental viewing of the comments, I've filled some space with the famous poem by Robert Frost (which I've incidentally alluded to here).

The Road Not Taken

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

Then took the other, as just as fair,
And having perhaps the better claim,
Because it was grassy and wanted wear;
Though as for that the passing there
Had worn them really about the same,

And both that morning equally lay
In leaves no step had trodden black.
Oh, I kept the first for another day!
Yet knowing how way leads on to way,
I doubted if I should ever come back.

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.

Friday, February 17, 2017

How to Make a Blog Popular?

In order to keep up with the expectations of the viewers (which I actually need not since it's mostly a null set), I've desisted from making any non-technical posts about the adventure that settling in at San Diego has been. As to the delay in posting here, suffice it to say that the clear sky, the beaches and the mild temperatures make for a very relaxed start to things. Now that it's been a few months, perhaps it's time to introduce my first research problem.

This problem has central motivations in increasing the popularity of this blog itself! Now, what makes a blog post? At the very basic level, one typically talks about 2 or more topics and in an elegant/novel fashion draws connections between them. However, as the recent elections have proven, popularity is driven more by the topics being of interest to viewers rather than the elegance itself (think Donald Trump).

Thus, one natural strategy is to first form a data-driven estimate of the set of people that are interested in various topics (say we have K different topics in total). Then, to use these estimates to make decisions on which topic to write a post on.

How do I go about doing this? Since I'd like to be a theoretician for the moment, at least until the point that I go looking for jobs, I'll simply do what every other theoretician does, assume that I already have these estimates. An interesting question now arises. To understand this, think of the following case first: If blog posts are to consist of only one topic, the optimal strategy is to simply choose the most popular topic.

However, say if two topics are to be included in the post, picking the first and the second-ranked topics might be sub-optimal. Moreover, this sub-optimality arises more often than not. Consider the case that the first rank goes to the topic "Kim Kardashian". Given this, it's entirely reasonable for the next rank to go to "Kanye West". Then, including these two topics might involve a large number of viewers which are in common to the two, thus disallowing us to leverage the presence of the second topic. One might, for instance, be better off including two very diverse topics such as "Kim Kardashian" and "Information Theory" to maximise the number of interested viewers.

More formally, we can think of each topic as corresponding to a set of users. Thus, we're interested in the two sets out of K, that when taken together gives the largest union. This problem has been widely studied under the title of "Data-Summarization". In one famous result, it was shown that an algorithm which picks one topic to be the first-ranked topic and the other to be that which creates the largest union with the former, gives us a solution that's within a factor of (1-1/e) of the optimal value. It was also shown that aspiring for anything more than this factor makes it an NP-hard problem.

However, if we impose additional constraints on our sets of interest, there's hope to find the optimal solution in complexity that's polynomial in K. We have promising results on some toy problems in this direction. But it remains to be seen if significant results can be obtained. Stay tuned for an update on either the toy problem if there's no success/a link to a paper if we do succeed.

Thursday, August 4, 2016

On Idling Around

Why, oh why is it that I'm wasting the present moment? From where has this nature to idle around come from?

Although to be fair, I've been idling along a lot since childhood. And I know I'm being really fair since I'm writing this from a geographic coordinate and altitude that's exactly where I idled around a lot in my early childhood (subtracting any tectonic plate shifts of course). I'm talking about my native home in Kerala, in my bedroom, from where the sights of coconut, banana and papaya trees full of papayas hasn't changed one bit. What has however changed is the fact that I now have access to a fast internet connection.

What's also changed is that I'm now educated enough to use this internet to gather data, research papers and potentially* do world-changing work. Except that my mind has not warped itself to this new reality and instead prefers to take comfort in the soft warm bed and blankets, watch the monsoon drops hit the mud-tiled roof as the monsoon clouds make it dark, turning the green surroundings into a shade of dark green.

The heavy rains also flood into me countless childhood memories – that of cricket with cousins, collecting mangoes from the nearby tree, make a hole in them as we'd squeeze the pulpy and sweet juice into our mouths, as our cuffs become stained in yellow. How exactly am I to break away from this vivid, colorful image and study probability instead?

For in it there's a beauty more profound, but harder to obtain. She expects you to pursue her relentlessly while herself showing not the slightest interest, hard to flatter and turns down all attempts at gaining her hand. Although internally, she enjoys these nerdy suitors spending hours to please her and can't help herself from giving a faint smile, a slight blush now and then. This sometimes catches the eye of the suitor who's overjoyed, starts pursuing with the fullest might and not before long, inexplicable beauty is within reach!

What's absolutely essential to appreciate this subtle beauty is to turn one's head away from the mundane to the magical. To turn away from the various BuzzFeeds of the internet and instead surround oneself with activities involving mental effort, so that some learning happens in the perceptrons within! In essence, it's time I realize that unlike in my childhood, I stand to lose something if I idle around – the more long-lasting joy of cracking problems! This childhood artefact will have to be cleared out to have any hope of gaining her hand!

*the word potentially is among the most abused words in English

Tuesday, July 5, 2016

Life, Perception and Exponential Functions

Motivation to work is important. As we move forward, the sources from which we derive motivation changes. In childhood, to do well in a set of exams is all that's required and motivations were simple - the rod in early childhood, some delight in learning and competition in later stages. However, in college, there's usually no fixed goal to pursue. One tries a bunch of things, realizes the sub-field closest to one's calling and proceeds. Even so, in undergraduate college, things like maintaining a good GPA provide some form of a concrete goal.

But the cocoon is broken as one leaves for the next venture, be it a job in a company, a startup or graduate studies. One enters a stage where the goals to aspire for and the means to achieve these are no longer well-defined. Here we typically have to work with the perceived rewards in mind. And so I make the argument that our ability to perceive these rewards is detrimental to success. It is here that I believe some vestiges from evolution lie in our way.

Isn't it surprising that sounds are reported to us as Decibels instead of in a linear scale? It turns out that we, in fact, perceive sounds in a logarithmic and not linear scale. This phenomenon is not restricted to just sounds. It carries forward to all forms of sensory perception, memory, and our spatial sense. Young children, when made to place numbers from 1 to 10 on a scale, place 3 at right about the middle. Since log 3 ~ (log 10)/2, when seen from the logarithmic light, it's no longer as surprising.

In a recent work, Lav R. Varshney and John Z. Sun, using some ideas from information theory, attempt to explain this. Their work shows that logarithm as the quantization function for sensory inputs produces the least distortion if we assume sensory inputs to follow a power law distribution. In practice, many natural inputs follow power laws such as frequencies of the English language. In plain language, logarithm allows us to best store and process information. To put it less technically but at all loss of rigor, our ancestors benefitted more from knowing if they're facing one or two lions as opposed to knowing whether there are 99 or 100 lions.

Now when it comes to long-term rewards, I claim that we perceive them to be at best linear in time. A simple thought experiment proves this. Suppose we were to pose one the question, "What technological advances do you predict by the end of 2100?". The intuitive way of answering this would be to provide a figure by accounting the progress that's happened in the previous 100 years. However, a more critical analysis reveals that since innovations happen on top of existing innovations, it's more likely an exponential curve. Indeed, history suggests the same. Take a man from 1000 B.C to 500 A.D. and he might see what he expects. But on taking him from 500 A.D. to 2000 A.D., he's bound to be held speechless for a few days at the least and permanently at the most.

Also, why else do we express such surprise in rags to riches stories or glance at that super successful multi-billionaire start-up kid with smoldering jealousy? The famous 10,000 hours rule, that one can become a world-class expert in a reasonable topic of choice, by putting in 10,000 hours of effort might, in fact, have a lot of truth to it. Progress always happens in baby steps. In steps of 1,2,4,8,16,32 and before you know it, the emperor runs out of money.

In a similar sense, seemingly meaningless investments suddenly start to give results. In a venture like a PhD, where one's success depends very much on the extent of knowledge that one accumulates, this becomes crucial. Since knowledge is built upon previous knowledge, it's reasonable to assume that the utilities of studying a certain topic also grows exponentially.

After reasonable digression, getting back to the topic at hand - motivation, one strong motivation for writing this post is that it, in some ways allows me (and the readers) to realize the value in seemingly inconsequential knowledge quests. What we have is an exponential reward function f(t), which we see as a log-linear g(t). Exponentially large rewards remain within our reach if we only alter our perception!