In order to keep up with the expectations of the viewers (which I actually need not since it's mostly a null set), I've desisted from making any non-technical posts about the adventure that settling in at San Diego has been. As to the delay in posting here, suffice it to say that the clear sky, the beaches and the mild temperatures make for a very relaxed start to things. Now that it's been a few months, perhaps it's time to introduce my first research problem.

This problem has central motivations in increasing the popularity of this blog itself! Now, what makes a blog post? At the very basic level, one typically talks about 2 or more topics and in an elegant/novel fashion draws connections between them. However, as the recent elections have proven, popularity is driven more by the topics being of interest to viewers rather than the elegance itself (think Donald Trump).

Thus, one natural strategy is to first form a data-driven estimate of the set of people that are interested in various topics (say we have K different topics in total). Then, to use these estimates to make decisions on which topic to write a post on.

How do I go about doing this? Since I'd like to be a theoretician for the moment, at least until the point that I go looking for jobs, I'll simply do what every other theoretician does, assume that I already have these estimates. An interesting question now arises. To understand this, think of the following case first: If blog posts are to consist of only one topic, the optimal strategy is to simply choose the most popular topic.

However, say if two topics are to be included in the post, picking the first and the second-ranked topics might be sub-optimal. Moreover, this sub-optimality arises more often than not. Consider the case that the first rank goes to the topic "Kim Kardashian". Given this, it's entirely reasonable for the next rank to go to "Kanye West". Then, including these two topics might involve a large number of viewers which are in common to the two, thus disallowing us to leverage the presence of the second topic. One might, for instance, be better off including two very diverse topics such as "Kim Kardashian" and "Information Theory" to maximise the number of interested viewers.

More formally, we can think of each topic as corresponding to a set of users. Thus, we're interested in the two sets out of K, that when taken together gives the largest union. This problem has been widely studied under the title of "Data-Summarization". In one famous result, it was shown that an algorithm which picks one topic to be the first-ranked topic and the other to be that which creates the largest union with the former, gives us a solution that's within a factor of (1-1/e) of the optimal value. It was also shown that aspiring for anything more than this factor makes it an NP-hard problem.

However, if we impose additional constraints on our sets of interest, there's hope to find the optimal solution in complexity that's polynomial in K. We have promising results on some toy problems in this direction. But it remains to be seen if significant results can be obtained. Stay tuned for an update on either the toy problem if there's no success/a link to a paper if we do succeed.

This problem has central motivations in increasing the popularity of this blog itself! Now, what makes a blog post? At the very basic level, one typically talks about 2 or more topics and in an elegant/novel fashion draws connections between them. However, as the recent elections have proven, popularity is driven more by the topics being of interest to viewers rather than the elegance itself (think Donald Trump).

Thus, one natural strategy is to first form a data-driven estimate of the set of people that are interested in various topics (say we have K different topics in total). Then, to use these estimates to make decisions on which topic to write a post on.

How do I go about doing this? Since I'd like to be a theoretician for the moment, at least until the point that I go looking for jobs, I'll simply do what every other theoretician does, assume that I already have these estimates. An interesting question now arises. To understand this, think of the following case first: If blog posts are to consist of only one topic, the optimal strategy is to simply choose the most popular topic.

However, say if two topics are to be included in the post, picking the first and the second-ranked topics might be sub-optimal. Moreover, this sub-optimality arises more often than not. Consider the case that the first rank goes to the topic "Kim Kardashian". Given this, it's entirely reasonable for the next rank to go to "Kanye West". Then, including these two topics might involve a large number of viewers which are in common to the two, thus disallowing us to leverage the presence of the second topic. One might, for instance, be better off including two very diverse topics such as "Kim Kardashian" and "Information Theory" to maximise the number of interested viewers.

More formally, we can think of each topic as corresponding to a set of users. Thus, we're interested in the two sets out of K, that when taken together gives the largest union. This problem has been widely studied under the title of "Data-Summarization". In one famous result, it was shown that an algorithm which picks one topic to be the first-ranked topic and the other to be that which creates the largest union with the former, gives us a solution that's within a factor of (1-1/e) of the optimal value. It was also shown that aspiring for anything more than this factor makes it an NP-hard problem.

However, if we impose additional constraints on our sets of interest, there's hope to find the optimal solution in complexity that's polynomial in K. We have promising results on some toy problems in this direction. But it remains to be seen if significant results can be obtained. Stay tuned for an update on either the toy problem if there's no success/a link to a paper if we do succeed.

## No comments:

## Post a Comment