Wednesday, December 19, 2012

Do you really have big data?

There's a religious war brewing in the ML community. One on side are those who think large clusters are the way to handle big data. On the other side are those who think that high-end desktops offer sufficient power with less complication. I perceive this debate being driven by differing motivations for scaling.
  1. For the data. Essentially somebody wants to learn with as much data as possible to get the best possible model. A large cluster is likely to be the operational store for this data and marshalling the data onto a high-end desktop is infeasible, so instead the algorithm must run on the cluster. The archetypal problem is ad targeting (high economic value and high data volume) using text features (so that a bilinear model can do well, but features are zipf distributed so large data provides meaningful improvement).
  2. For the compute. The amount of data might be relatively modest but the learning problem is computationally intense. The archetypal problem here is deep learning with neural networks (high compute cost) on a natural data set (which are typically sized to fit comfortably on a desktop, although that's changing) in raw representation (thus requiring non-convex ``feature discovery'' style optimization).
So even if you are only scaling for the compute, why not use a cluster? A high-end desktop with multiple cores and multiple GPUs essentially is a (small) cluster, but one with a high-speed interconnect. This high-speed interconnect is key when using SGD to solve a non-convex optimization problem. SGD is an amazingly versatile optimization strategy but has an Achilles' heel: it generates many small updates which in a distributed context means high synchronization cost. On a single machine it is easier to mitigate this problem (e.g., via minibatching and pipelining) than on a cluster. On a commodity-switched cluster, at the moment, we pretty much only know how to solve convex optimization problems well at scale (using batch algorithms).

The relatively limited bandwidth to the single desktop means that single-machine workflows might start with data wrangling on a large cluster against an operational store (e.g., via Hive), but at some point the data is subsampled to a size managable with a single machine. This subsampling might actually start very early, sometimes at the point of data generation in which case it becomes implicit (e.g., the use of editorial judgements rather than behavioural exhaust).

Viewed in this light the debate is really: how much data do you need to build a good solution to a particular problem, and it is better to solve a more complicated optimization problem on less data or a less complicated optimization problem on more data. The pithy summarizing question is ``do you really have big data?''. This is problem dependent, allowing the religious war to persist.

In this context the following example is interesting. I took mnist and trained different predictors on it, where I varied the number of hidden units. There are direct connections from input to output so zero hidden units means a linear predictor. This is a type of ``model complexity dial'' that I was looking for in a previous post (although it is far from ideal since the step from 0 to 1 changes things from convex to non-convex). Unsurprisingly adding hidden units improves generalization but also increases running time. (NB: I chose the learning rate, number of passes, etc. to optimize the linear predictor, and then reused the same settings while just adding hidden units.)


Now imagine a computational constraint: the 27 seconds it takes to train the linear predictor is all you get. I randomly subsampled the data in each case to make the training time around 27 seconds in all cases, and then I assessed generalization on the entire test set (so you can compare these numbers to the previous chart).


For mnist it appears that throwing data away to learn a more complicated model is initially a good strategy. In general I expect this to be highly problem dependent, and as a researcher or practitioner it's a good idea to have an intuition about where your preferred approach is likely to be competitive. YMMV.

Friday, December 14, 2012

Spectral Methods for Latent Models

There were several hot trends at NIPS this year and I'll give my broad take on the conference in a later post. Right now I want to dig into a particular line of research that is moving quickly and looks extremely promising: Spectral Algorithms for Latent Variable Models. This was my third major exposure to the topic and it finally clicked; as we shall see, it is appropriate that I required 3 observations.

The tl;dr version is as follows:
  1. Spectral methods for latent variable models are based upon the method of moments rather than maximum likelihood. The hope is that the method of moments can avoid the E-step (``inference'') during learning and thus be more computationally efficient.
  2. Moments of observations with related latent values have low-rank structure which when decoded identifies the model.
  3. If observations are sufficiently high-dimensional, third moments are sufficient to identify the model (the ``blessing of dimensionality''). In particular directions arising from an orthogonal decomposition of a symmetric tensor derived from the first three moments (``spikey directions'') reveal the latent structure.
Everything here is based upon the pioneering work of a group of researchers including Animashree Anandkumar, Dean P. Foster, Rong Ge, Daniel Hsu, Furong Huang, Sham M. Kakade, Yi-Kai Liu, Matus Telgarsky, and possibly others I'm overlooking (sorry in advance!); Anandkumar et. al. provided the specific inspiration for this blog post.

The Method of Moments

A probabilistic latent variable model is in some sense not that different than any other parametric probabilistic model, since one never actually observes the parameters of the probability distribution, only realizations. For instance following the Wikipedia example, if you are given a sample $\{ x_i \}$ drawn iid from a Gamma distribution \[
\begin{aligned}
X &\sim \Gamma (\alpha, \beta), \\
\frac{d \Gamma (\alpha, \beta)}{d \mu} &\doteq g (x; \alpha, \beta) = \frac{x^{\alpha-1} e^{-x/\beta}}{\beta^\alpha \Gamma (\alpha)},
\end{aligned}
\] then you can estimate $\alpha$ and $\beta$ from the sample either using maximum likelihood or by the method of moments. To use the method of moments, relate the parameters you care about $(\alpha, \beta)$ to expectations of observable quantities, \[
\begin{aligned}
\mathbb{E}[X] &= \alpha \beta, \\
\mathbb{E}[X^2] &= \beta^2 \alpha (\alpha + 1),
\end{aligned}
\] then replace the expectations with sample estimates and solve for the parameters.

Historically the method of moments was superseded by maximum likelihood mainly because of statistical efficiency, which in machine learning terms is a sample complexity issue. Nowadays, however, we have lots of data and are essentially compute bound (as evidence for this assertion, consider the Big Learning NIPS workshop). So the bottom line is, if the method of moments is computationally more tractable this trumps sample complexity issues. One reason to believe method of moments based approaches will be substantially cheaper computationally is that maximum likelihood for latent variable models essentially looks like EM, and the E-step (``inference'') is expensive; whereas the method of moments avoids the E-step during learning.

Observations with Related Latent Structure

Consider the following simple latent variable model: flip a biased coin, on the basis of that coin flip select one of two biased coins, then flip that coin and report the heads or tails, \[
\begin{aligned}
Z &\sim \mathcal{B} (\pi, 1), \\
X | Z &\sim \mathcal{B} (q_Z, 1).
\end{aligned}
\] There are 3 unknowns here, so intuitively we need 3 equations. Let's start forming moments, starting with the mean value, \[
\begin{aligned}
\mu &\doteq \mathbb{E}[X] = \pi q_0 + (1 - \pi) q_1. \\
\end{aligned}
\] So far so good, now let's try a second moment and consider the expected value of the product of two observations, \[
\begin{aligned}
\mathbb{E}[X_1 X_2] &= \pi^2 q_0^2 + 2 \pi (1 - \pi) q_0 q_1 + (1 - \pi) q_1^2 = \mu^2.
\end{aligned}
\] Oops. Actually we didn't need all that algebra: since the observations are iid, all of the higher order moments will be powers of $\mu$, and there is no additional information in products of observations of the above form. This is not a limitation of the method of moments, as maximum likelihood will also fail given only this information. Fundamentally given only the above information there is no way to distinguish between a mixture of two very biased coins and a different mixture of two midly biased coins.

Suppose instead you had different information: you are told that pairs of observations share the same (unknown!) latent value. \[
\begin{aligned}
\mathbb{E}[X_1 X_2 | Z_1 = Z_2] = \pi q_0^2 + (1 - \pi) q_1^2.
\end{aligned}
\] Aha! A second piece of information. Need we need just 1 more, so consider triples of observations that share the same (unknown!) latent value, \[
\begin{aligned}
\mathbb{E}[X_1 X_2 X_3 | Z_1 = Z_2 = Z_3] = \pi q_0^3 + (1 - \pi) q_1^3.
\end{aligned}
\] Now we have 3 equations and assuming that $q_0 \neq q_1$ we can estimate both $q_0$, $q_1$, and $\pi$, \[
\begin{aligned}
\mathbb{E}[X] &= \pi q_0 + (1 - \pi) q_1, \\
\mathbb{E}[X_1 X_2 | Z_1 = Z_2] &= \pi q_0^2 + (1 - \pi) q_1^2, \\
\mathbb{E}[X_1 X_2 X_3 | Z_1 = Z_2 = Z_3] &= \pi q_0^3 + (1 - \pi) q_1^3,
\end{aligned}
\] where in practice we replace expectations with sample means.

The key point is that sets of observations that have related latent structure are the key to identifiability. In the above example the latent value was exactly the same, but (surprise!) it turns out sufficient to know that the latent value was drawn from the same distribution (aka, ``from the same document'').

Blessing of Dimensionality

Do we need to go to higher order for more complicated latent structures? Let's modify the model so that there are three latent states. \[
\begin{aligned}
Z &\sim \mathrm{Trinomial} (\pi), \\
X | Z &\sim \mathcal{B} (q_Z, 1).
\end{aligned}
\] Now we have five unknowns ($\pi_0, \pi_1, q_0, q_1, q_2$) so it would appear we have to go to fifth order statistics to identify the parameters. However it turns out this is a consequence of the observations being single coin flips. If the observations are of sufficiently high dimension, then (surprise!) third order statistics suffice. Let's modify the model again so that the observations are vector valued, \[
\begin{aligned}
Z &\sim \mathrm{Trinomial} (\pi), \\
\mathbb{E}[X | Z] &\sim \mu_Z.
\end{aligned}
\] Note we can recover the previous binomal observation model by utilizing the one-hot encoding $\mu_Z = (1 - q_Z, q_Z)$, but now we''ll consider $\mu_Z$ to be $d > 2$ dimensional. At first blush this appears to provide no advantage, since we have extra information per observation but also extra parameters to estimate. However the information content in the higher moments grows faster than the number of extra parameters, allowing third order moments to suffice. To see this expand out the moments, \[
\begin{aligned}
\mathbb{E}[X] &= \sum \pi_i \mu_i, \\
\mathbb{E}[X_1 \otimes X_2 | Z_1 = Z_2] &= \sum \pi_i \mu_i \otimes \mu_i, \\
\mathbb{E}[X_1 \otimes X_2 \otimes X_3 | Z_1 = Z_2 = Z_3] &= \sum \pi_i \mu_i \otimes \mu_i \otimes \mu_i.
\end{aligned}
\] Naive parameter counting suggests that the first and second moments could be sufficient to identify the model, but the $\mu_i$ are not orthogonal so we can't just read off the low-rank structure of the covariance matrix with, e.g., SVD. However we can use SVD on the covariance to construct an orthonormal basis for the span of $\mu_i$, and in that basis the trivariance tensor has a unique orthogonal decomposition whose eigenvectors determine the $\mu_i$.

The previous argument fails if the $\mu_i$ are not linearly independent, and in particular this implies the dimensionality of the observations has to be at least the cardinality of the latent variable. Fortunately, we often have very high dimensional data in machine learning, a circumstance that usually creates problems (a ``curse of dimensionality'') but here creates an opportunity to identify rich latent structure from low order moments (a ``blessing of dimensionality'').

For more complicated latent models, the details of how the first three moments are manipulated to extract the desired latent parameters changes, but the basic strategy is to reduce the problem to an orthogonal decomposition on a tensor constructed using the first three moments. This tensor decomposition is a particular optimization problem, and due to its broad applicability I suspect it could be the new ``$L_2$-penalized hinge loss'', i.e., a fair bit of machine learning in the near future could be characterized as figuring out how to (approximately) solve this particular optimization problem quickly.

Sunday, December 2, 2012

Model Complexity, Data Resources, and Computational Constraints

Abu-Mostofa in one of his awesome video lectures (Lecture 8 @ 44:45 into the video) makes the point ``match the model complexity to the data resources, not the target complexity.'' However in big data machine learning this is not what is done. For example, if you want to win a Kaggle competition involving a dataset with circa 105 rows and circa 102 columns you use a giant collection of boosted decision trees (ensembled with other stuff). Scale those numbers by 4 orders of magnitude, however, and (primal) linear predictors or nearly naive-Bayes style approaches are dominant. More data, simpler models. Huh?

What's happening is that computational constraints dominate. You might consider that a pure engineering issue of needing to scale up the more complicated models with distributed learning, GPUs, etc. However, you could also use all the extra horsepower to feed even more data to a simpler model. So it begs the question: is it worth throwing away some of your data resources in order to learn a more complex model? Or is it better to throw away some of your model complexity to retain more data resources? Note I'm just considering the quality of the final model given the amount of data and computation that went into producing the model: in practice things like model evaluation time when used to predict are critical for consumer-facing internet services but let's ignore that.

In my recent career I just assumed it was better to trade model complexity (less) for data resources (more), but I never really investigated or questioned this. Recently I started playing around on Kaggle and I noticed there are many interesting data sets that are not that big, and the top-heavy tournament-style payoffs incent the use of models more complicated than what I've typically used professionally. That got me itching to add some more powerful techniques to vee-dub and now there's a neural network reduction but guess what? It's s-l-o-w. Vee-dub has distributed learning so voila I have a distributed neural network learning system but if I'm going to eat an entire cluster for a problem should I use relatively fast linear learning and more data, or relatively slow neural network learning and therefore less data? There's a data-model complexity frontier here, and the correct tradeoff is unclear.

There are lots of examples where, for a fixed algorithm, more data is better, e.g., Agarwal et. al.; and there are lots of examples where, for a fixed data set, more complicated models are better than simpler ones, e.g., mnist. Imagine a 3-dimensional plot where the z-axis is ``model awesomeness'', the x-axis being data resources, and the y-axes being model complexity. These results are basically one-dimensional axis-parallel statements about this two-dimensional function. Are there any results that compare across both dimensions? What would be awesome would be a study that compared, e.g., terascale linear learning to gigascale deep learning on the same problem, attempting to use the same computational resources with each technique.

I suspect that for terascale learning problems hugging close to the linear predictor is a good idea (for now!), but there is room for some modest improvement in modeling power given currently available data resources and computational constraints. That suggests that the neural network reduction as written is undesirable for this regime, because it doesn't have direct connections from the input to the output layer. If I add in those extra connections than hopefully there will be no harm and some modest benefit from adding a relatively small number of hidden units (although I've had difficulty learning low-rank latent models simultaneously with a linear predictor so nothing is guaranteed).

For the Kaggle zone, I'm pretty sure being able to throw dozens of machines at circa 106 rows with circa 103 hidden units is going to be awesome.

Tuesday, November 27, 2012

Unpimp Your Sigmoid

Nikos and I figured out how to implement a neural network as a vee-dub reduction. The basic idea is to create fake examples for the hidden units such that the gradient of the loss function computed by the core corresponds to backpropagation. The result strongly resembles the iteratively reweighted least squares approach to neural network learning of Warner and Misra. One difference is that Warner and Misra focused on a particular second-order learning scheme, whereas we don't specify the learning algorithm, we just consider the vee-dub core as a GLM solver and construct the augmented design matrix online. This allows us to leverage the different learning algorithms in vee-dub, which are the SGD variants and L-BFGS. SGD is typically faster than L-BFGS in the single box setting, especially with the enhancements implemented in vee-dub. So the main reason to entertain using L-BFGS is for distributed learning.

Some notes:
  1. No frills. One hidden layer and the activation function is tanh, the only configuration is the number of hidden units.
  2. Composition with other reductions should work but is not extensively tested. In isolation the reduction does binary classification and regression.
  3. It's not a complete dog but it's not as fast as possible either: more optimizations are possible. Nonetheless if your problem is high-dimensional and sparse the basic infrastructure of vee-dub (i.e., parsing, hashing, etc.) should produce a significant win with respect to neural network implementations that assume dense feature vectors.
  4. It should work with any available loss function and all available learning algorithm variants.
  5. Quadratics and ngrams work and are applied at the input-to-hidden layer.

Those who enjoy solving toy problems will be pleased to know that vee-dub can now solve 3-parity with two hidden units.
pmineiro@pmineirovb-931% ./make-parity 3
-1 |f  1:1 2:-1 3:-1
-1 |f  1:-1 2:1 3:-1
1 |f  1:1 2:1 3:-1
-1 |f  1:-1 2:-1 3:1
1 |f  1:1 2:-1 3:1
1 |f  1:-1 2:1 3:1
-1 |f  1:1 2:1 3:1
1 |f  1:-1 2:-1 3:-1
pmineiro@pmineirovb-932% ./make-parity 3 | ../vowpalwabbit/vw --nn 2 --passes 2000 -k -c --cache_file cache -f model -l 10 --invariant                          final_regressor = model
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
decay_learning_rate = 1
randomly initializing neural network output weights and hidden bias
creating cache_file = cache
Warning: you tried to make two write caches.  Only the first one will be made.
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
1.550870   1.550870            3         3.0   1.0000  -1.0000        4
1.919601   2.288332            6         6.0   1.0000   0.7762        4
2.011137   2.120980           11        11.0   1.0000  -1.0000        4
2.154878   2.298620           22        22.0   1.0000   0.3713        4
2.354256   2.553635           44        44.0  -1.0000   1.0000        4
2.286332   2.216827           87        87.0  -1.0000   1.0000        4
2.222494   2.158657          174       174.0   1.0000   0.8935        4
1.716414   1.210335          348       348.0  -1.0000  -0.9598        4
1.368982   1.021549          696       696.0   1.0000   0.9744        4
1.151838   0.934694         1392      1392.0   1.0000   1.0000        4
0.976327   0.800816         2784      2784.0   1.0000   1.0000        4
0.756642   0.536958         5568      5568.0   1.0000   1.0000        4
0.378355   0.000000        11135     11135.0  -1.0000  -1.0000        4

finished run
number of examples = 16000
weighted example sum = 1.6e+04
weighted label sum = 0
average loss = 0.2633
best constant = -6.25e-05
total feature number = 64000
pmineiro@pmineirovb-933% ./make-parity 3 | ../vowpalwabbit/vw -i model -t -p /dev/stdout --quiet
-1.000000
-1.000000
1.000000
-1.000000
1.000000
1.000000
-1.000000
1.000000
With -q ff, I can do 4-parity with 2 hidden units. Woot.

Going more realistic, I tried mnist. The task is multiclass classification of handwritten digits, so I used the one-against-all reduction with the neural network reduction. The raw pixel values are decent features because the digits are size-normalized and centered. An example line looks like
pmineiro@pmineirovb-658% zcat test.data.gz | head -1
 |p 202:0.328125 203:0.72265625 204:0.62109375 205:0.58984375 206:0.234375 207:0.140625 230:0.8671875 231:0.9921875 232:0.9921875 233:0.9921875 234:0.9921875 235:0.94140625 236:0.7734375 237:0.7734375 238:0.7734375 239:0.7734375 240:0.7734375 241:0.7734375 242:0.7734375 243:0.7734375 244:0.6640625 245:0.203125 258:0.26171875 259:0.4453125 260:0.28125 261:0.4453125 262:0.63671875 263:0.88671875 264:0.9921875 265:0.87890625 266:0.9921875 267:0.9921875 268:0.9921875 269:0.9765625 270:0.89453125 271:0.9921875 272:0.9921875 273:0.546875 291:0.06640625 292:0.2578125 293:0.0546875 294:0.26171875 295:0.26171875 296:0.26171875 297:0.23046875 298:0.08203125 299:0.921875 300:0.9921875 301:0.4140625 326:0.32421875 327:0.98828125 328:0.81640625 329:0.0703125 353:0.0859375 354:0.91015625 355:0.99609375 356:0.32421875 381:0.50390625 382:0.9921875 383:0.9296875 384:0.171875 408:0.23046875 409:0.97265625 410:0.9921875 411:0.2421875 436:0.51953125 437:0.9921875 438:0.73046875 439:0.01953125 463:0.03515625 464:0.80078125 465:0.96875 466:0.2265625 491:0.4921875 492:0.9921875 493:0.7109375 518:0.29296875 519:0.98046875 520:0.9375 521:0.22265625 545:0.07421875 546:0.86328125 547:0.9921875 548:0.6484375 572:0.01171875 573:0.79296875 574:0.9921875 575:0.85546875 576:0.13671875 600:0.1484375 601:0.9921875 602:0.9921875 603:0.30078125 627:0.12109375 628:0.875 629:0.9921875 630:0.44921875 631:0.00390625 655:0.51953125 656:0.9921875 657:0.9921875 658:0.203125 682:0.23828125 683:0.9453125 684:0.9921875 685:0.9921875 686:0.203125 710:0.47265625 711:0.9921875 712:0.9921875 713:0.85546875 714:0.15625 738:0.47265625 739:0.9921875 740:0.80859375 741:0.0703125
Here are some results. \[
\begin{array}{c|c|c}
\mbox{Model } &\mbox{ Test Errors } &\mbox{ Notes } \\ \hline
\mbox{Linear } &\mbox{ 848 } & \\
\mbox{Ngram } &\mbox{ 436 } & \verb!--ngram 2 --skips 1! \\
\mbox{Quadratic } &\mbox{ 301 } & \verb!-q pp! \\
\mbox{NN } &\mbox{ 273 } & \verb!--nn 40!
\end{array}
\] Quadratic gives good results but is s-l-o-w to train (each example has > 10000 features). NGram gives some of the boost of quadratic and is fairly zippy (due to the encoding these are horizontal n-grams; vertical n-grams would presumably also help). A 40 hidden unit neural network does better than Quadratic and is also faster to train. The command line invocation looks something like this:
pmineiro@pmineirovb-74% ./vw --oaa 10 -l 0.5 --loss_function logistic --passes 10 --hash all -b 22 --adaptive --invariant --random_weights 1 --random_seed 14 --nn 40 -k -c --cache_file nncache train.data.gz -f nnmodel

Sunday, October 21, 2012

Bagging!

My recent Kaggle competition experience impressed upon me the need for faster model selection. My original plan was to create a cross-validating vee-dub plugin (reduction), but Nikos convinced me that bagging was both more versatile and more intellectually coherent. The basic idea is to multiplex the weight vector inside vee-dub and learn an ensemble of models simultaneously, each one experiencing a different bootstrapped version of the input sequence. There are three benefits that result:
  1. Model selection. Out-of-bag estimates function analogously to cross-validated estimates. Monitoring out-of-bag estimates instead of progressive validation loss is a better idea when looping over a data set multiple times.
  2. Confidence intervals. When developing a standalone decision system, a point prediction is usually sufficient; but when developing a subsystem to be integrated into a larger decision system, interval predictions provide a richer summary to downstream consumers.
  3. Better results. Ensembling can improve model generalization due to a bias-variance tradeoff. I haven't seen any real lift on the data sets I've tried so far, however, and this is understood and expected: linear models have low variance and so don't benefit from bagging. To benefit from bagging you have to do something whose final output is more sensitive to differences in the training data, e.g., neural networks, trees, or linear models with automated data-driven feature selection.
Model selection was my original motivation, so here's an example leveraging the KDD Cup 98 data. The contest provided hundreds of features for prediction but the winning entries only used a handful, indicative that model selection was very important. The contest is as follows: you are asked to decide to whom to send solicitation (snail) mail; each mail costs 68 cents. The training set consists of pairs $(x_i, r_i)$ of features $x_i$ and response donation amounts $r_i$ corresponding to previous mail campaigns. The evaluation set consists of just the $x_i$ (except that the contest is over so we can actually peek and see the $r_i$ for the evaluation set); submissions submitted an estimated donation $\hat r_i$ for each instance in the evaluation set and then were scored according to $1_{\hat r_i > 0.68} (r_i - 0.68)$. In other words, even though the decision was whether-or-not to mail somebody, the decision was transmitted via the estimated response being greater than or less than 68 cents; otherwise, there was no evaluation of the accuracy of the estimated response, and the actual total profit experienced was the scoring criterion.

We can model this as an importance-weighted classification problem, transforming pairs $(x_i, r_i)$ into triples $(x_i, y_i, c_i)$ according to \[
(x_i, r_i) \to (x_i, 1_{r_i > 0.68}, |r_i - 0.68|).
\] For the evaluation set we can decode the decision whether-or-not to mail directly as the model's predicted class. In this case the constant predictor corresponds to a policy that mails no-one or mails everyone; as anybody who has ever had a mailbox will attest, it is more profitable to mail everyone than to mail no-one so ``mail everybody'' is the baseline policy to beat. (Parenthetically: maybe instead of machine learning trade-secret spam filters, we should just machine learn and openly distribute a model of who actually wants to buy Viagra and save everybody else from being bothered.)

I put together a naive ``kitchen-sink'' model by taking every feature in the training set, encoding dates numerically as months since January 1900, encoding numerics using splines with empirical deciles as knot-points, and encoding all other variables categorically. I also put together a model that has 10 features in it gleaned from looking at the write-ups of the winners (encoded using the same strategy as the kitchen-sink). Here are the results: \[
\begin{array}{c|c|c|c}
\mbox{Model } &\mbox{ Training Set Profit } &\mbox{ Training Set Profit (out-of-bag) } &\mbox{ Evaluation Set Profit } \\ \hline
\mbox{Mail everybody } & \$10788.54 & \$10704 \pm \$529 & \$10560.08 \\
\mbox{10 features } & \$17596.35 & \$14617 \pm \$434 & \$14896.07 \\
\mbox{Kitchen sink } & \$24353.80 & \$493 \pm \$80 & \$154.92 \\
\end{array}
\] So the kitchen sink severely overfits the training set, but out-of-bag estimates detect this. Woot.

Unfortunately the bagging reduction is not really faster than running multiple vee-dubs. It is faster for simple models where I/O overhead is substantial, but for more complicated models amortizing the I/O is only a modest benefit. It is more convenient to have everything orchestrated internally, and the reduction should compose with the distributed learning capabilities of vee-dub, but in the laptop zone it's not a huge win.

Of course once you have bagging, the natural next step is to try some unstable learning methods. Stay tuned!




Tuesday, October 2, 2012

Kaggle Competition Tips

I recently entered into a Kaggle competition for the first time. Overall it was positive experience and I recommend it to anyone interested in applied machine learning.

It was a private competition so I can only discuss generalities, but fortunately there are many. The experience validated all of the machine learning folk wisdom championed by Pedro Domingos, although the application of these principles is modified by the top-heavy metric-driven nature of the payouts. As in poker, where strategic adjustments are required between tournament and table play, machine learning competitions incent techniques that would be improper in a normal industrial setting.

Evaluation

For the competition we started by understanding the evaluation metric. In real-life (aka ``table play''), the goal of a machine learning system is often under-specified (e.g., ``improve relevance'' or``detect analomies'') or intractable to optimize directly due to aggregate impact and uncontrolled external factors (e.g., ``improve long-term customer retention and satisfaction''). Under these conditions proxy metrics are used, but hopefully nobody takes them too seriously because everybody understands the proxy metric was chosen for convenience.

In a competition (aka ``tournament play'') the evaluation metric determines how much you get paid. Ergo, it needs to be taken very seriously. The first step is to attempt to reproduce the evaluation metric using a simple model (e.g., the constant model), which is typically a known function applied to an unknown data set. Methods for assessing generalization error ($k$-fold cross-validation, out-of-bag estimation) can be used but it is important to remember that you are not trying to generalize to a future data set drawn from the same distribution, but rather a particular fixed data set used for evaluation. Note there is absolutely no problem with over-fitting the evaluation data set if possible. Whether this is possible depends upon how many intermediate submissions you are allotted and how submissions are scored, and whether you can see the evaluation features.



Feature Engineering

Here tournament and table play machine learning are more aligned, as feature engineering plays a critical role in both. However since the margin of victory in competitions is very slim, tournament feature engineering is more frantic. Put simply, if it appears to improve the model at all, it stays in: issues of interpretability, implementation complexity, maintainability, or evaluation time are unimportant in tournaments, except to the extent that they will cause problems for the modeler during the duration of the competition. In table play features are subject to a more nuanced cost-benefit analysis which accounts for these other factors.

Since feature engineering is model selection, it is important to monitor the evaluation metric. During the feature engineering phase we used vee-dub exclusively in order to turn models around as quickly as possible.

Time to un-pimp your features.
It's worth investing some time during the competition improving the speed at which you can turn around an evaluation estimate. I made a few, but I missed a big one which I only realized after the contest ended: I was doing 10-fold cross-fold validation by invoking vee-dub 10 times with different inputs, but this is best done internally in vee-dub via the reduction interface in order to amortize the I/O. Hopefully I can get that implemented in the near future.

Data shaping was also very important and relatively inefficient. By data shaping, I mean choosing how to represent a particular feature to the learning algorithm (e.g., try taking the log of this thing and setting some spline knots here, here, and here). Being able to specify this kind of data shaping directly to vee-dub would facilitate fast experimentation and parameter sweeping, again by amortizing I/O.

Finally, there was certain ``features'' we found in the data that were clearly artifacts of how the competition data was collected and unlikely to ``truly generalize.'' For tournament play, the issue is simply whether they would generalize to the evaluation set, so we prepared paired submissions to assess. In table play, these features would be not be utilized at all unless the predictive lift was significant, and even then they would be cautiously assessed with additional data due to the clear lack of face validity.

By the way, we did discover (apparently) real features as well which should generate insight for the contest promoter. While the incentives associated with competitions are not perfect, mostly it still boils down to doing good work.

Model Averaging

Our winning submission consisted of several machine learned models trained using completely different procedures and averaged together. This is now standard tournament technique, although I've never used it explicitly in table play (implicitly yes, via boosting and bagging; but averaging a tree and a neural network is not something I've ever done in a commercial setting). I was absolutely shocked how effective this was: average a model with another model which is demonstrably worse, and the result is better then either one, thanks to a bias-variance tradeoff. Getting something for nothing quickly becomes an addictive habit, but the best results come from combining models that were trained using completely different methods (i.e., models that do not covary), and it becomes harder to find completely different methods that generate competitive results (e.g., we tried metric-based methods but these were so poor that we did not use them in the final ensemble).

Gettin' Real

The best part about tournament play is the intensity of competition. In table play, you are typically working cooperatively with other machine learning experts to produce a good model that will benefit the business. Although you might be competing with other businesses also leveraging machine learning technology, the relationship between the fortunes of your employer and the quality of your machine learning model is obscured. In tournament play, the teams doing better than you on the leaderboard directly indicate the suboptimality of your current technique; and if you do happen to find yourself at the top of the heap, the minuscule and ephemeral nature of your lead will drive you to find further improvements and insights. I found myself working far more than I originally anticipated.

Happy Kaggling!

Wednesday, August 1, 2012

Scalable Semi-supervised Deep Learning

While I've always been sympathetic to the goals of deep learning researchers, in practice deep learning techniques have not played a significant role in my career thus far. I've chosen to work for internet companies that have large amounts of data (behavioural data exhaust) in mostly discrete or textual form (e.g., text search ads, dating profiles, and tweets). Thus I have relied exclusively on shallow learning algorithms (linear or mildly nonlinear) and hand-tuned feature construction.

If deep learning were great, there would be no need to pay people like myself to massage the raw data into a form that is more amenable to shallow learning algorithms. I also don't think history will judge the ability of human feature engineering kindly. Textual data is an exception because it's easy to work with: naive encodings (e.g., bag of bigrams) are a great starting point, and there are many sources of side-information available that can be joined up against tokens. The computer vision community, in contrast, seems to face a much harder problem. Although historically their labeled data sets were small by internet standards, this has been mitigated in recent years due to crowdsourcing and public database definition efforts. So the problem seems to be that pixels are a harder initial representation to work with than text. This is intuitive: text is occasionally misspelled, polysemous, or idiomatic; but pixels have to deal with lighting, pose, occlusion, etc.

Lately a team of deep learning and computer vision people at Stanford, Google, and New York University have been exciting progress, as evidenced by this presentation and this video. The basic idea is to leverage large-scale unsupervised techniques to generate features and then leverage those features for multiple supervised learning problems, i.e., a semi-supervised architecture. For the unsupervised part they use sparse encoder that has a neural network interpretation. Notably they show excellent results across a variety of problems not just in computer vision.

Another popular approach to unsupervised learning is generative probabilistic modeling. Historically this has been computationally expensive since the neither of the two most popular technique families (Monte Carlo and variational methods) are speed daemons. If generative models can be scaled efficiently, maybe the sparse encoder could get some competition.

Therefore I'm very excited about the Hoffman et. al. paper about Stochastic Variational Inference (SVI). SVI is essentially the analog of stochastic gradient descent (SGD) for variational methods. Note SVI is already a proven technique as the basis for the Vowpal Wabbit (VW) Latent Dirichlet Allocation (LDA) implementation, with one important caveat: the VW LDA implementation is extremely fast but only uses one core. This caveat and the analogy with SGD should raise a red flag here. SGD typically dominates when applicable on a single core. Multi-core SGD also typically dominates when the gradient computation has a bit of heft, as it does in the generative models to which we'd like to apply SVI. Distributed SGD, on the other hand, is not a straightforward slam dunk, since i/o considerations start to dominate. Correspondingly, distributed SVI might not work well in practice, at least not for certain generative models without implementation tricks. Nonetheless it is worth having a go, since SVI is broadly applicable. (Note some of cool stuff the Google guys did can be summarized as ``making distributed SGD work'').

Another promising direction is the resurgence of linear algebra methods for latent models such as LDA. Remember when semantic modeling and SVD were synonymous? (If so, you also remember Excite and Lycos.) Amazingly, Anandkumar et. al. show that the latent topic structure in LDA can be recovered using two SVDs and third order (trigram) statistics. There has been significant effort in making scalable distributed linear algebra libraries, so this line of research could ultimately yield the fastest latent factor model implementations.

Monday, July 30, 2012

Technology Jobs

Let me preface this by saying that this in no way reflects my experiences at Microsoft thus far, or indeed any work experience I've had in the past 10 years. Nonetheless for some reason I woke from a dream early this morning and this was in my head: those just starting out in the technology industry might appreciate it.
By the way that arxiv paper is Classic Nintendo Games are (NP-)Hard by Aloupis et. al.

Saturday, July 28, 2012

New Job

When parts of Yahoo research were traded to Microsoft, a unique opportunity opened up. Consequently, today was my first week at Microsoft as part of the Cloud and Information Services Laboratory (CISL). The team will focus on advancing big-data clustered machine learning, and some truly superlative people are involved. Now you know as much as I do! Hopefully you will organically encounter CISL over time via our external impacts.

Microsoft is without exaggeration 3 to 4 orders of magnitude larger than the typical company I've worked at in the past 7 years. Since there is super-linear scaling in creativity with scale for cities, I'm hoping something similar holds for workplaces :)

Tuesday, July 10, 2012

How to Best a Human with a Spreadsheet

I gave a presentation today on practical tips for solving contextual bandit problems on the internet. Unless one faces a complete cold start situation, there is an existing system making decisions which you are expected to improve. Such systems are often not explicitly constructed decision theoretically, and a common case is a pile of heuristics developed by a set of plucky humans armed with spreadsheet software. This led naturally to the topic of how to beat such systems.

It's important to note what is likely not to work: taking the same information used by the human-spreadsheet algorithm (HSA), applying the latest whiz-bang Machine Learning algorithm to it, and hoping to do better. You might get lucky but especially in a consulting context it's good to stack the deck in your favor. The reason why this is not a good strategy is that Machine Learning algorithms are at their core optimization algorithms, but HSA is also an optimization algorithm, and given enough time HSA can be quite competitive so it's important to understand its weaknesses.

HSA tends to use counting estimators driven by a small number of features (a typical number is 5). These features can themselves be nonlinear hand-tuned combinations of small numbers of base pieces of information (again, typically 5), which can often be found in other ``tabs'' in the spreadsheet. Therefore I find it useful to think of HSA as a low-dimensional deep-learning algorithm.

The deep-learning in HSA is not state-of-the-art but given that it has typically had several man-years of running time, it can be pretty good. Therefore you are asking for trouble hoping to best HSA purely by improved deep-learning. The true Achilles heel of HSA is the low-dimensionality: taking the union of all the base pieces of information in all the tabs and you are generally looking at less than 20 pieces of information. Furthermore, the pieces of information are typically dense in the original data stream.

This suggests a strategy for the win: look for additional pieces of information available to the decision system not considered by the HSA. In particular look for sparse features which individually occur too rarely to support counting statistics, but which in concert provide lots of information (raw text fields can have this property).

Another potential weakness of HSA is the manual aspect and the interaction with nonstationarity. Sometimes the heuristics in the serving system are not frequently updated, in which case automation and processing more recent data can provide lift. Sometimes the HSA contains separate different seasonal models driven by subsets of the data, in which case you do better by training on all the data and including temporal, seasonal, and holiday features (recency weighting the historical data is a good idea here: also make sure any cross-validation is done temporally out-of-sample).

In any event I would advise a healthy respect for HSA-driven systems. Naturally when I first started I figured such systems would be easy targets and I've learned humility ``the hard way.''

Monday, July 2, 2012

P-U Learning and Pointwise Classification

Once you become aware of something you see it everywhere; in particular, once I learned about P-U problems I started seeing examples of people (implicitly!) treating P-U datasets like ordinary datasets. Surprisingly this seems to generally work out, and I found a paper by Elkan and Noto called Learning Classifiers from Only Positive and Unlabeled Data which helps explain why.

In the Elkan and Noto model, there is a distribution $D$ on $X \times \{ 0, 1 \}$ sampling from which would constitute a traditional data set. However instead we sample from a distribution $D^\prime$ on $X \times \{ 0, 1 \}$ defined via
  1. Draw $(x, y) \sim D$.
  2. If $y = 0$, then $s = 0$.
  3. If $y = 1$, then $s = 1$ with probability $p (s = 1 | y = 1)$ independent of $x$; otherwise $s = 0$.
  4. Output $(x, s)$.
Note these are stronger assumptions then are utilized when optimizing AUC for a P-U problem; in that case one only has to assume the ability to iid sample from the positive label and unlabeled distributions. Here we make two stronger assumptions: the generation of unlabeled and positive examples is ``simultaneous'', and the positive label censorship process is independent of the features.

Caveats aside, the above observation model yields some interesting results. First is that \[
p (s = 1 | x) = p (y = 1 | x) p (s = 1 | y = 1),
\] i.e., the probability of a positive label in the observed data set is proportional to the probability of a positive label in the underlying unobserved data set. That means just training a classifier on a proper scoring rule using the P-U dataset should approximate something proportional to the true probability. Depending upon the overall decision problem that might be sufficient for good performance, and I suspect this is why naive use of pointwise classification sometimes works out.

Elkan and Noto uncover several other interesting implications of the above observation model. They note \[
p (s = 1 | y = 1) = \mathbb{E}_{(x, s) \sim D^\prime} \left[ p (s = 1 | x) \,\bigl|\, s = 1 \right],
\] i.e., once a classifier has been trained to approximate $p (s = 1 | x)$, the proportionality constant $p (s = 1 | y = 1)$ can be estimated by drawing another sample from $D^\prime$ and averaging the classifier output on the positive examples. This implies the classifier can be (approximately) calibrated with respect to the (unobservable!) underlying distribution.

Furthermore, they show how to relate an arbitrary expectation wrt $D$ in terms of an expectation wrt $D^\prime$, \[
\begin{aligned}
&\mathbb{E}_{(x, y) \sim D}\left[ f (x, y) \right] \\
&= \mathbb{E}_{(x, s) \sim D^\prime}\left[ f (x, 1) 1_{s=1} + \bigl( w (x) f (x, 1) + (1 - w (x)) f (x, 0) \bigr) 1_{s=0} \right],
\end{aligned}
\] where \[
w (x) = p (y = 1 | x, s = 0) = \frac{1 - p (s = 1 | y = 1)}{p (s = 1 | y = 1)} \frac{p (s = 1 | x)}{1 - p (s = 1 | x)}
\] is a weighting function which treats unlabeled examples as a probabilistic mixture of a positive and negative example. Note computing the weighting function only requires the classifier trained on the P-U dataset $p (s = 1 | x)$, and the normalization constant $p (s = 1 | y = 1)$ estimated via the previous paragraph's procedure. This is really cool stuff: basically leveraging this one can solve many different learning problems after an initial density estimation preprocessing step to convert the P-U dataset into a normal dataset.

When Should I Use This?

In general I would be careful applying this technique.

In a situation where positive examples are gathered in one manner and unlabeled examples in another, the simultaneity assumption of the Elkan and Noto observation model is not a good fit; whereas optimizing AUC naturally applies to this situation.

For link prediction on a fixed graph, the underlying distribution is a (large!) sum of indicator functions so the simultaneity assumption seems safe. However the conditional independence $s \perp x | y$ assumption might be dangerous. For instance at Facebook, women might be more likely to initiate and accept friend requests, which would make the label censorship probability depend upon the features.

Personally, I think it is most likely that I would use this technique when I have a complicated objective function and I want to leverage the change of measure result to importance-weight the data. My intuition says that even if the observation model is inaccurate that the importance-weighting would probably do more good than harm (relative to not importance weighting and treating unlabeled examples as negative examples).

Of course, if a calibrated estimator is required, then optimizing AUC is not sufficient and the Elkan and Noto technique becomes more attractive. Sometimes calibration is essential, e.g., at eHarmony I had to generate estimates that were consumed by a linear program. However often calibration seems essential but is not. You can do selective classification and active learning without calibration; you can combine previously trained models into an integrated decision system without calibrating the models; and you can even price a generalized second price style keyword auction without a calibrated CTR estimator.

Friday, June 29, 2012

My ICML 2012 Notables

I've already devoted entire blog posts to some of the ICML 2012 papers, but there are some other papers that caught my attention for which I only have a quick comment.

  • Online Structured Prediction via Coactive Learning: read the full blog post.
  • Predicting Accurate Probabilities with a Ranking Loss: read the full blog post.
  • Training Restricted Boltzmann Machines on Word Observations. I haven't used RBMs in over a decade, for practical text classification problems a bag-of-bigrams representation is often sufficient, and LDA is my go-to technique for unsupervised feature extraction for text. So why do I like this paper? First, the computational efficiency improvement appears substantial, which is always of interest: I like deep learning in theory, but in practice I'm very impatient. Second the idea of discovering higher order structure in text (5-grams!) is intriguing. Third (like LDA) the technique is clearly more generally applicable and I wonder what it would do on a social graph. That all suggests there is some chance that I might actually try this on a real problem.
  • Fast Prediction of New Feature Utility: I'm constantly in the situation of trying to chose which features to try next, and correlating with the negative gradient of the loss function makes intuitive sense.
  • Plug-in Martingales for Testing Exchangeability On-Line: how awesome would it be if VW in online learning mode could output a warning that says ``the input data does not appear to be generated by an exchangeable distribution; try randomly shuffling your data to improve generalization.''
  • Dimensionality Reduction by Local Discriminative Gaussians: This seems imminently practical. The major limitation is that it is a supervised dimensionality reduction technique, so it would apply to cases where there is one problem with a deficit of labeled data and a related problem using the same features with an abundance of labeled data (which is a special case of Transfer Learning). I usually find myself in the ``few labeled data and lots of unlabeled data'' case demanding an unsupervised technique, but that could be because I don't ask myself the following question often enough: ``is there a related problem which has lots of training data associated with it?''
  • Finding Botnets Using Minimal Graph Clusterings: Very entertaining. I was asked in a job interview once how I would go about identifying and filtering out automated traffic from search logs. There's no ``right answer'', and black-letter machine learning techniques don't obviously apply, so creativity is at a premium.

Wednesday, June 27, 2012

Rank+IR

Mennon et. al. have a paper at ICML 2012 called Predicting Accurate Probabilities with a Ranking Loss. The basic idea is to train a classifier on a ranking loss (e.g., AUC), then post-process the classifier scores with isotonic regression to calibrate the classifier. In contrast with training a classifier using a proper scoring rule (e.g., logistic regression), this procedure non-parametrically explores the space of link functions and the claim is this leads to better results. Note exploring the space of link functions non-parametrically is intuitively ``safe'' from a model complexity standpoint because this is a one-dimensional procedure which operates on the scores output by the underlying classifier.

It turns out we accidentally backed into this at eHarmony. When I joined the production system delivered matches sequentially so we started with a ranking loss. Later the production system switched to using a linear program to deliver matches, and the easiest thing to do was to add a calibration step at the end of the training pipeline, and we did isotonic regression with linear interpolation. We wanted to switch to directly training for classification with a proper scoring rule, but we started subsampling the negatives so we needed to continue to calibrate the classifier and therefore it never happened. The whole time we suspected we were being ``incoherent.'' Hey, it's better to be lucky than good. Now, if I find myself in a similar situation in the future, I'll be able to articulate a rationale for the approach.

The meta-lesson here is if you are an applied machine learning practitioner and you see a paper with Charles Elkan's name on it, you should read it. I've yet to be disappointed.

Tuesday, June 26, 2012

A Thought on Link Prediction

I'm reading a paper by Richard et. al. from the ICML 2012 paper list called Estimation of Simultaneously Sparse and Low Rank Matrices. I'm not sure why until now I was conflating these two ideas but in retrospect they are clearly different and one might want to optimize for both. Since the Latent Feature Log Linear (LFLL) model of Mennon and Elkan is in spirit a low-rank matrix factorization algorithm I was wondering how to simultaneously enforce sparsity in it; I think using an $L_1$ regularizer on the latent features might be worth trying.

However the paper also got me thinking about link prediction. Here's a quote from the paper:
Link prediction - the matrix $A$ is the adjacency matrix of a partially observed graph; entries are 0 for both not-existing and undiscovered links. The search space is unrestricted as before and the matrix $S$ contains the scores for link prediction; the ideal loss function is the empirical average of the zero-one loss for each coefficient, \[
l_{E} (S, A) = \frac{1}{|E|} \sum_{(i,j) \in E} 1_{(A_{ij} - 1/2) \cdot S_{ij} \leq 0}.
\]
So I read that as, ``this is a P-U problem that we are reducing to pointwise classification.'' However my preferred method for P-U problems is to reduce to ranking (AUC loss). What would that look like for link prediction?
  1. Instances are edges (i.e., pairs of vertices plus dyadic specific information).
  2. Reduction of AUC is to pairwise classification, so pairs of edges, or pairs of pairs of vertices.
  3. Each positive (observed) edge in the adjacency graph would be paired with an unlabeled (unobserved or unexplored) edge, the latter perhaps drawn uniformly from all possible edges; or possibly from all possible edges given one vertex (``per-vertex AUC'').
  4. The final classification model could be purely latent (e.g., pure LFLL), purely explicitly feature driven (e.g., bilinear implemented with VW), or a combination (e.g., LFLL with side information).
    1. In my experience LLFL with side information is very tricky to train, unlike pure LLFL.

Next time I run into a link prediction problem I'm going to give this a whirl.

Monday, June 25, 2012

Coactive Learning

Shivaswamy and Joachims have a paper called Online Structured Prediction via Coactive Learning at ICML 2012 this year. Joachims, of course, is associated with a classic line of research which I'll summarize thusly: attempting to impute absolute relevance scores from behavioral data exhaust is not effective, and that imputing relative preferences leveraging an attentional model (e.g., serial scan) is more effective. This is one of those ``deep tricks'' that you can carry with you into many different situations.

So the classic example is when you have a search engine result, and you get just one click at a particular position $p$, and your attentional model assumes that the user considered every result up to that position plus one more. Therefore the partial preferences $\forall x \in [1, p + 1], x \neq p: r_p > r_x$ are revealed and added to the (ranking) training set.

Later in my career I began to appreciate stochastic contextual bandits, specifically the importance of debiasing the historical state-action density in order to get consistent estimates. That left me with an inconsistency: on the one hand, optimizing a search engine with click feedback is definitely Learning Through Exploration, since you only get information about the relative preferences of (a subset of) the items presented. On the other hand I'm not attempting to debias the historical state-action density when I'm doing straight Joachims.

I was hoping this paper would resolve this difficulty for me. It did, but not in the way I expected; the contextual bandit literature is only referred to in the introduction for comparative purposes. Instead the authors make the following assumptions:
  1. User loss is convex in (linear) utility differences.
  2. Users only suggest improvements (i.e., user feedback always points ``downhill'').
  3. Users only suggest significant improvements (i.e., feedback states have a utility increment at least proportional to the increment to the optimum).
Under these circumstances it is sensible that a Perceptron-style algorithm achieves a good regret bound. The authors also explore relaxations of these assumptions (e.g., improvements are only significant in expectation, or feedback occasionally points downhill) and the resulting degradation of the regret guarantee.

I suspect the analysis does not look how I anticipated because, subject to the conditions of the previous paragraph, the user feedback can be chosen adversarially. Nonetheless it could be interesting to consider a ``contextual bandit style'' formulation, e.g., instead of learning the reward associated with the chosen arm, one learns the difference between the reward of the chosen arm and another arm. A good place to start would be the literature on contextual bandits with controlled side information, but a key difference here is that the user feedback is not under control of the algorithm.

Thursday, June 7, 2012

Stealth No More!

The startup I'm currently at publicly launched today. It's a social image sharing site called LoveIt. This is a crowded space at the moment, but we've tried to throw in some innovative new features. One machine learning related bit that I worked on is the recommendation system; here's an example screenshot with the recommendations in the bottom right hand side.



The image for a mashup by DJ Earworm (who is totally awesome!). In this case the second recommendation is a music collection which is very sensible, but the first recommendation is more questionable (focusing on the costume ball aspect). Hopefully the system will get better as we generate more behavioral data exhaust. I have noticed image recommendation is more forgiving than text recommendation: images have less precise meaning so people are more willing to invent why a quirky recommendation makes sense.

Conceptually the system is heavily Elkan inspired. The implementation is a combination of Elasticsearch and Vowpal Wabbit, strung together with Erlang. The tricky part is getting it to compute something quickly (circa 100ms), and both Elasticsearch and Vowpal Wabbit are excellent pieces of software in this regard!

The Bigger Picture

When I first started on the internet, the most common demand for machine learning I encountered was for optimizing performance marketing (the other big one would have been algorithmic search, but southern California wasn't a major player in that space). Nowadays there are many big smart companies focused on the science of advertising. In my opinion, if you have some machine learning acumen and some plucky post-series-A startup claiming to revolutionize internet advertising with a new algorithm attempts to recruit you, run the other way! There are probably still many smaller exits to be had in this space selling to the major ad networks, but unless you have a large equity share it won't change your life.

Fortunately there is a new nexus of ubiquitous machine learning need: content recommendation, personalization, summarization, and visualization. This is driven by the intersection of several trends, including the rise in user-generated content, social networks, and smartphones. For example, Twitter has turned everybody into an intelligence analyst lost in a sea of intercepts. Technologies that can scan all of Twitter and surface the (personalized) good stuff in real-time would be very interesting. Furthermore, as Google has proven, if you position yourself as a trusted discovery tool for users you can easily monetize. Thus if you get a recruiting call from a startup claiming to attack such problems, my advice is to seriously consider it.

Monday, May 21, 2012

Instruction Theory?

In learning theory we often talk about an environment which is oblivious to our algorithm, or an environment which is fully aware of our algorithm and attempting to cause it to do badly. What about the case where the environment is fully aware of our algorithm and attempting to help it succeed?

Here's a concrete example. Suppose you are trying to communicate a message to a receiver, and this message is one of a finite set of hypotheses. You are forced to communicate to the receiver by sending a sequence of feature-label pairs $(X \times Y)^*$; the receiver will decode the message via ERM on the hypothesis set using the supplied data. How many examples does it take, and how should you chose them? If this sounds corny, consider that evolution works by reuse, so if the capability to learn from experience developed due to other selection pressure, it might be co-opted to service communication a la Cognitive Linguistics.

Intuitively, even if the hypothesis being communicated was learned from experience, it's not good strategy to retransmit the exact data used to learn the hypothesis. In fact, it seems like the best strategy would be not using real data at all; by constructing an artificial set of training examples favorable structure can be induced, e.g., the problem can be realizable. (Funny aside: I TA-d for a professor once who confided that he sometimes lies to undergraduate in introductory courses in order to get them ``closer to the truth''; the idea was, if they took an upper division class he would have the ability to refine their understanding, and if not they were actually better off learning a simplified distortion).

Some concepts from learning theory are backwards in this setup. For instance, Littlestone's dimension indicates the maximum number of errors made in a realizable sequential prediction scenario when the examples are chosen adversarially (and generalizes to the agnostic case). We can chose the examples helpfully here (what's the antonym of adversarial?), but actually we want errors so that many of the hypothesis are incorrect and can be quickly eliminated. Unfortunately we might encounter a condition where the desired-to-be-communicated hypothesis disagrees with at most one other hypothesis on any point. Littlestone finds this condition favorable since a mistake would eliminate all but one hypothesis, and otherwise no harm no foul; but in our situation this is worst-case behaviour, because it makes it difficult to isolate the target hypothesis with examples. In other words, we can chose the data helpfully, but if the set of hypotheses is chosen adversarially this could be still very difficult.

Inventing an optimal fictitious sequence of data might be computationally too difficult for the sender. In this case active learning algorithms might provide good heuristic solutions. Here label complexity corresponds to data compression between the original sequence of data used to learn the hypothesis and the reduced sequence of data used to transmit the hypothesis.

There is fertile ground for variations, e.g., the communication channel is noisy, the receiver does approximate ERM, or the communication is scored on the difference in loss between communicated and received hypothesis rather than 0-1 loss on hypotheses.

Wednesday, May 16, 2012

Active Minimax Forecasting

This is a continuation of my previous post where I noted that the Minimax Forecaster is tantalizing from an active learning perspective. Fortunately I noticed a paper by Abernethy et. al. called A Stochastic View of Optimal Regret through Minimax Duality. In this paper the authors unwrap online convex optimization in a general fashion by leveraging von Neumann's minimax theorem. By doing this they derive a general formula for the value of an online game to the adversary. The intuition of the previous post is that differences in game value between observing and not observing a particular outcome will be key to making active learning decisions in an adversarial setting, so this formula is very interesting.

Abernethy et. al. start out with a more general setup than the previous post, but I'll adapt their conventions to the previous post where there are differences. There is a game with $T$ rounds in it. There is a set $\mathcal{F}$ of experts. On each round, each expert $f$ produces a prediction $f_t$, player produces a prediction $p_t$, adversary simultaneously produces an outcome $y_t$, and player suffers an instantaneous loss $l (p_t, y_t)$. The experts are static (their predictions to do not depend upon previously observed outcomes), so essentially each expert is an sequence $f_{1:T}$. The player wants to generate a sequence of predictions $p_{1:T}$ which minimizes worst-case regret \[
\sup_{y_{1:T} \in \mathcal{Y}^T} L (p_{1:T}, y_{1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:T}),
\] where $L (p_{1:T}, y_{1:T}) = \sum_s l (p_s, y_s)$ is the total loss. (To ease notation supremums and infimums will capture the entire rest of the expression unless explicitly limited by parenthesis.) The Minimax Forecaster of the previous post is an explicit solution to a specific case of this problem, whereas Abernethy et. al. are concerned with characterizing such games in general. They consider the value of the game to the adversary under optimal play, \[
\begin{aligned}
R^* (\mathcal{F}) &= \inf_{p_1 \in \mathcal{P}} \sup_{y_1 \in \mathcal{Y}} \ldots \inf_{p_T \in \mathcal{P}} \sup_{y_T \in \mathcal{Y}} \sum_{t=1}^T l (p_t, y_t) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t).
\end{aligned}
\] The amazing result is that this is the same as \[
\begin{aligned}
R^* (\mathcal{F}) &= \sup_{Y \in \mathbb{P} (\mathcal{Y}^T)} \mathbb{E}_{y_{1:T} \sim Y} \left[ \sum_{t=1}^T \inf_{p_t \in \mathcal{P}} \biggl( \mathbb{E}_{\tilde y_t} \left[ l (p_t, \tilde y_t) | y_{1:t-1} \right] \biggr) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t) \right],
\end{aligned}
\] where the supremum is over distributions of outcome sequences $\mathbb{P} (\mathcal{Y}^T)$. In other words, this looks like a game played with an oblivious (non-stationary) environment, but played in the worst possible environment. This is a nifty result, and leads in subsequent work to interpolating between the IID and adversarial settings by constraining the supremum over sequence distributions.

Now with active learning, we make the adversary more powerful by choosing not to observe some of the outcomes (represented by the variable $z_s \in \{ 0, 1 \}$). I can upper bound the game value to the adversary as \[
\begin{aligned}
R^* (\mathcal{F} | z_{1:T}) &\leq \sup_{Y \in \mathbb{P} (\mathcal{Y}^T)} \mathbb{E}_{y_{1:T} \sim Y} \left[ \sum_{t=1}^T \inf_{p_t \in \mathcal{P}} \biggl( \mathbb{E}_{\tilde y_t} \left[ l (p_t, \tilde y_t) | \Omega_{t-1} \right] \biggr) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t) \right],
\end{aligned}
\] where $\Omega_t = \{ y_s | s \leq t, z_s = 1 \}$ denotes observed outcomes. This is intuitively pleasing because the inner conditional expectation represents the player's knowledge. To derive the upper bound follow the procedure in Appendix A of the paper, after transforming the game value expression whenever $z_s = 0$ from \[
\inf_{p_1 \in \mathcal{P}} \sup_{y_1 \in \mathcal{Y}} \cdots \inf_{p_s \in \mathcal{P}} \sup_{y_s \in \mathcal{Y}} \cdots \inf_{p_T \in \mathcal{P}} \sup_{y_T \in \mathcal{Y}} \sum_{t=1}^T l (p_t, y_t) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t),
\] to \[
\inf_{p_1 \in \mathcal{P}} \sup_{y_1 \in \mathcal{Y}} \cdots \inf_{p_s \in \mathcal{P}} \cdots \inf_{p_T \in \mathcal{P}} \sup_{y_T \in \mathcal{Y}} \sup_{y_s \in \mathcal{Y}} \sum_{t=1}^T l (p_t, y_t) - \inf_{f \in \mathcal{F}} \sum_{t=1}^T l (f_t, y_t),
\] i.e., by letting the adversary defer selection of the unobserved values until the end of the game. This is just an upper bound because in reality the adversary has to chose the outcome for round $s$ on round $s$ so possibly I'm being too generous to the adversary.

Minimax Forecasting

If we design a Minimax Forecaster on the extensive form game associated with the bound, we get an algorithm which optimizes the bound. Consider the case where all outcomes are observed except for round $s$. Repeating the backward induction steps from the original minimax forecaster paper yields the expressions, \[
\begin{aligned}
p_t^* &= \frac{1}{2} \biggl( 1 - R^* (\mathcal{F}, y_{1:t-1}0 | z_s = 0) + R^* (\mathcal{F}, y_{1:t-1}1 | z_s = 0) \biggr). & (t > s)
\end{aligned}
\] and \[
\begin{aligned}
&R^* (\mathcal{F}, y_{1:t} | z_s = 0) \\
&= \frac{T - t}{2} + \mathbb{E}_{\sigma_{t+1:T}}\left[ \sup_{y_s \in \mathcal{Y}} |p_s - y_s| - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t} \sigma_{t+1:T}) \right] & (t > s) \\
&= R^* (\mathcal{F}, y_{1:t}) \\
&\quad + \mathbb{E}_{\sigma_{t+1:T}}\left[ \left| p_s - \frac{1}{2} \left(1 + \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 0 \sigma_{s+1:T}) \right) - \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 1 \sigma_{s+1:T}) \right) \right) \right| \right].
\end{aligned}
\] Thus the residual game value having played $p_s$ and not observed outcome $s$ is equal to the fully observed residual game value plus a penalty related to expected absolute loss averaged over all Rademacher distributed playouts. This implies the optimal choice of $p_s$ is a median \[
\begin{aligned}
p_s^* &= \mathop{\operatorname{median}} \left( \frac{1}{2} \left( 1 + \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 0 \sigma_{s+1:T}) \right) - \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 1 \sigma_{s+1:T}) \right) \right) \right).
\end{aligned}
\] In the case where there are no additional rounds of play between $s$ and $T$, there is only one point in the distribution so the median is that point, so the optimal play $p_s^*$ is the same as in the fully observed case. In the case where there is one additional round of play between $s$ and $T$, there are two points in the distribution so the mean is a median, and again the optimal play $p_s^*$ is the same as in the fully observed game (consistent with the previous blog post). In the case of more than one additional round of play between $s$ and $T$, the mean is not necessarily a median (i.e., does not necessary minimize expected absolute loss) so optimal play $p_s^*$ is different. Maybe this is why Mathematica ``pooped out'' when I tried to solve a longer version of the unobserved game.

Once we've ``hopped over'' the unobserved point and determined $p_s^*$, we can continue the backwards induction; but I think the interesting bit is the penalty term, which says what the value of observing the outcome on round $s$ given all previous and subsequent rounds will be observed, \[
\mathbb{E}_{\sigma_{t+1:T}}\left[ \left| p_s - \frac{1}{2} \left(1 + \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 0 \sigma_{s+1:T}) \right) - \inf_{f \in \mathcal{F}} \left( L (f_{1:T}, y_{1:s-1} 1 \sigma_{s+1:T}) \right) \right) \right| \right].
\] This penalty term would be important for deciding whether or not to query the outcome on round $s$. It'd be nice to generalize it to the case where some of the previous outcomes are not observed and also with respect to potentially unobserved future outcomes. Of course, planning the latter might be exponentially difficult.

For this to be practical, it would be nice if medians and expected absolute losses could be cheaply and accurately estimated, e.g., using random playouts. Also perhaps deciding whether to query the outcome needs to be done on a game value bound, e.g., assuming all subsequent observations will be observed rather than taking into account future decisions to observe the outcome. Furthermore, the technique is still fundamentally transductive since the expert predictions for the planning horizon $f_{1:T}$ need to be known. Even with all those caveats, however, there might be an interesting application for this, e.g., in recommendation systems.



































Sunday, May 6, 2012

The Minimax Forecaster and Transductive Active Learning

I've been making my way down the NIPS 2011 paper list, and found this nice paper Efficient Online Learning via Randomized Rounding by Cesa-Bianchi and Shamir. This paper is about improving and extending the Minimax Forecaster which is described in Prediction, Learning, and Games. (I own a copy but I confess to not having made it very far into that book.) The Minimax Forecaster uses a different strategy for online learning than mirror descent which is essentially what I (and everybody else?) use everyday. This different setting provides an opportunity to think about adversarial active learning.

Here's the basic setup. There is a game with $T$ rounds in it. There is a set $\mathcal{F}$ of experts. On each round, each expert $f$ produces a prediction $f_t$, player produces a prediction $p_t$, adversary simultaneously produces an outcome $y_t$, and player suffers an instantaneous loss $l (p_t, y_t)$. The experts are static (their predictions to do not depend upon previously observed outcomes), so essentially each expert is an sequence $f_{1:T}$. The player wants to generate a sequence of predictions $p_{1:T}$ which minimizes worst-case regret \[
\sup_{y_{1:T} \in \mathcal{Y}^T} \biggl( L (p_{1:T}, y_{1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:T}) \biggr),
\] where $L (p_{1:T}, y_{1:T}) = \sum_s l (p_s, y_s)$ is the total loss. When the observations are binary $\mathcal{Y} = \{ 0, 1 \}$, then an instantaneous loss of $|p_t - y_t|$ corresponds to expected 0-1 loss when the player is randomizing decisions. Amazingly this case yields a closed-form expression for the optimal prediction \[
\begin{aligned}
p^*_t &= \frac{1}{2} \biggl( 1 + R^* (\mathcal{F}, y_{1:t-1}1) - R^* (\mathcal{F}, y_{1:t-1}0) \biggr) \\
&= \frac{1}{2} \biggl( 1 + \mathbb{E}_{\sigma_{t+1:T}} \left[ \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t-1} 0 \sigma_{t+1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t-1} 1 \sigma_{t+1:T}) \right] \biggr),
\end{aligned}
\] where temporal sequence concatenation is denoted lexically, $\sigma_t$ is $\mathrm{Bournelli}(1/2)$ distributed a la Rademacher averages, and $R^* (\mathcal{F}, y_{1:t})$ is the residual game value after some rounds of play, \[
\begin{aligned}
R^* (\mathcal{F}, y_{1:t}) &= \frac{1}{2} \biggl( 1 + R^* (\mathcal{F}, y_{1:t-1}0) + R^* (\mathcal{F}, y_{1:t-1}1) \biggr) \\
&= \frac{T - t}{2} - \mathbb{E}_{\sigma_{t+1:T}}\left[ \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:t} \sigma_{t+1:T}) \right].
\end{aligned}
\] Essentially what's happening here is that player is able to make adversary indifferent to playing either option on each round by playing a constant plus the difference between the residual game values associated with playing each option; this causes the residual game value to be a constant plus the average value of continuing after playing each option. Unwrapping the game value recursively leads to the Rademacher style averages. One observation of the paper is that such expectations can be approximated by sampling to achieve a high probability regret bound, aka random playout.

In practice even to do random playout you need to know $f_{1:T}$ for the various experts. When mapping this to a contextual prediction setting, this corresponds to knowing the sequence of features in advance (but not the labels). Thus this is essentially a transductive technique. Some recommendation problems are naturally transductive, and the paper discusses an application to collaborative filtering.

Active Learning?

In principle the setup can be modified to consider active learning. Each round, in addition to generating a prediction, player must make a decision $z_t \in \{ 0, 1 \}$ whether or not to observe $y_t$. If $z_t = 0$, player cannot use the value of $y_t$ in subsequent predictions. Since it is always better for the player to observe $y_t$, there has to be some penalty for doing so, thus consider a constant penalty $\alpha$ per observation. The player wants to generate a sequence of predictions $p_{1:T}$ and queries $z_{1:T}$ which minimizes worst-case regret \[\sup_{y_{1:T} \in \mathcal{Y}^T} \biggl( \sum_s \alpha z_s + L (p_{1:T}, y_{1:T}) - \inf_{f \in \mathcal{F}} L (f_{1:T}, y_{1:T}) \biggr).
\] Concise general closed-form expressions have eluded me thus far, but there is a non-trivial case which yields nice answers: the two-round game.

It never makes sense to observe the final outcome $y_T$, so $z_T = 0$. In the two-round game, then, the question is whether to observe $y_1$. If $y_1$ is not observed (i.e., $z_1 = 0$), player must ballistically plan both predictions without intermediate feedback, \[
\begin{aligned}
(p_1^*, p_2^*) &= \mathop{\operatorname{arg\,inf}}\limits_{p_{1:2} \in \mathcal{P}^2} \sup_{y_{1:2} \in \mathcal{Y}^2} \left( |p_1 - y_1| + |p_2 - y_2| - \inf_{f \in \mathcal{F}} L (f_{1:2}, y_{1:2}) \right).
\end{aligned}
\] This can be solved with Mathematica: here's the incantation.
Minimize[{ z, 
           p1 + p2 - inf00 <= z, 
           p1 + (1 - p2) - inf01 <= z, 
           (1 - p1) + p2 - inf10 <= z, 
           (1 - p1) + (1 - p2) - inf11 <= z }, 
           { p1, p2, z }] // Simplify
This has solution \[
\begin{aligned}
p_1^* &= \frac{1}{2} \left( 1 + \frac{1}{2} \sum_{y_2=0}^1 \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, 0y_2) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 1y_2) \right) \right) & (z_1 = 0), \\
p_2^* &= \frac{1}{2} \left(1 + \frac{1}{2} \sum_{y_1=0}^1 \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, y_10) - \inf_{f \in \mathcal{F}} L (f_{1:2}, y_11) \right) \right) & (z_1 = 0),
\end{aligned}
\] with game value \[
\begin{aligned}
&R (\mathcal{F}, \emptyset | z_1 = 0) \\
&= 1 - \frac{1}{2} \min\left\{ \inf_{f \in \mathcal{F}} L (f_{1:2}, 00) + \inf_{f \in \mathcal{F}} L (f_{1:2}, 11), \inf_{f \in \mathcal{F}} L (f_{1:2}, 01) + \inf_{f \in \mathcal{F}} L (f_{1:2}, 10) \right\}. \\
\end{aligned}
\] Now compare this to the case of $z_1 = 1$, which is the same as the fully observed Minimax Forecaster. \[
\begin{aligned}
p_1^* &= \frac{1}{2} \left( 1 + \frac{1}{2} \sum_{y_2=0}^1 \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, 0y_2) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 1y_2) \right) \right) & (z_1 = 1), \\
p_2^* &= \frac{1}{2} \left(1 + \inf_{f \in \mathcal{F}} L (f_{1:2}, y_10) - \inf_{f \in \mathcal{F}} L (f_{1:2}, y_11) \right) & (z_1 = 1).
\end{aligned}
\] The first round prediction $p_1^*$ is the same whether or not $z_1 = 0$ or $z_1 = 1$, but the second round prediction $p_2^*$ is different. If $z_1 = 0$, then $p_2^*$ is computed by averaging over possible histories; whereas if $z_1 = 1$, then $p_2^*$ is computing using the actual observed history. (Aside: perhaps constant-time Radacher averages will be quantum computing's killer app.)

To decide whether to observe $y_1$ or not, we need to know how much better it is to do so, i.e., the difference in game values. When $z_1=1$ this is the same as the fully observed Minimax Forecaster, \[
\begin{aligned}
&R (\mathcal{F}, \emptyset | z_1 = 1) \\
&= 1 - \frac{1}{4} \left( \inf_{f \in \mathcal{F}} L (f_{1:T}, 00) + \inf_{f \in \mathcal{F}} L (f_{1:T}, 01) + \inf_{f \in \mathcal{F}} L (f_{1:T}, 10) + \inf_{f \in \mathcal{F}} L (f_{1:T}, 11) \right),
\end{aligned}
\] therefore the difference in game value is \[
\begin{aligned}
&R^* (\mathcal{F}, \emptyset | z_t = 0) - R^* (\mathcal{F}, \emptyset | z_t = 1) \\
&= \frac{1}{4} \left| \inf_{f \in \mathcal{F}} L (f_{1:2}, 00) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 01) - \left( \inf_{f \in \mathcal{F}} L (f_{1:2}, 10) - \inf_{f \in \mathcal{F}} L (f_{1:2}, 11) \right) \right|. \\
\end{aligned}
\] This looks like a difference of future differences. If the game value difference exceeds $\alpha$, then we should decide $z_1 = 1$, otherwise not. So, for instance, if every expert predicts the same value on the first round, then the difference of future differences will be zero and we should not observe $y_1$. That certainly sounds like active learning.

So what should a general case $T$ round solution look like? Intuitively, one would hope that if all the experts that have done well in the past predict the same thing on the current instance, that the value of observing $y_t$ for that instance would go down. That is roughly what agnostic active learning does in the IID setting. Here the future is also important, but analogously if all the experts that are in the running for the infimum at the end of the horizon agree on a value, it should be that observing $y_t$ has less value. As we near to the end of the planning horizon, that will be driven mostly by having done well in the past.

Wednesday, April 25, 2012

Selective Classification and Active Learning

There's a gem of a paper by El-Yaniv and Wiener called Agnostic Selective Classification. The punchline of this paper is that selective classification is the test-time analog to active learning.

In selective classification, a classifier has the option of rejecting the input and abstaining from predicting. The idea is reject rarely while achieving high classification accuracy. This is multi-objective optimization, so the concept of a Pareto frontier arises, known as the risk-coverage curve in this context.

This can be formalized as follows: a selective classifier is a pair $(h, g)$ of functions $g: X \to \{ 0, 1 \}$ and $h: X \to Y$, where $g$ indicates whether or not to reject the input and $h$ is the classification rule if the input is accepted, \[
(h, g) (x) = \begin{cases} \mathrm{reject} & \mathrm{if}\; g (x) = 0, \\ h (x) & \mathrm{otherwise} \end{cases}.
\] The hypotheses $h$ come from a set $\mathcal{H}$; the paper is less clear on what set we are competing with for $g$, but they end up with a weaker notion of optimality so it doesn't matter. The features $X$ and labels $Y = \{ 0, 1 \}$ are jointly distributed according to $D$. In the passive batch setting, somebody hands us a bag of data sampled i.i.d. from $D$ and a rejection budget $b$, and then we have to produce a pair $(h, g)$ which is ideally near the Pareto frontier, \[
\begin{aligned}
\mathop{\operatorname{arg\,min\,}}\limits_{h \in \mathcal{H}} \mathbb{E}_{(x, y) \sim D}&[ 1_{g (x) = 1} 1_{h (x) \neq y}] \\
&\mathrm{s.t.} \\
E_{(x ,y) \sim D}&[1_{g (x) = 1}] \leq b.
\end{aligned}
\] This is really hard so El-Yaniv and Wiener consider a different problem. Let $h^* = \mathop{\operatorname{arg\,min\,}}_{h \in \mathcal{H}} \mathbb{E}_{(x, y) \sim D}[ 1_{h (x) \neq y}]$ be an optimal classifier. A pair $(h, g)$ is called a weakly optimal if \[
\mathbb{E}_{(x, y) \sim D}[ 1_{g (x) = 1} 1_{h (x) \neq y}] \leq \mathbb{E}_{(x, y) \sim D}[ 1_{g (x) = 1} 1_{h^* (x) \neq y}],
\] i.e., in the region of the input space where the selective classifier accepts, it does at least as well as an optimal hypothesis. This is not as good as the being on the Pareto frontier, but it's better than a sharp stick in the eye!

Now for the nifty part. The authors show that a weakly optimal selective classifier can (with high probability) be constructed as follows. Let $\tilde h$ be the minimizer of empirical risk $R (h)$, and define the empirically disagreeing hypothesis at $x$ via \[
h'_x \doteq \mathop{\operatorname{arg\,min\,}}\limits_{\substack{h \in \mathcal{H}, h (x) \neq \tilde h (x)}} R (h).
\] Then the disbelief index is the difference in empirical risk $R (h'_x) - R (\tilde h)$, and the selective classifier $(g, \tilde h)$ defined by \[
g (x) = 0 \iff R (\tilde h_x) - R (\tilde h) \leq \Delta
\] is weakly optimal, where $\Delta$ is a function of the training set size, VC dimension of $\mathcal{H}$, and probability bound $\delta$. In other words, refuse to classify the input if it is not empirically expensive to disagree with the empirical minimizer at the current input, otherwise classify according to the empirical minimizer.

If this looks familiar, it's because the disbelief index is the same quantity that arises in AALwC when deciding the probability of querying the label. This establishes a pleasant correspondence between ``asking for the label'' when training and ``refusing to predict'' when testing.

It also indicates that Vowpal Wabbit can easily be used as a selective classifier, since it already computes a fast approximation to the disbelief index for active learning purposes.

Sunday, April 22, 2012

Active Learning with a Drifting Distribution

We now have a good idea how to do agnostic active learning in the IID setting.  In between the IID setting and the fully adversarial setting, there is the situation where the environment is changing over time but is oblivious to our learning algorithm.   This is the topic of a recent paper by Liu Yang called Active Learning with a Drifting Distribution.

Passive Learning with a Drifting Distribution


I wasn't sufficiently familiar with passive learning with a drifting distribution, so I had to spend some time with one of the citations: On the Complexity of Learning from Drifting Distributions by Barve and Long.

In passive learning in the IID setting we can get arbitrarily good estimates of true expectations via empirical averages. This means for hypothesis sets of finite VC dimension that with high probability we can get arbitrarily close to optimal with enough data.

Once the world is allowed to start changing over time, we may or may not have enough information to track the changes.  There are multiple modeling choices here to formally characterize how the world drifts, including infrequent large changes, and possibly large changes with a slowly changing direction and rate.  Barve and Long study the case that the world changes slowly.  In particular, the world generates a countable sequence of training data consisting of feature label pairs $(X \times Y)^{\mathbb{N}}$.  These samples are independent but not identically distributed, and $D_t$ denotes the distribution of $(x_t, y_t)$.  The $D_t$ are constrained via the total variation distance \[
\sup_U |D_t (U) - D_{t+1} (U)| \leq \gamma,
\] where $U$ ranges over all measurable events, i.e., for anything to which you can assign a probability, consecutive distributions $D_t$ and $D_{t+1}$ will differ in the probability they assign by at most $\gamma$.   In this scenario, if the world wants to change significantly it has to supply us with training data associated with intermediate states, so the learning algorithm can succeed. Barve and Long show a sufficient condition for agnostic learning is \[
\gamma = O \bigl( \frac{\epsilon^3}{d \ln (1 / \epsilon)} \bigr).
\] where $d$ is the VC dimension of the hypothesis set and $\epsilon$ is our target regret.  This is achieved by doing ERM on a window of the most recent examples, where the window size is determined by $\epsilon$ and $\gamma$.  Since regret at time $t$ is relative to $D_t$, this is a really cool result.  Note the joint distribution of $X$ and $Y$ is allowed to change over time, so both the distribution of input features and the  conditional label density (``concept'') are changing.

Unfortunately this result says that we cannot get arbitrarily accurate with unlimited data.  In hindsight this is not surprising since we only use recent history for our ERM even with unlimited data.  An accompanying necessary condition without the $\ln (1 / \epsilon)$ term in the denominator indicates this limitation is fundamental, and not merely an artifact of the ``windowed ERM'' strategy.

Back to Active Learning


Returning to Liu Yang, the first observation is that the above assumption of ``slow drift'' is not powerful enough for active learning.  In active learning, we want to throw away some (most!) of the labels, which is equivalent to speeding up the world (scaling $\gamma$).  In particular, if we want to get invited to the VIP section of the Active Learning Club, we'd really like a sub-linear label complexity, which would be like passive learning with $\gamma \to \infty$, and we've seen above that means essentially no accuracy.  (However, since linear label complexity is the worst-case lower bound for agnostic active learning in the IID case, the above ``slow drift'' condition might still be worth investigating.)

Yang instead assumes that the various $D_t$ are drawn from a set of distributions $\mathbb{D}$ that are all close to each other.  This is formalized via the minimal $\epsilon$-cover $\mathbb{D}_{\epsilon}$ of $\mathbb{D}$, defined as the smallest subset of $\mathbb{D}$ which is within $\epsilon$ of any member of $\mathbb{D}$, as measured by the total variation distance.  In particular Yang assumes $\forall \epsilon > 0, |\mathbb{D}_{\epsilon}| < \infty$, and gets more precise bounds by assuming $\forall \epsilon > 0, |\mathbb{D}_\epsilon| < c \cdot \epsilon^{-m}$ for some constants $c, m \in [0, \infty)$.

Essentially, instead of being allowed to change arbitrarily over time, Yang confines the world to bounce around inside a region of distribution space characterized by a finite set of prototype distributions.  To achieve a certain level of accuracy, we need only consider the world shifting between the prototype distributions, which in turn suggests that under suitable conditions we will eventually have accumulated enough information that we can forgo looking at some of the labels.

Yang further restricts the conditional distribution of labels given features to be fixed, and only allows the distribution of features to change over time.  This second assumption means that it is meaningful to talk about the realizable case, and in the realizable case an algorithm called CAL achieves a sublinear mistake bound and label complexity in this kind of drifting environment, if the (largest) disagreement coefficient is sufficiently small.  The quick summary of CAL is that it only queries the label if there is a empirically perfect hypothesis which predicts each label; otherwise it infers the unobserved label is the label that is associated with the remaining empirically perfect hypotheses.  CAL uses the entire history, but leverages realizability to succeed, and note the conditional distribution (and therefore realizability) is constant here.

Extending the realizable analysis, Yang next considers the strictly benign noise case, defined as there being an optimal hypothesis which is always conditionally more likely to be correct than incorrect (for all distributions in $\mathbb{D}$).  In this case an agnostic variant of CAL called ACAL achieves a sub-linear mistake bound and label complexity.  Like CAL, ACAL will infer labels, but does so via an error difference instead of demanding empirical perfection.  In Yang's paper ACAL uses windowed data where periodically all history is discarded and the window size is increased geometrically, but this is not due to drift; rather is it due to the online setting with an unbounded stream, because standard specifications of ACAL assume a particular label budget.  Therefore ACAL is effectively using ``all the data''.

Some Takeaways

This paper (and the discussed citation) helped me appreciate that active learning in changing environments will only be ``efficient'' (from a label complexity standpoint) if the environment is somehow globally constrained in how it can change.

One research strategy, exemplified by this paper, is to take an existing active learning algorithm for the IID case and adapt it to some model of world drift.  Of course I'd like to see AALwC manhandled in this fashion: easier said than done!  A mashup of AALwC with Barve and Long would suggest running AALwC over the most recent window of data and not incrementing the label counter once the window is full.  This would have linear label complexity but sometimes constant factors matter (like on my Mechanical Turk invoices!).  It's less clear to me what should happen if AALwC is to be used under the stricter Yang assumptions of world drift.  The fact that ACAL is essentially using all the historical data suggests that AALwC might work under these conditions by merely slowing the schedule associated with decreasing the query probability.

A different research strategy would be to understand active learning in the adversarial setting and then apply different online to batch scenarios to convert a retrospective guarantee into multiple prospective guarantees.  That approach holds the promise of yielding a single algorithm that does as well as possible under a variety of different world drift scenarios.  If somebody achieves that, they will be justifiably famous!