Saturday, December 5, 2020

Distributionally Robust Contextual Bandit Learning

This blog post is about improved off-policy contextual bandit learning via distributional robustness. I'll provide some theoretical background and also outline the implementation in vowpal wabbit. Some of this material is in a NeurIPS expo talk video, and additional material is in the accepted paper.


In off-policy learning in contextual bandits our goal is to produce the best policy possible from historical data, and we have no control over the historical logging policy which generated the data. (Note production systems that run in closed-loop configurations nonetheless are in practice doing off-policy learning because of delays between inference, training, and model update.) Off-policy learning reduces to optimization of a policy value estimator analogously to supervised learning; however the accuracy of policy value estimation depends upon the mismatch between the policy being evaluated and the policy that generated the data, and therefore can be quite different for different policies (unlike supervised learning, where the differences in estimator resolution across the hypothesis class are less pronounced in practice). To appreciate this effect consider the IPS policy value estimator, $$ \hat{V}(\pi; \mu) = \frac{1}{N} \sum_{n \in N} \frac{\pi(a_n|x_n)}{\mu(a_n|x_n)} r_n, $$ where $\mu$ is the historical policy, $\pi$ is the policy being estimated, and our historical data consists of tuples $\{ (x_n, a_n, r_n) \}$. The importance weight $\frac{\pi(a_n|x_n)}{\mu(a_n|x_n)}$ can be quite large if $\pi$ frequently takes an action that $\mu$ rarely takes, causing the estimator to be highly sensitive to a few examples with large importance weights. Even if we initialize learning with $\pi = \mu$, as optimization progresses $\pi$ will induce increasingly different distributions over actions than $\mu$ as the learning algorithm encounters rare events with high reward. To combat this overfitting technique we will introduce regularization.

Distributionally Robust Optimization

Distributionally robust optimization is a generic method for regularizing machine learning objectives. The basic idea is to consider the observed data as one possible distribution of data (the “empirical distribution”), and then to optimize a worst-case outcome over all distributions that are “sufficiently close” to the empirical distribution. In the case of IPS we can find the smallest policy value estimate over a set of distributions that are close in KL divergence to the empirical distribution, $$ \begin{alignat}{2} &\!\min_{P \in \Delta} & \qquad & \mathbb{E}_P\left[w r\right], \\ &\text{subject to} & & \mathbb{E}_P\left[w\right] = 1, \\ & & & \mathbb{E}_N \left[\log\left( P\right) \right] \geq \phi. \end{alignat} $$ where $w \doteq \frac{\pi(a|x)}{\mu(a|x)}$. It turns out you can do this cheaply (in the dual), and the value of $\phi$ can be computed from a desired asymptotic confidence level. These results follow from classic work in the field of Empirical Likelihood.

The above problem finds a lower bound; finding an upper bound is analogous, resulting in the confidence intervals from the paper:

Empirical Likelihood Confidence Intervals are tighter Gaussian intervals.  Not shown: coverage of Empirical Likelihood CIs is better calibrated than Binomial (Clopper-Pearson).

When we do distributionally robust optimization, we are actually optimizing the lower bound on the policy value. The green curve in the above plot is a Clopper-Pearson interval, which does have guaranteed coverage, but is so wide that optimizing the lower bound wouldn't do much until the amount of data is large. The tighter intervals generated by the blue Empirical Likelihood curve imply that lower bound optimization will induce an interesting policy ordering with only modest data.

In practice, even when the empirical mean (empirical IPS value) is fixed, the lower bound is:

  • smaller when the policy generates value via few events with importance weights much larger than 1 and many events with importance weights near zero; and
  • larger when the policy generates value via many events with importance weights near 1.
This is precisely the kind of regularization we desired. Intuitively, any estimator which is sensitive to a few of the observed examples (aka high leverage) will have a larger penalty because it is “cheap”, as measured by KL divergence, to reduce the probability of those points.

Implementation in Vowpal Wabbit

To activate the functionality, add the --cb_dro flag to your contextual bandit command line in VW. Note it only effects training, so if you are only predicting you will not see a difference. Hopefully with the default hyperparameters you will see an improvement in the quality of your learned policy, such as in this gist.

Internally VW is solving the lower bound optimization problem from above on every example. There are some modifications:

  1. As stated above this would be too expensive computationally, but switching from the KL divergence to the Cressie-Read power divergence allows us to derive a closed form solution which is fast to compute.
  2. As stated above the lower bound requires remembering all policy decisions over all time. Instead we accumulate the sufficient statistics for the Cressie-Read power divergence in $O(1)$ time and space.
  3. To track nonstationarity we use exponentially weighted moving averages of the sufficient statistics. The hyperparameter --cb_dro_tau specifies the decay time constant.
As always, YMMV.

Friday, July 17, 2020

Convex-concave games off the shelf

If you need to solve a convex optimization problem nowadays, you are in great shape. Any problem of the form $$
&\!\inf_z & \qquad & f(z) \\
& \text{subject to} & & h(z) = 0 \\
& & & g(z) \preceq 0
$$ where $f$ and $g$ are convex and $h$ is affine can be attacked by several excellent freely available software packages: my current favorite is cvxpy, which is a joy to use. If you have a lot of variables and not a lot of constraints, you can instead solve a dual problem. It ends up looking like $$
&\!\sup_{x} & \qquad & L(x) \\
& \text{subject to} & & \tilde{h}_x(x) = 0 \\
& & & \tilde{g}_x(x) \preceq 0
$$ where $L$ is concave, assuming that you get lucky and can analytically eliminate all the primal variables $z$ such that only the dual variables $x$ remain. But what if you can't eliminate all the primal variables, but only most of them? You might end up with a problem like $$
&\!\sup_{x} \inf_y & \qquad & L(x, y) \\
& \text{subject to} & & \tilde{h}_x(x) = 0 \\
& & & \tilde{g}_x(x) \preceq 0 \\
& & & \tilde{h}_y(y) = 0 \\
& & & \tilde{g}_y(y) \preceq 0
$$ where $\tilde{g}_x$ and $\tilde{g}_y$ are convex, and $\tilde{h}_x$ and $\tilde{h}_y$ are affine, $L(x, \cdot)$ is convex in $y$ given fixed $x$, and $L(\cdot, y)$ is concave in $x$ given fixed $y$. It feels like this problem should be easier to solve than the original problem if many primal variables have been analytically eliminated. Unfortunately, none of my favorite convex optimization toolkits will accept a problem of this form. This is despite the viability of interior-point methods for such problems. Bummer.

One thing I tried was to solve the inner infimum using a standard toolkit, compute the gradient of the solution wrt the outer parameters via the sensitivity map, and then use a first-order method for the outer supremum. This did not work for me: it works for toy problems but on real problems the outer supremum has very slow convergence suggesting ill-conditioning.

What I need is the power of interior-point methods to handle ill-conditioning via second-order information. I'm able to achieve this via sequential quadratic minimax programming: first, locally approximate the objective $L(\lambda, \mu, y)$ with a quadratic around the current point and linearize the constraints. $$
&\!\sup_{\delta x} \inf_{\delta y} & \qquad & \frac{1}{2} \left(\begin{array}{c} \delta x \\ \delta y \end{array}\right)^\top \left(\begin{array}{cc} P_{xx} & P_{yx}^\top \\ P_{yx} & P_{yy} \end{array}\right) \left(\begin{array}{c} \delta x \\ \delta y \end{array} \right) + \left(\begin{array}{c} q_x \\ q_y \end{array} \right)^\top \left(\begin{array}{c} \delta x \\ \delta y \end{array} \right) \\
& \text{subject to} & & \left(\begin{array}{cc} A_x & 0 \\ 0 & A_y \end{array} \right) \left(\begin{array}{c} \delta x \\ \delta y \end{array}\right) = \left(\begin{array}{c} b_x \\ b_y \end{array}\right) \\
& & & \left(\begin{array}{cc} G_x & 0 \\ 0 & G_y \end{array} \right) \left(\begin{array}{c} \delta x \\ \delta y \end{array}\right) \preceq \left(\begin{array}{c} h_x \\ h_y \end{array}\right) \\
$$ The Wolfe dual converts this problem into a standard QP: $$
&\!\sup_{\delta x, \delta y, \lambda, \mu} & \qquad & \frac{1}{2} \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array}\right)^\top \left(\begin{array}{cccc} P_{xx} & 0 & 0 & 0 \\ 0 & -P_{yy} & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \end{array}\right) \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array}\right) + \left(\begin{array}{c} q_x \\ 0 \\ -b_y \\ -h_y \end{array} \right)^\top \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array} \right) \\
& \text{subject to} & & \left(\begin{array}{cc} A_x & 0 & 0 & 0 \\ P_{yx} & P_{yy} & A_y^\top & G_y^\top \end{array} \right) \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array}\right) = \left(\begin{array}{c} b_x \\ -q_y \end{array}\right) \\
& & & \left(\begin{array}{cc} G_x & 0 & 0 & 0 \\ 0 & 0 & 0 & -I \end{array} \right) \left(\begin{array}{c} \delta x \\ \delta y \\ \lambda \\ \mu \end{array}\right) \preceq \left(\begin{array}{c} h_x \\ 0 \end{array}\right) \\
$$ If you solve this for $(\delta x, \delta y)$ you get a Newton step for your original problem. The step acceptance criterion here is tricky: if the iterate is feasible you want to leverage the saddle point condition (see equation (11) of Essid et. al.). If the iterate is infeasible more sophistication is required, but fortunately my constraints were actually linear so doing an initial exact inner solve allowed me to iterate while staying feasible. (Note: if you solve a more general convex problem on each step, you don't need to linearize the $x$ constraints.)


Friday, May 10, 2019

ICLR 2019 Thoughts

ICLR 2019 was reminiscent of the early NeurIPS days (sans skiing): a single track of talks, vibrant poster sessions, and a large mid-day break. The Tuesday morning talks were on climate change, modeling proteins, generating music, and modeling the visual cortex. Except for climate change, these were all hot topics at NeurIPS in the late 1990s. History doesn't repeat, but it does rhyme.

My favorite talk was by Pierre-Yves Oudeyer, whose research in curiosity based learning spans both human subjects and robotics. Pierre's presentation was an entertaining tour de force of cognitive science, and I highly recommend watching the video (starts about 9 minutes 30 seconds). These ideas have extensively influenced the reinforcement learning community: the well-known Achilles' Heel of reinforcement learning is sample complexity, and recently practitioners have attacked it inspired by ideas from curiosity based learning (e.g., the Burda et. al. poster at the conference). Furthermore the view “exploration is for building world models” is reflected in recent theoretical results in contextual decision processes.

The strangest moment for me at the conference was seeing the GLUE poster. Apparently with the latency of conference review and publication, GLUE is just now being presented. Of course, it is already obsolete, so the presenters had another poster about their new dataset called SuperGLUE. Things are moving so quickly that the former “fast path” of conference proceedings is now noticeably behind.

Here's some stuff that caught my attention:
  • Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach: A couple years ago Zhang et. al. stunned the community by demonstrating convnets can fit Imagenet labels to randomly generated images, destroying the common belief that convnets generalized well due to capacity control. Here Zhou et. al. show that an MDL-style generalization bound applies, i.e., networks whose representation can be compressed after training have tighter deviation bounds. This is a (training) data-dependent bound and they seal the argument by noting networks trained on randomized training data do not compress as well.
  • Episodic Curiousity through Reachability: One of many curiosity-based exploration posters, Savinov et. al. propose a combination of a memory and something akin to a policy cover, with promising results. Also cool: the poster includes QR codes which trigger videos of agents learning to move via different algorithms.
  • Supervised Policy Update for Deep Reinforcement Learning: Vuong et. al. presented a plausible improvement over TRPO and PPO by convexifying the constrained policy optimization.

Friday, March 8, 2019

RL will disrupt OR

Operations research (OR) is in the initial stages of a revolution driven by reinforcement learning (RL).

When I was at eHarmony years ago, we used classical OR techniques to drive the matchmaking process. Machine learning played a critical role, but was limited to specfiying parameters to the OR solver. In essence, machine learning was to used to estimate the value function, and the OR solver then produced a policy.

OR has historically focused on highly tractable specializations of convex optimization. In an age of scarce compute, this made perfect sense: indeed, at that time eHarmony was pushing the limits of what was possible using high-end commercial OR solvers. However, compute is less scarce now: in predictive modeling convex optimization has been spectacularly superseded by non-convex techniques (aka “deep learning”). A similar revolution is unfolding in OR. The recipe is: develop a generative model of the problem (aka a “simulator”) and then directly optimize a policy on simulated data using RL techniques.

All models are wrong, but some models are useful. At first blush it seems implausible that using advanced RL techniques on a crude approximation of the real world would yield substantial benefits. However traditional OR techniques also preform extremely aggressive optimization (to machine precision (!)) on an approximation of the real world. OR models are useful despite typically making tremendous simplifications such as replacing all random variables with their expected values (or in more sophisticated setups, high probability bounds).

The simplifications for the RL techniques involve the assumptions in the generative model, such as a particular parametric model for probability of an airplane service event. Early research results suggest that for some economically important problems, relatively crude simulators coupled with RL techniques can induce superior policies to those developed using traditional OR techniques. Furthermore, simulators admit expressing increasingly refined approximations of reality without the constraints imposed by classical OR formulations.

Reaction time is a factor in this so please pay attention.
Reaction time. You should almost never take seriously anyone's explanation for why something works. Nonetheless, I'll give you my intution as to why RL will eventually dominate OR.
“Classical OR techniques are optimized to try to avoid bad events ballistically, whereas RL trained policies are optimized to react to events as they occur.”
If this is true then it doesn't matter if the simulation exactly gets the probability of tail events right as long as they are all present in the simulation and somewhat rare, because the "use of remediation actions" portion of the learned policy will be conditioned on events as they actually occur in practice. (If the events are not rare, then getting the coocurrence statistics right could matter.)

If this explanation has merit, then the upside to RL will be large for scenarios where classic OR optimization is frequently re-run in order to react to new events, because RL will have the “reaction time advantage”.

Tuesday, March 6, 2018

pytorch PU learning trick

I'm often using positive-unlabeled learning nowadays. In particular for observational dialog modeling, next utterance classification is a standard technique for training and evaluating models. In this setup the observed continuation of the conversation is considered a positive (since a human said it, it is presumed a reasonable thing to say at that point in the conversation) and other randomly chosen utterances are treated as unlabeled (they might be reasonable things to say at that point in the conversation).

Suppose you have a model whose final layer is a dot product between a vector produced only from context and a vector produced only from response. I use models of this form as “level 1” models because they facilitate precomputation of a fast serving index, but note the following trick will not apply to architectures like bidirectional attention. Anyway for these models you can be more efficient during training by drawing the negatives from the same mini-batch. This is a well-known trick but I couldn't find anybody talking about how to do this explicitly in pytorch.

Structure your model to have a leftforward and a rightforward like this:
class MyModel(nn.Module):

    def forward(leftinput, rightinput):
        leftvec = self.leftforward(leftinput)
        rightvec = self.rightforward(rightinput)
        return torch.mul(leftvec, rightvec).sum(dim=1, keepdim=True)

At training time, compute the leftforward and rightforward for your mini-batch distinctly:
criterion = BatchPULoss()
model = MyModel()

leftvec = model.leftforward(batch.leftinput)
rightvec = model.rightforward(batch.rightinput)

(loss, preds) = criterion.fortraining(leftvectors, rightvectors)
# "preds" contains the highest score right for each left 
# so for instance, calculate "mini-batch precision at 1"
gold_labels = torch.arange(0, batch.batch_size).long().cuda()
n_correct += ( == gold_labels).sum()
Finally use this loss:
import torch

class BatchPULoss():
    def __init__(self):
      self.loss = torch.nn.CrossEntropyLoss()

    def fortraining(self, left, right):
      outer =, right.t())
      labels = torch.autograd.Variable(torch.arange(0,outer.shape[0]).long().cuda(), 
      loss = self.loss(outer, labels)
      _, preds = torch.max(outer, dim=1)
      return (loss, preds)

    def __call__(self, *args, **kwargs):
      return self.loss(*args, **kwargs)
At training time you call the fortraining method but if you have fixed distractors for evaluation you can also call it directly just like CrossEntropyLoss.

Friday, December 15, 2017

NIPS Conversation AI Workshop

I only attended NIPS for the Conversation AI workshop, so my thoughts are limited to that. I really liked the subtitle of the workshop: "today's practice and tomorrow's potential." Since I'm on a product team trying to build chatbots that are actually effective, it struck me as exactly the right tone.

Several presentations were related to the Alexa prize. When reading these papers, keep in mind that contestants were subject to extreme sample complexity constraints. Semifinalists had circa 500 on-policy dialogs and finalists less than 10 times more. This is because 1) the Alexa chat function is not the primary purpose of the device so not all end users participated and 2) they had to distribute the chats to all contestants.

The result of sample complexity constraints is a “bias against variance”, as I've discussed before. In the Alexa prize, that meant the winners had the architecture of “learned mixture over mostly hand-specified substrategies.” In other words, the (scarce) on-policy data was limited to adjusting the mixture weights. (The MILA team had substrategies that were trained unsupervised on forum data, but it looks like the other substrategies were providing most of the benefit.) Sample complexity constraints are pervasive in dialog, but nonetheless the conditions of the contest were more extreme than what I encounter in practice so if you find yourself with more on-policy data consider more aggressive usage.

Speaking of sample complexity constraints, we have found pre-training representations on MT tasks a la CoVE is extremely effective in practice for multiple tasks. We are now playing with ELMo-style pre-training using language modeling as the pre-training task (very promising: no parallel corpus needed!).

Another sample complexity related theme I noticed at the workshop was the use of functional role dynamics. Roughly speaking, this is modeling the structure of the dialog independent of the topic. Once topics are abstracted, the sample complexity of learning what are reasonably structured conversations seems low. Didericksen et. al. combined a purely structural L1 model with a simple topically-sensitive L2 (tf-idf) to build a retrieval based dialog simulator. Analogously for their Alexa prize submission, Serban et. al. learned a dialog simulator from observational data which utilized only functional role and sentiment information and then applied Q-learning: this was more effective than off-policy reinforce with respect to some metrics.

Overall the workshop gave me enough optimism to continue plugging away despite the underwhelming performance of current dialog systems.

Thursday, August 10, 2017

ICML 2017 Thoughts

ICML 2017 has just ended. While Sydney is remote for those in Europe and North America, the conference center
is a wonderful venue (with good coffee!), and the city is a lot of fun. Everything went smoothly and the
organizers did a great job.

You can get a list of papers that I liked from my Twitter feed, so instead I'd like to discuss some broad themes
I sensed.

  • Multitask regularization to mitigate sample complexity in RL. Both in video games and in dialog, it is useful to add extra (auxiliary) tasks in order to accelerate learning.
  • Leveraging knowledge and memory. Our current models are powerful function approximators, but in NLP especially we need to go beyond "the current example" in order exhibit competence.
  • Gradient descent as inference. Whether it's inpainting with a GAN or BLUE score maximization with an RNN, gradient descent is an unreasonably good inference algorithm.
  • Careful initialization is important. I suppose traditional optimization people would say "of course", but we're starting to appreciate the importance of good initialization for deep learning. In particular, start close to linear with eigenvalues close to 1. (Balduzzi et. al. , Poole et. al.)
  • Convolutions are as good as, and faster than, recurrent models for NLP. Nice work out of Facebook on causal convolutions for seq2seq. This aligns with my personal experience: we use convolutional NLP models in production for computational performance reasons.
  • Neural networks are overparameterized. They can be made much sparser without losing accuracy (Molchanov et. al., Lobacheva et. al.).
  • maluuba had the best party. Woot!
Finally, I kept thinking the papers are all “old”. While there were lots of papers I was seeing for the first time, it nonetheless felt like the results were all dated because I've become addicted to “fresh results” on arxiv.

Wednesday, July 26, 2017

Rational Inexuberance

Recently Yoav Goldberg had a famous blog rant. I appreciate his concern, because the situation is game-theoretically dangerous: any individual researcher receives a benefit for aggressively positioning their work (as early as possible), but the field as a whole risks another AI winter as rhetoric and reality become increasingly divergent. Yoav's solution is to incorporate public shaming in order to align local incentives with aggregate outcomes (c.f., reward shaping).

I feel there is a better way, as exemplified by a recent paper by Jia and Liang. In this paper the authors corrupt the SQUAD dataset with distractor sentences which have no effect on human performance, but which radically degrade the performance of the systems on the leaderboard. This reminds me of work by Paperno et. al. on a paragraph completion task which humans perform with high skill and for which all state of the art NLP approaches fail miserably. Both of these works clearly indicate that our current automatic systems only bear a superficial (albeit economically valuable) resemblance to humans.

This approach to honest self-assessment of our capabilities is not only more scholarly, but also more productive, as it provides concrete tasks to consider. At minimum, this will result in improved technological artifacts. Furthermore iterating this kind of goal-setting-and-goal-solving procedure many many times might eventually lead to something worthy of the moniker Artificial Intelligence.

(You might argue that the Yoav Goldberg strategy is more entertaining, but the high from the Yoav Goldberg way is a "quick hit", whereas having a hard task to think about has a lot of "replay value".)

Monday, July 17, 2017

Tiered Architectures, Counterfactual Learning, and Sample Complexity

I'm on a product team now, and once again I find myself working on a tiered architecture: an “L1” model selects some candidates which are passed to an “L2” model which reranks and filters the candidates which are passed to an “L3”, etc. The motivation for this is typically computational, e.g., you can index a DSSM model pretty easily but indexing a BIDAF model is more challenging. However I think there are potential sample complexity benefits as well.

I worry about sample complexity in counterfactual setups, because I think it is the likely next source for AI winter. Reinforcement learning takes a tremendous amount of data to converge, which is why all the spectacular results from the media are in simulated environments, self-play scenarios, discrete optimization of a sub-component within a fully supervised setting, or other situations where there is essentially infinite data. In real life data is limited.

So when I read Deep Reinforcement Learning in Large Discrete Action Spaces by Dulac-Arnold et. al., I noticed that the primary motivation was computational, but figured another (more important?) benefit might be statistical. Tiered architectures cannot overcome worst-case sample complexity bounds, but I think in practice they are a good strategy for counterfactual setups.

Tiered architectures admit semi-supervised approaches, because an L1 model can often be initialized using unsupervised techniques (e.g., word embeddings, sentence embeddings, inverted indicies with tf-idf). Learning the L2 model utilizing this L1 model only has a sample complexity based upon the number of candidates produced by the L1 model, rather than the total number of candidates. Of course, learning the L1 still has a sample complexity based upon the total number of candidates, but if the unsupervised initialization is good then it is ok that the L1 learns slowly. Furthermore in practice the L1 hypothesis class is simpler (because of computational reasons) which mitigates the sample complexity.

There was a workshop called ``coarse-to-fine inference'' at NIPS 2017 which presumably explored these ideas, but I didn't attend it and their website is down. Hopefully there will be another one, I will attend!

Saturday, March 25, 2017

Why now is the time for dialog

I'm working on a task-oriented dialog product and things are going surprisingly well from a business standpoint. It turns out that existing techniques are sufficient to substitute some portion of commercial dialog interactions from human to machine mediated, with tremendous associated cost savings which exceed the cost of developing the automatic systems. Here's the thing that is puzzling: the surplus is so large that, as far as I can tell, it would have been viable to do this 10 years ago with then-current techniques. All the new fancy AI stuff helps, but only to improve the margins. So how come these businesses didn't appear 10 years ago?

I suspect the answer is that a format shift has occurred away from physical transactions and voice mediated interactions to digital transactions and chat mediated interactions.

The movement away from voice is very important: if we had to try and do this using ASR, even today, it probably wouldn't work. Fortunately, today you chat with your cable company rather than talking to them. That shift was motivated by cost savings: a human agent can handle multiple concurrent chat sessions more easily than multiple concurrent voice conversations. However it requires most of your customers to have a computer, smartphone, or other device rather than an old-school telephone.

The continuing dominance of e-commerce over physical stores is also a factor (RIP Sears). In e-commerce, human salespersons increasingly assist customers in transactions via live chat interfaces. Once again, what starts as a more effective way of deploying human resources becomes the vector by which automation increasingly handles the workload.

The end game here is that the number of people employed in retail goes down, but that their compensation goes up. That is because the machines will increasingly handle the routine aspects of these domains, leaving only the long tail of extremely idiosyncratic issues for the humans to resolve. Handling these non-routine issues will require more skill and experience and therefore demand higher compensation (also, an increasing part of the job will be to structure the torso of non-routine issues into something that the machines can handle routinely, i.e., teaching the machines to handle more; this is analogous to programming and will also demand higher compensation).

Wednesday, February 15, 2017

Software Engineering vs Machine Learning Concepts

Not all core concepts from software engineering translate into the machine learning universe. Here are some differences I've noticed.

Divide and Conquer A key technique in software engineering is to break a problem down into simpler subproblems, solve those subproblems, and then compose them into a solution to the original problem. Arguably, this is the entire job, recursively applied until the solution can be expressed in a single line in whatever programming language is being used. The canonical pedagogical example is the Tower of Hanoi.

Unfortunately, in machine learning we never exactly solve a problem. At best, we approximately solve a problem. This is where the technique needs modification: in software engineering the subproblem solutions are exact, but in machine learning errors compound and the aggregate result can be complete rubbish. In addition apparently paradoxical situations can arise where a component is “improved” in isolation yet aggregate system performance degrades when this “improvement” is deployed (e.g., due to the pattern of errors now being unexpected by downstream components, even if they are less frequent).

Does this mean we are doomed to think holistically (which doesn't sound scalable to large problems)? No, but it means you have to be defensive about subproblem decomposition. The best strategy, when feasible, is to train the system end-to-end, i.e., optimize all components (and the composition strategy) together rather than in isolation. Often this is not feasible, so another alternative (inspired by Bayesian ideas) is to have each component report some kind of confidence or variance along with the output in order to facilitate downstream processing and integration.

In practice, when systems get to a particular scope, there needs to be decomposition in order to divide the work up amongst many people. The fact that this doesn't work right now in machine learning is a problem, as elegantly described by Leon Bottou in his ICML 2015 invited talk.

Speaking of another concept that Leon discussed $\ldots$

Correctness In software engineering, an algorithm can be proven correct, in the sense that given particular assumptions about the input, certain properties will be true when the algorithm terminates. In (supervised) machine learning, the only guarantee we really have is that if the training set is an iid sample from a particular distribution, then performance on another iid sample from the same distribution will be close to that on the training set and not too far from optimal.

Consequently anyone who practice machine learning for a living has an experimental mindset. Often times I am asked whether option A or option B is better, and most of the time my answer is “I don't know, let's try both and see what happens.” Maybe the most important thing that people in machine learning know is how to assess a model in such a way that is predictive of generalization. Even that is a “feel” thing: identifying and preventing leakage between training and validation sets (e.g., by stratified and temporal sampling) is something you learn by screwing up a few times; ditto for counterfactual loops. Kaggle is great for learning about the former, but the latter seems to require making mistakes on a closed-loop system to really appreciate.

Experimental “correctness” is much weaker than the guarantees from other software, and there are many ways for things to go badly. For example in my experience it is always temporary: models go stale, it just always seems to happen. Ergo, you need to plan to be continually (hence, automatically) retraining models.

Reuse This one is interesting. Reuse is the key to leverage in traditional software engineering: it's not just more productive to reuse other code, but every line of code you write yourself is an opportunity to inject defects. Thus, reuse not only allows you to move faster but also make less mistakes: in return, you must pay the price of learning how to operate a piece of software written by others (when done well, this price has been lowered through good organization, documentation, and community support).

Some aspects of machine learning exhibit exactly the same tradeoff. For instance, if you are writing your own deep learning toolkit, recognize that you are having fun. There's nothing wrong with having fun, and pedagogical activities are arguably better than playing video games all day. However, if you are trying to get something done, you should absolutely attempt to reuse as much technology as you can, which means you should be using a standard toolkit. You will move faster and make less mistakes, once you learn how to operate the standard toolkit.

Machine learning toolkits are “traditional software”, however, and are designed to be reused. What about model reuse? That can be good as well, but the caveats about decomposition above still apply. So maybe you use a model which produces features from a user profile as inputs to your model. Fine, but you should version the model you depend upon and not blindly upgrade without assessment or retraining. Reusing the internals of another model is especially dangerous as most machine learning models are not identifiable, i.e., have various internal symmetries which are not determined by the training procedure. Couple an embedding to a tree, for instance, and when the next version of the embedding is a rotation of the previous one, you can watch your performance go to crap immediately.

Basically, model reuse creates strong coupling between components which can be problematic if one component is changed.

Testing I find the role of software testing in machine learning to be the trickiest issue of all. Without a doubt testing is necessary, but the challenge in using something like property-based testing is that the concept that is being captured by the machine learning component is not easily characterized by properties (otherwise, you would write it using non-ml software techniques). To the extent there are some properties that the ml component should exhibit, you can test for these, but unless you incorporate these into the learning procedure itself (e.g., via parameter tying or data augmentation) you are likely to have some violations of the property that are not necessarily indicative of defects.

Having a “extra-test” data set of with minimal acceptable quality is a good idea: this could be easy examples that “any reasonable model” should get correct. There's also self-consistency: at Yahoo they used to ship models with a set of input-output pairs that were computed with the model when it was put together, and if the loaded model didn't reproduce the pairs, the model load was cancelled. (That should never happen, right? Surprise! Maybe you are featurizing the inputs using a library with a different version or something.)

Monitoring the metrics (proxy and true) of deployed models is also good for detecting problems. If the proxy metric (i.e., the thing on which you actually trained your model and estimated generalization performance) is going south, the inputs to your model are changing somehow (e.g., nonstationary environment, change in feature extraction pipeline); but if the proxy metric is stable while the true metric is going south, the problem might be in how the outputs of your model are being leveraged.

Unfortunately what I find is many software systems with machine learning components are tested in a way that would make traditional software engineers cringe: we look at the output to see if it is reasonable. Crazy! As machine learning becomes a more pervasive part of software engineering, this state of affairs must change.

Friday, January 27, 2017

Reinforcement Learning and Language Support

What is the right way to specify a program that learns from experience? Existing general-purpose programming languages are designed to facilitate the specification of any piece of software. So we can just use these programming languages for reinforcement learning, right? Sort of.

Abstractions matter

An analogy with high performance serving might be helpful. An early influential page on high performance serving (the C10K problem by Dan Kegel

I don't know what the state-of-the-art is in high performance serving now, I'm a bit out of that game. The main point is that all programming languages are not created equal, in that they create different cognitive burdens on the programmer and different computational burdens at runtime. I could use an existing language (at that time, C++) in one of two ways (cooperative scheduling vs. pre-emptive scheduling), or I could use a different language (Erlang) that was designed to mitigate the tradeoff.

Imperative specification with automatic credit assignment

As previously stated, the difference between the programs we'd like to specify now, versus the ones specified in the past, is that we want our programs to be able to learn from experience. As with high-performance serving, we'd like to balance the cognitive burden on the programmer with the computational burden imposed at runtime (also, possibly, the statistical burden imposed at runtime; computational burdens correspond to resources such as time or space, whereas the statistical burden corresponds to data resources).

Within the current “AI summer”, one idea that become popular is automatic differentiation. Full AD means that essentially any language construct can be used to define a function, and the computation to compute the gradient of the function with respect to the input is provided “for free.” A language equipped with AD which is computing a (sub-)differentiable function can learn from experience in the sense of moving closer to a local optimum of a loss function. Deep learning toolkits implement AD to various degrees, with some frameworks (e.g., Chainer) aggressively pursuing the idea of allowing arbitrary language constructs when specifying the forward computation.

The ability to use arbitrary language constructs becomes increasingly important as inference becomes more complicated. Simple inference (e.g., classification or ranking) is easy to reason about but beyond that it quickly becomes a major source of defects to 1) specify how the output of a machine learning model is used to synthesize a complete system and 2) specify how the data obtained from running that complete system is used to update the model.

The problem is clearly visible in the field of structured prediction. “Structured prediction”, of course, is a somewhat ridiculous term analogous to the term “nonlinear systems analysis”; in both cases, a simpler version of the problem was solved initially (classification and linear systems analysis, respectively) and then an umbrella term was created for everything else. Nonetheless, Hal Daume has a good definition of structured prediction, which is making multiple predictions on a single example and experiencing a joint (in the decisions) loss. (He also has a Haiku version of this definition.)

Because inference in structured prediction is complicated, the ideas of imperative specification and automated credit assignment were essentially reinvented for structured prediction. The technique is outlined in an Arxiv paper by Chang et. al., but fans of Chainer will recognize this as the analog of “define-by-run” for structured prediction. (Note the optimization strategy here is not gradient descent, at least not on the forward computation, but rather something like a policy gradient method which translates to a discrete credit assignment problem over the predictions made by the forward computation.)

One way to view episodic RL is structured prediction with bandit feedback: structured prediction is fully observed, analogous to supervised learning, in that it is possible to compute the loss of any sequence of decisions given a particular input. In reinforcement learning you have bandit feedback, i.e., you only learn about the loss associated with the sequence of decisions actually taken. While this isn't the only way to view episodic RL, it does facilitate connecting with some of the ideas of the paper mentioned in the previous paragraph.

A Motivating Example

Here's an example which will hopefully clarify things. Suppose we want to build an interactive question-answering system, in which users pose questions, and then the system can optionally ask a (clarifying) question to the user or else deliver an answer. We can view this as an episodic RL problem, where the user statements are observations, system questions are actions, system answers are more actions, and the episode ends as soon as we deliver an answer.

What I'd like to do is specify the computation something like this pseudo-python:
def interactive_qa_episode():
  uq = get_user_question()
  qapairs = []
  sysaction = get_next_system_action(uq, qapairs)
  while (sysaction.is_question):
    ua = get_user_answer(sysaction.utterance)
    sysaction = get_next_system_action(uq, qapairs)
It is pretty clear what is going on here: we get a user question, conditionally ask questions, and then deliver an answer. Before the advent of machine learning, an implementer of such a system would attempt to fill out the unspecified functions above: in particular, get_next_system_action is tricky to hand specify. What we would like to do is learn this function instead.

It would be nice to use decorators to achieve this. First, to learn we need some idea of doing better or worse, so assume after delivering an answer there is some way to decide how satisfied the user is with the session (which, ceterus perebus, should be monotonically decreasing with the number of questions asked, to encourage expediency):
def interactive_qa_episode():
  uq = get_user_question()
  qapairs = []
  sysaction = get_next_system_action(uq, qapairs)
  while (sysaction.is_question):
    ua = get_user_answer(sysaction.utterance)
    sysaction = get_next_system_action(uq, qapairs)
# this next line is the only change to the original function
  reward = deliverAnswer(sysaction.utterance) 
All too easy! Pseudo-code is so productive. We can even imagine updating reward multiple times, with the decorator keeping track of the reward deltas for improved credit assignment.

Now some magic metaprogramming kicks in and converts this into a model being trained with an RL algorithm (e.g., a value iteration method such as q-learning, or a policy iteration method such as bandit LOLS). Or does it? We still haven't said which functions are to be learned and which are hand-specified. The default will be hand-specified, so we will decorate one function.
def get_next_system_action(uq, qapairs):
Now we get into some thorny issues. We need to specify this functions ultimately in terms of a parameterized model like a neural network; we'll have to say what the initial representation is that is computed from variables like uq and qapairs; and we'll have to say how the output of the model is mapped onto an actual decision. Just to keep moving, let's assume there is a fixed small set of system questions and system answers.
action_table = [ ... ] # list containing action mapping
def get_next_system_action(uq, qapairs):
  not_allowed_action_ids = [ sysa.action_id for (sysa, _) in qapairs ]
  action_id = categorical_choice(uq: uq,
                                 qapairs: qapairs,
                                 not_allowed_action_ids: not_allowed_action_ids,
                                 tag: 'nextsystemaction')
  return action_table[action_id]
categorical_choice is the representation of a forced choice from one of a set of possibilities. For small action spaces, this could be directly implemented as an output per action, but for large action spaces this might be implemented via action embeddings with an information-retrieval style cascading pipeline.

Great right? Well some problems remain.
  • The best model structure (i.e., policy class) for the choice requires some specification by the programmer, e.g., a convolutional text network vs. an iterated attention architecture. Ideally this specification is distinct from the specification of inference, so that many modeling ideas can be tried. That's the purpose of the tag argument, to join with a separate specification of the learning parameters. (If not provided, sane default tags could be generated during compilation.)
  • As indicated in the previous post, bootstrapping is everything. So an initial implementation of get_next_system_action needs to be provided. Maybe this reduces to providing an initial setting of the underlying model, but maybe it doesn't depending upon the initialization scenario. Note if initialization is done via simulation or off-policy learning from historical data, these could be supported by facilitating the mockup of the I/O functions get_user_question and get_user_answer. Another common scenario is that a not-learned function is provided as a reference policy with which the learned function should compete.
Can't I do this with Chainer already? Sort of. If you use a particular RL algorithm, definitely. For instance, q-learning reduces reinforcement learning to regression, so if you code that inline, you get something Chainer could handle. However the goal is to specify inference without leaking details about the learning algorithm, so I'd rather not code that inline. An alternative is to compile to Chainer, akin to cfront in the early days of c++.

Ultimately, however, I would hope to have a different compilation strategy. There's more at stake than just implementing the learning algorithm: there are all the issues mentioned in my previous post that have convinced me that the implementation should be able to leverage a reinforcement learning service.

Saturday, January 21, 2017

Reinforcement Learning as a Service

I've been integrating reinforcement learning into an actual product for the last 6 months, and therefore I'm developing an appreciation for what are likely to be common problems. In particular, I'm now sold on the idea of reinforcement learning as a service, of which the decision service from MSR-NY is an early example (limited to contextual bandits at the moment, but incorporating key system insights).

Service, not algorithm Supervised learning is essentially observational: some data has been collected and subsequently algorithms are run on it. (Online supervised learning doesn't necessarily work this way, but mostly online techniques have been used for computational reasons after data collection.) In contrast, counterfactual learning is very difficult do to observationally. Diverse fields such as economics, political science, and epidemiology all attempt to make counterfactual conclusions using observational data, essentially because this is the only data available (at an affordable cost). When testing a new medicine, however, the standard is to run a controlled experiment, because with control over the data collection more complicated conclusions can be made with higher confidence.

Analogously, reinforcement learning is best done “in the loop”, with the algorithm controlling the collection of data which is used for learning. Because of this, a pure library implementation of a reinforcement learning algorithm is unnatural, because of the requisite state management. For example, rewards occur after actions are taken, and these need to be ultimately associated with each other for learning. (One of my first jobs was at a sponsored search company called Overture, and maintaining the search-click join was the full time job of a dozen engineers: note this was merely an immediate join for a singleton session!)

Ergo, packaging reinforcement learning as a service makes more sense. This facilitates distributed coordination of the model updates, the serving (exploration) distribution, and the data collection. This scenario is a natural win for cloud computing providers. However, in practice there needs to be an offline client mode (e.g., for mobile and IOT applications); furthermore, this capability would be utilized even in a pure datacenter environment because of low latency decision requirements. (More generally, there would probably be a “tiered learning” architecture analogous to the tiered storage architectures utilized in cloud computing platforms. Brendan McMahan has been thinking along these lines under the rubric of federated learning.)

Bootstrapping is everything It is amazing how clarifying it is to try and solve and actual problem. I now appreciate that reinforcement learning has been oversold a bit. In particular, the sample complexity requirements for reinforcement learning are quite high. (That's fancy talk for saying it takes a lot of data to converge.) When you are working in a simulated environment that's not such a concern, because you have the equivalent of infinite training data, so we see dramatic results in simulated environments.

When reinforcement learning is done on live traffic with real users, you have less data than you think because you always start with a test fraction of data and you don't get more until you are better (catch 22). So I actually spend a lot of my time developing initial serving policies, unfortunately somewhat idiosyncratically: imitation learning can be great with the right data assets, but heuristic strategies are also important. I suspect initialization via not-smartly-initialized-RL in a simulated environment is another possibility (in dialog simulators aren't so good so I haven't leveraged this strategy yet).

This creates some design questions for RL as a service.
  • Assuming there is an initial serving policy, how do I specify it? In the decision service you pass in the action that the initial serving policy would take which is fine for contextual bandits, but for a multi-step epoch this could be cumbersome because the initial serving policy needs to maintain state. It would make sense for the service to make it easier to manage this.
  • How does the service help me put together the initial serving policy? Considering my experience so far, here are some possible ways to develop an initial serving policy:
    • An arbritrary program (``heuristic''). Sometimes this is the easiest way to cold start, or this might be the current ``champion'' system.
    • Imitation learning. Assumes suitable data assets are available.
    • Off-policy learning from historical data. This can be better than imitation learning if the historical policy was suitably randomized (e.g., the exhaust of previous invocations of RL as a service).
    • Boostrapping via simulation. In dialog this doesn't seem viable, but if a good simulator is available (e.g., robotics and game engines?), this could be great. Furthermore this would involve direct reuse of the platform, albeit on generated data.

Language is the UI of programming I think ideas from credit-assignment compilation would not only address the question of how to specify the initial policy, but also provide the most natural interface for utilizing RL a service. I'll do another post exploring that.

Friday, January 13, 2017

Generating Text via Adversarial Training

There was a really cute paper at the GAN workshop this year, Generating Text via Adversarial Training by Zhang, Gan, and Carin. In particular, they make a couple of unusual choices that appear important. (Warning: if you are not familiar with GANs, this post will not make a lot of sense.)
  1. They use a convolutional neural network (CNN) as a discriminator, rather than an RNN. In retrospect this seems like a good choice, e.g. Tong Zhang has been crushing it in text classification with CNNs. CNNs are a bit easier to train than RNNs, so the net result is a powerful discriminator with a relatively easy optimization problem associated with it.
  2. They use a smooth approximation to the LSTM output in their generator, but actually this kind of trick appears everywhere so isn't so remarkable in isolation.
  3. They use a pure moment matching criterion for the saddle point optimization (estimated over a mini-batch). GANs started with a pointwise discrimination loss and more recent work has augmented this loss with moment matching style penalties, but here the saddle point optimization is pure moment matching. (So technically the discriminator isn't a discriminator. They actually refer to it as discriminator or encoder interchangeably in the text, this explains why.)
  4. They are very smart about initialization. In particular the discriminator is pre-trained to distinguish between a true sentence and the same sentence with two words swapped in position. (During initialization, the discriminator is trained using a pointwise classification loss). This is interesting because swapping two words preserves many of the $n$-gram statistics of the input, i.e., many of the convolutional filters will compute the exact same value. (I've had good luck recently using permuted sentences as negatives for other models, now I'm going to try swapping two words.)
  5. They update the generator more frequently than the discriminator, which is counter to the standard folklore which says you want the discriminator to move faster than the generator. Perhaps this is because the CNN optimization problem is much easier than the LSTM one; the use of a purely moment matching loss might also be relevant.

The old complaint about neural network papers was that you couldn't replicate them. Nowadays it is often easier to replicate neural network papers than other papers, because you can just fork their code on github and run the experiment. However, I still find it difficult to ascertain the relative importance of the various choices that were made. For the choices enumerated above: what is the sensitivity of the final result to these choices? Hard to say, but I've started to assume the sensitivity is high, because when I have tried to tweak a result after replicating it, it usually goes to shit. (I haven't tried to replicate this particular result yet.)

Anyway this paper has some cool ideas and hopefully it can be extended to generating realistic-looking dialog.

Saturday, December 17, 2016

On the Sustainability of Open Industrial Research

I'm glad OpenAI exists: the more science, the better! Having said that, there was a strange happenstance at NIPS this year. OpenAI released OpenAI universe, which is their second big release of a platform for measuring and training counterfactual learning algorithms. This is the kind of behaviour you would expect from an organization which is promoting the general advancement of AI without consideration of financial gain. At the same time, Google, Facebook, and Microsoft all announced analogous platforms. Nobody blinked an eyelash at the fact that three for-profit organizations were tripping over themselves to give away basic research technologies.

A naive train of thought says that basic research is a public good, subject to the free-rider problem, and therefore will be underfunded by for-profit organizations. If you think this is a strawman position, you haven't heard of the Cisco model for innovation. When this article was written:
…Cisco has no “pure” blue-sky research organization. Rather, when Cisco invests research dollars, it has a specific product in mind. The company relies on acquisitions to take the place of pure research …
Articles like that used to worry me alot. So why (apparently) is this time different?

Factor 1: Labor Market Scarcity

Informal discussions with my colleagues generally end up at this explanation template. Specific surface forms include:
  • “You can't recruit the best people without good public research.” Facially, I think this statement is true, but the logic is somewhat circular. You certainly can't recruit the best researchers without good public research, but why do you want them in the first place? So is the statement more like “With good public research, you can recruit the best people, and then convince them to do some non-public research.” (?) Alot of grad students do seem to graduate and then “disappear”, so there is probably some truth to this.
  • “The best people want to publish: it's a perk that you are paying them.” Definitely, getting public recognition for your work is rewarding, and it makes total sense for knowledge workers to want to balance financial capital and social capital. Public displays of competence are transferable to a new gig, for instance. But this line of thought assumes that public research is a cost for employers that they chose to pay in lieu of, e.g., higher salaries.
I not only suspect this factor is only part of the picture: I strongly hope that it is only part of the picture. Because if it is the whole picture, as soon as the labor market softens, privately funded public research will experience a big pullback, which would suck.

Factor 2: Positive Externalities

This argument is: “researchers improve the productivity of those nearby such that it is worth paying them just to hang out.” In this line of thinking even a few weeks lead time on the latest ideas, plus the chance to talk in person with thought leaders in order to explain the nuances of the latest approaches, is worth their entire salary. There is some truth to this, e.g., Geoffrey Hinton performed some magic for the speech team here back in the day. The problem I have with this picture is that, in practice, it can be easier to communicate and collaborate with somebody across the planet than with somebody downstairs. It's also really hard to measure, so if I had to convince the board of directors to fund a research division based upon this, I think I would fail.

This is another favorite argument that comes up in conversation, by the way. It's funny to hear people characterize the current situation as “ we're scarce and totally awesome.” As Douglas Adams points out, there is little benefit to having a sense of perspective.

Factor 3: Quality Assurance

The idea here is basically “contributing to the public research discussion ensures the high quality of ideas within the organization.” The key word here is contributing, as the alternative strategy is something more akin to free-riding, e.g., sending employees to conferences to attend but not contribute.

There is definite value in preparing ideas for public consumption. Writing the related work section of a paper is often an enlightening experience, although honestly it tends to happen after the work has been done, rather than before. Before is more like a vague sense that there is no good solution to whatever the problem is, hopefully informed by a general sense of where the state-of-the-art is. Writing the experiment section, in my experience, is more of a mixed bag: you often need to dock with a standard metric or benchmark task that seems at best idiosyncratic and at worst unrelated to the thrust of your work and therefore forcing particular hacks to get over the finish line. (Maybe this is why everybody is investing so heavily in defining the next generation of benchmark tasks.)

The funny thing is most of the preceeding benefits occur during the preparation for publication. Plausibly, at that point, you could throw the paper away and still experience the benefits (should we call these “the arxiv benefits”?). Running the reviewer gauntlet is a way of measuring whether you are doing quality work, but it is a noisy signal. Quality peer feedback can suggest improvements and new directions, but is a scarce resource. Philanthropic organizations that want to advance science should attack this scarcity, e.g., by funding high quality dedicated reviewers or inventing a new model for peer feedback.

I don't find this factor very compelling as a rationale for funding basic research, i.e., if I were the head of a research department arguing for funding from the board of directors, I wouldn't heavily leverage this line of attack. Truth is less important than perception here, and I think the accounting department would rather test the quality of their ideas in the marketplace of products.

Factor 4: Marketing

A company can use their basic research accolades as a public display of the fitness and excellence of their products. The big players definitely make sure their research achievements are discussed in high profile publications such as the New York Times. However this mostly feels like an afterthought to me. What seems to happen is that researchers are making their choices on what to investigate, some of it ends up being newsworthy, and another part of the organization has dedicated individuals whose job it is to identify and promote newsworthy research. IBM is the big exception, e.g., Watson going after Jeopardy.

This is arguably sustainable (IBM has been at it for a while), but it creates activity that looks like big pushes around specific sensational goals, rather than distribution of basic research tools and techniques. In other words, it doesn't look like what was happening at this year's NIPS.

Factor 5: Monopolies

I find this explanation agreeable: that technology has created more natural monopolies and natural monopolies fund research, c.f., Bell Labs and Xerox PARC. All market positions are subject to disruption and erosion but Microsoft, Google, and Facebook all have large competitive moats in their respective areas (OS, search, and social), so they are currently funding public basic research. This factor predicts that as Amazon's competitive moats in retail (and cloud computing) widen, they will engage in more public basic research, something we have seen recently.

For AI (née machine learning) in particular, the key monopoly is data (which derives from customer relationships). Arguably the big tech giants would love for AI technologies to be commodities, because they would then be in the best position to exploit such technologies due to their existing customer relationships. Conversely, if a privately discovered disruptive AI technology were to emerge, it would be one of the “majors” being disrupted by a start-up. So the major companies get both benefits and insurance from a vibrant public research ecosystem around AI.

Nonetheless, a largish company with a decent defensive moat might look at the current level of public research activity and say, “hey good enough, let's free ride.” (Not explicitly, perhaps, but implicitly). Imagine you are in charge of Apple or Salesforce, what do you do? I don't see a clear “right answer”, although both companies appear to be moving in the direction of more open basic research.

Factor 6: Firms are Irrational

Tech firms are ruled by founder-emperors whose personal predilections can decide policies such as whether you can bring a dog to work. The existence of a research department with a large budget, in practice, can be similarly motivated. All the above factors are partially true but difficult to measure, so it comes down to a judgement call, and as long as a company is kicking ass deference for the founder(s) will be extreme.

If this factor is important, however, then when the company hits a rough patch, or experiences a transition at the top, things can go south quickly. There have been examples of that in the last 10 years for sure.

Friday, December 16, 2016

Dialogue Workshop Recap

Most of the speakers have sent me their slides, which can be found on the schedule page. Overall the workshop was fun and enlightening. Here are some major themes that I picked up upon.

Evaluation There is no magic bullet, but check out Helen's slides for a nicely organized discussion of metrics. Many different strategies were on display in the workshop:
  • Milica Gasic utilized crowdsourcing for some of her experiments. She also indicated the incentives of crowdsourcing can lead to unnatural participant behaviours.
  • Nina Dethlefs used a combination of objective (BLEU) and subjective (“naturalness”) evaluation.
  • Vlad Serban has been a proponent of next utterance classification as a useful intrinsic metric.
  • Antoine Bordes (and the other FAIR folks) are heavily leveraging simulation and engineered tasks.
  • Jason Williams used imitation metrics (from hand labeled dialogs) as well as simulation.
As Helen points out, computing metrics from customer behaviour is probably the gold standard for industrial task-oriented systems, but this is a scarce resource. (Even within the company that has the customer relationship, by the way: at my current gig they will not let me flight something without demonstrating limited negative customer experience impact.)

Those who have been around longer than I have experienced several waves of enthusiasm and pessimism regarding simulation for dialogue. Overall I think the takeaway is that simulation can be useful tool, as long as one is cognizant of the limitations.

Antoine quickly adapted his talk to Nina's with a fun slide that said “Yes, Nina, we are bringing simulation back.” The FAIR strategy is something like this: “Here are some engineered dialog tasks that appear to require certain capabilities to perform well, such as multi-hop reasoning, interaction with a knowledge base, long-term memory, etc. At the moment we have no system that can achieve 100% accuracy on these engineered tasks, so we will use these tasks to drive research into architectures and optimization strategies. We also monitor performance other external tasks (e.g., DSTC) to see if our learning generalizes beyond the engineered task set.” Sounds reasonable.

Personally, as a result of the workshop, I'm going to invest more heavily in simulators in the near-term.

Leveraging Linguistics Fernando Pereira had the killer comment about how linguistics is a descriptive theory which need not have explicit correspondence to implementation: “when Mercury goes around the Sun, it is not running General Relativity.” Nonetheless, linguistics seems important not only for describing what behaviours a competent system must capture, but also for motivating and inspiring what kinds of automata we need to achieve it.

Augmenting or generating data sets seems like a natural way to leverage lingustics. As an example, in the workshop I learned that 4 year old native English speakers are sensitive to proper vs. improper word order given simple sentences containing some nonsense words (but with morphological clues, such as capitalization and -ed suffix). Consequently, I'm trying a next utterance classification run on a large dialog dataset where some of the negative examples are token-permuted versions of the true continuation, to see if this changes anything.

Raquel Fernandez's talk focused on adult-child language interactions, and I couldn't help but think about potential relevance to training artificial systems. In fact, current dialog systems are acting like the parent (i.e., the expert), e.g., by suggesting reformulations to the user. But this laughable, because our systems are stupid: shouldn't we be acting like the child?

The most extreme use of linguistics was the talk by Eshghi and Kalatzis, where they develop a custom incremental semantic parser for dialog and then use the resulting logical forms to drive the entire dialog process. Once the parser is built, the amount of training data required is extremely minimal, but the parser is presumably built from looking at a large number of dialogs.

Nina Dethlefs discussed some promising experiments with AMR. I've been scared of AMR personally. First, it is very expensive to get the annotations. However, if that were the only problem, we could imagine a human-genome-style push to generate a large number of them. The bigger problem is the relatively poor inter-annotator agreement (it was just Nina and her students, so they could come to agreement via side communication). Nonetheless I could imagine a dialog system which is designed and built using a small number of prototypical semantic structures. It might seem a bit artificial and constrained, but so does the graphical user interface with the current canonical set of UX elements, which users learn to productivity interact with.

Angeliki Lazaridou's talk reminded me that communication is fundamentally a cooperative game, which explains why arguing on the internet is a waste of time.

Neural Networks: Game Changer? I asked variants of the following question to every panel: “what problems have neural networks mitigated and what problems remain stubbornly unaddressed.” This was, essentially, the content of Marco Baroni's talk. Overall I would say: there's enthusiasm now that we are no longer afraid of non-convex loss functions (along these lines, check out Julien Perez's slides).

However, we currently have only vague ideas on how to realize the competencies that are apparently required for high quality dialog. I say apparently because the history of AI is full of practitioners assuming sufficient capabilities are necessary for some task, and recent advances in machine translation suggest that savant-parrots might be able to do surprisingly well. In fact, during the discussion period there was some frustration that heuristic hand-coded strategies are still superior to machine learning based approaches, with the anticipation that this may continue to be true for the Alexa prize. I'm positive about the existence of superior heuristics, however: not only do they provide a source of inspiration and ideas for data-driven approaches, but learning methods that combine imitation learning and reinforcement learning should be able to beneficially exploit them.

Entity Annotation Consider the apparently simple and ubiquitous feature engineering strategy: add additional sparse indicator features which indicate semantic equivalence of tokens or token sequences. So maybe “windows 10” and “windows anniversary edition” both get the same feature. Jason Williams indicated his system is greatly improved by this, but he's trying to learn from $O(10)$ labeled dialogues, so I nodded. Antoine Bordes indicated this helps on some bAbI dialog tasks, but those tasks only have $O(1000)$ dialogues, so again I nodded. Then Vlad Serban indicated this helps for next utterance classification on the Ubuntu Dialog Corpus. At this point I thought, “wait, that's $O(10^5)$ dialogs.”
Apparently, knowing a turtle and a tortoise are the same thing is tricky.
In practice, I'm ok with manual feature engineering: it's how I paid the rent during the linear era. But now I wonder: does it take much more data to infer such equivalences? Will we never infer this, no matter how much data, given our current architectures?

Spelling The speakers were roughly evenly split between “dialog” and “dialogue”. I prefer the latter, as it has more panache.

Monday, December 12, 2016

NIPS 2016 Reflections

It was a great conference. The organizers had to break with tradition to accommodate the rapid growth in submissions and attendance, but despite my nostalgia, I feel the changes were beneficial. In particular, leveraging parallel tracks and eliminating poster spotlights allowed for more presentations while ending the day before midnight, and the generous space allocation per poster really improved the poster session. The workshop organizers apparently thought of everything in advance: I didn't experience any hiccups (although, we only had one microphone, so I got a fair bit of exercise during discussion periods).

Here are some high-level themes I picked up on.

Openness. Two years ago Amazon started opening up their research, and they are now a major presence at the conference. This year at NIPS, Apple announced they would be opening up their research practices. Clearly, companies are finding it in their best interests to fund open basic research, which runs counter to folk-economic reasoning that basic research appears to be a pure public good and therefore will not be funded privately due to the free-rider problem. A real economist would presumably say that is simplistic undergraduate thinking. Still I wonder, to what extent are companies being irrational? Conversely, what real-world aspects of basic research are not well modeled as a public good? I would love for an economist to come to NIPS to give an invited talk on this issue.

Simulation. A major theme I noticed at the conference was the use of simulated environments. One reason was articulated by Yann LeCun during his opening keynote: (paraphrasing) ``simulation is a plausible strategy for mitigating the high sample complexity of reinforcement learning.'' But another reason is scientific methodology: for counterfactual scenarios, simulated environments are the analog of datasets, in that they allow for a common metric, reproducible experimentation, and democratization of innovation. Simulators are of course not new and have had waves of enthusiasm and pessimism in the past, and there are a lot of pitfalls which basically boil down to overfitting the simulator (both in a micro sense of getting a bad model, but also in a macro sense of focusing scientific attention on irrelevant aspects of a problem). Hopefully we can learn from the past and be cognizant of the dangers. There's more than a blog post worth of content to say about this, but here are two things I heard at the dialog workshop along these lines: first, Jason Williams suggested that relative performance conclusions based upon simulation can be safe, but that absolute performance conclusions are suspect; and second, Antoine Bordes advocated for using an ensemble of realizable simulated problems with dashboard scoring (i.e., multiple problems for which perfect performance is possible, which exercise apparently different capabilities, and for which there is currently no single approach that is known to handle all the problems).

Without question, simulators are proliferating. I noticed the following discussed at the conference this year:
and I probably missed some others.

By the way, the alternatives to simulation aren't perfect either: some of the discussion in the dialogue workshop was about how the incentives of crowdsourcing induces unnatural behaviour in participants of crowdsourced dialogue experiments.

GANs The frenzy of GAN research activity from other conferences (such as ICLR) colonized NIPS in a big way this year. This is related to simulation, albeit more towards the mitigating-sample-complexity theme than the scientific-methodology theme. The quirks of getting the optimization to work are being worked out, which should enable some interesting improvements in RL in the near-term (in addition to many nifty pictures). Unfortunately for NLU tasks, generating text from GANs is currently not as mature as generating sounds or images, but there were some posters addressing this.

Interpretable Models The idea that model should be able to “explain itself” is very popular in industry, but this is the first time I have seen interpretability receive significant attention at NIPS. Impending EU regulations have certainly increased interest in the subject. But there are other reasons as well: as Irina Rish pointed out in her invited talk on (essentially) mindreading, recent advances in representation learning could better facilitate scientific inquiry if the representations were more interpretable.

Papers I noticed

Would you trust a single reviewer on yelp? I wouldn't. Therefore, I think we need some way to crowdsource what people thought were good papers from the conference. I'm just one jet-lagged person with two eyeballs (btw, use bigger font people! it gets harder to see the screen every year …), plus everything comes out on arxiv first so if I read it already I don't even notice it at the conference. That makes this list weird, but here you go.

Also this paper was not at the conference, as far as I know, but I found out about it during the coffee break and it's totally awesome:
  • Understanding deep learning requires rethinking generalization. TL;DR: convnets can shatter the standard image training sets when the pixels are permuted or even randomized! Of course, generalization is poor in this case, but it indicates they are way more flexible than their “local pixel statistics composition” architecture suggests. So why do they work so well?

Saturday, December 3, 2016

Learning Methods for Dialog Workshop at NIPS This Saturday

The schedule for the workshop has been finalized, and I'm pretty excited. We managed to convince some seasoned researchers in dialog, who don't normally attend NIPS, to give invited talks. We're also devoting some time to “Building Complete Systems”, because it's easy to focus on the trees instead of the forest, especially when the tree is something really interesting like a neural network trained on a bunch of GPUs. But don't worry, there's plenty of “NIPS red meat” in the schedule as well.

See you on Saturday!

Monday, September 19, 2016

NIPS dialogue workshop

I'm co-organizing a workshop on dialogue at NIPS 2016. NIPS is not a traditional forum for dialogue research, but there are increasing number of people (like myself!) in machine learning who are becoming interested in dialogue, so the time seemed right. From a personal perspective, dialogue is interesting because 1) it smells like AI, 2) recent advances in (deep learning) NLP techniques suggest the problem is more tractable and 3) corporate interest means both money and data will be plentiful. Honestly, the first point is very important: it was impossible to explain to my kids the minutiae on which I previously worked, whereas now I can show them videos like this. However, there are a lot of issues in dialogue that aren't going to be demolished merely by using a flexible hypothesis class, so I felt the need to educate myself about the activities of veteran dialogue researchers, and the best way to ensure that was to organize a workshop and invite some of them.

Hopefully you'll join the conversation.

Friday, July 8, 2016

Update on dialogue progress

In a recent blog post I discussed two ideas for moving dialogue forward; both ideas are related to the need to democratize access to the data required to evaluate a dialog system. It turns out both ideas have already been advanced to some degree:
  1. Having computers “talk” to each other instead of with people: Marco Beroni is on it.
  2. Creating an open platform for online assessment: Maxine Eskenazi is on it.
This is good to see.