Showing posts with label Conference. Show all posts
Showing posts with label Conference. Show all posts

Monday, December 12, 2016

NIPS 2016 Reflections

It was a great conference. The organizers had to break with tradition to accommodate the rapid growth in submissions and attendance, but despite my nostalgia, I feel the changes were beneficial. In particular, leveraging parallel tracks and eliminating poster spotlights allowed for more presentations while ending the day before midnight, and the generous space allocation per poster really improved the poster session. The workshop organizers apparently thought of everything in advance: I didn't experience any hiccups (although, we only had one microphone, so I got a fair bit of exercise during discussion periods).

Here are some high-level themes I picked up on.

Openness. Two years ago Amazon started opening up their research, and they are now a major presence at the conference. This year at NIPS, Apple announced they would be opening up their research practices. Clearly, companies are finding it in their best interests to fund open basic research, which runs counter to folk-economic reasoning that basic research appears to be a pure public good and therefore will not be funded privately due to the free-rider problem. A real economist would presumably say that is simplistic undergraduate thinking. Still I wonder, to what extent are companies being irrational? Conversely, what real-world aspects of basic research are not well modeled as a public good? I would love for an economist to come to NIPS to give an invited talk on this issue.

Simulation. A major theme I noticed at the conference was the use of simulated environments. One reason was articulated by Yann LeCun during his opening keynote: (paraphrasing) ``simulation is a plausible strategy for mitigating the high sample complexity of reinforcement learning.'' But another reason is scientific methodology: for counterfactual scenarios, simulated environments are the analog of datasets, in that they allow for a common metric, reproducible experimentation, and democratization of innovation. Simulators are of course not new and have had waves of enthusiasm and pessimism in the past, and there are a lot of pitfalls which basically boil down to overfitting the simulator (both in a micro sense of getting a bad model, but also in a macro sense of focusing scientific attention on irrelevant aspects of a problem). Hopefully we can learn from the past and be cognizant of the dangers. There's more than a blog post worth of content to say about this, but here are two things I heard at the dialog workshop along these lines: first, Jason Williams suggested that relative performance conclusions based upon simulation can be safe, but that absolute performance conclusions are suspect; and second, Antoine Bordes advocated for using an ensemble of realizable simulated problems with dashboard scoring (i.e., multiple problems for which perfect performance is possible, which exercise apparently different capabilities, and for which there is currently no single approach that is known to handle all the problems).

Without question, simulators are proliferating. I noticed the following discussed at the conference this year:
and I probably missed some others.

By the way, the alternatives to simulation aren't perfect either: some of the discussion in the dialogue workshop was about how the incentives of crowdsourcing induces unnatural behaviour in participants of crowdsourced dialogue experiments.

GANs The frenzy of GAN research activity from other conferences (such as ICLR) colonized NIPS in a big way this year. This is related to simulation, albeit more towards the mitigating-sample-complexity theme than the scientific-methodology theme. The quirks of getting the optimization to work are being worked out, which should enable some interesting improvements in RL in the near-term (in addition to many nifty pictures). Unfortunately for NLU tasks, generating text from GANs is currently not as mature as generating sounds or images, but there were some posters addressing this.

Interpretable Models The idea that model should be able to “explain itself” is very popular in industry, but this is the first time I have seen interpretability receive significant attention at NIPS. Impending EU regulations have certainly increased interest in the subject. But there are other reasons as well: as Irina Rish pointed out in her invited talk on (essentially) mindreading, recent advances in representation learning could better facilitate scientific inquiry if the representations were more interpretable.

Papers I noticed

Would you trust a single reviewer on yelp? I wouldn't. Therefore, I think we need some way to crowdsource what people thought were good papers from the conference. I'm just one jet-lagged person with two eyeballs (btw, use bigger font people! it gets harder to see the screen every year …), plus everything comes out on arxiv first so if I read it already I don't even notice it at the conference. That makes this list weird, but here you go.

Also this paper was not at the conference, as far as I know, but I found out about it during the coffee break and it's totally awesome:
  • Understanding deep learning requires rethinking generalization. TL;DR: convnets can shatter the standard image training sets when the pixels are permuted or even randomized! Of course, generalization is poor in this case, but it indicates they are way more flexible than their “local pixel statistics composition” architecture suggests. So why do they work so well?

Monday, July 4, 2016

ICML 2016 Thoughts

ICML is too big for me to ``review'' it per se, but I can provide a myopic perspective.

The heavy hitting topics were Deep Learning, Reinforcement Learning, and Optimization; but there was a heavy tail of topics receiving attention. It felt like deep learning was less dominant this year; but the success of deep learning has led to multiple application specific alternative venues (e.g., CVPR, EMNLP), and ICLR is also a prestigious venue; so deep learning at ICML this year was heavyweight in either the more theoretical or multimodal works. Arguably, reinforcement learning and optimization both should partially count towards deep learning's footprint; reinforcement learning has been this way for at least a year, but optimization has recently developed more interest in non-convex problems, especially the kind that are empirically tractable in deep learning (sometimes, although seemingly innocuous architecture changes can spoil the pudding; I suppose one dream of the optimization community would be the identification of a larger-than-convex class of problems which are still tractable, to provide guidance).

Here are some papers I liked:
  1. Strongly-Typed Recurrent Neural Networks
    The off-putting title makes sense if you are into type theory, or if you've ever been a professional Haskell programmer and have had to figure out wtf a monad is. tl;dr: if you put units of measurement on the various components of a recurrent neural network, you'll discover that you are adding apples and oranges. T-LSTM, a modification of the standard LSTM to fix the problem, behaves similarly empirically; but is amenable to analysis. Theorem 1 was the nice part for me: the modified architectures are shown to compute temporal convolutions with dynamic pooling. Could type consistency provide a useful prior on architectures? That'd be a welcome development.
  2. Ask Me Anything:
    Dynamic Memory Networks for Natural Language Processing
    and Dynamic Memory Networks for Visual and Textual Question Answering
    More titles I'm not over the moon about: everybody seems to be equating “memory” = “attention over current example substructure”. If you ask for the layperson's definition, they would say that memory is about stuff you can't see at the moment (note: Jason started this particular abuse of terminology with End-to-End Memory Networks). Pedantry aside, undeniably these iterated attention architectures have become the state of the art in question-answering style problems and the baseline to beat. Note since the next step in iterated attention is to incorporate previously seen and stored examples, the use of the term “memory” will soon become less objectionable.
  3. From Softmax to Sparsemax:
    A Sparse Model of Attention and Multi-Label Classification
    This is an alternative to the softmax layer (“link function”) used as the last layer of a neural network. Softmax maps $\mathbb{R}^n$ onto the (interior of the) simplex, whereas sparsemax projects onto the simplex. One big difference is that sparsemax can “hit the corners”, i.e., zero out some components. Empirically the differences in aggregate task performance when swapping softmax with sparsemax are modest and attributable to the selection pressures on experimental sections. So why care? Attention mechanisms are often implemented with softmax, and it is plausible that a truly sparse attention mechanism might scale better (either computationally or statistically) to larger problems (such as those involving actual memory, c.f., previous paragraph).
  4. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization
    I find Inverse RL unintuitive: didn't Vapnik say not to introduce difficult intermediate problems? Nonetheless, it seems to work well. Perhaps requiring the learned policy to be “rational” under some cost function is a useful prior which mitigates sample complexity? I'm not sure, I have to noodle on it. In the meantime, cool videos of robots doing the dishes!
  5. Dueling Network Architectures for Deep Reinforcement Learning.
    Best paper, so I'm not adding any value by pointing it out to you. However, after reading it, meditate on why learning two things is better than learning one. Then re-read the discussion section. Then meditate on whether a similar variance isolation trick applies to your current problem.

From the workshops, some fun stuff I heard:
  1. Gerald Tesauro dusted off his old Neurogammon code, ran it on a more powerful computer (his current laptop), and got much better results. Unfortunately, we cannot conclude that NVIDIA will solve AI for us if we wait long enough. In 2 player games or in simulated environments more generally, computational power equates to more data resources, because you can simulate more. In the real world we have sample complexity constraints: you have to perform actual actions to get actual rewards. However, in the same way that cars and planes are faster than people because they have unfair energetic advantages (we are 100W machines; airplanes are much higher), I think “superhuman AI”, should it come about, will be because of sample complexity advantages, i.e., a distributed collection of robots that can perform more actions and experience more rewards (and remember and share all of them with each other). So really Boston Dynamics, not NVIDIA, is the key to the singularity. (In the meantime … buy my vitamins!)
  2. Ben Recht talked about the virtues of random hyperparameter optimization and an acceleration technique that looks like a cooler version of sub-linear debugging. This style, in my experience, works.
  3. Leon Bottou pointed out that first order methods are now within constant factors of optimal convergence, with the corollary that any putative improvement has to be extremely cheap to compute since it can only yield a constant factor. He also presented a plausible improvement on batch normalization in the same talk.

Wednesday, April 6, 2016

Thoughts on reviewing

During ICML reviews I noticed that my personal take on reviewing is becoming increasingly distinct from my peers. Personally, I want to go to a conference and come away with renewed creativity and productivity. Thus, I like works that are thought provoking, groundbreaking, or particularly innovative; even if the execution is a bit off. However, I suspect most reviewers feel that accepting a paper is a validation of the quality and potential impact of the work. There's no right answer here, as far as I can tell. Certainly great work should be accepted and presented, but the problem is, there really isn't that much of it per unit time. Therefore, like a producer on a Brittany Spears album, we are faced with the problem of filling in the rest of the material. The validation mindset leads to the bulk of accepted papers being extremely well executed marginal improvements. It would be nice if the mix were tilted more towards the riskier novel papers.

The validation mindset leads to reviews that are reminiscent of food critic reviews. That might sound objectionable, given that food quality is subjective and science is about objective truth: but the nips review experiment suggests that the ability of reviewers to objectively recognize the greatness of a paper is subjectively overrated. Psychologists attempting to “measure” mental phenomena have struggled formally with the question of “what is a measurement” and lack of inter-rater reliability is a bad sign (also: test-retest reliability is important, but it is unclear how to assess this as the reviewers will remember a paper). So I wonder: how variable are the reviews among food critics for a good restaurant, relative to submitted papers to a conference? I honestly don't know the answer.

What I do know is that, while I want to be informed, I also want to be inspired. That's why I go to conferences. I hope reviewers will keep this in mind when they read papers.

Tuesday, December 15, 2015

NIPS 2015 Review

NIPS 2015 was bigger than ever, literally: at circa 3700 attendees this was roughly twice as many attendees as last year, which in turn was roughly twice as many as the previous year. This is clearly unsustainable, but given the frenzied level of vendor and recruiting activities, perhaps there is room to grow. The main conference is single track, however, and already 3 days long: so even more action is moving to the poster sessions, which along with the workshops creates the feel of a diverse collection of smaller conferences. Obviously, my view of the action will be highly incomplete and biased towards my own interests.

Reinforcement learning

Reinforcement learning continues to ascend, extending the enthusiasm and energy from ICML. The “Imagenet moment” for RL was the Deepmind work in the Arcade Learning Environment. In a talk in the Deep RL workshop, Michael Bowling presented evidence that the big boost in performance could be mostly characterized as 1) decoding the screen better with convnets and 2) using multiple previous frames as input. This was not to detract from the breakthrough, but rather to point out that a hard part of RL (partial feedback over long action sequences) is not addressed by this advance. What's interesting is currently no system in good at playing Pitfall, which involves long action sequences before reward is encountered. The Bowling quote is that we are good at games where “you wiggle the joystick randomly and you get some reward.”

However, the community is not standing still: with so much enthusiasm and human talent now thinking in this direction, progress will hopefully accelerate. For instance, an idea I saw recurring was: rewards are partially observed (and sparse!), but sensory inputs are continuously observed. Therefore decompose the prediction of future rewards into a combination of 1) predicting future sensory inputs conditioned on action sequences, and 2) predicting reward given sensory input. From a sample complexity standpoint, this makes a lot of sense. As Honglak Lee pointed out in his talk at the Deep RL workshop, the same technology powering Transformer Networks can be learned to predict future sensory input conditioned on action sequences, which can be leveraged for simulated play-out. (If you know about POMDPs, then the decomposition perhaps makes less sense, because you cannot necessarily predict reward from the current sensory state; but we have to crawl before we can walk, and maybe ideas from sequence-to-sequence learning can be composed with this kind of decomposition to enable some modeling of unobservable world state.)

Another popular reinforcement learning topic was need for better exploration strategies. I suspect this is the really important part: how do we explore in a manner which is relevant to regret with respect to our hypothesis class (which can be relatively small, redundant, and full of structural assumptions), rather than the world per se (which is impossibly big)? This is how things play out in contextual bandits: if all good policies want the same action than exploration is less important. At the conference the buzzword was “intrinsic motivation”, roughly meaning “is there a useful progress proxy that can be applied on all those action sequences where no reward is observed?”. Given a decomposition of reward prediction into (action-sequence-conditional sensory input prediction + sensory-reward prediction), then discovering novel sensory states is useful training data, which roughly translates into an exploration strategy of “boldly go where you haven't gone before” and hope it doesn't kill you.

Finally, I have some anecdotal evidence that reinforcement learning is on the path towards a mature industrial technology: at ICML when I talked to Deepmind people they would say they were working on some technical aspect of reinforcement learning. This time around I got answers like “I'm doing RL for ads” or “I'm doing RL for recommendations”. That's a big change.

Other Stuff

There were a variety of other interesting topics at the conference around which I'm still collecting my thoughts.
  1. I really like the best paper Competitive Distribution Estimation: Why is Good-Turing Good, and I suspect it is relevant for extreme classification.
  2. Brown and Sandholm are doing amazing things with their Heads-up No-Limit Poker Player. This is one of those “we probably aren't learning about how humans solve the problem, but it's still really cool technology.” Navel gazing isn't everything!
  3. I still like primal approximations to kernels (in extreme classification we have to hug close to the linear predictor), so I liked Spherical Random Features for Polynomial Kernels.
  4. I want to try Online F-Measure Optimization. F-measure is an important metric in extreme classification but just computing it is a pain in the butt, forget about optimizing it directly. Maybe that's different now.
  5. Automated machine learning aka AutoML is heating up as a topic. One near-term goal is to eliminate the need for expertise in typical supervised learning setups. The poster Efficient and Robust Automated Machine Learning is an interesting example. The AutoML challenge at the CIML workshop and ongoing challenges are also worth keeping an eye on. IBM also had a cool AutoML product demo at their party (parenthetically: what is the word for these things? they are clearly recruiting functions but they masquerade as college parties thrown by a trustafarian with nerdy friends).
  6. Memory systems, exemplified at the conference by the End-to-End Memory Networks paper, and at the workshops by the RAM workshop. I especially like attention as a mechanism for mitigating sample complexity: if you are not attending to something you are invariant to it, which greatly mitigates data requirements, assuming of course that you are ignoring irrelevant stuff. Is it somehow less expensive statistically to figure out what is important rather than how it is important, preserving precious data resources for the latter? I'm not sure, but Learning Wake-Sleep Recurrent Attention Models is on my reading list.
  7. Highway networks look pretty sweet. The idea of initializing with the identity transformation makes a lot of sense. For instance, all existing deep networks can be considered highway networks with an uncountable number of identity transformation layers elided past a certain depth, i.e., incompletely optimized “infinitely deep” highway networks.
  8. Extreme classification is still an active area, and the workshop was reasonably well-attended considering we were opposite the RAM workshop (which was standing-room-only fire-code-violating popular). I especially liked Charles Elkan's talk, which I could summarize as “we just need to compute a large collection of sparse GLMs, I'm working on doing that directly.” My own work with hierarchical spectral approaches does suggest that the GLM would have excellent performance if we could compute it, so I like this line of attack (also, conceivably, I could compose the two techniques). Also interesting: for squared loss, if the feature dimensionality is small, exact loss gradients can be computed in label-sparsity time via Efficient Exact Gradient Update for training Deep Networks with Very Large Sparse Targets. This is great for typical neural networks with low-dimensional bottlenecks before the output layer (unfortunately, it is not usable as-is for large sparse GLMs, but perhaps a modification of the trick could work?).
  9. Path-SGD
    could be a cool trick for better optimization of deep networks via eliminating one pesky invariant.
  10. The Self-Normalized Estimator for Counterfactual Learning. If you like reinforcement learning, you should love counterfactual estimation, as the latter provides critical insights for the former. I need to play around with the proposed estimator, but it looks plausibly superior.
  11. Taming the Wild: A Unified Analysis of Hogwild Style Algorithms. While I had plenty of empirical evidence that hogwild and matrix factorization play well together, this analysis claims that they should play well together. Neat!
  12. Last but not least, a shoutout to the Machine Learning Systems workshop for which CISL colleague Markus Weimer co-organized. While not quite fire-code-violating, it was standing room only.

Tuesday, October 13, 2015

KDD Cup 2016 CFP

The KDD Cup is soliciting ideas for their next competition. Things have gotten tricky for the KDD Cup, because CJ's class keeps winning. Essentially we have learned that lots of feature engineering and large ensembles do well in supervised learning tasks. But really CJ has done us a favor by directly demonstrating that certain types of supervised learning are extremely mature. If KDD Cup were Kaggle, that would be fine because such models can still have tremendous economic value. However the point of KDD Cup is to advance research, hence the pickle.

There is, of course, no shortage of research directions that would make plausible competition subjects. The challenge, so to speak, is how to organize the challenge. With supervised learning, the game is clear: here's a labeled training set, here's an unlabeled test set, submit your answers. There's some sophistication possible in running the leaderboard, but mostly the supervised learning competition is a straightforward setup. Additional complexity, however, would require some innovation. Here are some examples.
  1. Nonstationary environments. In real life the environment is changing either obliviously or adversarially. A competition could explore this, but presumably can't release the test set in order to simulate the “fog of war”. So this means submissions need to be executable, a protocol for scoring answers has to defined, etc. Somebody would have to do some infrastructure work to make all that happen.
  2. Automated training In this case, the competition wouldn't even release a training set! Instead submissions would be algorithms which were capable of taking a training set and producing a model which could be evaluated on a test set. Clearly infrastructure work is required to facilitate this.
  3. Computational constraints Unholy ensembles don't win in real life because nobody would deploy such a model. Real models are subject to both space and time constraints. Soeren Sonnenburg organized a large scale learning challenge several years ago which tried to assess performance under computational and sample complexity constraints. It was an admirable first effort, but there are some problems. A big one: it's difficult to come up with a ranking function (also in real life: you can usually negotiate for a bit more server memory and/or latency if you can demonstrate a lift in performance, but the tradeoffs aren't clear). There were other minor infelicities, e.g., participants had to time their own algorithms. Furthermore the competition didn't address space complexity of the resulting model, which in my experience is very important: models that are too large don't fit on production machines (given everything else going on) and/or take too long to update. So in this area there's definitely room to innovate in competition design.
  4. Partial feedback Call it contextual bandits, reinforcement learning, … heck, call it banana. With almost every problem I've worked on, there was a closed loop where the actions of the algorithm determined the data was collected. A competition could release partially observed history to initialize a policy, but a real test should involve online operation where actions generate feedback which updates the model, etc.
A common thread above is to the need to define interfaces into the run-time environment of the competition, and of course the implementation of the run-time environment. But in some cases there is also a need to define the objective function.

Sunday, September 20, 2015

ECML-PKDD 2015 Review

ECML-PKDD was a delight this year. Porto is definitely on the short list of the best European cities in which to have a conference. The organizers did a wonderful job injecting local charm into the schedule, e.g., the banquet at Taylor's was a delight. It's a wine city, and fittingly wine was served throughout the conference. During the day I stuck to coffee: jet lag, soft lights, and whispered mathematics are sedating enough without substituting coffee for alcohol. There is no question, however, that poster sessions are far better with a bit of social lubrication.

The keynotes were consistently excellent. Some standouts for me were:
  • Pedros Domingos presented his latest take on sum-product networks as a class of nonconvex functions for which finding a global maximum is tractable. Machine learning was (is?) obsessed with convex functions because it is a large class for which finding the global maximum is tractable. Lately the deep learning community has convincingly argued that convexity is too limiting, and as a result we are all getting more comfortable with more ``finicky'' optimization procedures. Perhaps what we need is a different function class?
  • Hendrik Blockeel talked about declarative machine learning. I work in a combination systems-ML group and I can tell you systems people love this idea. All of them learned about how relational algebra ushered in a declarative revolution in databases via SQL, and see the current state of affairs in machine learning as a pre-SQL mess.
  • Jure Leskovec did an unannounced change of topic, and delivered a fabulous keynote which can paraphrased as: ``hey you machine learning people could have a lot of impact on public policy, but first you need to understand the principles and pitfalls of counterfactual estimation.'' I couldn't agree more, c.f., Gelman. (Jure also gave the test-of-time paper talk about Kronecker graphs.)
  • Natasa Milic-Frayling detailed (with some disdain) the miriad of techniques that digital web and mobile advertising firms use to track and profile users. It was all very familiar because I worked in computational advertising for years, but the juxtaposition of the gung-ho attitude of ad networks with the European elevated respect for privacy was intriguing from a sociological perspective.
There were also some papers with which I'm going to spend quality time.

Wednesday, September 2, 2015

LearningSys NIPS Workshop CFP

CISL is the research group in which I work at Microsoft. The team brings together systems and machine learning experts, with the vision of having these two disciplines inform each other. This is also the vision for the LearningSys workshop, which was accepted for NIPS 2015, and is co-organized by Markus Weimer from CISL.

If this sounds like your cup of tea, check out the CFP and consider submitting your work.

Also, CISL is hiring: so if this is really your cup of tea, send your resume to me (to the address in top-right corner of my blog); or introduce yourself at the workshop in Montreal.

Monday, August 10, 2015

Paper Reviews vs. Code Reviews

Because I'm experiencing the NIPS submission process right now, the contrast with the ICLR submission process is salient. The NIPS submission process is a more traditional process, in which first (anonymous) submissions are sent to (anonymous) reviewers who provide feedback, and then authors have a chance to respond to the feedback. The ICLR submission process is more fluid: non-anonymous submissions are sent to anonymous reviewers who provide feedback, and then authors and reviewers enter into a cycle where the author updates the arxiv submission and reviewers provide further feedback. (ICLR also has public reviews, but I'm not going to talk about those). Note in the traditional model the reviewers have to imagine what my (promised) changes will look like in the final version.

The traditional model is from an age where papers were actual physical objects that were sent (via snail mail!) to reviewers who marked them up with ink pens, and hopefully advances in technology allow us to develop a more effective process. I think we should look to software engineering for inspiration. Having worked both as a researcher and as a software engineer, I appreciate the exoskeleton-robot distinction. In this context, the exoskeleton posture of science lends itself to a ballistic concept of paper review, where a completed unit of work is accepted or rejected; engineering is more about collaborative continuous improvement. Truthfully, most journals and some conferences utilize a more fluid review process, where there are “conditional accepts” (changes need to be re-reviewed) and “shepherds” (reviewers who commit to guiding a paper through several rounds of reviews). These processes place more burden on the reviewers, who are providing the valuable service of helping someone improve their work, without compensation or recognition. Naturally conferences might be hesitant to demand this from their volunteer reviewer pool.

The solution is technology that eases the cognitive and logistical burdens on all parties. Code reviews have the same broad goal as paper reviews: improvement of quality via peer feedback. Here are some things we can learn from code reviews:
  1. Incremental review. Rarely a programmer develop a complicated piece of software de novo. Most reviews are about relatively small changes to large pieces of software. To ease the cognitive burden on the reviewer, the changes are reviewed, rather than the entire new version. Visualization technology is used to improve change review productivity.
    1. The initial submission of a paper is distinct from the typical code review in this respect, but the subsequent cycle of revisions aligns with this nicely.
  2. Modular updates. When a programmer makes several distinct changes to a piece of software, the changes are (to the extent possible) designed to be commutable and independently reviewed. Technology is used to facilitate the composition of (the accepted subset of) changes.
    1. This aligns nicely with the review process, as different reviewers' feedback are analogous to issues.
  3. Minimal clean changes. Smart programmers will craft their changes to be most intelligible under review. This means little “easy” things like avoiding lexical changes which are semantically equivalent. It also means a tension between cleaning up the control flow of a program and creating a large change.

Add to this the ability to preserve the anonymity of all parties involved, and you would have a pretty sweet paper review platform that would plausibly accelerate the pace of all scientific fields while improving quality.

This is precisely the kind of public good that the government should fund. Somebody in academia: write a grant!

Tuesday, July 14, 2015

ICML 2015 Review

This year's location was truly superlative: the charming northern French city of Lille, where the locals apparently subsist on cheese, fries, and beer without gaining weight. A plethora of vendors and recruiters were in attendance, handing out sweet swag to starving grad students. Honestly it's hard to feel bad for ML grad students nowadays: getting a PhD in English indicates true selfless love of knowledge, while being a machine learning grad student is more like being a college basketball player.

The conference was not lacking for entertainment: in case you haven't been paying attention, the enormous success of deep learning has generated some controversy about inventorship. Between Stigler's Law of Eponymy and Sayre's Law, this is of course not surprising, but when they announced the deep learning panel would have some of the contesting luminaries together on stage, everybody prepped the popcorn. I hope they videotaped it because it did not disappoint.

As far as trends: first, “deep” is eating everything, e.g., Deep Exponential Families. However, you knew that already. Second, reinforcement learning is heating up, leveraging advances in deep learning and GPU architecture along with improved optimization strategies. Third, as Leon Bottou's excellent keynote suggested, the technological deficiencies of machine learning are becoming increasingly important as the core science advances: specifically, productivity of humans in creating machine learning models needs to advance, and the integration of machine learning with large software systems needs to be made less fragile.

Furthermore, the increasing importance of non-convex objective functions is creating some “anti”-trends. First, distributed optimization is becoming less popular, as a box with 4 GPUs and 1TB of RAM is a pretty productive environment (especially for non-convex problems). Considering I work in the Cloud and Information Services Lab, you can draw your own conclusions about the viability of my career. Second, there were many optimization papers on primal-dual algorithms, which although very cool, appear potentially less impactful than primal-only algorithms, as the latter have a better chance of working on non-convex problems.

Here's a list of papers I plan to read closely. Since I was very jet-lagged this is by no means an exhaustive list of cool papers at the conference, so check out the complete list.

  1. Unsupervised Domain Adaptation by Backpropagation. The classical technique considers the representation to be fixed and reweights the data to simulate a data set drawn from the target domain. The deep way is to change the representation so that source and target domain are indistinguishable. Neat!
  2. Modeling Order in Neural Word Embeddings at Scale. Turns out word2vec was underfitting the data, and adding in relative position improves the embedding. Pushing on bias makes sense in hindsight: the original dream of unsupervised pre-training was that model complexity would not be an issue because the data would be unlimited. Surprisingly the pre-training revolution is happening in text, not vision. (Analogously, Marx expected the proletarian revolution would occur in Germany, not Russia.)
  3. Counterfactual Risk Minimization: Learning from Logged Bandit Feedback. Offline policy evaluation involves importance weighting, which can introduce variance; empirical Bernstein tells us how to penalize variance during learning. Peanut butter and jelly! Why didn't I think of that … Trivia tidbit: this paper was the only entry in the Causality track not co-authored by Bernhard Schölkopf.

Ok, that's a short list, but honestly I'd read most of the papers of interest to me already when they appeared on arxiv months ago, so those were the ones I hadn't already noticed.

Monday, May 11, 2015

ICLR 2015 Review

The ambition, quality, and (small) community of ICLR combine to make this my new favorite conference. Recent successes in speech and vision, along with a wave of capital from billionaire founder-emperors and venture capitalists, have created with a sense of optimism and desire to attack Artificial Intelligence. The enthusiasm is contagious. (On a procedural note, the use of Arxiv in the review process made it easy to dialogue with the reviewers: everyone should do this, double blind is a myth nowadays anyway.)

The organizers were insightful in choosing the conference name. Although referred to as “the deep learning conference”, the conference is about learning representations. In the early days of AI (i.e., the 1960s), representations were identified as critical, but at that time representations were hand-constructed. Not only was this (prohibitively) laborious, but solutions were highly specialized to particular problems. The key idea motivating this conference is to use data and learning algorithms to help us design representations, hopefully making the resulting representations both easier to develop and more broadly applicable. Today, deep learning (i.e., layered nonlinearities trained with non-convex optimization techniques) is the leading technology for doing this, but should something better arise this conference is (near-term) future-proofed.

The selection of accepted papers and invited talks was extremely sensible given the above context: deep learning papers were definitely in the majority, but there were also interesting papers leveraging eigensystems, spectral methods, and dictionary learning. The invited talks were diverse and entertaining: Percy Liang's talk on learning latent logical forms for semantic parsing was an excellent example, as his work clearly involves learning representations, yet he jokingly professed unfamiliarity with deep learning during his talk.

There were many good papers, so check out the entire schedule, but these caught my eye.

Neural Machine Translation by Jointly Learning to Align and Translate The result in this paper is interesting, but the paper also excels as an example of the learned representation design process. Deep learning is not merely the application of highly flexible model classes to large amounts of data: if it were that simple, the Gaussian kernel would have solved AI. Instead, deep learning is like the rest of machine learning: navigating the delicate balance between model complexity and data resources, subject to computational constraints. In particular, more data and a faster GPU would not create these kinds of improvements in the standard neural encoder/decoder architecture because of the mismatch between the latent vector representation and the sequence-to-sequence mapping being approximated. A much better approach is to judiciously increase model complexity in a manner that better matches the target. Furthermore, the “art” is not in knowing that alignments are important per se (the inspiration is clearly from existing SMT systems), but in figuring out how to incorporate alignment-like operations into the architecture without destroying the ability to optimize (using SGD). Kudos to the authors.

Note that while a representation is being learned from data, clearly the human designers have gifted the system with a strong prior via the specification of the architecture (as with deep convolutional networks). We should anticipate this will continue to be the case for the near future, as we will always be data impoverished relative to the complexity of the hypothesis classes we'd like to consider. Anybody who says to you “I'm using deep learning because I want to learn from the raw data without making any assumptions” doesn't get it. If they also use the phrase “universal approximator”, exit the conversation and run away as fast as possible, because nothing is more dangerous than an incorrect intuition expressed with high precision (c.f., Minsky).

NICE: Non-linear Independent Components Estimation The authors define a flexible nonlinearity which is volume preserving and invertible, resulting in a generative model for which inference (and training), sampling, and inpainting are straightforward. It's one of these tricks that's so cool, you want to find a use for it.

Qualitatively characterizing neural network optimization problems The effectiveness of SGD is somewhat mysterious, and the authors dig into the optimization landscapes encountered by actual neural networks to gain intuition. The talk and poster had additional cool visualizations which are not in the paper.

Structured prediction There were several papers exploring how to advance deep neural networks beyond classification into structured prediction. Combining neural networks with CRFs is a popular choice, and Chen et. al. had a nice poster along these lines with good results on Pascal VOC 2012. Jaderberg et. al. utilized a similar strategy to tackle the (variadic and extensible output) problem of recognizing words in natural images.

Extreme classification There were several papers proposing methods to speed up learning classification models where the number of output is very large. Vijayanarasimhan et. al. attempt to parsimoniously approximate dot products using hashing, whereas Vincent provides an exact expression for (the gradient of) certain loss functions which avoids computing the outputs explicitly. I'll be digging into these papers in the next few weeks to understand them better. (Also, in theory, you can use our label embedding technique to avoid the output layer entirely when training extreme deep classifiers on the GPU, but I haven't implemented it yet so YMMV.)

Tuesday, April 21, 2015

Extreme Multi-Label Classification

Reminder: there is still time to submit to the Extreme Classification Workshop at ICML this year.

Multi-label classification is interesting because it is a gateway drug to structured prediction. While it is possible to think about multi-label as multi-class over the power set of labels, this approach falls apart quickly unless the number of labels is small or the number of active labels per instance is limited. The structured prediction viewpoint is that multi-label inference involves a set of binary predictions subject to a joint loss, which satisfies the haiku definition of structured prediction.

Nikos and I independently discovered what Reed and Hollmén state eloquently in a recent paper:
Competitive methods for multi-label data typically invest in learning labels together. To do so in a beneficial way, analysis of label dependence is often seen as a fundamental step, separate and prior to constructing a classifier. Some methods invest up to hundreds of times more computational effort in building dependency models, than training the final classifier itself. We extend some recent discussion in the literature and provide a deeper analysis, namely, developing the view that label dependence is often introduced by an inadequate base classifier ...
Reed and Hollmén use neural network style nonlinearities, while Nikos and I use a combination of randomized embeddings and randomized kernel approximations, but our conclusion is similar: given a flexible and well-regularized generic nonlinearity, label dependencies can be directly modeled when constructing the classifier; furthermore, this is both computationally and statistically more efficient than current state-of-the-art approaches.

The use of neural network style nonlinearities for multi-label is extremely reasonable for this setting, imho. Advancing the successes of deep learning into structured prediction is currently a hot topic of research, and it is partially tricky because it is unclear how to render an arbitrary structured prediction problem onto a structure which is amenable to (SGD) optimization (c.f., LSTMs for sequential inference tasks). Fortunately, although multi-label has a structured prediction interpretation, existing deep architectures for multi-class require only slight modifications to apply to multi-label. (“Then why are you using randomized methods?”, asks the reader. The answer is that randomized methods distribute very well and I work in a Cloud Computing laboratory.)

Sunday, April 12, 2015

Extreme Classification CFP

The CFP for the Extreme Classification Workshop 2015 is out. We'd really appreciate your submission. We also have some really cool invited speakers and (imho) this is a hot area, so regardless of whether you submit material you should attend the workshop, we're going to have some fun.

Monday, December 15, 2014

NIPS 2014

With a new venue and a deep attitude, NIPS was a blast this year, kudos to the organizers.

Let's start with the “talk of the conference”. I mean this in the spirit of Time's “Man of the Year”, i.e., I'm not condoning the content, just noting that it was the most impactful. And of course the winner is ... Ilya Sutsveker's talk Sequence to Sequence Learning with Neural Networks. The swagger was jaw-dropping: as introductory material he declared that all supervised vector-to-vector problems are now solved thanks to deep feed-forward neural networks, and then proceeded to declare that all supervised sequence-to-sequence problems are now solved thanks to deep LSTM networks. Everybody had something to say about this talk. On the positive side, the inimitable John Hershey told me over drinks that LSTM has allowed his team to sweep away years of cruft in their speech cleaning pipeline while getting better results. Others with less charitable interpretations of the talk probably don't want me blogging their intoxicated reactions.

It is fitting that the conference was in Montreal, underscoring that the giants of deep learning have transitioned from exiles to rockstars. As I learned the hard way, you have to show up to the previous talk if you want to get into the room when one of these guys is scheduled at a workshop. Here's an actionable observation: placing all the deep learning posters next to each other in the poster session is a bad idea, as it creates a ridiculous traffic jam. Next year they should be placed at the corners of the poster session, just like staples in a grocery store, to facilitate the exposure of other material.

Now for my personal highlights. First let me point out that the conference is so big now that I can only experience a small part of it, even with the single-track format, so you are getting a biased view. Also let me congratulate Anshu for getting a best paper award. He was an intern at Microsoft this summer and the guy is just super cool.

Distributed Learning

Since this is my day job, I'm of course paranoid that the need for distributed learning is diminishing as individual computing nodes (augmented with GPUs) become increasingly powerful. So I was ready for Jure Leskovec's workshop talk. Here is a killer screenshot.
Jure said every grad student is his lab has one of these machines, and that almost every data set of interest fits in RAM. Contemplate that for a moment.

Nonetheless there was some good research in this direction.

Other Trends

Randomized Methods: I'm really hot for randomized algorithms right now so I was glad to see healthy activity in the space. LOCO (mentioned above) was one highlight. Also very cool was Radagrad, which is a mashup of Adagrad and random projections. Adagrad in practice is implemented via a diagonal approximation (e.g., in vowpal wabbit), but Krummenacher and McWilliams showed that an approximation to the full Adagrad metric can be tractably obtained via random projections. It densifies the data, so perhaps it is not appropriate for text data (and vowpal wabbit focuses on sparse data currently), but the potential for dense data (i.e., vision, speech) and nonlinear models (i.e., neural networks) is promising.

Extreme Learning Clearly someone internalized the most important lesson from deep learning: give your research program a sexy name. Extreme learning sounds like the research area for those who like skateboarding and consuming a steady supply of Red Bull. What it actually means is multiclass and multilabel classification problems where the number of classes is very large. I was pleased that Luke Vilnis' talk on generalized eigenvectors for large multiclass problems was well received. Anshu's best paper winning work on approximate maximum inner product search is also relevant to this area.

Discrete Optimization I'm so clueless about this field that I ran into Jeff Bilmes at baggage claim and asked him to tell me his research interests. However assuming Ilya is right, the future is in learning problems with more complicated output structures, and this field is pushing in an interesting direction.

Probabilistic Programming Rob Zinkov didn't present (afaik), but he showed me some sick demos of Hakaru, the probabilistic programming framework his lab is developing.

Facebook Labs I was happy to see that Facebook Labs is tackling ambitious problems in text understanding, image analysis, and knowledge base construction. They are thinking big ... extreme income inequality might be bad for the long-term stability of western democracy, but its causing a golden age in AI research.

In Conclusion

Best. Conference. Ever. I can't wait until next year.

Monday, June 30, 2014

ICML 2014 Review

ICML 2014 went well, kudos to the organizers. The location (Beijing) and overlap with CVPR definitely impacted the distribution of attendees, so the conference felt different than last year. (I also learned that my blog is blocked from China, collateral damage from some kind of spat between Google and the Chinese government).

Deep learning was by far the most popular conference track, to the extent that the conference room for this track was overwhelmed and beyond standing room only. I missed several talks I wanted to attend because there was no physical possibility of entrance. This is despite the fact that many deep learning luminaries and their grad students were at CVPR. Fortunately Yoshua Bengio chose ICML and via several talks provided enough insight into deep learning to merit another blog post. Overall the theme is: having conquered computer vision, deep learning researchers are now turning their attention to natural language text, with some notable early successes, e.g., paragraph vector. And of course the brand is riding high, which explains some of the paper title choices, e.g., “deep boosting”. There was also a conference track titled “Neural Theory and Spectral Methods” ... interesting bedfellows!

ADMM suddenly became popular (about 18 months ago given the latency between idea, conference submission, and presentation). By this I don't mean using ADMM for distributed optimization, although there was a bit of that. Rather there were several papers using ADMM to solve constrained optimization problems that would otherwise be vexing. The take-home lesson is: before coming up with a customized solver for whatever constrained optimization problem which confronts you, try ADMM.

Now for the laundry list of papers (also note the papers described above):
  1. Input Warping for Bayesian Optimization of Non-stationary Functions. If you want to get the community's attention, you have to hit the numbers, so don't bring a knife to a gunfight.
  2. Nuclear Norm Minimization via Active Subspace Selection. The inimitable Cho-Jui Hsieh has done it again, this time applying ideas from active variable methods to nuclear norm regularization.
  3. Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. A significant improvement in the computational complexity required for agnostic contextual bandits.
  4. Efficient programmable learning to search. Additional improvements in imperative programming since NIPS. If you are doing structured prediction, especially in industrial settings where you need to put things into production, you'll want to investigate this methodology. First, it eases the burden of specifying a complicated structured prediction task. Second, it reduces the difference between training and evaluation, which not only means faster deployment, but also less defects introduced between experiments and the production system.
  5. Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels. It is good to confirm quasi-random numbers can work better for randomized feature maps.
  6. A Single-Pass Algorithm for Efficiently Recovering Sparse Cluster Centers of High-dimensional Data. I'll need to spend some quality time with this paper.
  7. Multiresolution Matrix Factorization. Nikos and I have had good luck learning discriminative representations using classical matrix decompositions. I'm hoping this new decomposition technique can be analogously adapted.
  8. Sample-based Approximate Regularization. I find data-dependent regularization promising (e.g., dropout on least-squares is equivalent to a scale-free L2 regularizer), so this paper caught my attention.
  9. Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm. No experiments in the paper, so maybe this is a ``pure theory win'', but it looks interesting.

Friday, February 21, 2014

Stranger in a Strange Land

I attended the SIAM PP 2014 conference this week, because I'm developing an interest in MPI-style parallel algorithms (also, it was close by). My plan was to observe the HPC community, try to get a feel how their worldview differs from my internet-centric “Big Data” mindset, and broaden my horizons. Intriguingly, the HPC guys are actually busy doing the opposite. They're aware of what we're up to, but they talk about Hadoop like it's some giant livin' in the hillside, comin down to visit the townspeople. Listening to them mapping what we're up to into their conceptual landscape was very enlightening, and helped me understand them better.

The Data Must Flow

One of the first things I heard at the conference was that “map-reduce ignores data locality”. The speaker, Steve Plimpton, clearly understood map-reduce, having implemented MapReduce for MPI. This was a big clue that they mean something very different by data locality (i.e., they do not mean “move the code to the data”).

A typical MPI job consists of loading a moderate amount of initial state into main memory, then doing an extreme amount of iterative computation on that state, e.g., simulating biology, the weather, or nuclear explosions. Data locality in this context means rearranging the data such that synchronization requirements between compute nodes is mitigated.

Internet companies, on the other hand, generally have a large amount of data which parameterizes the computation, to which they want to apply a moderate amount of computation (e.g., you only need at most 30 passes over the data to get an excellent logistic regression fit). While we do some iterative computations the data-to-computation ratio is such that dataflow programming, moderately distorted, is a good match for what we desire. This difference is why the CTO of Cray felt compelled to point out that Hadoop “does I/O all the time”.

Failure Is Not An Option

The HPC community has a schizophrenic attitude towards fault-tolerance. In one sense they are far more aware and worried about it, and in another sense they are oblivious.

Let's start with obliviousness. The dominant programming model for HPC today provides the abstraction of a reliable machine, i.e., a machine that does not make errors. Current production HPC systems deliver on this promise via error detection combined with global checkpoint-restart. The hardware vendors do this in an application-agnostic fashion: periodically they persist the entire state of every node to durable storage, and when they detect an error they restore the most recent checkpoint.

There are a couple problems which threaten this approach. The first is fundamental: as systems become more parallel, mean time between failure decreases, but checkpoint times do not (more nodes means more I/O capacity but also more state to persist). Thanks to constant factor improvements in durable storage due to SSDs and NVRAM, the global checkpoint-restart model has gained two or three years of runway, but it looks like a different strategy will soon be required.

The second is that error detection is itself error prone. ECC only guards against the most probable types of errors, so if a highly improbable type of error occurs it is not detected; and other hardware (and software) componentry can introduce additional undetected errors. These are called silent corruption in the HPC community, and due to their nature the frequency at which they occur is not well known, but it is going to increase as parallelism increases.

Ultimately, what sounds like a programmer's paradise (“I don't have to worry about failures, I just program my application using the abstraction of a reliable machine”) becomes a programmer's nightmare (“there is no way to inform the system about inherent fault-tolerance of my computation, or to write software to mitigate the need for expensive general-purpose reliability mechanisms which don't even always work.”). Paraphrasing one panelist, “... if an ECC module detects a double-bit error then my process is toast, even if the next operation on that memory cell is a write.”

Silent But Not Deadly

Despite the dominant programming model, application developers in the community are highly aware of failure possibilities, including all of the above but also issues such as numerical rounding. In fact they think about failure far more than I ever have: the most I've ever concerned myself with is, “oops I lost an entire machine from the cluster.” Meanwhile I'm not only not checking for silent corruption, I'm doing things like buying cheap RAM, using half-precision floating point numbers, and ignoring suddenly unavailable batches of data. How does anything ever work?

One answer, of course, is that typical total number core-hours of a machine learning compute task is so small that extremely unlikely things generally do not occur. While it takes a lot of computers to recognize a cat, the total core-hours is still less than 106. Meanwhile the Sequoia at LLNL has 100K compute nodes (1.6M cores) so a simulation which takes a week will have somewhere between 102-104 more core-hours of exposure. Nonetheless the ambition in the machine learning community is to scale up, which begs the question: should we be worried about data corruption? I think the answer is: probably not to the same level as the HPC community.

I saw a presentation on self-stabilizing applications, which was about designing algorithms such that randomly injected incorrect calculations were fixed by later computation. The third slide indicated “some applications are inherently self-stabilizing without further modification. For instance, convergent fixed point methods, such as Newton's method.” Haha! Most of machine learning is “the easy case” (as is, e.g., PageRank). Not that surprising, I guess, given that stochastic gradient descent algorithms appear to somehow work despite bugs.

Remember the butterfly effect? That was inspired by observed choatic dynamics in weather simulation. Predicting the weather is not like machine learning! One question is whether there is anything in machine learning or data analytics akin to weather simulation. Model state errors during training are corrected by contractive dynamics, and errors in single inputs or intermediate states at evaluation time only affect one decision, so their impact is bounded. However, model state errors at evaluation time affect many decisions, so it's worth being more careful. For example, one could ship a validation set of examples with each model to a production system, and when a model is loaded the output on the validation set is computed: if it doesn't match desired results, the new model should be rejected. Mostly however machine learning can afford to be cavalier, because there are statistical limits to the information content of the input data and we want to generalize to novel situations. Furthermore, the stakes are lower: a mistargeted advertisement is less dangerous than a mistargeted nuclear weapon.

Anything To Declare?

There appeared to be at least two distinct subcamps in the HPC community. In one camp were those who wanted to mostly preserve the abstraction of a reliable machine, possibly moving failure handling up the stack a bit into the systems software but still mostly keeping the application programmer out of it. As I heard during a panel discussion, this camp wants “a coherent architecture and strategy, not a bag of tricks.” In the other camp were those who wanted more application-level control over reliability strategies, in order to exploit specific aspects of their problem and avoiding the large penalty of global checkpoint restore. For example, maybe you have a way to check the results of a computation in software, and redo some work if it doesn't pass (aka Containment Domains). You would like to say “please don't do an expensive restore, I'll handle this one”. Current generation HPC systems do not support that.

At the application level being declarative appears key. The current HPC abstraction is designed to make an arbitrary computation reliable, and is therefore expensive. By declaring computational intent, simpler models of reliability can be employed. For instance, map-reduce is a declarative framework: the computation is said to have a particular structure (data-parallel map followed by associative reduce) which admits localized fault handling (when a node fails, only the map output associated with that node need be recomputed, and this can be done speculatively). These simpler models of reliability aren't just cheaper they are also faster (less redundant work when an error occurs). However, they do not work for general purpose computations.

Putting together a collection of special purpose computation frameworks with associated reliability strategies either sounds great or horrible depending upon which camp you are in. I'm sure some in the HPC community look at the collection of projects in the Apache Foundation with fear and loathing. Others, however, are saying that in fact a small number of computation patterns capture the majority of work (e.g., numerical linear algebra, stencil/grid computations, and Monte Carlo), so that a collection of bespoke strategies could be viable.

Cathedral vs. Bazaar

In the internet sector, the above level of disagreement about the way forward would be considered healthy. Multiple different open source projects would emerge, eventually the best ideas would rise to the top, and the next generation of innovation would leverage the lessons and repeat the cycle. Meanwhile in the HPC world, the MPI spec has yet to adopt any of the competing proposals for fault-tolerance. Originally there was hope for 3.0, then 3.1, and now it looks like 4.0 is the earliest possibility.

Compared to the Apache Foundation, the cathedral vs. bazaar analogy is apt. However the standards committee is a bit more conservative than the community as a whole, which is racing ahead with prototype designs and implementations that relax the abstraction of a reliable machine, e.g., redundant MPI and fault-tolerant MPI. There is also a large body of computation specific strategies under the rubric of “Algorithm Based Fault Tolerance”.


There are some lessons to be learned from this community.

The first is that declarative programming is going to win, at least with respect to the distributed control flow (non-distributed portions will still be dominated by imperative specifications, but for example learning algorithms specified via linear algebra can be declarative all the way down). Furthermore, distributed declarative expressive power will not be general purpose. The HPC community has been trying to support general purpose computation with a fault-free abstraction, and this is proving expensive. Some in the HPC community are now calling for restricted expressiveness declarative models that admit less expensive fault-tolerance strategies (in the cloud we have to further contend with multi-tenancy and elasticity). Meanwhile the open source community has been embracing more expressive but still restricted models of computation, e.g., Giraph and GraphLab. More declarative frameworks with different but limited expressiveness will arise in the near-term, and creating an easy way to run them all in one unified cluster, and to specify a task that spans all of them, will be a necessity.

The second is that, if you wait long enough, extremely unlikely things are guaranteed to happen. Mostly we ignore this in the machine learning community right now, because our computations are short: but we will have to worry about this given our need and ambition to scale up. Generic strategies such as containment domains and skeptical programming are therefore worth understanding.

The third is that Bulk Synchronous Parallel has a lot of headroom. There's a lot of excitment in the machine learning community around parameter servers, which is related to async PGAS in HPC (and also analogous to relaxations of BSP, e.g., stale synchronous parallel). However BSP works at petascale today, and is easy to reason about and program (e.g., BSP is what Vowpal Wabbit does when it cajoles Hadoop into doing a distributed logistic regression). With an optimized pipelined implementation of allreduce, BSP algorithms look attractive, especially if they can declare semantics about how to make progress given partial responses (e.g., due to faults or multi-tenancy issues) and how to leverage newly available additional resources (e.g., due to multi-tenancy).

I could have sworn there was a fourth takeaway but unfortunately I have forgotten it, perhaps due to an aberrant thermal neutron.

Thursday, December 12, 2013

NIPSplosion 2013

NIPS was fabulous this year, kudos to all the organizers, area chairs, reviewers, and volunteers. Between the record number of attendees, multitude of corporate sponsors, and the Mark Zuckerburg show, this year's conference is most notable for sheer magnitude. It's past the point that one person can summarize it effectively, but here's my retrospective, naturally heavily biased towards my interests.

The keynote talks were all excellent, consistent with the integrative “big picture” heritage of the conference. My favorite was by Daphne Koller, who talked about the “other online learning”, i.e., pedagogy via telecommunications. Analogous to how moving conversations online allows us to precisely characterize the popularity of Snooki, moving instruction online facilitates the use of machine learning to improve human learning. Based upon the general internet arc from early infovore dominance to mature limbic-stimulating pablum, it's clear the ultimate application of the Coursera platform will be around courtship techniques, but in the interim a great number of people will experience more substantial benefits.

As far as overall themes, I didn't detect any emergent technologies, unlike previous years where things like deep learning, randomized methods, and spectral learning experienced a surge. Intellectually the conference felt like a consolidation phase, as if the breakthroughs of previous years were still being digested. However, output representation learning and extreme classification (large cardinality multiclass or multilabel learning) represent interesting new frontiers and hopefully next year there will be further progress in these areas.

There were several papers about improving the convergence of stochastic gradient descent which appeared broadly similar from a theoretical standpoint (Johnson and Zhang; Wang et. al.; Zhang et. al.). I like the control variate interpretation of Wang et. al. the best for generating an intuition, but if you want to implement something than Figure 1 of Johnson and Zhang has intelligible pseudocode.

Covariance matrices were hot, and not just for PCA. The BIG & QUIC algorithm of Hseih et. al. for estimating large sparse inverse covariance matrices was technically very impressive and should prove useful for causal modeling of biological and neurological systems (presumably some hedge funds will also take interest). Bartz and Müller had some interesting ideas regarding shrinkage estimators, including the “orthogonal complement” idea that the top eigenspace should not be shrunk since the sample estimate is actually quite good.

An interesting work in randomized methods was from McWilliams et. al., in which two random feature maps are then aligned with CCA over unlabeled data to extract the “useful” random features. This is a straightforward and computationally inexpensive way to leverage unlabeled data in a semi-supervised setup, and it is consistent with theoretical results from CCA regression. I'm looking forward to trying it out.

The workshops were great, although as usual there are so many interesting things going on simultaneously that it made for difficult choices. I bounced between extreme classification, randomized methods, and big learning the first day. Michael Jordan's talk in big learning was excellent, particularly the part juxtaposing decreasing computational complexity of various optimization relaxations with increasing statistical risk (both effects due to the expansion of the feasible set). This is starting to get at the tradeoff between data and computation resources. Extreme classification (large cardinality multiclass or multilabel learning) is an exciting open area which is important (e.g., for structured prediction problems that arise in NLP) and appears tractable in the near-term. Two relevant conference papers were Frome et. al. (which leveraged word2vec to reduce extreme classification to regression with nearest-neighbor decode) and Cisse et. al. (which exploits the near-disconnected nature of the label graph often encountered in practice with large-scale multi-label problems).

The second day I mostly hung out in spectral learning but I saw Blei's talk in topic modeling. Spectral learning had a fun discussion session. The three interesting questions were
  1. Why aren't spectral techniques more widely used?
  2. How can spectral methods be made more broadly easily applicable, analogous to variational Bayes or MCMC for posterior inference?
  3. What are the consequences of model mis-specification, and how can spectral methods be made more robust to model mis-specification?
With respect to the first issue, I think what's missing is rock solid software that can easily found, installed, and experimented with. Casual practitioners do not care about theoretical benefits of algorithms, in fact they tend to view “theoretical” as a synonym for “putative”. Progress on the second issue would be great, c.f., probabilistic programming. Given where hardware is going, the future belongs to the most declarative. The third issue is a perennial Bayesian issue, but perhaps has special structure for spectral methods that might suggest, e.g., robust optimization criterion.

Tuesday, December 10, 2013

The Flipped Workshop

This year at NIPS one of the great keynotes was by Daphne Koller about Coursera and the flipped classroom. On another day I was at lunch with Chetan Bhole from Amazon, who pointed out that all of us go to conferences to hear each other's lectures: since the flipped classroom is great, we should apply the concept to the conference.

I love this idea.

It's impractical to consider moving an entire conference over to this format (at least until the idea gains credibility), but the workshops provide an excellent experimental testbed, since the organizers are plenary. Here's how it would work: for some brave workshop, accepted submissions to the workshop (and invited speakers!) would have accompanying videos, which workshop participants would be expected to watch before the workshop. (We could even use Coursera's platform perhaps, to get extra things like mastery questions and forums.) During the workshop, speakers only spend 2 minutes or so reminding the audience who they are and what was the content of their video. Then, it becomes entirely interactive Q-and-A, presumably heavily whiteboard or smartboard driven.

Feel free to steal this idea. Otherwise, maybe I'll try to organize a workshop just to try this idea out.

Saturday, June 22, 2013

ICML 2013: Sparse, Deep, and Random

ICML 2013 was a great conference this year, kudos to the organizers.  It's too big for an individual to get a comprehensive view of everything but I did notice three trends.

First, sparsity as a structural constraint seemed ubiquitous.  Since I know very little about this sub-field, I paid closely attention to the first two minutes of talks where foundational issues are typically (quickly!) discussed, e.g., “why do people care about sparsity at all?”.  I heard general motivations such as computational convenience and intelligibility.  I also heard somewhat specific motivations, e.g., Anandkumar et. al. showed that for particular generative model structures the parameters can be identified via sparse coding techniques; and Ruvolo and Eaton advocated sparse coding of models in multi-task learning to facilitate knowledge transfer between tasks.

Second, deep learning continues to enjoy a resurgence.  Two talks in particular suggested some important future directions.  The first was a talk by Coates about deep learning on the following architecture: 16 machines with 4 GPUs each wired up via infiniband.  I've complained on this blog about how the high communication costs of SGD make it a poor distributed learning algorithm but Coates et. al. are attacking this problem directly with hardware.  This is clearly the near future.  The bottom line is that we don't really have better training algorithms for neural networks, but the economics of the problems they are solving are so important that to the extent it is possible to “throw hardware at it”, hardware will be thrown.  The second talk was On the Difficulty of Training Recurrent Neural Networks by Pascanu et. al., which discussed some improvements to gradient-based learning in the recurrent setting.  It's clear that the deep learning guys, having dominated the “natural UI” problems so important in the mobile space (e.g., speech recognition and image tagging), are now looking to dominate sequential prediction tasks (which should increase in importance as autonomous systems become more ubiquitous).  They will have strong competition from the kernel guys: Le Song gave an amazing talk on Hilbert Space Embedding of Conditional Distributions with applications to sequential prediction.

Speaking of kernel guys, the third theme was random, and in particular Alex Smola's talk on fast randomized approximation of kernel learning (“FastFood”) was a real treat.  Presumably the combination of randomized computational techniques with Hilbert space representations of conditional distributions will result in strong algorithms for sequential prediction and other latent modeling tasks.  Another standout in this area was Mahoney's talk on Revisiting the Nyström Method for Improved Large-Scale Machine Learning.

Note unlike the first two themes (sparse and deep), I wouldn't say random is a broadly popular theme yet.  I happen to be personally very excited by it, and I think the implications for machine learning are large especially in the distributed setting.  Basically the numerical linear algebra guys who have advanced these randomized algorithms have been working on “architecture aware computing” for many years now, something that the machine learning community is only starting to appreciate.  For a glimpse of what this might mean to you, consider David Gleich's talk on Tall and Skinny QR Factorizations in Hadoop.

 I couldn't find any online material, which is unfortunate because this is really cool, and if you've ever put a machine learning algorithm into an existing production system you will love this right away.  The basic idea is that you, the end user, program “normally” and make calls into a learning algorithm which is implemented as a coroutine.  This has two advantages: first, the algorithm naturally experiences the distribution of decisions induced by the program, so the “dataset collection” problem and associated errors is mitigated (this is extra important for sequential prediction); and second, the training and evaluation time code paths are the same so implementation in production is both easier and less error-prone.  Note the evaluation time overhead is minimal in this setup so there is no temptation to rewrite the algorithm for production.  Testing time overhead is introduced but this can be mitigated by decorating the code with additional annotations that allow the learning algorithm to memoize appropriately.  This is so hot that I want to volunteer for some sequential prediction tasks in my neighborhood just to try it out.

Wednesday, January 2, 2013

NIPS 2012 Trends

Rather than a laundry list of papers, I thought I would comment on some trends that I observed at NIPS this year.

Deep Learning is Back

For the true faithful deep learning never left, but for everybody else several recent developments have coalesced in their favor.

First, data sets have gotten bigger. Bigger data sets mean more complex model families can be considered without overfitting. Once data sets get too big than computational constraints come into play, but in the zone of 105 to 106 rows and 102 to 103 columns deep learning computational costs are tolerable, and this zone contains many data sets of high economic value.

Second, data sets have gotten more public. Call this the Kaggle effect if you will, although purely academic projects like ImageNet are also important. Once larger interesting data sets are public meaningful technique comparisons are possible. Here's a quick paper reading hint: that section of the paper that talks about how the approach of the paper is better than all the other approaches, you can skip that section, because the numbers in that section are subject to a particular selection pressure: the authors keep experimenting with their technique until it is demonstrably better, while they do not apply the same enthusiasm to the competitive techniques. On the other hand if there is a situation where the proponents of technique A push as hard as they can on a data set, and proponents of technique B push as hard as they can on a data set, then knowing who does better is more interesting. The deep learning community benefits from these kinds of match-ups because, at the end of the day, they are very empirically oriented.

Third, data sets have gotten more diverse. Linear methods work well if you have enough intuition about the domain to choose features and/or kernels. In the absence of domain knowledge, nonconvex optimization can provide a surrogate.

These trends are buoyed by the rise of multicore and GPU powered computers. While deep learning is typically synonymous with deep neural networks, we can step back and say deep learning is really about learning via nonconvex optimization, typically powered by SGD. Unfortunately SGD does poorly in the distributed setting because of high bandwidth requirements. A single computer with multicores or multiple GPU cards is essentially a little cluster with a high-speed interconnect which helps workaround some of the limitations of SGD (along with pipeling and mini-batching). I think the near-future favors the GPU approach to deep learning over the distributed approach (as exemplified by DistBelief), since there is economic pressure to increase the memory bandwidth to the GPU for computer gaming. I'm partial to the distributed approach to deep learning because in practice the operational store of data is often a cluster so in situ manipulations are preferable. Unfortunately I think it will require a very different approach, one where the nonconvexity is chosen with the explicit design goal of allowing efficient distributed optimization. Until there's a breakthrough along those lines, my money is on the GPUs.

Probabilistic Programming

Probabilistic programming is a style of modeling in which users declaratively encode a generative model and some desired posterior summaries, and then a system converts that specification into an answer.
Declarative systems are the paragon of purity in computer science. In practice declarative systems face adoption hurdles because unless the domain in question is well-abstracted, end users inevitably find the limitation of the domain specific language unbearable. When the domain is well-abstracted, declarative systems can thrive if there are broadly applicable general strategies and optimizations, because even the most experienced and talented programmer will find the declarative framework more productive (at the very least, for prototyping, and quite possibly for finished product).

So here's some good news: for Bayesians a large amount of machine learning is well-abstracted as posterior summarization via Monte Carlo. Furthermore, the no u-turn sampler looks like a broadly applicable strategy, and certain other techniques like automatic differentiation and symbolic model simplification offer the promise of both correctness and (relative) speed. Overall, then, this looks like a slam dunk.

Spectral Methods for Latent Models

I blogged about this extensively already. The tl;dr is that spectral methods promise more scalable latent model learning by eliding the E-step. In my experience topic models extract great features for subsequent supervised classification in many domains (not just text!), so this is an exciting development practically speaking. Also, the view of topic models as extracting higher-order moment eigenvalues gives some intuition about why they have broad utility.