Learning Tractable but Expressive Models
University of Washington, Seattle
Inference is the hardest part of learning. Learning most powerful
models requires repeated intractable inference, and approximate inference
often interacts badly with parameter optimization. At inference time, an
intractable accurate model can effectively become an inaccurate model due
to approximate inference. All these problems would be avoided if we learned
only tractable models, but standard tractable model classes - like thin
junction trees and mixture models - are insufficiently expressive for most
applications. However, in recent years a series of surprisingly expressive
tractable model classes have been developed, including arithmetic circuits,
feature trees, sum-product networks, and tractable Markov logic. I will
give an overview of these representations, algorithms for learning them,
and their startling successes in challenging applications.
Modeling, Inference and Estimation using Random Maximum A-Posteriori Perturbations
Toyota Technological Institute at Chicago
Learning and inference in complex models drives much of the research in machine learning applications, from computer vision, natural language processing, to computational biology. The inference problem in such cases involves assessing the likelihood of possible
structures, whether objects, parsers, or molecular structures. Although it is often feasible to only find the most likely or maximum
a-posteriori (MAP) assignment rather than considering all possible assignments, MAP inference is limited when there are other likely
assignments. In a fully probabilistic treatment, all possible alternative assignments are considered thus requiring summing over the
assignments with their respective weights -- evaluating the partition function -- which is considerably harder. The main surprising result
of our work is that MAP inference (maximization) can be used to approximate and bound the partition function (weighted counting). This
leads us to a new approximate inference framework that is based on MAP-statistics, thus does not depend on pseudo-probabilities,
contrasting the current framework of Bethe approximations which lacks statistical meaning. This approach excels in regimes where there are several but not exponentially many prominent assignments. For example, this happens in cases where observations carry strong signals (local evidence) but are also guided by strong consistency constraints (couplings).
Structured Prediction using Linear Programming Relaxations: Algorithms and Theory
New York University
Predicting structured objects such as parse trees or protein folds is
often formulated as combinatorial optimization, where the goal is to
find the most likely structure given the available evidence. Linear
programming relaxations are a powerful tool for solving these
optimization problems. Learning for structured prediction corresponds
to inverse combinatorial optimization, finding parameters for the
model such that for each of the data points, the optimal solution is
the desired structure. This talk will survey existing algorithms and
theory relating to learning for structured prediction using linear
Learning Approximate Inference Policies for Fast Prediction
Johns Hopkins University
In many domains, our best models are computationally intractable. This problem will only get worse as we manage to build more richly detailed models of specific domains. Fortunately, the practical goal of artificial or natural intelligence is not to do perfect detailed inference, but rather to answer specific questions by reasoning from observed data. Thus, we should seek policies for fast approximate inference that will actually achieve low expected loss on our target task. The target task is a distribution not only over test-time data but also over which variables will be observed and queried. The loss function may explicitly penalize for runtime (or data acquisition).
This story leaves open an engineering question: What space of policies should we search? I will review a range of options and point out past work for each. Among others, I will show our own recent successes using message-passing approximate inference policies for graphical models. The form of these policies is determined by the structure of our intractable and surely mismatched domain model, but we tune the parameters to minimize loss (Stoyanov & Eisner, 2012).
One may search a space of sophisticated policies or a space of crude hacks, but crucially, one should tune the policy parameters to optimize expected error and runtime. This expectation can be taken over training data (empirical risk minimization), or over samples from a posterior belief about the target task (which I will call imputed risk minimization). The latter case requires some sort of prior model, but this is necessary when the empirical risk estimate suffers from sparse, non-independent, or out-of-domain training data.
Marginalization-Based Parameter Learning in Graphical Models
Rochester Institute of Technology
Likelihood-based learning of graphical models faces challenges of computational complexity and robustness to model error. This talk will discuss methods that directly maximize a measure of the accuracy of predicted marginals, in the context of a particular approximate inference algorithm. Experiments suggest that marginalization-based learning, by compensating for both model and inference approximations at training time, can perform better than likelihood-based approximations when the model being fit is approximate in nature.