Sylvain Arlot (CNRS - Ecole Normale Supérieure)
Optimal model selection with V-fold cross-validation: how should V be chosen?
V-fold cross-validation is a simple and efficient method for estimator selection when the goal is to minimize the final prediction error. Nevertheless, few theoretical results exist about the influence of V on the risk of the final estimator, in particular results taking into account the variability of the cross-validation criterion as a function of V. We will focus on the case-example of model selection for least-squares density estimation. First, a non-asymptotic oracle inequality holds for V-fold cross-validation (and its bias corrected version, V-fold penalization), with a leading constant 1+o(1) for V-fold penalization, and the second order term in the constant decreases when V increases. Second, making an exact variance computation allows to quantify the improvement we can expect when V increases. In particular, this computation enlightens why the improvement is larger when V goes from 2 to 10 than when V goes from 10 to 100, for instance. Simulation experiments confirm these theoretical results for realistic values of the sample size.
This talk is based upon a collaboration with Matthieu Lerasle (CNRS - Universite de Nice, France). Preprint: http://arxiv.org/abs/1210.5830
Francois Caron (INRIA Bordeaux - Sud-Ouest)
Bayesian nonparametric models for bipartite graphs
In this talk I will present a novel Bayesian nonparametric model for bipartite graphs, based on the theory of completely random measures. The model is able to handle a potentially infinite number of nodes and has appealing properties; in particular, it may exhibit a power-law behavior for some values of the parameters. I derive a posterior characterization, a generative process for network growth, and a simple Gibbs sampler for posterior simulation. Finally, the model is shown to provide a good fit to several large real-world bipartite social networks
Nicolas Chopin (CREST - ENSAE)
EP-ABC: Expectation-Propagation for Likelihood-Free Inference
Many models of interest in the natural and social sciences have no closed-form likelihood function, which means that they cannot be treated using the usual techniques of statistical inference. In the case where such models can be efficiently simulated, Bayesian inference is still possible thanks to the Approximate Bayesian Computation (ABC) algorithm. Although many refinements have been suggested, ABC inference is still far from routine. ABC is often excruciatingly slow due to very low acceptance rates. In addition, ABC requires introducing a vector of "summary statistics", the choice of which is relatively arbitrary, and often require some trial and error, making the whole process quite laborious for the user. We introduce in this work the EP-ABC algorithm, which is an adaptation to the likelihood-free context of the variational approximation algorithm known as Expectation Propagation (Minka, 2001). The main advantage of EP-ABC is that it is faster by a few orders of magnitude than standard algorithms, while producing an overall approximation error which is typically negligible. A second advantage of EP-ABC is that it replaces the usual global ABC constraint on the vector of summary statistics computed on the whole dataset, by n local constraints of the form that apply separately to each data-point. As a consequence, it is often possible to do away with summary statistics entirely. In that case, EP-ABC approximates directly the evidence (marginal likelihood) of the model. Comparisons are performed in three real-world applications which are typical of likelihood-free inference, including one application inneuroscience which is novel, and possibly too challenging for standard ABC techniques. I will also mention briefly new applications of EP-ABC we are currently working on, in population genetics and in spatial modelling. (joint work withSimon Barthelmé, plus some on-going work with Jean-Michel Marin, Pierre Pudlo, Magal Beffi, and Erlis Ruli)
Aurélien Garivier (Institut Mathématique de Toulouse, Université Paul Sabatier)
Dynamic resource allocation as an estimation problem
An agent sequentially chooses actions in a set of possible options. Each action leads to a stochastic reward, whose distribution is unknown. What dynamic allocation rule should he choose so as to maximize his cumulated reward?
The study of "bandit problems" (the word refers to the paradigmatic situation of a gambler facing a row of slot-machines and deciding which one to choose in order to maximize his rewards), dating back to the 1930s, was originally motivated by medical trials. In has recently raised a renewed interest because of computer-driven applications, from computer experiments to recommender systems and Big Data.
One possible solution, called UCB, relies on upper confidence bounds of the expected rewards associated to all actions. In this presentation, I shall explain why fine confidence bounds are required in order to obtain efficient allocation rules, and I shall present the statistical challenges involved in their construction.
Zaid Harchaoui (INRIA Grenoble - Rhône-Alpes)
Large-scale learning with conditional gradient algorithms
We consider convex optimization problems arising in machine learning in large-scale settings. For several important learning problems, such as e.g. noisy matrix completion or multi-class classification, state-of-the-art optimization approaches such as composite minimization (a.k.a. proximal-gradient) algorithms are difficult to apply and do not scale up to large datasets. We propose three extensions of the conditional gradient algorithm (a.k.a. Frank-Wolfe's algorithm), suitable for large-scale problems, and establish their finite-time convergence guarantees. Promising experimental results are presented on large-scale real-world datasets.
Guillaume Obozinski (Ecole des Ponts - Paristech)
Convex relaxation for Combinatorial Penalties
The last years have seen the emergence of the field of structured sparsity, which aims at identifying a model of small complexity given some a priori knowledge on its possible structure. Specifically, models with structured sparsity are models in which the set of non-zero parameters --- often corresponding to a set of selected variables --- is not only assumed to be small, but also to display structured patterns. Two important examples are group sparsity, where groups of parameters are simultaneously zero or non-zero ,and hierarchical sparsity, were variables can only be selected following a prescribed partial order encoded by a directed acyclic graph.
A common approach to the problem is to add to the empirical risk an implicit or explicit penalization of the structure of the non-zero patterns.
In this talk, I will consider a generic formulation in which allowed structures are encoded by a combinatorial penalty, and show that when combined with continuous regularizer such as an Lp norm, a tightest convex relaxation can be constructed and used a regularizer. The formulation considered allows to treat in a unified framework several a priori disconnected approaches such as norms based on overlapping groups, norms based on latent representations such as block-coding and submodular functions, and to obtain generic consistency and support recovery results for the corresponding estimators obtained as minimizers of the regularized empirical risk.
Peter Richtarik (University of Edinburgh)
Mini-batch primal and dual methods for support vector machines
In this work we address the issue of using mini-batches in stochastic optimization of support vector machines (SVMs). We show that the same quantity, the spectral norm of the data, controls the parallelization speedup obtained for both primal stochastic subgradient descent (SGD) and stochastic dual coordinate ascent (SCDA) methods and use it to derive novel variants of mini-batched SDCA. Our guarantees for both methods are expressed in terms of the original nonsmooth primal problem based on the hinge-loss.
Igor Pruenster (University of Torino & Collegio Carlo Alberto)
On some distributional properties of Gibbs-type priors
Gibbs-type priors represent a natural (maybe the most natural?) generalization of the Dirichlet process and can be intuitively characterized in terms of their predictive structure. This talk will deal with some of their distributional properties and highlight the corresponding implications in terms of Bayesian nonparametric modeling.
Aarti Singh (Carnegie Mellon University)
Optimal convex optimization under Tsybakov noise through reduction to active learning
We demonstrate that the complexity of convex minimization is only determined by the rate of growth of the function around its minimizer, as quantified by a Tsybakov-like noise condition (TNC) with exponent k. Specifically, we demonstrate optimal first-order stochastic optimization rates that depend precisely on the TNC exponent k which include as special cases the classical rates for convex (k tending to infinity), strongly convex (k=2) and uniformly convex optimization (k > 2). Even faster rates (nearly exponential) can be attained if the exponent 1 < k < 2. Our analysis is based on a reduction of convex optimization to active learning as both problems involve minimizing the number of queries needed to identify the minimizer or the decision boundary, respectively, by using feedback gained from prior queries. Our results demonstrate that the complexity of convex optimization in d-dimensions is precisely the same as the complexity of active learning in 1 dimension. First, we port a lower bound proof from active learning to demonstrate the complexity of optimizing convex function satisfying TNC, under both stochastic first-order oracle as well as derivative-free (zeroth order) setting. We then demonstrate that the optimal rate with TNC exponent under first-order oracle can be achieved by an epoch-gradient descent algorithm, as well as a coordinate descent algorithm that uses a 1-dimensional active learning algorithm as subroutine. We also show that it is possible to adapt to the unknown TNC exponent and hence the degree of convexity.
Yee Whye Teh (Oxford University)
Fast MCMC sampling for Markov jump processes and extensions
Markov jump processes (or continuous-time Markov chains) are a simple and important class of continuous-time dynamical systems. In this talk, we tackle the problem of simulating from the posterior distribution over the unobserved paths in these models given some observations. Our approach is an auxiliary variable Gibbs sampler, and is based on the idea of uniformization. This sets up a Markov chain over paths by alternately sampling a finite set of virtual jump times given the current path and then sampling a new path given the set of extant and virtual jump times using a standard hidden Markov model forward filtering-backward sampling algorithm. Our method is exact and does not involve approximations like time-discretization. We demonstrate how our sampler extends naturally to MJP-based models like Markov-modulated Poisson processes and continuous-time Bayesian networks and show significant computational benefits over state-of-the-art MCMC samplers for these models (Joint work with Vinayak Rao)