Statistical and computational aspects of aggregating data summaries

Many statistical procedures are based on ideas of the aggregation of information gleaned from summaries of a data set such as subsamples, bootstrap samples or random projections.  Although the bootstrap itself is probably the best known such method, there are many other examples, including bagging [1,2,3] for regression or classification (with random forests [4] as a special case), Stability Selection for variable selection [5,6] and random projection ensemble classification [7].  Intuitively, such procedures allow the statistician to understand the stability of observed effects under perturbations of the original data, and appear to be particularly valuable for complex, high-dimensional data. Even though these methods are typically embarrassingly parallelisable, they may nevertheless be computationally intensive.

This project will explore when and why methods such as these can be expected to succeed. The analysis will combine both statistical perspectives and the inherent computational trade-offs. It is hoped that the analysis will suggest other statistical challenges where the aggregation of data summaries can prove an effective tool.

  • [1] Brieman, L. (1996) Bagging predictors. Mach. Learn., 24, 123–140.
  • [2] Hall, P. and Samworth, R. J. (2005) Properties of bagged nearest neighbour classifiers. J. Roy Statist. Soc. Ser. B, 67, 363–379.
  • [3] Samworth, R. J. (2012) Optimal weighted nearest neighbour classifiers. Ann. Statist., 40, 2733–2763.
  • [4] Breiman, L. (2001) Random forests. Mach. Learn., 45, 5–32.
  • [5] Meinshausen, N. and Buehlmann, P. (2010) Stability selection (with discussion). J. Roy. Statist. Soc. Ser. B, 72, 417–473.
  • [6] Shah, R. D. and Samworth, R. J. (2013) Variable selection with error control: another look at stability selection. J. Roy. Statist. Soc. Ser. B, 75, 55-80.
  • [7] Cannings, T. I. and Samworth, R. J. (2015) Random projection ensemble classification. Available at

Who's involved

Papers, Publications & Software

Isotonic regression in general dimensions

Qiyang Han Tengyao Wang Sabyasachi Chatterjee Richard J. Samworth
Published 30/08/2017

InspectChangepoint Software

Tengyao Wang Richard J. Samworth
Excited to announce that Mark Girolami will give a talk on 'The Statistical Finite Element Method' at the second CC… View on Twitter