[HAL] BigDataFr recommends: Random Forests for Big Data

randil forests

BigDataFr recommends: Random Forests for Big Data

Abstract

[…] Big Data is one of the major challenges of statistical science and has numerous consequences from algorithmic and theoretical viewpoints. Big Data always involve massive data but they also often include online data and data heterogeneity. Recently some statistical methods have been adapted to process Big Data, like linear regression models, clustering methods and bootstrapping schemes. Based on decision trees combined with aggregation and bootstrap ideas, random forests were introduced by Breiman in 2001. They are a powerful nonparametric statistical method allowing to consider in a single and versatile framework regression problems, as well as two-class and multi-class classification problems. Focusing on classification problems, this paper proposes a selective review of available proposals that deal with scaling random forests to Big Data problems. These proposals rely on parallel environments or on online adaptations of random forests. We also describe how related quantities — such as out-of-bag error and variable importance — are addressed in these methods. […]

Read paper
By Robin Genuer1,2 Jean-Michel Poggi 3,4 Christine Tuleau-Malot 5, Nathalie Villa-Vialaneix 6
Source: hal-archives-ouvertes.fr

1 ISPED
– Institut de Santé Publique, d’Epidémiologie et de Développement
2 SISTM – Statistics In System biology and Translational Medicine
Epidémiologie et Biostatistique [Bordeaux], Inria Bordeaux – Sud-Ouest
3 UPD5 – Université Paris Descartes – Paris 5
4 LM-Orsay – Laboratoire de Mathématiques d’Orsay
5 JAD – Laboratoire Jean Alexandre Dieudonné
6 MIAT INRA – Unité de Mathématiques et Informatique Appliquées de Toulouse

Laisser un commentaire

Votre adresse de messagerie ne sera pas publiée. Les champs obligatoires sont indiqués avec *