BigDataFr recommends: Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means
Excerpt
We analyze a compression scheme for large data sets that randomly keeps a small percentage of the components of each data sample. The benefit is that the output is a sparse matrix and therefore subsequent processing, such as PCA or K-means, is significantly faster, especially in a distributed-data setting. Furthermore, the sampling is single-pass and applicable to streaming data. The sampling mechanism is a variant of previous methods proposed in the literature combined with a randomized preconditioning to smooth the data. [..]
Read paper
By Farhad Pourkamali-Anaraki, Stephen Becker
Source: arxiv.org