Feature selection for high-dimensional integrated data

Charles Zheng, Scott Schwartz, Robert S. Chapkin, Raymond J. Carroll, Ivan Ivanov

Research output: Chapter in Book/Report/Conference proceedingChapter

1 Scopus citations


Motivated by the problem of identifying correlations between genes or features of two related biological systems, we propose a model of feature selection in which only a subset of the predictors Xt are dependent on the multidimensional variate Y, and the remainder of the predictors constitute a “noise set” Xu independent of Y. Using Monte Carlo simulations, we investigated the relative performance of two methods: thresholding and singular-value decomposition, in combination with stochastic optimization to determine “empirical bounds” on the small-sample accuracy of an asymptotic approximation. We demonstrate utility of the thresholding and SVD feature selection methods to with respect to a recent infant intestinal gene expression and metagenomics dataset.
Original languageEnglish (US)
Title of host publicationProceedings of the 2012 SIAM International Conference on Data Mining
PublisherSociety for Industrial & Applied Mathematics (SIAM)
ISBN (Print)9781611972320
StatePublished - Dec 18 2013
Externally publishedYes

Bibliographical note

KAUST Repository Item: Exported on 2020-10-01
Acknowledged KAUST grant number(s): KUS-C1-016-04
Acknowledgements: We are indebted to the Texas A& M Brazos Computing Cluster and Institute of Developmentaland Molecular Biology for access to computingresources, and to professors David B. Dahl,Mohsen Pourahmadi, and Joel Zinn for helpful discussions.The infant microarray-metagenomics data wasprovided courtesy of Sharon M. Donovan, of the Divisionof Nutritional Sciences, U. of Illinois, Urbana, IL.This publication is based in part on work supported byAward No. KUS-C1-016-04, made by King AbdullahUniversity of Science and Technology (KAUST).
This publication acknowledges KAUST support, but has no KAUST affiliated authors.


Dive into the research topics of 'Feature selection for high-dimensional integrated data'. Together they form a unique fingerprint.

Cite this