Machine learning methods and in particular random forests (RFs) are a promising alternative to standard single SNP analyses in genome-wide association studies (GWAS). RFs provide variable importance measures (VIMs) to rank SNPs according to their predictive power. However, in contrast to the established genomewide significance threshold, no clear criteria exist to determine how many SNPs should be selected for downstream analyses (i.e. for feature selection while controlling false positive rates).
We propose a new variable selection approach for RF using a recurrent relative variable importance measure (r2VIM) and have implemented it in the r2VIM software package. Importance values for each feature (SNP) are calculated relative to an observed minimal importance score for several runs of RF and only SNPs with large relative VIMs in all of the runs are selected as important. Evaluations on simulated GWAS data show that the new method controls the number of false-positives under the null hypothesis. Under a simple alternative hypothesis with several independent main effects it is only slightly less powerful than logistic regression. In an experimental GWAS data set, the same strong signal is identified as when using linear regression. In another underpowered GWAS data set, the approach selects none of the SNPs, thus exhibiting good control of false positive feature selection.
Last Modified: Monday, 01-May-2017 15:49:54 EDT