Download Knime Workflow



Background: Advances in technology have facilitated the generation of gene expression data from large numbers of samples and the development of “Big Data” approaches to analysing gene expression in basic and biomedical systems. That being said, the data still includes relatively small numbers of samples and tens of thousands of variables/gene expression. A variety of different approaches have been developed for searching these gene spaces in order to select the most informative variables that can accurately distinguish one class of subjects/samples from another. However, there is still a need for new approaches that can accurately distinguish biologically different classes of subjects with similar gene expression profiles. We describe a new and promising approach for selecting the most informative differentially expressed genes that addresses this problem. We describe a method for identifying significant differentially expressed clusters of genes using a process of Recursive Cluster Elimination
(RCE) that is based on an ensemble clustering approach. We refer to this approach as SVM-RCE-EC (Ensemble Clustering). We show that SVM-RCE-EC improves gene selection, classification accuracy as compared to other methods including the traditional SVM-RCE approach, and that this is particularly evident when applied to difficult data sets that are poorly resolved by other approaches.
Methods: To implement SVM-RCE-EC we first applied an ensemble-clustering method, to identify robust gene
clusters. We then applied Support Vector Machines (SVMs), with cross validation to score (rank) those clusters of
genes based on their contributions to classification accuracy. The clusters of genes that are least significant are
progressively removed by the procedure of RCE with the most significant clusters being retained until one identifies
the most robust, significantly differentially expressed genes between the two classes. We compare the classification
performance of SVM-RCE-EC to a variety of published classification algorithms.
Results and Conclusion: Utilization of gene clusters selected using the ensemble method enhances
classification performance as compared to other methods and identifies sets of significant genes that appear to
be more biologically meaningful to the system being analyzed. We show that SVM-RCE-EC outperforms several
other methods on data that represent highly similar sample classes that are difficult to distinguish and is comparable
to other methods when applied to data where the classes are more easily separated. The improved performance
of SVM-RCE-EC on difficult data sets is likely due to the fact that the significant clusters, as determined by the
ensemble approach, capture the native structure of the data while SVM-RCE leaves that determination to the user.
This hypothesis is supported by the observations that the performance of the clusters generated by SVM-RCE-EC
is more robust.
Availability: The Matlab version of SVM-RCE-EC is available upon request to the first author and at GitHub