The completion of the human genome and the advancement of high-throughput technologies have enable the quantification of thousands of genes for precision medicine. The problem with gene expression data is that the number of genes (as variables) greatly supersedes the number of samples thereby rendering regular statistical models that often require large number of samples than variables, unsatisfactory. As such, several machine learning algorithms (functions) have been proposed in the literature to tackle this problem of small sample size relative to the number of variables, commonly referred to as the “curse of dimensionality”.
Nevertheless, there is no universal algorithm that performs best on all gene expression data. The varying performance of these algorithms across datasets is a clear indication that there are data characteristics associated with the performance of the algorithms. In order to determine an optimal function for a given gene expression data, several algorithms are often compared and the one with the smallest cross-validated error is chosen. This approach leads to selection bias because an algorithm might have the smallest cross-validated error by chance. To combat this, a number of selection bias correction methods have been proposed but no such method is guaranteed to be effective when several least optimal functions are compared on a dataset with a small sample size.
Alternative approaches utilize combined input of several algorithms (super-learners) with the goal of improving predictions. Such approaches are hardly accepted in medical applications because they are often considered black boxes whose models are hard to interpret and they utilize the entire genome instead of a selected profile, making practical application time consuming and costly. Hence traditional algorithms that can perform variable selection to yield a gene signature and interpretable models are often preferred over super-learners but the question which of these algorithms is optimal for a given data remains unanswered.
In this thesis, we have identified gene expression data characteristics that associate with the performance of often used traditional machine learning algorithms, using publicly available microarray gene expression data. With the identified data characteristics we systematically varied the variables and assessed their associative effects to the performance of these functions using simulations. Additionally, we analyzed our simulated results to provide predictive models for selecting an optimal algorithm for diagnostic or prognostic analysis on any given data, with little or no bias. Application of our models on several real-life gene expression data showed high correlations between predicted and actually achieved performance of the functions. One of such models was used to select an optimal algorithm that was subsequently utilized to identify and validate prognostic biomarkers for disease severity in respiratory syncytial virus (RSV) infected infants. The identified 84 genes signature might serve as the basis for the management of RSV disease in pediatric wards.