The implication of the gut microbiome in a wide variety of diseases has lead to its exploration as a biomarker for diagnostics and prognostics. However, single microbes rarely constitute reliable biomarkers, likely reflecting the microbiome's complexity and extensive cross-feeding. Rather, the biomarker patterns we seek are expressed across the entire community. Supervised machine learning (SML) techniques are an ideal approach through which to learn these patterns from supplied data, and serve as diagnostic and prognostic tools.
Yet, despite growth in the use of these tools, their operation and best-practice remain largely opaque to the Biological community. Educational material is lacking from Biological literature, and in Computer Science is reliant on ill-explained jargon and complex mathematics.
Our aim here is to present an overview of machine learning, accessible to Biologists. We cover what SML (both prediction and classification) is, and the conceptual differences between various prominent algorithms, e.g. random forests and neural networks. A common pitfall in machine learning is over-fitting, a phenomenon wherein a predictive model learns not only the trends in the supplied data, but its noise also. This harms subsequent post-learning performance on unseen data, for instance in the clinic, wherein the noise differs. We explore a best-practice pipeline, designed to avoid over-fitting, covering pipeline components such as feature selection and generation, cross validation, and the use of training-validation-test data sets. We explore the challenges that microbiome data poses for SML.
This exploration of machine learning is illustrated through a clinical weight-loss case study. People meet with varying success on any given dietary intervention, in part, we hypothesise, due to the microbiome. Obese participants were administered one of three diets: a high protein, Mediterranean, or low glycemic index diet. Faecal samples were taken prior to a three month dietary intervention, and weight-loss recorded thereafter. We predict weight-loss success on a given dietary intervention, aiming to tailor clinical strategy to the individual.