Machine learning is a branch of computer science that has the potential to transform epidemiologic sciences. Amid a growing focus on “Big Data,” it offers epidemiologists new tools to tackle problems for which classical methods are not well-suited. In order to critically evaluate the value of integrating machine learning algorithms and existing methods, however, it is essential to address language and technical barriers between the two fields that can make it difficult for epidemiologists to read and assess machine learning studies.
Here, we provide an overview of the concepts and terminology used in machine learning literature, which encompasses a diverse set of tools with goals ranging from prediction to classification to clustering. We provide a brief introduction to 5 common machine learning algorithms and 4 ensemble-based approaches. We then summarize epidemiologic applications of machine learning techniques in the published literature. We recommend approaches to incorporate machine learning in epidemiologic research and discuss opportunities and challenges for integrating machine learning and existing epidemiologic research methods.
Machine learning is a branch of computer science that broadly aims to enable computers to “learn” without being directly programmed. It has origins in the artificial intelligence movement of the 1950s and emphasizes practical objectives and applications, particularly prediction and optimization. Computers “learn” in machine learning by improving their performance at tasks through “experience” (2, p. xv). In practice, “experience” usually means fitting to data; hence, there is not a clear boundary between machine learning and statistical approaches.
Indeed, whether a given methodology is considered “machine learning” or “statistical” often reflects its history as much as genuine differences, and many algorithms (e.g., least absolute shrinkage and selection operator (LASSO), stepwise regression) may or may not be considered machine learning depending on who you ask. Still, despite methodological similarities, machine learning is philosophically and practically distinguishable. At the liberty of (considerable) oversimplification, machine learning generally emphasizes predictive accuracy over hypothesis-driven inference, usually focusing on large, high-dimensional (i.e., having many covariates) data sets. Regardless of the precise distinction between approaches, in practice, machine learning offers epidemiologists important tools. In particular, a growing focus on “Big Data” emphasizes problems and data sets for which machine learning algorithms excel while more commonly used statistical approaches struggle.
This primer provides a basic introduction to machine learning with the aim of providing readers a foundation for critically reading studies based on these methods and a jumping-off point for those interested in using machine learning techniques in epidemiologic research. The “Concepts and Terminology” section of this paper presents concepts and terminology used in the machine learning literature.
The “Machine Learning Algorithms” section provides a brief introduction to 5 common machine learning algorithms: artificial neural networks, decision trees, support vector machines, naïve Bayes, and k-means clustering. These are important and commonly used algorithms that epidemiologists are likely to encounter in practice, but they are by no means comprehensive of this large and highly diverse field. The following two sections, “Ensemble Methods” and “Epidemiologic Applications,” extend this examination to ensemble-based approaches and epidemiologic applications in the published literature. “Brief Recommendations” provides some recommendations for incorporating machine learning into epidemiologic practice, and the last section discusses opportunities and challenges.
For epidemiologists seeking to integrate machine learning techniques into their research, language and technical barriers between the two fields can make reading source materials and studies challenging. Some machine learning concepts lack statistical or epidemiologic parallels, and machine learning terminology often differs even where the underlying concepts are the same. Here we briefly review basic machine learning principles and provide a glossary of machine learning terms and their statistical/epidemiologic equivalents.
Supervised, unsupervised, and semi supervised learning
Machine learning is broadly classifiable by whether the computer’s learning (i.e., model-fitting) is “supervised” or “unsupervised.” Supervised learning is akin to the type of model-fitting that is standard in epidemiologic practice: The value of the outcome (i.e., the dependent variable), often called its “label” in machine learning, is known for each observation. Data with specified outcome values are called “labeled data.” Common supervised learning techniques include standard epidemiologic approaches such as linear and logistic regression, as well as many of the most popular machine learning algorithms (e.g., decision trees, support vector machines).
In unsupervised learning, the algorithm attempts to identify natural relationships and groupings within the data without reference to any outcome or the “right answer”. Unsupervised learning approaches share similarities in goals and structure with statistical approaches that attempt to identify unspecified subgroups with similar characteristics (e.g., “latent” variables or classes). Clustering algorithms, which group observations on the basis of similar data characteristics (e.g., both oranges and beach balls are round), are common unsupervised learning implementations. Examples may include k-means clustering and expectation-maximization clustering using Gaussian mixture models.
Semi supervised learning fits models to both labeled and unlabeled data. Labeling data (outcomes) is often time-consuming and expensive, particularly for large data sets. Semi supervised learning supplements limited labeled data with an abundance of unlabeled data with the goal of improving model performance (studies show that unlabeled data can help build a better classifier, but appropriate model selection is critical). For example, in a study of Web page classification, Nigam et al. fit a naïve Bayes classifier to labeled data and then used the same classifier to probabilistically label unlabeled observations (i.e., fill in missing outcome data). They then retrained a new classifier on the resulting, fully labeled data set, thereby achieving a 30% increase in Web page classification accuracy on data outside of the training set. Semi supervised learning can bear some similarity to statistical approaches for missing data and censoring (e.g., multiple imputation), but as an approach that focuses on imputing missing outcomes rather than missing covariates.
Classification versus regression algorithms
Within the domain of supervised learning, machine learning algorithms can be further divided into classification or regression applications, depending upon the nature of the response variable. In general, in the machine learning literature, classification refers to prediction of categorical outcomes, while regression refers to prediction of continuous outcomes. We use this terminology throughout this primer and are explicit when referring to specific regression algorithms (e.g., logistic regression). Many machine learning algorithms that were developed to perform classification have been adapted to also address regression problems, and vice versa.
Generative versus discriminative algorithms
Machine learning algorithms, both supervised and unsupervised, can be discriminative or generative (11, 12). Discriminative algorithms directly model the conditional probability of an outcome, Pr(y|x) (the probability of y given x), in a set of observed data—for example, the probability that a subject has type 2 diabetes mellitus given a certain body mass index (BMI; weight (kg)/height (m)2). Most statistical approaches familiar to epidemiologists (e.g., linear and logistic regression) are discriminative, as are most of the algorithms discussed in this primer.
In contrast, while generative algorithms can also compute the conditional probability of an outcome, this computation occurs indirectly. Generative algorithms first model the joint probability distribution, Pr(x, y) (the probabilities associated with all possible combinations of x and y), or, continuing our example, a probabilistic model that accounts for all observed combinations of BMIs and diabetes outcomes (Table 2). This joint probability distribution can be transformed into a conditional probability distribution in order to classify data, as Pr(y|x) = Pr(x, y)/Pr(x). Because the joint probability distribution models the underlying data-generating process, generative models can also be used, as their name suggests, for directly generating new simulated data points reflecting the distribution of the covariates and outcome in the modeled population (11). However, because they model the full joint distribution of outcomes and covariates, generative models are generally more complex and require more assumptions to fit than discriminative algorithms (12, 13). Examples of generative algorithms include naive Bayes and hidden Markov models.
Reinforcement learning
In reinforcement learning, systems learn to excel at a task over time through trial and error (14). Reinforcement learning techniques take an iterative approach to learning by obtaining positive or negative feedback based on performance of a given task on some data (whether prediction, classification, or another action) and then self-adapting and attempting the task again on new data (though old data may be reencountered) (15). Depending on how it is implemented, this approach can be akin to supervised learning, or it may represent a semi supervised approach (as in generative adversarial neural networks (16)).
Reinforcement learning algorithms often optimize the use of early, “exploratory” versions of a model—that is, task attempts—that perform poorly to gain information to perform better on future attempts, and then become less labile as the model “learns” more (15). Medical and epidemiologic applications of reinforcement learning have included modeling the effect of sequential clinical treatment decisions on disease progression (17) (e.g., optimizing first- and second-line therapy decisions for schizophrenia management (18)) and personalized, adaptive medication dosing strategies. For example, Nemati et al. (19) used reinforcement learning with artificial neural networks in a cohort of intensive-care-unit patients to develop individualized heparin dosing strategies that evolve as a patient’s clinical phenotype changes, in order to maximize the amount of time that blood drug levels remain within the therapeutic window.
Artificial neural networks
Artificial neural networks (ANNs) are inspired by the signaling behavior of neurons in biological neural networks. ANNs, which consist of a population of neurons interconnected through complex signaling pathways, use this structure to analyze complex interactions between a group of measurable covariates in order to predict an outcome. ANNs possess layers of “neurons” connected by “axons” (20) (Figure 1A). These layers are grouped into 1) an input layer, 2) one or more middle “hidden” layers, and 3) an output layer. The neurons in the input and output layers correspond to the independent and dependent variables, respectively. Neurons in adjacent layers communicate with each other through activation functions, which convert the weighted sum of a neuron’s inputs into an output (Figure 1B). Depending on the type of activation function, the output can be dichotomous (“1” when the weighted sum exceeds a given threshold and “0” otherwise) or continuous. The weighted sum of a neuron’s inputs is somewhat analogous to coefficients in linear or logistic regression. What is Machine Learning? A Primer for the Epidemiologist
Recommended Articles
Sample Sheet Metal Quiz MCQ’s Questions and Answers
What is Machine Learning? A Primer for the Epidemiologist
Mechanical Engineering Notes