| Sign In to gain access to subscriptions and/or personal tools. |
© 2003 SAGE Publications Dental Data Mining: Potential Pitfalls and Practical IssuesPresented at "Dental Informatics & Dental Research: Making the Connection", a conference held in, Bethesda, MD, USA, June 12–13, 2003, sponsored by the University of Pittsburgh Center for Dental Informatics and supported in part by award 1R13DE014611-01 from the National Institute of Dental and Craniofacial Research/National Library of Medicine.
Center for Health and Community, Center to Address Disparities in Childrens Oral Health, Department of Preventive and Restorative Dental Sciences, Division of Oral Epidemiology and Dental Public Health, University of California, San Francisco, CA 94143-1361, USA; sgansky{at}itsa.ucsf.edu
Knowledge Discovery and Data Mining (KDD) have become popular buzzwords. But what exactly is data mining? What are its strengths and limitations? Classic regression, artificial neural network (ANN), and classification and regression tree (CART) models are common KDD tools. Some recent reports (e.g., Kattan et al., 1998) show that ANN and CART models can perform better than classic regression models: CART models excel at covariate interactions, while ANN models excel at nonlinear covariates. Model prediction performance is examined with the use of validation procedures and evaluating concordance, sensitivity, specificity, and likelihood ratio. To aid interpretation, various plots of predicted probabilities are utilized, such as lift charts, receiver operating characteristic curves, and cumulative captured-response plots. A dental caries study is used as an illustrative example. This paper compares the performance of logistic regression with KDD methods of CART and ANN in analyzing data from the Rochester caries study. With careful analysis, such as validation with sufficient sample size and the use of proper competitors, problems of naïve KDD analyses (Schwarzer et al., 2000) can be carefully avoided.
Key Words: Models, statistical decision support techniques neural networks (computer) dental caries oral health
Informatics, in general, and dental informatics, in particular, are disciplines encompassing a variety of research areas, from molecular biology to library science to public health surveillance. Many dental informatics application areas utilize knowledge discovery and data mining (KDD)—semi-automatic pattern, association, anomaly, and statistically significant structure discovery in data (Fayyad et al., 1996, p. 6). KDD operates at the intersection of artificial intelligence, machine language learning, computer science, engineering, and statistics. KDD has been named a Top Ten emerging technology that will change the world (Waldrop, 2001). However, KDD is not alchemy—it does not turn lead into gold (i.e., bad data or flawed study designs into incredible, novel insights)—but rather KDD is a discipline using modern computing tools to solve problems. In current business applications, KDD touches lives daily when customers swipe supermarket savings cards, sending buying habits to data warehouses. This provided retailers the (apocryphal?) data mining discovery: diapers and beer sharing mens late-night supermarket baskets. In the future, similar encounters in clinicians offices might collect health information in data warehouses (according to patient confidentiality protections), which can be mined to identify at-risk patients and better treatment modalities. Such possibilities are gradually becoming reality (e.g., Page et al., 2002). Some potential oral health applications for KDD include: large surveys (e.g., NHANES), longitudinal cohort studies (e.g., Veterans Administration Longitudinal Study on Aging), disease registries (e.g., National Cancer Institutes Surveillance, Epidemiology and End Results [SEER] program; birth defects registry; craniofacial treatment outcomes registry), health services research (e.g., claims data, fraud detection), provider and workforce databases, digital diagnostics (e.g., radiology, microbiology), and molecular biology (e.g., polymerase chain-reactions, microarrays).
KDD learning methods (see the Glossary in the Appendix) can be unsupervised (grouping into similar, heretofore undetermined, classes based on similarities) or supervised (prediction using already-determined classes, such as disease status). Unsupervised methods include hierarchical cluster analysis and k-means; supervised methods include regression, tree models (e.g., classification and regression trees [CART], boosting, bagging, and ensemble methods), multivariate adaptive regression splines, artificial neural networks (ANNs), support vector machines, and random forests (Hastie et al., 2001). In oral health research, CART was used to predict caries (Stewart and Stamm, 1991). ANN was used in clinical decision-making for third molar extraction (Brickley and Shepherd, 1996, 1997; Brickley et al., 1998; Goodey et al., 2000), oral cancer risk assessment (Speight et al., 1995), predicting dental age from photomicrographs (Amariti et al., 2000), predicting growth classification from lateral cephalograms (Lux et al., 1998), and assessing correlations of tooth enamel chemical elements (Nilsson et al., 1996). However, KDD methods have not been compared very much in oral health studies; compared with regression models, ANN might better identify non-linearities, while CART may better find interactions (Kattan et al., 1998). Logistic regression models linear relationships between predictors (inputs) and a binary response (output) (e.g., Harrell, 2001). The binary logit model can be written as:
where CART models (e.g., Stewart and Stamm, 1991; Hastie et al., 2001) adapt well to fit interactions, since they group individuals with similar probabilities of caries (to produce terminal nodes with the highest purity or homogeneity of outcome classes). Unlike logit models, CART models are robust to outliers and do not require specific data transformations or hierarchical interaction specification. CART models are step-function-type likelihood approximations (analogous to Riemann sums approximating integrals); these models are highly interpretable for easy clinician use. ANNs, extremely flexible weighted combinations of non-linear functions, use a hidden layer with hidden units/nodes/neurons and activation functions to link inputs to the hidden layer and from the hidden layer to outputs. A feed-forward or multilayer perceptron ANN is:
where
with r = 1, 2,..., R indexing the neurons, Hr denoting the r-th neuron, wpr denoting the coefficients of the p-th input xpi for the r-th neuron, and g0–1 denoting the inverse activation function (in this case, tanh–1). In ANN terminology (Schwarzer et al., 2000), a P-R-S model has P inputs (predictors), 1 hidden layer with R neurons, and S outputs (outcomes). Neurons are a function of weighted sums of inputs plus a constant ("bias"), w0. Similarly, outputs are a function of weighted sums of neurons plus bias; logistic and hyperbolic tangents are common activation functions. (Logistic regression is a P-0-1 feed-forward ANN with logistic activation function.) Weight decay, a model complexity penalty term for maximization, can be added to examine potential overfitting. In a simulation study of a 1-15-1 ANN, weight decays of 0, 0.002, and 0.005 were used with 0.005 stabilizing prediction (Schwarzer et al., 2000). Varying random seeds, R, and weight decays stabilizes global optimization (Ripley, 1996). ANNs are iteratively optimized with training data, and the final model is fitted to validation data so that future performance can be assessed. Training estimates weights, but they have no clear interpretation; thus, ANNs have very poor interpretability. Since ANNs with large R fit any arbitrary surface, ANNs should not be overfitted to the training data. Common mistakes with ANN are: too many parameters for the sample size, not using validation, not using a model complexity penalty, incorrect misclassification estimation, implausible probability functions, incorrectly described network complexity, inadequate flexible statistical competitors (e.g., CART), and insufficient comparisons with statistical competitors (e.g., receiver operating characteristic curves) (Schwarzer et al., 2000).
Although KDD is probably best known for analytic algorithms, KDD is an iterative process with the following steps regarding data: collect and store, pre-process, analyze, validate, and implement (Fig. 1 3000), cross-validation (if sample size < 3000), bootstrap (resampling with replacement), or jackknife (leave-one-out) methods. Finally, implementation could involve changes in the KDD process, new clinical interventions, or changes in health policy.
Data quality and study design issues remain paramount: Limitations inherent in study designs remain when KDD is used. For example, a tooth implant registry cannot examine the bone qualitys impact on implant failure if bone quality is not measured well (or at all). Similarly, a cross-sectional study of stress and temporomandibular joint disorders still assesses only associations, not causality. Goals of this paper are to demystify knowledge discovery and data mining (KDD) by explaining the process, to identify possible pitfalls and practical issues, and to compare the performance of KDD methods (logit, CART, and ANN) in analyzing Rochester caries study data.
Rochester Caries Study In upstate New York (the Rochester and Finger Lakes areas), first- and second-graders, caries-free at baseline, had stimulated saliva collected and dental exams without radiographs performed every six months for up to 6 years for a larger longitudinal caries risk assessment study (Billings et al., 2003); this was a follow-up to the same research teams earlier cross-sectional investigation (Leverett et al., 1993a) and two-year longitudinal study (Leverett et al., 1993b). This example analysis predicts primary tooth caries (output) according to a subset of predictors (inputs) which may have non-linearity or interactions. Salivary assays assessed mutans streptococci (MS) and lactobacillus (LB) levels (colony-forming units per milliliter, CFU/mL), fluoride (F), calcium (Ca), and phosphate (P) levels. Data for 466 children with 2 years of follow-up were analyzed with input variables, selected based on published discriminant analysis models (Leverett, 1993b) (log10 MS, log10 LB, F [parts per million, ppm], Ca [millimole per liter, mmol/L], and P [mmol/L]). The output (response) variable was caries incidence (at least one decayed or filled surface) on primary teeth at 24 months of follow-up. Earlier analyses showed 18-month measures to be more predictive of 24-month caries incidence than baseline, six-month, or 12-month measures.
KDD methods Training and validation were performed with a 70%/30% randomly split sample stratified on primary dentition caries. All methods used the same training data to develop the prediction models and the hold-out (not used to develop the models) validation data to score or validate the models. Additionally, five-fold cross-validation [CV(5)] was performed, randomly forming 5 groups leading to 5 analyses, each with 4/5 of the total data (i.e., each 5th was left out of one analysis). Results were then aggregated to calculate mean square error (MSE), also called the Brier score (B), between observed and expected output:
where n is the sample size.
Visualization Cumulative captured-response curves are similar to ROC curves, but with graph sensitivity vs. the percent testing positive (identified as high-risk). Thus, sensitivity for KDD methods can be compared for a specific percent-positive cut-off, which may be useful when resources for those labeled high-risk might be limited. A related graph is the lift chart, which displays the gain each KDD method has over baseline vs. the percent testing positive. To visualize the input contribution, we divided ANN predicted probabilities into quintiles (fifths) and showed the distributions of the standardized predictors in each quintile via boxplots.
Logistic regression yielded a model with only the two bacterial level variables as significant predictors: the log10 MS odds ratio (OR) was 1.27 (95% confidence interval, 1.10 - 1.46), while the log10 LB OR was 1.36 (95% CI, 1.19 - 1.57), meaning that each log10 increase in MS related to a 27% increase in probability of having primary tooth caries 6 months later, and each log10 increase in LB related to 36% greater odds of carious primary dentition.
The resultant training classification tree is presented in Fig. 2
Cumulative captured-response curves (Fig. 3
Although ANN weights do not provide the direct interpretation (e.g., ORs), the predicted probability from an ANN model can be categorized (binned), and the distribution of predictors in each category can be graphed. Fig. 4
CV(5) results showed extremely similar root MSE values for the 3 methods: 0.365 for logit, 0.363 for CART, and 0.362 for ANN. AUC (c index) from ROC curves differed somewhat: 0.553 for CART, 0.680 for logit, and 0.707 for ANN. This is the probability that one randomly chosen child with caries and one without caries would both be correctly classified.
Limitations of the example presented included the relatively small number of predictors utilized. Moreover, other factors potentially related to caries, such as salivary flow rate and pH, were not included. Sufficient salivary flow rate was an inclusion criterion for the study. Buffering capacity, pH, was not an important predictor in earlier work for the precursor study. However, investigators thought that the relationships between predictors and response might be non-linear and include interactions (earlier analyses showed interactions between bacterial counts and salivary chemistry measures). Additionally, the logistic regression models considered did not include interaction or non-linear terms, which may have produced logistic regression models approaching predictive accuracy of artificial neural networks. Boosting (re-weighting to emphasize misclassifications) or bootstrap aggregation (bagging) could have improved the performance of the tree models.
Knowledge discovery and data mining (KDD) is not a panacea but rather a process with useful tools; KDD does not obviate the need for careful monitoring of data quality and study design issues. Multiple methods should be used to assess sensitivity to one particular method; prediction results from various methods should be compared according to receiver operating characteristic (ROC) curves, cumulative captured-response curves, or lift charts. Care should be taken to avoid common mistakes made with artificial neural networks (ANNs) (e.g., Schwarzer et al., 2000). Validation (internal and external) is essential: "The major cause of unreliable models is overfitting the data" (Harrell, 2001, p. 249). Graphic displays can greatly help interpretations and demystify the "black box" nature of some KDD methods, such as ANNs. KDD methods may provide advantages over traditional statistical methods in dental data.
Artificial neural network (ANN) model — multilayer non-linear "black box" mathematical model to predict output from inputs Bagging (Bootstrap aggregation) — ensemble tree model method to reduce misclassification error using bootstrap (with replacement) resampling Boosting (e.g., adaptive resampling and combining (ARCing) or adaptive boosting (AdaBoost)) — ensemble tree model method to reduce misclassification error using increased weights for misclassified observations to allow for better prediction in subsequent trees on those records Bootstrap — drawing (resampling) a large number (e.g., 500 to 10,000) of new sets of data with the original sample size () from the original data with replacement and re-analyzing those bootstrap resamples to simulate variability and assess robustness Classification and regression tree (CART) model — recursive partitioning method (re-assessing all inputs at each stage) to split the data into 2 groups at each stage based on inputs that minimize the output class misclassification error Cross-validation or K-fold cross-validation [CV(K)] — randomly dividing the data into K mutually exclusive and exhaustive subsets (e.g., 5 or 10), re-analyzing each subset, and aggregating across the K subsets to estimate robustness Ensemble tree model or committee of trees — classifier using majority vote (modal) class assignment or mean predicted probability from a group of tree models grown under different conditions to reduce classification error Hierarchical clustering — groups records together based on closeness/similarity, starting with each record in its own cluster and ending with all records in one cluster (or vice versa) and allowing the reader to choose classification from those in the middle; formed in either a step-down (divisive) or step-up (agglomerative) direction Input — predictor, explanatory, or independent variable Jackknife — assessing analysis robustness by leaving out one observation (i.e., sample size is - 1), analyzing the data again, repeating times until each observation has been left out once, and then comparing with the original analysis with all the data; equivalent to -fold cross-validation [CV()] K-means clustering — iteratively determines K groups based on closeness/similarity to group center (mean) and minimal within-group variability k-nearest neighbor (knn) clustering — iteratively identifies groups based on the k closest neighbors to each point, assigning modal or majority class among the k neighbors Multivariate adaptive regression splines (MARS) — iterative modeling method using combinations of linear basis functions of inputs (predictors) to fit non-linear relationships smoothly Output — response, outcome, or dependent variable Random Forests — ensemble tree method using randomly selected subsets of inputs while also providing interpretability through summary measures of input variable importance Regression model (linear or logistic) — classic statistical model to predict output value or probability (linearly or log-linearly) from inputs Split sample — randomly grouping data into training and testing samples, stratifying on output, building prediction models with the training sample, and testing that resultant model with the holdout testing sample to provide an unbiased error estimate and assess robustness Supervised learning — modeling in which the output class is known Support vector machines (SVM) — computationally intensive "black box" method to find the non-linear multidimensional boundary (hyperplane) transformed as a linear hyperplane that best splits classes Unsupervised learning — modeling in which the output class is not known; data are clustered according to similar input variables
The author is grateful to Dr. John Featherstone for his insights about caries risk, to Dr. Jin Whan Jung for his inspiration to utilize KDD tools, to Dr. Ronald Billings for generously providing the Rochester caries study data, and to Dr. Jane Weintraub for her helpful suggestions and comments on an earlier draft of the paper. Any ambiguities, omissions, or errors that remain are solely my own. This research was supported in part by cooperative agreement US DHHS/NIH/NIDCR, NCMHD U54DE14251-01. The Rochester caries study was performed with support from US DHHS/NIH/NIDCR R01DE08946 (R.J. Billings, Principal Investigator).
Publication supported by Software of Excellence (Auckland, NZ)
Advances in Dental Research, Vol. 17, No. 1,
109-114 (2003)
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

i is the probability of the i-th person having the response (yi), βs are the corresponding parameters for P predictor variables, and ei is the error for the i-th person. For example, if log10MS and fluoride levels relate linearly to the probability of developing caries, this model would fit well. Logistic regression coefficients (βs) are easy to interpret (as natural logarithms of odds ratios), a very desirable property. If the actual likelihood surface is not a hyperplane, logistic regression will not fit well, since it misses bumps or non-linearities. 

3000), cross-validation (if sample size < 3000), bootstrap (resampling with replacement), or jackknife (leave-one-out) methods. Finally, implementation could involve changes in the KDD process, new clinical interventions, or changes in health policy. 



