| Sign In to gain access to subscriptions and/or personal tools. |
© 2003 SAGE Publications Bayesian Machine Learning and Its Potential Applications to the Genomic Study of Oral OncologyPresented at "Dental Informatics & Dental Research: Making the Connection", a conference held in, Bethesda, MD, USA, June 12–13, 2003, sponsored by the University of Pittsburgh Center for Dental Informatics and supported in part by award 1R13DE014611-01 from the National Institute of Dental and Craniofacial Research/National Library of Medicine.
1 Department of Biostatistics, Boston University School of Public Health; Correspondence: * corresponding author, marco_ramoni{at}harvard.edu
With the completion of the Human Genome Project and the growing computational challenges presented by the large amount of genomic data available today, machine learning is becoming an integral part of biomedical research and plays a major role in the emerging fields of bioinformatics and computational biology. This situation offers unparalleled opportunities and unprecedented challenges to machine learning research in general and to Bayesian learning methods in particular. This paper outlines some of the opportunities and the challenges of this endeavor, it describes where the efforts of "cracking the code of life" can most benefit from a Bayesian approach, and it identifies some potential applications of Bayesian machine learning methods to the genomic analysis of squamous cell carcinomas of the head and neck.
Key Words: Functional genomics bioinformatics machine learning Bayesian statistics oral cancer head and neck squamous cell carcinoma (HNSCC)
Since the beginning of the Human Genome Project—the international effort to characterize the human genome through a complete mapping and sequencing of its DNA—it was clear that the management and analysis of the vast amount of information to be generated would require the development and use of appropriate computational methods. While the completion of the Project in 2003 brought several surprises and changed our views about many aspects of the genome, it did not change the perception of the enormity of the task of decoding it. The availability of a complete reference sequence of the human genome has facilitated the development of new technologies—such as DNA microarrays and large-scale genotyping—that provide us with multiple views of the genome and multiple access points to its nature. These access points may be instrumental in understanding the code of life, but, to deliver on this promise, they need to be integrated into a single, global picture, able to capture the interplay between structure and function of the genome. The genome is not static, and very important functions are revealed only by its dynamic behavior. Even basic biological processes, such as the cell cycle, can be deconvoluted and understood only by observing the genome in action and by studying the behavior of genes over time. Today, microarray technology allows us to take snapshots of the expression of every gene at a particular instant, and we are given the opportunity and the challenge to combine these snapshots into "movies" telling the global behavior of the genome. These global views of the behavior of the genome reveal clusters of genes displaying similar behaviors and suggest new roles of some genes by associating their behavior to that of others. These global functional views offer the even greater opportunity to go beyond similarities and dive into the control mechanisms of the genome underpinning the regulation processes and the functional interplay among genes. This paper outlines how Bayesian methods can offer interesting and unique solutions to these critical problems involved in the decoding of the semantics of the genome: integration, functional analysis, and control identification. It also briefly reviews the current state of the art in the genomic analysis of oral cancer to identify where Bayesian methods can be most relevant.
Classic statistics provides methods to analyze data, from simple descriptive measures to complex and sophisticated models. The available data are processed, and then conclusions about a hypothetical population—of which the data available are supposed to be a representative sample—are drawn. It is not hard to imagine situations, however, in which data are not the only available source of information about the population. Suppose, for example, that we need to guess the outcome of an experiment that consists of tossing a coin. How many biased coins have we ever seen? Probably not many, and hence we are ready to believe that the coin is fair and that the outcome of the experiment can be either head or tail with the same probability. On the other hand, imagine that someone would tell us that the coin was forged so that it is more likely to land "head". How can we take into account this information in the analysis of our data? This question becomes critical when we consider data in domains of application for which knowledge corpora have been developed. Scientific and medical data are both examples of this situation. Bayesian methods provide a principled way to incorporate this external information into the data analysis process. For this to happen, however, Bayesian methods have to change entirely the vision of the data analysis process with respect to the classic approach. In a Bayesian approach, the data analysis process starts with a given probability distribution. Since this distribution is given before any data are considered, it is called prior distribution. In our previous example, we would represent the fairness of the coin as a uniform prior probability distribution, assigning probability 0.5 of landing on one of the two sides of the coin. On the other hand, if we learn, from some external source of information, that the coin is biased, then we can model a prior probability distribution that assigns a higher probability to the event that the coin lands "head". The Bayesian data analysis process consists of using the sample data to update this prior distribution into a posterior distribution. (See Ramoni and Sebastiani, 1999, for an introduction to Bayesian data analysis.) The basic tool for this updating is a theorem, proved by Thomas Bayes, an 18th century clergyman. The role of Bayes theorem in this approach is so critical that the whole approach is named after it. The ability to integrate data with external information, a trademark of Bayesian analysis, provides a natural framework for integrating various forms of information available about the genome, and it has already been exploited to develop linkage models (Wang et al., 2000) of complex diseases, such as autism (Wang et al., 2000; Vieland et al., 2001). The intuition behind this approach is that the conclusions of a study (expressed as posterior probabilities) can be used as prior probabilities for another study, with a principled and seamless integration of the flow of information. Functional differences and similarities between genes can be integrated with information about their structural properties, as long as the conclusions of each component are expressed in terms of posterior probability. When searching for genes associated to a particular disease, for instance, one can update the posterior probability of linkage (PPL) using various phenotypic variables. These phenotypes can be clinical manifestations of a disease or complex patterns of gene expression—inferred through microarray analysis—that are common to subgroups of patients. In this way, each common pattern of gene expression can be used as a phenotype in the genetic studies. The power of the Bayesian approach is that an otherwise undetectable linkage may be established with the use of complex phenotypic information (Vieland et al., 2003).
The aim of functional genomics is to understand the function of genes as parts of the entire human genome. Current research is mainly focused on the understanding of gene expression mechanisms, i.e., the processes inducing a particular gene to be transcribed and ultimately to code for a protein. The identification of the genes expressed in, say, a cancer cell line or in a dystrophic muscle can cast a new light on the genetic basis of a disease and lead to potential remedies. The long-term promise of this endeavor is to "reverse-engineer" the regulatory mechanisms underlying genomic control systems and their interaction with external conditions, pathogenic agents, and pharmaceutical products.
Differential expression analysis The simplest functional genomic study we can conduct is a comparative experiment aimed at identifying those genes that are differentially expressed between two biological conditions, such as normal vs. cancerous tissues. For instance, we can compare the gene expression levels of a cancer cell line against a healthy cell line and identify the genes differentially expressed in the two cell lines. Early analyses of these array data identified differentially expressed genes by taking the ratio of the intensities and choosing an arbitrary threshold value above (below) which the genes were taken to be differentially expressed (Schena et al., 1995). More sophisticated techniques take into account the noise in the gene expression data measured with microarrays by modeling the intensity values by probability s. In the first statistical analysis of these data, Chen et al.(1997) proposed a method to identify statistically significant changes between two conditions, under several distributional assumptions. Bayesian approaches to this problem have been emerging over the past few years. Newton et al.(2001) proposed a Bayesian approach to differential analysis using a hierarchical model that helps to identify differentially expressed genes on the basis of the posterior odds of their average expression change. A similar approach has been proposed by Baldi and Long (2001) by modeling expression values as independent log-normal distributions, parameterized by means and variances with conjugate prior distributions. Microarray experiments are typically characterized by a small sample size, due to the high costs of the technology and the intrinsic paucity of some biological samples. The choice of appropriate distributional assumptions may be critical if reproducible results are to be achieved at low sample size. Bayesian Analysis of Differential Gene Expression (BADGE) (available from http://genomethods.org/badge) is a Bayesian method for the analysis of microarray data designed to yield high reproducibility at low sample size. BADGE models gene expression measurements by log-normal and gamma distributions and uses model averaging to compute the posterior probability of differential expression and to build molecular classification models. BADGE accounts for gene expression variability without arbitrary normalization, and provides a common framework for both detection of genes with different expression and molecular classification.
Functional clustering A specific property of Bayesian statistics can provide a principled solution to this problem. Contrary to its classic counterpart, Bayesian hypothesis-testing computes directly the probability of a hypothesis rather than the probability of committing an error in assuming it. Within this framework, we have developed a Bayesian clustering method able to identify the set of most probable processes responsible for sequences of observations (Ramoni et al., 2002a,b). The idea underpinning this Bayesian approach is that the observed data are generated by processes. The aim of the algorithm is to find the set of processes most likely, a posteriori, to have generated the sequences of observations in the database.
We have applied this method to cluster observations on 517 genes in a study of the responses of human fibroblasts to serum. The data were collected with the use of competitive cDNA microarrays. These microarrays measure the expression level of a gene simultaneously in a basal or control condition and in an experimental condition. The overall expression induced by the experimental condition is measured as the ratio of the two intensity levels, and these are the data used as input by clustering algorithms. Fig. 1
Decoding control Clustering methods deliver a functional portrait by breaking its components into classes of genes behaving in similar ways under the same experimental conditions and discover expression patterns of co-regulation. A particular Bayesian formalism—called Bayesian network—can help us to go one step further and try to understand the types of control and dependency relationships underpinning the global expression process. Bayesian networks are not new to genetic research. As a matter of fact, networks based on directed acyclic graphs actually originated from the genetics studies by Sewall Wright (1921), who developed a method called Path Analysis, a recognized ancestor of Bayesian networks. The application of Bayesian networks to functional genomics, on the other hand, is very recent. Bayesian networks hold the promise of answering very interesting questions in functional genomics, and, in principle, they seem to be the right technology to take advantage of the massively parallel analysis of whole-genome data to discover how they interact, control each other, and align themselves in pathways of activation. While clustering algorithms attempt to locate groups of genes that have similar expression patterns over a set of experiments to discover genes that are co-regulated, Bayesian networks dive into the regulatory circuitry of genetic expression to discover the web of dependencies among genes. A Bayesian network has two components: a directed acyclic graph, in which nodes represent stochastic variables and arrows represent dependencies among the variables; and a joint probability distribution for the network variables (Pearl, 1988). The graph encodes conditional independencies among the variables that are used to factorize the joint probability distribution into modules. Each node in the network is associated with a conditional probability distribution that shapes the association between the node and all other nodes with arrows pointing to it. The advantage of a Bayesian network is to break down an otherwise unmanageable joint probability distribution over the domain variables into a set of smaller components, easier to define and cheaper to use. However, the fact that the network breaks down the overall association into modules does not lead to information disintegration. Being parts of the same network, the components of the Bayesian network can be interrelated according to well-established algorithms for probabilistic inference. The promise of Bayesian networks in functional genomics goes even further, since intensive research efforts have been addressed, during the past decade, to define conditions under which Bayesian networks actually uncover the causal model underlying the data (Pearl, 1995). The most ambitious question, therefore, is: Given a set of microarray data, can we discover a causal model of interaction among different genes? The challenge is the common problem of sound statistical methods when faced with microarray data: a large number of variables with a small number of measurements. In the context of Bayesian networks, this situation results in the inability to discriminate among the sets of possible models, since the small amount of data is insufficient to identify a single most probable model. Friedman et al.(2000) address these problems using partial models of Bayesian networks and a measure of confidence in a learned model. The strategy they follow is to search a space of under-specified models, each comprised of a set of Bayesian networks, and to select a class of models rather than a single one. They also adopt a measure of confidence based on bootstrapping to evaluate the reliability of each discovered dependency in the database, to avoid the risk of ascribing a causal role to a gene when not enough information is actually available to support the claim. Hartemink et al.(2002) tackled the under-determination problem by turning the unsupervised search of the most probable network structure into a supervised one. They leveraged on established biological knowledge to select a small number of networks and then limited their comparisons to these networks only.
We have taken a slightly different approach, adopting the strategy used in differential gene expression analysis and converting the ratio measures generated by cDNA microarrays into discrete variables by thresholding the measures at 2 folds up and 2 folds down, the same used by the authors of the original paper. Fig. 2
Although the use of Bayesian networks in functional genomics is still in its infancy, the simple comparison of the network in Fig. 2
Cancer studies have been one of the research areas most affected by the introduction of genome-wide functional analysis. In oral medicine, most microarray studies today are focused on Head and Neck Squamous Cell Carcinomas (HNSCC) (see Kuo et al., 2003, for a general review). These carcinomas account for almost 90% of all malignancies affecting the head and the neck. With its heterogeneous nature and proven genetic bases, HNSCC is an ideal candidate for genomic analysis. Although the microarray-based investigation of HNSCC started quite early, the vast majority of genomic studies of HNSCC have been following a supervised experimental design. Several studies have used cDNA microarrays to compare HNSCC with normal epithelial tissues (Leethanakul et al., 2000a,b; Villaret et al., 2000; Al Moustafa et al., 2002; Gonzales-Moles et al., 2002). The interest of the results on the genomic profile of HNSCC with respect to normal samples was so high that representative cDNA libraries from patient sets, comprised of normal and malignant squamous epithelium, were generated to extend the Head and Neck Cancer Genome Anatomy Project (HN-CGAP) (Leethanakul et al., 2003). The use of high-density oligonucleotide microarrays to study oral cancer was pioneered at the Harvard School of Dental Medicine by Alevizos and colleagues (Alevizos et al., 2001). Using laser-capture microdissection (Ohyama et al., 2000), they compared tumor and normal oral epithelial cells using Affymetrix GeneChip® high-density oligonucleotide microarrays. Following a supervised but multi-class design, Ha and colleagues (Ha et al., 2003) compared malignant lesions, pre-malignant lesions, distant, histopathologically normal mucosa from patients with pre-malignant or malignant lesions, and normal mucosa from the upper aerodigestive tract of patients with non-cancer diagnoses. Using a slightly different experimental design, Vigneswaran and colleagues (Vigneswaran et al., 2003) validated their previously used cDNA microarray by characterizing the genomic profile of metastatic lesions in oral squamous cell carcinoma. Belbin and colleagues (Belbin et al., 2002) applied an unsupervised design to cDNA microarrays containing 9216 clones to analyze gene expression profiles of tumor blocks obtained from 17 patients. Using unsupervised clustering, they were able to identify two distinct subgroups of patients by means of 375 genes differentially expressed between the two groups. Mendez and colleagues (Mendez et al., 2002) used Affymetrix GeneChip® microarrays to examine the expression profiles of 26 invasive squamous cell carcinomas of the oral cavity and oropharynx, 2 pre-malignant lesions, and 18 normal oral tissue samples. They were able to confirm that oral carcinomas are distinguishable from normal oral tissue based on genome-wide transcriptional expression patterns, but they were unable to account for other differences among the tumor tissues.
The study of HNSCC offers excellent research opportunities for the Bayesian approaches we have outlined in this paper. Molecular studies have demonstrated the presence of multiple structural abnormalities—such as microsatellite instability in tumor suppressor genes and loss of heterozygosity at numerous chromosomal locations—in HNSCC. The presence of these well-characterized structural abnormalities lends itself to an integrative genomic approach, such as the one described in "Bayesian Methods" (above), able to combine this structural information with functional data derived from microarray studies. The combination of these two sources of information would capitalize on our understanding of the structural bases of HNSCC to explain the complex signals coming from gene expression studies. Bayesian clustering methods, in contrast, can introduce a new dimension to this integration process. They can leverage on the combination of genomic and phenotypic information to understand the interplay between clinical outcomes and identify new classifications able to disambiguate the well-known heterogeneity of HNSCC. Bayesian networks can take this process of integration even further, seamlessly combining structural, functional, and phenotypic information into a single, coherent molecular landscape. Their ability to discover control mechanisms can be used to identify downstream functional regulations induced by structural abnormalities and explain gene expression changes in genes with no structural changes and their long-range effects on clinical phenotypes.
Publication supported by Software of Excellence (Auckland, NZ)
Advances in Dental Research, Vol. 17, No. 1,
104-108 (2003) This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



