| Sign In to gain access to subscriptions and/or personal tools. |
© 2003 SAGE Publications Overview of Bioinformatics and its Application to Oral GenomicsPresented at "Dental Informatics & Dental Research: Making the Connection", a conference held in, Bethesda, MD, USA, June 12–13, 2003, sponsored by the University of Pittsburgh Center for Dental Informatics and supported in part by award 1R13DE014611-01 from the National Institute of Dental and Craniofacial Research/National Library of Medicine.
Biomedical Informatics Research Fellow, Harvard School of Dental Medicine, Department of Oral Medicine, Infection, and Immunity, 188 Longwood Avenue, Boston, MA 02115; wkuo{at}genetics.med.harvard.edu
The "informatics revolution" in both bioinformatics and dental informatics will eventually change the way we practice dentistry. This convergence will play a pivotal role in creating a bridge of opportunity by integrating scientific and clinical specialties to promote the advances in treatment, risk assessment, diagnosis, therapeutics, and oral health-care outcome. Bioinformatics has been an emerging field in the biomedical research community and has been gaining momentum in dental medicine. This area has created a steady stream of large and complex genomic data, which has transformed the way a clinical or basic science researcher approaches genomic research. This application to dental medicine, termed "oral genomics", can aid in the molecular understanding of the genes and proteins, their interactions, pathways, and networks that are responsible for the development and progression of oral diseases and disorders. As the result of the Human Genome Project, new advances have prompted high-throughput technologies, such as DNA microarrays, which have become accepted tools in the biomedical research community. This manuscript reviews the two most commonly used microarray technologies, basic microarray data analysis, and the results from several ongoing oral cancer genomic studies.
Key Words: Oral genomics bioinformatics dental informatics oral cancer cross-platform
Oral genomics is a broad category describing the development and application of genomic information that is likely to lead to qualitative change in the way in which dental medicine is practiced in terms of diagnostics, risk assessment, therapeutics, and oral health-care outcome. The post-genomic era has brought with it a change in the way basic experiments are conducted, enabling biomedical researchers to examine biological systems more comprehensively. These approaches to comprehensive molecular analysis will provide opportunities to enhance our framework of knowledge of oral health, craniofacial development and malformation, and the pathogenesis of oral diseases. Since the inception of the Human Genome Project (HGP) in 1988 (Palca, 1988), the 3.2 billion basepairs which make up the human genome have been sequenced to near perfection. These sequences contain the blueprint for the mechanisms controlling the behavior of each cell. The small variations in the DNA sequence that lead to different characteristics, such as facial features, hair color, or height, are known as polymorphisms, which also can cause or contribute to the development of many oral diseases and craniofacial syndromes. The HGP has prompted new research strategies and experimental technologies that have generated a steady stream of genomic data and have transformed the study of various life processes. Many model organisms have been sequenced to provide insights on human evolution and the differences and similarities in their genomic sequences.
Bioinformatics is a discipline that has become an essential part of the biomedical research community. Its role involves deciphering genomic, transcriptomic, and proteomics data generated by high-throughput experimental technologies and organizing information gathered from traditional biology. The application of genomic information, both clinical and genomic, will require the convergence of dental informatics and bioinformatics that share common methodological challenges to help us understand how the enormous amount of data is translated into an improved overall understanding and an increase of opportunities for application to oral diseases (Fig. 1
The idea is that "informatics", both dental informatics and bioinformatics, can create a bridge of opportunity by integrating the clinical specialties in dentistry and the dental basic sciences to develop new hypotheses and ideas (Fig. 2
The ability to explore the gene expression profiles by microarray technology has revolutionized the approach in which we study genes. By applying this technology in oral genomics, we can enhance our knowledge of biological pathways, networks, and molecular systems of the diseases and disorders we face as a profession (Slavkin, 2001; Yeager, 2001; Kuo et al., 2002c, 2003b; Wright and Hart, 2002). Unfortunately, despite the capabilities of this technology, to date there have been few published studies using microarrays to generate novel insights into oral genomics. In this manuscript, we will review the two most commonly used microarray technologies, data analysis, and the results from several ongoing oral cancer genomic studies.
Microarray technologies have become a common tool in the biomedical research community for measuring the gene expression profiles in a global fashion. Instead of studying genes individually, one can examine the expression of thousands of genes simultaneously. Microarray methods were initially developed to study differential gene expression using complex populations of RNA. In general, microarrays consist of unique DNA sequences, called probes, which are biochemically fixed to a glass slide. Labeling the mRNA from the tissue of interest is called the "target". Since "mRNA" is easily degraded, it must be converted to a more stable form of complementary DNA (cDNA). The converted mRNA to cDNA is the "target" that will be deposited onto the glass slide, where hybridization occurs. Hybridization on DNA microarrays is a process where complementary sequences will bind to each other under correct conditions. Basepair "A" is complementary to "T", and "C" is complementary to "G". The two most commonly used microarray platforms are the robotically spotted cDNA arrays and short photolithographic oligonucleotide array, which differ in their design, protocol, and analysis. A hybrid of the two platforms, the long oligonucleotide arrays, has recently gained momentum; the probes range from 30 to 80 basepairs in length. As with any new technology, there is an evaluation period in terms of their performance before they become accepted in the research community. (Refer to the Table
cDNA microarrays require the construction of polymerase chain-reaction (PCR) products robotically printed in a two-dimensional grid onto a glass slide. The PCR products are double-stranded sequences amplified from expressed sequence tags (ESTs), which range from 100 to 1000 basepairs in length. The number of cDNAs that can be spotted onto a glass slide exceeds 35,000. cDNA arrays use a two-fluorescent-dye approach by labeling the two RNA samples with either a Cy5 or a Cy3 dye. Once the samples are labeled, they are hybridized to the slide, washed, scanned, and quantified for further computational analysis. In addition to affordability, cDNA microarrays provide the opportunity for the simultaneous analysis of two similar biological samples. cDNA arrays also offer the discovery of novel genes, since ESTs of unknown functionality can be spotted. They also provide the user with flexibility in terms of adding new cDNA clones and creating smaller customized arrays for specific investigations. The disadvantage of spotted arrays resides in their inability to measure absolute levels of gene expression, a measurement that Affymetrix GeneChipsTM can provide, and the variability in spot quality from slide to slide. Another disadvantage is their inability to control for non-specific hybridization, which can cause inaccurate gene expression measurements. Short oligonucleotide microarrays, or GeneChipsTM (25 basepairs in length), are manufactured by Affymetrix (Santa Clara, CA, USA), utilizing a photolithographic approach similar to the fabrication of microchips in a computer. This approach permits the creation of a high-density array which can contain up to 1,000,000 unique oligonucleotide features covering more than 39,000 transcript variants. Each gene is represented by a least one set of 11–20 different "probe pairs". A probe pair consists of a perfect-match (PM) and a mismatch (MM) pair, whereby the 13th position of the probe is designed not to match the target sequence. The goal is to control for non-specific hybridization and reduce the noise in the data analysis as compared with cDNA microarrays. The information across all 20 paired PM and MM probes is integrated by a proprietary algorithm in the Affymetrix Microarray Suite software. In contrast to cDNA arrays, where two samples were hybridized to the cDNA arrays, each mRNA preparation for an Affymetrix array is hybridized to a separate Affymetrix GeneChipTM. After hybridization, the GeneChipTM is washed, stained, scanned, and quantified for further analysis. The advantage of Affymetrix GeneChipsTM is their accepted usage in the research community, which provides an advantage of an integrated platform, in which analytical tools are provided to the user in addition to the array themselves. The major disadvantage is their high cost and their inability to compare the expression levels of two related biological samples simultaneously.
As many researchers in the microarray field have recognized, careful experimental design is the most important factor for an effective microarray experiment. The most challenging aspect of a microarray workflow is the large quantity of data generated by this technology and finding meaningful results that will require machine learning and statistical approaches (bioinformatics). There have been numerous publications discussing the interpretation and analytical challenges presented by large datasets (Quackenbush, 2001; Churchill, 2002). We will briefly review several main topics in microarray data analysis. In a typical microarray analysis, the first step is pre-processing of the data, which includes normalization and filtering. This step is performed before expression data from the experiments are compared. Normalization is necessary to account for and to minimize systematic and experimental variations in the calculation of gene expression data (Schuchhardt et al., 2000; Schadt et al., 2001; Yang et al., 2002). It attempts to identify the biological information by removing the impact of non-biological influence on the data. Normalization is complex, and its discussion is beyond the scope of this paper. Briefly, there are two main approaches, intensity-dependent normalization and an intensity-independent strategy. After normalization, the next hurdle to overcome is the number of genes, which generally exceeds the number of observations by at least one order of magnitude. Substantial variable reduction is usually necessary before any machine learning or statistical algorithms can be applied. One way to overcome this problem is to filter the data, which reduces the number of genes prior to analysis. For example, genes that have a low overall variance across all the samples to be studied can be filtered, since they are of limited interest. New data analysis techniques are continuously being developed, and statistical approaches present with their own challenges. As mentioned previously, in a microarray experiment, there are many variables and a small number of replicates for each data point, thereby leading to inaccurate estimates of variance. The significant bottleneck is the identification of meaningful groups of genes that show statistically differential expression between experiments. There are several approaches that apply widely used parametric statistical tests, such as Students t test (Tsai et al., 2003) and ANOVA (Analysis of Variance) (Draghici et al., 2003), or non-parametric tests such as the Mann-Whitney U test or the Kruskal-Wallis test (Kuo et al., 2002b) for every individual gene.
Most microarray data analyses are typically carried out by supervised or unsupervised machine learning approaches. In unsupervised learning, the data are analyzed with no a priori assumption about the identity of the samples analyzed. Typical examples of an unsupervised learning technique used in microarray analysis are hierarchical cluster analysis (Eisen et al., 1998), k-means clustering (Tavazoie et al., 1999), and self-organizing maps (SOM) (Tamayo et al., 1999). Unsupervised clustering of microarray data has been widely used in the classification of tumors—for example, leukemia (Golub et al., 1999), breast (Perou et al., 2000), and prostate (Dhanasekaran et al., 2001) cancer. An illustration of k-means clustering to an oral cancer microarray dataset is shown in Fig. 3
Supervised learning techniques, in contrast, integrate into the data analysis some biological information for the samples being analyzed. This approach is used to identify subsets of genes capable of predicting a diagnosis or a clinical outcome. Examples of supervised learning techniques include nearest-neighbor algorithms (Theilhaber et al., 2002), linear discriminant analysis (Mendez et al., 2002), classification trees (Xu et al., 2002), and support vector machines (Brown et al., 2000). After data analysis, the genes that have been identified require biological validation. As discussed, a microarray experiment is a multi-step process, which is prone to errors, biases, and sometimes overinterpretation (Brown et al., 2000). In addition, quality issues in the data will significantly affect the final results. Validation is generally carried out at the RNA level by one of three methods: Northern blot, real-time PCR, or in situ hybridization on tissue sections.
OSCC is an aggressive epithelial malignancy and is the sixth most common cancer among US males. The five-year survival rate has not changed for the past four decades and remains less than 50%. Dental clinicians and pathologists are faced with two major problems in the management of OSCC: the heterogeneity of the disease and the lack of conventional histological and clinical features. For example, when a patient presents to the clinic with a dysplastic lesion, the cytological features of dysplasia will provide little value in terms of predicting which dysplastic lesion may be more or less aggressive over time (diagnostic dilemma). Therefore, new prognostic and predictive factors are needed for classification of different stages of this disease, such as discrimination among normal, dysplastic, and malignant cells. This was illustrated in a recent publication (Kuo et al., 2002b). The goal was to investigate differentially expressed genes that best discriminate among normal, dysplastic, and cancer samples by using DNA microarrays. All samples were collected by laser capture microdissection (LCM) (Todd et al., 2002) to minimize biological noise. From the study, we identified several genes and expressed sequence tags (ESTs) from the normal and dysplasia comparison to be associated with cancer. For example, we found DDB2, a damage-specific DNA protein, to be a potential biomarker for the progression of OSCC. The implication of this is that DDB2 can be a potential novel therapeutic target for prevention or therapy. In another study, we applied an hierarchical clustering algorithm using average-linkage heuristic and Euclidean distance metrics to the gene expression profiles of normal, dysplastic, and cancer samples (Kuo et al., 2003a). We were able to separate the normal and cancer samples based on their expression profiles. An interesting observation was made with the "dysplasia" samples, which did not cluster together as a group. They were found in either the "normal" or "cancer" clusters. So, combining results from the two above studies, we can infer that there are subtle changes that occur in the progression of OSCC and that are not captured through histology reports. Furthermore, constructing signaling pathways through the identification of the specific genes and the sequence in which they appear in the transformation of normal to cancer can be beneficial in our understanding of OSCC. In another study, we developed a novel approach to examine the functional relationships between gene pairs in OSCC using Affymetrix GeneChipsTM (Kuo et al, 2003c). A difficulty in analyzing microarray data is that our understanding of gene interactions for most biological systems is incomplete. Most studies have simply focused on each gene independently, attempting to find a set of genes whose expression levels change across various conditions or experiments. The results of the study illustrate that the analysis of the relationship of a pair of genes provided more information about the mechanism or function underlying OSCC than examining one gene at a time. The general hypothesis is that genes that behave differently in different disease conditions are more likely to be related to a particular disease mechanism. From the 36 samples (8 normal samples, 28 cancer samples), we identified gene pairs CORO1A and CXCR4, CORO1A and CR2, CXCR4 and CR2 to be associated with OSCC, based on MeSH terms. Additionally, we also identified a gene pair, PIN and IGFBP4, to be associated with a "pre-cancerous condition" based on MeSH terms. The results from this study are very interesting and preliminary, which necessitates further genomic studies. These studies illustrate that expression profiling using microarrays can be a means to refine conventional and histopathological assessment of OSCC samples, allowing for a more accurate prediction of disease course. Other potential OSCC research-related applications involve exploration of mechanisms of action of current and experimental therapies and potential to screen for potential clinical markers (Jenssen et al., 2002). We also illustrate that oral genomics is still in its infancy, and novel approaches to analysis are still needed.
Microarray data analysis remains challenging, but many research groups have now overcome most of the roadblocks initially encountered. Though this approach has become mainstream in many laboratories, there still is a need for an integrated microarray database that would permit comparison of microarray data generated by different laboratories and microarray platforms (Kuo et al., 2002a). The existence of several microarray platforms for measuring gene expression makes consistency and reproducibility across high-throughput technologies important issues. To illustrate this issue, the two independent OSCC studies presented in the previous section not only utilized two different platforms (cDNA and Affymetrix GeneChipsTM), but also they were conducted in different laboratories. In theory, it should be possible to combine data generated from different laboratories and different platforms. The advantage to this is that it reduces the need to duplicate similar experiments. In Fig. 4
Oral genomics will be transformed by the evolution of high-throughput techniques and "informatics". As the function of more genes becomes unraveled, the biological relevance of microarray findings will become more transparent, and the link between genotypes and clinical phenotypes will improve. In the meantime, significant improvements in the understanding of microarray findings and their translation into the clinic will result from the use of a multidisciplinary approach in which combinations of analyses from "informatics" are performed. In the future, we anticipate that dental informatics and bioinformatics and the incorporation of clinical data into the analysis of genomic information will increase our understanding of the mechanisms underlying the biological challenges in dentistry. In this approach, we can eventually change the current practice of dentistry, including diagnostics, therapeutics, and prognostics of common oral diseases and disorders. This new approach to dental medicine will be both molecularly informed and informatically empowered.
Publication supported by Software of Excellence (Auckland, NZ)
Advances in Dental Research, Vol. 17, No. 1,
89-94 (2003)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



