All methods for detecting mutations rely on the manipulation of DNA, the basic building block of heredity in the cell. DNA consists of two long strands of polynucleotides that twist around each other clockwise in a double helix (Fig. 1-1). Nucleic acid bases attached to the sugar groups of each strand face each other within the helix, perpendicular to its axis. Only four bases exist: the purines adenine and guanine (A and G) and the pyrimidines cytosine and thymine (C and T). During assembly of the double helix, stable pairings of nucleotides from either strand are made between A and T or between G and C. Each base pair (bp) forms one of the billions of rungs in the long, unbroken ladder of DNA that forms a chromosome.

Figure 1-1 **DNA structure.** DNA, the cell’s genetic material, is contained in single compacted strands comprising chromosomes within the cell nucleus. In the DNA double helix, the two intertwined components of its backbone are composed of sugar (deoxyribose) and phosphate molecules, which are connected by pairs of molecules called bases. The sequence of four bases (guanine, adenine, thymine, and cytosine) in the DNA helix determines the specificity of genetic information. The bases face inward from the sugar-phosphate backbone and form pairs with complementary bases on the opposing strand for specific recognition. The arrangement of chemical groups is unique for each base pair, allowing base pairs to be specifically targeted by transcription factors, polymerases, restriction enzymes, and other DNA-binding proteins. (From Chalquist C. http://www.terrapsych.com.)

The functional unit of inherited information in DNA, the gene, usually is represented by a discrete section of sequence necessary to encode a particular protein structure. Gene expression is initiated by forming a copy of the gene with use of messenger RNA (mRNA); the gene is constructed base by base from the DNA template by a polymerase enzyme. Once transcribed, an mRNA transcript is modified and the processed product is transported out of the nucleus. In the cytoplasm, proteins are then synthesized, or translated, in macromolecular complexes called ribosomes that read the mRNA sequence and convert the nucleic acid code, based on three-base segments or codons, into a 20 amino acid code to form the corresponding protein.

Generating Diversity with Alternate Splicing

In higher organisms, most protein-coding gene sequences are interrupted by stretches of noncoding DNA sequences, called introns. In the nucleus, these introns are removed after mRNA transcription to produce a continuous chain of coding sequences, or exons, which subsequently undergo translation into protein. The splicing process requires absolute precision, because the deletion or addition of a single nucleotide at the splice junction would throw the three-base coding sequence out of frame or lead to exon skipping or addition, thus creating abnormal proteins.

The dramatic increase in genetic complexity conferred by alternate RNA splicing is underscored by the multiple splice patterns of many medically relevant genes, in which different combinations of exons are chosen for the final mRNA transcript, such that one gene can encode many different proteins (Fig. 1-2). The choice of protein isoform to be expressed from a gene with multiple splicing possibilities is a decision that can be perturbed in disease. To date, errors in splicing mechanisms have been associated with a large group of cancers. These errors include mutations in several transcription factors, cell signaling, and membrane proteins. These include the oncogene p53 in more than 12 different types of cancer and mutL homolog 1 protein mutation in hereditary nonpolyposis colorectal cancer. When mutations in the splicing site lead to insertion of novel sequences in the mRNA, the encoded protein can be used as a potential clinical marker, as seen for the transcription factor NSFR in persons with small cell lung cancer. Because of their unique expression in cancer cells, these markers can be further explored as new cancer-specific therapeutic targets.

Figure 1-2 **RNA splicing.** Alternate splicing produces multiple related proteins, or isoforms, from a single gene. (From Guttmacher AE, Collins F. Genomic medicine—a primer. N Engl J Med 2002;347:1512–20.)

The Genomics of Cancer

The complete set of DNA sequences carried on all the chromosomes is known as the genome. Although the general map of the genome is shared by all members of a species, the recent sequencing of thousands of individual human genomes has given rise to the new field of genomics, providing us with new tools to reveal the more subtle variations that arise between individuals. These variations are critical, both as a natural engine driving heterogeneity within a species and as a source of predisposition to cancer types. The most common forms of human genetic variations, or alleles, arise as single-nucleotide polymorphisms, or SNPs. Because these allelic dissimilarities are abundant, inherited, and dispersed throughout the genome, SNPs can be used to track racial diversity, personal traits, and susceptibility to common forms of cancer (Fig. 1-3).

Figure 1-3 **Determining cancer susceptibility with single-nucleotide polymorphisms** (**SNPs).** Millions of SNPs exist between individuals, as depicted by the red arrows and the SNP density map of human chromosome 11 (*right*). By contrast, point mutations, deletions, insertions, and rearrangements between normal tissues and tumors or between primary and secondary tumors probably number in the tens to hundreds (or potentially thousands), as depicted by the spectral karyotype image at the bottom of the figure. Because the constitutional genetic polymorphisms are present in all of the tissues of the body, it might be possible to distinguish differences in metastatic versus nonmetastatic tumors and in nontumor tissue before those metastatic cells develop into a solid tumor. (From Hunter K. Host genetics influence tumour metastasis. Nat Rev Cancer 2006;6:141–6.)

How do SNPs arise between individuals? One source of variation in DNA sequence derives from deviations in the strict base-pairing rule underlying the structure, storage, retrieval, and transfer of genetic information. The duplicated genetic information in the two strands of DNA not only permits the repair of a damaged coding sequence but also forms the basis for the replication of DNA. During cell division, polymerase enzymes unwind the DNA strands and copy them, using the base sequences as a template for constructing a new helix so that the dividing cell passes its entire genetic content on to its progeny. Errors in this process are rare, and person-to-person differences constitute only about 0.1% of the human genome. SNPs are inherited if they occur in the germline. Many genetically inherited variations occur in regions that do not encode protein or alter the regulation of nearby genes. Given the disruptive effects that even subtle genetic changes may have on cell function, it is important to distinguish SNPs that represent true mutations from benign polymorphisms.

Our ability to monitor hundreds of thousands of SNPs simultaneously is one of the most important advances in modern medical genetics. Relatively simple genotyping technologies for SNP detection rely largely on the polymerase chain reaction (PCR). In this procedure, two chemically synthesized single-stranded DNA fragments, or primers, are designed to match chromosomal DNA sequences flanking the segment in which an SNP is positioned. With the addition of nucleotide building blocks and a heat-stable DNA polymerase, the primer pairs, or amplicons, initiate synthesis of new DNA strands using the chromosomal material as a template. Each successive copying cycle, initiated by “melting” the resulting double-stranded products with heat, doubles the number of DNA segments in the reaction (Fig. 1-4). The technique is exceptionally sensitive; millions of identical DNA copies can be generated in a matter of hours with PCR using a single DNA molecule as the starting material.

Figure 1-4 **Amplification of DNA by polymerase chain reaction.** The DNA sequence to be amplified is selected by primers, which are short, synthetic oligonucleotides that correspond to sequences flanking the DNA to be amplified. After an excess of primers is added to the DNA, together with a heat-stable DNA polymerase, the strands of both the genomic DNA and the primers are separated by heating and allowed to cool. A heat-stable polymerase elongates the primers on either strand, thus generating two new, identical, double-stranded DNA molecules and doubling the number of DNA fragments. Each cycle takes just a few minutes and doubles the number of copies of the original DNA fragment.

Other novel methods for large-scale SNP detection include single nucleotide primer extension, allele-specific hybridization, oligonucleotide ligation assay, and invasive signal amplification, which detect polymorphisms directly from genomic DNA without the requirement of PCR amplification. The International HapMap project has been established with the objective of identifying those variations (commonly thought to be in the order of 10 million in our genome) in the human population. This project is already in its third phase (HapMap3) and now includes both SNPs and copy number variations observed in 1184 samples from 11 different human populations. Regardless of the method used to characterize them, the collective SNPs in a selected genomic region characterize a haplotype, or a specific combination of alleles at multiple linked genetic loci along a chromosome that are inherited together.

Even when the SNPs within a given haplotype are not directly involved in a disease, they provide markers for clonality and for the loss or rearrangement of specific chromosomal segments in growing tumors. In the human nucleus, each of the 23 tightly compacted chromosomes has a characteristic size and structure and a distinctive base sequence that carries unique protein coding information. Other noncoding DNA sequences are used for directing the transcription of neighboring genes through complex regulatory circuits involving protein binding and modification of the DNA itself, or shifting of its chromosomal packaging. Although genomic instability generally is considered a consequence of tumor formation rather than the initial trigger of cancer, the loss, gain, or rearrangement of chromosomal segments through deletion or translocation is a common form of neoplastic mutation, as protein-coding segments from different genes are combined or regulatory sequences are brought into new proximity to genes they do not normally control, as is seen in persons with chronic myeloid leukemia. In persons with chronic myeloid leukemia, recombination events lead to the fusion of BCR and ABL genes (Philadelphia chromosome). This process results in constitutive activation of the fused gene, leading to loss of proliferative control in myeloid cells and, consequently, cancer. Gross changes in DNA arrangement can be detected by cytogenetic analysis of chromosomal features on metaphase spreads. Fluorescent in situ hybridization provides greater resolution by localizing specific chromosomal DNA sequences corresponding to fluorescently labeled probes (Fig. 1-5) and can be used to track specific alterations in chromosomal structure where known genes are involved.

Figure 1-5 **Detection of chromosomal translocations.** Fluorescent in situ hybridization technology uses a labeled DNA segment as a probe to search homologous sequences in interphase chromosomes for the t(9;22)(q34;q11) translocation, which is associated with chronic myeloid leukemia. On the left, patient nuclei were hybridized with probes for chromosome 9 (labeled with SpectrumRed fluorophore) and chromosome 22 (labeled with SpectrumGreen). (Modified from Varella-Garcia M. Molecular cytogenetics in solid tumors: laboratorial tool for diagnosis, prognosis, and therapy. Oncologist 2003;8:45–58.)

The plethora of data arising from genome-wide association studies using currently available techniques poses particular challenges to cancer researchers. Discerning the causal genetic variants among genotype-phenotype associations requires extensive replication, control for underlying genetic differences in population cohorts, and consistent classification of clinical outcomes. New technologies must be met with equivalently sophisticated and rigorous analytical methodologies for the true genetic cause of cancer to be teased out from our variable and often unstable heredity.

Building Gene Libraries

The engineering of genes by recombinant DNA technology evolved from methods initially devised to provide sequences in amounts sufficient for biochemical analysis. The original protocol involves clipping the desired segment from the surrounding DNA and inserting it into a bacterial or viral vector, which is then amplified millions of times in a host bacterium. With use of recombinant DNA technology, genetic engineering routinely can produce industrial quantities of pure, clinically useful products in a cost-effective manner. For diagnostic purposes, it is easier and faster to amplify a known genomic DNA sequence directly from a patient sample with PCR, but the classic approach is still applied to the construction of recombinant DNA libraries.

To be useful, a DNA library must be as complete as possible, with recombinant members, or clones, sufficiently numerous to include all the sequences in an individual genome. For certain kinds of gene-linkage analysis that require long, uninterrupted stretches of DNA, special vectors, such as bacterial or yeast artificial chromosomes, can carry foreign DNA fragments of enormous lengths. Chromosomal segments represented in genomic DNA libraries can contain the structure of an entire gene, including the information that regulates its expression, and formed the starting material for sequencing the human genome.

Many genes associated with cancer originally were identified using partial DNA libraries, which contain only the DNA sequences transcribed by a particular tissue or type of cell. The starting material in this case is mRNA. For cloning purposes, the enzyme reverse transcriptase can convert mRNA into complementary DNA (cDNA). The number of clones in a cDNA library is much smaller than in a genomic library, because a cDNA library represents only the genes expressed by the tissue of interest and contains exclusively the coding portion of genes. For this particular reason, this technique has become obsolete for organisms whose genome has now been fully sequenced. New advances in PCR chemistry allow for the direct cloning of increasingly larger cDNA fragments with high specificity and low error rates. Highly accurate PCR technology, coupled with the constant evolving generation of genomic sequence maps in humans and models organisms, has expanded exponentially the availability of candidate genes to be tested in cancer biology.

Losing Control of the Genome

Mutations that lead to oncogenic transformation of a cell invariably affect the expression of the cell’s genetic information that specifies functional products—either RNA molecules or proteins used for various cellular functions. The primary level of gene control is the transcription of DNA into RNA. Gene regulation, or the control of RNA synthesis, represents a complex process that itself is a frequent target of neoplastic mutation.

DNA regulatory sequences do not encode a product, and yet without them, a cell could not coordinate the expression of the hundreds of thousands of genes in its nucleus, select only certain genes for expression, and activate or repress them in response to precise internal or external signals. These control centers of the genome contain binding sites for multiple proteins, called transcription factors, which interact to form regulatory networks that control gene transcription. Their function can be altered by signals that induce modifications such as phosphorylation or by interactions with other regulators such as steroid hormones. Many of the cell’s responses to a wide variety of external stimuli, such as neurotransmitters, antigens, cytokines, and growth factors, are mediated through transcription factors binding to DNA regulatory sequences.

Certain regulatory DNA sequences common to many genes are positioned upstream of the transcription start site (Fig. 1-6). Collectively called the “promoter” of a gene, these proximal sequences constitute binding sites for the RNA polymerase and its numerous cofactors. Whereas the position of the promoter with regard to the transcription start site is relatively inflexible, other DNA regulatory elements, known as enhancers, occur in unpredictable locations, often at a considerable distance from the genes they control. Some transcription factors bind to particular regions of enhancers and drive their associated genes in many types of cells, whereas others, which are active in only a limited variety of cells, maintain a tissue-specific pattern of gene expression. Enhancers often are responsible for the aberrant expression of genes induced by chromosomal translocation-associated specific forms of cancer; for example, a normally quiescent gene promoting cell growth that is dislocated to a position near a strong enhancer may be activated inappropriately, resulting in loss of control of growth.

Figure 1-6 **Mammalian gene structure and expression.** The DNA sequences that are transcribed as RNA are collectively called the gene and include exons (expressed sequences) and introns (intervening sequences). Introns invariably begin with the nucleotide sequence GT and end with AG. An AT-rich sequence in the last exon forms a signal for processing the end of the RNA transcript. Regulatory sequences that make up the promoter and include the TATA box occur close to the site where transcription starts. Enhancer sequences are located at variable distances from the gene. Gene expression begins with the binding of multiple protein factors to enhancer sequences and promoter sequences. These factors help form the transcription-initiation complex, which includes the enzyme RNA polymerase and multiple polymerase-associated proteins. The primary transcript (pre-messenger RNA [mRNA]) includes both exon and intron sequences. Posttranscriptional processing begins with changes at both ends of the RNA transcript. At the 5′ end, enzymes add a special nucleotide cap; at the 3′ end, an enzyme clips the pre-mRNA about 30 base pairs after the AAUAAA sequence in the last exon. Another enzyme adds a polyA tail, which consists of up to 200 adenine nucleotides. Next, spliceosomes remove the introns by cutting the RNA at the boundaries between exons and introns. The process of excision forms lariats of the intron sequences. The spliced mRNA is now mature and can leave the nucleus for protein translation in the cytoplasm. (From Rosenthal N. Regulation of gene expression. N Engl J Med 1994;331:931–2.)

Enhancers and promoters have been assigned specific roles by means of cell culture assays or in transgenic animals in which putative regulatory DNA sequences are linked to test or “reporter” genes, and they are examined for their ability to activate expression of the reporter gene in response to the appropriate signals. By assessing the effects of deleting, adding, or changing DNA sequences within the regulatory element, the precise nucleotides that are critical for recognition by transcription factors can be determined.

The interaction between protein and DNA increasingly is being used to identify transcription factor binding sites in a regulatory region. Whereas electrophoretic mobility shift assays, or DNA footprinting, were once standard techniques for determining protein-DNA interactions, emerging genome-wide technologies, such as chromatin immunoprecipitation (ChIP) on microarray chip (ChIP-chip) and ChIP on sequencing (ChIP-seq), are revolutionizing the way in which we see the interaction of a transcription factor complex with virtually all of its potential genomic targets in a particular cell state. These strategies involve the use of candidate protein-specific antibodies to pull down DNA targets regulated by them. These targets are further identified with the use of microarray ChIP-chip or next-generation sequencing ChIP-seq technologies (see Fig. 1-14).

Our appreciation of oncogenic perturbations, either by mutation of regulatory protein-coding genes, loss of controlled signaling by cell cycle switches, or in the target sequences that these proteins recognize, recently has extended to include posttranslational modifications that control protein activity, such as phosphorylation, ubiquitylation, and SUMOylation. Tumor-associated changes in these modifications underscore the multiple levels of control necessary to ensure the correct gene expression so central to the normal function of the cell.

Epigenetics and Cancer

Epigenetics refers to the general control of gene expression that is inherited during cell division, although it is not part of the DNA sequence itself. Epigenetic regulation involves changes in chromatin, a higher order building block of chromosomes that wraps DNA into coils with scaffolding proteins such as histones. Histones are a necessary component of chromosomal compaction and also play a critical role in gene accessibility (Fig. 1-7). Active genetic loci are associated with loosely configured euchromatin, whereas silent loci are condensed in heterochromatin. The state of chromatin configuration (euchromatin or heterochromatin) both controls and is controlled by patterns of histone modifications such as methylation and acetylation on specific DNA sequences. This pattern relates the underlying genetic information to its higher-order structure that determines whether a particular gene regulatory element is available to transcription factors (on or off status). These epigenetic modifications of the nuclear environment that determine the accessibility of a gene can persist during cell division, because inherited epigenetic patterns provide permanent marks for altered chromatin configuration in daughter cells. The pattern of modifications generated by the epigenetic code rivals the complexity of the DNA code itself.

Figure 1-7 **Chromatin packaging of DNA.** The 4 m of DNA in every human cell must be compressed in the nucleus, reaching compaction ratios of 1 : 400,000. This level of compaction is achieved by wrapping the DNA (*blue*) around histone protein complexes (*green*), thus forming nucleosomes that are connected by a thread of free linker DNA. Each nucleosome, together with its linker, packages about 200 base pairs (66 nm) of DNA. The nucleosomes are then coiled into chromatin, a rope of nucleoprotein about 30 nm thick (*bottom left electron micrograph*). To allow DNA to be accessed by transcription and replication apparatus, chromatin is relaxed (*bottom right electron micrograph*). (Courtesy Jakob Waterborg. Copyright 1998 Jakob Waterborg.)

Recent research has linked rearrangement of chromatin and associated DNA methylation with the inactivation of tumor suppressor genes and neoplastic transformation. Defects that could lead to cancer involve perturbations in the “epigenotype” of a particular locus through the silencing of normally active genes or the activation of normally silent genes, which are associated with changes in DNA methylation, histone modification, and chromatin proteins (Fig. 1-8). Changes in the number or density of heterochromatin proteins associated with cancer-related genes such as EZH2 or of euchromatic proteins such as trithorax in persons with leukemia also can be associated with abnormal patterns of methylation in gene promoter regions, as well as with higher order chromosomal structures that are only beginning to be understood. Finally, it is increasingly evident that interactions between the “epigenome,” the genome, and the environment are common targets for mutation and can have profound effects on the gene expression readout of a cancer cell.

Figure 1-8 **Gene accessibility through epigenetics.** The illustration depicts known and possible defects in the epigenome that could lead to disease. A, X is a transcriptionally active gene with sparse DNA methylation (*magenta circles*), an open chromatin structure, interaction with euchromatin proteins (*green protein complex*) and histone modifications such as H3K9 acetylation and H3K4 methylation (*green circles*). Y is a transcriptionally silent gene with dense DNA methylation, a closed chromatin structure, interaction with heterochromatin proteins (*red protein complex*), and histone modifications such as H3K27 methylation (*pink circles*). B, The abnormal cell could switch its epigenotype through the silencing of normally active genes or activation of normally silent genes, with the attendant changes in DNA methylation, histone modification, and chromatin proteins. In addition, the epigenetic lesion could include a change in the number or density of heterochromatin proteins in gene X (such as EZH2 in persons with cancer) or euchromatic proteins in gene Y (such as trithorax in persons with leukemia). There also may be an abnormally dense pattern of methylation in gene promoters (shown in gene X) and an overall reduction in DNA methylation (shown in gene Y) in persons with cancer. The insets show that the higher-order loop configuration may be altered, although currently such structures are only beginning to be understood.

Profiling Tumors

Monitoring global gene expression patterns of cells represents one of the latest breakthroughs in the development of a molecular taxonomy of cancer. Although classic blotting and probe hybridization techniques (e.g., “Northern blot”) are still reliable ways to monitor expression of individual genes, they have limitations, such as unequal hybridization efficiency of individual probes, sensitivity for low copy or small transcripts, and difficulty in detecting multiple RNAs simultaneously or in simultaneously analyzing a large number of targets. For cancer studies, it is important to be able to compare the expression pattern of all known RNAs, including noncoding RNAs, between cancer cells and normal cells. Thus new genome-wide analytic techniques are the state-of-the-art choice to detect mRNA expression profiles at a single point in time or cell state. Genome-wide profiling of gene expression in tumors delivers an unprecedented view into the biological processes underlying tumor progression by following the changes in a tumor cell’s transcriptional landscape.

By relying on two-color fluorescence-based microarray technology (DNA microarray), simultaneous evaluation of thousands of gene transcripts and their relative expression can provide a snapshot of the “transcriptome,” the full complement of RNA transcripts produced at a specific time during the progression of malignancy.

Transcriptional profiling with use of microarrays typically involves screens of mRNA expression from two sources (such as tumor and normal cells), using cDNA or oligonucleotide libraries that are arranged in extremely high density on microchips. These microchips are probed with a mixture of fluorescently tagged cDNAs generated from the tumor and normal samples, which results in differential staining of each gene spot. The relative intensity of the two different colors reflects the RNA expression level of each gene, as analyzed with a laser confocal scanner (Fig. 1-9). With use of microarrays, single genes that constitute diagnostic, prognostic, or therapeutically relevant markers can be systematically monitored. Alternatively, the entire set of expressed genes can be collectively analyzed using powerful statistical methods to classify tumors by their transcriptional profile. Microarray analysis already has dramatically improved our ability to explore the genetic changes associated with cancer etiology and development and is providing new tools for disease diagnosis and prognostic assessment. For example, DNA microarray analysis of multiple primary breast tumor transcriptomes has revealed a reproducible 70-gene expression signature recently cleared by the U.S. Food and Drug Administration for a PCR-based application in which expression analysis of a relatively small gene group can predict the prognosis of early-stage breast cancers. When applied on a larger scale, these assays can predict response to chemotherapy or optimize pharmaceutical intervention by targeting therapeutic approaches to specific patient populations and, ultimately, to individualized therapy.

Figure 1-9 **Microarray-based expression profiling of tumor tissue. A,** Reference RNA and tumor RNA are labeled by reverse transcription with different fluorescent dyes (*green* for the reference cells and *red* for the tumor cells) and hybridized to a complementary DNA (cDNA) microarray containing robotically printed cDNA clones. B, The slides are scanned with a confocal laser scanning microscope and color images are generated with RNA from the tumor and reference cells for each hybridization. Genes upregulated in the tumors appear red, whereas those with decreased expression appear green. Genes with similar levels of expression in the two samples appear yellow. Genes of interest are selected on the basis of the differences in the level of expression by known tumor classes (e.g., BRCA1-mutation–positive and BRCA2-mutation–positive). Statistical analysis determines whether these differences in the gene expression profiles are greater than would be expected by chance. C, The differences in the patterns of gene expression between tumor classes can be portrayed in the form of a color-coded plot, and the relations between tumors can be portrayed in the form of a multidimensional-scaling plot. Tumors with similar gene-expression profiles cluster close to one another in the multidimensional-scaling plot. D, Particular genes of interest can be further studied through the use of a large number of arrayed, paraffin-embedded tumor specimens, referred to as tissue microarrays. E, Immunohistochemical analyses of hundreds or thousands of these arrayed biopsy specimens can be performed to extend the microarray findings. (From Hedenfalk I, Duggan D, Chen Y, et al. Gene expression profiles in hereditary breast cancer. N Engl J Med 2001;344:539–48.)

Recently, a novel high-throughput approach for global transcriptome analysis has been made possible by advances in strategies that allow mass sequencing of DNA fragments. With use of this technique, called RNA-seq, it is now possible to obtain a comprehensive and unbiased analysis of all mRNA transcripts present in cells or tissues (Fig. 1-10). The technique relies on the generation of small fragments of cDNA from any RNA sample, followed by sequencing of these expressed tags from one end (single-end sequencing) or both ends (pair-end sequencing), resulting in fragments of 30 to 400 bps. The resulting sequences then can be mapped against the known reference genome or transcriptome of a certain species. Unlike microarray analysis of preselected gene sets, RNA-seq allows the unbiased identification of all genes, or even the presence of different isoforms, expressed in the sample, allowing a comprehensive comparison of transcript levels between normal cells and cancer cells.

Figure 1-10 **Methods for high throughput transcriptome analyses. A,** Schematics of regular protocol for RNA-seq sample preparation, showing poly-A tail–specific messenger RNA (mRNA) isolation followed by fragmentation of RNA into smaller regions, further used for complementary DNA (cDNA) conversion. Polymerase chain reaction fragments are then tethered by adaptors, sequenced by synthesis, and aligned to the reference genome or transcriptome to calculate the relative prevalence of mRNAs (RPKM*). B, Target fragments can be used to map exon-intron boundaries and thus infer present and quantify different mRNA isoforms in the sample of interest, as shown for the muscle-specific gene *Myf6* in this example. C, Data generated using this method also can be compared with analysis of other tissues or samples, allowing assessment of relative quantification of targets, as exemplified here for a highly specific gene (*orange peaks*) for muscle samples. (From Mortazavi A, Williams BA, McCue K, et al. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat Methods 2008;5:621–8.) *Reads per kilobase of exon model per million mapped. The RPKM measure of read density reflects the molar concentration of a transcript in the starting sample by normalizing for RNA length and for the total read number in the measurement, as described by Mortazavi et al. in *Nat Methods* 2008;5(7):621-8.

The aforementioned technologies can be applied to the analysis of noncoding RNA species as well. Besides the 20,000 protein-coding transcripts used to classify a wide variety of human tumors, hundreds if not thousands of small, noncoding interference RNA species recently have been discovered with critical functions in multiple biological processes, many of which are directly or indirectly involved in the control of cell proliferation. Known as microRNAs (miRNAs), these short transcripts arise from primary genome-encoded transcripts of variable sizes that are processed into 70- to 100-nucleotide hairpin-shaped precursors, which are processed into mature miRNAs of 21- to 23-bp RNA molecules (Fig. 1-11). miRNAs function by base pairing with target mRNAs to inhibit translation and/or promote mRNA degradation. In the context of cancer, miRNAs may act in concert with other effectors such as p53 to inhibit inappropriate cell proliferation. A global decrease in miRNA levels often is observed in human cancers, indicating that small RNAs may have an intrinsic function in tumor suppression. The utility of monitoring the expression of miRNAs in human cancer is just now being explored, but preliminary findings reveal an extraordinary level of diversity in miRNA expression across cancers and the large amount of diagnostic information encoded in a relatively small number of miRNAs. Significant technologic advances facilitating the profiling of the miRNA expression patterns in normal and cancer tissues hint at the unexpected greater reliability of miRNA expression signatures than the respective signatures of protein-coding genes in classifying cancer types. Along with their potential diagnostic value, miRNAs also are being tested for their prognostic use in predicting clinical behaviors of patients with cancer.

Figure 1-11 **Micro RNA (microRNA) production and gene regulation in animal cells.** Mature functional microRNAs of approximately 22 nucleotides are generated from long primary microRNA (*pri-microRNA*) transcripts. First, the pri-microRNAs, which usually contain a few hundred to a few thousand base pairs, are processed in the nucleus into stem-loop precursors (*pre-microRNA*) of approximately 70 nucleotides by the RNase III endonuclease Drosha and DiGeorge syndrome critical region gene 8 (*DGCR8*). The pre-microRNAs are then actively transported into the cytoplasm by exporting 5 and Ran-GTP and further processed into small RNA duplexes of approximately 22 nucleotides by the Dicer RNase III enzyme and its partner Loquacious (*Loqs*), a homologue of the human immunodeficiency virus transactivating response RNA-binding protein. The functional strand of the microRNA duplex is then loaded into the RNA-induced silencing complex (*RISC*). Finally, the microRNA guides the RISC to the target messenger RNA (*mRNA*) target for translational repression or degradation of mRNA. (Adapted from Chen C-J. MicroRNAs as oncogenes and tumor suppressors. N Engl J Med 2005;353:1768–71. Copyright 2005 Massachusetts Medical Society.)

Because probe specificity in miRNA microarray analysis is problematic as a result of the small target size, hybridization can be performed first in solution and then quantified using multicolor flow sorting. Real-time PCR also can be used to quantify specific miRNA sets or to capture a more detailed picture of their changing expression profiles in tumor progression. Identification of the miRNAs involved in tumor pathogenesis and elucidation of their action in a specific cancer will be the next necessary steps for their manipulation in a therapeutic setting.

Recent advances in this field have revealed that miRNAs also are involved in cancer initiation and progression, and specific modulation of such RNAs may serve as a therapeutic strategy. Inhibition of key miRNAs using antagomirs (a class of chemically modified anti-miRNA oligonucleotides) has been effective in suppressing tumor growth in mouse models. It remains to be seen if these results can be extended to the treatment of cancer in the clinic, but interference with miRNA function is an attractive new tool for the development of cancer therapies.

The Cancer Proteome

The term “proteome” describes the entire complement of proteins expressed by the genome of a cell, tissue, or organism. More specifically, it is used to describe the set of all the expressed proteins at a given time point in a defined setting, such as a tumor. Like RNA transcription, the synthesis of proteins is a highly regulated process that contributes to the specific proteome of a particular cell and can be perturbed in diseases such as cancer.

Recent advances in protein analytic techniques during the past decade have progressed to the point that even small numbers of specific proteins expressed in tissues can be used to predict the prognosis of a cancer. The improvement of protein-based assays has made it possible to identify and examine the expression of most proteins and to envision large-scale protein analysis on the level of gene-based screens. Various systematic methodologies have contributed to the current explosion of information on the proteome. These methodologies are now being compared for their suitability as platforms for the generation of databases on protein structural features, interaction maps, activity profiles, and regulatory modifications.

The yeast two-hybrid system has been a popular genetics-based approach for detecting protein-protein interactions inside a cell (Fig. 1-12). One protein fused to the DNA binding domain (bait) and a different protein fused to the activation domain of a transcriptional activator (prey) are expressed together in yeast cells. If the bait and prey interact, transcription of a reported gene is induced and detected, typically by a color reaction that reflects the transactivation of the reporter gene, and by proxy, the interaction of the two test proteins. The method also can be used for large-scale protein interactions, to determine RNA-protein interactions, and for protein-ligand binding.

Figure 1-12 **Exploring protein-protein interactions with the yeast two-hybrid system.** Two-hybrid technology exploits the fact that transcriptional activators are modular in nature. Two physically distinct functional domains are necessary to get transcription: a DNA-binding domain that binds to the DNA of the promoter and an activation domain that binds to the basal transcription apparatus and activates transcription. A, The known gene encoding protein A is cloned into the “bait” vector, fused to the gene encoding a DNA-binding domain from some transcription factor. When placed into a yeast system with a reporter gene, this fusion protein can bind to the reporter gene promoter, but it cannot activate transcription. B, Separately, a second gene (or a library of complementary DNA fragments encoding potential interactors), protein B, is cloned into the “prey” vector, fused to an activation domain of a different transcription factor. When placed into a yeast strain containing the reporter gene, it cannot activate transcription because it has no DNA-binding domain. C, When the two vectors are placed into the same yeast, a transcription factor is formed that can activate the reporter gene if protein B, made by the second plasmid, binds to protein A. D, Screening a yeast two-hybrid library. The plate on the left holds 96 different yeast strains in patches (or colonies), each of which expresses a different bait protein (*top*). The plate on the right holds 96 patches, each of the same yeast strain (prey strain) that expresses a protein fused to an activation domain (prey). The plate of bait strains and the plate of prey strains are each pressed to the same replica velvet, and the impression is lifted with a plate containing yeast extract peptone dextrose (*YPD*) medium. After one day of growth on the YPD plate, during which time the two strains mate to form diploids, the YPD plate is pressed to a new replica velvet, and the impression is lifted with a plate containing diploid selection medium and an indicator such as X-Gal. Blue patches (*dark spots*) on the X-Gal plate indicate that the lacZ reporter is transcribed, suggesting that the prey interacts with the bait at that location. (C from http://www.nature.com/nature/journal/v403/n6770/full/403601a0.html. D from Bartel PL, Fields S, editors. The yeast two-hybrid system, New York: Oxford University Press, 1997.)

As a complementary proteomics tool, mass spectrometry is an accurate mass measurement of charged peptides isolated by two-dimensional gel electrophoresis, producing a mass-to-charge ratio of charged samples under vacuum that can be used to determine the sequence identity of peptides. Combined with a specific proteolytic cleavage step, mass spectroscopy can be used for peptide mass mapping. Automation of this process has made mass spectroscopy the analytic tool of choice for many proteomics projects.

Monoclonal antibodies (mAbs) have been a cornerstone of protein analysis in cancer research and more recently have risen to prominence as cancer therapeutics based on their exquisite specificity for protein targets and their potent interference with protein function. Novel strategies have been developed that not only target antigens highly expressed in cancer cells but also enhance the innate immune response against cancer cells. These antibodies can act via several mechanisms, including antibody-dependent cellular cytotoxicity, complement-mediated cytotoxicity, and antibody-dependent cellular phagocytosis (Fig. 1-13). Laboratory mice have been the animal model of choice for generating a ready source of diverse, high-affinity, and high-specificity mAbs; however, the use of rodent antibodies as therapeutic agents has been restricted by the inherent immunogenicity of mouse proteins in a human setting. The more recent application of transgenic mouse technology to introduce variable regions encoded by human sequences into the corresponding mouse immunoglobulin genes has enabled the generation of “humanized” therapeutic mAbs with reduced immunogenicity. In addition, the generation of bispecific antibodies with dual affinity for tumor antigens, such as TriomAb, has been shown to effectively kill tumor cells by inducing memory T-cell protective immunity. Besides the expected use of mAbs directed to extracellular epitopes (protein regions recognized by the antibody), evidence from mouse models has raised the possibility of using antibodies targeting intracellular epitopes for anticancer therapies. Targeting such antigens would enrich immunotherapy, allowing the use of tumor-specific intracellular mediators of cell survival and proliferation. Numerous mAb-based agents are currently in trial or in use as therapeutics for cancer, and the potential for further optimization of mAbs through genetic engineering promises to open new avenues for in vivo therapy.

Figure 1-13 **Mechanisms for antibody-based therapies used against cancer cells.** Multiple current approaches involve direct cytotoxicity, Fc-mediated immune effector engagement, nonrestricted activation of cytotoxic T-cells, and blockade of inhibitory signaling. The diverse spectrum of action of these therapies will allow the inclusion of various anticancer targets in the near future. (From Weiner LM, Murray JC, Shuptrine CW. Antibody-based immunotherapy of cancer. Cell 2012;148:1081–4.)

From an epigenetics perspective, new techniques are enabling the genome-wide characterization of protein-DNA interactions that can uncover novel transcription factor targets, histone modifications, and DNA methylation patterns within a cancer cell. Combining ChIP with microarray (ChIP-on-chip) allows genome-wide screening for the binding position of protein factors to their gene targets. In ChIP-on-chip assays or ChIP-seq, a cross-linking reagent is applied in vivo to proteins associated with DNA in the nucleus, which then can be co-immunoprecipitated with specific antibodies to the protein under analysis. The bound DNA and appropriate controls are then fluorescently labeled and applied to microscopic slides for microarray analysis, or they are directly sequenced, rendering a simultaneous profile of all the binding positions of specific proteins in the cancer cell’s genome (Fig. 1-14). The global profiling of promoter occupancy of specific cancers, where protein-DNA interaction profiles discriminate tumors from patients presenting with different clinical outcomes, is a promising predictive method.

Figure 1-14 **Methods for unbiased identification of transcription factor binding sites.** Chromatin immunoprecipitation on sequencing (ChIP-seq) and chromatin immunoprecipitation on microarray chip (ChIP-chip) can provide location, isolation, and identification of the DNA sequences occupied by specific DNA binding proteins in cells. Proteins capable of DNA interactions are targeted with specific antibodies. DNA and the associated proteins are cross-linked, and DNA is fragmented into 150-500 base pairs and immunoprecipitated. After reversion of the cross-link, DNA is isolated and either mass sequenced (ChIP-seq) or used as probes in a genomic array (ChIP-chip), and binding sites occupied by the proteins can be identified in the genome. These binding sites may indicate functions of various transcriptional regulators and help identify their target genes during development and disease progression. The type of functional elements identified using these techniques includes promoters, enhancers, repressor and silencing elements, insulators, boundary elements, and sequences that control DNA replication. (From Kim TH, Ren B. Genome-wide analysis of protein-DNA interactions. Annu Rev Genomics Hum Genet 2006;7:81-102; and Liu ET, Pott S, Huss M. Q&A: ChIP-seq technologies and the study of gene regulation. BMC Biology 2010;8:56.)

After a decade of development, proteomics is still primarily a basic research activity, yet in the near future this technology is likely to have a profound impact on medicine. By defining the collective protein-protein interactions in a cancer cell (its “interactome”), functional relationships between disease-promoting genes may be revealed that provide novel candidates for intervention (Fig. 1-15). Networks of disorder-gene associations already are being built that offer a platform for describing all known phenotype and disease gene associations, often indicating the common genetic origin of many diseases. A precise diagnosis of cancer using proteomics could be envisioned, based on highly discriminating patterns of proteins in easily accessible patient samples. Proteomics information also promises to provide sophisticated mathematical models of the molecular events underlying a process as complex as neoplastic transformation, which will capture the dynamics of the disease with unprecedented power.

Figure 1-15 **Interactome networks and human disease.** Networks are integrated sources of information obtained from biochemical, molecular, proteomic, and other high throughput analysis. Different networks can be obtained for each organism, organ, or cell. In the first instance, central regulatory “nodes” identify important components in the network. These networks and their data then can be integrated and compared with healthy and disease models, allowing an integrative view of events that is much more powerful than isolated networks. (Modified from Vidal M, Cusick ME, Barabási A. Interactome networks and human disease. Cell 2011;144:986–98.)

Modeling Cancer in Vivo

Once the mechanistic underpinnings of a particular cancer have been described, creating an animal model to test that mechanism becomes critical to understanding the pathophysiology and to design therapeutic strategies for treatment. Recent advances in manipulating the mouse genome have resulted in more sophisticated models of human cancer. These methodologies can circumvent embryonic death by targeted alteration of gene expression only after a critical period in development and reduce the complexity of gene functional analysis by restricting its pattern of activation. Inducible gene expression or silencing also allows acute effects, as opposed to chronic effects, to be assessed. Although species differences in tumor susceptibility and disease remission exist between mice and humans, the tools for genetic manipulation in mice are superior to those in other mammals, and useful information about the function of oncogenes can be gained by targeted expression of mutant protein products in mouse tissues.

Transgenic Models of Cancer

Integrating an oncogene that causes malignancy into the genome of a mouse without altering the mouse’s own genes generates a transgenic, cancer-prone mouse that transmits this trait to its offspring with a dominant pattern of inheritance. The technology for producing transgenic mice joins recombinant DNA methodology with standard techniques that are used today by in vitro fertilization clinics, relying on our understanding of mammalian reproduction and the development of protocols to harvest, manipulate, and reimplant eggs and early embryos (Fig. 1-16). The transgene is constructed so that the gene product will be expressed under appropriate spatial and temporal control. In addition to all the standard signals necessary for efficient transcription and translation of the gene, transgenes contain a promoter, or regulatory region, that drives transcription in either a ubiquitous or tissue-restricted pattern. This process requires an extensive knowledge of genetic regulation in the target cells. A recent advance that circumvents this requirement involves embedding the transgene inside another gene locus that is expressed in the desired pattern. Held in a bacterial artificial chromosome for easier manipulation, this long stretch of DNA surrounding the host gene is likely to carry all the necessary regulatory information to guarantee a predictable expression pattern of the introduced transgene.