The Technology of Analyzing Nucleic Acids in Cancer

Published on 09/04/2015 by admin

Filed under Hematology, Oncology and Palliative Medicine

Last modified 22/04/2025

Print this page

rate 1 star rate 2 star rate 3 star rate 4 star rate 5 star
Your rating: none, Average: 0 (0 votes)

This article have been viewed 1938 times

Figure 23-1 PCR and capillary sequencing of exons A generalized workflow is shown for the PCR-based amplification of specific exons in the human genome, their sequencing, and separation of the nucleotide sequence by a fluorescent capillary sequencing approach. Nucleotide changes can be identified by the appropriate software, such as PolyPhred, which was widely used for analysis of capillary data. A single base substitution mutation is shown in the trace data example. Pros and cons of the approach are listed and described in the text.
image
Figure 23-2 Next-generation sequencing (NGS) of whole genomes A generalized workflow for the production of whole-genome sequencing data by next-generation or massively parallel sequencing is shown. Several pros and cons of this approach are listed and described in the text. Computational identification of the genome-wide differences between tumor and normal genomes requires highly specialized pipelines for each variant type (point mutation, copy number alteration, insertion/deletion variant, structural variant).
image
Figure 23-3 Read placement distance and orientation is indicative of structural variation of various types In each figure, the paired end mapping (PEM) orientation and distance on the reference genome (Ref) of the NGS data is shown relative to the short read mapping (SRM) of the experimental data (Exp). (A) The anticipated read mapping distance and orientation are shown. (B) A deletion in the experimental genome. (C) An insertion. (D) An inversion. (E) A complex rearrangement. (F) A translocation. With permission from Quinlan AR, Hall IM. Characterizing complex structural variation in germline and somatic genomes. Trends Genet. 2012;28:43-53.

Challenges to NGS Analysis of Cancer Nucleic Acids

The search for somatic variation in cancer DNA and RNA has a distinct advantage over other complex diseases: The exact comparison of tumor to normal nucleic acids within an individual patient distinctly identifies those alterations that are tumor unique. Furthermore, there are increasing amounts of data from various projects that have begun using NGS methods to catalog large numbers of cancer cases across different tumor types (ICGC [icgc.org], TCGA [cancergenome.nih.gov], PCGP [www.pediatriccancergenomeproject.org] 21 that can be used to inform individual analyses about previously described alterations. In spite of these decided advantages, there are several significant challenges that confound experimental design and analytical approaches in cancer genomics studies. Several examples of these challenges are described next, along with the ways researchers attempt to overcome them, where applicable.

Tumor Cellularity

Cancerous cells in solid tumors do not exist in isolation in the body. Rather, they are always in close proximity to normal cells of various types, including stromal cells, immune cells, and components known as the extracellular matrix (ECM). The proportion of tumor cells can be estimated by an experienced pathologist examining the tumor section under hematoxylin and eosin staining, and this estimate is expressed as a “percent tumor nuclei” or “percent tumor cellularity” value. As a result of the association of tumor and normal cells, an isolate of DNA or RNA derived from a solid cancer sample will contain both tumor and normal cells unless a specific procedure such as flow cytometry or laser capture microdissection (LCM) is used first to significantly enrich the percentage of tumor cells in the isolate. Also, certain tumor types, such as those from prostate or pancreas, are more prone to low tumor cellularity. Based on the pathology estimate, decisions in sequencing must be made in the context of tumor cellularity percentages. Namely, if the tumor cellularity is below 60%, the decision must be made either to enrich the tumor by flow cytometry (more common for blood cancers such as lymphoma or leukemia) or by LCM (used for solid tumors), or to try oversampling the tumor NGS library (increased sequencing coverage) by an amount commensurate with the tumor cellularity estimate. Although sorting or LCM seems the most obvious choice, one limitation of either approach is that significantly reduced yields of DNA or RNA will be obtained. Unless specialized procedures are in hand, the low yield may limit the ability to derive high-quality data from such samples. By contrast, oversampling may be effective for DNA sequencing but will be more expensive to generate and will require adjustment of variant calling parameters, or use of a more sensitive variant caller, to effectively identify somatic variants. Oversampling for RNA-seq from a sample with low tumor cellularity is generally not advised, as the tumor transcripts will be too difficult to discern from those of the normal cells unless LCM or sorting is first used to separate the tumor cells from the adjacent normal/nonmalignant cells.

Heterogeneity (Regional versus Genotypic)

Heterogeneity is a fundamental aspect of cancer cells found within the same tumor of which there are two types, regional and genotypic. Regional heterogeneity reflects the differences that emerge in solid tumors as they grow and progress. It refers to the different regions present in a tumor mass, such as areas of necrosis or areas of invasion (of surrounding normal tissue). Genotypic heterogeneity reflects the fact that cancer cells evolve during the process of tumor progression, so that not all tumor cells share the same somatic genotype. In genotypic heterogeneity, the use of NGS has demonstrated that by comparing the genomes from progression samples (a de novo leukemia compared to its relapse) using high-depth sequencing of somatic mutations, an initiating or “founder” clone can be identified that contains the core mutational load that initiates tumor growth as well as more advanced clones that combine newer mutations with those in the founder clone. 22 One shared aspect of regional and genotypic heterogeneity is that as a tumor mass increases in size, both are more likely to occur in that areas of regional heterogeneity are likely to have genotypic heterogeneity. There are so far only two studies to examine this at the DNA level; one study of two advanced-stage renal cell carcinomas that exhibited extreme genotypic heterogeneity 23 and one study of five early-stage (2/3) breast cancers that showed little to no genotypic heterogeneity 24 when sampled and studied at multiple sites.

Ploidy and Copy Number Alterations in DNA

Altered numbers of chromosomes (more or fewer than 2) have been widely observed in cancer cells, likely reflecting errors in chromosomal segregation that occur during rapid division and growth cycles. Observing ploidy alterations requires cytogenetic examination of the tumor cells in metaphase, which may or may not be part of the pathology-based diagnosis for the patient sample. Alternatively, ploidy alterations and large chromosomal arm or subarm amplifications and deletions (somatic copy number alterations or SCNAs) can be inferred from signal strength–based analysis of genotyping array data. 25,26 Ploidy increases and arm or subarm amplifications are important in DNA sequencing of the tumor because these regions will contribute more DNA to the library, and hence more reads will result than for the diploid (or haploid) regions of the genome. Thus increased coverage must be obtained for the tumor library to compensate for amplified regions or ploidy-altered chromosomes so the coverage of diploid genomic regions is sufficient for variant detection. Careful analysis of aligned reads in copy number–altered regions can provide exquisite resolution of the genes involved and of the relative timing of somatic mutation and copy number alteration when both occur in the same locus. 27

FFPE Preservation and Nucleic Acid Integrity

Most pathology assays used in cancer diagnosis and characterization require stability of proteins and cellular structure. Hence, fixation in formalin and embedding in paraffin have been the standard pathology preparation methods for more than 100 years. As this practice is unlikely to change in the near future, and because so many clinically valuable specimens already have been preserved by this method, the study of formalin-fixed, paraffin-embedded (FFPE)-preserved nucleic acid isolates by NGS methods is increasing. The chemical reaction between formaldehyde, proteins, and nucleic acids leads to crosslinking proteins and nucleic acids, and ultimately the DNA/RNA backbone breaks because of the presence of abasic lesions. 28 This is a random interaction, and DNA/RNA fragmentation increases with longer exposure of nucleic acids to formalin and over time of storage. Therefore, the older a tumor FFPE block, the more likely to be advanced the degradation of the nucleic acid components. Nonetheless, careful examination of the nucleic acid integrity will identify those samples suitable for library construction for DNA or RNA, based on the average size and distribution of degraded nucleic acid isolated from the sample. In DNA isolates, the average fragment size should be 300 bp or greater or a suitable NGS library is unlikely to result. In RNA, the 28S and 18S rRNA peaks should be visible by gel electrophoresis, with an RNA integrity number (RIN) of at least 5.

Applications of NGS to Study and Analyze Nucleic Acids

The genomic DNA isolated from cancer cell nuclei can be studied in a variety of ways, several of which are profiled here. Because cancer develops from alterations of the nuclear genome that are distinct from the germline genome, an inherent and powerful comparison can be obtained by studying the paired tumor and normal genomic DNA from individual cancer patients. In discovery efforts, large numbers of such cases can be studied to add information about the frequency of different types of somatic alterations and the genes whose protein products will be altered as a result. Higher level analyses of the pathways affected by somatic alterations in DNA can further inform the resulting tumor biology. Studies of RNA by NGS methods have deepened our understanding of the numerous types of RNAs, their membership, and how they are altered in the course of carcinogenesis, although not all alterations are comprehensible in the biological context. The latter reflects our ignorance of the many roles these molecules play in cellular biology, emphasizing the need for functional studies as a follow-on to NGS-mediated discovery efforts.

Whole-Genome Sequencing

The most comprehensive approach to identifying the somatic alterations present in cancer genomics is obtained by whole-genome sequencing (WGS) of the tumor and normal DNAs. In this approach (see Figure 23-2), the isolated high-molecular-weight genomic DNA from each tissue is fragmented by the application of high-frequency sound waves or other physical shearing methods and then enzymatically treated to blunt the fragment ends that result. Finally, short synthetic adapters are added to make a whole-genome library. After limited PCR amplification by primers that correspond to the forward and reverse adapters, a gel-based sizing allows specific size fractions to be isolated (two to four insert sizes are typical to enhance library diversity and genome representation). The more precise the size fraction, the more precisely structural variants can be identified by virtue of their relative position once mapped to the reference genome (see Figure 23-3). Libraries are then quantitated, diluted to the appropriate concentration, and amplified in situ to produce collections of fragments, each of which originated from a single library fragment. Thus the data generated from WGS are “digital” in nature and can be interpreted later in this context, to provide highly precise information about chromosomal amplification and deletion events genome-wide, and the relative frequency of mutations in the tumor cell genomes sampled by DNA isolation. 22 Read pair data are then generated from tumor and normal libraries to a minimum depth of 30-fold, allowing for a mapping rate of around 85% of read pairs; this equates roughly to 120 Gbp of data per genome. Following data generation, the signals obtained from the stepwise sequencing process are interpreted by instrument-specific software, culled for low-quality sequences, paired, and provided to the mapping algorithm for alignment to the Human Reference Genome as outlined earlier. Alignment is done for tumor-specific reads and for normal-specific reads separately. Variants are identified and then compared to one another. 29 There are many specialized algorithms that have been specifically developed to evaluate the somatic variants that are carried by the cancer cell genomes decoded by whole genome sequencing data. Depending upon the algorithm type, one can identify somatic single nucleotide variants (SNVs), focused insertion and deletion events of one to several nucleotides (in/dels), and larger, structural events such as translocations, inversions, deletions, and amplifications. Loss of heterozygosity (LOH) is a common somatic genome event, and there also are algorithms to identify stretches of LOH along chromosomes. Each algorithm has an associated false positive rate, so secondary validation of putative somatic variants is the best practice. The identification of structural variants is particularly prone to a high false positive rate due to the difficulty of identifying these regions, as illustrated in Figure 23-3. Here, the distance and orientation of read pair mapping to the Human Genome Reference for multiple unique read pairs is required to identify a structural event, as indicated in the figure. By using the read pairs that identify the event, and a short read assembly algorithm, one can reassemble the structural variant event to nucleotide resolution. Finally, one makes “sense” of the variants identified genome-wide by annotation, effectively overlaying our current understanding of genes, regulatory regions, and other identified features that help define the tumor-unique profile of genomic alterations.
As sequencing costs have dropped and instrument throughput has increased, the amount of read data and hence the coverage of the tumor genome has increased. This increase has occurred for several reasons: notably, the confidence of detecting somatic variations typically increases with increasing coverage. Furthermore, the heterogeneous nature of cancer cell genomes means that increased coverage provides enhanced characterization of the mutational spectrum within the cells. Tumor progression is at its essence an evolutionary process in which new mutations arise from the fundamental tumor genome (often referred to as the “founder clone”) and expand into new subpopulations of cells. 22,30 Thus, the higher the tumor genome coverage, the more likely it is that subpopulations can be identified.

Exome Sequencing

Although WGS data are straightforward to produce and provide comprehensive genome-wide information about somatic alterations, their production remains expensive, and they are difficult to accurately interpret. Much like the early PCR-directed methods used to characterize cancer somatic mutations, technology development efforts in NGS have resulted in an application typically referred to as “hybrid capture” to selectively isolate regions of the genome followed by NGS. 3133 Sequence-based comparison of the isolated regions between tumor and normal generate specific information about somatic and germline SNVs and indels. Hybrid capture protocols combine the whole-genome library fragments from tumor and normal with a collection of specific probe sequences designed to capture, by hybridization, those fragments in the population that contain the same loci the probes represent. Because capture probes carry covalently attached biotin moieties, the probe:library fragment hybrids can be selectively removed from solution by association with streptavidin-coated magnetic particles and the application of a magnet. Unhybridized fragments are removed with the supernatant, and a secondary wash eliminates many (but not all) spurious hybridization events (typically referred to as “off-target effects”). The resulting captured fragments are eluted from the beads by denaturation, quantitated, and sequenced to about 100-fold average depth. In one commonly used version of hybrid capture, probes representing nearly all of the annotated exons in the human genome (the “exome”) permit selective capture of these exons so they can be sequenced, compared, and annotated with respect to the somatic alterations identified. Exome capture reagents are available from commercial manufacturers, and the associated methods can be automated readily to provide a very high throughput of exome capture reactions, suitable for large-scale cancer discovery. As an alternative, custom capture probe sets can be designed and manufactured by one of several commercial suppliers to selectively isolate genes/loci of interest and characterize their mutational status in a large number of cancer cases or as a diagnostic reagent to assess specific mutational hotspots. This approach also can be used to select loci from a whole-genome library that carry putative variants, as a means of validation that mutations indeed exist. 22 Hybrid capture becomes of limited utility when the target loci (also referred to as “regions of interest” or ROIs) to be isolated fall below a combined length of approximately 500 kbp, mainly because the amount of off-target hybridization increases as the target space decreases. Because off-target captured fragments contribute to the overall fragment pool that is recovered for sequencing, the amount of sequence data actually mapping to the loci of interest decreases steadily to the point that data generation becomes too expensive to obtain the necessary coverage in desired targets. Below an ROI of about 500 kbp, either PCR and amplicon pooling or multiplex PCR are typically used.

DNA Methylation

One predominant mechanism of transcriptional control in cells is the covalent modification of the cytosine bases in DNA by methyl groups and their derivatives. Understanding the ways that methylation status changes in tumor cells can provide insights into changes in gene expression patterns, as well as new prognostic markers if sufficient clinical data and samples exist. This type of analysis requires comparator normal methylation data, ideally obtained from adjacent nonmalignant tissue genomic DNA isolates. Correlative analyses then can link DNA methylation changes to gene expression changes, providing insights into tumor biology that cannot be obtained by directly sequencing genomic DNA. Several approaches to identifying methylated cytosines in genomic DNA use chemical modification, antibody-based recognition of methylC, or comparative restriction enzyme digestion patterns from exposure to a methylation-sensitive versus non–methylation-sensitive isoschizomer. However, the most widely used approach is bisulfite modification. 34,35 In bisulfite modification, native genomic DNA is treated with sodium bisulfite to convert unmethylated cytosines to uracils (see Figure 23-4). When copied by a restriction enzyme, each unmethylated C will represent as a C to T transition, whereas methylated C residues are untouched and incorporate a G during copying. In the pre-NGS era, regions upstream of genes of interest that were activated or silenced by methylation changes were evaluated by comparing bisulfite treatment plus PCR to PCR alone between tumor and adjacent non-malignant (normal) DNA. In the era of NGS, whole genomic DNA of tumor and adjacent normal tissues are treated with bisulfite after library construction (the adapters are methylated to prevent their conversion) and then processed and sequenced as described earlier. 36 The resulting bisulfite converted reads are aligned in silico to a “bisulfite converted” genome in order to identify unmethylated (and by inference, methylated) C residues (Figure 23-4).
image
Figure 23-4 Chemical conversion of unmethylated cytosine residues by bisulfite The upper panel indicates the series of reactions that occur between unmethylated cytosine residues and the bisulfite reagent to sulfonate cytosine, eliminate ammonia to generate uracil sulfonate, and then eliminate bisulfite to generate uracil. In the lower panel, methylated cytosines do not participate in the reaction with bisulfite.
image
Figure 23-5 Basic principles of chromatin immunoprecipitation (ChIP) After crosslinking DNA to proteins by treating live cells with formaldehyde, the DNA is isolated and then sheared to a uniform length distribution. By mixing the sheared DNA with a specific antibody to the protein of interest, crosslinked fragments can be bound and then isolated by immunoprecipitation. Following this step, reversing the crosslinking allows the DNA fragments to be reclaimed. The resulting low yield of DNA is converted to a next-generation sequencing (NGS) library and sequenced. Alignment to the reference genome and peak finding can identify those regions of the genome that are bound by the protein of interest.

Chromatin Immunoprecipitation

Genomic DNA contains myriad regulatory sites that provide the binding sequences recognized by the cell’s transcriptional regulation machinery. In addition, complexing of the genomic DNA with histones provides a fundamental mechanism for permitting access to regions of the DNA so transcription can occur. Changes to binding sites due either to mutation or amplification/deletion can destroy protein binding and effectively silence the gene. Similarly, rearrangements of bound histones can make available or eliminate genes from transcriptional copying. One approach to evaluate these changes is broadly termed chromatin immunoprecipitation, or ChIP (see Figure 23-5). ChIP is based on two fundamental approaches: (1) DNA and protein in close physical proximity can be bound reversibly by the introduction of formalin to growing cell cultures, and (2) the DNA:protein complexes can be precipitated from solution by an antibody-mediated immunoprecipitation step. Before NGS methods, the genomic region of interest was assayed from these immunoprecipitated DNA fragment populations by quantitative polymerase chain reaction (qPCR), allowing a selective look at protein binding status. In the NGS era, however, sequencing of the collective ChIP fraction can identify protein-bound regions. 37,38 These are challenging libraries to produce because of the low yield of DNA from the immunoprecipitation step. Once sequencing data are obtained, the reads are analyzed by first mapping onto the reference genome and then detecting the peaks that indicate factor- or histone-bound loci, as appropriate. The accuracy and sensitivity of these methods largely depend on the quality of the antibody used to effect immunoprecipitation, and on attaining enough coverage that peaks can be distinguished from noise.

Sequencing Messenger RNA (mRNA)

The pursuit to characterize RNA expression as a biomarker of prognosis, metastasis, therapeutic options, and other clinical metrics lies at the essence of cancer genomics. In particular, before sequencing was transformed by next-generation instrumentation, microarrays served as a mechanism to characterize tumor-specific gene expression in a reasonably comprehensive and measurable way. However, relative to microarrays, sequencing of coding RNAs (called “RNA-seq”) provides data suitable for addressing a multitude of RNA-specific questions that one might use to characterize a tumor. For example, the digital nature of sequencing means that gene expression levels are quantifiable rather than relative, a concept first introduced by Wold and colleagues, who defined the metric known as FPKM (fragments per kilobase of transcript per million mapped reads) to express the normalized value of expression for each transcript detected. 39 Beyond gene expression levels, there are myriad levels of complexity that can be mined from RNA-seq datasets, including information about splice isoforms, allele-specific expression, chimeric or fusion transcripts, and RNA editing. 40,41 Much like ChIP-seq data, described earlier, these data are without context for an individual sample, unless a matched nonmalignant adjacent normal tissue is similarly prepared, sequenced, and evaluated. At present, there is a dearth of knowledge from RNA-seq analysis of normal human tissues to provide even a plausible baseline for these studies as an alternative to the adjacent nonmalignant tissue comparator. Integrating data from DNA and RNA sequencing can be incredibly powerful in the context of cancer genomics, such as DNA amplification leading to increased RNA expression levels for the genes lying in the amplified region (ERBB2 in breast cancer) or chromosomal translocation leading to a fusion gene transcript that drives oncogenesis (t15;17 and PML-RARα, t9:21 and BRC-ABL).
There are myriad approaches to producing RNA-seq data from tumor RNA isolates, depending on the yield obtained. Ideally, polyA selection is used for abundant total RNA samples (more than 2 μg) to reduce highly expressed rRNA transcripts, but these yields are not guaranteed with samples from clinical procedures such as fine-needle aspirates or LCM. In this regard, RNA-seq has benefitted from previous microarray studies because enzymatic RNA amplification is widely accepted to represent the RNA expression levels of the native sample. Amplification can be further refined by specialized approaches to RNA that accurately reflect the strandedness of the transcript or adjust for the fragmentation of RNA that is derived from FFPE preserved tissues. There are several methods that subsequently decrease the levels of abundant rRNA transcripts when polyA purification cannot be used, including selective amplification of polyA, use of abundance-based hybridization kinetics and selective double-strand hybrid depletion, or selective depletion using rRNA-targeted locked nucleic acids (LNAs).

Sequencing Noncoding RNAs

In addition to the transcripts from protein-coding genes, many classes of noncoding RNAs (ncRNAs) have been described and have been determined to be expressed at altered levels in cancer cells. 42,43 As for messenger RNAs (mRNAs), the digital nature of NGS can provide exquisite quantitation of expression levels while being comprehensive across a wide dynamic range of expression. Highly specialized library construction techniques for ncRNAs depend largely on a clever approach to selecting the ncRNAs from total RNA isolates and then performing a sizing step that isolates the specific population of interest. Obviously, the sequencing parameters of NGS instruments should be altered to specifically reflect the corresponding sizes of ncRNAs being studied and the types of analyses planned for the resulting data.

Conclusions

As illustrated by this overview, the impact of NGS and associated methods on our ability to characterize the nucleic acids involved in cancer, and hence to generate comprehensive hypotheses about tumor biology, has evolved rapidly. These methods and associated analyses continue to develop and are now coalescing toward diagnostic assays that provide an individualized focus for precise prognostic and therapeutic determinations. Over time, the cancer genomics revolution, fueled by NGS innovation, will profoundly affect the outcomes of cancer patients worldwide.
References

1. Mardis E.R. A decade’s perspective on DNA sequencing technology . Nature . 2011 ; 470 : 198 203 .

2. Sanger F. et al. Nucleotide sequence of bacteriophage phi X174 DNA . Nature . 1977 ; 265 : 687 695 .

3. Sanger F. , Coulson A.R. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase . J Mol Biol . 1975 ; 94 : 441 448 .

4. Sanger F. , Coulson A.R. , Barrell B.G. , Smith A.J. , Roe B.A. Cloning in single-stranded bacteriophage as an aid to rapid DNA sequencing . J Mol Biol . 1980 ; 143 : 161 178 .

5. Sanger F. , Nicklen S. , Coulson A.R. DNA sequencing with chain-terminating inhibitors . Proc Natl Acad Sci U S A . 1977 ; 74 : 5463 5467 .

6. International Human Genome Sequencing Consortium . Finishing the euchromatic sequence of the human genome . Nature . 2004 ; 431 : 931 945 .

7. Rowley J.D. , Golomb H.M. , Dougherty C. 15/17 translocation, a consistent chromosomal change in acute promyelocytic leukaemia . Lancet . 1977 ; 1 : 549 550 .

8. Rowley J.D. The cytogenetics of acute leukaemia . Clin Haematol . 1978 ; 7 : 385 406 .

9. Rowley J.D. Chromosome abnormalities in leukemia . Haematol Blood Transfus . 1979 ; 23 : 43 52 .

10. Rowley J.D. Ph1-positive leukaemia, including chronic myelogenous leukaemia . Clin Haematol . 1980 ; 9 : 55 86 .

11. Rowley J.D. Identification of the constant chromosome regions involved in human hematologic malignant disease . Science . 1982 ; 216 : 749 751 .

12. Rowley J.D. Biological implications of consistent chromosome rearrangements in leukemia and lymphoma . Cancer Res . 1984 ; 44 : 3159 3168 .

13. Rowley J.D. The Philadelphia chromosome translocation. A paradigm for understanding leukemia . Cancer . 1990 ; 65 : 2178 2184 .

14. Rowley J.D. The role of chromosome translocations in leukemogenesis . Semin Hematol . 1999 ; 36 : 59 72 .

15. Sjoblom T. et al. The consensus coding sequences of human breast and colorectal cancers . Science . 2006 ; 314 : 268 274 .

16. Pao W. et al. EGF receptor gene mutations are common in lung cancers from “never smokers” and are associated with sensitivity of tumors to gefitinib and erlotinib . Proc Natl Acad Sci U S A . 2004 ; 101 : 13306 13311 .

17. Mardis E.R. , Wilson R.K. Cancer genome sequencing: a review . Hum Mol Genet . 2009 ; 18 : R163 R168 .

18. Li H. , Ruan J. , Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores . Genome Res . 2008 ; 18 : 1851 1858 .

19. Li H. , Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform . Bioinformatics . 2010 ; 26 : 589 595 .

20. Schuster S.C. et al. Complete Khoisan and Bantu genomes from southern Africa . Nature . 2010 ; 463 : 943 947 .

21. Hudson T.J. et al. International network of cancer genome projects . Nature . 2010 ; 464 : 993 998 .

22. Ding L. et al. Clonal evolution in relapsed acute myeloid leukaemia revealed by whole-genome sequencing . Nature . 2012 ; 481 : 506 510 .

23. Gerlinger M. et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing . N Engl J Med . 2012 ; 366 : 883 892 .

24. Ellis M.J. et al. Whole-genome analysis informs breast cancer response to aromatase inhibition . Nature . 2012 ; 486 : 353 360 .

25. Zhang Q. et al. CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data . Bioinformatics . 2010 ; 26 : 464 469 .

26. Beroukhim R. et al. Assessing the significance of chromosomal aberrations in cancer: methodology and application to glioma . Proc Natl Acad Sci U S A . 2007 ; 104 : 20007 20012 .

27. Nik-Zainal S. et al. The life history of 21 breast cancers . Cell . 2012 ; 149 : 994 1007 .

28. Magana-Schwencke N. , Ekert B. , Moustacchi E. Biochemical analysis of damage induced in yeast by formaldehyde. I. Induction of single-strand breaks in DNA and their repair . Mutat Res . 1978 ; 50 : 181 193 .

29. Larson D.E. et al. SomaticSniper: identification of somatic point mutations in whole genome sequencing data . Bioinformatics . 2012 ; 28 : 311 317 .

30. Yachida S. et al. Distant metastasis occurs late during the genetic evolution of pancreatic cancer . Nature . 2010 ; 467 : 1114 1117 .

31. Albert T.J. et al. Direct selection of human genomic loci by microarray hybridization . Nat Methods . 2007 ; 4 : 903 905 .

32. Hodges E. et al. Genome-wide in situ exon capture for selective resequencing . Nat Genet . 2007 ; 39 : 1522 1527 .

33. Gnirke A. et al. Solution hybrid selection with ultra-long oligonucleotides for massively parallel targeted sequencing . Nat Biotechnol . 2009 ; 27 : 182 189 .

34. Harris R.A. et al. Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications . Nat Biotechnol . 2010 ; 28 : 1097 1105 .

35. Frommer M. et al. A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands . Proc Natl Acad Sci U S A . 1992 ; 89 : 1827 1831 .

36. Lister R. et al. Human DNA methylomes at base resolution show widespread epigenomic differences . Nature . 2009 ; 462 : 315 322 .

37. Goren A. et al. Chromatin profiling by directly sequencing small quantities of immunoprecipitated DNA . Nat Methods . 2010 ; 7 : 47 49 .

38. Garber M. et al. A high-throughput chromatin immunoprecipitation approach reveals principles of dynamic gene regulation in mammals . Mol Cell . 2012 ; 47 : 810 822 .

39. Mortazavi A. , Williams B.A. , McCue K. , Schaeffer L. , Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq . Nat Methods . 2008 ; 5 : 621 628 .

40. Garber M. , Grabherr M.G. , Guttman M. , Trapnell C. Computational methods for transcriptome annotation and quantification using RNA-seq . Nat Methods . 2011 ; 8 : 469 477 .

41. Hawkins R.D. , Hon G.C. , Ren B. Next-generation genomics: an integrative approach . Nat Rev Genet . 2010 ; 11 : 476 486 .

42. Iorio M.V. , Croce C.M. MicroRNA dysregulation in cancer: diagnostics, monitoring and therapeutics. A comprehensive review . EMBO Mol Med . 2012 ; 4 : 143 159 .

43. Prensner J.R. , Chinnaiyan A.M. The emergence of lncRNAs in cancer biology . Cancer Discov . 2011 ; 1 : 391 407 .