The Human Genome

Published on 25/03/2015 by admin

Last modified 22/04/2025

Print this page

This article have been viewed 1583 times

Chapter 74 The Human Genome

The Human Genome Project, culminating in the sequencing of the human genome, made it possible to study virtually any human gene and to explore the roles of genes in both rare and common disorders. It has also become apparent that the genome includes far more than a coded store of information to produce proteins.

The human genome has approximately 25,000 genes that encode the wide variety of proteins found in the human body. Reproductive or germline cells contain 1 copy (N) of this genetic complement and are haploid, whereas somatic (nongermline) cells contain 2 complete copies (2N) and are diploid. Genes are organized into long segments of DNA, which, during cell division, are compacted into intricate structures together with proteins to form chromosomes. Each somatic cell has 46 chromosomes: 22 pairs of autosomes, or nonsex chromosomes, and 1 pair of sex chromosomes (XY in a male, XX in a female). Germ cells (eggs, sperm) contain 22 autosomes and 1 sex chromosome, for a total of 23. At fertilization, the full diploid chromosome complement of 46 is again realized in the embryo.

Most of the genetic material is contained in the cell’s nucleus. The mitochondria (the cell’s energy-producing organelles) contain their own unique genome. The mitochondrial chromosome consists of a double-stranded circular piece of DNA, which contains 16,568 base pairs (bp) of DNA and is present in multiple copies within mitochondria per cell. The proteins that compose the mitochondria may either be produced in the mitochondria (from information contained in the mitochondrial genome) or from information contained in the nuclear genome and transported into the organelle. All mitochondria are maternally derived; sperm do not usually contribute mitochondria to the developing embryo. Hence, a child’s mitochondrial genetic makeup derives exclusively from his or her biological mother.

Fundamentals of Molecular Genetics

The central tenet of molecular genetics is that information encoded in DNA, predominantly located in the cell nucleus, is transcribed into messenger RNA (mRNA), which is then transported to the cytoplasm, where it is translated into protein. A gene is a unit that includes a regulatory region and a coding region that stores information corresponding to the sequence of amino acids in a specific protein.

DNA consists of a pair of chains of a sugar-phosphate backbone linked by pyrimidine and purine bases to form a double helix (Fig. 74-1). The sugar in DNA is deoxyribose. The pyrimidines are cytosine (C) and thymine (T); the purines are guanine (G) and adenine (A). The bases are linked by hydrogen bonds such that A always pairs with T and G with C. Each strand of the double helix has polarity, with a free phosphate at one end (5′) and an unbonded hydroxyl on the sugar at the other end (3′). The two strands are oriented in opposite polarity in the double helix.

Figure 74-1 DNA double helix, with sugar-phosphate backbone and nitrogenous bases.

(From Jorde LB, Carey JC, Bamshad MJ, et al, editors: Medical genetics, ed 2, St Louis, 1999, Mosby, p 8.)

The replication of DNA follows the paring of bases in the parent DNA strand. The original two strands unwind by breaking the hydrogen bonds between base pairs. Free nucleotides, consisting of a base attached to a sugar-phosphate, form new hydrogen bonds with their complementary bases on the parent strand; new phosphodiester bonds are created by the enzyme DNA polymerase. Replication of chromosomes begins simultaneously at multiple sites, forming replication bubbles that expand bidirectionally until the entire DNA molecule (chromosome) is replicated. Errors in DNA replication, or mutations induced by environmental mutagens such as irradiation or chemicals, are detected and potentially corrected by DNA repair systems.

A prototypical gene consists of a regulatory region, segments called exons that encode the amino acid sequence of a protein, and intervening segments called introns (Fig. 74-2). Transcription starts at the promoter region and continues through the entire length of the gene to form mRNA. The introns are removed and the exons spliced together to form a mature message, which is exported to the cytoplasm. There the mRNA is bound to ribosomes and translated into protein.

Figure 74-2 Summary of the steps leading from DNA to proteins. Replication and transcription occur in the cell nucleus. The mRNA is then transported to the cytoplasm, where translation of the mRNA into amino acid sequences composing a protein occurs.

(From Jorde LB, Carey JC, Bamshad MJ, et al, editors: Medical genetics, ed 2, St Louis, 1999, Mosby, p 12.)

Transcription is initiated by attachment of RNA polymerase to the promoter site upstream of the beginning of the coding sequence. Specific proteins bind to the region to either repress or activate transcription by opening up the chromatin, which is a complex of DNA and histone proteins. It is the action of these regulatory proteins (transcription factors) that determines, in large part, when a gene is turned on or off. Some genes are also turned on and off by methylation of cytosine bases that are adjacent to guanines (CpG bases). Methylation is an example of an epigenetic change, meaning a change that can affect gene expression, and possibly the characteristics of a cell or organism, but that does not involve a change in the underlying genetic sequence. Gene regulation is flexible and responsive, with genes being turned on or off during development and in response to internal and external conditions and stimuli.

Transcription proceeds through the full length of the gene, synthesizing mRNA in a 5′ to 3′ direction. RNA, like DNA, is a sugar-phosphate chain with pyrimidines and purines. The sugar in this case is ribose; uracil replaces thymine in RNA. The RNA reads off one strand of DNA to copy a complementary RNA sequence. A “cap” consisting of 7-methylguanosine is added to the 5′ end of the RNA in a 5′-5′ bond and, for most transcripts, several hundred adenine bases are enzymatically added to the 3′ end after transcription.

mRNA processing occurs in the nucleus and consists of excision of the introns and splicing together of the exons. Specific sequences at the start and end of introns mark the sites where the splicing machinery will act on the transcript. In some cases, there may be tissue-specific patterns to splicing, so that the same primary transcript can produce multiple distinct proteins.

The processed transcript is next exported to the cytoplasm, where it binds to ribosomes, which are complexes of protein and RNA. The genetic code is then read in triplets of bases, each triplet corresponding with a specific amino acid or providing a signal that terminates translation. The triplet codons are recognized by transfer RNAs (tRNAs) that include complementary anticodons and bind the corresponding amino acid, delivering it to the growing peptide. A new amino acid is enzymatically attached to the peptide; each time an amino acid is added, the ribosome moves one triplet codon step along the mRNA. Eventually a stop codon is reached, at which point translation ends and the peptide is released. In some proteins, there are post-translational modifications such as attachment of sugars (glycosylation); the protein is then delivered to its destination within or outside the cell by trafficking mechanisms that recognize portions of the peptide.

An emerging layer of complexity and genetic regulation is that of noncoding RNAs. This refers to RNAs that are transcribed from DNA. However, they are not transported and translated into proteins. Instead, these RNAs are “noncoding” and serve diverse biologic functions often in complexes with different proteins. Traditionally, this has included RNAs that function in mediating splicing or processing of coding RNA or translation of coding RNAs in ribosomes. Small noncoding RNAs including microRNAs (miRNAs) are representative of a class of small RNAs (21-23 bp) that control gene expression in the cell by directly targeting specific sets of coding RNAs by direct RNA-RNA binding. This RNA-RNA interaction might lead to degradation of the target coding RNA or inhibition of translation of the protein specified by that coding RNA. miRNAs, in general, target and regulate several hundred mRNAs.

Genetic Variation

The process of producing protein from a gene is subject to disruption at multiple levels owing to alterations in the coding sequence (Fig. 74-3). Changes in the regulatory region can lead to altered gene expression, including increased or decreased rates of transcription, failure of gene activation, or activation of the gene at inappropriate times or in inappropriate cells. Changes in the coding sequence can lead to substitution of one amino acid for another (missense mutation or nonsynonymous) or creation of a stop codon in the place of an amino acid codon. Some single-base changes do not affect the amino acid (silent or wobble mutation or synonymous), because there may be several codons that correspond with a single amino acid. Amino acid substitutions can have a profound effect on protein function if the chemical properties of the substituted amino acid are markedly different from the usual one. Other substitutions can have a subtle or no effect on protein function, particularly if the substituted amino acid is chemically similar to the original one.

Figure 74-3 Various types of intragenic mutations. Promoter mutations alter rate of transcription or disrupt gene regulation. Base changes within exons can have various effects, as shown. Mutations within introns can lead to inclusion of some intronic sequence in the final processed mRNA, or it can lead to exon skipping.

Genetic changes can also include insertions or deletions. Insertions or deletions of a nonintegral multiple of 3 bases into the coding sequence leads to a frameshift, altering the grouping of bases into triplets. This leads to translation of an incorrect amino acid sequence and often a premature stop to translation. Insertion or deletion of an integral multiple of 3 bases into the coding sequence will insert or delete a corresponding number of amino acids from the protein leading to in-frame alterations that maintain the amino acid sequence outside of the deleted or duplicated amino acids. Larger-scale insertions or deletions can disrupt a coding sequence or result in complete deletion of an entire gene or group of genes.

Mutations usually can be classified as causing a loss of function or a gain of function. Loss-of-function mutations cause a reduction in the level of protein function due to decreased expression or production of a protein that does not work as efficiently. In some cases, loss of protein function from one gene is sufficient to cause disease. Haploinsufficiency describes the situation in which maintenance of a normal phenotype requires the proteins produced by both copies of a gene, and a 50% decrease in gene function results in an abnormal phenotype. Hence, haploinsufficient phenotypes are by definition dominantly inherited. Loss-of-function mutations can also have a dominant negative effect when the abnormal protein product actively interferes with the function of the normal protein product. Both of these situations lead to diseases inherited in a dominant fashion (Chapter 75). In other cases, loss-of-function mutation must be present in both copies of a gene before an abnormal phenotype results. This situation typically results in diseases inherited in a recessive fashion (Chapter 75).

Gain-of-function mutations typically cause dominantly inherited diseases. These mutations can result in production of a protein molecule with an increased ability to perform a normal function or they can confer a novel property on the protein. The gain-of-function mutation in achondroplasia, the most common of the disproportionate, short-limbed short stature disorders, exemplifies the enhanced function of a normal protein. Achondroplasia results from a mutation in the fibroblast growth receptor 3 gene (FGFR3), which leads to activation of the receptor, even in the absence of fibroblast growth factor (FGF). In sickle cell disease, an amino acid is substituted into the hemoglobin molecule that has little effect on the ability of the protein to transport oxygen. However, sickle hemoglobin chains have a novel property; unlike normal hemoglobin, sickle hemoglobin chains aggregate under conditions of deoxygenation, forming fibers that deform the red cells.

Other gain-of-function mutations result in overexpression or inappropriate expression of a gene product. Many cancer-causing genes (oncogenes) are normal regulators of cellular proliferation during development. However, expression of these genes in adult life and/or in cells in which they usually are not expressed can result in neoplasia.

In some cases, changes in gene expression are caused by changes in the number of copies of a gene present in the genome (Fig. 74-4). Although some copy number variations (CNVs) are common and do not appear to cause or predispose to disease, others are clearly disease causing. Charcot-Marie-Tooth disease type 1A, the most common inherited form of chronic peripheral neuropathy of childhood, is caused by duplications of the gene for peripheral myelin protein 22, resulting in overexpression due to the existence of 3 active copies of this gene. Deletions of this same gene—leaving only one active copy—are responsible for a different disorder, hereditary neuropathy with liability to pressure palsies.

Figure 74-4 Array comparative genomic hybridization. Test and reference DNA samples are differentially labeled, mixed, and passed over a target array of probes (e.g., BAC clones or oligonucleotides) containing DNA fragments from across the whole human genome. The experiment is often repeated with reversal of the test and reference dyes to detect dye effects or identify spurious signals. DNA samples hybridize with their corresponding probe, and the ratio of fluorescence from each probe (test:reference) is used to detect regions that vary in copy number between the test and the reference sample (red line: original hybridization; blue line: dye-swapped hybridization). Equal copy number for both the test and reference DNA is identified by equal binding, resulting in a ratio of 1:1. Duplication in a genomic region of the test sample is identified by an increased ratio, and a deletion is identified by a decreased ratio, but a deletion in the test sample is indistinguishable from a duplication in the reference sample. These ratios are usually converted to log₂ scale for further analysis.

(Adapted from Feuk L, Carson AR, Scherer SW: Structural variation in the human genome, Nat Rev Genet 7:85–97, 2006, with permission from Nature Reviews Genetics.)

Deletions and duplications can vary in their extent and can involve several genes, even when they are not visible on a traditional chromosome analysis. Such changes are commonly called microdeletions and microduplications. When deletion or duplication of several genes in the same chromosomal region each play a role in the resulting clinical features, the condition can also be referred to as a contiguous gene disorder.

In some cases the recognition of a specific constellation of features leads the clinician to suspect a specific microdeletion or microduplication syndrome. Examples of such disorders include Smith-Magenis, DiGeorge, and Williams syndromes. In other cases, the clinician may be alerted to this possibility by an unusually diverse array of clinical features in one patient or the presence of unusual features in a person with a known condition. Owing to the close physical proximity of a series of genes, different deletions involving the short arm of the X chromosome can produce individuals with various combinations of ichthyosis, Kallmann syndrome, ocular albinism, mental retardation, chondrodysplasia punctata, and short stature.

DNA rearrangements can also take place in somatic cells—meaning cells that do not go on to produce ova or sperm. The best understood are the rearrangements that occur in lymphoid cells. Some rearrangements are required for the formation of functional immunoglobulin in B cells and antigen-recognizing receptors on T cells. Large segments of DNA, which code for the variable and the constant regions of either immunoglobulin or the T-cell receptor, are physically joined at a specific stage in the development of an immunocompetent lymphocyte. These rearrangements take place during development of the lymphoid cell lineage in humans and result in the extensive diversity of immunoglobulin and T-cell receptor molecules. It is as a result of this postgermline DNA rearrangement that no two individuals, not even identical twins, are really identical, because mature lymphocytes from each will have undergone random DNA rearrangements at these loci.

Studies of the human genome sequence reveal that any two individuals differ in about 1 base in 1,000. Some of these differences are silent; some result in changes that explain phenotypic differences (hair or eye color, physical appearance); some have medical significance, causing single gene disorders such as sickle cell anemia or explaining susceptibility to common disorders such as asthma. Genetic variants that occur at a frequency of >1% in a population are often referred to as polymorphisms. These may be silent or subtle or have significant phenotypic effects.

Genotype-Phenotype Correlations in Genetic Disease

The term genotype is used to signify the internally coded, heritable information of an individual and can also be used to refer to which particular alternative version (allele) of a gene is present at a specific location (locus) on a chromosome. A phenotype is the observed structural, biochemical, and physiologic characteristics of an individual, determined by the genotype, and can also refer to the observed structural and functional effects of a mutant allele at a specific locus. Many mutations result in predictable phenotypes. In these cases, physicians can predict clinical outcomes and plan appropriate treatment strategies based on a patient’s genotype.

The long QT syndrome exemplifies a disorder with predictable associations between a patient’s genotype and his or her phenotype (Chapter 429.5). Long QT syndrome is genetically heterogeneous, meaning that mutations in several different genes can cause the same disorder. The risk for cardiac events (syncope, aborted cardiac arrest, or sudden death) is higher with long QT syndrome mutations involving the KCNQ1 gene (63%) or the KCNH2 gene (46%) than among subjects with mutations in the SCN5A gene (18%). In addition, those with mutations involving KCNQ1 experience most of their episodes during exercise and rarely during rest or sleep; those with KCNH2 and SCN5A mutations are more likely to have episodes during sleep or rest, and rarely during exercise. Therefore, mutations in specific genes (genotype) are correlated with specific manifestations (phenotype) of long QT syndrome. These types of relationships are commonly referred to as genotype-phenotype correlations.

Mutations in the fibrillin-1 gene associated with Marfan syndrome represent another example of predictable genotype-phenotype correlations (Chapter 693). Marfan syndrome is characterized by the combination of skeletal, ocular, and aortic manifestations, with the most devastating outcome being aortic root dissection and sudden death. Sixty-five exons make up the fibrillin-1 gene, and mutations have been found in almost all of these exons. The location of the mutation within the gene (genotype) might play a significant role in determining the severity of the condition (phenotype). Neonatal Marfan syndrome is caused by mutations in exons 24-27 and 31-32, whereas milder forms are caused by mutations in exons 59-65 and in exons 37 and 41.

Genotype-phenotype correlations have also been observed in cystic fibrosis (CF) (Chapter 395). Although pulmonary disease is the major cause of morbidity and mortality, CF is a multisystemic disorder that affects not only the epithelia of the respiratory tract but also the exocrine pancreas, intestine, male genital tract, hepatobiliary system, and exocrine sweat glands. CF is caused by mutations in the CF transmembrane conductance regulator (CFTR) gene. More than 1,600 different mutations have been identified. The most common is a deletion of three nucleotides that removes the amino acid phenylalanine (F) at the 508th position on the protein (ΔF508 mutation), which accounts for about 70% of all CF mutations and is associated with severe disease. The best genotype-phenotype correlations in CF are seen in the context of pancreatic function, with most common mutations being classified as either pancreatic sufficient or pancreatic insufficient. Persons with pancreatic sufficiency usually have either 1 or 2 pancreatic sufficient alleles, indicating that pancreatic sufficient alleles are dominant. In contrast, the genotype-phenotype correlation in pulmonary disease is much weaker, and persons with identical genotypes have wide variations in the severity of their pulmonary disease. This finding may be accounted for in part by genetic modifiers or environmental factors.

There are many disorders in which the effects of mutations on phenotype can be modified by changes in the other allele of the same gene, by changes in specific modifier genes, and/or variations in a number of unspecified genes (genetic background). When sickle cell anemia is co-inherited with the gene for hereditary persistence of fetal hemoglobin, the sickle cell phenotypic expression is less severe. Modifier genes in CF can influence the development of congenital meconium ileus, or colonization with P. aeruginosa. Modifier genes can also affect the manifestations of Hirschsprung disease, neurofibromatosis type 2, craniosynostosis, and congenital adrenal hyperplasia. The combination of genetic mutations producing glucose-6-phosphate dehydrogenase deficiency and longer versions of the TATAA element in the UDP-glucuronosyltransferase gene promoter exacerbates neonatal physiologic hyperbilirubinemia.