Mapping and Identifying Genes for Monogenic Disorders

Published on 16/03/2015 by admin

Filed under Basic Science

Last modified 22/04/2025

Print this page

rate 1 star rate 2 star rate 3 star rate 4 star rate 5 star
Your rating: none, Average: 0 (0 votes)

This article have been viewed 3780 times

CHAPTER 5 Mapping and Identifying Genes for Monogenic Disorders

The identification of the gene associated with an inherited single gene (monogenic) disorder, as well as having immediate clinical diagnostic application, will enable an understanding of the developmental basis of the pathology with the prospect of possible therapeutic interventions. The molecular basis for more than 2700 disease phenotypes is now known.

The first human disease genes identified were those with a biochemical basis where it was possible to purify and sequence the gene product. The development of recombinant DNA techniques in the 1980s enabled physical mapping strategies and led to a new approach, positional cloning. This describes the identification of a gene purely on the basis of its location, without any prior knowledge of its function. Notable early successes were the identification of the dystrophin gene (mutated in Duchenne muscular dystrophy), the cystic fibrosis transmembrane regulatory gene, and the retinoblastoma gene. Patients with chromosome abnormalities or rearrangements have often provided important clues by highlighting the likely chromosomal region of a gene associated with disease.

In the 1990s a genome-wide set of microsatellites was constructed with approximately 1 marker per 10 centimorgans (cM). These 350 markers could be amplified by polymerase chain reaction (PCR) and facilitated genetic mapping studies that led to the identification of thousands of genes. This approach has been superseded by DNA microarrays or ‘single nucleotide polymorphism (SNP) chips’. Although SNPs (p. 67) are less informative than microsatellites, they can be scored automatically and microarrays are commercially available with several million SNPs distributed throughout the genome.

The common step for all approaches to identify human disease genes was the identification of a candidate gene (Figure 5.1). Candidate genes may be suggested from animal models of disease or by homology, either to a paralogous human gene (e.g., where multigene families exist) or to an orthologous gene in another species. With the sequencing of the human genome now complete, it is also possible to find new disease genes by searching through genetic databases (i.e., ‘in silico’).

Recent developments in sequencing technology mean that exome sequencing (analysis of the coding regions of all known genes) and even whole genome sequencing are now feasible strategies for identifying disease genes by direct identification of the causal mutation in a family (or families) with multiple affected individuals. Consequently, the timescale for identifying human disease genes has decreased dramatically from a period of years (e.g., the search for the cystic fibrosis gene in the 1980s) to weeks or perhaps even days, now that the human genome sequence is available in public databases.

Position-Independent Identification of Human Disease Genes

Before genetic mapping techniques were developed, the first human disease genes were identified through knowledge of the protein product. For disorders with a biochemical basis, this was a particularly successful strategy.

Next-Generation ‘Clonal’ Sequencing

This new sequencing technology shows great promise for elucidating the remaining ~55% of single gene disorders where the genetic aetiology remains unknown (Figure 5.2). The first success was in the identification of mutations in the DHODH gene that cause Miller syndrome by ‘exome’ sequencing. Around 164,000 regions encompassing exons and their conserved splice sites (a total of 27 Mb) were sequenced in a pair of affected siblings and probands from two additional families. Non-synonymous variants, splice donor/acceptor, or coding insertion/deletion mutations were identified in nearly 5000 genes in each of the two affected siblings. Filtering these variants against public databases (dbSNP and HapMap) yielded novel variants in less than 500 genes. Analysis of pooled data from the four affected patients revealed just one gene, DHODH, which contained two mutated alleles in each of the four individuals.

Positional Cloning

Positional cloning describes the identification of a disease gene through its location in the human genome, without prior knowledge of its function. It is also described as reverse genetics as it involves an approach opposite to that of functional cloning, in which the protein is the starting point.

Linkage Analysis

Genetic mapping, or linkage analysis (p. 137), is based on genetic distances that are measured in centimorgans (cM). A genetic distance of 1 cM is the distance between two genes that show 1% recombination, that is, in 1% of meioses the genes will not be co-inherited and is equivalent to approximately 1 Mb (1 million bases). Linkage analysis is the first step in positional cloning that defines a genetic interval for further analysis.

Linkage analysis can be performed for a single, large family or for multiple families, although this assumes that there is no genetic heterogeneity (p. 378). The use of genetic markers located throughout the genome is described as a genome-wide scan. In the 1990s, genome-wide scans used microsatellite markers (a commercial set of 350 markers was popular), but microarrays with several million SNPs now provide greater statistical power.

Autozygosity mapping (also known as homozygosity mapping) is a powerful form of linkage analysis used to map autosomal recessive disorders in consanguineous pedigrees (p. 269). Autozygosity occurs when affected members of a family are homozygous at particular loci because they are identical by descent from a common ancestor.

Linkage of cystic fibrosis (CF) to chromosome 7 was found by testing nearly 50 white families with hundreds of DNA markers. The gene was mapped to a region of 500 kilobases (kb) between markers MET and D7S8 at chromosome band 7q31-32, when it became evident that the majority of CF chromosomes had a particular set of alleles for these markers (shared haplotype) that was found in only 25% of non-CF chromosomes. This finding is described as linkage disequilibrium and suggests a common mutation from a founder effect (p. 378). Extensive physical mapping studies eventually led to the identification of four genes within the genetic interval identified by linkage analysis, and in 1989 a 3-bp deletion was found within the cystic fibrosis transmembrane receptor (CFTR) gene. This mutation (p.Phe508del) was present in approximately 70% of CF chromosomes and 2% to 3% of non-CF chromosomes, consistent with the carrier frequency of 1 in 25 in whites.

Chromosome Abnormalities

Occasionally, individuals are recognized with single-gene disorders who are also found to have structural chromosomal abnormalities. The first clue that the gene responsible for Duchenne muscular dystrophy (DMD) (p. 307) was located on the short arm of the X chromosome was the identification of a number of females with DMD who were also found to have a chromosomal rearrangement between an autosome and a specific region of the short arm of one of their X chromosomes. Isolation of DNA clones spanning the region of the X chromosome involved in the rearrangement led in one such female to more detailed gene-mapping information as well as to the eventual cloning of the DMD or dystrophin gene (p. 307).

At the same time as these observations, a male was reported with three X-linked disorders: DMD, chronic granulomatous disease, and retinitis pigmentosa. He also had an unusual X-linked red cell group known as the McLeod phenotype. It was suggested that he could have a deletion of a number of genes on the short arm of his X chromosome, including the DMD gene, or what is now termed a contiguous gene syndrome. Detailed prometaphase chromosome analysis revealed this to be the case. DNA from this individual was used in vast excess to hybridize in competitive reassociation, under special conditions, with DNA from persons with multiple X chromosomes to enrich for DNA sequences that he lacked, the so-called phenol enhanced reassociation technique, or pERT, which allowed isolation of DNA clones containing portions of the DMD gene.

The occurrence of a chromosome abnormality and a single-gene disorder is rare, but identification of such individuals is important as it has led to the cloning of several other important disease genes in humans, such as tuberous sclerosis (p. 316) and familial adenomatous polyposis (p. 221).

The Human Gene Map

The rate at which single-gene disorders and their genes are being mapped in humans is increasing exponentially (see Figure 1.6, p. 7). Many of the more common and clinically important monogenic disorders have been mapped to produce the ‘morbid anatomy of the human genome’ (Figure 5.3).

FIGURE 5.3 A gene map of the human genome with examples of some of the more common or important single genes and disorders.

α1-AT 14q32 α1-Antitrypsin deficiency
ABO 9q34 ABO blood group
ACTH 2p25 Adrenocorticotrophic hormone deficiency
ADA 20q13.11 Severe combined immunodeficiency, ADA deficiency
AHP 9q34 Acute hepatic porphyria
AIP 11q23.3 Acute intermittent porphyria
AKU 3q2 Alkaptonuria
ALD Xq28 Adrenoleukodystrophy
APKD1 16p13 Adult polycystic kidney disease, locus 1
APKD2 4q21–23 Adult polycystic kidney disease, locus 2
APOB 2p24 Apolipoprotein B
APOE 19q.13.2 Apolipoprotein E
ARG1 6q23 Arginase deficiency, argininemia
ARSB 5q11–13 Mucopolysaccharidosis type VI, Maroteaux-Lamy syndrome
AS 15q11–13 Angelman syndrome
ATA 11q22.3 Ataxia telangiectasia
ATIII 1q23–25 Antithrombin III
ATRX Xq13 α-Thalassemia mental retardation
AZF Yq11 Azoospermia factor
BBS2 16q21 Bardet–Biedl syndrome
BLM 15q26.1 Bloom syndrome
BRCA1 17q21 Familial breast/ovarian cancer, locus 1
BRCA2 13q12.3 Familial breast/ovarian cancer, locus 2
BWS 11p15.4 Beckwith–Wiedemann syndrome
C3 19p13.2-13.3 Complement factor 3
C5 9q34.1 Complement factor 5
C6 5p13 Complement factor 6
C7 5p13 Complement factor 7
C9 5p13 Complement factor 9
CAH1 6p21.3 Congenital adrenal hyperplasia, 21-hydroxylase
CBS 21q22.3 Homocystinuria
CEP 10q25.2-26.3 Congenital erythropoietic porphyria
CFTR 7q31.2 Cystic fibrosis transmembrane conductance regulator
CKN2 10q11 Cockayne syndrome 2, late onset
CMH1 14q12 Hypertrophic obstructive cardiomyopathy type 1
CMH2 1q3 Hypertrophic obstructive cardiomyopathy type 2
CMH3 15q22 Hypertrophic obstructive cardiomyopathy type 3
CMT1A 17p11.2 Charcot–Marie–Tooth disease type 1A
CMT1B 1q22 Charcot–Marie–Tooth disease type 1B
CMT2 1p35–36 Charcot–Marie–Tooth disease type 2
COL1A1 17q21.31-22 Collagen type I, α1 chain, osteogenesis imperfecta
COL1A2 7q22.1 Collagen type I, α2 chain, osteogenesis imperfect
COL2A1 12q13.11-13.2 Collagen type II, Stickler syndrome
COL3A1 2q31 Collagen type III, α1 chain, Ehlers-Danlos syndrome type IV
CYP11B1 8q21 Congenital adrenal hyperplasia, 11β-hydroxylase
DAZ Yq11 Deleted in azoospermia
DFNB1/A3 13q12 Non-syndromic sensorineural deafness, first recessive, third dominant locus
DM 19q13.2-13.3 Myotonic dystrophy
DMD/BMD Xp21.2 Dystrophin, Duchenne and Becker muscular dystrophy
DRPLA 12p13.1-12.3 Dentatorubropallidoluysian disease
EDSVI 1p36.2-36.3 Ehlers-Danlos syndrome type VI
EYA1 8q13.3 Brachio-otorenal syndrome
F5 1q23 Coagulation protein V
F7 13q34 Coagulation protein VII
F8 Xq28 Coagulation protein VIII, hemophilia A
F9 Xq27.1-27.2 Coagulation protein IX, Christmas disease, hemophilia B
F10 13q34 Coagulation protein X
F11 Xq27.1-27.2 Coagulation factor XI
F12 5q33-qter Coagulation factor XII
FAP 5q21-22 Familial adenomatous polyposis, Gardner syndrome
FBN1 15q21.1 Fibrillin-1, Marfan syndrome
FBN2 5q23-31 Fibrillin-2, contractural arachnodactyly
FGFR1 8p11.1-11.2 Fibroblast growth factor receptor 1, Pfeiffer syndrome
FGFR2 10q26 Fibroblast growth factor receptor 2, Crouzon, Pfeiffer, Apert syndrome
FGFR3 4p16.3 Fibroblast growth factor receptor 3, achondroplasia, thanatophoric dysplasia
FH 19p13.1-13.2 Familial hypercholesterolemia
FRAXA (FMR1) Xq27.3 Fragile X mental retardation
FRDA 9q13–21.1 Friedreich ataxia
FSHMD 4q35 Facioscapulohumeral muscular dystrophy
GAL 9p13 Galactosemia
GAP 9q31 Basal cell nevus syndrome, Gorlin syndrome
GLB1 3p21.33 GM1 gangliosidosis
G6PD Xq28 Glucose-6-phosphate dehydrogenase
GUSB 7q21.11 Mucopolysaccharidosis type VII, Sly syndrome
HbB 11p15.5 β-Globin gene
HD 4p16.3 Huntington disease
HEXA 15q23–24 Hexosaminidase A, Tay-Sachs disease
HEXB 5q13 Hexosaminidase B, Sandhoff disease
HFE 6p21.3 Hemochromatosis
HGPRT Xq26-27.2 Hypoxanthine guanine phosphoribosyl transferase, Lesch-Nyhan syndrome
HLA 6p21.3 Major histocompatibility locus
HPE3 7q36 Holoprosencephaly
IDUA 4p16.3 Mucopolysaccharidosis type I, Hurler syndrome
IGKC 2p12 Immunoglobulin κ light chain
IGLC1 22q11 Immunoglobulin λ light chains
INS 11p15.5 Insulin-dependent diabetes mellitus type 2
KRT5 12q11-13 Epidermolysis bullosa simplex, Koebner type
LGMD7 5q31 Limb-girdle muscular dystrophy
MCAD 1p31 Acyl coenzyme-A dehydrogenase, medium chain
MDS 17p13.3 Miller-Dieker lissencephaly syndrome
MEN1 11q13 Multiple endocrine neoplasia syndrome type 1
MHS 19q13.1 Malignant hyperpyrexia susceptibility, locus 1
MITF 3p14.1 Waardenburg syndrome type 2
MJD 14q24.3-31 Machado-Joseph disease, spinocerebellar ataxia type 3
MPS VI 5q11-13 Maroteaux-Lamy syndrome
MSH2 2p15-16 Hereditary non-polyposis colorectal cancer type 1
NCF2 1q25 Chronic granulomatous disease, neutrophil cytosolic factor-2 deficiency
NF1 17q11.2 Neurofibromatosis type I, von Recklinghausen disease
NF2 22q12.2 Neurofibromatosis type II, bilateral acoustic neuroma
NP 11p15.1-15.4 Niemann-Pick disease type A and B
NPC 18q11-12 Niemann-Pick disease type C
NPS 9q43 Nail-patella syndrome
OTC Xp21.1 Ornithine transcarbamylase
p53 17p13.1 p53 protein, Li-Fraumeni syndrome
PKU 12q24.1 Phenylketonuria
PROC 2q13-14 Protein C, coagulopathy disorder
PROS 3p11.1-q11.2 Protein S, coagulopathy disorder
PRNP 20p12-pter Prion disease protein
PWS 15q11 Prader-Willi syndrome
PXMP1 1p21–22 Zellweger syndrome type 2
RB 13q14.1-14.2 Retinoblastoma
RET 10q11.2 Familial medullary thyroid carcinoma, MEN 2A and 2B, familial Hirschsprung disease
RH 1p34–36.2 Rhesus null disease, Rhesus blood group
RP1 8p11-q21 Retinitis pigmentosa, locus 1
RP2 Xp11.3 Retinitis pigmentosa, locus 2
RP3 Xp21.1 Retinitis pigmentosa, locus 3
rRNA   Ribosomal RNA
SCA1 6p23 Spinocerebellar ataxia, locus 1
SCA2 12q24 Spinocerebellar ataxia, locus 2
SPH1 14q22-23.2 Spherocytosis type I
SMA 5q12.2-13.3 Spinal muscular atrophy
SOD1 21q22.1 Superoxide dismutase, familial motor neuron disease
SRY Yp11.3 Sex-determining region Y, testis-determining factor
TBX5 12q21.3-22 Holt-Oram syndrome
TCOF1 5q32-33.1 Treacher-Collins syndrome
TRPS1 8q24.12 Trichorhinophalangeal syndrome
TSC1 9q34 Tuberous sclerosis, locus 1
TSC2 16p13.3 Tuberous sclerosis, locus 2
TYR 11q14-21 Oculocutaneous albinism
USH1A 14q32 Usher syndrome type IA
USH1B 11q13.5 Usher syndrome type IB
USH1C 11p15.1 Usher syndrome type IC
USH2 1q41 Usher syndrome type II
VWS 1q32 van der Woude syndrome
VHL 3p25–26 von Hippel-Lindau syndrome
VWF 12p13.3 von Willebrand disease
WD 13q14.3-21.1 Wilson disease
WRN 8p11.2-12 Werner syndrome
WS1 2q35 Waardenburg syndrome type 1
WT1 11p13 Wilms tumor 1 gene
ZWS1 7q11.23 Zellweger syndrome type 1

image

The Human Genome Project

Beginning and Organization of the Human Genome Project

The concept of a map of the human genome was proposed as long ago as 1969 by Victor McKusick (see Figure 1.5, p. 7), one of the founding fathers of medical genetics. Human gene mapping workshops were held regularly from 1973 to collate the mapping data. The idea of a dedicated human genome project came from a meeting organized by the US Department of Energy at Sante Fe, New Mexico, in 1986. The US Human Genome Project started in 1991 and is estimated to have cost around 2.7 billion US dollars. Other nations, notably France, the UK, and Japan, soon followed with their own major national human genome programs and were subsequently joined by a number of other countries. These individual national projects were all coordinated by the Human Genome Organization, which has three centers, one for the Americas based in Bethesda, Maryland, one for Europe located in London, and one for the Pacific in Tokyo.

Although the key objective of the Human Genome Project was to sequence all 3 × 109 base pairs of the human genome, this was just one of the six main objectives/areas of work of the Human Genome Project.

Human Gene Maps and Mapping of Human Inherited Diseases

Designated genome mapping centers with ear-marked funding were involved in the coordination and production of genetic or recombination and physical maps of the human genome. The genetic maps initially involved the production of fairly low-level resolution index, skeleton or framework maps, which were based on polymorphic variable-number di-, tri-, and tetranucleotide tandem repeats (p. 17) spaced at approximately 10-cM intervals throughout the genome.

The mapping information from these genetic maps was integrated with high-resolution physical maps (Figure 5.4). Access to the detailed information from these high-resolution genetic and physical maps allowed individual research groups, often interested in a specific or particular inherited disease or group of diseases, rapidly and precisely to localize or map a disease gene to a specific region of a chromosome.

image

Figure 5.4 A summary map of human chromosome 3, estimated to be 210 Mb in size, which integrates physical mapping data covered by 24 YAC contigs and the Genethon genetic map with cumulative map distances.

(From Gemmill RM, Chumakov I, Scott P, et al 1995 A second-generation YAC contig map of human chromosome 3. Nature 377:299–319; with permission.)

Sequencing of the Human Genome

Although sequencing of the entire human genome would have been seen to be the obvious main focus of the Human Genome Project, initially it was not the straightforward proposal it seemed. The human genome contains large sections of repetitive DNA (p. 15) that were technically difficult to clone and sequence. In addition, it would seem a waste of time to collect sequence data on the entire genome when only a small proportion is made up of expressed sequences or genes, the latter being most likely to be the regions of greatest medical and biological importance. Furthermore, the sheer magnitude of the prospect of sequencing all 3 × 109 base pairs of the human genome seemed overwhelming. With conventional sequencing technology, as was carried out in the early 1990s, it was estimated that a single laboratory worker could sequence up to approximately 2000 bp per day.

Projects involving sequencing of other organisms with smaller genomes showed how much work was involved as well as how the rate of producing sequence data increased with the development of new DNA technologies. For example, with initial efforts at producing genome sequence data for yeast, it took an international collaboration involving 35 laboratories in 17 countries from 1989 until 1995 to sequence just 315,000 bp of chromosome 3, one of the 16 chromosomes that make up the 14 million base pairs of the yeast genome. Advances in DNA technologies meant, however, that by the middle of 1995 more than half of the yeast genome had been sequenced, with the complete genomic sequence being reported the following year.

Further advances in DNA sequencing technology led to publication of the full sequence of the nematode Caenorhabditis elegans in 1998 and the 50 million base pairs of the DNA sequence of human chromosome 22 at the end of 1999. As a consequence of these technical developments, the ‘working draft’ sequence, covering 90% of the human genome, was published in February 2001. The finished sequence (more than 99% coverage) was announced more than 2 years ahead of schedule in April 2003, the 50th anniversary of the discovery of the DNA double helix. Researchers now have access to the full catalog of 25,000 to 30,000 genes, and the human genome sequence will underpin biomedical research for decades to come.

Although the Human Genome Sequencing Project is complete, a number of new projects have been initiated as a direct consequence, including the Cancer Genome, HapMap (p. 148), and 1000 Genomes (p. 150) projects.

Development of Bioinformatics

Bioinformatics was essential to the overall success of the Human Genome Project. This is the establishment of facilities for collecting, storing, organizing, interpreting, analyzing, and communicating the data from the project, which can be widely shared by the scientific community at large. It was vital for anyone involved in any aspect of the Human Genome Project to have rapid and easy access to the data/information arising from it. This dissemination of information was met by the establishment of a large number of electronic databases available on the World Wide Web on the internet (see Appendix). These include protein and DNA sequence databases (e.g., GenBank, EMBL), databases of genetic maps for humans (such as the GDB, Genethon, CEPH, CHLC, and the Whitehead Institute sites) and other species (the Mouse Genome Database and the C. elegans database), linkage analysis programs (e.g., the Rockefeller University website), annotated genome data (Ensembl and UCSC Genome Bioinformatics) and the catalog of inherited diseases in humans (Online Mendelian Inheritance in Man, or OMIM).

These developments in bioinformatics now allow the prospect of identifying coding sequences and determining their likely function(s) from homologies to known genes, leading to the prospect of identifying a new gene without the need for any laboratory experimental work, or what has been called ‘cloning in silico’.

Functional Genomics

The second major way in which model organisms proved to be invaluable in the Human Genome Project was by providing the means to follow the expression of genes and the function of their protein products in normal development as well as their dysfunction in inherited disorders. This is referred to as functional genomics.

The ability to introduce targeted mutations in specific genes, along with the production of transgenic animals (p. 102), for example in the mouse, allows the production of animal models to study the pathodevelopmental basis for inherited human disorders, as well as serve as a test system for the safety and efficacy of gene therapy and other treatment modalities (p. 350). Strategies using different model organisms in a complementary fashion, taking into account factors such as the ease or complexity of producing transgenic organisms and the generation times of different species, allow the possibility of relatively rapid analysis of gene expression, function and interactions in providing an understanding of the complex pathobiology of inherited diseases in humans.

Further Reading

Botstein D, White RL, Skolnick M, Davis RW. Construction of a genetic linkage map in man using restriction fragment length polymorphisms. Am J Hum Genet. 1980;32:314-331.

One of the original papers describing the concept of linked restriction fragment length polymorphisms.

Kerem B, Rommens JM, Buchanan JA, et al. Identification of the cystic fibrosis gene. Genetic analysis. Science. 1989;245:1073-1080.

Original paper describing cloning of the cystic fibrosis gene.

McKusick VA. Mendelian inheritance in man, 12th ed. London: Johns Hopkins University Press; 1998.

A computerized catalog of the dominant, recessive, and X-linked mendelian traits and disorders in humans with a brief clinical commentary and details of the mutational basis, if known. Also available online, updated regularly.

Ng SB, Buckingham KJ, Lee C, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2010;42:30-35.

The first publication describing the use of next generation sequencing to elucidate the genetic aetiology of Miller syndrome.

Royer-Pokora B, Kunkel LM, Monaco AP, et al. Cloning the gene for an inherited human disorder—chronic granulomatous disease—on the basis of its chromosomal location. Nature. 1985;322:32-38.

Original paper describing the identification of a disease gene through contiguous chromosome deletions.

Strachan T, Read AP. Human molecular genetics, 4th ed. London: Garland Science; 2011.

A comprehensive textbook of all aspects of molecular and cellular biology as related to inherited disease in humans.

Sulston J. The common thread: a story of science, politics, ethics and the human genome. London: Joseph Henry Press; 2002.

A personal account of the human genome sequencing project by the man who led the UK team of scientists.