Transcript Document

Genomics Lecture Topics


Genetic mapping studies: two approaches

Classical linkage map/genome-wide association study

Physical map
Cloning and isolating genes the old-fashioned way using
positional cloning



Search for the cystic fibrosis gene
Modern genome sequencing

Shotgun sequencing an entire genome

Sequencing the human genome
Functional/comparative genomics
Classical genetic linkage studies:
Genetic linkage mapping involves determining the statistical association
of specific traits with genetic markers on chromosomes using pedigrees
and crosses.

Use recombination frequencies to determine a relative distances
between markers on a chromosome.

Genome–wide Association Studies = GWAS

Humans require 24 different maps, one for each of the 22
autosomes and one each for the X and Y chromosomes.

Marker alleles are used to determine the rate of recombination
(e.g., crossing over) between linked genes.

Linked = on the same chromosome or genome

The unit measured for each linkage map is the recombination
frequency = # recombinants/total progeny

Reported as map units (mu) or centiMorgans (cM).

Will talk about this more starting with chapter 11.
Different types of markers used in genetic mapping:
1.
Genes can be used as genetic markers, but they are not ideal
choices because they occur infrequently (ca. every 100 kb in
humans).
2.
Greater marker density is usually required.
3.
3 major types of markers are used:
1.
RFLPs = substitutions at a restriction site.
2.
Microsatellites (STR) = short tandem repeats
3.
Single nucleotide polymorphisms (SNPs)
4.
STS = sequence tagged site
Genome-wide mapping:

High density genetic mapping was revolutionized in the 1980s by
the discovery of abundant polymorphic genetic markers like
microsatellites.

Research teams collaborated and added to a common data base.

By 1994, human genetic map had localized:
5,264 microsatellites to 2,335 chromosome loci
(average density of one marker every 599 kb)

In the process, thousands of sequence tagged site (STS) identified.

STS = couple hundred base pairs of known sequence
High-density genetic
map of 5,264
microsatellites
localized to each of
23 chromosomes.
From genome-wide mapping to genome sequencing…

For many species with small genomes, such a map would have
provided enough landmarks to begin sequencing the entire genome
using a conventional map and sequence approach.

Human map still lacked resolution, large stretches of uncharted DNA
remained.

Average distance between markers was 600 kb.

Physical mapping was required to assist with the sequencing.
Physical map = map of physically identifiable regions of genomic
DNA constructed without recombination analysis.

Time and effort could be minimized by targeting sequencing efforts
to a specific chromosome (or smaller regions).
Two types of physical maps useful for sequencing a genome:
1. Low Resolution-Cytogenetic/FISH maps
•
Stained chromosomes produce banding patterns composed of bands
that average 6 Mb.
•
Regions are designated by their position relative to the centromere.
“q” = long arm
“p” = short arm
Numbered from the centromere starting with “1”
•
Genes and other sequences are localized to chromosome maps with
probes and by using a technique called fluorescent in situ
hybridization (FISH)
•
Various types of radioactive probes and stains also can be used to
mark specific regions of chromosomes.
•
Provides a physical map of the overall structure of each
chromosome/region.
http://www.mun.ca/biology/scarr/FISH_chromosome_painting.htm
http://www.euchromatin.org/E09.htm
Two types of physical maps useful for sequencing a genome:
2. High Resolution-YAC/BAC Clone Contig Maps
•
Mechanically shear or partially digest genomic DNA with restriction
enzymes and clone large 200-500 kb overlapping fragments to YACs
or BACs.
•
An entire genome or single chromosome can be represented in a
YAC or BAC clone library (depends on starting point).
•
Overlapping YAC/BAC clones can be assembled into a scaffold
without sequencing by DNA fingerprinting using markers like
microsatellites.
•
BAC vectors with a capacity of 300 kb and ability to replicate in E.
coli have become popular for genome sequencing (now routinely
sequenced using the shotgun approach).
Fig. 10.1 2nd edition, YAC contig physical map assembled by
microsatellite mapping (combination YACs + microsatellite mapping)
Cloning, isolating, and sequencing genes:


Locating a gene is easy if the gene product (protein) is identified.
1.
Create a cDNA library using an expression vector.
2.
Probe with antibodies that bind the gene product.
3.
Isolate and sequence positive clones.
If the gene product is unknown, locating and sequencing a gene is
more difficult.
1.
Identify a marker (microsatellite, RFLP, SNP) that is:
1.
2.
1.
Shows a strong statistical association with the disease
phenotype in test crosses or genome-wide association study
(GWAS).
Physically linked to the gene on the same chromosome.
Use linkage map + physical map and a technique called
positional cloning to home in on gene and actually sequence it.
e.g., cloning and discovery of the cystic fibrosis (CF) gene.
Positional Cloning- identification the cystic fibrosis (CF) gene:

Most common lethal genetic disease in the U.S. (~1 in 2,000).

First human gene identified by positional cloning.

Required 4 years and the work of many laboratories.
Overview of cystic fibrosis:
CF results from defect in protein that regulates the movement of salt
and water in and out of cells.
Causes thick mucus secretions in the lungs, pancreas, and intestines.
Causes lung disease and organ failure, patients experience chronic
bacterial infections.
Life expectancy is abut 40 years.
First steps to identifying the CF gene by positional cloning:
1.
Many hundreds of individuals with CF pedigrees were screened with
a large number of RFLPs.
2.
A single recurring RFLP showed weak linkage (statistical
association) to the cystic fibrosis trait.
3.
CF gene was next localized to chromosome 7 using a labeled RFLP
probe and in situ hybridization to condensed chromosomes.
4.
All other known RFLPs from chromosome 7 were simultaneously
screened for linkage to CF.
5.
Two more linked RFLPs were discovered on a 500,000 bp subregion
(31-32) of the long arm of chromosome 7 (7q31-q32).
6.
The data indicated CF locus is within a 500,000 bp region of
chromosome 7.
Steps to identifying the CF gene (cont.):
1.
Section (500 kb) of chromosome 7 containing the CF gene was cut,
cloned, and mapped using a technique called chromosome walking.
1.
End of a cloned sequence is used as a probe to find adjacent
overlapping fragments in a genomic library.
2.
Clones that overlap are mapped with RFLPs to determine the
extent of overlap.
3.
A new labeled probe designed for the second clone is used to
screen the library once again.
4.
Repeat…

Technique does not work well with highly repetitive sequence that is
scattered throughout the genome.

Length of each step in the chromosome walk is limited by the size of
inserts in the library and the size of the overlap.
Fig. 9.10,
2nd edition
Illustration
showing how
chromosome walking
was used to
identify a candidate
gene for a disease like
cystic fibrosis.
Technique called chromosome jumping also was used:
1.
Use partial restriction digestion to cut a large section of
chromosomal DNA into large overlapping fragments.
2.
Circularize fragments with DNA ligase, bringing ends of DNAs that
previously were distant close together.
3.
Cut the circles with a restriction enzyme yet again to release the
junction region (ends are now inverted).
4.
Clone junction regions to form a jumping library.
5.
Subclone a small fragment of DNA and use as a probe to find the
next junction fragment occurring in the library (same technique as
chromosome walking).
6.
Repeat… and/or start chromosome walking.
7.
Chromosome jumping reaches the target gene faster than walking.
8.
Similar technique called “mate pair” is used in today’s nextgeneration sequencing.
Chromosome
jumping
Preparation of next-generation mate-pair library:
http://www.investigativegenetics.com
Summary of the search for the CF gene:
1.
7 chromosome jumps were made for CF.
2.
Chromosome walks were made from each jump site to identify
overlapping clones.
3.
Clones spanning a total 500 kb eventually were characterized.
4.
Next, cloned DNA was used as a probe against other species using a
restriction digest + Southern blot.
*Genes are more conserved than non-coding sequences and
similar sequences should be found in other species.
5.
Five subclones (or candidates) hybridized with other organisms.
6.
Two of the subclones were ruled out by linkage analysis, and a third
was a pseudogene (gene-like sequence lacking expression signals).
7.
Remaining two clones were hybridized with mRNA on a Northern blot
to test whether their sequences are transcribed.
8.
One more candidate was eliminated, and the 5th candidate was
sequenced…
Characteristics of the CF gene:
1.
cDNA (mature mRNA of same size) is 6,500 bp.
2.
Genomic DNA: CF gene spans 250 kb and contains 24 exons.
3.
68% of Caucasians with cystic fibrosis show a 3-bp deletion that
results in the loss of phenylalanine (Phe).
4.
Sixty other mutations described.
Fig. 4.13, CFTR Structure
Cystic Fibrosis
Transmembrane
Conductance
Regulator
Protein
Shotgun DNA sequencing:

Sequence the entire genome rapidly.

No requirement for a high resolution linkage or physical map.

Just break the genome up into small pieces, sequence it, and find the
gene of interest/do the analysis later.

Reverses the way genetic studies proceeds.

It used to be we had to find the gene first to study the cause of
the disease.

Now we can study effects of genes we didn’t even knew exist.
Shotgun DNA sequencing---dideoxy method:
1.
Begin with genomic DNA and/or 200-300 kb BAC clone library.
2.
Mechanically shear DNA into ~2 kb bp overlapping fragments.
3.
Isolate on agarose, purify, and clone into standard plasmid vectors.
4.
Sequence ~500 bp from each end of each 2 kb insert.
5.
Sequence from the middle 1,000 bp of each insert is obtained from
overlapping clones.
6.
Repeat the process so that 4-5x the total length of the genome is
sequenced (dideoxy sequencing is 99.99% accurate).
7.
Results in a contig library with ~97% genome coverage (the
missing 3% is composed mostly of repeated DNA sequence).
8.
Assemble hundreds of thousands of overlapping ~500 bp sequences
with fast computers operating in parallel (supercomputer).
2 kb clones present a problem, solved with 10 kb clones:
1.
Many repeated sequences in the genome are in regions spanning ~5
kb in size.
2.
So many 2 kb clones contain entirely repeated DNA.
3.
Results in a dead stop in the assembly, because there is ambiguity
about where each clone goes.

Repeated sequences occur all over the genome.
4.
On average, 10 kb clones contain less repeated DNA sequence.
5.
Solution is to create and sequence a 10 kb clone library derived from
the same genomic DNA or BAC library.
6.
Complete genome coverage requires combining the sequences from
the 2 kb & 10 kb libraries.
Fig. 8.13, Shotgun sequencing a genome
Genome
Date
Size
Institute
Homo sapiens
mtDNA
1981
Haemophilus
influenzae
(bacteria)
1995
1,830,137 bp TIGR
(1 circular)
Shotgun
Mycoplasma
genitalium
(bacteria)
1995
580,070 bp TIGR
(1 circular)
Shotgun
Escherichia coli
(bacteria)
1997
4,639,221 bp University of
(1 circular) WisconsinMadison
Shotgun
Methanococcus
jannaschii
(Archaeon)
1996
1,739,933 bp DOE
(3 circular)
Shotgun
Saccharomyce
s cerevisiae
(yeast)
1996
12,067,280 bp 100+ labs
(16 linear)
Mapping
Caenorhabditis
elegans
(nematode)
1998
97,000,000 bp Consortium
(6 linear)
Mapping
16,159 bp (1 circular)
Method
-
Genome
Date
Size
Institute
Drosophila
melanogaster
(fruit fly)
2000
180,000,000 bp UC Berkley
Celera Genomics
Arabidopsis
thaliana
(angiosperm)
2000
125,000,000 bp Consortium
(5 linear)
Homo sapiens
(human)
2000
3,400,000,000 bp Human Genome
Project &
Celera Genomics
Method
Shotgun
w/BAC map
Mapping &
Shotgun
Sequencing the human genome:
Two major players:
Human Genome Project (HGP):




Publicly funded international consortium (NIH, DOE, etc.)
Francis Collins, National Human Genome Res. Inst. (NHGRI)
Began in U.S. in 1990 with a goal of 15 years
Genetic and physical mapping approach + dideoxy sequencing
Celera Genomics Corporation (CRA):





Spin-off of Applied Biosystems (ABI)
J. Craig Venter, CEO
Created in 1998 with a goal of 3 years
Direct shotgun approach + dideoxy sequencing (+ HGP’s maps
for validation)
Both groups collected blood and sperm samples from anonymous
male and female donors of different ethnic backgrounds.
J. Craig Venter
Celera Genomics
Francis Collins
Human Genome Project
Milestone: 26 June 2000 White House press conference with Bill Clinton:
HGP:
Started 1990
~22.1 billion nucleotides of sequence data
7-fold coverage
Unfinished (24% completely finished, 50% near-finished)
Celera:
Started 1998
~14.5 billion nucleotides of sequence data
4.6-fold coverage
Complete assembled genome with >99% coverage
First assembled draft of human genome was simultaneously published
in Nature & Science 15 & 16 February 2001 (Nature published 1 day
earlier).
How did Celera et al. assemble the sequences?
Two methods:
Method A:
1.
Assembly of 26.4 million 550 bp sequences  4.6-fold coverage,
without reference to a physical map of any kind.
2.
Covered >99% of the genome.
3.
500 million trillion base-to-base comparisons.
4.
20,000 CPU hours (833 CPU days) on a supercomputer.
Method B:
1.
Used BAC clone scaffold (combined lots of smaller maps) to validate
the whole genome direct shotgun assembly approach.
2.
Also helped resolved ambiguities resulting from the assembly of
short repeated DNA fragments.
Features of the human genome:

32,000 genes estimated (50,000-100,000 were predicted).

Not many more genes than Drosophila, and only 50% more genes
than Caenorhabditis elegans (nematode worm).

Only 1-1.5% of the genome codes for protein.

50% of the sequence is repeated DNA.

Humans share 223 genes found in bacteria, but not yeast,
nematodes, or fruit flies.
Next-generation genome sequencing:

The shotgun method is fundamentally the same.

The throughput has increased and the cost has decreased.

Not uncommon to assemble trillions of sequence reads.
Some things to consider:
If error rates are high (454, Illumina) 30-50x genome
sequencing is required.
If error rates are low (SOLiD, Ion Torrent) 4-5x coverage is
sufficient.

Costs are falling from $10K to $1K.
Sequencing is no longer the primary need; data storage/retrieval and
computational needs are outpacing everything else.
How much data storage does 1 human genome require?
About 1.5 GB (2 CDs) if your stored only one copy of each letter.
For the raw format 2-30 TB are required.
Less accurate platforms with 30-50x accuracy require more data
storage capacity.
Post-genome sequencing era is very different:

Classical genetics studies started with a phenotype and set out
to identify the gene.

But we now have the ability to start with a gene sequence and
sets out to identify the phenotype.

Large data sets required many mathematical tools, which has
given rise to the field of bioinformatics.
Lots of applications:
1.
Identify genes within genomic DNA sequences.
2.
Align and match homologous gene sequences in databases and
seek to determine function.
3.
Predict structure of gene products.
4.
Describe interactions between genes and gene products.
5.
Study gene expression.
1. Identifying genes in DNA sequences:

First step is annotation = identification and description of putative
genes and other important sequences.

Open reading frames (ORFs)
ORF = potential protein coding sequence that begins with a start
codon and ends with a stop codon.

ORFs come in all sizes.

Not all ORFs encode proteins (6-7% do not in yeast).

ORFs with introns can require sophisticated computer
algorithms to detect.
2. Homology searches to assign gene function:

Homology search = identify gene function by searching database.

Similarities reflect evolutionary relationships and shared function.

Homology searches are performed for nucleotides and amino acids.

GenBank’s BLAST search: http://www.ncbi.nlm.nih.gov/BLAST/

Example, human mtDNA control region sequence:

TTCTCTGTTCTTCATGGGGAAGCAGATTTGGGTACCACCCAAGTATTGACTCACCC
ACAACAACCGCTATGTATTTCGTACATTACTGCCAGCCACCATGAATATTGCACGG
TACCATAAATACTTGACCACCTGTAGTACATAAAAACCCAATCCACATCAAAA
Fig. 9.2, Summary of genes in
the yeast genome.
3. Gene function can be identified and studied in other ways:

Gene knockout approach = systematically delete different genes
and observe the phenotypes (PCR + cloning is one method).

Study the transcriptome = complete set of mRNAs in a cell

mRNAs are not stable, but types and levels change with
different experimental conditions.
1.
Sample mRNA at experimental intervals and convert to cDNA
using reverse transcriptase.
2.
Probe unknown cDNAs with DNA microarray of PCR-generated
ORF sequences (requires known sequence for each probe).
3.
Or sequence the entire transcriptome using Next Generation
Sequencing (e.g., Pyrosequencing).
Fig. 9.7b, Microarray study
of gene expression
“Proteomics”:
Proteome = complete set of expressed proteins in a cell
Major goals of proteomics:
•
Identify every protein.
•
Determine the sequence and structure of each protein (and its
function).
•
Create a database with the sequence of each protein.
•
Analyze protein levels and interactions in different cell types, at
different times, and at different stages of development.
Rationale:

Genes are two-steps removed from disease (DNA  mRNA 
protein).

Most gene products involved in disease are composed of protein.

Understanding protein means understanding disease.