Lecture PowerPoint to accompany Molecular Biology Fourth Edition Robert F. Weaver Chapter 24 Genomics, Proteomics, and Bioinformatics Copyright © The McGraw-Hill Companies, Inc.

Download Report

Transcript Lecture PowerPoint to accompany Molecular Biology Fourth Edition Robert F. Weaver Chapter 24 Genomics, Proteomics, and Bioinformatics Copyright © The McGraw-Hill Companies, Inc.

Lecture PowerPoint to accompany
Molecular Biology
Fourth Edition
Robert F. Weaver
Chapter 24
Genomics, Proteomics,
and Bioinformatics
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
24.1 Positional Cloning
• Positional cloning is a method for
discovery of genes involved in genetic
traits
• Positional cloning was very difficult in the
absence of genomic information
• Begins with mapping studies to pin down
the location of the gene of interest to a
relatively small region of DNA
24-2
Classical Tools of Positional
Cloning
• Mapping depends on a set of landmarks to
which gene position can be related
• Restriction Fragment Length Polymorphisms
(RFLP) are landmarks with lengths of restriction
fragments given by a specific enzyme vary from
one individual to another
• Exon Traps use a special vector to help clone
exons only
• CpG Islands are DNA regions containing
unmethylated CpG sequences
24-3
Detecting RFLPs
24-4
Exon Trapping
24-5
Identifying the Gene Mutated in
a Human Disease
• Using RFLps, geneticists mapped the
Huntington disease gene (HD) to a region near
the end of chromosome 4
• Used an exon trap to identify the gene itself
• Mutation causing the disease is an expansion of
a CAG repeat from the normal range of 11-34
copies to abnormal range of at least 38 copies
• Extra repeats cause extra Glu inserted into
huntingtin, product of the HD gene
24-6
24.2 Sequencing Genomes
• What information can be gleaned from genome
sequence?
– Location of exact coding regions for all the genes
– Spatial relationships among all the genes and
exact distances between them
• How is coding region recognized?
– Contains an ORF long enough to code for a
phage protein
– ORF must
• Start with ATG triplet
• End with stop codon
– Phage or bacterial ORF is the same as a gene’s
coding region
24-7
Phage X174 Genome
• First genome sequenced was a very simple one,
phage X174
– Completed by Sanger in 1977
– 5375-nt complete
• Note that some of these phage genes overlap
24-8
Genome Results
• The base sequences of viruses and organisms
that have been obtained range from:
–
–
–
–
Phages
Bacteria
Animals
Plants
• A rough draft and finished versions of the human
genome have also been obtained
• Comparison of the genomes of closely related
and more distantly related organisms can shed
light on the evolution of these species
24-9
Sequencing Milestones
24-10
The Human Genome Project
• In 1990, geneticists started to map and
ultimately sequence the entire human
genome
• Original plan was systematic and
conservative
– Prepare genetic and physical maps of
genome with markers to allow piecing DNA
sequences together in proper order
– Most sequencing would be done only after
mapping was complete
24-11
1998 – Human Genome Project
• Celera, a private, for-profit company,
shocked genomic community by
announcing Celera would complete a
rough draft of human genome by 2000
• Method that would be used was shotgun
sequencing, whole human genome would
be chopped up and cloned
– Clones sequenced randomly
– Sequences would be pieced together using
computer programs
24-12
Vectors for Large-Scale
Genome Projects
• Two high-capacity vectors have been used extensively in
the Human Genome Project
– Mapping was done mostly using the yeast artificial
chromosome, accepts million base pairs
– Sequencing with bacterial artificial chromosomes
accepting about 300,000 bp
24-13
• BACs are more stable, easier to work with than YACs
Clone-by-Clone Strategy
• Mapping the human genome requires a
set of landmarks to which we can relate
the positions of genes
• Some of these markers are genes, many
more are nameless stretches of DNA
– RFLPs
– VNTRs, variable number tandem repeats
– STSs, sequence-tagged sites, expressedsequence tags and microsatellites
24-14
Variable Number Tandem
Repeats
• VNTRs derive from minisatellites,
stretches of DNA that contain a short core
sequence repeated over and over in
tandem (head to tail)
• The number of repeats of the core
sequence in a VNTR is likely to be
different from one individual to another
– So VNTRs are highly polymorphic
– This makes them relatively easy to map
– Disadvantage as genetic markers as they
tend to bunch together at chromosome ends
24-15
Sequence-Tagged Sites
• STSs are short
sequences
– 60-1000 bp long
– Detectable by PCR
• Can design short
primers
– Hybridize few hundred
bp apart
– Amplify a predictable
length of DNA
24-16
Sequence-Tagged Sites Mapping
24-17
Microsatellites
• STSs are very useful in physical mapping
or locating specific sequences in the
genome
– Worthless as markers in traditional genetic
mapping unless polymorphic
• Microsatellites are a class of STSs that are
highly polymorphic
– Similar to minisatellites
– Consist of a core sequence repeated over
and over many times in a row
– Core here is 2-4 bp long, much shorter
24-18
Contig
• A set of clones used by geneticists in
physically mapping or sequencing a given
region is called a contig
– Contains contiguous (or overlapping) DNAs
spanning long distances
– Used like putting together a jigsaw puzzle
– Easier to complete with bigger pieces
– Helpful to assemble in overlapping fashion
24-19
Shotgun Sequencing
Massive sequencing projects can take two forms:
1. Map-then-sequence strategy
– Produces physical map of genome including STSs
– Sequences clones (mostly BACs) used in mapping
– Places sequences in order to be pieced together
2. In the shotgun approach
– Assembles libraries of clones with different size
inserts
– Sequences the inserts at random
– Relies on computer program to find areas of
overlap among sequences and piece them together
24-20
Shotgun-Sequencing Method
24-21
Sequencing Standards
• A “working draft” may be:
– Only 90% complete
– Error rate of up to 1%
• A “final draft” (less consensus):
– Error rate of less than 0.01%
– Should have as few gaps as possible
• Some researchers require a “final draft” is
not completely sequenced until every last
gap is completed
24-22
Sequencing the Human
Genome
• First chromosome completed in the
Human Genome Project was chromosome
22 in late 1999
• Second completed was chromosome 21
• These are the 2 smallest human
autosomes, have very valuable sequence
information
24-23
Chromosome 22
• Only the long arm (22q) was sequenced
• Short arm (22p) is composed of pure
heterochromatin, likely devoid of genes
• 11 gaps remained in the sequence
– 10 are gaps between contigs likely due to
“unclonable” DNA
– Other a 1.5-kb region of cloned DNA that
resisted sequencing
24-24
Findings from Chromosome 22
1. We must learn to live with gaps in our
sequence
2. 679 annotated genes categorized as:
– 274 Known genes, previously identified
– 150 Related genes, homologous to known
genes
– 148 Predicted genes, sequence homology to
ESTs
– 134 Pseudogenes, sequences are
homologous to known genes, but contain
defects that preclude proper expression
24-25
Contigs and Gaps
24-26
More From Chromosome 22
3. Coding regions of genes account for only
tiny fraction of length of the chromosome
•
•
•
Annotated genes are 39% of total length
Exons are only 3%
Repeat sequences (Alu, LINEs, etc) are
41%
4. Rate of recombination varies across the
chromosome
•
Long regions of low recombination
interspersed with short regions where it is
relatively frequent
24-27
Repetitive DNA Content
24-28
More From Chromosome 22
5. There are local and long-range duplications
• Immunoglobin l locus
• 36 gene segments are clustered together that
can encode variable regions
• 60-kb region is duplicated with greater than 90%
fidelity almost 12 Mb away
• Duplications found in few copies, low-copy repeats
6. Large chunks of human chromosome 22q are
conserved in several different mouse
chromosomes
• 113 human genes with mouse orthologs mapped
to mouse chromosomes
24-29
Homologs
• Orthologs are homologous genes in
different species that evolved from a
common ancestor
– 8 regions on 7 mouse chromosomes
• Paralogs are homologous genes that
evolved by gene duplication within a
species
• Homologs are any kind of homologous
genes, both orthologs and paralogs
24-30
Regions of Conservation
24-31
Chromosome 21
• Human chromosome 21q, and some of 21p
have been sequenced
• Gaps remaining are relatively few and short
• Sequence reveals a relative poverty of genes
– 225 genes
– 59 pseudogenes
• All 24 genes known to be shared between
mouse chromosome 10 and human
chromosome 21 are in the same order in both
chromosomes
24-32
The X Chromosome
• The sequence of 151 Mb of human X
chromosome (99.3% of its euchromatin)
revealed 1098 protein-encoding genes
– 168 genes governing X-linked phenotype
– Genes for 173 noncoding RNAs
• Chromosome is rich in LINE1 elements
– May serve as way station for X inactivation
mechanism in female cells
24-33
X Chromosome Orthologs
• Comparison of the X chromosome
sequence with the chicken whole genome
confirmed that X (and partner Y) evolved
from an ancestral pair of autosomes
• Comparison of 3 mammalian X
chromosome sequences demonstrate high
degree of synteny among these
chromosomes
• This synteny likely reflects high degree of
evolutionary pressue to keep order of
genes on X chromosome relatively stable
24-34
Human Genome Project Status
• Working draft of human genome reported by 2
groups allowed estimates that genome contains
fewer genes than anticipated – 25,000 to 40,000
• About half the genome has derived from the
action of transposons
• Transposons themselves have contributed
dozens of genes to the genome
• Bacteria also have donated dozens of genes
• Finished draft is much more accurate than
working draft, but there are still gaps
• Information also about gene birth and death
during human evolution
24-35
Other Vertebrate Genomes
• Comparing human genome with that of other
vertebrates has taught us much about
similarities and differences among genomes
– Comparison has also helped to identify many human
genes
– In future, will likely help identify defective genes
involved in human genetic diseases
• Closely related species like mouse can be used
to find when and where genes are expressed so
predict when and where human genes are likely
expressed
24-36
The Minimal Genome
• It is possible to define the essential gene set of a
simple organism
– Mutate one gene at a time
– See which genes are required for life
• In theory, also possible to define the minimal
genome= set of genes that is minimum required
for life
– Minimum genome likely larger than the essential gene
set
• In principle, possible to place minimal genome
into a cell lacking genes of its own, create a new
life form that can live and reproduce under lab
24-37
conditions
The Barcode of Life
• There is a movement which has begun to create
a barcode to identify any species of life on earth
• The first such barcode will consist of the
sequence of a 648-bp piece of mitochondrial
COI gene from each organism
• This sequence is sufficient to identify uniquely
almost any organism
• Other sequences will be worked out for plants
and perhaps later for bacteria
24-38
24.3 Applications of Genomics:
Functional Genomics
• Functional genomics refers to those areas
that deal with the function or expression of
genomes
• All transcripts an organism makes at any
given time is an organism’s transcriptome
• Use of genomic information to block
expression systematically is called
genomic functional profiling
• Study of structures and functions of the
protein products of genomes is proteomics
24-39
Transcriptomics
• This area is the study of all transcripts an
organism makes at any given time
• Create DNA microarrays and microchips that
hold 1000s of cDNAs or oligos
– Hybridize labeled RNAs from cells to these arrays or
chips
– Intensity of hybridization at each spot reveals the
extent of expression of the corresponding gene
• Microarray permits canvassing expression
patterns of many genes at once
• Clustering of expression of genes in time and
space suggest products of these genes
collaborate in some process
24-40
Oligonucleotides on a Glass
Substrate
24-41
Serial Analysis of Gene
Expression
• Serial Analysis of Gene Expression
(SAGE) allows us to determine:
– Which genes are expressed in a given tissue
– The extent of that expression
• Short tags, characteristic of particular
genes, are generated from cDNAs and
ligated together between linkers
• These ligated tags are then sequenced to
determine which genes are expressed and
how abundantly
24-42
SAGE
24-43
Whole Chromosome
Transcription Mapping
• High density whole chromosome transcriptional
mapping studies have shown a majority of
sequences in cytoplasmic poly(A)RNAs derive
from non-exon regions of human chromosomes
• Almost half of the transcription from these same
chromosomes is nonpolyadenylated
• Results indicate that great majority of stable
nuclear and cytoplasmic transcripts in these
chromosomes come from regions outside exons
• Helps to explain the great differences between
species whose exons are almost identical
24-44
Transcription Maps
24-45
Genomic Functional Profiling
• Genomic functional profiling can be
performed in several ways
– A type of mutation analysis, deletion analysis mutants created by replacing genes one at a
time with antibiotic resistance gene flanked by
oligomers serving as barcode for that mutant
– A functional profile can be obtained by growing
the whole group of mutants together under
various conditions to see which mutants
disappear most rapidly
24-46
RNAi Analysis
• Another means of genomic functional
analysis on complex organisms can be
done by inactivating genes via RNAi
• An application of this approach targeting
the genes involved in early embryogenesis
in C. elegans has identified:
– 661 important genes
– 326 are involved in embryogenesis
24-47
Tissue-Specific Functional
Profiling
• Tissue-specific expression profiling can be done by
examining spectrum of mRNAs whose levels are
decreased by an exogenous miRNA
• Then compare to the spectrum of expression of
genes at the mRNA level in various tissues
• If that miRNA causes decrease in levels of mRNAs
naturally low in cells expressing the miRNA
– Suggests that the miRNA is at least a partial cause of
those natural low levels
• This type of analysis has implicated
– miR-124 in destabilizing mRNAs in brain tissue
– miR-1 in destabilizing mRNAs in muscle tissue
24-48
Locating Target Sites for
Transcription Factors
• Chromatin immunoprecipitation followed by DNA
microarray analysis can be used to identify DNAbinding sites for activators and other proteins
• Small genome organisms - all of the intergenic
regions can be included in the microarray
• If genome is large, that is not practical
• To narrow areas of interest can use CpG islands
– These are associated with gene control regions
– If timing/conditions of activator’s activity are known,
control regions of genes known to be activated at
those times, or under those conditions, can be used
24-49
In Situ Expression Analysis
• The mouse can be used as a human
surrogate in large-scale expression
studies that would be ethically impossible
to perform on humans
• Scientists have studied the expression of
almost all the mouse orthologs of the
genes on human chromosome 21
– Expression followed through various stages of
embryonic development
– Catalogued the embryonic tissues in which
these genes are expressed
24-50
Single-Nucleotide Polymorphisms
• Single-nucleotide polymorphisms can
probably account for many genetic
conditions caused by single genes and
even some by multiple genes
• Might be able to predict response to a
drug
• Haplotype map with over 1 million SNPs
makes it easier to sort out important SNPs
from those with no effect
24-51
Structural Variation
• Structural variation is a prominent source of
variation in human genomes
–
–
–
–
Insertions
Deletions
Inversions
Rearrangements of DNA chunks
• Some structural variation can in principle
predispose certain people to contract diseases
– Some variation is presumably benign
– Some also is demonstrably beneficial
24-52
24.4 Proteomics
• The sum of all proteins produced by an
organism is its proteosome
• Study of these proteins, even smaller
subsets, is called proteomics
• Such studies give a more accurate picture
of gene expression than transcriptomics
studies do
24-53
Protein Separations and Analysis
• Current research in proteomics requires first that
proteins be resolved, sometimes on a massive
scale
– Best tool for separation of many proteins at once is 2D gel electrophoresis
• After separation, proteins must be identified
– Best method of identification involves digestion of
proteins one by one with proteases
– Then identify the peptides by mass spectrometry
• In the future, microchips with antibodies
attached may allow analysis of proteins in
complex mixtures without separation
24-54
MALDI-TOF Mass Spectrometry
24-55
Detecting Protein-Protein
Interactions
24-56
Protein Interactions
• Most proteins work with other proteins to
perform their functions
• Several techniques are available to probe
these interactions
• Yeast two-hybrid analysis has been used
for some time, now other methods are
available
– Protein microarrays
– Immunoaffinity chromatography with mass
spectrometry
– Other combinations
24-57
24.5 Bioinformatics
• Bioinformatics involves the building and
use of biological databases
– Some of these databases contain the DNA
sequences of genomes
– Essential for mining the massive amounts of
biological data for meaningful knowledge
about gene structure and expression
24-58
Finding Regulatory Motifs in
Mammalian Genomes
Using computational biology techniques,
Lander and Kellis have discovered highly
conserved sequence motifs in 4 mammalian
species, including humans:
– In the promoter regions, these motifs probably
represent binding sites for transcription
factors
– 3’-UTRs motifs probably represent binding
sites for miRNAs
24-59
Using the Databases
• The National Center for Biological Information
(NCBI) website contains a vast store of
biological information, including genomic and
proteomic data
• Start with a sequence and discover gene to
which it belongs, then compare that sequence
with that of similar genes
• Query the database with a topic for information
• View structures of protein in 3D by rotating the
structure on your computer screen
24-60