Genome Analysis - University of Leicester

Download Report

Transcript Genome Analysis - University of Leicester

Genomics-sequencing of microbial
genomes
This lecture illustrates the strategies used in microbial genome
sequencing projects, compares genome content and
organisation amongst microbes, and shows how to derive
information on gene function across genome.
Objectives for students:
• Expected to describe strategies involved in microbial genome
sequencing and functional genomics
• Provide examples of information that can be derived from
genomics
Genomics: 1
Microbial Genome Sequencing
• Genome Sequencing Projects
– strategy & methods
– annotation
• Comparative genomics
– organisation
– gene content
• Functional genomics
– transcriptome
– proteome
– genome-wide mutation
• Concentrate on strategy & ideas
Genomics: 2
Bacterial genome projects
• Many completed:
• Good link to projects:
– http://www.tigr.org/
– Haemophilus influenzae
– http://www.ncbi.nlm.nih.gov/
– Escherichia coli
– http://www.sanger.ac.uk/
– Bacillus subtilis
– http://www.genomesonline.org/
– Mycoplasma genitalium
– Helicobacter pylori (x2)
Genome sequencing progress
– Campylobacter jejuni
– Treponema pallidum
• Complete:
– Neisseria menigitidis
– Archaeal: 70 (2007&2008: 49&55)
– Neisseria gonnorhoea
– Bacterial: 945 (554&728)
– Vibrio cholerae
– (Eukaryal: 121) (76&97)
– E. coli O157
• Ongoing:
– Prokaryotic: 3498 Archaeal: 111
– (Eukaryotic: 1223)
• Metagenome projects: 200
Genomics: 3
www.genomesonline.org
Genomics: 4
Microbial eukaryote projects
• Complete
–
–
–
–
–
–
–
–
–
Yeast -Saccharomyces cerevisiae
Plasmodium falciparum
Aspergillus nidulans, A.niger, A.oryzae & A.fumigatus
Trypanosoma cruzi & brucei
Leishmania
Entamoeba histolytica
Giardia lamblia
Candida albicans & glabrata
Paramecium
• Underway
–
–
–
–
Pneumocystis carinii
Plasmodium vivax
some complete chromosomes finished
Other species and isolates from completed list
Genomics: 5
Why bother? -To sequence or not to sequence
(considerations in the pre-genome era)
• piecemeal collection of
sequenced genes
– slow
– costly
– ever complete?
• genome project
–
–
–
–
rational approach
efficient and rapid
quality assurance
address novel questions
• problems/issues
–
–
–
–
–
–
ownership
strain choice
cost
approach
data release
some now less relevant
• Post genomic era
– Comparative genomics
– Functional genomics
Genomics: 6
Genome sequencing strategy
• Strategy choice
• large collaborative cosmid/BAC-based projects
– now better suited for larger genomes
– slow
• small insert shotgun approach
– centralised
– rapid and efficient
– choice for bacteria
• Strain choice
– fresh isolate vs lab strain
– clinical vs environmental
– subsequent genetic analysis
Genomics: 7
Yeast genome sequence strategy
•
•
•
•
Yeast chromosomes (16) individually sequenced
several approaches used
Make genome library in cosmids
order cosmid library
–
–
–
–
which cosmid overlaps with which
link cosmid to genome map
produced tiled set of cosmids
only sequence minimum number
• Use chromosome specific probe to identify chr-specific cosmids
• sequence cosmid inserts by subcloning
• Solve problems by direct PCR sequencing, walking and other libraries
(lambda)
• Telomeres
Genomics: 8
Tiled set
Genomics: 9
A
Ordering
Clones
B
c1
C
D
c2
E
F
c3
G
H
c4
I
J
c5
c1
c2
c3
c4
c5
A
B
C
D
E
F
G
H
I
J
Genomics: 10
100
200
PH011
80
100
70512
140
120
7202
70266
70893
180
160
70265
70449
70515
70871
70124
70463
Genomics: 11
Whole genome/chromosome shotgun strategy (WGS)
•
•
•
•
•
Rapid
Generation of small insert genomic library
Library is not initially ordered
DNA sequence ends of inserts
Depends on powerful computing to
assemble sequence reads
Genomics: 12
Main steps in generating a complete genome
sequence
Minimum time
period (weeks)
Isolation
2
Construction
4-6
Shotgun
sequencing
2-4
Finishing
12
Annotation
12
Genomics: 13
bacterial
chromosome
random
shearing
size selection
vector
plasmid
library
of
clones
individual clones
sequence
end of
each clone
Genomics: 14
Sequencing individual clones
Assembly




genome sequence with gaps
Genomics: 15
Automated sequencers: ABI 3700
• Made by Applied
Biosystems
• Most widely used
automated sequencers:
– 96 capillaries
– robot loading from
384-well plates
• Two to three hours per
run
• 600–700 bases per run
robotic arm and syringe
96 glass capillaries
96–well plate
load bar
Genomics: 16
Automated sequencers: MegaBACE
• Made by Amersham
• 96 capillaries
• Robotic loading from
384–well plate
• Two to four hours per
run
• Can read up to 800
bases
Source : GE Healthcare Life Science, Uppsala, Sweden
Genomics: 17
Automatic gel reading
• Top image: confocal
detection by the
MegaBACE sequencer
of fluorescently
labeled DNA
• Bottom image:
computer image of
sequence read by
automated sequencer
Genomics: 18
Industrialization of sequencing
• Most genome
sequencing projects
divide tasks among
different teams
– Genome libraries
– Production sequencing
– Finishing
• Sequencing machines
run 24/7
• Many tasks performed
by robots
The Broad Institute of MIT and Harvard, www.genome.gov
Genomics: 19
The future is here?..454 sequencing
Reprinted by permission from Macmillan Publishers Ltd: [NATURE] (Margulies et al., 437: 376
copyright (2005)
Genomics: 20
454 sequencing: the system
DNA Library Preparation
4.5 hours
emPCR
Sequencing
8 hours
7.5 hours
•Well diameter: average 44μm
•400,000 reads obtained in parallel
•A single cloned amplified sstDNA
bead is deposited per well
•4 bases (TACG) cycled 100 times
•Chemiluminescent signal generation
•Signal processing to determine base
sequence and quality score
Source :454 Sequencing © Roche Diagnostics
Genomics: 21
WGS: Just how much effort?
• individual sequencing reads accumulate
– each read about 500bp
– computing used to assemble reads
– contiguous sequences called contigs
• Aim for 8-10 read coverage of genome for
accuracy
• example:
– H.influenzae
• 19,687 templates
• 24,304 reads assembled
• 11,631,485 bp
• 9
Genomics: 22
Sequencing a genome
fragments of sequence
cisahubofevaluatedgen
luatedgeneticsrel ourcesforteach
esforteachershealt atedgene
chershealthprofession
ourcesforteach
chershealthprofession
esforteachershealt
luatedgeneticsrel
hprofessionalsandgeneralpub
hprofessionalsandgeneralpub
tatedgene
cisahubofevaluatedgenc
chershealthprofession
ourcesforteach
luatedgeneticsrel
atedgene
cisahubofevaluatedgenc
vgecisahubof
bofevaluatedgenetics
icsrelatedresourcesforteachershealth
lthprofessionalsandgeneralp
generalpublic
overlaps
contiguous sequence
vgecisahubofevaluatedgeneticsrelatedresourcesforteachershealthprofessionalsandgeneralpublic
Genomics: 23
Gaps
Genome
contig
Sequence Gap
Library clone
Sequence read
Physical Gap
Genomics: 24
Number of contigs
Bridging Gaps
difficult gap bridging
rapid gap bridging
1
Number of reads
•
•
•
•
Finishing
rise in contig number as amount of reads increases
steady fall as accumulating sequence bridges gaps between contigs
levels off as new reads more likely in known contig than gap
start finishing
Genomics: 25
Finishing
• Why are gaps present?
• Gap bridging
– sequence gaps
• sequence gaps –choose appropriate clone and walk
– physical gaps
• alternative libraries (which?)
• PCR across gap
• Mistakes/poor sequence
– areas where sequence reads are less than 8-10
– repeated sequences -rRNA
• closure and completion
Genomics: 26
Finished Yet?
atgaatccaagccaaatacttgaaaatttaaaaaaagaattaagtgaaaacgaatacgaaaactatttatcaaatttaaaattcaacgaaaaacaaagcaaagcagatcttttagtttttaatgctccaaatgaactcatgg
ctaaattcatacaaacaaaatacggcaaaaaaatcgcgcatttttatgaagtgcaaagcggaaataaagccatcataaatatacaagcacaaagtgctaaacaaagcaacaaaagcacaaaaatcgacatagctca
tataaaagcacaaagcacgattttaaatccttcttttacttttgaaagttttgttgtaggggattctaacaaatacgcttatggagcatgtaaagccatagcacataaagacaaacttggaaaactttataatccaatctttgtt
tatggacctacaggacttggaaaaacacatttacttcaagcagttggaaatgcaagcttagaaatgggaaaaaaagttatttacgctaccagtgaaaatttcatcaacgattttacttcaaatttaaaaaatggttctttaga
taaatttcatgaaaagtatagaaactgcgatgttttacttatagatgatgtacagtttttaggaaaaaccgataaaattcaagaagaatttttctttatatttaatgaaatcaaaaataacgatggacaaatcatcatgacttca
gacaatccacccaacatgctaaaaggtataaccgaacgcttaaaaagtcgttttgcacatgggatcatagctgatataactccacctcaactagatacaaaaatagccatcataagaaaaaaatgtgaatttaacgata
tcaatctttctaatgatattataaactatatcgctacttctttaggggataatataagagaaatcgaaggtatcatcataagtttaaatgcttatgcaaccatactaggacaagaaatcacactcgaacttgccaaaagtgtg
atgaaagatcatatcaaagaaaagaaagaaaatatcactatagatgacattttatctttggtatgtaaagaatttaacatcaaaccaagcgatgtgaaatccaataaaaaaactcaaaatatagtcacagcaagacgcat
tgtgatttacctagctagggcacttacggctttgactatgccacaacttgcgaattattttgaaatgaaagatcatacagctatttcacataatgttaaaaaaatcacagaaatgatagaaaatgatgcttctttaaaagcaa
aaatcgaagaacttaaaaacaaaattcttgttaaaagtcaaagttaagtgaaaggatgtgaaaaataaattctagagtgtgaaaaaaagaaattaagcaaagtatgataaaatacaaatttgattattttgctttgaaaaat
ttcacaatttcaacaagcttattattacaacgaatttaaaattaaaataaaccaaggagaaaaaatgaagttaagtatcaataaaaatactttagaatctgcagtgattttatgtaatgcttatgtagaaaaaaaagactcaa
gcaccattacttctcatcttttttttcatgctgatgaagataaacttcttattaaagctagtgattatgaaataggtatcaactataaaataaaaaaaatccgcgtagaatcaagtggttttgctactgcaaatgcaaaaagtatt
gcagatgttattaaaagcttaaacaatgaagaagttgttttagaaaccattgataattttttatttgtaagacaaaaaagtacaaaatacaaacttcctatgtttaatcatgaagattttccaaattttccaaatacagaaggaa
aaaaccaatttgacattgattcaagtgatttaagccgttctcttaaaaagatattaccaagtattgatacaaataacccaaaatactccttaaatggtgcatttttagatataaaaacagataaaattaacttcgtaggaactg
atacaaaacgccttgcaatctatactttagaaaaagcaaataatcaagaatttagttttagtatccctaaaaaagctattatggaaatgcaaaaacttttctatgaaaaaatagaaattttttatgatcaaaatatgcttattgc
caaaaatgaaaattttgaattctttacaaaacttatcaatgataaatttccagattatgaaaaagttataccaaaaactttcaaacaagaactcagtttttcaactgaagattttatagatagtcttaaaaaaatcagcgttgtaa
ctgaaaaaatgagacttcattttaacaaagataaaatcatctttgaaggtataagtttagacaatatggaagcaaaaacagaacttgaaattcaaacaggagtaagtgaagaatttaatcttactataaaaatcaaacattt
acttgatttcttaacttctatagaagaagaaaaattcactttaagtgtaaatgaacctaattcagcatttatagtcaaatcccaaggactatcaatgattatcatgcctatgattttgtaataaaacaagtaaaagataaagga
aaaatatgcaagaaaattacggtgcgagtaatattaaagtcctaaaaggcttagaagctgttagaaaacgcccaggtatgtatataggagatacaaacataggcggacttcatcatatgatttatgaagttgtggataat
tctatcgatgaagctatggcaggacattgcgatactatagatgtagaaatcactactgaaggaagctgtatagttagtgataatggtcgtggtattcctgttgatatgcacccaactgaaaatatgccaactttaactgttg
ttttaactgtcctacatgcagggggaaaattcgataaagatacttataaagtttcaggcggtttgcacggtgttggggtttcggttgtaaatgcactctctaaaaaacttgtagctacagttgaaagaaatggagaaattta
tcgtcaagaattttcagaaggtaaagttatcagtgaatttggtgtgataggaaaaagtaaaaaaacaggaacaactatagaattttggcctgatgatcaaatttttgaagtgactgaatttgattatgaaattttggctaaaa
gatttcgtgaacttgcatacttaaatccaaaaatcactataaattttaaagataaccgcgtaggcaaacatgaaagttttcactttgaaggtggaatttctcagtttgttacagacttaaataaaaaagaagctttaactaaag
caattttctttagtgtagatgaagaagatgtgaatgttgaagtagctttgctttacaatgatacttatagtgaaaatttactctcttttgtaaataatattaaaaccccagatggtggaacacacgaagctggttttagaatggg
tttaactcgtgtgataagtaactatatagaagcaaatgcaagtgctagagaaaaggataataaaatcacgggtgatgatgtgcgtgaaggtttgatcgctattgtgagtgtaaaggtacctgaaccacaatttgaagga
caaaccaaaggaaaacttggttcaacttatgtgcgtcctatagtttcaaaagcaagttttgagtatttgactaaatattttgaagaaaatcctatcgaagctaaagctataatgaataaagctttaatggcagctagaggaa
gagaagcagcgaaaaaagctagagaattaacgcgcaaaaaagaaagtttaagcgtaggaactttaccagggaaattagctgattgtcaaagtaaagatccaagtgaaagtgaaatttatcttgtggaaggggattct
gcaggaggttctgcaaaacaaggtagagaaagatctttccaagctatactgcctttgcgtggtaaaattttaaatgttgaaaaagcaagactagataaaattttaaaatctgagcaaattcaaaatatgattaccgcttttg
gctgtggtataggtgaagattttgatctttcaaaacttagatatcataaaatcatcatcatgacagatgcggatgttgatggatctcatatacaaaccttgcttttaactttcttcttccgttttatgaatgaacttgtggcaaatg
gacatatttatctagcacaaccacctttatatctttataaaaaagctaaaaagcaaatttatttaaaagatgaaaaagctttgagcgaatacctgatagaaacgggaatagaaggtttaaactatgaaggtataggaatga
atgatttaaaagattatttaaaaatcgttgcagcttatcgtgcgattttaaaagatcttgaaaagcgttttaatgtgatttctgtgatacgctatatgatagaaaattcaaatttagttaaaggaaataatgaagaattatttagtg
taatcaaacaatttttagaaacacaaggacacaatatcttaaatcattatatcaacgaaaatgaaattcgagctttcgttcaaactcaaaatggcttagaagaacttgtgatcaatgaagaacttttcactcatccactatat
gaagaagcgagttatatttttgataagattaaagatagaagcttggaatttgataaagatattttagaagttcttgaagatgttgaaaccaatgctaaaaaaggtgctactatacaacgctataaaggtttaggggaaatga
atcctgagcaactttgggaaaccacaatggatccaagcgtaagaagacttttaaaaatcactattgaagatgcacaaagtgcaaatgatacctttaatctctttatgggtgatgaggttgaaccaagacgcgattatatc
caagcgcacgctaaagatgtaaagcatttggatgtgtaaaaatttatcattgaagaaatcatttcttcaatgagttttgttttgtaagagtatagctagaggaattcttcttcttgtatcgtatttttctccataatatttttcaagat
aatttaaaattttttcttcatcttcaggttctatttcccaaagtccttcactatcttgcatccatcttatagctgctaaccaagcttttctacttgcatgcatattggtaatgagattggatccatgacaagctaaacaatttgcttcc
actaaaggtgaatcaggatcgataatcaatcctgtatcagggttaatttcaagattttgagcccaacttgcacttaaaaacaatgctaagatcaatataatttttttcatacttaaactccataaacattaactctatggcatgc
attattgatatatcctcctggattccactgtgctaaaaccataggttgactgttaccttgactatcgatagctcttgcccaaatttcataatatccttttgttggtattgatatttgagcactccatttttgccatgctaatctatttaa
tggtttttctacctttgc ………………….
Genomics: 27
Sequencing a genome
fragments of sequence
cisahubofevaluatedgen
luatedgeneticsrel ourcesforteach
esforteachershealt atedgene
chershealthprofession
ourcesforteach
chershealthprofession
esforteachershealt
luatedgeneticsrel
hprofessionalsandgeneralpub
hprofessionalsandgeneralpub
tatedgene
cisahubofevaluatedgenc
chershealthprofession
ourcesforteach
luatedgeneticsrel
atedgene
cisahubofevaluatedgenc
vgecisahubof
bofevaluatedgenetics
icsrelatedresourcesforteachershealth
lthprofessionalsandgeneralp
generalpublic
overlaps
contiguous sequence
vgecisahubofevaluatedgeneticsrelatedresourcesforteachershealthprofessionalsandgeneralpublic
annotation
VGEC is a hub of evaluated genetics related resources for
teachers, health professionals and general public.
Genomics: 28
Genome Annotation
• Find ORFs
–
–
–
–
look for ATG-Stop (+alternatives)
over certain size
overlaps
computer based (“Glimmer” & “Orpheus”) and
trained eye.
• ORF function
– Search databases with predicted translated
sequences –BLASTX
– Consider level of similarity and context
– Domain comparisons
• Pfam/Prosite
• Other features
Genomics: 29
www.yeastgenome.org
Genomics: 30
http://www.yeastgenome.org/MAP/GENOMICVIEW/GenomicView.shtml
http://mips.gsf.de/genre/proj/yeast/index.jsp
Genomics: 31
Artemis: sequence viewer and annotation tool from the Sanger
Centre (http://www.sanger.ac.uk/Software/Artemis/)
Genomics: 32
Genomics: 33
Genomics: 34
http://xbase.bham.ac.uk/
xBASE is a database for comparative genome analysis of all
bacterial genome sequences
Chaudhuri RR, Pallen MJ. xBASE, a collection of online
databases for bacterial comparative genomics. Nucleic Acids
Res. 2006 Jan 1;34(Database issue):D335-7.
Genomics: 35
A conceptual diagram of the flux and information in a networkbased genome-sequencing project
Coordinator
Working draft
sequence
DNA
Finished
sequence
Finishing
instructions
Finished annotated
sequence
Annotation
tasks
Shotgun
templates
S
S
S
S
Annotations
Finishing sequences
S
S
S
Bioinformatics Lab
S
S
S
S
S
S
S
S
S
S
S
S
S
S
S
Shotgun
sequences
S
S
S
S
S
S
S
S
S
S
S
Genomics: 36
Post Genome Sequence
• Comparative genomics
–
–
–
–
–
comparing genome organisation and content
genome size
genome repeats/Tn/phages
gene content
minimal gene content
• Functional genomics –ascribing gene function
across a genome
–
–
–
–
gene function –knowns
phenotype prediction
gene function –unknowns
investigating function
• Bacteria-Yeast
Genomics: 37
Bacteria: Does size matter?
• Link genome size to adaptive capability
– biosynthetic capability
• synthesis of nutrients
– Stress resistance
• resist environmental insults
– structural complexity
• surface structures, sporogenesis
– Regulation –sensing signals and transcriptional
responses
• detect change or requirement and respond
appropriately
• transcriptional regulation
Genomics: 38
Not just Size but how you use it…..
• Small genomes
– Mycoplasma genitalium
•
•
•
•
•
580,070 bp
smallest genome for self-replicating organism
free living but only just..infects host cells (guess which!)
few biosynthesis and regulatory systems
has replication & transcription & translation, metabolism etc
functions
– Borrelia burgdorferi
• 910,725 bp
• Lyme disease
• few cellular biosynthetic systems
– Mycoplasma pneumoniae (0.8 Mbp); Chlamydia
trachomatis (1.0 Mbp);
Genomics: 39
bigger genomes
• Haemophilus influenzae
– 1.830 Mbp
– colonises human respiratory tract
– limited environment
• Helicobacter pylori
– 1.667 Mbp
– colonises human stomach
– limited environment
• Campylobacter jejuni
– 1.641 Mbp
– colonises intestine
– limited environment
Genomics: 40
and bigger….
• Escherichia coli (K-12)
– 4.639 Mbp
• Bacillus subtilis
– 4.214 Mbp
– soil/plant organism
– secondary metabolites
• Pseudomonas aeruginosa
– incomplete (5.9 Mbp)
• Yersinia pestis (4.4 Mbp)
• Clostridium spp (4-5 Mbp)
• Mycobacterium tuberculosis
– 4.411 Mbp
– slow growing (double in 24h)
– large proportion of genome on lipid metabolism
• Streptomyces coelicolor (~8 Mbp)
– secondary metabolites –antibiotics!
Genomics: 41
Organisation
• Linear chromosomes
– Borrelia burgdorferi
– Streptomyces coelicolor
• Multiple chromosomes
– Vibrio cholerae
• Plasmids
–
–
–
–
Borrelia burgdorferi
17 linear & circular plasmids
50% genome size
plasmid replication, “decaying genes”, ?Ag variation
• Transposons, IS elements, phages
– found in most genomes
– Campylobacter has none
• Repeats
Genomics: 42
Replication
• Origin (oriC) and termination (terC) of replication
– OriC often near dnaA gene (replication initiation
protein)
– In Borrelia burgdorferi (linear) oriC (& dnaA) in centre
• strand bias
– which strand is each gene on?
– transcription in same direction as replication –more
efficient
– variation in level of strand bias
• Mt 55% vs Bs 75%
Genomics: 43
Gene Content
• Annotation
– sequence similarity
• gene families
• regulators, transport, biosynthesis
– domain matches
• trans-membrane domains, DNA binding
• Paralogues and Orthologues
– Paralogues:
• Members of same family (homologous) in same genome.
• Likely to have different exact function
– Orthologues:
• homologues (same family) in different genomes
• May have identical function
Genomics: 44
Vibrio cholerae as predicted by genome........
Reprinted by permission from Macmillan Publishers Ltd: [NATURE]( Heidelberg et al, 406 ,477-483), copyright (2000)
Genomics: 45
Gene content (cont.)
• ORFans
– significant proportion of genome contains ORFs of
unknown function
– some may be orthologues of unknowns in other
organisms
– some unique to organism
• important for biology of organism
– examples:
•
•
•
•
H.influenzae: 42%
H.pylori: 33%
E.coli: 38%
M.tuberculosis: 60% to 16%
– number decreasing
• Gene size –most about 1kb
Genomics: 46
Genomic
rearrangements
http://www.sanger.ac.uk/resources/software/act/
• Example
comparison
• Comparison
of:
S.e Typhi
CT18 with
S.e Typhi
Ty2
• inversion that
spans
terminus
Genomics: 47
Variation by gain and/or loss
•
Core regions
– shared by closely related species
•
Additional “flexible” gene pool
Common bacterial ancestor
– variable regions
– acquired from mobile genetic
elements
•
First described as pathogenicity
islands
– in non-pathogens too
– wider role
•
pathogens
commensals
symbionts
environmental
Mutations
rearrangements
GEI
Plasmid
Gain of GI sometimes assoc with
gene loss
– reduction in obligate intracellular
pathogens
•
Gene acquisition
by HGT
Genomic Islands
–
–
–
–
•
Genome reduction by
deletion events
Genome organisation as well as
genome content correlates with
microbial lifestyle
Intracellular bacterium,
obliagate intracellular
pathogen, endosymbiont
Extracellular bacterium,
facultative pathogen,
symbiont
All lifestyles
Genomics: 48
Other tRNA-associated elements:
tRNAPProL
Black arrows=Sal+Ec; white arrows=Sal or Ec; grey=strain/serovar specific
GC is for S. Typhi
Infection and Immunity, May 2002, p. 2351-2360, Vol. 70, No. 5
Genomics: 49
Other tRNA-associated elements:
tRNAArgU
Infection and Immunity, May 2002, p. 2351-2360, Vol. 70, No. 5
Genomics: 50
The supragenome
•
•
•
The distributed-genome hypothesis (DGH)
Bacteria have a (supra) genome much larger than
the genome of any single bacterium.
Core and non-core gene sets
–
–
–
•
•
Example: Hiller et al. sequenced 8 strains of
Streptococcus pneumoniae + 9 already available
Core set of genes in all strains
20-30% genes non-core (not present in all strains)
Genetic recombination generates diversity across
strains.
Also for Haemophilus influenzae (Hogg et al.)
–
•Hiller et al. Journal of
Bacteriology, November 2007, p.
8186-8195, Vol. 189, No. 22
•Hogg et al. Genome Biology
2007, 8:R103 (doi:10.1186/gb2007-8-6-r103).
~1400 in core set and ~1300 non-core in subset of strains
Genomics: 51
Yeast
•
•
•
•
•
16 chromosomes totalling 12.068Mbp
5885 orfs –6275 but 390 unlikely translated
Few introns ~4%
Avg gene size 2kb (worm ~6kb and human >30kb)
GC vary along chr length
– low GC at telomere & centromere
– GC rich correlate with higher recombination
• Tn and remnants in genome
– evidence of hotspots
• 50% orfs known function
– some exact role unclear
•
•
http://genome-www.stanford.edu/Saccharomyces/
http://mips.gsf.de/projects/fungi
Genomics: 52
Functional genomics
• Functional genomics –ascribing gene function across a
genome
GENOME
• function and inter-relationships
• strategy
•
•
–
–
–
[bioinformatic analysis -gene identification]
Transcriptome -expression pattern
Proteome -expression pattern
Mutantome -mutant phenotype
Interactome –protein-protein interactions
TRANSCRITOME
RNA
Copies of the
active proteincoding genes
PROTEOME
The cell’s
repertoire
Genomics: 53
Arrays: micro and chip
• Microarrays
– Glass slides with <10000 individual samples applied in
known position
– Use of robotics
– Samples can be PCR products or oligos
– example: oligo/PCR product complementary to each
ORF
• Chip arrays
– silicon based
– >10,000 sequences
– http://www.affymetrix.com/index.html
• Redundancy
• fluorescent labels
Genomics: 54
TGCATA
ACGTAT
TGCATA
ACGTAT
Individual
sequences &
bound sample
Laser
TGCATA
ACGTAT
Chip
Arrays
One cell=
one specific sequence
Genomics: 55
Transcriptome
• Genome-wide determination of expression
level of each ORF
• when expressed relates to role
• also assess mutants
• compare expression of each ORF in
different conditions
• Genome wide expression maps
• global patterns of expression
Genomics: 56
orf 1
orf 2
mRNAs
AGGCAT
AATGAA
When expressed?
2 x ORF
Bacillus genieae
orf 1
AATGAA
TTACTT
grow in conditions
when only orf 2
expressed
AATGAA
TTACTT
AGGCAT
isolate mRNAs
and make cDNA
copy
orf 2
Genomics: 57
Grow under
different
conditions
extract
mRNA
Probe array with labelled copy of mRNA
Genomics: 58
Differentially labelled probes
Red
channel
Green
channel
Combined
Genomics: 59
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
Genomics: 60
Expression profiling C. jejuni in low
iron
Cj1659 (P19)
Cj0037c
Cj0177
Genomics: 61
Proteome
•
•
•
•
Genome-wide determination of protein expression
Gives information stimulons
protein expression linked to function
assess mutants (regulatory mutants affect several
proteins)
•
•
•
•
•
•
Grow bacteria under defined conditions
Extract proteins
2D-gel electrophoresis
Protein spot identification
Mass Spectrometry
peptide size predictions from Genome data
Genomics: 62
Defining the Campylobacter
proteome –chasing spots
Which protein?
Which conditions?
Which other
proteins are coexpressed?
Genomics: 63
C. jejuni iron example
Genomics: 64
http://depts.washington.edu/yeastrc/pages/ms.html
Mol mass
pI
Mass Spec
digest
with
protease
*
*
**
*
Genomics: 65
Mass Mutagenesis: mutantome
• Mutate every ORF in genome
– organism specific technology
• High throughput analysis of phenotype
– need to analyse many 1000s of mutants under many
conditions
• Signature-tagged technology
– enables analysis of mutant pools
– requires array technology for genome-wide projects
• Association on ORF with mutant phenotypes
• Regulators might be pleiotropic
Genomics: 66
Arrays: micro and chip
• Microarrays
– Glass slides with <10000 individual samples applied in
known position
– Use of robotics
– Samples can be PCR products or oligos
– example: oligos complementary to each unique Tag
– example: oligo/PCR product complementary to each
ORF
• Chip arrays
– silicon based
– >10,000 sequences
– http://www.affymetrix.com/index.html
• Redundancy
• fluorescent labels
Genomics: 67
TGCATA
ACGTAT
TGCATA
ACGTAT
Individual
sequences &
bound sample
Laser
TGCATA
ACGTAT
Chip
Arrays
One cell=
one specific sequence
Genomics: 68
Signature Tagged
• Tags are short unique
DNA sequences
• Tag linked to
mutation
• Each individual
mutant has unique
tag
• Each mutant ORF
has unique Tag
ORF X
Chromosomal Mutants
Genomics: 69
ORF X
Chromosomal Mutants
Mutant Pools
compare
condition
‘normal’
functional role ?
Genomics: 70
Bar coding genes
“normal, un-mutated
Campylobacter
mutant 1
mutant 2
mutantspecific DNA
sequence
mutant 3
mutant 4
and so on…
to mutant 1654.
Genomics: 71
Which bar codes are missing?
post-treatment
mutant pool
www.freedigitalphotos.net/
mutant pool
• Which bar
coded mutants
are missing?
• Gene involved
in process
Bar code Array
1 2 3 4……… 9 10
11
21
91
+ + - +
+ + + +
+ - + + ++ -
+
+
+
+
+
+
+
+
+
100
copies of
barcodes
present
Genomics: 72
Reprinted by permission from Macmillan Publishers Ltd: [NATURE REVIEWS GENETICS]
(Mazurkiewicz et al. 7 929-939), copyright (2006)
Genomics: 73
Interactome
Yeast 2 hybrid
Which proteins can interact?
•Expression library of bindingdomain::protein 1 (bait)
•Expression library of activationdomain::protein 2 (prey)
•Test combinations of all genome
orfs
•Which combinations turn on the
reporter gene?
http://en.wikipedia.org/wiki/Two-hybrid_screening
Genomics: 74
Protein-protein interaction networks
Parrish et al. 2007. A proteome-wide protein interaction map
for Campylobacter jejuni. Genome Biol 8:R130.
Genomics: 75
Genomotyping or Genomic indexing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
•Array of all known genes in microbe
•Genes 1, 2, 3 &14 forms minimal gene set
•Hybridise array with labelled chromosomal DNA
2
3
1
11
4
5
5
8
6
14
15
9
Isolate 1
Isolate 2
Isolate 3
1
2
3
4
5
1
2
3
4
5
1
2
3
4
5
6
7
8
9
10
6
7
8
9
10
6
7
8
9
10
11
12
13
14
15
11
12
13
14
15
11
12
13
14
15
Genomics: 76