Transcript Document

C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Introduction to Bioinformatics
Lecture 4: Bioinformatics infrastructure
Centre for Integrative Bioinformatics VU (IBIVU)
Bioinformatics
“Nothing in Biology makes sense except in
the light of evolution” (Theodosius
Dobzhansky (1900-1975))
“Nothing in bioinformatics makes sense
except in the light of Biology”
Divergent evolution
Ancestral sequence: ABCD
ACCD (B C)
mutation
ACCD
or
Alignment
AB─D
ACCD
A─BD
ABD (C ø)
deletion
Pairwise
Divergent evolution
Ancestral sequence: ABCD
ACCD (B C)
mutation
ACCD
or
Alignment
AB─D
true alignment
ACCD
A─BD
ABD (C ø)
deletion
Pairwise
What can be observed about
divergent evolution
(a)
G
(b)
G
Ancestral sequence
G
Sequence 1
A
One substitution one visible
Sequence 2
1: ACCTGTAATC
2: ACGTGCGATC
* **
D = 3/10 (fraction different
sites (nucleotides))
C
(c)
G
C
Two substitutions one visible
(d)
G
G
A
A
Two substitutions none visible
A
Back
mutation not visible
G
Convergent evolution

Often with shorter motifs (e.g. active sites)
 Motif (function) has evolved more than once
independently, e.g. starting with two very
different sequences adopting different folds
 Sequences and associated structures remain
different, but (functional) motif can become
identical
 Classical example: serine proteinase and
chymotrypsin
Serine proteinase (subtilisin) and
chymotrypsin





Different evolutionary origins
These proteins chop up other proteins
Similarities in the reaction mechanisms.
Chymotrypsin, subtilisin and carboxypeptidase C
have a catalytic triad of serine, aspartate and
histidine in common: serine acts as a nucleophile,
aspartate as an electrophile, and histidine as a base.
The geometric orientations of the catalytic residues
are similar between families, despite different protein
folds.
The linear arrangements of the catalytic residues
reflect different family relationships. For example the
catalytic triad in the chymotrypsin clan is ordered
HDS, but is ordered DHS in the subtilisin clan and
SDH in the carboxypeptidase clan.
Serine proteinase (subtilisin) and
chymotrypsin
H
D
S
chymotrypsin
D
H
S
serine proteinase
S
D
H
carboxypeptidase C
Catalytic triads
Read http://www.ebi.ac.uk/interpro/potm/2003_5/Page1.htm
Serine proteinase (subtilisin) and
chymotrypsin
Serine proteinase (subtilisin) and
chymotrypsin
A gene codes for a protein
DNA
CCTGAGCCAACTATTGATGAA
transcription
mRNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
Transcription + Translation = Expression
DNA makes mRNA makes Protein
Translation
happens within
the ribosome
Ribosome structure



In the nucleolus, ribosomal RNA
is transcribed, processed, and
assembled with ribosomal
proteins to produce ribosomal
subunits
At least 40 ribosomes must be
made every second in a yeast
cell with a 90-min generation
time (Tollervey et al. 1991). On
average, this represents the
nuclear import of
3100 ribosomal proteins every
second and the export of
80 ribosomal subunits out of the
nucleus every second. Thus, a
significant fraction of nuclear
trafficking is used in the
production of ribosomes.
Ribosomes are made of a small
(‘2’ in Figure) and a large
subunit (‘1’ in Figure)
Large (1) and small (2) subunit fit
together (note this figure mislabels
angstroms as nanometers)
Transcriptional Regulation
Integrated View
Expression..
TF binding site
TF
mRNA
Pol II transcription
TATA
DNA
Epigenectics – Epigenomics:
Gene Expression





Transcription factors (TF) are essential for
transcription initialisation
Transcription is done by polymerase type II
(eukaryotes)
mRNA must then move from nucleus to
ribosomes (extranuclear) for translation
In eukaryotes there can be many TF-binding sites
upstream of an ORF that together regulate
transcription
Nucleosomes (chromatin structures composed of
histones) are structures round of which DNA coils.
This blocks access of TFs
Epigenectics – Epigenomics:
Gene Expression
TF binding site
(closed)
mRNA
transcription
TATA
Nucleosome
TF binding site
(open)
Expression

Because DNA has flexibility, bound TFs can move in
order to interact with pol II, which is necessary for
transcription initiation (see next slide)
 Recent TF-based initialisation theory includes a
wave function (Carlsberg) of TF-binding, which is
supposed to go from left to right. In this way the TFbinding site nearest to the TATA box would be bound
by a TF which will then in turn bind Pol II.
 It has been suggested that “Speckles” have
something to do with this (speckels are observed
protein plaques in the nucleus)
 Current prediction methods for gene co-expression,
e.g. finding a single shared TF binding site, do not
take this TF cooperativity into account (“parking lot
optimisation”)
434 Cro
protein
complex
(phage)
PDB: 3CRO
Zinc finger
DNA recognition
(Drosophila)
PDB: 2DRP
..YRCKVCSRVY THISNFCRHY VTSH...
Zinc-finger DNA binding protein family
Characteristics of the family:
Function:
The DNA-binding motif is found as part of
transcription regulatory proteins.
Structure:
One of the most abundant DNA-binding motifs.
Proteins may contain more than one finger in a
single chain. For example Transcription Factor
TF3A was the first zinc-finger protein discovered
to contain 9 C2H2 zinc-finger motifs (tandem
repeats). Each motif consists of 2 antiparallel
beta-strands followed by by an alpha-helix. A
single zinc ion is tetrahedrally coordinated by
conserved histidine and cysteine residues,
stabilising the motif.
Zinc-finger DNA binding protein family
Characteristics of the family:
Binding:
Fingers bind to 3 base-pair subsites and specific
contacts are mediated by amino acids in positions 1, 2, 3 and 6 relative to the start of the alpha-helix.
Contacts mainly involve one strand of the DNA.
Where proteins contain multiple fingers, each
finger binds to adjacent subsites within a larger
DNA recognition site thus allowing a relatively
simple motif to specifically bind to a wide range of
DNA sequences.
This means that the number and the type of zinc
fingers dictates the specificity of binding to DNA
Leucine zipper
(yeast)
PDB: 1YSA
..RA RKLQRMKQLE DKVEE LLSKN YHLENEVARL...
A protein sequence alignment
MSTGAVLIY--TSILIKECHAMPAGNE-------GGILLFHRTHELIKESHAMANDEGGSNNS
* *
* **** ***
A DNA sequence alignment
attcgttggcaaatcgcccctatccggccttaa
att---tggcggatcg-cctctacgggcc---***
**** **** **
******
Searching for similarities
What is the function of the new gene?
The “lazy” investigation (i.e., no biologial
experiments, just bioinformatics techniques):
– Find a set of similar protein sequences to the
unknown sequence
– Identify similarities and differences
– For long proteins: first identify domains
Intermezzo: what is a domain
A domain is a:
• Compact, semi-independent unit
(Richardson, 1981).
• Stable unit of a protein structure that can
fold autonomously (Wetlaufer, 1973).
• Recurring functional and evolutionary
module (Bork, 1992).
“Nature is a ‘tinkerer’ and not an inventor” (Jacob, 1977).
Protein domains recur in different combinations
The DEATH Domain (DD)
http://www.mshri.on.ca/pawson
• Present in a variety of Eukaryotic
proteins involved with cell death.
• Six helices enclose a tightly
packed hydrophobic core.
• Some DEATH domains form
homotypic and heterotypic dimers.
Structural domain organisation can intricate…
Pyruvate kinase
Phosphotransferase
b barrel regulatory domain
a/b barrel catalytic substrate binding
domain
a/b nucleotide binding domain
1 continuous + 2 discontinuous domains
Evolutionary and functional relationships
Reconstruct evolutionary relation:
•Based on sequence
-Identity (simplest method)
-Similarity
•Homology (common ancestry: the ultimate goal)
•Other (e.g., 3D structure)
Functional relation:
Sequence Structure
Function
Searching for similarities
Common ancestry is more interesting:
Makes it more likely that genes share
the same function
Homology: sharing a common ancestor
– a binary property (yes/no)
– it’s a nice tool:
When (an unknown) gene X is homologous to (a
known) gene G it means that we gain a lot of
information on X: what we know about G can be
transferred to X as a good suggestion.
Protein Function Prediction
The deluge of genomic information begs the
following question: what do all these genes do?
Many genes are not annotated, and many more are
partially or erroneously annotated. Given a genome
which is partially annotated at best, how do we fill in
the blanks?
Of each sequenced genome, 20%-50% of the
functions of proteins encoded by the genomes
remains unknown!
Protein Function Prediction
We are faced with the problem of predicting
protein function from sequence, genomic,
expression, interaction and structural data.
For all these reasons and many more,
automated protein function prediction is
rapidly gaining interest among
bioinformaticians and computational
biologists
Ways to predict function

Sequence-based function prediction

Structure-based function prediction
– Sequence-structure comparison
– Structure-structure comparison

Motif-based function prediction

Phylogenetic profile analysis

Protein interaction prediction and databases

Functional inference at systems level
Classes of function prediction
methods

Sequence based approaches
– protein A has function X, and protein B is a homolog (ortholog) of
protein A; Hence B has function X

Structure-based approaches
– protein A has structure X, and X has so-so structural features;
Hence A’s function sites are ….

Motif-based approaches
– a group of genes have function X and they all have motif Y; protein
A has motif Y; Hence protein A’s function might be related to X

Function prediction based on “guilt-by-association”
– gene A has function X and gene B is often “associated” with gene A,
B might have function related to X
Sequence-based function prediction
Homology searching

uery: 2
Sbjct: 3
Query: 62
Sbjct: 58
Sequence comparison is a powerful tool for detection
of homologous genes but limited to genomes that are
not too distant away
LSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLKSEDEMKASEDL 61
LSD +
V +W K+
G + L R+
+P+T
F +
D
S ++
LSDKDKAAVRALWSKIGKSSDAIGNDALSRMIVVYPQTKIYFSHWP-----DVTPGSPNI 57
KKHGATVLTALGGILKKKGHHEAEIKPLAQSHATKHKIPVKYLEFISECIIQVLQSKHPG 121
K HG V+ +
+ K
+ + L++ HA K ++
+ ++ CI+ V+ + P
KAHGKKVMGGIALAVSKIDDLKTGLMELSEQHAYKLRVDPSNFKILNHCILVVISTMFPK 117
Query: 122 DFGADAQGAMNKALELFRKDMASNYK 147
+F +A +++K L
+A Y+
Sbjct: 118 EFTPEAHVSLDKFLSGVALALAERYR 143
We have done homology searching (FASTA, BLAST, PSI-BLAST) in
earlier lectures
Structure-based function prediction

Structure-based methods could possibly detect
remote homologues that are not detectable by
sequence-based method
– using structural information in addition to sequence
information
– protein threading (sequence-structure alignment) is a
popular method
Structure-based methods could provide
more than just “homology” information
Threading
Template
sequence
Compatibility
score
+
Query
sequence
Template
structure
Threading
Template
sequence
Compatibility
score
+
Query
sequence
Template
structure
Structure-based function prediction
Threading

Scoring function for measuring to what extend query
sequence fits into template structure

For scoring we have to map an amino acid (query
sequence) onto a local environment (template structure)

We can use the following structural features for scoring:
o
Secondary structure
Is environment inside or outside? – Residue accessible surface
area (ASA)
o
o
Polarity of environment
The best (highest scoring) “thread” through the structure
gives a so-called structural alignment, this looks exactly the
same as a sequence alignment but is based on structure.

Threading – inverse folding
Map sequence to structural environments
Query
What is the optimal thread
for each local
environment?
Find the best compromise
over all environments
Template
?
N
C
environment
•Secondary
structure
•ASA
hydrophobic
hydrophilic
•Polarity of
environment
Fold recognition by
threading
Fold 2
Query
sequence
What is the most
compatible
structure?
The one with the highest
threading score
Fold 1
Fold 3
Compatibility
scores
Fold N
Structure-based function prediction

SCOP (http://scop.berkeley.edu/) is a protein structure
classification database where proteins are grouped into a
hierarchy of families, superfamilies, folds and classes, based on
their structural and functional similarities
Structure-based function prediction

SCOP hierarchy – the top level: 11 classes
Structure-based function prediction
All-alpha protein
membrane protein
All-beta protein
Alpha-beta protein
Coiled-coil protein
Structure-based function prediction

SCOP hierarchy – the second level: 800 folds
Structure-based function prediction

SCOP hierarchy - third level: 1294 superfamilies
Structure-based function prediction

SCOP hierarchy - third level: 2327 families
Structure-based function prediction

Using sequence-structure alignment method, one can predict a
protein belongs to a
– SCOP family, superfamily or fold
folds
superfamilies
families



Proteins predicted to be in the same SCOP family are orthologous
Proteins predicted to be in the same SCOPE superfamily are
homologous
Proteins predicted to be in the same SCOP fold are structurally
analogous
Structure-based function prediction

Prediction of ligand binding sites
– For ~85% of ligand-binding proteins, the largest largest cleft
is the ligand-binding site
– For additional ~10% of ligand-binding proteins, the second
largest cleft is the ligand-binding site
Structure-based function prediction

Prediction of macromolecular binding site
– there is a strong correlation between macromolecular
binding site (with protein, DNA and RNA) and disordered
protein regions
– disordered regions in a protein sequence can be predicted
using computational methods
http://www.pondr.com/
Motif-based function prediction

Prediction of protein functions based on identified sequence
motifs

PROSITE contains patterns specific for more than a thousand
protein families.

ScanPROSITE -- it allows to scan
a protein sequence for occurrence
of patterns and profiles stored in
PROSITE
Motif-based function prediction

Search PROSITE using ScanPROSITE
MSEGSDNNGDPQQQGAEGEAVGENKMKSRLRK
GALKKKNVFNVKDHCFIARFFKQPTFCSHCKDFIC
GYQSGYAWMGFGKQGFQCQVCSYVVHKRCHEY
VTFICPGKDKGNETLIDSDSPKTQH ……..

The sequence has ASN_GLYCOSYLATION N-glycosylation site:
242 - 245 NETL
Regular expressions
Alignment
ADLGAVFALCDRYFQ
SDVGPRSCFCERFYQ
ADLGRTQNRCDRYYQ
ADIGQPHSLCERYFQ
For short sequence
stretches, regular
expressions are often
more suitable to
describe the
information than
alignments (or profiles)
Regular expression
[AS]-D-[IVL]-G-x4-{PG}-C-[DE]-R-[FY]2-Q
{PG} = not (P or G)
Regular expressions
Regular expression
No. of exact
matches in DB
D-A-V-I-D
71
D-A-V-I-[DENQ]
252
[DENQ]-A-V-I-[DENQ]
925
[DENQ]-A-[VLI]-I-[DENQ]
2739
[DENQ]-[AG]-[VLI]2-[DENQ] 51506
D-A-V-E
1088
Prosite

In addition to regular expressions, the Prosite
database also contains so-called extended
profiles
 Extended profiles contain more explicit
information than classical profiles, for
example to describe expected gap lengths,
etc.
 This is because some patterns are better
described using regular expressions (e.g.
short motifs), while others are better
formalised using (extended) profiles
Domain swapping
Domain swapping is a structurally viable
mechanism for forming oligomeric assemblies
(Bennett et al., 1995). In domain swapping, a
secondary or tertiary element of a monomeric
protein is replaced by the same element of another
protein.
Domain swapping can range from secondary
structure elements to whole structural domains. It
also represents a model of evolution for functional
adaptation by oligomerization, e.g. of oligomeric
enzymes that have their active site at sub-unit
interfaces (Heringa and Taylor, 1997).
Domain databases
COGS Domain database
The COGs (Clusters of Orthologous Groups) database
is a phylogenetic classification of the proteins encoded
within complete genomes (Tatusov et al., 2001).
It primarily consists of bacterial and archaeal genomes.
Operational definition of orthology is based on
bidirectional best hit
Incorporation of the larger genomes of multicellular
eukaryotes into the COG system is achieved by
identifying eukaryotic proteins that fit into already
existing COGs. Eukaryotic proteins that have orthologs
within different COGs are split into their individual
domains.
The COGs database currently consists of 3166 COGs
including 75,725 proteins from 44 genomes.
COGs: the beginning (1997)
In order to extract the maximum amount of information from the
rapidly accumulating genome sequences, all conserved genes need
to be classified according to their homologous relationships.
Comparison of proteins encoded in seven complete genomes from
five major phylogenetic lineages and elucidation of consistent
patterns of sequence similarities allowed the delineation of 720
clusters of orthologous groups (COGs). Each COG consists of
individual orthologous proteins or orthologous sets of paralogs
from at least three lineages. Orthologs typically have the same
function, allowing transfer of functional information from one
member to an entire COG. This relation automatically yields a
number of functional predictions for poorly characterized genomes.
The COGs comprise a framework for functional and evolutionary
genome analysis.
COG2813: 16S RNA G1207 methylase RsmC
COG members are mapped onto the
genomes included in the DB
PRINTS database
•PRINTS is a compendium of protein fingerprints.
•A fingerprint is a group of conserved motifs used to
characterise a protein family; its diagnostic power (false
positives and false negatives) is refined by iterative scanning
of a SWISS-PROT/TrEMBL composite database.
•Usually the motifs do not overlap, but are separated along a
sequence, though they may be contiguous in 3D-space.
•Fingerprints can encode protein folds and functionalities more
flexibly and powerfully than can single motifs, full diagnostic
potency deriving from the mutual context provided by motif
neighbours
•PRINTS contains the most discriminating groups of regular
expressions for each protein sequence
•Release 31.0 of PRINTS contains 1,550 entries, encoding
9,531 individual motifs.
BETAHEAM: 2 of 5 PRINTS motifs making the fingerprint
INITIAL MOTIF SETS
BETAHAEM1 Length of motif = 17 Motif number = 1
Beta haemoglobin motif I - 1
PCODE
ST INT
GRLLVVYPWTQRYFDSF HBB1_RAT
29 29
GRLLVVYPWTQRYFDSF HBB1_MOUSE 29 29
GRLLVVYPWTQRFFEHF HBB_ALCAA 28 28
GRLLVVYPWTQRFFEHF HBB_ODOVI 28 28
GRLLVVYPWTQRFFESF HBB_BOVIN 28 28
GRLLVVYPWTQRFFESF HBB_ATEGE 29 29
GRLLVVYPWTQRFFESF HBB_HUMAN 29 29
GRLLVVYPWTQRFFESF HBB_ANTPA 29 29
ARLLIVYPWTQRFFASF HBB_ANAPL 29 29
SRCLIVYPWTQRHFSGF HBB_NOTAN 29 29
BETAHAEM2 Length of motif =
Beta haemoglobin motif II PCODE
DLSSASAIMGNPKVKA HBB1_RAT
DLSSASAIMGNAKVKA HBB1_MOUSE
DLSTADAVMHNAKVKE HBB_ALCAA
DLSSAGAVMGNPKVKA HBB_ODOVI
DLSTADAVMNNPKVKA HBB_BOVIN
DLSTPDAVMSNPKVKA HBB_ATEGE
DLSTPDAVMGNPKVKA HBB_HUMAN
DLSNAGAVMGNAKVKA HBB_ANTPA
NLSSPTAILGNPMVRA HBB_ANAPL
NLYNAEAILGNANVAA HBB_NOTAN
16
1
ST
47
47
46
46
46
47
47
47
47
47
Motif number = 2
INT
1
1
1
1
1
1
1
1
1
1
After iteration the
number of
sequences for each
motif can grow
dramatically. Both
the initial motifs
(example here) and
final motifs are
provided to the
user
The PRODOM Database
ProDom is a comprehensive set of protein
domain families automatically generated
from the SWISS-PROT and TrEMBL
sequence databases
The PRODOM Database
ProDom (Corpet et al., 2000) is a database of
protein domain families automatically generated
from SWISSPROT and TrEMBL sequence
databases (Bairoch and Apweiler, 2000) using a
novel procedure based on recursive PSI-BLAST
searches (Altschul et al., 1997).
Release 2001.2 of ProDom contains 283,772
domain families, 101,957 having at least 2
sequence members. ProDom-CG (Complete
Genome) is a version of the ProDom database
which holds genome-specific domain data.
The PROSITE Database
PROSITE is a database of protein families and domains. It consists of
biologically significant sites, patterns and profiles that help to reliably
identify to which known protein family (if any) a new sequence belongs
PROSITE (Hofmann et al., 1999) is a good source of high quality
annotation for protein domain families. A PROSITE sequence family is
represented as a pattern or profile, providing a means of sensitive
detection of common protein domains in new protein sequences.
PROSITE release 16.46 contains signatures specific for 1,098 protein
families or domains. Each of these signatures comes with documentation
providing background information on the structure and function of these
proteins.
The PROSITE Database
A PROSITE sequence family is represented as a
pattern or a profile.
A pattern is given as a regular expression (next
slide)
The generalised profiles used in PROSITE carry the
same increased information as compared to
classical profiles as Hidden Markov Models
(HMMs).
Regular expressions
Alignment
ADLGAVFALCDRYFQ
SDVGPRSCFCERFYQ
ADLGRTQNRCDRYYQ
ADIGQPHSLCERYFQ
For short sequence
stretches, regular
expressions are often
more suitable to
describe the
information than
alignments (or profiles)
Regular expression
[AS]-D-[IVL]-G-x4-{PG}-C-[DE]-R-[FY]2-Q
{PG} = not (P or G)
Regular expressions
Regular expression
No. of exact
matches in DB
D-A-V-I-D
71
D-A-V-I-[DENQ]
252
[DENQ]-A-V-I-[DENQ]
925
[DENQ]-A-[VLI]-I-[DENQ]
2739
[DENQ]-[AG]-[VLI]2-[DENQ] 51506
D-A-V-E
1088
Rationale for regular expressions

“I want to see all sequences that ...
– ... contain a C”
--- C
– ... contain a C or an F” -- [CF]
– ... contain a C and an F” -- (C.*F | F.*C) (‘|’ means ‘or’ and ‘.*’ means
don’t care for any length)
– ... contain a C immediately followed by an F” -- CF
– ... contain a C later followed by an F” -- C.*F
– ... begin with a C” -- ^C (‘^’ means ‘starting with’)
– ... do not contain a C” -- {C}
– ... contain at least three Cs” -- C3– ... contain exactly three Cs” -- C3
– ... has a C at the seventh position” -- .6C
– ... either contain a C, an E, and an F in any order except CFE, unless
there are also at most three Ps, or there is a ....
Regex limitations

regex cannot remember indeterminate counts !!!
– “I want to see all sequences with ...
☺ ... six Cs followed by six Ts”
– C6T6
☺ ... any number of Cs followed by any number of Ts”
✰ C*T*
☹ ... Cs followed by an equal number of Ts” (This cannot be done..)
✰ CnTn
✰ (CT|CCTT|CCCTTT|C4T4| ... )?
The PFAM Database
Pfam is a large collection of multiple sequence
alignments and hidden Markov models covering
many common protein domains and families. For
each family in Pfam you can:

Look at multiple alignments
 View protein domain architectures
 Examine species distribution
 Follow links to other databases
 View known protein structures
 Search with Hidden Markov Model (HMM) for
each alignment
The PFAM Database
Pfam is a database of two parts, the first is the
curated part of Pfam containing over 5193 protein
families (Pfam-A). Pfam-A comprises manually
crafted multiple
alignments and profile-HMMs .
To give Pfam a more comprehensive coverage of
known proteins we automatically generate a
supplement called Pfam-B. This contains a large
number of small families taken from the PRODOM
database that do not overlap with Pfam-A.
Although of lower quality Pfam-B families can be
useful when no Pfam-A families are found.
The PFAM Database
Sequence coverage Pfam-A : 73% (Gr)
Sequence coverage Pfam-B : 20% (Bl)
Other (Grey)
A PFAM alignment
CYB_TRYBB/1-197
CYB_MARPO/1-208
CYB_HETFR/1-205
CYB_STELO/1-204
CYB_ASCSU/1-196
CYB6_SPIOL/1-210
CYB6_MARPO/1-210
CYB6_EUGGR/1-210
M...LYKSG..EKRKG..LLMSGC.....LYR.....IYGVGFSLGFFIALQIIC..GVCLAWLFFSCFICSNWYFVLFL
M.ARRLSILKQPIFSTFNNHLIDY.....PTPSNISYWWGFGSLAGLCLVIQILTGVFLAMHYTPHVDLAFLSVEHIMR.
MATNIRKTH..PLLKIINHALVDL.....PAPSNISAWWNFGSLLVLCLAVQILTGLFLAMHYTADISLAFSSVIHICR.
M.TNIRKTH..PLMKILNDAFIDL.....PTPSNISSWWNFGSLLGLCLIMQILTGLFLAMHYTPDTTTAFSSVAHICR.
...........MKLDFVNSMVVSL.....PSSKVLTYGWNFGSMLGMVLGFQILTGTFLAFYYSNDGALAFLSVQYIMY.
M.SKVYDWF..EERLEIQAIADDITSKYVPPHVNIFYCLGGITLT..CFLVQVATGFAMTFYYRPTVTDAFASVQYIMT.
M.GKVYDWF..EERLEIQAIADDITSKYVPPHVNIFYCLGGITLT..CFLVQVATGFAMTFYYRPTVTEAFSSVQYIMT.
M.SRVYDWF..EERLEIQAIADDVSSKYVPPHVNIFYCLGGITFT..CFIIQVATGFAMTFYYRPTVTEAFLSVKYIMN.
CYB_TRYBB/1-197
CYB_MARPO/1-208
CYB_HETFR/1-205
CYB_STELO/1-204
CYB_ASCSU/1-196
CYB6_SPIOL/1-210
CYB6_MARPO/1-210
CYB6_EUGGR/1-210
WDFDLGFVIRSVHICFTSLLYLLLYIHIFKSITLIILFDTH..IL....VWFIGFILFVFIIIIAFIGYVLPCTMMSYWG
.DVKGGWLLRYMHANGASMFFIVVYLHFFRGLY....YGSY..ASPRELVWCLGVVILLLMIVTAFIGYVLPWGQMSFWG
.DVNYGWLIRNIHANGASLFFICIYLHIARGLY....YGSY..LLKE..TWNIGVILLFLLMATAFVGYVLPWGQMSFWG
.DVNYGWFIRYLHANGASMFFICLYAHMGRGLY....YGSY..MFQE..TWNIGVLLLLTVMATAFVGYVLPWGQMSFWG
.EVNFGWIFRVLHFNGASLFFIFLYLHLFKGLF....FMSY..RLKK..VWVSGIVILLLVMMEAFMGYVLVWAQMSFWA
.EVNFGWLIRSVHRWSASMMVLMMILHVFRVYL....TGGFKKPREL..TWVTGVVLGVLTASFGVTGYSLPWDQIGYWA
.EVNFGWLIRSVHRWSASMMVLMMILHIFRVYL....TGGFKKPREL..TWVTGVILAVLTVSFGVTGYSLPWDQIGYWA
.EVNFGWLIRSIHRWSASMMVLMMILHVCRVYL....TGGFKKPREL..TWVTGIILAILTVSFGVTGYSLPWDQVGYWA
CYB_TRYBB/1-197
CYB_MARPO/1-208
CYB_HETFR/1-205
CYB_STELO/1-204
CYB_ASCSU/1-196
CYB6_SPIOL/1-210
CYB6_MARPO/1-210
CYB6_EUGGR/1-210
LTVFSNIIATVPILGIWLCYWIWGSEFINDFTLLKLHVLHV.LLPFILLIILILHLFCLHYFM
ATVITSLASAIPVVGDTIVTWLWGGFSVDNATLNRFFSLHY.LLPFIIAGASILHLAALHQYG
ATVITNLLSAFPYIGDTLVQWIWGGFSIDNATLTRFFAFHF.LLPFLIIALTMLHFLFLHETG
ATVITNLLSAIPYIGTTLVEWIWGGFSVDKATLTRFFAFHF.ILPFIITALAAVHLLFLHETG
SVVITSLLSVIPVWGFAIVTWIWSGFTVSSATLKFFFVLHF.LVPWGLLLLVLLHLVFLHETG
VKIVTGVPDAIPVIGSPLVELLRGSASVGQSTLTRFYSLHTFVLPLLTAVFMLMHFLMIRKQG
VKIVTGVPEAIPIIGSPLVELLRGSVSVGQSTLTRFYSLHTFVLPLLTAIFMLMHFLMIRKQG
VKIVTGVPEAIPLIGNFIVELLRGSVSVGQSTLTRFYSLHTFVLPLLTATFMLGHFLMIRKQG
INTERPRO combined database
Because the underlying construction and analysis methods of
the above domain family databases are different, the
databases inevitably have different diagnostic strengths and
weaknesses.
The InterPro database (Apweiler et al., 2000) is a
collaboration between many of the domain database curators.
It aims to be a central resource reducing the amount of
duplication between the databases.
Release 3.2 of InterPro contains 3,939 entries, representing
1,009 domains, 2,850 families, 65 repeats and 15
posttranslational modification sites. Entries are accompanied
by regular expressions, profiles, fingerprints and Hidden
Markov Models which facilitate sequence database searches.
Databases integrated in INTERPRO:
The UniProt (Universal Protein Resource) is the world's most comprehensive catalog of
information on proteins. It is a central repository of protein sequence and function created
by joining the information contained in Swiss-Prot, TrEMBL, and PIR.
PROSITE is a database of protein families and domains. It consists of biologically
significant sites, patterns and profiles that help to reliably identify to which known protein
family (if any) a new sequence belongs.
Pfam is a large collection of multiple sequence alignments and hidden Markov models
covering many common protein domains.
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs
used to characterise a protein family; its diagnostic power is refined by iterative scanning of
UniProt. Usually the motifs do not overlap, but are separated along a sequence, though
they may be contiguous in 3D-space. Fingerprints can encode protein folds and
functionalities more flexibly and powerfully than can single motifs, their full diagnostic
potency deriving from the mutual context afforded by motif neighbours.
The ProDom protein domain database consists of an automatic compilation of homologous
domains. Current versions of ProDom are built using a novel procedure based on recursive
PSI-BLAST searches (Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W &
Lipman DJ, 1997, Nucleic Acids Res., 25:3389-3402; Gouzy J., Corpet F. & Kahn D., 1999,
Computers and Chemistry 23:333-340.) Large families are much better processed with this
new procedure than with the former DOMAINER program (Sonnhammer, E.L.L. & Kahn, D.,
1994, Protein Sci., 3:482-492).
Databases integrated in INTERPRO (Cont.):
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation
of genetically mobile domains and the analysis of domain architectures. More than 500 domain
families found in signalling, extracellular and chromatin-associated proteins are detectable.
These domains are extensively annotated with respect to phyletic distributions, functional class,
tertiary structures and functionally important residues. Each domain found in a non-redundant
protein database as well as search parameters and taxonomic information are stored in a
relational database system. User interfaces to this database allow searches for proteins
containing specific combinations of domains in defined taxa.
TIGRFAMs is a collection of protein families, featuring curated multiple sequence alignments,
Hidden Markov Models (HMMs) and annotation, which provides a tool for identifying functionally
related proteins based on sequence homology. Those entries which are "equivalogs" group
homologous proteins which are conserved with respect to function.
PIR Superfamily (PIRSF) is a classification system based on evolutionary relationship of whole
proteins. Members of a superfamily are monophyletic (evolved from a common evolutionary
ancestor) and homeomorphic (homologous over the full-length sequence and sharing a common
domain architecture). A protein may be assigned to one and only one superfamily. Curated
superfamilies contain functional information, domain information, bibliography, and crossreferences to other databases, as well as full-length and domain HMMs, multiple sequence
alignments, and phylogenetic tree of seed members. PIRSF can be used for functional
annotation of protein sequences.
SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known
structure. The library is based on the SCOP classification of proteins: each model corresponds
to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs
to. SUPERFAMILY has been used to carry out structural assignments to all completely
sequenced genomes. The results and analysis are available from the SUPERFAMILY website.
Domain structure databases
Several methods of structural classification
have been developed to classify the large
number of protein folds present in the PDB.
The most widely used and comprehensive
databases are CATH, 3Dee, FSSP and SCOP,
which use four unique methods to classify
protein structures at the domain level.
Examples of domain structure
databases
CATH
 3DEE
 FSSP
 SCOP

CATH
The CATH domain database assigns domains
based on a consensus approach using the
three algorithms PUU (Holm and Sander, 1994),
DETECTIVE (Swindells, 1995) and DOMAK
(Siddiqui and Barton, 1995) as well as visual
inspection (Jones et al., 1998). The CATH
database release 2.3 contains approximately
30,000 domains ordered into five major levels:
Class; Architecture; Topology/fold; Homologous
superfamily; and Sequence family.
CATH
Class covers a, b, and a/b proteins
Architecture is the overall shape of a domain as defined by the
packing of secondary structural elements, but ignoring their
connectivity.
The topology-level consists of structures with the same
number, arrangement and connectivity of secondary structure
based on structural superposition using SSAP structure
comparison algorithm (Taylor and Orengo, 1989).
A homologous superfamily contains proteins having high
structural similarity and similar functions, which suggests that
they have evolved from a common ancestor.
Finally, the sequence family level consists of proteins with
sequence identities greater than 35%, again suggesting a
common ancestor.
CATH
CATH classifies domains into approximately 700 fold
families; ten of these folds are highly populated and
are referred to as ‘super-folds’.
Super-folds are defined as folds for which there are
at least three structures without significant sequence
similarity (Orengo et al., 1994).
The most populated is the a/b -barrel super-fold.
3Dee
3Dee structural domain repository (Siddiqui et al.,
2001) stores alternative domain definitions for the
same protein and organises the domains into
sequence and structural hierarchies. Most of the
database creation and update processes are
performed automatically using the DOMAK (Siddiqui
and Barton, 1995) algorithm. However, some
domains are manually assigned. It contains nonredundant sets of sequences and structures, multiple
structure alignments for all domain families,
secondary structure and fold name definitions. The
current 3Dee release is now a few years old and
contains 18,896 structural domains.
FSSP
FSSP (Holm and Sander, 1997) is a complete comparison of
all pairs of protein structures in the PDB. It is the basis for the
Dali Domain Dictionary (Dietmann et al., 2001), a numerical
taxonomy of all known structures in the PDB.
The taxonomy is derived automatically from measurements of
structural, functional and sequence similarities.
The database is split into four hierarchical levels
corresponding to super-secondary structural motifs, the
topology of globular domains, remote homologues (functional
families) and sequence families.
FSSP
The top level of the fold classification corresponds to
secondary structure composition and super-secondary
structural motifs. Domains are assigned by the PUU algorithm
(Holm and Sander, 1994) and classified into one of five
‘attractors’, which can be characterised as all-a, all-b, a/b , ab meander, and antiparallel b-barrels. Domains which are not
clearly defined to a single attractor are assigned to a mixed
class.
In September 2000, the Dali classification contained 17,101
chains, 1,375 fold types and 3,724 domain sequence families.
The database contains definitions of structurally conserved
cores and a library of multiple alignments of distantly related
protein families.
SCOP
The SCOP database (Structural Classification of Proteins) is a
manual classification of protein structure (Murzin et al., 1995).
The classification is at the domain level for many proteins, but
in general, a protein is only split into domains when there is a
clear indication that the individual domains may have existed
as independent proteins.
Therefore, many of the domain definitions in SCOP will be
different to those in the other structural domain databases.
The principal levels of hierarchy are family, superfamily and
fold, split into the traditional four domain classes, all-a, all-b,
a+b and a/b .
Release1.55 of the SCOP database contains 13,220 PDB
entries, 605 fold types and 31,474 domains.
Gene Ontology (GO)

Not a genome sequence database

Developing three structured, controlled
vocabularies (ontologies) to describe gene
products in terms of:
– biological process
– cellular component
– molecular function
in a species-independent manner
The GO ontology
Gene Ontology Members
FlyBase
- database for the fruitfly Drosophila melanogaster
Berkeley Drosophila Genome Project (BDGP) - Drosophila informatics; GO database & software,
Sequence Ontology development
Saccharomyces Genome Database (SGD) - database for the budding yeast Saccharomyces
cerevisiae
Mouse Genome Database (MGD) & Gene Expression Database (GXD) - databases for the
mouse Mus musculus
The Arabidopsis Information Resource (TAIR) - database for the brassica family plant Arabidopsis
thaliana
WormBase - database for the nematode Caenorhabditis elegans
EBI GOA project : annotation of UniProt (Swiss-Prot/TrEMBL/PIR) and InterPro databases
Rat Genome Database (RGD) - database for the rat Rattus norvegicus
DictyBase - informatics resource for the slime mold Dictyostelium discoideum
GeneDB S. pombe - database for the fission yeast Schizosaccharomyces pombe (part of the
Pathogen Sequencing Unit at the Wellcome Trust Sanger Institute)
GeneDB for protozoa - databases for Plasmodium falciparum, Leishmania major, Trypanosoma
brucei, and several other protozoan parasites (part of the Pathogen Sequencing Unit at the
Wellcome Trust Sanger Institute)
Genome Knowledge Base (GK) - a collaboration between Cold Spring Harbor Laboratory and
EBI)
TIGR - The Institute for Genomic Research
Gramene - A Comparative Mapping Resource for Monocots
Compugen (with its Internet Research Engine)
The Zebrafish Information Network (ZFIN) - reference datasets and information on Danio rerio
Protein interaction database

There are numerous databases of protein-protein
interactions

DIP is a popular protein-protein interaction database
The DIP database catalogs
experimentally determined
interactions between proteins.
It combines information from a
variety of sources to create a
single, consistent set of
protein-protein interactions.
Protein interaction databases
BIND - Biomolecular Interaction Network Database
DIP - Database of Interacting Proteins
PIM – Hybrigenics
PathCalling Yeast Interaction Database
MINT - a Molecular Interactions Database
GRID - The General Repository for Interaction Datasets
InterPreTS - protein interaction prediction through tertiary structure
STRING - predicted functional associations among genes/proteins
Mammalian protein-protein interaction database (PPI)
InterDom - database of putative interacting protein domains
FusionDB - database of bacterial and archaeal gene fusion events
IntAct Project
The Human Protein Interaction Database (HPID)
ADVICE - Automated Detection and Validation of Interaction by Co-evolution
InterWeaver - protein interaction reports with online evidence
PathBLAST - alignment of protein interaction networks
ClusPro - a fully automated algorithm for protein-protein docking
HPRD - Human Protein Reference Database
Protein interaction database
Network of protein interactions and predicted functional links involving silencing
information regulator (SIR) proteins. Filled circles represent proteins of known function;
open circles represent proteins of unknown function, represented only by their
Saccharomyces genome sequence numbers ( http://genomewww.stanford.edu/Saccharomyces). Solid lines show experimentally determined
interactions, as summarized in the Database of Interacting Proteins19 (http://dip.doembi.ucla.edu). Dashed lines show functional links predicted by the Rosetta Stone
method12. Dotted lines show functional links predicted by phylogenetic profiles16. Some
predicted links are omitted for clarity.
Network of predicted
functional linkages involving
the yeast prion protein20
Sup35. The dashed line shows
the only experimentally
determined interaction. The
other functional links were
calculated from genome and
expression data11 by a
combination of methods,
including phylogenetic
profiles, Rosetta stone
linkages and mRNA
expression. Linkages
predicted by more than one
method, and hence
particularly reliable, are
shown by heavy lines.
Adapted from ref. 11.
STRING - predicted functional
associations among genes/proteins

STRING is a database of predicted functional
associations among genes/proteins.
 Genes of similar function tend to be
maintained in close neighborhood, tend to be
present or absent together, i.e. to have the
same phylogenetic occurrence, and can
sometimes be found fused into a single gene
encoding a combined polypeptide.
 STRING integrates this information from as
many genomes as possible to predict
functional links between proteins.
Berend Snel en Martijn Huynen (RUN) and the group of Peer Bork (EMBL, Heidelberg)
STRING - predicted functional
associations among genes/proteins
STRING is a database of known and predicted proteinprotein interactions.
The interactions include direct (physical) and indirect
(functional) associations; they are derived from four
sources:
1.
2.
3.
4.
Genomic Context (Synteny)
High-throughput Experiments
(Conserved) Co-expression
Previous Knowledge
STRING quantitatively integrates interaction data from
these sources for a large number of organisms, and
transfers information between these organisms where
applicable. The database currently contains 736429
proteins in 179 species
STRING - predicted functional
associations among genes/proteins
Conserved Neighborhood
This view shows runs of genes that occur repeatedly in close
neighborhood in (prokaryotic) genomes. Genes located together in a run
are linked with a black line (maximum allowed intergenic distance is 300
bp). Note that if there are multiple runs for a given species, these are
separated by white space. If there are other genes in the run that are
below the current score threshold, they are drawn as small white
triangles. Gene fusion occurences are also drawn, but only if they are
present in a run (see also the Fusion section below for more details).
Functional inference at systems level

Function prediction of individual genes could be made in the
context of biological pathways/networks

Example – phoB is predicted to be a transcription regulator and
it regulates all the genes in the pho-regulon (a group of coregulated operons); and within this regulon, gene A is interacting
with gene B, etc.
Functional inference at systems level

KEGG is database of biological pathways and networks
Functional inference at systems level
Functional inference at systems level
Functional inference at systems level

By doing homologous search, one can map a known
biological pathway in one organism to another one;
hence predict gene functions in the context of
biological pathways/networks
Wrapping up

We have seen a number of ways to infer a
putative function for a protein sequence

To gain confidence, it is important to combine
as many different prediction protocols as
possible (the STRING server is an example of
this)
Homework

Give an example of two proteins having the same
structural fold but different biological functions
through searching SCOP and Swiss-prot

What is the biological function of phoR in the twocomponent system of prokaryotic organism based on
KEGG database search
C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Lecture 17:
Protein function
Introduction to Bioinformatics
Domain fusion
For example, vertebrates have a multi-enzyme
protein (GARs-AIRs-GARt) comprising the enzymes
GAR synthetase (GARs), AIR synthetase (AIRs),
and GAR transformylase (GARt) 1.
In insects, the polypeptide appears as GARs(AIRs)2-GARt. However, GARs-AIRs is encoded
separately from GARt in yeast, and in bacteria each
domain is encoded separately (Henikoff et al.,
1997).
1GAR: glycinamide ribonucleotide synthetase
AIR: aminoimidazole ribonucleotide synthetase
Domain fusion
Genetic mechanisms influencing the layout of
multidomain proteins include gross rearrangements
such as inversions, translocations, deletions and
duplications, homologous recombination, and
slippage of DNA polymerase during replication
(Bork et al., 1992).
Although genetically conceivable, the transition
from two single domain proteins to a multidomain
protein requires that both domains fold correctly
and that they accomplish to bury a fraction of the
previously solvent-exposed surface area in a newly
generated inter-domain surface.
Pathways and Pathway Diagrams

Pathways
– Set of nodes (entities)
and edges
(associations)

Pathway Diagrams
– XY coordinates
– Node splitting allowed
– Multiple views of the
same pathway
– Different abstraction
Metabolic
networks
Glycolysis
and
Gluconeogenesis
Kegg database (Japan)
C
E
N
T
R
E
F
O
R
I
N
T
E
G
R
A
T
I
V
E
B
I
O
I
N
F
O
R
M
A
T
I
C
S
V
U
Lecture 16:
Domains, their prediction and
domain databases
Introduction to Bioinformatics
Sequence-Structure-Function
Sequence
Ab initio
prediction
and folding
impossible but for
the smallest
structures
Threading Structure
Homology
searching
(BLAST)
Function
Function
prediction
from
structure
very difficult
Functional Genomics – Systems
Biology
Genome
Expressome
Proteome
TERTIARY STRUCTURE (fold)
Metabolomics
fluxomics
TERTIARY STRUCTURE (fold)
Metabolome
Systems Biology
is the study of the interactions between the components
of a biological system, and how these interactions give
rise to the function and behaviour of that system (for
example, the enzymes and metabolites in a metabolic
pathway). The aim is to quantitatively understand the
system and to be able to predict the system’s time
processes



the interactions are nonlinear
the interactions give rise to emergent properties, i.e. properties that
cannot be explained by the components in the system
Biological processes include many time-scales, many
compartments and many interconnected network levels (e.g.
regulation, signalling, expression,..)
Systems Biology
understanding is often achieved through
modeling and simulation of the system’s
components and interactions.
Many times, the ‘four Ms’ cycle is
adopted:
Measuring
Mining
Modeling
Manipulating
‘The
silicon
cell’
(some people think
‘silly-con’ cell)
A system response
Apoptosis: programmed cell death
Necrosis: accidental cell death
Human
Yeast
‘Comparative
metabolomics’
We need to be able to
do automatic pathway
comparison (pathway
alignment)
Important difference
with human pathway
This pathway diagram shows a comparison of pathways in (left) Homo sapiens
(human) and (right) Saccharomyces cerevisiae (baker’s yeast). Changes in
controlling enzymes (square boxes in red) and the pathway itself have occurred
(yeast has one altered (‘overtaking’) path in the graph)
Experimental

Structural genomics
Functional genomics
 Protein-protein interaction
 Metabolic pathways


Expression data
Issue when elucidating function
experimentally

Partial information (indirect interactions)
and subsequent filling of the missing
steps
Negative results (elements that have
been shown not to interact, enzymes
missing in an organism)
 Putative interactions resulting from
computational analyses

Protein function categories

Catalysis (enzymes)

Binding – transport (active/passive)
– Protein-DNA/RNA binding (e.g. histones, transcription factors)
– Protein-protein interactions (e.g. antibody-lysozyme) (experimentally
determined by yeast two-hybrid (Y2H) or bacterial two-hybrid (B2H)
screening )
– Protein-fatty acid binding (e.g. apolipoproteins)
– Protein – small molecules (drug interaction, structure decoding)

Structural component (e.g. -crystallin)

Regulation

Signalling

Transcription regulation

Immune system

Motor proteins (actin/myosin)
Catalytic properties of enzymes
Michaelis-Menten equation:
Vmax × [S]
V = ------------------Km + [S]
Km







kcat
Moles/s
Vmax
Vmax/2
E+S
ES
E+P
E = enzyme
K
[S]
S = substrate
ES = enzyme-substrate complex (transition state)
P = product
Km = Michaelis constant
Kcat = catalytic rate constant (turnover number)
Kcat/Km = specificity constant (useful for comparison)
m
Protein interaction domains
http://pawsonlab.mshri.on.ca/html/domains.html
Energy difference upon binding
Examples of protein interactions (and of functional
importance) include:
 Protein – protein
(pathway analysis);
 Protein – small molecules
(drug interaction, structure decoding);
 Protein – peptides, DNA/RNA
The change in Gibb’s Free Energy of the protein-ligand
binding interaction can be monitored and expressed by
the following equation:
G=H–TS
(H=Enthalpy, S=Entropy and T=Temperature)
Protein-protein interaction networks
Protein function

Many proteins combine functions

Some immunoglobulin structures are
thought to have more than 100 different
functions (and active/binding sites)

Alternative splicing can generate (partially)
alternative structures
Protein function & Interaction
Active site /
binding cleft
Shape complementarity
Protein function evolution
Chymotrypsin
How to infer function
Experiment
 Deduction from sequence

– Multiple sequence alignment – conservation
patterns
– Homology searching

Deduction from structure
– Threading
– Structure-structure comparison
– Homology modelling
Cholesterol Biosynthesis:
Cholesterol biosynthesis primarily occurs in
eukaryotic cells. It is necessary for membrane
synthesis, and is a precursor for steroid hormone
production as well as for vitamin D. While the
pathway had previously been assumed to be
localized in the cytosol and ER, more recent
evidence suggests that a good deal of the
enzymes in the pathway exist largely, if not
exclusively, in the peroxisome (the enzymes
listed in blue in the pathway to the left are
thought to be at least partly peroxisomal).
Patients with peroxisome biogenesis disorders
(PBDs) have a variable deficiency in cholesterol
biosynthesis
Cholesterol Biosynthesis:
from acetyl-Coa to mevalonate
Mevalonate plays a role in epithelial cancers:
it can inhibit EGFR
Epidermal Growth Factor as a
Clinical Target in Cancer
A malignant tumour is the product of uncontrolled cell proliferation.
Cell growth is controlled by a delicate balance between growthpromoting and growth-inhibiting factors. In normal tissue the
production and activity of these factors results in differentiated cells
growing in a controlled and regulated manner that maintains the
normal integrity and functioning of the organ. The malignant cell has
evaded this control; the natural balance is disturbed (via a variety of
mechanisms) and unregulated, aberrant cell growth occurs. A key
driver for growth is the epidermal growth factor (EGF) and the
receptor for EGF (the EGFR) has been implicated in the
development and progression of a number of human solid tumours
including those of the lung, breast, prostate, colon, ovary, head and
neck.
Energy housekeeping:
Adenosine diphosphate (ADP) – Adenosine triphosphate (ATP)
Chemical Reaction
Add Enzymatic Catalysis
Add Gene Expression
Add Inhibition
Metabolic Pathway: Proline
Biosynthesis
Proline as end product effects a negative feedback loop
Transcriptional Regulation
Methionine Biosynthesis in E. coli
Shortcut Representation
High-level Interaction representation
Levels of Resolution
SREBP Pathway
Signal Transduction
Important signalling pathways:
Map-kinase (MapK) signalling
pathway, or TGF-b pathway
Transport
Phosphate Utilization in Yeast
Multiple Levels of Regulation
Gene expression
 Protein posttranslational modification

Protein activity
 Protein intracellular location
 Protein degradation
 Substrate transport

Graphical Representation –
Gene Expression
Protein interaction domains
http://pawsonlab.mshri.on.ca/index.php?option=com_content&task=view&id=30&Itemid=63
Domain function
Active site / binding cleft
Protein-protein (domaindomain) interaction
Shape complementarity
A domain is a:
Compact, semi-independent unit
(Richardson, 1981).
 Stable unit of a protein structure that
can fold autonomously (Wetlaufer,
1973).
 Recurring functional and evolutionary
module (Bork, 1992).

“Nature is a tinkerer and not an inventor” (Jacob, 1977).

Smallest unit of function
Delineating domains is essential for:
• Obtaining high resolution structures (x-ray but
particularly NMR – size of proteins)
• Sequence analysis
• Multiple sequence alignment methods
• Prediction algorithms (SS, Class, secondary/tertiary
structure)
• Fold recognition and threading
• Elucidating the evolution, structure and function of
a protein family (e.g. ‘Rosetta Stone’ method)
• Structural/functional genomics
• Cross genome comparative analysis
Domain connectivity
linker
Structural domain organisation can be nasty…
Pyruvate kinase
Phosphotransferase
b barrel regulatory domain
a/b barrel catalytic substrate binding
domain
a/b nucleotide binding domain
1 continuous + 2 discontinuous domains
Domain size
The
size of individual structural domains varies
widely
– from 36 residues in E-selectin to 692 residues in
lipoxygenase-1 (Jones et al., 1998)
– the majority (90%) having less than 200 residues
(Siddiqui and Barton, 1995)
– with an average of about 100 residues (Islam et al.,
1995).
Small
domains (less than 40 residues) are often
stabilised by metal ions or disulphide bonds.
Large domains (greater than 300 residues) are
likely to consist of multiple hydrophobic cores (Garel,
1992).
Analysis of chain hydrophobicity in
multidomain proteins
Analysis of chain hydrophobicity in
multidomain proteins
Domain characteristics
Domains are genetically mobile units, and
multidomain families are found in all three
kingdoms (Archaea, Bacteria and Eukarya)
underlining the finding that ‘Nature is a tinkerer and
not an inventor’ (Jacob, 1977).
The majority of genomic proteins, 75% in unicellular
organisms and more than 80% in metazoa, are
multidomain proteins created as a result of gene
duplication events (Apic et al., 2001).
Domains in multidomain structures are likely to
have once existed as independent proteins, and
many domains in eukaryotic multidomain proteins
can be found as independent proteins in
prokaryotes (Davidson et al., 1993).
Protein function evolution
- Gene (domain) duplication Active site
Chymotrypsin
Pyruvate phosphate dikinase
3-domain protein
 Two domains catalyse 2-step reaction
A B  C
 Third so-called ‘swivelling domain’
actively brings intermediate enzymatic
product (B) over 45Å from one active
site to the other

/
Pyruvate phosphate dikinase
3-domain protein
 Two domains catalyse 2-step reaction
A B  C
 Third so-called ‘swivelling domain’
actively brings intermediate enzymatic
product (B) over 45Å from one active
site to the other

/
The DEATH Domain
http://www.mshri.on.ca/pawson
• Present in a variety of Eukaryotic
proteins involved with cell death.
• Six helices enclose a tightly
packed hydrophobic core.
• Some DEATH domains form
homotypic and heterotypic dimers.
Detecting Structural Domains

A structural domain may be detected as a
compact, globular substructure with more
interactions within itself than with the rest of the
structure (Janin and Wodak, 1983).
 Therefore, a structural domain can be determined
by two shape characteristics: compactness and
its extent of isolation (Tsai and Nussinov, 1997).
 Measures of local compactness in proteins have
been used in many of the early methods of
domain assignment (Rossmann et al., 1974;
Crippen, 1978; Rose, 1979; Go, 1978) and in
several of the more recent methods (Holm and
Sander, 1994; Islam et al., 1995; Siddiqui and
Barton, 1995; Zehfus, 1997; Taylor, 1999).
Detecting Structural Domains
However,
approaches encounter problems
when faced with discontinuous or highly
associated domains and many definitions
will require manual interpretation.
Consequently
there are discrepancies
between assignments made by domain
databases (Hadley and Jones, 1999).
Detecting Domains using
Sequence only

Even more difficult than prediction from structure!
Integrating protein multiple sequence
alignment, secondary and tertiary structure
prediction in order to predict structural domain
boundaries in sequence data
SnapDRAGON
Richard A. George
George R.A. and Heringa, J. (2002) J. Mol. Biol., 316, 839-851.
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
Protein structure hierarchical levels
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
VHLTPEEKSAVTALWGKVNVDE
VGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHG
KKVLGAFSDGLAHLDNLKGTFA
TLSELHCDKLHVDPENFRLLGN
VLVCVLAHHFGKEFTPPVQAAY
QKVVAGVANALAHKYH
QUATERNARY STRUCTURE
TERTIARY STRUCTURE (fold)
SNAPDRAGON
Domain boundary prediction protocol using sequence information
alone (Richard George)
1.
2.
3.
4.
Input: Multiple sequence alignment (MSA)
and predicted secondary structure
Generate 100 DRAGON 3D models for the
protein structure associated with the MSA
Assign domain boundaries to each of the 3D
models (Taylor, 1999)
Sum proposed boundary positions within 100
models along the length of the sequence,
and smooth boundaries using a weighted
window
George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from
sequence data, J. Mol. Biol. 316, 839-851.
SnapDragon
Folds
generated by
Dragon
Multiple alignment
Boundary
recognition
(Taylor, 1999)
Predicted
secondary structure
CCHHHCCEEE
Summed and
Smoothed
Boundaries
SNAPDRAGON
Domain boundary prediction protocol using sequence information
alone (Richard George)
1.
Input: Multiple sequence alignment
(MSA)
1. Sequence searches using PSI-BLAST (Altschul et
al., 1997)
2. followed by sequence redundancy filtering using
OBSTRUCT (Heringa et al.,1992)
3. and alignment by PRALINE (Heringa, 1999)

and predicted secondary structure
4. PREDATOR secondary structure prediction
program
George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from
sequence data, J. Mol. Biol. 316, 839-851.
Domain prediction using DRAGON
Distance Regularisation Algorithm for
Geometry OptimisatioN
(Aszodi & Taylor, 1994)
•Fold proteins based on the requirement that
(conserved) hydrophobic residues cluster together.
•First construct a random high dimensional Ca
distance matrix.
•Distance geometry is used to find the 3D
conformation corresponding to a prescribed target
matrix of desired distances between residues.
SNAPDRAGON
Domain boundary prediction protocol using sequence information
alone (Richard George)
2. Generate 100 DRAGON (Aszodi & Taylor, 1994)
models for the protein structure associated
with the MSA
–
–
–
–
DRAGON folds proteins based on the requirement that
(conserved) hydrophobic residues cluster together
(Predicted) secondary structures are used to further
estimate distances between residues (e.g. between the first
and last residue in a b-strand).
It first constructs a random high dimensional Ca (and pseudo
Cb) distance matrix
Distance geometry is used to find the 3D conformation
corresponding to a prescribed matrix of desired distances
between residues (by gradual inertia projection and based
on input MSA and predicted secondary structure)
DRAGON = Distance Regularisation Algorithm for Geometry OptimisatioN
Multiple alignment
Ca distance
matrix
N
Target
matrix
3
N
100 randomised
initial matrices
100 predictions
N
N
Predicted secondary
structure
CCHHHCCEEE
N
Input data
•The Ca distance matrix is divided into smaller clusters.
•Separately, each cluster is embedded into a local centroid.
•The final predicted structure is generated from full
embedding of the multiple centroids and their
corresponding local structures.
Lysozyme 4lzm
PDB
DRAGON
Methyltransferase 1sfe
PDB
DRAGON
Phosphatase 2hhm-A
PDB
DRAGON
Taylor method (1999)
DOMAIN-3D
3. Assign domain boundaries to each of
the 3D models (Taylor, 1999)



Easy and clever method
Uses a notion of spin glass theory (disordered
magnetic systems) to delineate domains in a
protein 3D structure
Steps:
1.
2.
3.
4.
Take sequence with residue numbers (1..N)
Look at neighbourhood of each residue (first shell)
If (“average nghhood residue number” > res no) resno =
resno+1
else resno = resno-1
If (convergence) then take regions with identical “residue
number” as domains and terminate
Taylor,WR. (1999) Protein structural domain identification. Protein Engineering 12 :203-216
Taylor method (1999)
repeat until convergence
5
if 41 < (5+6+56+78+89)/5
78
56
6
41
then Res 41 42 (up 1)
else Res 41 40 (down 1)
89
Taylor method (1999)
continuous
discontinuous
SNAPDRAGON
Domain boundary prediction protocol using sequence information
alone (Richard George)
4.
Sum proposed boundary positions within 100
models along the length of the sequence,
and smooth boundaries using a weighted
window (assign central position)
Window score = 1≤ i ≤ l Si × Wi
Wi
i
Where Wi = (p - |p-i|)/p2 and p = ½(n+1).
It follows that l Wi = 1
George R.A. and Heringa J.(2002) SnapDRAGON - a method to delineate protein structural domains from
sequence data, J. Mol. Biol. 316, 839-851.
SNAPDRAGON
Statistical significance:

Convert peak scores to Z-scores using
z = (x-mean)/stdev

If z > 2 then assign domain boundary
Statistical significance using random models:

Test hydrophibic collapse given distribution of
hydrophobicity over sequence

Make 5 scrambled multiple alignments (MSAs) and
predict their secondary structure

Make 100 models for each MSA

Compile mean and stdev from the boundary
distribution over the 500 random models

If observed peak z > 2.0 stdev (from random models)
then assign domain boundary
SnapDRAGON prediction
assessment
• Test set of 414 multiple alignments;183 single and
231 multiple domain proteins.
• Boundary predictions are compared to the region
of the protein connecting two domains (maximally
10 residues from true boundary)
SnapDRAGON prediction assessment
• Baseline method I:
• Divide sequence in equal parts based on number of
domains predicted by SnapDRAGON
• Baseline method II:
• Similar to Wheelan et al., based on domain length
partition density function (PDF)
• PDF derived from 2750 non-redundant structures
(deposited at NCBI)
• Given sequence, calculate probability of onedomain, two-domain, .., protein
• Highest probability taken and sequence split equally
as in baseline method I
Average prediction results per protein
Continuous set
Discontinuous set
Full set
Coverage
63.9 (± 43.0)
35.4 (± 25.0)
51.8 (± 39.1)
Success
46.8 (± 36.4)
44.4 (± 33.9)
45.8 (± 35.4)
Coverage
43.6 (± 45.3)
20.5 (± 27.1)
34.7 (± 40.8)
Success
34.3 (± 39.6)
22.2 (± 29.5)
29.6 (± 36.6)
Coverage
45.3 (± 46.9)
22.7 (± 27.3)
35.7 (± 41.3)
Success
37.1 (± 42.0)
23.1 (± 29.6)
31.2 (± 37.9)
SnapDRAGON
Baseline 1
Baseline 2
Coverage is the % linkers predicted (TP/TP+FN)
Success is the % of correct predictions made (TP/TP+FP)
Average prediction results per protein
Phylogenetic profile analysis

Function prediction of genes based on “guilt-byassociation” – a non-homologous approach

The phylogenetic profile of a protein is a string that
encodes the presence or absence of the protein in
every sequenced genome

Because proteins that participate in a common
structural complex or metabolic pathway are likely to
co-evolve, the phylogenetic profiles of such proteins
are often ``similar''
Phylogenetic profile analysis

Phylogenetic profile (against N genomes)
– For each gene X in a target genome (e.g., E coli),
build a phylogenetic profile as follows
– If gene X has a homolog in genome #i, the ith bit
of X’s phylogenetic profile is “1” otherwise it is “0”
Phylogenetic profile analysis

Example – phylogenetic profiles based on 60
genomes
genome
gene
orf1034:1110110110010111110100010100000000111100011111110110111010101
orf1036:1011110001000001010000010010000000010111101110011011010000101
orf1037:1101100110000001110010000111111001101111101011101111000010100
orf1038:1110100110010010110010011100000101110101101111111111110000101
orf1039:1111111111111111111111111111111111111111101111111111111111101
orf104: 1000101000000000000000101000000000110000000000000100101000100
orf1040:1110111111111101111101111100000111111100111111110110111111101
orf1041:1111111111111111110111111111111101111111101111111111111111101
orf1042:1110100101010010010110000100001001111110111110101101100010101
orf1043:1110100110010000010100111100100001111110101111011101000010101
orf1044:1111100111110010010111010111111001111111111111101101100010101
orf1045:1111110110110011111111111111111101111111101111111111110010101
orf1046:0101100000010001011000000111110000010100000001010010100000000
orf1047:0000000000000001000010000001000100000000000000010000000000000
orf105: 0110110110100010111101101010111001101100101111100010000010001
orf1054:0100100110000001100001000100000000100100100001000100100000000
By correlating the rows
(open reading frames
(ORF) or genes) you find
out about joint presence
or absence of genes: this
is a signal for a
functional connection
Genes with similar phylogenetic profiles have related functions
or functionally linked – D Eisenberg and colleagues (1999)
Phylogenetic profile analysis

Phylogenetic profiles contain great amount of functional
information

Phlylogenetic profile analysis can be used to distinguish
orthologous genes from paralogous genes

Subcellular localization: 361 yeast nucleus-encoded
mitochondrial proteins are identified at 50% accuracy with 58%
coverage through phylogenetic profile analysis

Functional complementarity: By examining inverse phylogenetic
profiles, one can find functionally complementary genes that
have evolved through one of several mechanisms of convergent
evolution.
Prediction of protein-protein interactions
Rosetta stone

Gene fusion is the an effective method for prediction
of protein-protein interactions
– If proteins A and B are homologous to two domains of a
protein C, A and B are predicted to have interaction
A
B
C
Though gene-fusion has low prediction
coverage, it false-positive rate is low
Domain fusion example
Vertebrates
have a multi-enzyme protein (GARsAIRs-GARt) comprising the enzymes GAR
synthetase (GARs), AIR synthetase (AIRs), and
GAR transformylase (GARt).
In insects, the polypeptide appears as GARs(AIRs)2-GARt.
In yeast, GARs-AIRs is encoded separately from
GARt
In bacteria each domain is encoded separately
(Henikoff et al., 1997).
GAR: glycinamide ribonucleotide
AIR: aminoimidazole ribonucleotide