protws1 4587

Download Report

Transcript protws1 4587

Biozon
A unified knowledge resource on DNA sequences,
proteins, complexes and cellular pathways
Golan Yona
Department of Computer Science
Cornell University
Golan Yona, Cornell University
Ipam04
The human genome project
The first step - The quaternary code of the cell (organism)
It codes proteins and RNA molecules – the basic procedures (many of
them have unknown functions)
Proteins, RNA and DNA can interact (modules)
They form pathways – complex programs with input and output
Golan Yona, Cornell University
Ipam04
The unknown(s)
Can we decipher the “meaning” of our genome:
Can we identify the role of the basic procedures (proteins)
Can we predict the interactions between them
Can we identify the complex programs (pathways)
Can we find regularities? Global principles? (the way proteins are
organized into families (p-table), the methods used to compile
complex programs from the basic procedures.)
Golan Yona, Cornell University
Ipam04
A very active field
Sequence and structure databases (SwissProt, PDB, GenBank)
Databases of protein domains and families (Prosite, Pfam, InterPro)
Databases of interactions (Bind, DIP)
Databases of pathways (Kegg, MetaCyc)
Golan Yona, Cornell University
Ipam04
The challenge of data integration
• The biochemical function of genes depends on their extended
biological context – their relations to other genes, the set of
interactions they form, the pathways they participate in, their
subcellular location and so on.
• This broader biological context is important for the characterization of
new and existing genes, interactions and pathways.
• There is a strong need to corroborate and integrate data from different
resources and different aspects of biological systems for the effective
analysis of genes and other biological entities (from complexes to
protein families and biochemical pathways)
Golan Yona, Cornell University
Ipam04
Searching for data
•
•
•
•
•
•
•
•
•
SwissProt protein …
PDB structure
Interactions
Pathways
DNA sequence
The domain structure
Similar structures and sequences
Protein families
Expression data
Maybe for a few genes – but on a large scale…
Golan Yona, Cornell University
Ipam04
Aaron Birkland
The Biozon project
A unified biological knowledge resource
• An efficient system for storage, retrieval manipulation and
exploration of biological data (both at the macro-molecular
and the cellular level)
• Integration of multiple sources (tens of databases plus inhouse computations) – the keys are the physical objects
• Emphasis on multi-feature protein and DNA
characterization and classification.
Golan Yona, Cornell University
Ipam04
Data types
Source: (prototype)
• Proteins sequences (SwissProt,
TrEMBL, GenPept, PIR, …)
• Protein structures (PDB, SCOP)
• DNA sequences (GenBank)
• Protein-Protein interactions (BIND)
• Pathways (KEGG)
• Expression data (BodyMap, …)
• GO data
• Complete genomes
Derived (computed)
• Protein domains families 
• The domain structure of proteins and families 
• Pairwise similarities between proteins and protein
families:
sequence simiularities 
structural similarities 
threading-based
profile-profile
expression similarity
• 3D models
• Predicted protein-protein interactions
• Assignment of genes to pathways
• Local and global maps of the protein space
Total of 35 million documents and 2.5 billion relations as of Jan 2004
Golan Yona, Cornell University
Ipam04
Golan Yona, Cornell University
Ipam04
Golan Yona, Cornell University
Ipam04
Querying data
• Complex queries – span different data types
– All proteins with known structures that participate in known
interactions
– All pathways that contain proteins with solved structures
– The structures of these proteins
– DNA sequences that encode for proteins kinases
Golan Yona, Cornell University
Ipam04
Golan Yona, Cornell University
Ipam04
Expandable model
• Current: semi-automatic updates
• Stability: consistency and integrity through the use of triggers, time
stamps
• Scalability: The database was designed so as to allow easy integration
of other data types and other databases.
The Amazon.com model
• Optimality: Using a concise representation of computed distances,
recovery upon retrieval (optimal results as opposed to heuristics)
Warehouse of computed data
Golan Yona, Cornell University
Ipam04
Methods
Part1: First identify the functions of the basic procedures (proteins):
•
•
•
•
A) Identify the evolutionary building blocks of proteins.
B) Develop new representations for proteins and methods to measure similarity.
Statistical models for protein families.
C) Embedding techniques to study the geometry of the protein universe. Grammars
to study the derivation rules
D) Unsupervised learning techniques to find the statistical regularities
Part 2: Machine learning techniques to predict interactions
Part 3: Algorithms for pathway prediction
Part 4: Learning algorithms to identify the elements of cellular computations
Part 5: The whole picture … BIOZON
Golan Yona, Cornell University
Ipam04
Methods
Part1: First identify the functions of the basic procedures (proteins):
•
•
•
•
A) Identify the evolutionary building blocks of proteins.
B) Develop new representations for proteins and methods to measure similarity.
Statistical models for protein families.
C) Embedding techniques to study the geometry of the protein universe. Grammars
to study the derivation rules
D) Unsupervised learning techniques to find the statistical regularities
Part 2: Machine learning techniques to predict interactions
Part 3: Algorithms for pathway prediction
Part 4: Learning algorithms to identify the elements of cellular computations
Part 5: The whole picture … BIOZON
Golan Yona, Cornell University
Ipam04
The domain structure of a protein
(with Niranjan Nagarajan)
• A domain is considered the fundamental unit of protein
structure, folding, function, evolution and design.
• Compact
• Stable
• Folds independently?
• Has a specific function
Golan Yona, Cornell University
Ipam04
A protein is a combination of domains
Protein1
Protein2
Protein3
Why is it important to know the domain
structure:
functional analysis of proteins
structure prediction
structural genomics
protein building blocks
Golan Yona, Cornell University
Ipam04
Any signals that might indicate domain boundaries?
• A very weak signal if any in the sequence
– Usually domain delineation is done based on structure (SCOP,
CATH , DSSP)
– But structural information is sparse..
– Best methods available for sequence are manual or semimanual (Pfam, Smart).
– Fully automatic methods are not as accurate (ProDom, Domo).
• Our assumption: were formed early on ..combinations
were formed later ..but there are traces of the
autonomous units..
• ..but hard to discern signal from noise
Golan Yona, Cornell University
Ipam04
Overview of our system
Seed Sequence
blast search
blast search
Intron Boundaries
DNA DATA
PROTEIN DATA
Sequence Participation
Multiple Alignment
Secondary Structure
Entropy
Neural Network
Correlation
Contact Profile
Physio-Chemical Properties
Putative Predictions
Hypothesis evaluation
Golan Yona, Cornell University
Final prediction
Ipam04
First step: The domain-informationcontent of an alignment column
• Measures (features) that are believed to
reflect structural properties of proteins
• A total of 20 measures
– Conservation measures (entropy, evolutionary pressure)
– Consistency and correlation measures (maintain
domain integrity: correlation, sequence termination)
– Measures of structural flexibility (indel entropy, correlated
mutations, predicted contact profiles)
– Residue type based measures
– Predicted secondary structure information
– Intron-exon data
Golan Yona, Cornell University
Ipam04
Examples
• Class entropy: some positions have preference towards
a class of amino-acids (similar physio-chemical
properties)
• Evolutionary pressure (span): sum of pairwise
similarities
• Correlated mutations: indicative of contacts
Contact profiles
Golan Yona, Cornell University
Ipam04
Contact profile
Golan Yona, Cornell University
Ipam04
Step2: Maximizing the information
content of features
Feature X
Boundary positions
Domain positions
value
• Generate distributions of scores for domains and
transitions (boundaries)
• Opt for the most distinct distributions of domain
positions vs. boundary positions, using the JS
divergence measure
• Also indicates which measures are the most
informative.
Golan Yona, Cornell University
Ipam04
Step3: The learning system
• A neural network is trained to model
the complex decision boundary surface
• Predicts correctly 94% of domain
positions and 88% of the transitions in
the test set
Golan Yona, Cornell University
Ipam04
Step4: Hypothesis evaluation
• First refine predictions
– The initial output of the neural network is smoothed. Each
minima is considered as a candidate transition point
• Search for the best hypothesis
Golan Yona, Cornell University
Ipam04
The domain generator model
• Finds the best of all possible hypotheses
• We assume a model: random generator
that moves repeatedly between a domain
state and a linker state and emits one
domain or transition at a time according to
different source probability distributions.
• Total probability is the product
Golan Yona, Cornell University
Ipam04
D1
D2
Dn
S
T1
T2
Tn-1
NN output
• Find the partition that maximizes the posterior
probability P(D/S)
• Maximize the product of the likelihood and the
prior
Golan Yona, Cornell University
Ipam04
Calculating the prior P(D)
• For an arbitrary protein of length L what is the
probability to observe the partition D
• Approximate using a simplified model
Estimated from experimental data
Golan Yona, Cornell University
Ipam04
D1
D2
The likelihood
• Assume domains are independent of each other
(additional test can be used to assess
S1
independence)
T1
Domain
Transition
T-source
T2
D-source
Construct minimum spanning tree using pair statistics
Golan Yona, Cornell University
Ipam04
Finally..
• Enumerate all possible hypotheses, calculate the posterior
probability for each one, and output the one that maximizes the
prob.
Golan Yona, Cornell University
Ipam04
Examples
PDB ID: 1acc
 Domain Definition:
14-735
 Predicted Domains:
1-158, 159-583, 584-735
 PFam Definition:
103-544
Golan Yona, Cornell University
Ipam04
Methods
Part1: First identify the functions of the basic procedures (proteins):
•
•
•
•
A) Identify the evolutionary building blocks of proteins.
B) Develop new representations for proteins and methods to measure similarity.
Statistical models for protein families.
C) Embedding techniques to study the geometry of the protein universe.
Grammars to study the derivation rules
D) Unsupervised learning techniques to find the statistical regularities
Part 2: Machine learning techniques to predict interactions
Part 3: Algorithms for pathway prediction
Part 4: Learning algorithms to identify the elements of cellular computations
Part 5: The whole picture … BIOZON
Golan Yona, Cornell University
Ipam04
Embedding
– Global organization
– Reconstruct the geometry of the protein universe
– Look for statistical regularities
MDFFCEKKLYA..
KHGGACDLMYK..
HVIPPYTKMGNC...
AVCSLRRADFVV..
The goal – finding a faithful low dimensional representation of the data
Golan Yona, Cornell University
Ipam04
Traditional MDS (multidimensional scaling)
Minimize distortion in pairwise distances
original distances
distances in the host space
However, it does not necessarily preserve
higher-order structure
Other methods: PCA, IsoMap (Tenenbaum et al.),
LLE (Roweis & Saul)
Golan Yona, Cornell University
Ipam04
Distributional scaling: geometry
preserving MDS (with Mike Quist)
Classical cost function
The distributional information
B
AB
BC
AC
C
Golan Yona, Cornell University
Ipam04
Weights defined based on entropy
Golan Yona, Cornell University
Distance between distributions is
defined based on the Earth mover’s
distance measure
Ipam04
Robustness to clustering errors
Over-classification
Golan Yona, Cornell University
misclassification
Ipam04
Global Map of the Protein Space
Golan Yona, Cornell University
Ipam04
Acknowledgments
• My students:
Aaron Birkland - Biozon
Niranjan Nagarajan – domain prediction, protein-protein interactions
Umar Syed – the mixture model of stochastic decision trees, function prediction
Mike Quist - embedding
Bill Dirks – analysis of expression data
Liviu Popescu – pathway prediction
Jason Davis - protein-protein interactions
Garmay Leung – structure comparison
Richard Chung – structural profile-profile comparison
Hugh Edwards, Chris Chau, Rob Cronin, Taruna Seth, Bo Fuld, Adi Alon, Arthur Kong, Wilmin Martono,
Keith Jamison, John Tam, Allen Wang, Kuan Chang, William Yeh, Charitha Tillekeratne
NEXT
Golan Yona, Cornell University
Ipam04
Acknowledgments
Collaborations:
• Ran El-Yaniv
• Klara Kedem
• Dave Lin
Funding:
• NSF
• SUN Microsystems
Golan Yona, Cornell University
Ipam04