Data representation: techniques and trade-offs Rob Knight Dept. Chem. & Biochem. CU Boulder Two conflicting goals for representing data • Machine-readable – Dot-bracket (structures) – Array of chars.

Download Report

Transcript Data representation: techniques and trade-offs Rob Knight Dept. Chem. & Biochem. CU Boulder Two conflicting goals for representing data • Machine-readable – Dot-bracket (structures) – Array of chars.

Data representation:
techniques and trade-offs
Rob Knight
Dept. Chem. & Biochem.
CU Boulder
Two conflicting goals for
representing data
• Machine-readable
– Dot-bracket (structures)
– Array of chars (alignments)
• Allows standardization
across algorithms, highthroughput analyses, but…
• Difficult to relate to expert
knowledge
WTF?
QuickTime™ and a
TIFF (Uncompressed) dec ompres sor
are needed to s ee this pic ture.
Two conflicting goals for
representing data
• Human-readable
– Secondary structure pictures
– “Mutated” alignments based
on hand-crafted rules
• Prohibits standardization
across algorithms, highthroughput analyses, but…
• Allows efficient exploitation
of expert knowledge
WTF?
QuickTime™ and a
TIFF (Uncompressed) dec ompres sor
are needed to s ee this pic ture.
Can’t we all just get along?
• NO - inherent conflicts in representation goals
• Need to either
(a) find middle ground that is acceptable for both
purposes, or
(b) agree to separate representations that cannot be
interconverted without substantial manual
intervention and/or loss of data
Example 1: motif representation
• Conflicting goals:
– Want human-readable, unique specification of
motif
– Want to be able to recapture arbitrary interactions
among bases in the motif
What are we prepared to give up?
• Example: kink-turn motif and reverse-kink-turn motif
• If we discard out-of-order interactions, can use
something like:
– GCWWGASHGASHAGHS{,GAA}GCWWCGWW
…
Leontis et al. 2006
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
What are we prepared to give up?
• …but if we include the long-range interactions, we
must number the bases and include arbitrary pairs:
GiCj+8WWGi+1Aj+7SHGi+2Aj+6SHAi+3Gj+5HS{,GAA}
Gi+6Cj+1WWCi+7GjWW-Ai+5Gj+2SSGi+6Aj+6SS
Not really human-readable or machine-readable!
(problem: representing arbitrary graphs is hard)
j
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
i
So what’s the solution?
•
Two proposals:
a) Tiered system of less complex -> more complex linear
representations depending on the type of motif (think of
chemical nomenclature for substituents or cycles)
b) Use common names but require deposition of formal
motif definition with unique accession # in central db
j
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
i
Advantages and disadvantages
a)
Tiered system:
–
–
–
b)
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Accession system:
–
–
–
–
•
More human-readable
Difficult to parse (unless readability is expended),
liable to incomplete or ambiguous specifications
Probably won’t be able to do text search anyway
because of journal formatting
Easy to parse (store machine-readable connect list,
incl. ambiguities) so can automate analyses
Can generate human-readable diagrams as output
Leontis et al. 2006
Can generate specification using graphical tools so
RNA
need not require familiarity with the file format
Requires central repository and enforcement of deposition
Question: is the community prepared to reify and enforce the
current motif nomenclature?
Example 2: homology
•
•
•
Fundamental problem: systems that are homologous at one
level are not necessarily homologous at other levels
E.g. bat wings and bird wings: homologous as pentadactyl
limbs, but not homologous as wings
Homology is hierarchical and
can partially overlap at any level
(e.g. Griffiths 2006)
Frog
Bird
Rodent
Bat
forelimbs forelimbs forelimbs forelimbs
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Mammal
forelimbs
Tetrapod
forelimbs
Ridley “Evolution” 3rd ed.
Is one alignment enough?
•
•
•
Example: Lorsch & Szostak (1994) evolution of
polynucleotide kinase from ATP aptamer (15% mut.)
Some recovered classes
retained sequence similarity
to the ATP aptamer, but formed
new active sites that ignored this
similarity
Therefore, aligning the
functional regions is
incompatible with
aligning the most similar
sequences -- need diff.
alignments for ancestry
and functionality
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Multi-level alignments
•
•
Can align aptamer to Class IV and Class V by sequence
alone, but cannot align to the others without structural info
(except in small regions)
General problem (for other
sequences, not shown here):
coarse-grained structure
varies too, so cannot
match up the “right” stems
and loops w/o sequence
information
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
S1
B1
S2
S2’
B2
S1’
Solution: iterative approach?
•
•
•
•
Need alignment to produce
reasonable tree and structure
Need structure and tree to
produce reasonable
alignment
Need sequence sim.
to anchor structure
sim. b/c structure
changes
Main drawback to
current techniques:
assume that all parts
of sequence are
equally important for aln!
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Conclusions
•
Goal in both cases should be to connect expert knowledge
with automated approaches -- too big a gap at present
•
Motifs: central database with accessions has many
advantages, but will the community support it?
•
Alignments: probably need to move away from the
“annotate one alignment” model towards “many alignments
for the same set of sequences depending on task” model -needs hierarchical view of homology, new techniques for
connecting levels (not clustal!)