GlycoCT—a unifying sequence format for carbohydrates

Download Report

Transcript GlycoCT—a unifying sequence format for carbohydrates

GlycoCT—a unifying sequence format
for carbohydrates
S. Herget, R.Ranzinger, K.Maass and C.W.v.d.Lieth
Presented by Yingxin Guo
An overview of the sequence formats
used in glycobioinformatics
Special structural features
Uniqueness—A central requirement for
encoding carbohydrate sequences
 Why
 Server as primary key in database
 Beneficial for the implementation of exact structure search
 How
 Apply strict sorting rules
 Define a controlled vocabulary
 Support encoding of uncertain linkages and unspecified
monosaccharides
General idea of GlycoCT
Basic monosaccharide namespace
Basic residue(RES) entities in GlycoCT
• Substituents and other entities
Modeling the topology
 Residue entities are modeled in RES section.
 Linkages are modeled in LIN section.
 Atom replacement schema.
Encoding linkage
Encoding Repeating units
Encoding alternative units
Encoding underdetermined units
Sorting
 Why
 One central requirement is to generate a unique representation
for all carbohydrates.
 Sorting is used to determine the order of appearance of
elements.
 How
 A set of hierarchical rules are used in GlycoCT to define the
ordering of residues, linkages and special structural features.
 Residue comparison algorithm
 Linkage comparison algorithm
 Underdetermined subtree comparison algorithm
 Alternative subtree comparison algorithm
Residue comparison
 Apply when there are multiple starting points exist.
 Rules
 Number of child residues.
 Length of the longest branch.
 Number of terminal residues.
 Number of branching points.
 Lexical order.
Linkage comparison
• Decide the internal order of the RES and LIN sections
 Rules
 Number of bonds between parent and
child residues.
 Atom linkage position at the parent residue.
 Atom linkage position at the child residue.
 Linkage type at the parent residue.
 Comparison of child residues with residue
comparison algorithm.
Underdetermined subtree & Alternative
subtree comparison
 The encoding of UND and ALT is handled separately from
the description of the other topological features.
 Apply the set of rules from the residue and linkage
comparison algorithm to each UND and ALT to determine
internal order.
 The reducing residues of UNDs and ALTs are compared with
the residue comparison.
 If two compared UNDs are identical, the parent residues and
linkages(linkage between UND and main graph) are
compared.
First application and results
 All the monosaccharides from CarbBank were translated to
the naming defined by GlycoCT.
 1439 different names in CarbBank resulted in 474 different
basetypes and 29 different substituents, reducing the number
of distinct residues by 65%.
 Two main reasons for the reduction
 The separation of monosaccharides into basetype and
substituents
 The unique encoding for monosaccharides
Conclusion
 A superset of capabilities of all known sequence
formats in glycobioinformatics
 Support structurally undetermined sequences
 The consistent naming scheme for
monosaccharides can be easily maintained.