GlycoCT—a unifying sequence format for carbohydrates
Download
Report
Transcript GlycoCT—a unifying sequence format for carbohydrates
GlycoCT—a unifying sequence format
for carbohydrates
S. Herget, R.Ranzinger, K.Maass and C.W.v.d.Lieth
Presented by Yingxin Guo
An overview of the sequence formats
used in glycobioinformatics
Special structural features
Uniqueness—A central requirement for
encoding carbohydrate sequences
Why
Server as primary key in database
Beneficial for the implementation of exact structure search
How
Apply strict sorting rules
Define a controlled vocabulary
Support encoding of uncertain linkages and unspecified
monosaccharides
General idea of GlycoCT
Basic monosaccharide namespace
Basic residue(RES) entities in GlycoCT
• Substituents and other entities
Modeling the topology
Residue entities are modeled in RES section.
Linkages are modeled in LIN section.
Atom replacement schema.
Encoding linkage
Encoding Repeating units
Encoding alternative units
Encoding underdetermined units
Sorting
Why
One central requirement is to generate a unique representation
for all carbohydrates.
Sorting is used to determine the order of appearance of
elements.
How
A set of hierarchical rules are used in GlycoCT to define the
ordering of residues, linkages and special structural features.
Residue comparison algorithm
Linkage comparison algorithm
Underdetermined subtree comparison algorithm
Alternative subtree comparison algorithm
Residue comparison
Apply when there are multiple starting points exist.
Rules
Number of child residues.
Length of the longest branch.
Number of terminal residues.
Number of branching points.
Lexical order.
Linkage comparison
• Decide the internal order of the RES and LIN sections
Rules
Number of bonds between parent and
child residues.
Atom linkage position at the parent residue.
Atom linkage position at the child residue.
Linkage type at the parent residue.
Comparison of child residues with residue
comparison algorithm.
Underdetermined subtree & Alternative
subtree comparison
The encoding of UND and ALT is handled separately from
the description of the other topological features.
Apply the set of rules from the residue and linkage
comparison algorithm to each UND and ALT to determine
internal order.
The reducing residues of UNDs and ALTs are compared with
the residue comparison.
If two compared UNDs are identical, the parent residues and
linkages(linkage between UND and main graph) are
compared.
First application and results
All the monosaccharides from CarbBank were translated to
the naming defined by GlycoCT.
1439 different names in CarbBank resulted in 474 different
basetypes and 29 different substituents, reducing the number
of distinct residues by 65%.
Two main reasons for the reduction
The separation of monosaccharides into basetype and
substituents
The unique encoding for monosaccharides
Conclusion
A superset of capabilities of all known sequence
formats in glycobioinformatics
Support structurally undetermined sequences
The consistent naming scheme for
monosaccharides can be easily maintained.