Transformational Grammars and PROSITE Patterns

Download Report

Transcript Transformational Grammars and PROSITE Patterns

Transformational
Grammars
and PROSITE Patterns
Roland Miezianko
CIS 595 - Bioinformatics
Prof. Vucetic
Agenda
• Transformational Grammars
– Definition
– The Chomsky Hierarchy
• Finite State Automata
– FMR-1 Triplet Repeat Region
– Regular Grammar Example
• PROSITE
– Patterns in Regular Grammar Form
Assumptions
• Treated biological sequences as
one-dimensional strings of
independent and uncorrelated
symbols.
• Need to address interaction
among base pairs to understand
secondary structures.
Secondary Structures
• The 3-D folding of proteins and
nucleic acids involves extensive
physical interactions between
residues that are not adjacent in
primary sequence. [1]
• Require a model for secondary
structure that reflect the
interaction among base pairs.
Modeling Strings
• General theories for modeling
strings of symbols has been
developed by computational
linguists
– Chomsky in 1956, 1959
– Interested in how a brain or
computer program could
algorithmically determine whether
a sentence was grammatical or not
Transformational
Grammars
• Transformational Grammars
consist of:
– Symbols
• Abstract Nonterminal Symbols
• Terminal Symbols
– Rewriting Rules (Productions)
• A --> B
Transformational
Grammars, Example
Example Grammar
Two-letter terminal alphabet: {a, b}
Single nonterminal letter: S
Three Productions:
S->aS
S->bS
S->e
(e=special blank terminal symbol)
Example derivation of our simple grammar:
S->aS->abS->abbS->abb
Chomsky Hierarchy
• Four types of restrictions on
grammar’s productions resulted
on four classes of grammars.
–
–
–
–
Regular Grammars
Context-Free Grammars
Context-Sensitive Grammars
Unrestricted Grammars
Chomsky Hierarchy
unrestricted
context-sensitive
context-free
regular
Automata
• Each grammar has a corresponding
abstract computational device
called: automaton
Grammar
Parsing Automaton
Regular
Finite State
Context-Free
Push-Down
Context-Sensitive
Linear Bounded
Unrestricted
Turing Machine
FRM-1 Triplet
Repeat Region
• FRM-1 gene sequence contains
CGG which is repeated number
of times
• Number of triplets is highly
variable between individuals
• Increased copy number is
associated with a genetic
disease
FRM-1 Triplet
Repeat Region
• FSA will match any string from
the “language” that contains the
strings:
GCG CTG
GCG CGG CTG
GCG CGG CGG CTG
GCG CGG CGG CGG CGG … CTG
FRM-1 Triplet
Repeat Region
FRM-1 Triplet
Repeat Region
Regular Grammar for our Finite State
Automaton finds any number of copies of CGG
PROSITE Patterns
• PROSITE database is an
example of a biological
application of regular grammars
– Unlike methods which assign
scores to alignments, PROSITE
patterns either match a sequence
or do not.
PROSITE Patterns
• Consists of a string of pattern
elements separated by dashes
and terminated by a period
–
–
–
–
–
Pattern Element – single letter
[ ] - any one letter
{ } – anything but enclosed letters
X – any residue can occur
X(y) – any letter of length y
PROSITE Patterns
RNP-1 Motif
[RK]-G-{EDRKHPCG}-[AGSCI]-[FY]-[LIVA]-x-[FYM].
Conclusion
• Transformational grammars are
useful in developing acceptors
of different length sequences
and for matching specific multisequence regions.
• Higher order grammars in the
Chomsky hierarchy are more
difficult to program and apply
References
[1] Durbin, R. Biological Sequence Analysis: Probabilistic Models of
Proteins and Nucleic Acids. University of Cambridge Press, 1998.
[2] Gibson, G. A Primer of Genome Science. Sinauer Associates, Inc.
Publishers, 2002.
[3] Mount, D. Bioinformatics: Sequence and Genome Analysis. Cold
Spring Harbor Laboratory Press, 2001.
[4] PROSITE Database http://us.expasy.org/prosite/
Transformational
Grammars
and PROSITE Patterns
Questions
And
Answers