Text-mining & ontologies Andrey Rzhetsky A (very) short introduction into text-mining GeneWays as an infogrinder On-line Journals GeneWays Pathways.

Download Report

Transcript Text-mining & ontologies Andrey Rzhetsky A (very) short introduction into text-mining GeneWays as an infogrinder On-line Journals GeneWays Pathways.

Text-mining & ontologies
Andrey Rzhetsky
A (very) short introduction into
text-mining
GeneWays as an infogrinder
On-line Journals
GeneWays
Pathways
Graph: multi-type arcs
and nodes
QuickTime™ and a
decompressor
are needed to see this picture.
Typical arcs
1001,'bind'
1004,'suppress'
1011,'replace'
1018,'interact'
1020,'activate'
1022,'stimulate'
1023,'phosphorylate'
1027,'increase'
1028,'associate'
1034,'up-regulate'
1036,'inhibit'
1040,'promote'
1041,'down-regulate'
1043,'trigger'
1049,'block'
1054,'modify'
1057,'digest'
1058,'degrade'
1062,'link'
1071,'cleave'
1072,'release'
1074,'catalyze'
1083,'inactivate'
1106,'repress'
1110,'acetylate'
1117,'methylate'
Typical nodes
17767,'calcium channel antagonists'
20324,'hsp70 chaperone'
17467,'activator protein 1'
5104,'daunorubicin'
13194,'tyrosyl-phosphorylated' 9689,'paroxonase'
4190,'immunodeficiency'
4478,'iga2'
8552,'human fcgammarii'
4472,'iga1'
13151,'ikaros'
9820,'caveolin 1'
7277,'virus-triggered p-dcs'
4366,'complexes pr-3'
12290,'anti-alpha4 mabs'
2258,'gal4-mef2d'
14464,'polyneuropathy'
database ID
16044,'alk5'
10393,'mek-1 inhibitor'
13262,'pro-matrilysin'
Checking the internal
consistency of statements
(PLoS ONE, 2006)
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed to see this picture.
AR
Tian Zheng
Statements
George leaves the office when Bill arrives
(B -| G)
Statements
George is in the oval office
(G=1)
Statements
Bill is in the oval office
(B=1)
Together
George leaves the office when Bill arrives
(B -| G)
George is in the oval office
(G=1)
Bill is in the oval office
(B=1)
problem
Get a consistent tissue/cell/organism-specific
model of molecular interactions
Given
Noisy statements about molecular states (nodes)
Noisy statements about molecular interactions
(arcs)
Inconsistent example
Qu ickTime™ an d a
TIFF (Unco mpressed) d ecompresso r
are nee ded to see this picture.
Consistent example
Qu ickTime™ an d a
TIFF (Unco mpressed) de compressor
are nee ded to see this pic ture.
Implementation:
Simple generalization of Bayesian networks
P(nodes|arcs) -- Bayesian network
P(arcs|nodes) -- additional level of modeling
Gibbs sampling
Arc update
Node update
Before
Qu ickTime™ and a
TIFF (Uncompressed) deco mpressor
are nee ded to see this picture.
After
Qu ickTime™ an d a
TIFF (Uncompressed) de compressor
are nee ded to see this pic ture.
Difference
Qu ickTime™ and a
TIFF (Uncompressed) decompressor
are need ed to see this picture.
Entropy
change
Qu ickTime™ and a
TIFF (Uncompressed) decompressor
are need ed to see this picture.
And Now For Something
Completely Different…
(with apologies to Monty
Python)
How many events we
observe?
• (With apologies to Steven Pinker): 9/11
-- one event or two?
It was one coordinated attack, but …
1 --> $3.5 * 106
2 --> $7 * 106
in insurance paid to Larry
Silverstein
My point: seemingly
incompatible descriptions of
the same phenomenon coexist in publications
Phosphorylation -- how many
events and players? Enzyme
and substrate? Plus ATP?
ADP? Intermediate complex?
A
phosphorylates
B
Timeline: let’s think about
implications for concepts &
ontologies
Big Dipper: now and
100,000 years from now
Things and their
perception change…
Time-stamped texts are as
fossil layers
Fossilized concepts…
Time and meaning
Semantics
Time and meaning
YOU ARE HERE
Semantics
Text-miners are here (green)
YOU ARE HERE
Semantics
Time and meaning
SUBFIELD B
SUBFIELD A
Semantics
Berlin-Kay
QuickTime™ and a
decompressor
are needed to see this picture.
Point: with text-mining we can
go 100, 200, … years back.
Even within a 20-year window
we cannot neglect changes in
conceptual semantics
minimum
We need representation of A and A
Both can be true (with different
probabilities)

Fuzzy sets representing concept
inheritance
Even better…
We could introduce binding between verbs and
relations that changes with time
Binding between phrases and their semantics
(mapping to the real or abstract world) can
also change with time
Allow co-existence of concepts associated with
incompatible theories (ether, chi, etc)
Financial support comes from