BioKnOT - Indiana University

Download Report

Transcript BioKnOT - Indiana University

BioKnOT
Biological Knowledge through
Ontology and TFIDF
By: James Costello
Advisor: Mehmet Dalkilic
Outline
Motivation and Goals
Background
Program Architecture
Populating the Article Database
Developing an Article Scoring Model
BioKnOT demonstration
Summary and Future Work
June 11, 2004
Bioinformatics Capstone Project
Costello
2
Motivation and Goals
Motivation

Current online text searching methods are not good
enough for highly specific research.
Importance
Timeliness
Relevance
Goal of Project

Create an online text retrieval system that will allow
users to construct their own set of highly specific,
timely, and important research articles that are
custom fit to a user’s needs.
June 11, 2004
Bioinformatics Capstone Project
Costello
3
Standard Search Model
∩
D = set of documents
D’ = set of documents that meet some search
criteria
D’ D
D’ = {d1, d2, …dk}

Where di is an individual document and we hope di is
more interesting than di+1
|D’| = huge number of documents
|D’| for a filtered search on PubMed for
“apoptosis” is 65,832 articles
June 11, 2004
Bioinformatics Capstone Project
Costello
4
BioKnOT Search Model
∩
∩
∩
D = set of documents
D’ = set of documents that meet the initial search criteria
D’
D
D’t = set of documents that pass the filter
D’t D’
D’tu = set of documents that have been ranked by based
on semantic content from user input
D’tu D’t
D’tu = {d1, d2, …dk}
|D’tu| = very small and very specific

Where di is an individual document and di is more interesting
than di+1
June 11, 2004
Bioinformatics Capstone Project
Costello
5
Program Architecture
Initial Search Page
apoptosis
Boolean Search
Actual Online
Article
Filter Page
term
term
term
term
Filter Your Search
All Stored Data
On the Article
(title, author(s),…)
June 11, 2004
User’s
sentences
Submit Description
Hyperlink
Results Page
Illustration of Word
Relationships in
Article
User Input Page
Hyperlink
1. Article Title …
View Word Graph
See All Data
2. …
Hyperlink
Word Weighting
Page
Bad
Good
term
Add Word Weights
Refine Your Search
Bioinformatics Capstone Project
Costello
6
Populating the Article Database
Data we need








Author(s)
Article Title
Abstract
Journal title
Date and year of publication
Count of how many times the article was cited
URL of online full text article or PubMed Search
results
Some Type of Accession Number
June 11, 2004
Bioinformatics Capstone Project
Costello
7
Resources Used in
Populating the Database
Institute of Scientific Information
(ISI) Web of Science

http://bert.lib.indiana.edu:2182/portal.cgi
EndNote 7
PubMed

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
June 11, 2004
Bioinformatics Capstone Project
Costello
8
Steps Taken to Populate the Article
Database
ISI’s Web of
Science
Export article
information
Search Interface
PubMed
After PubMed
Abstract
found, Web
Bot searches
for online
article URL
Search
Interface
PubMed
June 11, 2004
Export XML
and Parse
Web Bot to search for
URL information using
article title and author(s)
Either PubMed URL
Article Abstract
Interface
Endnote 7
or Online Article
URL inserted
Bioinformatics Capstone Project
Costello
Article
Database
> 2,000
9
Initial Search
Boolean search
Searches all article’s in the database with
a URL

Searches an article’s title and abstract
June 11, 2004
Bioinformatics Capstone Project
Costello
10
Filter Page
TFIDF
LUCAS


Web Service
http://lair.indiana.edu/research/lucas/index.html
TFIDF Calculations


TF = number of occurrences of a term in a document
IDF = log of the total number of documents over the number of
documents that contain the desired term
tfi,d =
|di|
|Σik
di|
|D|
idfi,D = log2
|{di | di
D}|
tfidfi,d = (1 + tfi,d)idfi,D if tfi,d ≥ 1
June 11, 2004
Bioinformatics Capstone Project
Costello
11
Term Relationship Measurements
Intra-sentence distance

Sentence structure taken into account
Inter-sentence distance

Sentence structure ignored
Ex.
“... and is not present in the mitochondria.
Permeability is another...”
“... mitochondrial permeability is an important
aspect of apoptosis...”
June 11, 2004
Bioinformatics Capstone Project
Costello
12
Inter-sentence vs. Intra-sentence
distance
Initial Search
Searching for the
relationship
Set of
Documents
Doc A
cell
death
Documents
used to
Construct the
Random
Model
Doc B
…cell
death…
…cell…
Doc D
Doc C
…cell.
Death…
…death…
Doc E
…cell
death…
Document that are
scored and
returned to the user
June 11, 2004
Bioinformatics Capstone Project
Costello
13
Visual Representation of Term
Relationships
Graph M
Example of a Term
Relationship Graph that
was specified by the user
Graph N
Example of a Term
Relationship Graph
that was taken from
an Article’s Abstract
June 11, 2004
Bioinformatics Capstone Project
Costello
14
Scoring an Article
M = User Defined Term Relationships
N = Abstract of Individual Article Term Relationships
S = Scoring Matrix
P = Presence or Absence of a Term Relationship from M in N
f = Sigmoidal Term Relationship Function
Abstract Score = ∑ PM,N(i,j) × Si,j × fMi,j(Ni,j)
1
PM,N(i,j) =
June 11, 2004
Mi,j × Ni,j ≠ 0
-1 Otherwise
Bioinformatics Capstone Project
Costello
15
Sigmoidal Scoring Function
1-½
fMi,j(Ni,j) =
½
0
x-α
β-α
β-α
x-α
if α < x ≤ β
if β < x ≤ γ
if x > γ
% Term Membership
if x ≤ α
1
1
½
0
α
β
γ
Term Distance
June 11, 2004
Bioinformatics Capstone Project
Costello
16
Scoring Matrix (Random Model)
Derived from the TFIDF Terms that were defined by the
user and abstracts of all the articles returned by the
initial term search.
User defined term relationships are found in all the
abstracts and the log-odds score is taken
LOD Score(ti,tj) = log2
P(tj | ti, Δ)
P(ti) × P(tj)
(tj | ti, Δ) is found by first finding a word, ti, that the user
has defined and then opening up a 5 word reading
frame, Δ, following ti. The presence of a second user
defined word, tj, must be within Δ
June 11, 2004
Bioinformatics Capstone Project
Costello
17
Steps to derive the Scoring Matrix
Determine important terms

cell, death, human
Look for relationships of those words in the search
space.

Relationships
cell→death, cell→human, death→cell,
death→human, human→cell, human→death

Search Space (abstract)
←The effects … cell in a human … in cancer. →
20 words
Once an important term is found, a 5 word reading frame
is opened. If a relationship is found within the reading
frame, then the distance between the words is taken.

cell→human = 3
If multiple occurrences of the same relationship are
found in the search space, the average is taken.
June 11, 2004
Bioinformatics Capstone Project
Costello
18
Steps to derive the Scoring Matrix
Lastly, these relationships, along with the individual word
probabilities, can be taken, scored and structured into a matrix.





2
P(cell→human) = 12 = .167
P(cell) = .03
P(human) = .06
LOD(cell→human) = 1.97
Continue for all relationships
apoptosis
apoptosis
Human
Cell
June 11, 2004
0
1.64
2.35
human
1.27
0
1.97
Bioinformatics Capstone Project
Costello
cell
-1.08
0
0
19
Adding User Weights to Term
Matrix
User is asked to enter weights for each
word relationship that is found within the
user’s expansion statement.
Weights range from [0,2]
Score is noted ri,j for term i to term j
Weights multiplied by matrix values to add
user’s input into the random model.
June 11, 2004
Bioinformatics Capstone Project
Costello
20
User’s Word Weight
submissions
Scoring Matrix Before User’s Word Weights
Si,j
cell
death
protein
cell …… 1.0
cell
0.0
2.54
0.0
protein
cell …… 0.5
death
0.98
0.0
0.0
protein
death … 1.5
protein
-1.65
3.65
0.0
cell
death … 2.0
death
Scoring Matrix After User’s Word Weights
Final Score
Si,j
=
1
× Si,j
ri,j
0
ri,j × Si,j
June 11, 2004
if Si,j < 0
if Si,j = 0
if Si,j > 0
Si,j
cell
death
protein
cell
0.0
5.08
0.0
death
0.98
0.0
0.0
protein
-3.30
5.48
0.0
Bioinformatics Capstone Project
Costello
21
Visual Representation of Term
Relationships
Graph M
Example of a Term
Relationship Graph that
was specified by the user
Graph N
Example of a Term
Relationship Graph
that was taken from
an Article’s Abstract
June 11, 2004
Bioinformatics Capstone Project
Costello
22
Comparing Term Relationship
Graphs
In order to compare the word graphs, an
adjacency matrix must be created. This is where
the values of Mi,j and Ni,j are taken.
Matrix M
apoptosis
tumor
June 11, 2004
Matrix N
apoptosis
tumor
0
0
5.00
0
fas
induce
Bioinformatics Capstone Project
Costello
fas
induce
0
0
3.00
0
23
Results and Refinement
Semantic Score from the equation
∑ PM,N(i,j) × Si,j × fMi,j(Ni,j)
June 11, 2004
Support Score in the form of Citation Frequency,
which is simply the citation count supplied by
ISI’s Web of Science divided by the difference in
year from now to the publication date.
Bioinformatics Capstone Project
Costello
24
Software Demonstration
BioKnOT
http://biokdd.informatics.indiana.edu/cgi-bin/jccostel/thesis/bioknot.cgi
PubMed
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
June 11, 2004
Bioinformatics Capstone Project
Costello
25
Summary
Offer a new and effective way to search
research articles.
BioKnOT offers many features that aid the
user in deciding what factors are important
in retrieving articles.
Currently under submission to SIGIR
Bioinformatics workshop.
June 11, 2004
Bioinformatics Capstone Project
Costello
26
Future Work
Adding more sophisticated support
through citation frequency.
Increase efficiency of scoring method
Usability analysis
Incorporate BioKnOT into CATPA
Developing a Bioinformatics Knowledge
Base locally using BioKnOT.
June 11, 2004
Bioinformatics Capstone Project
Costello
27
Acknowledgments
Professor MehmetDalkilic
Professor Javed Mostafa
Professor Sun Kim
June 11, 2004
Bioinformatics Capstone Project
Costello
28