Transcript Slide 1

Discovering Gene-Disease
Association using On-line
Scientific Text Abstracts.
Raj Adhikari
Advisor: Javed Mostafa
Motivation

Motivation


A central problem in bioinformatics is how to
capture information from the vast scientific
literature and create an automated system for
“knowledge discovery” that can be used in various
areas.
I address the special case of gene-disease
interactions and show that using the
frequencies/relevance of words in Pubmed
abstracts can be used to find genes related to a
disease.
21 July 2015
Bioinformatics capstone project
2
Goal

Use the combination of statistical methods
and a database to:




retrieve research abstracts from Pubmed.
extract relevant information from the free texts
using statistical methods.
Measure the accuracy of the results and display
the results using a Web based system .
Complement and support existing knowledge base
systems like GeneCards.
21 July 2015
Bioinformatics capstone project
3
Resources used in creating
database

PubMed




a database of human genes, their products and their involvement
in diseases
http://bioinfo.weizmann.ac.il/cards/index.shtml
HGNC




http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
GeneCards


The US National Library of Medicine's database that contains more
than 11 million references to journal articles in the health sciences.
HUGO Gene Nomenclature Committee (approved over 19000
human gene symbols).
consistent with OMIM and LocusLink
http://www.gene.ucl.ac.uk/nomenclature
Tools used: Perl, CGI, Java, MySQL
21 July 2015
Bioinformatics capstone project
4
Creating the database

Data I used:


A relatively small list of genes and diseases in
humans
An article set (around 8000)

For each Pubmed article:





PMID
Article Title
Abstract (filter with a list of stop words)
The HUGO dataset.
List of around 3500 related gene-disease pairs
from GeneCards.
21 July 2015
Bioinformatics capstone project
5
Populating the database tables






Use the book Genes and Disease at OMIM to
generate a list of around 60 diseases and 90 genes.
Search Pubmed for each gene-disease pair on the
Title/Abstract field.
Use ESearch (tool that provides access to Pubmed
database outside of the web interface) to retrieve
data in XML file format.
Use XML::Simple Perl package to parse the XML file
Filter the text using stop words and store each title
and abstract along with the related PMID in a
database table.
Add more genes using HUGO
OMIM: Database of genetic diseases with references to molecular medicine, cell biology,
biochemistry and clinical details of the diseases.
21 July 2015
Bioinformatics capstone project
6
Populating the database tables

Table structures:


Derivative table
Term PMID Tfreq Dfreq Tfidf
LSI
Parse the retrieved text files and create the
following tables:

HUGO table structure:
HGNC genesymbol alias

GeneCards table structure:
Genesymbol disease
21 July 2015
Bioinformatics capstone project
7
Generating term weights


Basic idea: compare co-occurrence of terms in a document
and across a set of documents by generating term weights.
Within a document: Term-Frequency


Across the document set: Inverse Document Frequency


tf measures term density within a document.
idf measures the “informativeness” of a term across a dataset.
Thus:
21 July 2015
 n 
idfi  log 
 df i 
 n 
tfidf  tf i  log 
 df i 
Bioinformatics capstone project
8
Latent Symantec Indexing




Calculating co-occurrence of terms might not suffice
because of possible “noise” in the dataset.
Use LSI, a statistical technique, to estimate a latent
structure.
Assume some underlying semantic structure in the
dataset which could be partially obscured.
Implementation



term by document matrix (tends to be sparse)
convert matrix entries to weights, e.g. tfidf.
Analyze the matrix by singular value decomposition (SVD) to
derive latent semantic structure model.
21 July 2015
Bioinformatics capstone project
9
SVD

SVD

unique mathematical decomposition of a matrix
into the product of three matrices:




two with orthonormal columns
one with singular values on the diagonal
finds optimal projection into low-dimensional
space
tool for dimension reduction
21 July 2015
Bioinformatics capstone project
10
SVD
Singular Value Decomposition
{A}={U}{E}{V}T
Where:
{U} has orthonormal, unit length columns: {U}{U}’ = I
{E} is the diagonal matrix of positive real numbers
{V} has orthonormal, unit length columns: {V}{V}’ = I
21 July 2015
Bioinformatics capstone project
11
SVD




Approximate Ak keeping only the first k singular
values and the corresponding columns from U and V
matrices.
The new matrix Ak does not exactly match the
original term by document matrix A. (It gets closer
and closer as more singular values are kept).
This is what we want: we don’t want perfect fit since
we think some of the 0’s in A should be not be 0 and
vice versa.
Limitations of SVD – very memory intensive, cannot
handle large datasets.
21 July 2015
Bioinformatics capstone project
12
Scoring Matrix Generation



A scoring matrix is generated for each term
weighting method using the data stored in
the database.
This matrix is used to find the relationships
between genes and diseases.
Relatively fast process since the weights are
pre-computed and stored in a database.
21 July 2015
Bioinformatics capstone project
13
Finding relationships
T1
T2
T3
…
Tn
T1 T2 T3 …
D1 1
1
T1
D2 1
1
T2
…
1
0
…
Dn 1
0
Tn
Tn
2
Use the doc-term matrix to establish relationships between genes and disease
21 July 2015
Bioinformatics capstone project
14
Results
21 July 2015
Bioinformatics capstone project
15
Verification of the relationship



Data from GeneCards and HUGO has been
stored in a database.
For each gene, if the symbol is an official
genesymbol (according to HUGO), then
search for the genesymbol in GeneCards and
display the disease associated with it.
Else (if the symbol is an alias), use HUGO to
find the official genesymbol and search in
GeneCards using this genesymbol and display
the disease associated with the gene.
21 July 2015
Bioinformatics capstone project
16
Verification results
21 July 2015
Bioinformatics capstone project
17
Using gene alias


Make use of gene alias from HUGO to
increase the chances of detecting correct
genes for a given disease
Method:


Increment the weight of an official gene by adding
the weight of the of the alias.
Group the alias together with the official gene.
21 July 2015
Bioinformatics capstone project
18
Results

for Pancreatic Cancer

Top five genes – without considering alias

Top five genes – considering alias
21 July 2015
Bioinformatics capstone project
19
Using gene alias - problems

Problem: HUGO might have multiple official gene
symbols for some alias:


This particular alias could actually increase the weight of
a gene that is not related to the disease.
Example:
3585
FANCD2
FAD, FA-D2
1101
BRCA2
FAD, FAD1
9508
PSEN1
FAD, S182, PS1
21 July 2015
Bioinformatics capstone project
20
Problem using alias
21 July 2015
Bioinformatics capstone project
21
Verification


In addition, the number of Pubmed articles
containing a disease and a gene symbol can
be an indication of how strong the association
between a disease and a gene is.
Same theory applies for a gene-gene
relationship.
21 July 2015
Bioinformatics capstone project
22
Gene-Gene Relationships

In addition, we can use the doc-term matrix to find gene(s) that
are related to any given gene.
g1 g2 g3 …
D1 1 1 1
D2 1
… 1
Dn 1

1
0
0
1
1
0
g1 g2 g3 … gn
gn
g1
g2
2
…
gn
Using the matrices above, we see that g2 is related to g3 and
the weight is 2.
21 July 2015
Bioinformatics capstone project
23
Discovering additional genegene relationships

We can make use of the possibility that two genes
might be related to each other via a disease as in:
gene1 -> disease1 -> gene2
gene1 -> disease2 -> gene2

to establish relationships between gene1 and gene2.
In our case, the fact that gene1 and gene2 are
related to each other via two different diseases
makes the relationship between them even stronger.
21 July 2015
Bioinformatics capstone project
24
Architecture
21 July 2015
Bioinformatics capstone project
25
System Demonstration


http://biokdd.informatics.indiana.edu/radhikar
/search.html
Related URLs:

Genecards:


http://bioinfo.weizmann.ac.il/cards/index.shtml
HGNC:

21 July 2015
http://www.gene.ucl.ac.uk/nomenclature/
Bioinformatics capstone project
26
Summary


Using the combination of statistical methods
and a database, the process of establishing
gene-disease relationship using literature data
is fast and efficient.
With minimal changes, our system can be
extended to discover other relationships like
protein-protein interactions, etc.
21 July 2015
Bioinformatics capstone project
27
Future Work




Extend our system to incorporate the entire
Medline dataset.
Incorporate full gene names.
Find a better way to verify the gene-gene
relationships.
Incorporate other On-Line scientific literature
databases.
21 July 2015
Bioinformatics capstone project
28
Acknowledgments




Professor
Professor
Professor
Professor
21 July 2015
Javed Mostafa
Sun Kim
Memo Dalkilic
Haixu Tang
Bioinformatics capstone project
29