Transcript Document

Data Integration and Information
Retrieval:
Moving from the Ivory Tower into the Corporate Office
Tae W. Ryu
Department of Computer Science
California State University, Fullerton
Summary of Today’s Talk

Past and current research activities
 Data integration and information retrieval
 Commercial application to Real Estate
business by Mr. Shin
 Questions & answers
A Bioinformatics Project at CSUF

A bioinformatics research group (BIG) involving
several faculty members and students from
Computer Science, Biology, Biochemistry,
Mathematics at CSUF and Pomona College in
Claremont started in 2001

Bioinformatics is the study of the biological
systems using computers.
DNA the molecule of life
DNA(Deoxyribonucleic Acid)

DNA is double-stranded
– Base pairs (A-T, G-C) are complementary, known as
Watson-Crick bps

A double-stranded DNA sequence can be
represented by strings of letters (1D) in either
direction
– 5' ... TACTGAA ... 3'
– 3' ... ATGACTT ... 5'

Length of DNA in bps (e.g. 100kbp)
Genes and Genetic Code

What are genes?
– Specific sequence of nucleotides (A,T,G,C) along a
chromosome carrying information for constructing a
protein

Who defined the concept of a gene?
– Mendel – 1860’s (DNA was elucidated 75 years later)

What is the genetic code?
– 3 base pairs in a gene = a codon (representing one
amino acid)

Genome is a complete set of chromosomes.
Non-coding Regions in DNA
Intergenic regions (non-coding) and Introns
Gene
Gene
Gene
• Over 90% of the
Human genome is noncoding sequences
(intergenic region or junk
DNA).
• The role of this region
is yet unknown but is
speculated to be very
important.
Our Project Goal

Understand the importance and roles of
those non-coding regions (intergenic
regions) in DNA
– Build a high-quality of integrated data source
for the non-coding sequences (intergenic
regions) in Eukaryotic genomes
– Seek pilot projects for bioinformatics research
and education at CSUF
Bioinformatics and
Integrated Biological Data

Major task for bioinformatists is to make sense out the biological data
– Typical tasks

Modeling, sequence to structure or functional class, structure to function or
mechanism
– How?

Biology-oriented approach:
–

Experiment and DNA manipulation in a Wet-Lab
Computer-oriented approach:
– Data mining, pattern recognition and discovery, prediction model, simulation, etc.

Success in most bioinformatics research requires
– An integrated view of all the relevant data


High quality of genomic sequence data and other relevant data
The results of analyses, such as, patterns produced by other research
– User friendly and powerful information retrieval tool
– Data analysis and interpretation


Data analysis by data mining and statistical approaches
Interpretation by biologists (with strong domain knowledge)
Obstacles to Data Integration

Data spread over multiple, heterogeneous data sources
–
–
–
–

Databases (MySql, Oracle, SQL Server, etc.)
Semi-structured sequence files (text or XML)
HTML format in Web sites
Output of analytic programs (BLAST, PFAM, etc.)
Not all sources represent biology optimally
– Semantics of sources can differ widely


Genbank is sequence-centric, not gene-centric
SwissProt is sequence-centric, not domain-centric
– Use different terms and definitions


Biological ontologies are being built now
Lack of standards in data representation
– XML is emerging as a standard for data transfer
More Obstacles

Poor data quality (errors) and incomplete data
– due to errors in Labs
– due to the large amount of data that is computer-
generated using heuristic algorithms

Data in the original data sources is changing

This is a really challenging problem that requires
in-depth knowledge of both Computer Science
and Molecular Biology
– Several approaches are possible (cross-validation, re-
experiment) but still limited
Possible Approaches

Database approach (conventional)
– Relational or object-oriented database

Data warehouse (or Data mart)
– Data warehouse maintains an integrated high-quality,
current (or historical), and consistent data.

Data mart is a small scale of data warehouse
– Often important prerequisite for sophisticated data
mining

Ideal approach (a future system)
– A comprehensive information management system with
all the above components plus powerful search engine
and intelligent information retrieval based on text
mining
Virtual Intergenic Data Warehouse
Transformed
dataTransformed
set1
data set2
User
interface
Data mining
…Transformed
data setn
View
Cube
…
Cube
Multi-dimensional views
Statistical and
Data mining tools
Metadata
Intergenic Data Warehouse
Mediator
Wrapper
PROSITE
Wrapper
GenBank
Mediator
Wrapper
…
Swiss-Prot
Mediator
Wrapper
EPD
Mediator
Wrapper
TRANSFAC
Data
extraction,
cleansing, and
reconcile
process
Wrapper
…
Others
Building data
warehouse
Current Progress

Intergenic Database (IGDB version 1.1)
– Integrated from genbank for Caenorhabditis elegans (nematode) and
Saccharomyces cerevisiae (baker’s yeast), and Arabidopsis thaliana
(mouse-ear cress) genomes
– Mouse, Mosquito, Human are under way

Pattern Summary System (PATSS)
– Summarize the sequence patterns generated by BLAST
– Pattern visualization with alignment tools
– Distributed BLAST using Web service and clustered computers

Ontology-based data integration
– Intelligent wrapper and mediator
– Structure description language for data extraction

Powerful information retrieval system based on
customized search engine with the support of text mining
– Web crawlers and customized search engine, document indexing
– Text mining, natural language processing
Search Engine: How Does It Work?
Caching DNS
DNS
cache
Per-server
queues
Async UDP
DNS prefetch
client
Text indexing
and other analyses
DNS resolver
Client (UDP)
Wait
for
DNS
Wait until
http socket
available
Fresh
work
Load monitor
and work-thread
manager
Hyperlink
Extractor and
Normalizer
Http
Send and
receive
Page fetching context/thread
isPageKnown?
Crawl
Metadata
Persistent
global work
pool of Urls
Relative
links, links
embedded
in scripts,
images
Text
Repository
and index
isUrlVisited?
Handles
spider
traps,
robots.txt
URL
Approval
guard
Search Engine for Web Data Integration and Retrieval
(d,t)
Fresh batch of
documents
Batch
sort
(t,d)
(t,d,s)
Merge-purge
Batch
sort
(d,t,s)
New or
deleted
documents
Fast indexing
(may not
be compact)

Build compact
index (may
hold partly in
RAM)
May preserve
this sorted
sequence
Query
logs
Main
index
Stop-press
index
Query
processor
Text mining
t: token id
 d: document id
 s: a bit to specify if the document has been deleted or inserted
User
What is Text Mining?


Text mining is the process of extracting interesting/useful
patterns from text documents (1997 by data mining group).
Text is the most natural form of storing and exchanging
information
– Very high commercial potential
– Study indicates that 80% of company’s information was contained in text
documents such as emails, memos, reports, etc.

Applications
– Customer profile analysis

mining incoming emails for customer’s complaint and feedback
– Information dissemination

organizing and summarizing trade news and reports for personalized information service
– Security

email or message scan, spam blocker
– Patent analysis

analyzing patent databases for major technology players and trends
– Extracting specific information from the Web (Web mining)

More powerful and intelligent search engine
Text Mining Framework
Document retrieval
Text documents

Information extraction
Information mining
Information extraction: machine readable dictionaries and lexical knowledge bases are essential.
– Fact extraction:

pattern matching, lexical analysis, syntactic and semantic structure
– Fact integration and knowledge representation

Information mining: mostly based on data mining and machine learning techniques
– Episodes and episode rules
– Conceptual clustering and concept hierarchies
– Text categorization

clustering, classification (machine learning approach)
– Text summarization
– Visualization
– Natural language processing (very computationally expensive)

Interpretation
Commercial products (mostly for categorization, summarization, visualization)
–
iMiner (IBM), TextWise (Syracuse), cMap (Canis), etc.
Future Information Management System
Browsers
Customized windows
Text mining
Data mining
Indexing
Web documents
Text documents
Databases
Or
Data warehouse
Techniques Used for Real Estate
Business by Mr. Shin

Data integration from multiple data sources
– Database integration
– Information extraction from Web using Web
crawler

Customized search engine with the support
of text mining
 User friendly information retrieval tool
Thank You.