Intelligent Meta-Search and Clustering Technology Tamas Doszkocs, Ph.D.

Download Report

Transcript Intelligent Meta-Search and Clustering Technology Tamas Doszkocs, Ph.D.

Intelligent Meta-Search and Clustering Technology
http://tamas.nlm.nih.gov/metasearch/
http://toxseek.nlm.nih.gov
Tamas Doszkocs, Ph.D.
Computer Scientist
National Library of Medicine
[email protected]
Characteristics of Web Searching
•
Content is created by diverse
organizations and individuals
•
Information on the Web is
inherently heterogeneous
•
Content is distributed on
multiple servers in multiple
locations and multiple formats
and languages aimed for
diverse audiences and purposes
(In its April 2005 survey
NetCraft received responses
from 62,286,451 web sites)
•
The “Open Web” of billions of
static Web pages is indexed
and searched via multiple
search engines and directories
Problems in Web Searching
• Even the largest of the current search engines
index only a fraction of all Web pages
(The WayBackMacine of Internet Archive has indexed 40 billion pages, Google about 8.1 billion,
Yahoo about 20.8 billion -- August 2005)
• The not so “Hidden Web” of content databases (e.g.
PubMed, Web of Science) is estimated to be thousands of
times larger than the Open Web.
• Both the Open Web and the Hidden Web are
characterized by problems of information
coverage, quality, overload, relevancy, currency
and completeness, as well as inherent language
ambiguity and incompatible user interfaces
Meta-Searching
• Meta-Search Engines may
simultaneously search multiple Open
Web and Hidden Web sites in order
to increase content coverage,
precision, relevance and/or search
efficiency and effectiveness.
Overlap Among 3 Major Search Engines
http://missingpieces.dogpile.com/whitepaper.pdf
http://comparesearchengines.dogpile.com/OverlapAnalysis.pdf
Overlap Among AskJeeves, Google, MSN and Yahoo
Google Isn’t Everything!
http://www.forbes.com/business/free_forbes/2005/0815/056.html?partner=yahoomag
Generations of Meta-Search Engines
• First Generation
• “Broadcast” or
“Federated” search
– List of results
• Second Generation
• Third Generation
• Next Generation
•
Merging and Ranking
– Increased coverage
•
Result Clustering
– Focused drill-down
– Dynamic Query Mods
•
Semantic and
Pragmatic Intelligence
– tamas.nlm.nih.gov/metasearch/
– toxseek.nlm.nih.gov
– http://bestmeta.com
Moving Targets:
Nine Search Engines Compared
By Ben Patterson (May 9, 2005)
http://reviews.cnet.com/4520-10572_7-6219242-2.html?tag=txt
Moving Targets
and the need for
Automatic Change Detection and Monitoring
and
Integrating New Capabilities
The ToxSeek Meta-Search and ClusteringProject
• Goals:
– Integrate best practices Information Retrieval and
Natural Language Processing techniques with AI
heuristics to create an advanced general purpose metasearch, result clustering and knowledge discovery tool
– Apply ToxSeek to efficiently access diverse biomedical
and environmental health information resources
– Create specialized applications for accessing quality
information sources on HIV/AIDS, consumer health,
homeland security, public health law, library research
and other applications
ToxSeek Features
• Integrates multiple spellcheckers and sophisticated lexical,
morphologic, syntactic and semantic resources
• Merges and ranks the results from heterogeneous
information sources
• Employs efficient Natural Language Phrase Parser and AI
heuristics to automatically identify Key Concepts and their
Associations in queries and retrieved documents
• Uses the automatically identified Key Concepts and
Associations to create topical Result Clusters
• Supports focused multi-concept drill-down, dynamic query
refinement, multi-media and limited question answering
ToxSeek Implementation
• Production applications and research prototypes
have been implemented for meta-searching
diverse content on:
–
–
–
–
–
–
Toxicology and Environmental Health
Consumer Health
Library Catalogs and Proprietary Databases
HIV/AIDS
BioDefense
Homeland Security
• “Shift Happens…”
– http://library.nps.navy.mil/home/staff/gmarlatt/HSDL%20ALI%
20April%202005%20%20final%20rev%207%20april.ppt
ToxSeek Web Search Query: “terrorism”
ToxSeek Query: “police state”
Win the Search Engine Wars
with Intelligent Meta-Search and Clustering Technology
http://tamas.nlm.nih.gov/metasearch/
http://toxseek.nlm.nih.gov
Tamas Doszkocs, Ph.D.
Computer Scientist
National Library of Medicine
[email protected]