Information retrieval
Download
Report
Transcript Information retrieval
Information Retrieval
Ugochukwu Chimbo EJIKEME
Structured Vs Unstructured Data
Coperate information not stored in the database
In General
* The structure of the data itself.
* The structure of the container that hosts the data.
* The structure of the access method used to access
the data.
Information Retrieval Systems (IRS)
Information-retrieval systems are used to store
and query textual data such as documents. They
use a simpler data model than do database
systems. Traditional examples of informationretrieval systems are online library catalogs and
online document-management systems such as
those that store newspaper articles.
Characteristics of IRS
Documents are typically described by a set of
keywords.
Information in the database is organized simply as a
collection of unstructured documents.
Cares less about transactional requirements.
Relevance Ranking
Using Terms (Keywords)
Ranking Using TF-IDF
Similarity-Based Retrieval
Hyperlinks (WEB)
Popularity Ranking (prestige ranking)
PageRank
Combining TF-IDF and Popularity Ranking
Measures
Ranking using TF-IDF
Term Frequency (TF) – Relevance of a document
(d) to a term (t).
“Multiple Keyword” Queries ?
n
Σ (TF(d,t ))
i
i=1
Inverse document frequency (IDF)
Query: “Facebook Ugo”???.
Relevance therefore:
Proximity??? The closer the word to each other in
the document, the higher the rank.
Similarity-Based Retrieval
Retrieve document similar to another.
Similarity may be defined on the basics of terms.
Cosine similarity metrics
Relevance feedback – start new search based on
user feedback on prior search.
Hyperlink
Popularity Ranking
Rank “popular” documents higher among set of
documents with specific keywords.
Determining “Popularity”
Access rate ?
How to get accurate data?
Bookmarks?
Might be private?
Links to related pages?
Using web crawler to analyze external links.
transfer of prestige
a link from a popular page x to a page y is treated as
conferring more prestige to page y than a link from
a not-so-popular page z.
PageRank
A measure of popularity of a page based on
the popularity of pages that link to the page.
Understanding PageRank.
Random walk model:
The PageRank of a page is the probability that a random
walker is visiting a page at any given point in time.
Drawback:
does not take query keywords into account.
Other Measures of Popularity
Click fraction
search engine provides an indirect link through the
search engine site, which records the page click, and
transparently redirects the browser to the original
link.
Anchor text + Page Rank
Anchor text + Page Rank + TF–IDF measures
The HITS algorithm:
compute popularity using set of related pages only.
Hubs and Authorities
Hub - A page that stores links to many related pages
(may not in itself contain actual information on a
topic)
Authority - A page that contains actual information on
a topic (may not store links to many related pages).
Each page gets a prestige value as a hub (hubprestige), and another prestige value as an authority
(authority-prestige).
Search Engine Spamming
Practice of creating Web pages, or sets of Web
pages, designed to get a high relevance rank for
some queries, even though the sites are not
actually popular sites.
Synonyms, Homonyms, and
Ontologies
Synonyms
Define alternative words for keywords
E.g Class room <==> (Class or Lecture) room
Homonyms
single words with multiple meanings
Concept-based querying
analyze each document to disambiguate each word
in the document, and replace it with the concept
that it represents; disambiguation is usually done by
looking at other surrounding words in the
document.
Ontologies are hierarchical structures that reflect
relationships between concepts.
Common relationships include: is – a, part of,.. etc.
Indexing of Documents
Inverted index
maps each keyword Ki to a list Si of the documents
that contain Ki.
Document 1 (d1), Document 2 (d2), Document 3 (d3)
56,89,201
12, 18, 19
5
Inverted Index = “d1/56,89,201; d2/12,18,19; d3/5”
*May also include Term Frequency in documents.
Measuring Retrieval Effectiveness
Keywords are maintained in a compressed form
(to keep space usage of the index low).
index sometimes stored such that the retrieval is
approximate; a few relevant documents may not be
retrieved (called a false drop or false negative), or a
few irrelevant documents may be retrieved (called a
false positive).
Measurement metrics
Precision
measures the percentage retrieved documents
relevant to a given query.
Recall
Measures percentage of the documents (relevant to
the query) retrieved.
Beyond Page Ranking
Information Extraction
convert information from textual form to a more
structured form.
Sample application: google scholar.
Question Answering
system attempts to provide direct answers to
questions posed by users.
Summary
Information-retrieval systems are used to store
and query textual data such as documents.
Queries attempt to locate documents that are of
interest by specifying, for example, sets of
keywords.
Relevance ranking makes use of several types of
information, such as:
◦ Term frequency: how important each term is to
each document.
◦ Inverse document frequency.
◦ Popularity ranking.
Search engine spamming attempts to get (an undeserved)
high ranking for a page.
• Synonyms and homonyms complicate the task of
information retrieval. Concept- based querying aims at
finding documents containing specified concepts,
regardless of the exact words (or language) in which the
concept is specified. Ontologies are used to relate concepts
using relationships such as is-a or part-of.
Inverted indices are used to answer keyword queries.
Precision and recall are two measures of the effectiveness
of an information retrieval system.
Techniques have been developed to extract structured
information from textual data and to give direct answers to
simple questions posed in natural language.