Information retrieval

Download Report

Transcript Information retrieval

Information Retrieval
Ugochukwu Chimbo EJIKEME
Structured Vs Unstructured Data
 Coperate information not stored in the database
 In General
 * The structure of the data itself.
 * The structure of the container that hosts the data.
 * The structure of the access method used to access
the data.
Information Retrieval Systems (IRS)
 Information-retrieval systems are used to store
and query textual data such as documents. They
use a simpler data model than do database
systems. Traditional examples of informationretrieval systems are online library catalogs and
online document-management systems such as
those that store newspaper articles.
Characteristics of IRS
 Documents are typically described by a set of
keywords.
 Information in the database is organized simply as a
collection of unstructured documents.
 Cares less about transactional requirements.
Relevance Ranking
 Using Terms (Keywords)
 Ranking Using TF-IDF
 Similarity-Based Retrieval
 Hyperlinks (WEB)
 Popularity Ranking (prestige ranking)
 PageRank
 Combining TF-IDF and Popularity Ranking
Measures
Ranking using TF-IDF
 Term Frequency (TF) – Relevance of a document
(d) to a term (t).
 “Multiple Keyword” Queries ?
n
Σ (TF(d,t ))
i
i=1
 Inverse document frequency (IDF)
 Query: “Facebook Ugo”???.
 Relevance therefore:
 Proximity??? The closer the word to each other in
the document, the higher the rank.
Similarity-Based Retrieval
 Retrieve document similar to another.
 Similarity may be defined on the basics of terms.
 Cosine similarity metrics
 Relevance feedback – start new search based on
user feedback on prior search.
Hyperlink
 Popularity Ranking
 Rank “popular” documents higher among set of
documents with specific keywords.
 Determining “Popularity”
 Access rate ?
 How to get accurate data?
 Bookmarks?
 Might be private?
 Links to related pages?
 Using web crawler to analyze external links.
 transfer of prestige
 a link from a popular page x to a page y is treated as
conferring more prestige to page y than a link from
a not-so-popular page z.
PageRank
 A measure of popularity of a page based on
the popularity of pages that link to the page.
 Understanding PageRank.
 Random walk model:
 The PageRank of a page is the probability that a random
walker is visiting a page at any given point in time.
 Drawback:
 does not take query keywords into account.
Other Measures of Popularity
 Click fraction
 search engine provides an indirect link through the
search engine site, which records the page click, and
transparently redirects the browser to the original
link.
 Anchor text + Page Rank
 Anchor text + Page Rank + TF–IDF measures
 The HITS algorithm:
 compute popularity using set of related pages only.
 Hubs and Authorities
 Hub - A page that stores links to many related pages
(may not in itself contain actual information on a
topic)
 Authority - A page that contains actual information on
a topic (may not store links to many related pages).
 Each page gets a prestige value as a hub (hubprestige), and another prestige value as an authority
(authority-prestige).
Search Engine Spamming
 Practice of creating Web pages, or sets of Web
pages, designed to get a high relevance rank for
some queries, even though the sites are not
actually popular sites.
Synonyms, Homonyms, and
Ontologies
 Synonyms
 Define alternative words for keywords
 E.g Class room <==> (Class or Lecture) room
 Homonyms
 single words with multiple meanings
 Concept-based querying
 analyze each document to disambiguate each word
in the document, and replace it with the concept
that it represents; disambiguation is usually done by
looking at other surrounding words in the
document.
 Ontologies are hierarchical structures that reflect
relationships between concepts.
 Common relationships include: is – a, part of,.. etc.
Indexing of Documents
 Inverted index
 maps each keyword Ki to a list Si of the documents
that contain Ki.
 Document 1 (d1), Document 2 (d2), Document 3 (d3)

56,89,201
12, 18, 19
5
 Inverted Index = “d1/56,89,201; d2/12,18,19; d3/5”
 *May also include Term Frequency in documents.
Measuring Retrieval Effectiveness
 Keywords are maintained in a compressed form
(to keep space usage of the index low).
 index sometimes stored such that the retrieval is
approximate; a few relevant documents may not be
retrieved (called a false drop or false negative), or a
few irrelevant documents may be retrieved (called a
false positive).
Measurement metrics
 Precision
 measures the percentage retrieved documents
relevant to a given query.
 Recall
 Measures percentage of the documents (relevant to
the query) retrieved.
Beyond Page Ranking
 Information Extraction
 convert information from textual form to a more
structured form.
 Sample application: google scholar.
 Question Answering
 system attempts to provide direct answers to
questions posed by users.
Summary
 Information-retrieval systems are used to store
and query textual data such as documents.
 Queries attempt to locate documents that are of
interest by specifying, for example, sets of
keywords.
 Relevance ranking makes use of several types of
information, such as:
 ◦ Term frequency: how important each term is to
each document.
 ◦ Inverse document frequency.
 ◦ Popularity ranking.
 Search engine spamming attempts to get (an undeserved)
high ranking for a page.
• Synonyms and homonyms complicate the task of
information retrieval. Concept- based querying aims at
finding documents containing specified concepts,
regardless of the exact words (or language) in which the
concept is specified. Ontologies are used to relate concepts
using relationships such as is-a or part-of.
 Inverted indices are used to answer keyword queries.
 Precision and recall are two measures of the effectiveness
of an information retrieval system.
 Techniques have been developed to extract structured
information from textual data and to give direct answers to
simple questions posed in natural language.