Information Retrieval - University of Illinois at Chicago

Transcript Information Retrieval - University of Illinois at Chicago

Sample Searches
• Query: smalheiser
• Response: DBLP Neil Smalheiser
PNAS Abstract Smalheiser et
et al 97
MBC abstract Smalheiser
…
UIC Dept of Psychiatry Neil
Smalheiser
Query: computer science genetics
• Campus program: university, campus,
college and employment resources
• SpringerLink (On-line journals and books
in science, technology and medicine)
• Course ( Advanced topics in computer
science and computational genetics)
• Annual Review of Computer Science
• …
Comments for Slides 1-2
The first two searches are intended to show Google is reasonably good
In retrieving Web pages when given a few keywords. In the first query,
The name Smalheiser is submitted. Ideally, the home page of Smalheiser
should be retrieved first. Instead, his publications in computer science are
retrieved in the first document. This is followed by some of his publications
In the medical area. Finally, his home page in Pyschiatry is retrieved.
The second query asks for important documents in the intersection of the two
areas “computer science” and “genetics”. The first retrieved document seems
to be unrelated to the query. The second retrieved document seems ok.
The third document is a course in both areas.
The examples show that Google is still far from perfect.
Information Retrieval
Document Representation:
Remove stop words:
Eg. In “Automatically Identifying
Gene terms in MEDLINE Abstracts”
Remove “in”
Stemming: “Automatically” becomes “automatic”;
“Identifying” becomes “identify”;
“Abstracts” becomes “abstract”
Comments on last slide
The first two steps in constructing a document
representation consist of eliminating non-content words
and mapping variations of the same word to the same
stem via a process called stemming.
Document representation
• Document is a set of content words or
terms:
{ automatic, identify, gene, term, medline,
abstract}
Sometimes, keep locations of terms. Eg.
“automatic” first word in title
Comments on last slide
Location information can be of importance in
differentiating the ordering of contents
words in a query from other orderings of
the same words. It is also useful in
determining phrases.
Assign weights to terms
Term frequency: no. of times the term occurs in the
document
Document frequency: no. of documents having the term
The weight of a term in a document: proportional to term
frequency, inversely proportional to document frequency
Eg term frequency * log ( N/document frequency)
N = no. of documents in collection
Comments on the last slide
The well-known tf-idf weighting scheme to
assign a weight to a term is given. The
weight is proportional to the term
frequency and inversely proportional to its
document frequency. There are numerous
variations of this formula, but all of them
have the property that higher weights are
given to terms with higher term
frequencies and lower document
frequencies.
Other factors in assigning weights
Terms in title;
Terms in abstract;
Terms in big fonts etc.
get heavier weights
Comments to last slide
If a term occurs in the title, it usually gets a higher
weight than the same term occurring in the main
text. This may apply to the term appearing in the
abstract. If the term occurs in big fonts or a way
that attracts reader’s attention, it should also
gets a higher weight.
All these situations can be implemented by
assuming that each occurrence of such a term is
equivalent to k occurrences of the same term in
the main text with k >1.
Query representation
Two common models:
Vector space model: query as a set
of terms, possibly ordered
Boolean Model: Terms connected by
“AND”, “OR” and “NOT”
Comments to the last slide
In the information retrieval literature, it has been shown that
the vector pace model is usually better than the Boolean
Model, because if a query contains quite a few terms
which are connected by “AND”s, then there may not be a
document satisfying the query. If the terms are
connected by “OR”s, then there may be too many
unordered documents satisfying the query and the user
has no efficient way to identify the useful documents
from the irrelevant ones.
In practice, it is likely that a hybrid model having features of
both models is used for effective retrieval.
Vector space model
Each dimension of a vector
represents a distinct term;
#dimensions = all terms in the
collection, including proper names
Eg. Automatic identify gene …
( 1,
1,
1, ….)
Compute the similarity between a
query and a document
Q = (q1, …, qn) D= (d1, …, dn)
Dot [Q, D] =  qi  d i
i
#terms in common, favors long documents
Norm( D) =
 di
2
i
Cosine( Q, D) = Dot[Q, D] /( Norm(D)*Norm(Q))
Comments on last slide
When the documents are binary vectors, the Dot product
similarity function obtains the number of terms in
common between the two vectors. When the terms are
weighted, the weights are incorporated into the similarity
function. Clearly, this favors a long document such as an
encyclopedia.
To compensate it, the norm ( length) of a document is
included in the denominator of the similarity function so
that a longer document gets a larger denominator. The
query norm is used to ensure that the Cosine function
returns a value between 0 and 1, if all terms have nonnegative values. When the two vectors differ from each
other by a positive multiplicative constant, their angular
distance is 0 and the Cosine value is 1.
Boolean model
gene AND abstract;
( sometimes, uses “+” to ensure the term
needs to be present)
gene OR abstract;
gene AND NOT abstract;
( uses “-” to indicate undesired
terms; Eg. +gene –abstract)
Other features
Phrase search: “information retrieval”
Proximity search: information NEAR retrieval
Date search: 2003
Field search: Eg in the field “Author”, look for
“Neal”
Wildcard search: smal*er
Comments on last slide
Some systems require a query phrase such
as “information retrieval” to be placed in
quotes. This may require a retrieved document to have
exactly such a phrase.
If a document containing the words “retrieval of information”
is desired, the query can be reformulated as
“Information” near “retrieval”.
Filtering operations can be specified by filling in additional
information in specific fields such as the author field.
Wildcard entries such as smal*er, where “*” denotes zero
or more characters are allowed, provided that “*” does
not occur in the first few characters ( say 3), otherwise
the space for searching matching strings will be too
large.
Additional features
Case sensitive: java gets
java, Java, JAVA;
Java gets Java and possibly JAVA ( first capital
letter implies a proper name )
ordered query terms eg. stray dog
spelling error: if no such word,
some search engines suggests similar words
Comments on last slide
Location information in documents and the query
can be used to differentiate stray dog from dog
stray.
If a word does not exist in the index of all words in
the documents, then some search engines may
suggest some neighboring words which differ
from the misspelled word by 1 or 2 characters.
Note that proper names are included in the
index.
Directory search
Specify subtree:
computer
finance
hardware software ………..
medicine
…………….
……………….
query “memory” under computer means
computer memory vs human memory in
medicine
Comments in last slide
Directory search may reduce ambiguities.
In the given directory, documents or pages are
classified under each node. For example, there
is a set of documents which are classified under
computer and another class under medicine.
The former class contains documents about
computer memory while the latter class contains
documents about human memory. If the query is
restricted to the class “computer”, then only
documents in the former class relating to
computer memory will be retrieved.
Feedback
identify relevant documents and possibly
irrelevant documents
re-formulate query using terms from relevant
documents and from irrelevant documents;
Query: apple; Rel Doc: computer; Irrel: fruit
Modified query: apple, computer, - fruit
Comments to last slide
The user needs to identify relevant documents and
possibly irrelevant documents. Terms from the
relevant documents may be added to the query,
while terms from the irrelevant documents may
be used to exclude documents having such
terms to be retrieved in the next round. In the
example, the term “computer” is found in the
relevant documents and is added to the query,
while the term “fruit” is found in the irrelevant
documents and it is used to exclude documents
having such a word.
Web
Surface Web: linked together
Deep Web: Not linked; documents can be
generated dynamically by programs
Quite a few medical databases and biomedical databases are in the Deep Web
Comments on the last slide
The Web is roughly classified into the
Surface web and the Deep Web. The pages
in the former are hyperlinked, while pages
in the latter are accessible only by
submitting queries to query interfaces.
Web crawlers which extract content
information from Surface Web pages are
unable to get into Deep Web pages for
lack of hyperlinks.
Retrieval from the surface Web
Anchor text: belong to the document pointed to.
<a href="http://tigger.uic.edu/htbin/cgiwrap/bin/newsbureau/cgibin/index.cgi">More News</a>
Page rank: importance of a Web page
Rank( P) =
 (rank (Qi ) / out (Qi ))
i
for every Qi pointing to P; iterative; Web surfing interpretation
Comments on last slide
There are some differences between
retrieval from the Web and from non-Web
sources. In the former case, words known
as anchor texts which appear together
with the link from a page A to another page
B should be utilized for retrieval.
Specifically, the anchor words should be
used as content words for page B, as they
describe the contents of B as observed by
the user who creates A.
Example to illustrate page rank
A
P
B
Rank(P) = ½ Rank(A) + 1/3 Rank(B)
A lot of pages pointing to the IBM home
page, implying that it has a very high page
rank.
Comments on last slide
The example illustrates how the page rank
of a page can be computed. In practice, all
pages are initialized with the same rank
and the page rank formula is applied to
compute the page ranks of all pages. This
process is repeated until convergence is
reached. Under some reasonable
assumptions, convergence is guaranteed.
The page rank information is utilized to
rank pages for any user query.
Query: IBM
Thousands of pages have that word,
but among those pages having that
word, IBM home page has largest
rank.
Google utilizes page rank
Comments on last slide
There are a number of ways to utilize page ranks
to rank pages for a given query. One way is to
first retrieve pages which have reasonable
similarities with the query. Then the retrieved
pages are re-ranked in descending order of
page rank. Another way is to compute the
relevance of a page based on a function of the
similarity of the page with the query and its page
rank. Then pages are re-ranked in descending
order of relevance.
Authority and Hub
• Query retrieves documents based on
similarities
• Expand this set by adding their parents
and their children
• Compute A(p) = sum H(q) for each edge
(q,p)
• Compute H(p) = sum A(q) for each edge
(p,q)
Authority and Hub continued
• Normalize A(p) and H(p)
• Repeat until A() and H() converge
• Output pages with top authority scores
( It has been shown that convergence is
guaranteed.)
• www.teoma.com
( This company claims to have an advanced
search capability which is more accurate than
the standard authority and hub technique.)
Various features of different search engines,
including Google, AltaVista, Hotbot etc
Search Engines for the World Wide Web
By Alfred and Emily Glossbrenner, 3rd
edition,Peachpit Press, 2001.
Metasearch engine
Connects to numerous search engines.
Given query Q, finds suitable search
engines to process the query, invokes the
selected search engines to search and
merges their results.
Comments on last slide
Instead of using a search engine such as
Google, a metasearch engine which
connects to numerous search engines can
be utilized. Upon receiving a user query, a
metasearch engine sends the query ( with
possibly some modifications) to
appropriate search engines and merges
and re-ranks the retrieved documents
returned from the invoked search engines.
Advantages of Metasearch Engines
over Search Engines
Do not need substantial hardware relative
to large search engines;
Large coverage;
up-to-date information.
Comments to last slide
There is no need for substantial hardware,
because the searches are done by the
underlying search engines. The coverage
of a metasearch engine is the union of the
coverages of the individual search
engines. That it may have more up-to-date
information than a large search engine will
be explained by the next few slides.
Up-to-date information
• Search engine crawler gets data
• Builds large index database
• Time consuming to update large index
database
• Metasearch engine connects to numerous
small search engines
Comments to last slide
A search engine utilizes a crawler to extract contents from
Surface Web pages and then builds an index database.
Upon receiving a query, the search engine searches the
index database to determine the pages to return to the
user. Since the contents of Web pages keep on
changing, the index database needs to be updated.
However, the index database is large and refreshing it
may take a long time, say weeks. In contrast, if a
metasearch engine is connected to numerous small
search engines and each of these search engines keeps
its database up-to-date, the metasearch engine may be
able to provide current information.
Utilizes dictionary/ontology
Wordnet: ordinary dictionary terms
MeSH hierarchy: medical terms
Person --- (Synonyms: human, people)
Hyponyms: man
woman
May want to include synonyms and hyponyms of query
terms into query
Comments to last slide
Dictionaries or ontologies may be utilized to achieve high
retrieval effectiveness. A common dictionary in a general
domain is Wordnet which provides synonyms, hyponyms
as well as other relationships to each ordinary word. As
an example, if a query contains the word “person”, its
synonyms and hyponyms may be added to the query.
Note that a word may have multiple senses (meanings)
and selections of suitable synonyms and hyponyms are
essential. It is worthwhile to explore the use of the MeSH
hierarchy for effective retrieval in the medical domain.
Difficulty
A word sometimes has many senses
Eg Query: drugs for mental patients
senses for drugs: prescription drugs;
illegal drugs
useful to include antidepressant;
will retrieve a lot of irrelevant documents if
include heroin
Comments to the last slide
The example shows that a correct addition
of a hyponym (antidepressant is a
hyponym of drug) will lead to high retrieval
effectiveness while an incorrect addition
(heroin is also a hyponym of drug) leads to
poor retrieval results.
Natural language Processing
for Information Retrieval
•
•
•
•
finds part-of-speech of each word;
identify noun phrases;
identify proper names;
recognizes acronyms:
eg. CHF congestive heart failures
• Word sense disambiguation
eg. Apple CPU
Comments to last slide
Natural language processing plays a role in
information retrieval. However, so far, it is
used to identify parts of speech of words,
named entities and phrases only.
Recognition of acronyms is also useful for
information retrieval.
Word Sense Disambiguation
Pine 1 kinds of evergreen tree with needle-shaped leaves
2 waste away through sorrow or illness
Cone 1 solid body which narrows to a point
2 fruit of certain evergreen trees
Find the combination of descriptions which
have the largest number of words in
common.
Comments to last slide
If the query is “pine cone” and each of the two
words has multiple senses, the correct sense
may be identified by finding the combination of
senses whose descriptions have the largest
number of words in common. In this example,
sense 1 of pine and sense 2 of cone have the
words “evergreen tree “ in common in their
descriptions. These common words may be
added to the query to improve retrieval
effectiveness.
Information extraction
Information retrieval obtains whole
documents; often users want small parts
of retrieved documents.
Examples:
From certain papers on heart disease,
extract names of authors;
from experimental sections of papers,
extract tables of interest.
Techniques
(1) Construct rules involving patterns or
keywords of identify parts of interest; utilizes a
grammar to extract required information
Eg. To identify terrorist events
useful keywords: kill, bomb etc. use a grammar
to identify the subjects (terrorists) and the
objects (victims)
Comments
Traditionally, information extraction is
achieved by manually constructed rules for
the extraction, after examining numerous
instances of what are desired. In order to
save labor cost, machine learning
techniques are introduced. Rules are
automatically constructed and based on
positive and negative examples, promising
rules are kept for future extraction
activities.
(2) Use machine learning techniques to
construct rules
Positive and negatives examples can
be given to guide the construction
Aim: Reduce manual construction of rules
Machine learning
Example: Pavilion a230 Minitower
AMD ® Athlon XP .. GHz
…….
Pavilion a210n Minitower
Intel ® Celeron … … GHz
Rule: (var1) * ‘®’ ( var2) ‘GHz’
Comments
In this example, the user supplies a few positive
instances to be extracted. Then, the system
automatically constructs the rule with R and
GHz as landmarks. The words before the
landmarks are captured by variables.
In the Web environment, HTML or XML documents
have tags and they may be used to construct
rules. However, rules involving tags may be site
dependent, implying that new rules may need to
be generated when there is a site change.
Rules involving tags
 Martino Motor Sales 
 Currie Motors Lincoln Memory 
Rule: * ‘Var’ 
Extracted data may not be that structured
Layout of document can be site dependent,
implying that new correct rules need to be
constructed for new sites
Summary
• Information retrieval:
user’s point of view:
eg. phrase, case sensitive
system point of view:
eg. Feedback query construction
Web retrieval vs non-web retrieval
search engine vs metasearch engine
Summary continued
• Natural language processing
Eg. acronym recognition
• Information extraction
rules: manual, machine learning
Can be site dependent