WMES3304/WXGB5008 INFORMATION RETRIEVAL

Download Report

Transcript WMES3304/WXGB5008 INFORMATION RETRIEVAL

LAST WEEK
 Retrieval evaluation
 Why?
 How?
 Recall and precision – Venn’s Diagram &
Contingency Table
WMES3103
INFORMATION
RETRIEVAL
WEEK 5
QUERY LANGUAGES
AND OPERATION
QUERY LANGUAGES
 Will cover the different kinds of queries
sent to text retrieval systems.
 Will show the different types of query that
a user can formulate.
 Normally, the main and most popularly
used type of user query is the keywordbased retrieval.
 Different queries are continuously sent to an
IRS.
 Most query languages use the content
(semantics) and the structure of the text (text
syntax) to find the relevant documents.
 At times, the IRS may fail to trace and retrieve
the relevant documents.
 Therefore, we need to use a number of
techniques which will hopefully enhance the
query and this will enable us to retrieve an
acceptable level of relevant documents.
QUERY LANGUAGES
 eg. use of thesaurus, synonyms, stemming,
stopwords, etc
 A keyword is a word that can be retrieved by an
IRS.
 The retrieval unit is the basic element which
can be retrieved by the system as an answer to
a query = also known as documents
 A retrieval unit can be a file, document, Web
page, paragraph, or some other structural unit
which contains the answer to the query.
Example : Keyword
 Keyword used
is :
“artificial
intelligence”
Example : Retrieval unit
 website
Example : Retrieval unit
 document
TYPES OF QUERY
LANGUAGES
 Keyword-based querying




Single-word
Context
Boolean
Natural language
 Pattern matching
 Structural queries
 Form-like fixed
 Hypertext
 Hierarchical
KEYWORD-BASED
QUERYING
 Query = formulation of a user information need.
 Query =a keyword or a number of keywords = a
basic query
 Documents containing such keywords are
searched for in the IRS.
 Keyword-based queries are popular because:
 Intuitive
 Easy to express
 Allows for fast ranking.
Single-word query
 Simplest form of query that can be formulated in




an IRS.
Text document = long sequences of words.
The IRS will look at the text and search for the
word.
Result of a word query = a set of documents
containing at least one of the words of the query.
Set of documents will be ranked according to the
degree of similarity to the query.
Single-word query
 Ranking done via word occurences inside
the text
 Most popularly used = term frequency =
counts the number of times a word
appears inside a document
Context query
 Singleword queries are complemented
with the ability to search for words in a
given context = near other words.
 Words which appear near other words
may indicate a higher possibility of
relevance than if they appear apart.
 2 type of queries
 phrase
 proximity
 Phrase – sequence of single-word queries.
 Proximity – more relaxed version of the
phrase query.
 Phrase is given together with a maximum
allowed distance between them.
 Distance measured in characters or words
depending on the system
Example : ABI-INFORM (CD)
 W/n – first keyword
 PRE/n – first keyword
precede
second
must be within n
keyword by up to n
words of the second
words.
keyword.
pre/1
 computer w/1 data =  European
community = the word
the word computer
European
must
must be within 1
precede
the
word
word of the word
community by up to 1
data = computer
word
=
European
generated
data,
economic community,
computer simulated
data, data mining
European
flavoured
computer
community
Example : COMPENDEX
 Search for a phrase =
type in each keyword
separated by a space
= will search for the
phrase with the 2
keywords next to
each other and in the
specified order
 artificial intelligence
 Desired proximity of
keywords
specified
with
full
stops
between keywords
 back..basics = back
to basics, back to the
basics
 Keywords
must
appear in the same
sentence = type in the
keywords separated
by an underscore
 computer_medicine
Boolean query
 Oldest form of keyword query = use of Boolean






operators
Typical Boolean query = words + operators.
Given 2 basic keyword queries : A and B
A or B - selects all documents with the word A or
B.
A and B – selects all documents with A and B
A not B – selects all documents with the word A
but without the word B.
Represented by Venn’s Diagram
Boolean operator : AND
robotics
M alaysia
each docum ent in this set w ill contain
both the w ords robotics and m alaysia
Boolean operator : OR
w ater pollution
m arine pollution
each docum ent in this set w ill contain one or both of
the keyw ords - w ater pollution , m arine pollution
Boolean operator : NOT
digital
w atches
D ocum ents w ith the w ord digital w ill not have the
w ord w atches in them
Natural Language
 User determines the keywords that should
be eliminated and are not useful for
searching.
 Ranking for documents with these
keywords would be very low.
TARGET - Dialog
? target
Input search terms separated by spaces ( e.g. DOG CAT FOOD). You can enhance your TARGET
search with the following options:
-
PHRASES are enclosed in single quotes (e.g. ‘DOG FOOD’)
SYNONYMS are enclosed in parentheses (e.g. (DOG CANINE))
SPELLING variations are indicated with a ? (e.g. DOG? To search for DOG, DOGS)
Terms that MUST be present are flagged with an asterisk (e.g. DOG *FOOD)
Q = QUIT H = HELP
? komodo dragon food diet nutrition
Your TARGET search request will retrieve up to 50 of the statistically relevant records.
Searching 1997 – 1998 records only
… Processing Complete
Your search retrieved 50 records
Press ENTER to browse results C = Customize display Q = Quit
H = Help
Pattern Matching
 More specific query formulation
 Retrieve pieces of text that have some
property.
 Used in the retrieval of text statistics, data
extraction, etc.
 A pattern is a set of syntactic features that
must occur in a text segment.
 Segments that fulfils the pattern
specifications = pattern match
Pattern Matching
 Interested in documents containing segments
which match the given search pattern.
 Each IRS will allow some degree of search
pattern.
 Very simple or very complex.
 The more powerful the set of patterns allowed,
the more involved are the queries that can be
formulated by the user, and the more complex is
the implementation of the search.
Pattern Matching
 Words – a word in the text, most basic pattern.
 Prefixes – the beginning of a text word – eg.
prefix comput will retrieve all documents
containing the words such as computers,
computing, computation, computational, etc.
 Suffixes - the termination of a text word – eg.
prefix ters will retrieve all documents containing
the words such as monsters, posters, potters,
painters, etc.
Pattern Matching
 Substrings –can appear within a text word
– eg. tal will retrieve all documents
containing the words such as coastal, talk,
metallic, pedestal, etc.
 Ranges – A pair of strings which matches
any word lying between them in
lexicographical order – eg. range between
words held and hold will retrieve strings
such as hoax, hissing, helm, help, etc.
Pattern Matching
 Allowing errors – A word together with an
error threshold
 will retrieve all text words which are similar to
the given word.
 errors are caused by typing, spelling, etc.
 most accepted model is the Levenshtein
distance or edit distance.
Pattern Matching
 Example : Edit distance between COLOR and
COLOUR is 1, SURVEY and SURGERY is 2.
Therefore, in the query, we must specify the
maximum number of allowed errors for a word
to match the pattern.
Structural Queries
 Based on structure of the text
 3 structures – fixed, hypertext, hierarchical
 The user will query the text based on the
structure.
 Query language nowadays integrates both
contents and structural queries.
 Example : UM Library OPAC records
 Example of query : fi au ali and subject
malaysia
Query Protocols
 Protocol: a strict set of rules that govern the
exchange of information between computer
devices
 Query languages used automatically by software
applications to query text databases.
 Some are standards for querying CD-ROMs or
as intermediate languages to query library
systems.
 Not intended for human use = refer as protocols
and not languages.
Query Protocols
 Z39.50 –query bibliographical information using
a standard interface between the client and the
host database manager which is independent of
the client user interface and of the query
database language at the host. Originally used
for bibliographical information based on MARC
format.
 WAIS – Wide Area Information Service – popular
before Web – network publishing protocol and
can query databases through the Internet.
www.ukoln.ac.uk/dlis/z3950/
Protocols for CD-ROM
 Allows for flexibility in data communication
between primary information providers and end
users.
 Significant cost savings - allows access to a
variety of information without the need to buy,
install, and train users for different data retrieval
applications.
 3 protocols has been recommended :
 CCL (Common Comand Language)
 CD-RDx (Compact Disk Read only Data exchange)
 SFQL (Structured Full-text Query Language)
QUERY OPERATIONS
 Users - difficult to formulate queries which are
well-designed for retrieval purposes because
they do not know the collection make-up and the
retrieval environment.
 Web search engines – users spend a lot of time
reformulating their queries to get effective
retrieval.
 First query formulation – retrieve documents and
examine for relevance - construct new improved
query formulations - retrieve documents and
examine for relevance - process is repeated until
the user is satisfied.
QUERY OPERATIONS
 2 processes involved
 expanding the original query with new terms
 reweighting the terms in the expanded query.
 2 ways of improving initial query formulation
 approaches based on feedback information from the
user
 approaches based on information derived from the
set of documents initially retrieved (called the local set
of documents)
User Relevance Feedback
 Most popular query formulation strategy.
 User is presented with a list of retrieved
documents, examines them, and marks those
which are relevant.
 Only the top 10 or 20 ranked documents need to
be examined.
 Separates into relevant and non-relevant.
 Select important terms attached to the retrieved
and relevant documents only, and enhance
importance of terms in new query formulation.
User Relevance Feedback
 Expect new query will move towards the
relevant documents and away from the nonrelevant ones.
 Advantages :
 Protects the user from the details of the query
reformulation process because all the user has to do
is reuse the terms
 Breaks down the entire search process into a
sequence of small steps which are easier to grasp.
 Provides a control process designed to emphasis
some terms and deemphasis others.
Automatic Local Analysis
 Documents retrieved for a given query are
examined immediately to determine terms for
query expansion.
 Similar to relevance feedback cycle but done
without the assistance of the user – automatic.
 Local feedback strategies are based on
expanding the query with terms correlated to the
query terms = local clusters built from local
documents set.