LSU/SLIS Search Engines and Information Retrieval (IR) Session 11 LIS 7008 Information Technologies Agenda • The search process • Information retrieval (IR) • Recommender systems • Multimedia information retrieval.

Download Report

Transcript LSU/SLIS Search Engines and Information Retrieval (IR) Session 11 LIS 7008 Information Technologies Agenda • The search process • Information retrieval (IR) • Recommender systems • Multimedia information retrieval.

LSU/SLIS
Search Engines and Information
Retrieval (IR)
Session 11
LIS 7008
Information Technologies
Agenda
• The search process
• Information retrieval (IR)
• Recommender systems
• Multimedia information retrieval (MMIR)
• Evaluation of information retrieval systems
Questions
• What is the Web? Who defines it?
• Why Google pays $125 million to scan books?
– You’ll get an idea by the end of this session
Web Information Retrieval
Browsers
Search
Engines
Web
(a portion)
The Memex Machine
Vannevar Bush, 1945. As we May Think.
DIKW Information Hierarchy
More refined
and abstract
connectedness
wisdom
Understanding
principles
Wisdom
knowledge
Understanding
patterns
Knowledge
information
Information
Understanding
relations
Data
data
DB | IR | KS
understanding
http://www.systems-thinking.org/dikw/dikw.htm
DIKW Information Hierarchy
•
Data
–
•
Information
–
–
•
Symbols: exists, usable or not, meaningful or not
Data organized and presented in a particular manner
Data been given meaning by way of relational connection
Knowledge
– “Justified true belief”; application of data & information
– Intends to be useful
•
Wisdom
–
–
Distilled and integrated knowledge
Evaluated understanding; the process by which we
discern, or judge, between right and wrong, good and bad.
Slide adapted from http://www.systems-thinking.org/dikw/dikw.htm and
http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (session 5)
A (Facetious) Example
• Data
– 98.6º F, 99.5º F, 100.3º F, 101º F, … (we do not know
what they mean)
• Information
– John’s hourly body temperature: 98.6º F, 99.5º F,
100.3º F, 101º F, …
• Knowledge
– If you have a temperature above 100º F, you most likely
have a fever
• Wisdom
– If you have a fever and don’t feel well, go see a doctor
Slide adapted from http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (session 5)
Databases (DB)
IR
Structured data. Clear
semantics based on a
formal model.
Formally
(mathematically)
defined queries.
Unambiguous. SQL.
Exact. Always correct
in a formal sense.
Mostly unstructured
information. Free text
with some metadata.
Vague, imprecise
information needs
(often expressed in
natural language).
Sometimes relevant,
often not.
Interaction
with system
One-shot queries.
IR is a process.
Interaction is important.
Other issues
Concurrency, recovery,
atomicity are critical.
Efficiency important.
Effectiveness and
usability are critical.
What we’re
retrieving
Queries
we’re posing
Results we
get
How about searching for a book in www.lib.lsu.edu catalog?
Information “Retrieval” Tasks
• Find something that you want/need
– The information need may or may not be explicit
• E.g., cat, movie for dog, jaguar
• Known item search
– Find the class home page
• Answer seeking
– Is Lexington or Louisville the capital of Kentucky?
– When did Michael Jackson die?
• This is not database search, because information is unstructured
• Directed exploration
– Who make(s) videoconferencing systems?
• Self-guided teaching, exploration
The Big Picture
• The four components of the information
retrieval environment:
–
–
–
–
User (user needs)
Process
support
System
Data/info
(static)
What we care about!
What computer geeks care about!
Information Retrieval Paradigm
Search
Select
Query
Document
Delivery
Browse
Examine
Document
IR: Supporting the Search Process
Source
Selection
IR System
Query
Formulation
Predict
Nominate
Choose
Query
Search
Query Reformulation
and
Relevance Feedback
Ranked List
Selection
Document
Examination
Source
Reselection
Document
Delivery
Supporting the Search Process
Source
Selection
IR System
Query
Formulation
Query
Search
Ranked List
Selection
Indexing
Document
Index
Examination
Acquisition
Collection
Document
Cache
Delivery
Human-Machine Synergy
• Machines are good at:
– Doing simple things accurately and quickly
– Scaling to larger collections in sublinear time
• People are better at:
– Accurately recognizing what they are looking for
– Evaluating intangibles such as “quality”
• Both are pretty bad at:
– Mapping consistently between words and concepts
Search Component Model
Utility
Human Judgment
Document
Query Formulation
Query
Representation Function
Representation Function
Query Representation
Document Representation
Comparison Function
Retrieval Status Value (retrieved or not)
Document Processing
Query Processing
Information Need
Ways of Finding Text
• Searching content
– Characterize documents by the words they contain
• Searching metadata (fields)
– Using controlled or uncontrolled vocabularies
• Searching behavior
– User-Item: Find similar users based on the items they search
• E.g., users who looked for “movies for dog”
– Item-Item: Find items that cause similar reactions
• E.g., Amazon.com recommends books based on the books you
bought
Two Ways of Searching
Controlled
Vocabulary
Searcher
Free-Text
Searcher
Author
Indexer
Construct query from
terms that may
appear in documents
Write the document
using terms to
convey meaning
Choose appropriate
concept descriptors
Query
Terms
Content-Based
Query-Document
Matching
Document
Terms
Document
Descriptors
Retrieval Status Value
Construct query from
available concept
descriptors
Metadata-Based
Query-Document
Matching
Query
Descriptors
3 Types of Information Retrieval
Systems
• Exact-match retrieval
– Boolean retrieval systems
• Ranked retrieval
– Similarity-based retrieval
• Rating-based Recommendation
“Exact Match” Retrieval
• Query: Find all documents with some characteristics:
– Indexed as “Presidents -- United States”
– Containing the words “Obama” and “health”
– Read by my boss
• Result: A set of documents is returned
– Hopefully, not too many or too few
– Usually listed in date or alphabetical order
• E.g., Boolean retrieval search engines
The Perfect Query Paradox
• Every information need has a perfect document
– Finding that set is the goal of search
• Every document set has a perfect query
– “AND” every word in the doc to get a query for document
– Repeat for each document in the set
– “OR” every document query to get the set query
• The problem isn’t the system … it’s the query! How can
a human being formulate such a complicated query?!
Queries on the Web (1999)
• Low query construction effort
– 2.35 (often imprecise) terms per query
– 20% of Web users use searching operators
– 22% of queries are subsequently modified
• Low browsing effort
– Only 15% of users view more than one page
– Most users look only “above the fold”
• One study showed that 10% don’t know how to scroll!
3 Types of User Needs
• Informational (30-40% of AltaVista queries)
– What is a quark? How does a car engine work?
• Navigational
– Find the Web page of United Airlines talking about
luggage policy
• Transactional
– Data:
– Shopping:
– Proprietary:
What is the weather in Paris?
Who sells a Viao Z505RX?
Obtain a journal article
Ranked Retrieval
• Boolean search engine per se is perfect
– Searchers are not because of perfect query paradox
– Make a search engine not as good as Boolean search engine
• Put most useful documents near top of a list
– Possibly useful documents go lower in the list
• Users can read down as far as they like
– Based on what they need, time available, ...
• Provides useful results from weak queries
– Untrained users find exact-match (Boolean) systems harder to
use due to lack of query formulation skills
Similarity-Based Retrieval
• Assume “most useful” = most similar to query
• Weight terms based on two criteria:
– Repeated words are good cues to meaning of a document
– Rarely used words make searches more selective
• Compare term weights with query
– Add up the weights for each query term
– Put the documents with the highest total first
Simple Example: Counting Words
Build an index for documents:
Query: “recall and fallout measures for information retrieval”
d1
Documents:
d1: Nuclear fallout contaminated Texas.
d2: Information retrieval is interesting.
fallout
1
1
1 1
1
information
interesting
d3: Information retrieval is complicated.
nuclear
Texas
Query
1
1
1
1 1
retrieval
Vector of term weights for d1: (0, 1, 1, 0, 0, 1, 0, 1)
d3
1
complicated
contaminated
d2
1
1
Discussion Point:
Which Terms to Emphasize?
• Major factors
– Uncommon terms are more selective
• E.g., in a collection of chemistry abstract, “Monica Lewinsky” can
be a very selective term
– Repeated terms provide evidence of meaning
• E.g., a doc mentions “cat” many times, it is probably about cat.
• Adjustments
– Give more weight to terms in certain positions
• Title, first paragraph, etc.
– Give less weight each term in longer documents
– Ignore documents that try to “spam” the index
• Invisible text, excessive use of the “meta” field, …
“Okapi” Term Weights
E.g., cat, cat, cat, Lewinsky
 N  DFj  0.5 

wi , j 
* log
 DF  0.5 
Li
j


1.5  TFi , j  0.5
L
TFi , j
TF component
IDF component
6.0
1.0
5.8
0.8
5.6
5.4
0.5
1.0
IDF
Okapi TF
L/L
0.6
Classic
5.2
Okapi
2.0
0.4
5.0
4.8
0.2
4.6
0.0
4.4
0
5
10
15
Raw TF
20
25
0
5
10
15
Raw DF
20
25
Index Quality of Web Search Engines
Index quality is affected by:
• Crawl quality
– Comprehensiveness, dead links, duplicate detection
• Document analysis
– Frames, metadata, imperfect HTML, …
• Document extension
– Anchor text, source authority, category, language, …
• Document restriction (ephemeral text suppression)
– Banner ads, keyword spam, …
Other Web Search Quality Factors
• Spam suppression
– “Adversarial information retrieval”
• detect, isolate, and defeat spamming
– Every source of evidence has been spammed
• Text, queries, links, access patterns, …
• “Family filter” accuracy
– Family filter reduces objectionable content
– Link analysis can be very helpful
• E.g., few serious web pages make links to pornography sites
Indexing Anchor Text
• A type of “document expansion”
– Terms near links describe content of the target
• Works even when you can’t index content
– Image retrieval, uncrawled links, …
Rating-Based Recommendation
• Use ratings as to describe objects
– Personal recommendations, peer review, …
• Beyond topicality:
– Accuracy, coherence, depth, novelty, style, …
• Has been applied to many modalities
– Recommending books, Usenet news, movies, music, jokes,
beer, …
Using Positive Information
Joe
Ellen
Mickey
Goofy
John
Ben
Nathan
Small
World
Space
Mtn
D
A
A
D
A
F
D
A
F
A
A
C
A
Mad Dumbo SpeedTea Pty
way
B
D
A
A
A
D
A
C
C
?
F
A
Cntry
Bear
?
A
A
F
A
Using Negative Information
Joe
Ellen
Mickey
Goofy
John
Ben
Nathan
Small
World
Space
Mtn
D
A
A
D
A
F
D
A
F
A
A
C
A
Mad Dumbo SpeedTea Pty
way
B
D
A
A
A
D
A
C
C
?
F
A
Cntry
Bear
?
A
A
F
A
Problems with Explicit Ratings
• Cognitive load on users -- people don’t like
to provide ratings
• Rating sparsity -- needs a number of raters
to make recommendations
• No ways to detect new items that have not
been rated by any users
Implicit Evidence for Ratings
Segment
Examine View
Object
Class
Select
Bookmark
Save
Purchase Subscribe
Retain
Print
Delete
Cite
Link
Reference Quote
Cut&Paste Reply
Forward
Rate
Interpret Annotate Publish
Organize
Implicit Evidence for Rating:
Click Streams
• Browsing histories are easily captured
– Browsers send all links to a central site
– Record from and to pages and user’s cookie
– Redirect the browser to the desired page
• User’s reading time is correlated with
interest
– Can be used to build individual profiles
– Used to target advertising by doubleclick.com
Slide adapted from http://www.umiacs.umd.edu/~oard/teaching/690/fall05/syllabus.html (Session 12)
Estimating Authority from Links
Google PageRank
Hub
Authority
Authority
Why Google Scans Books?
• Index is large
• Interface is intuitive
Putting It All Together
(3 Types of Evidence for Finding Info)
Free Text
Topicality
Quality
Reliability
Cost
Flexibility
good
poor
Behavior
Metadata
Information Retrieval Types
Genre
Time
Source: Ayse Goker
Expanding the Search Space: CrossLanguage Information Retrieval
Scanned
Docs
Identity: Harriet
“… Later, I learned that
John had not heard …”
Document Image Retrieval:
Page Layer Segmentation
• Document image generation model
– A document consists many layers, such as handwriting, machine printed text,
background patterns, tables, figures, noise, etc.
High Payoff Investments
(on language recognition problems)
MT
Searchable
Fraction
Handwriting
Speech
Transducer Capabilities
AccuratelyRe cognizedWords
Words Pr oduced
OCR (Google Books)
Speech Retrieval Approaches
• Controlled vocabulary indexing
– Manually describe speech content
• Ranked retrieval based on associated text
– Manual/automatic transcript, notes, anchor text
• Automatic feature-based indexing
– Use phonetic features (e.g., phonemes)
– E.g., M ae n ih jh: maen, aenih, nihjh
• Social filtering based on other users’ ratings
Slide from http://www.glue.umd.edu/~oard/teaching/796/spring04/syllabus.html
BBN Radio News Retrieval
Muscle Fish Audio Retrieval
Muscle Fish Audio Retrieval
• Compute 4 acoustic features for each time slice
– Pitch, amplitude, brightness, bandwidth
• Segment at major discontinuities
– Find average, variance, and smoothness of segments
• Store pointers to segments in 13 sorted lists
– Use a commercial database for proximity matching
• 4 features, 3 parameters for each, plus duration
– Then rank order using statistical classification
• Display file name and audio
Slide from http://www.glue.umd.edu/~oard/teaching/796/spring04/syllabus.html
Music
• Search by metadata (in text)
• Search by singing/humming
• http://www.midomi.com
Image Retrieval
• Three traditional approaches
– Controlled vocabulary indexing
– Ranked retrieval based on associated captions
– Social filtering based on other users’ ratings
• Today’s focus is on content-based retrieval
– An analogue of content-based text retrieval
Webseek: http://www.ctr.columbia.edu/webseek/
Color Histogram Matching
• Represent image as a rectangular pixel raster
– e.g., 1024 columns and 768 rows
• Represent each pixel as a quantized color
– e.g., 256 colors ranging from red through violet
• Count the number of pixels in each color bin
– Produces vector representations
• Compute color vector similarity
– e.g., normalized inner product of 2 vectors
(I am talking about linear algebra terminology)
Color Histogram Example
Image Retrieval Summary
• Query by
– Keywords, image example, sketch
• Matching query against image
–
–
–
–
–
Caption text
Segmentation
Similarity of color, texture, shape
Spatial arrangement (orientation, position)
Specialized techniques (e.g., face recognition)
• Selection
– Thumbnails as surrogates
• For your project if you use a lot of images! E.g., Banned Book
Exhibition project (Fall 2008 project)
Video Retrieval Approaches
• Visual content based approach
– A video is a sequence of images
• Semantic content based approach
– keywords search
– browsing
Evaluation of IR Systems
• What can be measured that reflects the searcher’s
ability to use a system? (Cleverdon, 1966)
–
–
–
–
–
–
Coverage of Information
Form of Presentation
Effort required/Ease of Use
Time and Space Efficiency
Recall
Precision
Effectiveness
Evaluating IR Systems:
2 Strategies
• User-centered strategy
– Given a group of users, and at least 2 retrieval systems
– Have each user try the same task on both systems
– Measure which system works the “best”
• System-centered strategy
– Test collection (given documents, queries, and relevance
judgments)
– Try several variations on the retrieval system
– Measure which ranks more good docs near the top
Which is the Best Rank Order?
A.
B.
C.
D.
E.
F.
= relevant document
User may stop at any point, compute Precision, then take average precision
Precision and Recall
• Precision
– How much of what was found is relevant?
– Often of interest, particularly for interactive
searching
• Recall
– How much of what is relevant was found?
– Particularly important for law, patents, and
medicine
Measures of Effectiveness
Retrieved
Ret+ Rel
| Ret  Rel |
Precision
| Ret |
Relevant
| Ret  Rel |
Recall 
| Rel |
Slide adapted from http://www.umiacs.umd.edu/~oard/teaching/690/fall05/syllabus.html (Session 12)
Effectiveness: Set-Based Measures
Relevant
•
•
•
•
Not
relevant
Retrieved
A
B
Not
retrieved
C
D
Collection size = A+B+C+D
Relevant = A+C
Retrieved = A+B
Precision = A /(A+B)
Recall = A /(A+C)
Miss = C /(A+C)
False alarm (fallout) = B / (B+D)
When is precision important?
When is recall important?
Slide adapted from http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (Session 11)
Precision-Recall Curves
1
0.9
0.8
Precision
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Recall
Source: Ellen Voorhees, NIST
User Studies
• Goal is to account for interface issues
– By studying the interface component
– By studying the complete system
• Formative evaluation
– Provide a basis for system development
– Have users in the development team to give feedback
instantly
• Summative evaluation
– Designed to assess performance of the system
Slide adapted from http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (Session 11)
Quantitative User Studies
• Select independent variable(s)
– e.g., what info to display in selection interface: title, snippet
• Select dependent variable(s)
– e.g., time to find a known relevant document
• Run subjects in different orders
– Average out learning and fatigue effects
• Compute statistical significance
– Null hypothesis: independent variable has no effect
– Rejected if p<0.05
Slide adapted from http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (Session 11)
Qualitative User Studies
• Observe user behavior
–
–
–
–
Instrumented software, eye trackers, etc.
Face and keyboard cameras
Think-aloud protocols
Interviews and focus groups
• Organize the data
– For example, group it into overlapping categories
• Look for patterns and themes
• Develop a “grounded theory”
Slide adapted from http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (Session 11)
Questionnaires
• Demographic data
– For example, computer experience
– Basis for interpreting results
• Subjective self-assessment
– Which did they think was more effective?
– Often at variance with objective results!
• Preference
– Which interface did they prefer? Why?
Slide adapted from http://www.umiacs.umd.edu/~jimmylin/LBSC690-2007-Spring/content.html (Session 11)
Affective Evaluation of Systems
• Measure stickiness through frequency of use
– Non-comparative, long-term study
• Key factors (from cognitive psychology):
– Worst experience
– Best experience
– Most recent experience
• Highly variable effectiveness is undesirable
– Bad experiences are particularly memorable
Slide adapted from http://www.umiacs.umd.edu/~oard/teaching/690/fall05/syllabus.html (Session 12)
By now you should know…
• Why information retrieval is hard
• Why information retrieval is more than just
querying a search engine
• The difference between Boolean and ranked
retrieval (and their advantages /disadvantages)
• Basics of evaluating information retrieval
systems
• Roughly how IR systems work
• Roughly how Web search engines work
Example IR Interfaces
•
•
•
•
•
•
•
•
Google: keyword in context
Microsoft Live: query refinement suggestions
Exalead: faceted refinement
Vivisimo/Clusty: clustered results
Kartoo: cluster visualization
WebBrain: structure visualization
Grokker: “map view”
PubMed: related article search
Hands On: Try Some Search Engines
• Web Pages (using spatial layout)
– http://kartoo.com/
• Multimedia (based on metadata)
– http://singingfish.com
• Movies (based on recommendations)
– http://www.movielens.umn.edu
• Grey literature (based on citations)
– http://citeseer.ist.psu.edu/
• Images (based on image similarity)
– http://elib.cs.berkeley.edu/photos/blobworld/
– rot link or system down?
• Cont’d on next slide
Slide from http://www.umiacs.umd.edu/~oard/teaching/690/fall05/syllabus.html (Session 12)
Hands on: Just play with them
• Query formulation
- Advanced search interface?
- Help?
- Query refinement prompt?
• Presentation of result sets
- What information were presented to you?
- In which way? 1 dimension, 2+ dimensions? Categories?
• Rank criteria
- by similarity scores?
- by date? alphabetic order? It’s ok if you cannot tell.
Summary
• Search is a process engaged in by people
• Co-design problem
• Human-machine synergy is the key
• Content and behavior offer useful evidence
• Evaluation must consider many factors