Transcript PPT
• Session 1
IR in a Nutshell:
Applications, Research, and Challenges
Tamer Elsayed Feb 21 st 2013
Roadmap
What is Information Retrieval (IR)?
● Overview and applications
Overview of my research interests
● Large-scale problems ● MapReduce Extensions ● Twitter Analysis
The future of IR research
● SWIRL 2012 IR in a Nutshell: Applications, Research, and Challenges 2
WHAT IS IR?
OVERVIEW & APPLICATIONS/RESEARCH TOPICS
IR in a Nutshell: Applications, Research, and Challenges 3
Information Retrieval (IR) …
information need
Unstructured Query
IR in a Nutshell: Applications, Research, and Challenges
Hits
4
Who and Where?
*Source: Matt Lease (IR Course at UTexes)
IR is not just “ Web Page ” Ranking
or Document or Retrieval
6
Web Search: Google
Search suggestions Vertical search Query-biased summarization Search shortcuts Vertical search (news, blog, image) Sponsored search
Web Search: Google II
Spelling correction Personalized search / social ranking Vertical search (local)
Cross-Lingual IR
1/3 of the Web is in non-English About 50% of Web users do not use English as their primary language Many (maybe most) search applications have to deal with multiple languages ● monolingual search: search in one language, but with many possible languages ● cross-language search: search in multiple languages at the same time
Routing / Filtering
Given standing query, analyze new information as it arrives ● Input: all email, RSS feed or listserv, … ● Typically classification rather than ranking ● Simple example: Ham vs. spam *Source: Matt Lease (IR Course at UTexes)
Content-based Music Search
*Source: Matt Lease (IR Course at UTexes)
Speech Retrieval
*Source: Matt Lease (IR Course at UTexes)
Entity Search
*Source: Matt Lease (IR Course at UTexes)
Question Answering & Focused Retrieval
*Source: Matt Lease (IR Course at UTexes)
Expert Search
*Source: Matt Lease (IR Course at U Texes)
Blog Search
*Source: Matt Lease (IR Course at UTexes)
μ-Blog Search (e.g. Twitter)
*Source: Matt Lease (IR Course at UTexes)
e-Discovery
*Source: Matt Lease (IR Course at Utexes)
Book Search
Find books or more focused results Detect / generate / link table of contents Classification: detect genre (e.g. for browsing) Detect related books, revised editions Challenges: Variable scan quality, OCR accuracy, Copyright, etc.
Other Visual Interfaces
*Source: Matt Lease (IR Course at Utexes)
MY RESEARCH
IR in a Nutshell: Applications, Research, and Challenges 21
My Research …
emails Enron ~500,000 Text + Large-Scale Processing Identity Resolution User Application web pages CLuE Web ~1,000,000,000 Web Search
22
Back in 2009 …
Before 2009, small text collections are available ● Largest: ~ 1M documents ClueWeb09 ● Crawled by CMU in 2009 ● ~ 1B documents !
● need to move to cluster environments MapReduce/Hadoop seems like promising framework 23
MapReduce Framework
(a) Map (b) Shuffle (c) Reduce
input
(k
1 , v 1
) map [k
2 , v 2
]
input input
map map
Shuffling
(k
2
, [v
2
]) reduce [(k
3 , v 3
)]
output
reduce
output
group values by: [keys] reduce
output input
map Framework handles “everything else” !
24
Ivory
http://ivory.cc
E2E Search Toolkit using MapReduce Completely designed for the Hadoop environment Experimental Platform for research Supports common text collections ● + ClueWeb09 Open source release Implements state-of-the-art retrieval models 25
(1) Pairwise Similarity in Large Collections
0.20
0.30
~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~ 0.34
0.34
0.13
0.74
0.20
~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~ 0.00
0.34
0.34
0.13
0.74
0.20
0.30
~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~
0.34
0.13
0.74
Applications: Clustering “more-like-that” queries ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~ 0.34
0.34
0.13
0.74
~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~~~~~~~~~ ~~~~ 0.00
0.34
0.34
0.13
0.74
26
Decomposition
Each term contributes only if appears in reduce map
27
(2) Cross-Lingual Pairwise Similarity
Find similar document pairs in different languages
More difficult than monolingual!
Multilingual text mining, Machine Translation Application: automatic generation of potential “interwiki” language links Locality-sensitive Hashing
Vectors close to each other are likely to have similar signatures
28
Solution Overview
N f German articles CLIR projection Preprocess N e English articles N e +N f English document vectors
Signature generation
Random Projection/ Minhash/Simhash 11100001010 01110000101
N e +N f Signatures Similar article pairs Sliding window algorithm
(3) Approximate Positional Indexes
“Learning to Rank” models Approximate Term positions Proximity features
Learn effective ranking functions
√ Large index Smaller X index √ Slow query evaluation Faster query evaluation √ X
Close Enough is Good Enough?
30
Fixed-Width Buckets
Buckets of length W
1 2 3 4 5 d
1
………...........….
………...........….
………...........….
………...........….
………...........….
………...........….
………...........….
………...........….
………...........….
………...........….
d
2
………...........….
………...........….
………...........….
………...........….
1 2 3
31
(4) Pseudo Training Data for Web Rankers
Documents, queries, and relevance judgments
Important driving force behind IR innovation In industry, easy to get In academia, hard and really expensive
Web Graph
P
1 SIGIR 2012
P
3
web search P
2
P
4
web search P
7
web search web search P
5 Google
P
6
web search
Queries and Judgments?
P
3
anchor text lines ≈ pseudo queries target pages ≈ relevant candidates P
1 SIGIR 2012
P
4
web search P 5 P
2
P
7
noise reduction ?
P
6
(5) Extending MapReduce Framework
Iterative Computations (iHadoop) Concurrent Jobs with shared data m maps - r reduces instead of 1 map-1 reduce IR in a Nutshell: Applications, Research, and Challenges 35
(6) Twitter Analysis
Real-time search in Twitter ● ● TREC 2011 (6 th out of 59 teams) TREC 2013?
Answering Real-time Questions from Arabic Social Media ● NPRP-submitted IR in a Nutshell: Applications, Research, and Challenges 36
FUTURE RESEARCH DIRECTIONS
IR in a Nutshell: Applications, Research, and Challenges 37
SWIRL 2012
Goal of Report
Inspire researchers and graduate students to address the questions raised Provide funding agencies data to focus and coordinate support for information retrieval research.
Participants were asked to focus on efforts that could be handled in an
academic setting
,
without
the requirement of
large-scale commercial data
.
Key Themes (across Topics)
Not just a ranked list
● move beyond the classic “single adhoc query and ranked list” approach
Help for users
● support users more broadly, including ways to bring IR to inexperienced, illiterate, and disabled users.
Capturing context
● Treats people using search systems, their context, and their information needs as critical aspects needing exploration.
Information, not documents
● beyond document retrieval and into more complex types of data and more complicated results
New Domains
● data with restricted access, collections of “apps,” and richly connected workplace data
Evaluation
● suggest new techniques for evaluation
“Most Interesting” Topics
IR in a Nutshell: Applications, Research, and Challenges 41
[1] Conversational Answer Retrieval
IR: provides ranked lists of documents in response to a
wide range
of keyword
queries
QA: provides more specific answers to a very
limited range
of
natural language questions
.
Goal: combine the advantages of both to provide effective retrieval of appropriate answers to a wide range of questions expressed in natural language,
with rich user-system dialogue
Proposed Research
Questions: open-domain, natural language text questions Answers: Develop more general approaches to identifying as many constraints as possible on the answers for questions Dialogue would be initiated by the searcher and proactively by the system, for: ● refining the understanding of questions ● improving the quality of answers Answers: short answers, text passages, clustered groups of passages, documents, or even groups of documents may be appropriate answers. Even tables, figures, images, or videos IR in a Nutshell: Applications, Research, and Challenges 43
Challenges
Definitions of question and answer for open domain searching Techniques for representing questions and answers Techniques for reasoning about and ranking answers Techniques for representing a mixed-initiative CAR
dialogue
Effective dialogue actions for improving question understanding Effective dialogue actions for refining answers IR in a Nutshell: Applications, Research, and Challenges 44
[2] Finding What You Need with Zero Query Terms (or Less)
Function without an explicit query, depending on context and personalization in order to understand user needs Anticipate user needs and respond with information appropriate to the current context without the user having to enter a query (
zero query terms
) or even initiate an interaction with the system (
or less
).
In a mobile context: take the form of an app that recommends interesting places and activities based on the user’s location, personal preferences, past history, and environmental factors such as weather and time. In a traditional desktop environment: might monitor ongoing activities and suggest related information, or track news, blogs, and social media for interesting updates.
Imagine a system that automatically gathers information related to an upcoming task.
Proposed Research
New representations of information and user needs, along with methods for matching the two Modeling person, task, and context; Methods for finding “objects of interest”, including content, people, objects and actions Methods for determining what, how and when to show material of interest.
IR in a Nutshell: Applications, Research, and Challenges 46
Challenges
Time- and geo-sensitivity; trust, transparency, privacy; determining interruptibility; summarization Power management in mobile contexts
Evaluation
IR in a Nutshell: Applications, Research, and Challenges 47
[3] Mobile Information Retrieval Analytics (MIRA)
No company or researcher has an understanding of mobile information access across a variety of tasks, modes of interaction, or software applications.
For example, a search service provider might know that a query was issued, but not know whether the results it provided resulted in consequent action.
The identification of common types of web search queries led to query classification and algorithms tuned for different purposes, which improved web search accuracy. A similar understanding for mobile information seeking would focus research on the problems of highest value to mobile users.
study what information, what kind of information, and what granularity of information to deliver for different tasks and contexts
Proposed Research
Methodology and tools for doing large-scale collection of data about mobile information access.
Research on incentive mechanisms is required to understand situations in which people are willing to allow their behavior to be monitored.
Research on privacy is required to understand what can be protected by dataset licenses alone, what must be anonymized, and tradeoffs between anonymization and data utility.
Development of well-defined information seeking tasks Support quantitative evaluation in well-defined evaluation frameworks that lead to repeatable scientific research IR in a Nutshell: Applications, Research, and Challenges 49
Challenges
Developing incentive mechanisms Developing data collections that are sufficiently detailed to be useful while still protecting people’s privacy. Collection of data in a manner that university internal review boards will consider acceptable ethically.
Collection of data in a manner that does not violate the Terms of Use restrictions of commercial service providers.
IR in a Nutshell: Applications, Research, and Challenges 50
[4] Empowering Users to Search and Learn
Search engines are currently optimized for look-up tasks and not tasks that require more sustained
interactions with information
People have been conditioned by current search engines to interact in particular ways that prevent them from achieving higher levels of learning.
We seek to empower users to be more proactive and critical thinkers during the information search process.
[5] The Structure Dimension
Better integration of structured and unstructured
information to seamlessly meet a user’s information needs is a promising, but underdeveloped area of exploration.
Named entities, user profiles, contextual annotations, as well as (typed) links between information objects ranging from web pages to social media messages.
[6] Understanding People in Order to Improve Information (Retrieval) Systems
Development of a
research resource for the IR community
: 1. from which hypotheses about how to support people in information interactions can be developed 2. in which IR system designs can be appropriately evaluated.
Conducting studies of people ● before, during, and after engagement with information systems, ● ● at a variety of levels, using a variety of methods. • ethnography • in situ observation • • controlled observation large-scale logging
IR in a Nutshell: Applications, Research, and Challenges 54
Thank You!
IR in a Nutshell: Applications, Research, and Challenges 55