Comparing document Segmentation for Passage Retrieval in

Download Report

Transcript Comparing document Segmentation for Passage Retrieval in

Comparing Document Segmentation for
Passage Retrieval in Question Answering
Jorg Tiedemann
University of Groningen
presented by:
Moy’awiah Al-Shannaq
[email protected]
December 05, 2011
Outline
•
•
•
•
•
•
Introduction
Overview of passage retrieval module
Strategies for passage retrieval in QA
Document segmentation
Passage retrieval in Joost
Experiments
 Setup
 Result
•
•
•
Conclusion
Future Work
References
2
Introduction
•
Information Retrieval (IR): is the area of study concerned with searching
for documents, for information within documents, and for metadata about
documents, as well as that of searching structured storage, relational
databases, and the world wide web*.
•
Passage Retrieval: retrieve individual passages within documents (one or
more sentences, paragraphs).
•
Precision:
Number of relevant document retrieved
Number of total retrieved
•
Recall :
Number of relevant document retrieved
Number of total relevant
•
Question Answering (QA): include a passage retrieval component to
reduce the search space for information extraction modules.
*http://en.wikipedia.org/wiki/Information_retrieval
3
Outline
•
•
•
•
•
•
Introduction
Overview of passage retrieval module
Strategies for passage retrieval in QA
Document segmentation
Passage retrieval in Joost
Experiments
 Setup
 Result
•
•
•
Conclusion
Future Work
References
4
Passage Types
•
Discourse passage: is when the segmentation based on document
structure.
– Problems with this approach often arise with special structures such as
headers, lists and tables which are easily mixed with other units such as
proper paragraphs.
•
Semantic passages: split documents into semantically motivated units
using some topical structure.
•
Window-based passages: use fixed or variable-sized windows to segment
documents into smaller units.
– window-based passages have a fixed length using non-overlapping
parts of the document.
5
Passage Incorporation Approaches
•
We can distinguish between two approaches to the incorporation of
passages in information retrieval:
1) Passage-level evidence to improve document retrieval.
2) Using passages directly as the unit to be retrieved.
•
Paper interested in the second approach to return small units in QA.
6
What are the differences between Passage
Retrieval in QA and ordinary IR?
•
Passage Retrieval in QA differs from ordinary IR in at least two points:
1) Queries are generated from user questions and not manually created as in
standard IR.
1)
•
The units to be retrieved are usually much smaller than documents in IR .
The division of documents into passages is crucial for two reasons:
1) The textual units have to be big enough to ensure IR works properly.
2) They have to be small enough to enable efficient and accurate QA.
7
Outline
•
•
•
•
•
•
Introduction
Overview of passage retrieval module
Strategies for passage retrieval in QA
Document segmentation
Passage retrieval in Joost
Experiments
 Setup
 Result
•
•
•
Conclusion
Future Work
References
8
Strategies for Passage Retrieval in QA
•
Search- time passaging: two-steps strategy of retrieving documents first
and then selecting relevant passages within these documents.
– Return only one passage per relevant document.
•
Index- time passaging: one-step strategy that return relevant passages
from documents.
– Allow multiple passages per relevant document to be returned.
•
In our QA system we adopt the second strategy using a standard IR engine
to match keyword queries generated from a natural language question with
passages.
9
Outline
•
•
•
•
•
•
Introduction
Overview of passage retrieval module
Strategies for passage retrieval in QA
Document segmentation
Passage retrieval in Joost
Experiments
 Setup
 Result
•
•
•
Conclusion
Future Work
References
10
Document Segmentation
•
The experiments work with Dutch data from the QA tasks at the crosslingual evaluation forum (CLEF).
•
The document collection used there is a collection of two daily newspapers
from the years 1994 and 1995.
– It includes about 190,000 documents (newspaper articles) .
– 4 million sentences including approximately 80 million words.
– The documents include additional markup to segment them into
paragraphs.
•
We define document boundaries as hard boundaries, i.e., passages may
never come from more than one document in the collection.
11
Document Segmentation Strategies
•
Window-based passages: Documents are split into passages of fixed size
(in terms of number of sentences).
•
Variable-sized arbitrary passages: Passages may start at any sentence in
each document and may have variable lengths.
– This is implemented by adding redundant information to our standard IR
index.
– We create passages starting at every sentence in a document for each
length defined.
•
Sliding window passages: A sliding window approach also adds
redundancy to the index by sliding over documents with a fixed-sized
window
12
Outline
•
•
•
•
•
•
Introduction
Overview of passage retrieval module
Strategies for passage retrieval in QA
Document segmentation
Passage retrieval in Joost
Experiments
 Setup
 Result
•
•
•
Conclusion
Future Work
References
13
Passage Retrieval in Joost
•
Joost QA system includes two strategies:
1) Table-lookup strategy using fact databases that have been created offline.
2) On-Line answer extraction strategy with passage retrieval and
subsequent answer identification and ranking modules.
•
paper approach interested in the second strategy in order to check the
passage retrieval component and its impact on QA performance.
14
Dutch CLEF Corpus
•
The contents of the CLEF dataset evidently very diverse. Most of the
documents are very short but the longest one contains 625 sentences.
•
Figure 1: Distribution of document sizes in terms of sentences they contain in the Dutch CLEF corpus.
15
Dutch CLEF Corpus
•
Figure 2: Distribution of paragraph sizes in terms of sentences in the Dutch CLEF corpus.
16
Dutch CLEF Corpus
•
Figure 3: Distribution of paragraph sizes in terms of characters in the Dutch CLEF corpus.
17
Outline
•
•
•
•
•
•
Introduction
Overview of passage retrieval module
Strategies for passage retrieval in QA
Document segmentation
Passage retrieval in Joost
Experiments
 Setup
 Result
•
•
•
Conclusion
Future Work
References
18
Experiment Setup
•
The entire Dutch CLEF document collection is used to create the index files
with the various segmentation approaches.
•
There are 777 questions, each question may have several answers.
•
For each setting 20 passages retrieved per question using the same query
generation strategy
19
Evaluation Measures
1) Redundancy: The average number of passages retrieved per question
that contain a correct answer.
2) Coverage: Percentage of questions for which at least one passage is
retrieved that contains a correct answer.
20
Evaluation Measures
3) Mean reciprocal ranks: The mean of the reciprocal rank of the first
passage retrieved that contains a correct answer.
21
Coverage and redundancy
•
Figure 4: Coverage and redundancy of passages retrieved for various segmentation strategies.
22
Mean Reciprocal Ranks
•
Figure 5: Mean reciprocal ranks of passage retrieval (IR MRR) and question answering (QA MRR) for various
segmentation strategies.
23
Outline
•
•
•
•
•
•
Introduction
Overview of passage retrieval module
Strategies for passage retrieval in QA
Document segmentation
Passage retrieval in Joost
Experiments
 Setup
 Result
•
•
•
Conclusion
Future Work
References
24
Conclusion
•
Accurate passage retrieval is essential for Question Answering .
•
Discourse based segmentation into paragraphs works well with standard
information retrieval techniques.
•
Among the window-based approaches a segmentation into overlapping
passages of variable-length performs best, in particular for passages with
sizes of 1 to 10 sentences.
•
Passage retrieval is more effective than full document retrieval.
25
Outline
•
•
•
•
•
•
Introduction
Overview of passage retrieval module
Strategies for passage retrieval in QA
Document segmentation
Passage retrieval in Joost
Experiments
 Setup
 Result
•
•
•
Conclusion
Future Work
References
26
Future Work
•
Advance improvement for discourse based segmentation.
•
Combine several retrieval setting using various segmentation approaches.
27
Outline
•
•
•
•
•
•
Introduction
Overview of passage retrieval module
Strategies for passage retrieval in QA
Document segmentation
Passage retrieval in Joost
Experiments
 Setup
 Result
•
•
•
Conclusion
Future Work
References
28
References
[1] J. P. Callan. Passage-level evidence in document retrieval. In SIGIR ’94: Proceedings
of the 17th annual international ACM SIGIR conference on Research and evelopment
in information retrieval, pages 302–310, New York, NY, USA, 1994.Springer-Verlag
New York, Inc.
[2] CLEF. Multilingual question answering at CLEF. http://clef-qa.itc.it/, 2005.
[3] M. A. Greenwood. Using pertainyms to improve passage retrieval for questions
requesting information about a location. In Proceedings of the Workshop on
Information Retrieval for Question Answering (SIGIR 2004), Sheffield, UK, 2004.
[4] M. Kaszkiel and J. Zobel. Effective ranking with arbitrary passages. Journal of the
American Society of Information Science, 52(4):344–364, 2001.
[5] D. Moldovan, S. Harabagiu, M. Pasca, R. Mihalcea, R. Girju,R. Goodrum, and V. Rus.
The structure and performance of an open-domain question answering system, 2000.
[6] I. Roberts and R. Gaizauskas. Evaluating passage retrieval approaches for question
answering. In Proceedings of 26th European Conference on Information Retrieval,
2004.
[7] S. E. Robertson, S.Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at
TREC-3. In Text REtrieval Conference,pages 21–30, 1992.
[8] http://en.wikipedia.org/wiki/Information_retrieval
Thank You
29