Users in system development – victims among tyrants and heroes

Download Report

Transcript Users in system development – victims among tyrants and heroes

Using semantic components to represent and search
domain-specific documents: An evaluation of indexing
accuracy and consistency
Marianne Lykke
Royal School of Library and Information Science
Susan L. Price and Lois M. L. Delcambre
Portland State University
ISKO 2010 Conference
Sapienza University of Rome, Faculty of Philosophy
February 23 - 26, 2010
ISKO 2010
Marianne Lykke
Agenda
• Problem and motivation
• Semantic component model
• Research questions
• Test design
• Results
• Conclusions
ISKO 2010
Marianne Lykke
Problem and motivation
Challenges for information retrieval in domain-specific
digital libraries:
• Domain-specific libraries often contain large sets of
similar documents about few topics
o Important to be able to distinguish between topical
similar documents
• Domain experts often have specific information needs
targeting a single “right answer”, specified by domainspecific facets.
o Important to be able to limit search to domain-specific
dimensions
(e.g. Leckie et al., 1996; Fagin et al., 2003; Freund et al., 2005; Hearst et al., 2006)
ISKO 2010
Marianne Lykke
Problem and motivation
• Little time for information retrieval
o Important that then relevant documents are highly ranked and
retrieved by first query
• Distributed indexing, carried out by indexers with varied
degree of indexing competence
o Important to address classical indexing problems: quality,
exhaustivity, specificity, consistency
(e.g. Leckie et al., 1996; Fagin et al., 2003; Freund et al., 2005; Hearst et al., 2006)
ISKO 2010
Marianne Lykke
Semantic component model
• Semantic components model developed to facilitate formulation
of specific, structured queries covering the search topic
exhaustively by domain-specific dimensions
• Two-level model dividing a given collection into a set of
document classes, each class with an associated set of
semantic components
• Based on assumptions that
o Domain experts know document genres within a certain domain:
content and structure (Dillon, 1991; Orlikowski & Yates, 1994; Bishop, 1999; Vaughan
& Dillon, 2005)
o Domain-specific document content and structure correspond to
domain-specific information needs (Ely et al, 1999,2000; Price, Delcambre,
Nielsen, 2006)
ISKO 2010
Marianne Lykke
Document class:
Clinical method
SC: General
information
SC: Practical
information
HIO 2009
Marianne Lykke
Document class:
Clinical method
SC: General
information
SC: Risk factors
After treatment
HIO 2009
Marianne Lykke
Semantiske component model
Document class
Semantic component
Document class
Semantic component
Clinical problem
General information
Diagnosis
Referral
Treatment
Clinical unit
Function and specialty
Practical information
Referral
Staff and organization
Clinical method
General information
Practical information
Referral
Aftercare
Risks
Expected results
Drugs
General information
Practical information
Target group
Effect
Side effects
Services
General information
Practical information
Referral
Notice
General information
Practical information
Qualification
ISKO 2010
Marianne Lykke
HIO 2009
Marianne Lykke
HIO 2009
Marianne Lykke
Case study
• sundhed.dk: Danish, national health portal
• Active since 2001, 25.000 documents
• Two main target groups: citizens and medical professionals
• Combination of full-text indexing and controlled, assigned
indexing:
o ICPC, International Classification Primary Care
o ICD-10, International Classification of Diseases
o Home-grown Citizens Thesaurus
• Large and varied group of indexers
o 5 regions
o Up to 250 indexers per region
• Specific target group: family doctors
ISKO 2010
Marianne Lykke
Test design
• Comparative, experimental indexing study
o Baseline: keyword indexing (controlled and free terms)
o Experimental: semantic component indexing
• Test persons: 16 sundhed.dk indexers (convenience sample)
• Indexing task: 12 sundhed.dk documents
o 6 documents were indexed with semantic components (SC)
o 6 documents were indexed with keywords
• Random assignment of documents and indexing methods
• Training session
• Evaluation measures:
o
o
o
o
Accuracy
Consistency
Indexing time
Easiness
ISKO 2010
Marianne Lykke
Research questions
• Is semantic component indexing more accurate than keyword
indexing compared to a reference standard?
• Is semantic component indexing more consistent than
keyword indexing?
• Is semantic component indexing faster than keyword
indexing?
• Is semantic component indexing easier than keyword
indexing?
ISKO 2010
Marianne Lykke
Accuracy
Document
Semantic component
Recall
macroaverage
Precision
macroaverage
Keywords
Recall
macroaverage
Precision
macroaverage
1
0.74 ± 0.37
0.89 ± 0.26
0.14 ± 0.33
0.74 ± 0.43
2
0.56 ± 0.33
0.61 ± 0.39
0.35 ± 0.47
0.74 ± 0.42
3
0.59 ± 0.45
0.72 ± 0.38
0.10 ± 0.23
0.72 ± 0.42
4
0.33 ± 0.29
0.72 ± 0.41
0.16 ± 0.35
0.70 ± 0.45
5
0.74 ± 0.39
0.68 ± 0.47
0.38 ± 0.47
0.85 ± 0.30
6
0.59 ± 0.13
0.81 ± 0.35
0.01 ± 0.04
0.88 ± 0.31
7
0.63 ± 0.39
0.79 ± 0.31
0.28 ± 0.36
0.62 ± 0.41
8
0.70 ± 0.31
0.93 ± 0.17
0.01 ± 0.02
0.61 ± 0.49
9
0.66 ± 0.33
0.76 ± 0.43
0.21 ± 0.39
0.79 ± 0.39
10
0.61 ± 0.35
0.75 ± 0.26
0.25 ± 0.42
0.79 ± 0.39
11
0.65 ± 0.43
0.86 ± 0.31
0.12 ± 0.27
0.80 ± 0.36
12
0.63 ± 0.48
0.83 ± 0.30
0.03 ± 0.08
0.85 ± 0.34
ISKO 2010
Marianne Lykke
Consistency
Document
Semantic component
Keywords
Mean K ± SD
(of all semantic
components in the
document)
Binary K
(all vocabularies)
Traditional 1 ± SD
consistency = c / (a + b – c)
1
0.46 ± 0.35
-0.08
0.05 ± 0.13
2
0.21 ± 0.16
0.001
0.18 ± 0.19
3
0.25 ± 0.30
-0.08
0.05 ± 0.11
4
0.35 ± 0.23
0.02
0.19 ± 0.30
5
0.50 ± 0.30
0.32
0.33 ± 0.23
6
0.05 ± 0.11
-0.07
0.23 ± 0.41
7
0.40 ± 0.48
0.26
0.27 ± 0.18
8
0.66 ± 0.11
-0.08
0.05 ± 0.11
9
0.04 ± 0.24
-0.02
0.09 ± 0.14
10
0.44 ± 0.16
0.27
0.29 ± 0.13
11
0.48 ± 0.41
-0.06
0.04 ± 0.09
12
0.01 ± 0.07
-0.12
0.08 ± 0.24
ISKO 2010
Marianne Lykke
Time to index
40
Number of Indexing Instances
35
30
25
20
15
10
5
0
< 2min
2 - 5 min
5 - 10 min
10-15 min
Time to Index
Semantic Component Indexing
Keyword Indexing
> 15 min
Easiness
10
9
Number of Indexers
8
7
6
5
4
3
2
1
0
Choose
concept
Choose
keyword
What each SC Designate SC
is
Very difficult
Mark
boundaries
Very easy
Choose doc.
class
Conclusions
• Varied accuracy for both indexing methods, but data suggests
that semantic component indexing might be more accurate
• Indications that feasibility and easiness of indexing methods are
similar
• Semantic component indexing may be preferable alternative if
no appropriate controlled vocabulary is available due to short
time for development and easy customization to specific
document collection
• Limitations:
o Small sample and a single domain
o Not directly comparable evaluation measure
• Retrieval test shows improvement of document ranking of 25.6%
by nDCG (normalized Discounted Cumulative Gain)
ISKO 2010
Marianne Lykke
Future research
• Development of model:
o Simpler version
o Up-marking by users (social tagging)
o Automatic up-marking
o Up-marking by XML
• Larger scale evaluation
• Evaluation in other domains
ISKO 2009
Marianne Lykke
Litteratur
Dillon, M (1991). Reader’s model of text structures: the case of academic articles. International Journal of Man-Machine
Studies, 35. 913 – 925.
Ely, J, Osheroff, J, Ebell, M, Bergus, G, Levy, B Chambliss, M & Evans, E (1999). Analysis of wquestions asked by family
doctors regarding patient care. BMJ, 310 (7206). 358 – 361.
Ely, J, Osheroff, J, Gorman, P, Ebell, M, Bergus, G, Levy, B Chambliss, M, Pifer, E & Stavri, P (2000). A taxonomy of generic
clinical questions: classification study. BMJ, 321 (7278). 429 - 432.
Fagin, R., Kumar, R., McCurley, K S., Novak, J., Sivakumar, D., Tomlin, J.A. & Williamson, D.P. (2003). Searching the
workplace web. In: Proceedings of the 12th International World Wide Web Conference (WWW ’03), Budapest,
Hungary, May 20-24, 2003. 366-375.
Freund, L., Toms, E. & Waterhouse, J. (2005). Modeling the information behaviour of software engineers using a work-task
framework. In: Grove, A (ed.) ASIS&T ’05 Proceedings of the 68th Annual meeting, Charlotte, NC, October 28-ember
2, 2005.
Hearst, M & Plaunt, C (1993). Subtopic structuring for full length document access. Proceedings of the ACM SIGIR
Conference on Research and Development in Information Retrieval. 59 – 69.
Leckie, G.J., Pettigrew, K.E. & Sylvain, C. (1996). Modeling the information seeking of professionals. Library Quarterly, 66
(2). 161-193.
Orlikowaki, W J & Yates, J (1994). Genre repertoire: the structuring of communicative practices in organizations.
Administrative Science Quarterly, 39. 541 – 574.
Price, S, Delcambre, L & Nielsen, M L (2006). Using semantic components to express questions against document
collections. Proceedings International Workshop on Health Information and Knowledge Management (HIKM 2006),
Arlington (VA).
Price, S, Nielsen, M L, Delcambre, L & Vedsted, P (2007). Semantic components enhance retrieval of domain-specific
documents. Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management (CIKM),
Lisboa, November 6 - 8, 2007.
HIO 2009
Marianne Lykke
Search term
Search term should
appear in specified
semantic component
HIO 2009
Marianne Lykke
Semantic component
should appear in
document
HIO 2009
Marianne Lykke
250
Number of Documents
200
150
100
50
0
< 2 min
2 - 5 min
5 - 10 min
Time to Index
10 - 15 min
> 15 min
Time to index
Time required for indexing documents
Indexing
Type
Total
Documents
Indexed (max
= 96)
Mean Num. Docs
Indexed
Per Indexer
(max = 6)
Mean Time
(min:sec)
Min Time
(min:sec)
Max Time
(min:sec)
Semantic
Components
83
5.2
07:03
00:24
27:05
Keywords
88
5.5
05:56
01:06
31:26
10
9
Number of indexers
8
7
6
5
4
3
2
1
0
For indexing documents
For searching
Task type
Prefer keyword indexing
About the same
Prefer semantic component indexing
Research team
General practice
Information and computer science
Peter Vedsted
MD, Ph.D.
Research Unit general Practice,
Århus University
Lois Delcambre, Ph.D., Professor
Susan Price, MD, Ph.D. student
Computer Science Department
Portland State University, USA
Jens Rubak
MD
Praksis.dk, Region Midt
Marianne Lykke, Ph.D., Associate professor
Information Interaktion and Information
Arkitecture
Danmarks Bibliotekskole
Vibeke Luk
Information specialist
sundhed.dk
sundhed.dk
Frans la Cour
IT consultant
Autonomy
Supported by grants from the National Science Foundation, grant numbers
0514238, 0511050 and 0534762, the National Library of Medicine Training Grant 5T15-LM07088 and Kvalitetsudviklingsudvalget for Almen Praksis, Aarhus Amt
HIO 2009
Marianne Lykke