Transcript Powerpoint

CLiMB:
Computational Linguistics
for
Metadata Building
Center for Research on Information Access
Columbia University Libraries
7/27/2016
CLiMB - Columbia University
1
7/27/2016
CLiMB - Columbia University
2
Overall Goals
•
•
•
•
•
Research: Development of richer retrieval through increased
numbers of descriptors
Research and Practice: Creation of enabling technologies for new
large digitization projects
Research and Practice: Expand capability for cross-collection
searching
Practice: Development of suite of CLiMB tools
Resources: Vocabulary list which can be used by other visual
resource professionals
The essence of CLiMB:
• Use scholars themselves as “catalogers” by utilizing scholarly
publications
• Enhance existing descriptive metadata
7/27/2016
CLiMB - Columbia University
3
CLiMB Project Teams
Coordinating
Collections
(Curatorial)
Technical
External
Advisory
7/27/2016
CLiMB - Columbia University
4
CliMB:
2 year timetable
• YEAR 1
– Evaluating existing computational tools
– Developing additional software as needed
– Selecting and building (scanning, converting)
needed candidate texts
– Loading initial descriptive metadata into end-user
system
– Evaluating initial results with user groups
7/27/2016
CLiMB - Columbia University
9
CliMB:
2 year timetable
• YEAR 2
– Use feedback to refine metadata
generation & filtering
– Prepare additional collections for testing
– Incorporate data in different user platforms
– Seek external partners for using CLiMB
toolset
7/27/2016
CLiMB - Columbia University
10
Computational Linguistic Techniques
• What techniques have we tried?
– Goal: Identify high quality metadata terms
– Goal: Use metadata for finding images
• How well have they worked?
• What else do we want to try?
7/27/2016
CLiMB - Columbia University
11
Text about Images
The Blacker House is known for
its porte cochère and adjacent
terraces. Samuel Parker
Williams, an occasional Greene
collaborator, worked on the site,
particularly on the sandstone
boulder foundation for the
sleeping porch.
-- Based on Bosley
7/27/2016
CLiMB - Columbia University
12
Techniques We Have Tried
Supervised (using existing resources)
– Matching algorithms - proper names & variants
– Back of book index analysis
– Composite list of terms from authoritative lists
Unsupervised
– Part of speech tagging
– Noun phrase identification
– Proper noun identification
7/27/2016
CLiMB - Columbia University
13
Computational Linguistic Techniques
• What techniques have we tried?
– Goal: Identify high quality metadata terms
– Goal: Load metadata into image search database
– Goal: Use enriched metadata for finding images
• How well have they worked?
• What else do we want to try?
7/27/2016
CLiMB - Columbia University
14
CLiMB Art Keys (CAKEs)
• Need Unique Identifiers
– Key of database records
• Varies from collection to collection
– Greene & Greene – Project Names
– Chinese Paper Gods – God Names
– South Asian Temples – Temple Names
7/27/2016
CLiMB - Columbia University
15
Text about Images
The Blacker House is known for its porte
cochère and adjacent terraces. Samuel
Parker Williams, an occasional Greene
collaborator, worked on the site,
particularly on the sandstone boulder
foundation for the sleeping porch.
-- Based on Bosley
7/27/2016
CLiMB - Columbia University
16
Compile list of
subject vocabulary
Find meaningful
terms in texts
Segment relevant
texts
Collect terms from all sources.
Identify and link CAKE described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records.
Mount in image search platform.
Process queries and evaluate
7/27/2016
CLiMB - Columbia University
17
Create Composite List of
Subject Terms
Philosophy: Use whatever resources exist
• Catalog records
– Robert R. Blacker house (Pasadena, Calif.)
– Greene, Charles Sumner
– Blacker, Robert R.
• Art and Architecture Thesaurus
– porte cochère
• Back of the book index
– Blacker house
7/27/2016
CLiMB - Columbia University
18
How CLiMB Works Today
Official pages:
• www. columbia.edu/cu/cria
Work in progress (temporary sites):
• www.cs.columbia.edu/~delson/cni
• www.cs.columbia.edu/~delson/CAKEFinder
7/27/2016
CLiMB - Columbia University
19
7/27/2016
CLiMB - Columbia University
20
7/27/2016
CLiMB - Columbia University
21
7/27/2016
CLiMB - Columbia University
22
Progress – Composite List
• Greene & Greene
– Extracted back of the book indexes
– Direct matching of index terms to the text
• Terms found - highlighted in yellow
– David Gamble
– Pasadena
– Westmoreland Place
– furniture
7/27/2016
CLiMB - Columbia University
23
7/27/2016
CLiMB - Columbia University
24
Compile list of
subject vocabulary
Find meaningful
terms in texts
Segment relevant
texts
Collect terms from all sources.
Identify and link CAKEs described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records.
Mount in image search platform.
Process queries and evaluate
7/27/2016
CLiMB - Columbia University
25
Three Term Types and Approaches
1) Art Object keys (Charles Pratt)
• Named Entity noun phrase finders, POS
taggers
2) Proper nouns important to the domain
3) Common noun terms
•
generic domain vocabulary (chimney)
•
semantically significant to the domain (Vshaped plan)
7/27/2016
CLiMB - Columbia University
26
Part of Speech (POS) taggers
• Why use a part of speech tagger?
– To identify nouns, verbs and proper nouns
• The Blacker House is known for its porte cochère…
– <Determiner>The
– <Proper_Noun>
• <Singular_Proper_Noun>Blacker
• <Singular_Proper_Noun>House
– <Verb_Present>is
– <Verb_Past_Participle>known
– <Preposition>for
– <Possessive_Pronoun>its
– <Adjective>adjacent
– <Noun_Plural>terraces
7/27/2016
CLiMB - Columbia University
27
Part of Speech (POS) taggers
• Strength: An essential step allows the rest
of the system to work
• Weakness: The best POS taggers have
95% accuracy
– A typical 20-word sentence is likely to have a
mistake!
• But: some errors do not matter much
– E.g. sleeping porch
7/27/2016
CLiMB - Columbia University
28
Proper Nouns
• Alembic WorkBench Results
– 91.2% recall
• Misses The senior Pratt, Hall brothers
– 97.5% precision using Alembic
• Successfully finds William Issac Ott, University of California
• This is very good!
• LTChunk proper nouns highlighted in peach
–
–
–
–
Laurabelle Robinson
Greenes
Pasadena
Etc.
7/27/2016
CLiMB - Columbia University
30
7/27/2016
CLiMB - Columbia University
31
Noun Phrase Chunking
[The [ Blacker House ] ] is known for
[ [its porte cochère] and [adjacent terraces] ].
[Samuel Parker Williams],
[an occasional Greene collaborator],
worked on [the site], particularly on
[the [ [sandstone boulder] foundation] ]
for [the [ sleeping porch ] ].
-- Based on Bosley
7/27/2016
CLiMB - Columbia University
32
NP Chunkers
• Columbia’s LinkIT
– Regular expression grammar over POS tags
– Improves WorkBench results through finding
simplex NPs
• LTChunk
– By LTG Group, University of Edinburgh
– Not as many NPs
• Arizona - commercialized
•7/27/2016
IBM – also commercial
CLiMB - Columbia University
33
Results: NP Chunking
• Common noun phrases highlighted in light
blue:
– His brother’s property
– Planter boxes
– The south wing
– Etc.
7/27/2016
CLiMB - Columbia University
34
7/27/2016
CLiMB - Columbia University
35
Experiments with Algorithms
• TF/IDF and term frequency ratios
– Filter technical terms from frequent common nouns
– Term frequency ratio algorithm to improve accuracy
• Co-occurrence
– Useful terms may appear near other good ones
• Machine learning
– Use learning algorithms to discover complex
associational context
7/27/2016
CLiMB - Columbia University
36
Compile list of
subject vocabulary
Find meaningful
terms in texts
Segment relevant
texts
Collect terms from all sources.
Identify and link CAKEs described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records.
Mount in image search platform.
Process queries and evaluate
7/27/2016
CLiMB - Columbia University
37
What is Segmentation?
• Divide texts into cohesive chunks
• Needed for determining associational
context
• Needed to determine what terms are
related to an art object
7/27/2016
CLiMB - Columbia University
38
Results: Segmentation
7/27/2016
Project People, Frequency
12
10
Cole
Bolton
Thorsen
Pratt
Gamble
Blacker
Robinson
Ford
8
6
4
2
49
46
43
40
37
34
31
28
25
22
19
16
13
10
7
4
0
1
•
Use the frequency
that our terms
appear within a
document to
estimate where the
document is about
that term
This graph shows
where different
names are
mentioned in
Bosley on Greene
& Greene Ch. 5
Frequency
•
Paragraph
CLiMB - Columbia University
39
What We’ve Tried: Segmenters
• Marti Hearst’s TextTiling
– Performs well for a general algorithm, but not
sufficient for this specialized task
– M. Hearst, ACL, 1993
• F. Choi’s C99 segmenter
– Performance comparable to TextTiling
– F. Y. Y. Choi, NAACL, 2000
• Frequency ratio approach outperformed
TextTiling
• In-house tool to be tested
– Kan & Klavans, WVLC-6, 1998, Segmenter
7/27/2016
CLiMB - Columbia University
40
Meronymy as “Part-Of”
• Why is this potentially useful?
– A method for identifying “hot” paragraphs
• Descriptive text contains “part of” relations
• Details that correlate to the whole
– Porch is a part of house
• An early hypothesis – in testing stages
7/27/2016
CLiMB - Columbia University
41
Meronymy for Cohesion
The Spinks house design is an elaboration of
the rectangular, large-gabled form of the
“California House” ….has … porches and
terraces. In front, an expanse of …lawn rises
nearly to the level of the entry terrace…. The
front door is approached obliquely in the
shaded recess of the terrace….
7/27/2016
CLiMB - Columbia University
42
Meronymy and Other Relations
The
California
House
Other Houses
Spinks House
porch
terrace
entry terrace
front entry
front door
7/27/2016
CLiMB - Columbia University
43
Compile list of
subject vocabulary
Find meaningful
terms in texts
Segment relevant
texts
Collect terms from all sources.
Identify and link CAKEs described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records.
Mount in image search platform.
Process queries and evaluate
7/27/2016
CLiMB - Columbia University
44
Progress – Project Name Matching
• Finding project names in Greene & Greene
• Challenge: finding variations
–
–
–
–
CLiMB Art key (CAKE) Robert Roe Blacker House
RRB House
The house
1214 Fairlawn Terrace.
• Possible techniques to improve matching
– Developing a semi-automatic technique
– Use existing information to label text
– An iterative platform for manual intervention
7/27/2016
CLiMB - Columbia University
45
Variants of The Culbertson House
•
•
•
•
•
•
•
Cordelia A. Culbertson house (Pasadena, Calif.)
Francis F. Prentiss house (Pasadena, Calif.)
Culbertson sisters house (Pasadena, Calif.)
Prentiss, Francis F.
Culbertson, Cordelia A.
Allen, Elizabeth S.
Allen, Mrs. Dudley P.
• House was purchased by Allen’s, who remarried and
became Prentiss!
7/27/2016
CLiMB - Columbia University
46
Zaoshen (Chinese deity)
•
•
•
•
•
•
•
•
•
•
•
•
•
•
USE FOR: Dingfuzhenjun (Chinese deity)
USE FOR: Kitchen God (Chinese deity)
USE FOR: Simingzaojun (Chinese deity)
USE FOR: Simingzaoshen (Chinese deity)
USE FOR: Ssu-ming-tsao-chèun (Chinese deity)
USE FOR: Ssu-ming-tsao-shen (Chinese deity)
USE FOR: Ting-fu-chen-chèun (Chinese deity)
USE FOR: Tsao-chèun (Chinese deity)
USE FOR: Tsao-shen (Chinese deity)
USE FOR: Tsao-wang (Chinese deity)
USE FOR: Tsao-wang-yeh (Chinese deity)
USE FOR: Zaojun (Chinese deity)
USE FOR: Zaowang (Chinese deity)
REFERENCE: Encyc. Britannicab(Tsao Shen, pinyin Zao Shen, in Chinese
mythology, the god of the kitchen (god of the hearth), who is believed to
report to the celestial gods on family conduct and have it within his power to
bestow poverty or riches on individual families; has also been confused with
Ho Shen (god of fire) and Tsao Chèun (Furnace Prince))
7/27/2016
CLiMB - Columbia University
47
Some Data to Illustrate
• Unaltered Project Names
– 0 matches (both case sensitive and insensitive)
• Case Insensitive Project Name matching
–
–
–
–
–
4 matches
{Theodore Irwin house} occurs 1 time
{California Institute of Technology} occurs 1 time
{William R. Thorsen house} occurs 1 time
{William T. Bolton house} occurs 1 time
• At least double in the chapter
7/27/2016
CLiMB - Columbia University
48
Results: Finding CAKEs
• References to CAKEs are the highlighted
phrases:
– Robert R. Blacker House (Pasadena, Calif.)
• The Blacker house
• The house
– William R. Thorsen House (Berkeley, Calif.)
• The Thorsen house
• The house
7/27/2016
CLiMB - Columbia University
49
7/27/2016
CLiMB - Columbia University
50
A Future Solution
• Bootstrapping algorithm
– Seed terms hand labelled
– Terms mapped into multi-dimensional feature space
– Other terms that are close to the seed terms are
added to the set
• Features:
– Window size
– Headedness
– Modifier similar to that of a seed term
7/27/2016
CLiMB - Columbia University
51
Summary: Research Tools Tested
• Part of Speech Taggers
• Noun Phrase Chunkers
• Merging techniques
• Proper Noun Finders
• Proper Name Variant Finder
• Segmenters
7/27/2016
CLiMB - Columbia University
52
Compile list of
subject vocabulary
Find meaningful
terms in texts
Segment relevant
texts
Collect terms from all sources.
Identify and link CAKEs described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records.
Mount in image search platform.
Process queries and evaluate
7/27/2016
CLiMB - Columbia University
53
Future: Determine relationships
• The Blacker House related to Greene
– The Greenes built the house.
• Porte Cochère is related to Blacker House
– because they are directly a part of the house.
• William Issac Ott is related to
– Blacker House (on which he worked)
– Greene (with whom he worked).
• Detecting these semantic relationships statistically is a
challenge for our next steps:
– Co-occurrence
– Use of subject headings
– Meronymy and other relations (WordNet)
7/27/2016
CLiMB - Columbia University
54
Compile list of
subject vocabulary
Find meaningful
terms in texts
Segment relevant
texts
Collect terms from all sources.
Identify and link CAKEs described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records.
Mount in image search platform.
Process queries and evaluate
7/27/2016
CLiMB - Columbia University
55
Thank you!
Any questions?
www.columbia.edu/cu/cria/climb
7/27/2016
CLiMB - Columbia University
56