Faceted Metadata for Information Architecture and Search

Download Report

Transcript Faceted Metadata for Information Architecture and Search

Semi-Automated Creation of
Facet Hierarchies
Marti Hearst
School of Information, UC Berkeley
Joint work with Dr. Emilia Stoica
Outline
 Faceted Metadata
 Definition
 Advantages
 Flamenco:
 Search Interface Design using Faceted Metadata
 Castanet:
 (Semi) Automated Tool for Creation of Category Systems
 Comparison to State-of-the-Art Alternatives
 Conclusions
Marti Hearst, Taxonomy Bootcamp ‘06
Focus: Search and Navigation
of Large Collections
Shopping Sites
Digital Libraries
E-Government
Sites
Image
Collections
Marti Hearst, Taxonomy Bootcamp ‘06
Problems with Site Search
 Study by Vividence in 2001 on 69 Sites




70% eCommerce
31% Service
21% Content
2% Community
 Poorly organized search results
 Frustration and wasted time
 Poor information architecture




Confusion
Dead ends
"back and forthing"
Forced to search
Marti Hearst, Taxonomy Bootcamp ‘06
What we want to Achieve
 Integrate browsing and searching seamlessly
 Support exploration and learning
 Avoid dead-ends, “pogo’ing”, and “lostness”
Marti Hearst, Taxonomy Bootcamp ‘06
Main Idea
 Use hierarchical faceted metadata
 Design the interface to:




Allow flexible navigation
Provide previews of next steps
Organize results in a meaningful way
Support both expanding and refining the search
Marti Hearst, Taxonomy Bootcamp ‘06
The Problem With Hierarchy
 Most things can be classified in more than one way.
 Most organizational systems do not handle this well.
 Example: Animal Classification
robin
penguin
otter
penguin
robin
salmon
wolf
cobra
bat
robin
bat
robin
bat
salmon
salmon
cobra
wolf
wolf
cobra
bat
otter
wolf
penguin
otter, seal
salmon
otter
penguin
seal
Skin
Covering
Locomotion
Diet
Marti Hearst, Taxonomy Bootcamp ‘06
The Problem with Hierarchy
 Inflexible
 Force the user to start with a particular category
 What if I don’t know the animal’s diet, but the
interface makes me start with that category?
 Wasteful
 Have to repeat combinations of categories
 Makes for extra clicking and extra coding
 Difficult to modify
 To add a new category type, must duplicate it
everywhere or change things everywhere
Marti Hearst, Taxonomy Bootcamp ‘06
The Problem With Hierarchy
start
swim
fur
fly
scales
feathers
fur
run
scales
feathers
fur
scales
slither
…
feathers
fish
fish
fish
fish
fish
fish
fish
fish
fish
rodents
rodents
rodents
rodents
rodents
rodents
rodents
rodents
rodents
insects
insects
insects
salmon
insects
bat
insects
insects
robin
insects
insects
insects
wolf
Marti Hearst, Taxonomy Bootcamp ‘06
The Idea of Facets
 Facets are a way of labeling data
 A kind of Metadata (data about data)
 Can be thought of as properties of items
 Facets vs. Categories
 Items are placed INTO a category system
 Multiple facet labels are ASSIGNED TO items
Marti Hearst, Taxonomy Bootcamp ‘06
The Idea of Facets
 Create INDEPENDENT categories (facets)
 Each facet has labels (sometimes arranged in a hierarchy)
 Assign labels from the facets to every item
 Example: recipe collection
Ingredient
Cooking
Method
Chicken
Stir-fry
Bell Pepper
Curry
Course
Cuisine
Main Course
Thai
Marti Hearst, Taxonomy Bootcamp ‘06
The Idea of Facets
 Break out all the important concepts into their
own facets
 Sometimes the facets are hierarchical
 Assign labels to items from any level of the hierarchy
Preparation Method
Fry
Saute
Boil
Bake
Broil
Freeze
Desserts
Cakes
Cookies
Dairy
Ice Cream
Sorbet
Flan
Fruits
Cherries
Berries
Blueberries
Strawberries
Bananas
Pineapple
Marti Hearst, Taxonomy Bootcamp ‘06
Using Facets
 Now there are multiple ways to get to each item
Preparation Method
Fry
Saute
Boil
Bake
Broil
Freeze
Fruit > Pineapple
Dessert > Cake
Preparation > Bake
Desserts
Cakes
Cookies
Dairy
Ice Cream
Sherbet
Flan
Fruits
Cherries
Berries
Blueberries
Strawberries
Bananas
Pineapple
Dessert > Dairy > Sherbet
Fruit > Berries > Strawberries
Preparation > Freeze
Marti Hearst, Taxonomy Bootcamp ‘06
Example:
Nobel Prize Winners Collection
(Before and After Facets)
Marti Hearst, Taxonomy Bootcamp ‘06
Only One Way to View Laureates
Marti Hearst, Taxonomy Bootcamp ‘06
First, Choose Prize Type
Marti Hearst, Taxonomy Bootcamp ‘06
Next, view the list!
The user must first choose an
Award type (literature), then browse
through the laureates in
chronological order.
No choice is given to, say organize
by year and then award, or by
country, then decade, then award, etc.
Marti Hearst, Taxonomy Bootcamp ‘06
Flamenco Interface:
Using Hierarchical Faceted Metadata
Marti Hearst, Taxonomy Bootcamp ‘06
Opening View
Select literature from PRIZE facet
Marti Hearst, Taxonomy Bootcamp ‘06
Group results by YEAR facet
Marti Hearst, Taxonomy Bootcamp ‘06
Select 1920’s from YEAR facet
Marti Hearst, Taxonomy Bootcamp ‘06
Current query is PRIZE > literature AND
YEAR: 1920’s. Now remove PRIZE > literature
Marti Hearst, Taxonomy Bootcamp ‘06
Now Group By YEAR > 1920’s
Marti Hearst, Taxonomy Bootcamp ‘06
Hierarchy Traversal:
Group By YEAR > 1920’s, and drill down to 1921
Marti Hearst, Taxonomy Bootcamp ‘06
Select an individual item
Marti Hearst, Taxonomy Bootcamp ‘06
Use Endgame to expand out
Marti Hearst, Taxonomy Bootcamp ‘06
Use Endgame to expand out
Marti Hearst, Taxonomy Bootcamp ‘06
Or use “More like this” to find similar items
Marti Hearst, Taxonomy Bootcamp ‘06
Start a new search using keyword “California”
Marti Hearst, Taxonomy Bootcamp ‘06
Note that category structure remains after the keyword search
Marti Hearst, Taxonomy Bootcamp ‘06
The query is now a keyword ANDed with a facet subhierarchy
Marti Hearst, Taxonomy Bootcamp ‘06
Using Facets
 The system only shows the labels that correspond
to the current set of items
 Start with all items and all facets
 The user then selects a label within a facet
 This reduces the set of items (only those that have
been assigned to the subcategory label are displayed)
 This also eliminates some subcategories from the view.
Marti Hearst, Taxonomy Bootcamp ‘06
Advantages of Facets
 Can’t end up with empty results sets
 (except with keyword search)
 Helps avoid feelings of being lost.
 Easier to explore the collection.
 Helps users infer what kinds of things are in the
collection.
 Evokes a feeling of “browsing the shelves”
 Is preferred over standard search for collection
browsing in usability studies.
 (Interface must be designed properly)
Marti Hearst, Taxonomy Bootcamp ‘06
Advantages of Facets
 Seamless to add new facets and subcategories
 Seamless to add new items.
 Helps with “categorization wars”
 Don’t have to agree exactly where to place something
 Interaction can be implemented using a standard
relational database.
 May be easier for automatic categorization
Marti Hearst, Taxonomy Bootcamp ‘06
Limitation of Facets
 Do not naturally capture MAIN THEMES
 Facets do not show RELATIONS explicitly
Aquamarine
Red
Orange
Door
Doorway
Wall
 Which color associated with which object?
Photo by J. Hearst, jhearst.typepad.com
Marti Hearst, Taxonomy Bootcamp ‘06
Terminology Clarification
 Facets vs. Attributes
 Facets are shown independently in the interface
 Attributes just associated with individual items


E.g., ID number, Source, Affiliation
However, can always convert an attribute to a facet
 Facets vs. Labels
 Labels are the names used within facets
 These are organized into subhierarchies
 Synonyms
 There should be alternate names for the category labels
 Currently (in Flamenco) this is done with subcategories

E.g., Deer has subcategories “stag”, “fawn”, “doe”
Marti Hearst, Taxonomy Bootcamp ‘06
Usability Study Results
Marti Hearst, Taxonomy Bootcamp ‘06
Flamenco Usability Studies
 Usability studies done on 3 collections:
 Recipes (epicurious): 13,000 items
 Architecture Images: 40,000 items
 Fine Arts Images: 35,000 items
 Conclusions:
 Users like and are successful with the dynamic
faceted hierarchical metadata, especially for
browsing tasks
 Very positive results, in contrast with studies on
earlier iterations.
Marti Hearst, Taxonomy Bootcamp ‘06
Most Recent Usability Study
 Participants & Collection
 32 Art History Students
 ~35,000 images from SF Fine Arts Museum
 Study Design
 Within-subjects
 Each participant sees both interfaces
 Balanced in terms of order and tasks
 Participants assess each interface after use
 Afterwards they compare them directly
 Data recorded in behavior logs, server logs, paper-surveys;
one or two experienced testers at each trial.
 Used 9 point Likert scales.
 Session took about 1.5 hours; pay was $15/hour
Marti Hearst, Taxonomy Bootcamp ‘06
Post-Interface Assessments
All significant at p<.05 except “simple” and “overwhelming”
Marti Hearst, Taxonomy Bootcamp ‘06
Post-Test Comparison
Which Interface Preferable For:
Find images of roses
Find all works from a given period
Find pictures by 2 artists in same media
Overall Assessment
More useful for your tasks
Easiest to use
Most flexible
More likely to result in dead ends
Helped you learn more
Overall preference
Baseline
Faceted
15
16
2
30
1
29
4
28
8
23
6
24
28
3
1
31
2
29
Marti Hearst, Taxonomy Bootcamp ‘06
How to Create Facet Hierarchies?
Our Approach: Castanet
Example: Recipes
(3500 docs)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Marti Hearst, Taxonomy Bootcamp ‘06
Our Approach:
Leverage the structure of WordNet
Marti Hearst, Taxonomy Bootcamp ‘06
Our Approach
Select terms
Documents
 Leverage the structure of WordNet
Get
hypernym
paths
Build
tree
Compress
tree
WordNet
Divide into facets
Marti Hearst, Taxonomy Bootcamp ‘06
Select terms
 Select well-distributed
terms from the collection
 Eliminate stopwords
 Retain only those terms
with a distribution higher
than a threshold
(default: top 10%)
Documents
1. Select Terms
Build
core tree
WordNet
Comp.
tree
Augm.
core tree
Remove
top level
categ.
Marti Hearst, Taxonomy Bootcamp ‘06
Documents
 Build a “backbone”
 Create paths from
unambiguous terms only
 Bias the structure towards
appropriate senses of words
Select terms
2. Build Core Tree
Build
core tree
WordNet
Comp.
tree
entity
entity
substance,matter
substance,matter
nutriment
nutriment
dessert
dessert
frozen dessert
frozen dessert
ice cream sundae
sherbet,sorbet
sundae
sherbet


Augm.
core tree
Remove
top level
categ.
Get hypernym path if term:
- has only one sense, or
- matches a pre-selected
WordNet domain
Adding a new term increases a
count at each node on its path
by # of docs with the term.
Marti Hearst, Taxonomy Bootcamp ‘06
2. Build Core Tree (cont.)
 Merge hypernym
paths to build a tree
entity
entity
entity
substance,matter
substance,matter
substance,matter
nutriment
nutriment
nutriment
dessert
dessert
dessert
frozen dessert
ice cream sundae
sundae
frozen dessert
sherbet,sorbet
sherbet
frozen dessert
ice cream sundae
sundae
sherbet,sorbet
sherbet
Marti Hearst, Taxonomy Bootcamp ‘06
Select terms
 Attach to Core tree
the terms with
more than one
sense
 Favor the more
common path over
other alternatives
Documents
3. Augment Core Tree
Build
core tree
WordNet
Comp.
tree
Augm.
core tree
Remove
top level
categ.
Marti Hearst, Taxonomy Bootcamp ‘06
Augment Core Tree (cont.)
Date (p1)
entity
Date (p2)
abstraction
substance,matter
measure, quantity
food, nutrient
fundamental quality
nutriment
food
edible fruit (78)
time period
calendar day (18)
date
date
Choose this
path since it
has more
items assigned
Marti Hearst, Taxonomy Bootcamp ‘06
Eliminate a parent with
fewer than k children
unless it is the root or its
distribution is larger than
0.1*maxdist
Select terms
 Rule 1:
Documents
4. Compress Tree
Build
core tree
WordNet
Comp.
abstraction
tree
dessert
sundae
parfait
Remove
top level
categ.
dessert
frozen dessert
ice cream sundae
Augm.
core tree
frozen dessert
sherbet,sorbet
sherbet
sundae
parfait sherbet
Marti Hearst, Taxonomy Bootcamp ‘06
Eliminate a child whose
name appears within the
parent’s name
Select terms
 Rule 2:
Documents
4. Compress Tree (cont.)
Build
core tree
Augm.
core tree
WordNet
Comp.
abstraction
tree
Remove
top level
categ.
dessert
frozen dessert
sundae parfait
sherbet
dessert
sundae parfait sherbet
Marti Hearst, Taxonomy Bootcamp ‘06
5. Divide into Facets
Divide into facets
Marti Hearst, Taxonomy Bootcamp ‘06
5. Divide into Facets (Remove top levels)
entity
substance,matter
Divide into facets
food,nutriment
food stuff,food product
ingredient,fixings
flavorer
herb
parsley
flavorer
herb
sweetening
oregano sugar
syrup
parsley
oregano
sweetening
sugar
syrup
Rule 1: Eliminate very general categories (e.g., entity, abstraction). If
no paths are longer than threshold t, then done. Else:
Rule 2: Undo first step. Then eliminate all top levels until the
maximum length of any path in the resulting hierarchyMartiisHearst,
t. Taxonomy Bootcamp ‘06
Disambiguation
 Ambiguity in:
 Word senses
 Paths up the hypernym tree
2 paths for same word
Sense 1 for word “tuna”
organism, being
=> plant, flora
=> vascular plant
=> succulent
=> cactus
=> tuna
Sense 2 for word “tuna”
organism, being
=> fish
=> food fish
2 paths for
=> tuna
same sense
=> bony fish
=> spiny-finned fish
=> percoid fish
Marti Hearst, Taxonomy Bootcamp ‘06
=> tuna
How to Select the Right Senses and Paths?

First: build core tree


(1) Create paths for words with only one sense
(2) Use Domains

Wordnet has 212 Domains





medicine, mathematics, biology, chemistry, linguistics,
soccer, etc.
Automatically scan the collection to see which domains
apply
The user selects which of the suggested domains to use
or may add own
Paths for terms that match the selected domains are
added to the core tree
Then: add remaining terms to the core tree.
Marti Hearst, Taxonomy Bootcamp ‘06
Optional Step: Domains

To disambiguate, use Domains

Wordnet has 212 Domains
 medicine, mathematics, biology, chemistry, linguistics,
soccer, etc.

A better collection has been developed by Magnini 2000




Assigns a domain to every noun synset
Automatically scan the collection to see which domains
apply
The user selects which of the suggested domains to use or
may add own
Paths for terms that match the selected domains are added
to the core tree
Marti Hearst, Taxonomy Bootcamp ‘06
Using Domains
dip glosses:
Sense 1: A depression in an otherwise level surface
Sense 2: The angle that a magnet needle makes with horizon
Sense 3: Tasty mixture into which bite-size foods are dipped
dip hypernyms
Sense 1
solid
Sense 2
shape, form
=> concave shape
=> depression
=> space
Sense 3
food
=> ingredient, fixings
=> angle
Given domain “food”, choose sense 3
=> flavorer
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Evaluation
Castanet Evaluation
 This is a tool for information architects, so people
of this type did the evaluation
 We compared output on
 Recipes
 Biomedical journal titles
 We compared to two state-of-the-art algorithms
 LDA (Blei et al. 04)
 Subsumption (Sanderson & Croft ’99)
Marti Hearst, Taxonomy Bootcamp ‘06
Subsumption Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Subsumption Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Subsumption Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Subsumption Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
LDA Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
LDA Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
LDA Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Evaluation Method
 Information architects assessed the category
systems
 For each of 2 systems’ output:
 Examined and commented on top-level
 Examined and commented on two sub-levels
 Then comment on overall properties
 Meaningful?
 Systematic?
 Likely to use in your work?
Marti Hearst, Taxonomy Bootcamp ‘06
Evaluation Results
 Results on recipes collection for “Would you use
this system in your work?”
 Yes in some cases or yes definitely:
 Pine (Castanet):
29/34
 Oak (LDA):
0/18
 Birch (Subsumption): 6/16
 Results on quality of categories:
Marti Hearst, Taxonomy Bootcamp ‘06
Opportunities for Tagging
 New opportunity: Tagging, folksonomies





(flickr de.lici.ous)
People are created facets in a decentralized manner
They are assigning multiple facets to items
This is done on a massive scale
This leads naturally to meaningful associations
Marti Hearst, Taxonomy Bootcamp ‘06
Conclusions
 Flexible application of hierarchical faceted
metadata is a proven approach for navigating large
information collections.
 Midway in complexity between simple hierarchies and
deep knowledge representation.
 Currently in use on e-commerce sites; spreading to other
domains
 Systems are needed to help create faceted
metadata structures
 Our WordNet-based algorithm, while not perfect, seems
like it will be a useful tool for Information Architects.
Marti Hearst, Taxonomy Bootcamp ‘06
Acknowledgements
 Flamenco Team
 Brycen Chun, Ame Elliott, Jennifer English, Kevin Li,
Rashmi Sinha, Emilia Stoica, Kirsten Swearingen, KaPing Yee
 Castanet
 Emilia Stoica
 Funding
 This work supported in part by NSF (IIS-9984741)
Marti Hearst, Taxonomy Bootcamp ‘06
For more information:
flamenco.berkeley.edu
Thank you!
Marti Hearst