Faceted Metadata for Information Architecture and Search
Download
Report
Transcript Faceted Metadata for Information Architecture and Search
Semi-Automated Creation of
Facet Hierarchies
Marti Hearst
School of Information, UC Berkeley
Joint work with Dr. Emilia Stoica
Outline
Faceted Metadata
Definition
Advantages
Flamenco:
Search Interface Design using Faceted Metadata
Castanet:
(Semi) Automated Tool for Creation of Category Systems
Comparison to State-of-the-Art Alternatives
Conclusions
Marti Hearst, Taxonomy Bootcamp ‘06
Focus: Search and Navigation
of Large Collections
Shopping Sites
Digital Libraries
E-Government
Sites
Image
Collections
Marti Hearst, Taxonomy Bootcamp ‘06
Problems with Site Search
Study by Vividence in 2001 on 69 Sites
70% eCommerce
31% Service
21% Content
2% Community
Poorly organized search results
Frustration and wasted time
Poor information architecture
Confusion
Dead ends
"back and forthing"
Forced to search
Marti Hearst, Taxonomy Bootcamp ‘06
What we want to Achieve
Integrate browsing and searching seamlessly
Support exploration and learning
Avoid dead-ends, “pogo’ing”, and “lostness”
Marti Hearst, Taxonomy Bootcamp ‘06
Main Idea
Use hierarchical faceted metadata
Design the interface to:
Allow flexible navigation
Provide previews of next steps
Organize results in a meaningful way
Support both expanding and refining the search
Marti Hearst, Taxonomy Bootcamp ‘06
The Problem With Hierarchy
Most things can be classified in more than one way.
Most organizational systems do not handle this well.
Example: Animal Classification
robin
penguin
otter
penguin
robin
salmon
wolf
cobra
bat
robin
bat
robin
bat
salmon
salmon
cobra
wolf
wolf
cobra
bat
otter
wolf
penguin
otter, seal
salmon
otter
penguin
seal
Skin
Covering
Locomotion
Diet
Marti Hearst, Taxonomy Bootcamp ‘06
The Problem with Hierarchy
Inflexible
Force the user to start with a particular category
What if I don’t know the animal’s diet, but the
interface makes me start with that category?
Wasteful
Have to repeat combinations of categories
Makes for extra clicking and extra coding
Difficult to modify
To add a new category type, must duplicate it
everywhere or change things everywhere
Marti Hearst, Taxonomy Bootcamp ‘06
The Problem With Hierarchy
start
swim
fur
fly
scales
feathers
fur
run
scales
feathers
fur
scales
slither
…
feathers
fish
fish
fish
fish
fish
fish
fish
fish
fish
rodents
rodents
rodents
rodents
rodents
rodents
rodents
rodents
rodents
insects
insects
insects
salmon
insects
bat
insects
insects
robin
insects
insects
insects
wolf
Marti Hearst, Taxonomy Bootcamp ‘06
The Idea of Facets
Facets are a way of labeling data
A kind of Metadata (data about data)
Can be thought of as properties of items
Facets vs. Categories
Items are placed INTO a category system
Multiple facet labels are ASSIGNED TO items
Marti Hearst, Taxonomy Bootcamp ‘06
The Idea of Facets
Create INDEPENDENT categories (facets)
Each facet has labels (sometimes arranged in a hierarchy)
Assign labels from the facets to every item
Example: recipe collection
Ingredient
Cooking
Method
Chicken
Stir-fry
Bell Pepper
Curry
Course
Cuisine
Main Course
Thai
Marti Hearst, Taxonomy Bootcamp ‘06
The Idea of Facets
Break out all the important concepts into their
own facets
Sometimes the facets are hierarchical
Assign labels to items from any level of the hierarchy
Preparation Method
Fry
Saute
Boil
Bake
Broil
Freeze
Desserts
Cakes
Cookies
Dairy
Ice Cream
Sorbet
Flan
Fruits
Cherries
Berries
Blueberries
Strawberries
Bananas
Pineapple
Marti Hearst, Taxonomy Bootcamp ‘06
Using Facets
Now there are multiple ways to get to each item
Preparation Method
Fry
Saute
Boil
Bake
Broil
Freeze
Fruit > Pineapple
Dessert > Cake
Preparation > Bake
Desserts
Cakes
Cookies
Dairy
Ice Cream
Sherbet
Flan
Fruits
Cherries
Berries
Blueberries
Strawberries
Bananas
Pineapple
Dessert > Dairy > Sherbet
Fruit > Berries > Strawberries
Preparation > Freeze
Marti Hearst, Taxonomy Bootcamp ‘06
Example:
Nobel Prize Winners Collection
(Before and After Facets)
Marti Hearst, Taxonomy Bootcamp ‘06
Only One Way to View Laureates
Marti Hearst, Taxonomy Bootcamp ‘06
First, Choose Prize Type
Marti Hearst, Taxonomy Bootcamp ‘06
Next, view the list!
The user must first choose an
Award type (literature), then browse
through the laureates in
chronological order.
No choice is given to, say organize
by year and then award, or by
country, then decade, then award, etc.
Marti Hearst, Taxonomy Bootcamp ‘06
Flamenco Interface:
Using Hierarchical Faceted Metadata
Marti Hearst, Taxonomy Bootcamp ‘06
Opening View
Select literature from PRIZE facet
Marti Hearst, Taxonomy Bootcamp ‘06
Group results by YEAR facet
Marti Hearst, Taxonomy Bootcamp ‘06
Select 1920’s from YEAR facet
Marti Hearst, Taxonomy Bootcamp ‘06
Current query is PRIZE > literature AND
YEAR: 1920’s. Now remove PRIZE > literature
Marti Hearst, Taxonomy Bootcamp ‘06
Now Group By YEAR > 1920’s
Marti Hearst, Taxonomy Bootcamp ‘06
Hierarchy Traversal:
Group By YEAR > 1920’s, and drill down to 1921
Marti Hearst, Taxonomy Bootcamp ‘06
Select an individual item
Marti Hearst, Taxonomy Bootcamp ‘06
Use Endgame to expand out
Marti Hearst, Taxonomy Bootcamp ‘06
Use Endgame to expand out
Marti Hearst, Taxonomy Bootcamp ‘06
Or use “More like this” to find similar items
Marti Hearst, Taxonomy Bootcamp ‘06
Start a new search using keyword “California”
Marti Hearst, Taxonomy Bootcamp ‘06
Note that category structure remains after the keyword search
Marti Hearst, Taxonomy Bootcamp ‘06
The query is now a keyword ANDed with a facet subhierarchy
Marti Hearst, Taxonomy Bootcamp ‘06
Using Facets
The system only shows the labels that correspond
to the current set of items
Start with all items and all facets
The user then selects a label within a facet
This reduces the set of items (only those that have
been assigned to the subcategory label are displayed)
This also eliminates some subcategories from the view.
Marti Hearst, Taxonomy Bootcamp ‘06
Advantages of Facets
Can’t end up with empty results sets
(except with keyword search)
Helps avoid feelings of being lost.
Easier to explore the collection.
Helps users infer what kinds of things are in the
collection.
Evokes a feeling of “browsing the shelves”
Is preferred over standard search for collection
browsing in usability studies.
(Interface must be designed properly)
Marti Hearst, Taxonomy Bootcamp ‘06
Advantages of Facets
Seamless to add new facets and subcategories
Seamless to add new items.
Helps with “categorization wars”
Don’t have to agree exactly where to place something
Interaction can be implemented using a standard
relational database.
May be easier for automatic categorization
Marti Hearst, Taxonomy Bootcamp ‘06
Limitation of Facets
Do not naturally capture MAIN THEMES
Facets do not show RELATIONS explicitly
Aquamarine
Red
Orange
Door
Doorway
Wall
Which color associated with which object?
Photo by J. Hearst, jhearst.typepad.com
Marti Hearst, Taxonomy Bootcamp ‘06
Terminology Clarification
Facets vs. Attributes
Facets are shown independently in the interface
Attributes just associated with individual items
E.g., ID number, Source, Affiliation
However, can always convert an attribute to a facet
Facets vs. Labels
Labels are the names used within facets
These are organized into subhierarchies
Synonyms
There should be alternate names for the category labels
Currently (in Flamenco) this is done with subcategories
E.g., Deer has subcategories “stag”, “fawn”, “doe”
Marti Hearst, Taxonomy Bootcamp ‘06
Usability Study Results
Marti Hearst, Taxonomy Bootcamp ‘06
Flamenco Usability Studies
Usability studies done on 3 collections:
Recipes (epicurious): 13,000 items
Architecture Images: 40,000 items
Fine Arts Images: 35,000 items
Conclusions:
Users like and are successful with the dynamic
faceted hierarchical metadata, especially for
browsing tasks
Very positive results, in contrast with studies on
earlier iterations.
Marti Hearst, Taxonomy Bootcamp ‘06
Most Recent Usability Study
Participants & Collection
32 Art History Students
~35,000 images from SF Fine Arts Museum
Study Design
Within-subjects
Each participant sees both interfaces
Balanced in terms of order and tasks
Participants assess each interface after use
Afterwards they compare them directly
Data recorded in behavior logs, server logs, paper-surveys;
one or two experienced testers at each trial.
Used 9 point Likert scales.
Session took about 1.5 hours; pay was $15/hour
Marti Hearst, Taxonomy Bootcamp ‘06
Post-Interface Assessments
All significant at p<.05 except “simple” and “overwhelming”
Marti Hearst, Taxonomy Bootcamp ‘06
Post-Test Comparison
Which Interface Preferable For:
Find images of roses
Find all works from a given period
Find pictures by 2 artists in same media
Overall Assessment
More useful for your tasks
Easiest to use
Most flexible
More likely to result in dead ends
Helped you learn more
Overall preference
Baseline
Faceted
15
16
2
30
1
29
4
28
8
23
6
24
28
3
1
31
2
29
Marti Hearst, Taxonomy Bootcamp ‘06
How to Create Facet Hierarchies?
Our Approach: Castanet
Example: Recipes
(3500 docs)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Marti Hearst, Taxonomy Bootcamp ‘06
Our Approach:
Leverage the structure of WordNet
Marti Hearst, Taxonomy Bootcamp ‘06
Our Approach
Select terms
Documents
Leverage the structure of WordNet
Get
hypernym
paths
Build
tree
Compress
tree
WordNet
Divide into facets
Marti Hearst, Taxonomy Bootcamp ‘06
Select terms
Select well-distributed
terms from the collection
Eliminate stopwords
Retain only those terms
with a distribution higher
than a threshold
(default: top 10%)
Documents
1. Select Terms
Build
core tree
WordNet
Comp.
tree
Augm.
core tree
Remove
top level
categ.
Marti Hearst, Taxonomy Bootcamp ‘06
Documents
Build a “backbone”
Create paths from
unambiguous terms only
Bias the structure towards
appropriate senses of words
Select terms
2. Build Core Tree
Build
core tree
WordNet
Comp.
tree
entity
entity
substance,matter
substance,matter
nutriment
nutriment
dessert
dessert
frozen dessert
frozen dessert
ice cream sundae
sherbet,sorbet
sundae
sherbet
Augm.
core tree
Remove
top level
categ.
Get hypernym path if term:
- has only one sense, or
- matches a pre-selected
WordNet domain
Adding a new term increases a
count at each node on its path
by # of docs with the term.
Marti Hearst, Taxonomy Bootcamp ‘06
2. Build Core Tree (cont.)
Merge hypernym
paths to build a tree
entity
entity
entity
substance,matter
substance,matter
substance,matter
nutriment
nutriment
nutriment
dessert
dessert
dessert
frozen dessert
ice cream sundae
sundae
frozen dessert
sherbet,sorbet
sherbet
frozen dessert
ice cream sundae
sundae
sherbet,sorbet
sherbet
Marti Hearst, Taxonomy Bootcamp ‘06
Select terms
Attach to Core tree
the terms with
more than one
sense
Favor the more
common path over
other alternatives
Documents
3. Augment Core Tree
Build
core tree
WordNet
Comp.
tree
Augm.
core tree
Remove
top level
categ.
Marti Hearst, Taxonomy Bootcamp ‘06
Augment Core Tree (cont.)
Date (p1)
entity
Date (p2)
abstraction
substance,matter
measure, quantity
food, nutrient
fundamental quality
nutriment
food
edible fruit (78)
time period
calendar day (18)
date
date
Choose this
path since it
has more
items assigned
Marti Hearst, Taxonomy Bootcamp ‘06
Eliminate a parent with
fewer than k children
unless it is the root or its
distribution is larger than
0.1*maxdist
Select terms
Rule 1:
Documents
4. Compress Tree
Build
core tree
WordNet
Comp.
abstraction
tree
dessert
sundae
parfait
Remove
top level
categ.
dessert
frozen dessert
ice cream sundae
Augm.
core tree
frozen dessert
sherbet,sorbet
sherbet
sundae
parfait sherbet
Marti Hearst, Taxonomy Bootcamp ‘06
Eliminate a child whose
name appears within the
parent’s name
Select terms
Rule 2:
Documents
4. Compress Tree (cont.)
Build
core tree
Augm.
core tree
WordNet
Comp.
abstraction
tree
Remove
top level
categ.
dessert
frozen dessert
sundae parfait
sherbet
dessert
sundae parfait sherbet
Marti Hearst, Taxonomy Bootcamp ‘06
5. Divide into Facets
Divide into facets
Marti Hearst, Taxonomy Bootcamp ‘06
5. Divide into Facets (Remove top levels)
entity
substance,matter
Divide into facets
food,nutriment
food stuff,food product
ingredient,fixings
flavorer
herb
parsley
flavorer
herb
sweetening
oregano sugar
syrup
parsley
oregano
sweetening
sugar
syrup
Rule 1: Eliminate very general categories (e.g., entity, abstraction). If
no paths are longer than threshold t, then done. Else:
Rule 2: Undo first step. Then eliminate all top levels until the
maximum length of any path in the resulting hierarchyMartiisHearst,
t. Taxonomy Bootcamp ‘06
Disambiguation
Ambiguity in:
Word senses
Paths up the hypernym tree
2 paths for same word
Sense 1 for word “tuna”
organism, being
=> plant, flora
=> vascular plant
=> succulent
=> cactus
=> tuna
Sense 2 for word “tuna”
organism, being
=> fish
=> food fish
2 paths for
=> tuna
same sense
=> bony fish
=> spiny-finned fish
=> percoid fish
Marti Hearst, Taxonomy Bootcamp ‘06
=> tuna
How to Select the Right Senses and Paths?
First: build core tree
(1) Create paths for words with only one sense
(2) Use Domains
Wordnet has 212 Domains
medicine, mathematics, biology, chemistry, linguistics,
soccer, etc.
Automatically scan the collection to see which domains
apply
The user selects which of the suggested domains to use
or may add own
Paths for terms that match the selected domains are
added to the core tree
Then: add remaining terms to the core tree.
Marti Hearst, Taxonomy Bootcamp ‘06
Optional Step: Domains
To disambiguate, use Domains
Wordnet has 212 Domains
medicine, mathematics, biology, chemistry, linguistics,
soccer, etc.
A better collection has been developed by Magnini 2000
Assigns a domain to every noun synset
Automatically scan the collection to see which domains
apply
The user selects which of the suggested domains to use or
may add own
Paths for terms that match the selected domains are added
to the core tree
Marti Hearst, Taxonomy Bootcamp ‘06
Using Domains
dip glosses:
Sense 1: A depression in an otherwise level surface
Sense 2: The angle that a magnet needle makes with horizon
Sense 3: Tasty mixture into which bite-size foods are dipped
dip hypernyms
Sense 1
solid
Sense 2
shape, form
=> concave shape
=> depression
=> space
Sense 3
food
=> ingredient, fixings
=> angle
Given domain “food”, choose sense 3
=> flavorer
Marti Hearst, Taxonomy Bootcamp ‘06
Castanet Evaluation
Castanet Evaluation
This is a tool for information architects, so people
of this type did the evaluation
We compared output on
Recipes
Biomedical journal titles
We compared to two state-of-the-art algorithms
LDA (Blei et al. 04)
Subsumption (Sanderson & Croft ’99)
Marti Hearst, Taxonomy Bootcamp ‘06
Subsumption Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Subsumption Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Subsumption Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Subsumption Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
LDA Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
LDA Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
LDA Output (shown in Flamenco)
Marti Hearst, Taxonomy Bootcamp ‘06
Evaluation Method
Information architects assessed the category
systems
For each of 2 systems’ output:
Examined and commented on top-level
Examined and commented on two sub-levels
Then comment on overall properties
Meaningful?
Systematic?
Likely to use in your work?
Marti Hearst, Taxonomy Bootcamp ‘06
Evaluation Results
Results on recipes collection for “Would you use
this system in your work?”
Yes in some cases or yes definitely:
Pine (Castanet):
29/34
Oak (LDA):
0/18
Birch (Subsumption): 6/16
Results on quality of categories:
Marti Hearst, Taxonomy Bootcamp ‘06
Opportunities for Tagging
New opportunity: Tagging, folksonomies
(flickr de.lici.ous)
People are created facets in a decentralized manner
They are assigning multiple facets to items
This is done on a massive scale
This leads naturally to meaningful associations
Marti Hearst, Taxonomy Bootcamp ‘06
Conclusions
Flexible application of hierarchical faceted
metadata is a proven approach for navigating large
information collections.
Midway in complexity between simple hierarchies and
deep knowledge representation.
Currently in use on e-commerce sites; spreading to other
domains
Systems are needed to help create faceted
metadata structures
Our WordNet-based algorithm, while not perfect, seems
like it will be a useful tool for Information Architects.
Marti Hearst, Taxonomy Bootcamp ‘06
Acknowledgements
Flamenco Team
Brycen Chun, Ame Elliott, Jennifer English, Kevin Li,
Rashmi Sinha, Emilia Stoica, Kirsten Swearingen, KaPing Yee
Castanet
Emilia Stoica
Funding
This work supported in part by NSF (IIS-9984741)
Marti Hearst, Taxonomy Bootcamp ‘06
For more information:
flamenco.berkeley.edu
Thank you!
Marti Hearst