NLP Support for Faceted Navigation in Scholarly Collections ACL’09 Workshop on NLP for Scholarly Collections Marti Hearst and Emilia Stoica Presented by Preslav Nakov.

Download Report

Transcript NLP Support for Faceted Navigation in Scholarly Collections ACL’09 Workshop on NLP for Scholarly Collections Marti Hearst and Emilia Stoica Presented by Preslav Nakov.

NLP Support for Faceted Navigation in Scholarly Collections

ACL’09 Workshop on NLP for Scholarly Collections Marti Hearst and Emilia Stoica Presented by Preslav Nakov

Motivation

   Faceted navigation is now standard for “vertical” content collections   e-commerce stores image collections It is also being used for digital libraries  WorldCat, NCSU, Chicago Problem: the facets for the SUBJECT facet need to be richer.   How to automatically create these facets?

Our solution: CastaNet applied to scholarly collections Marti Hearst, Taxonomy Bootcamp ‘06

Outline

 Definition of faceted metadata  Examples of faceted navigation in use  Castanet: an algorithm for (semi) automatic creation of facet hierarchies  Application of Castanet to a scholarly collection Marti Hearst, Taxonomy Bootcamp ‘06

The Idea of Facets

  Facets are a way of labeling data  A kind of Metadata (data about data)  Can be thought of as properties of items Facets vs. Categories   Items are placed INTO a category system Multiple facet labels are ASSIGNED TO items Marti Hearst, Taxonomy Bootcamp ‘06

The Idea of Facets

  Create INDEPENDENT categories (facets)  Each facet has labels (sometimes arranged in a hierarchy) Assign labels from the facets to every item  Example: recipe collection

Cooking Method Stir-fry Ingredient Chicken Bell Pepper Curry Course Main Course Cuisine Thai

Marti Hearst, Taxonomy Bootcamp ‘06

The Idea of Facets

  Break out all the important concepts into their own facets Sometimes the facets are hierarchical  Assign labels to items from any level of the hierarchy

Preparation Method Fry Saute Boil Bake Broil Freeze Desserts Cakes Cookies Dairy Ice Cream Sorbet Flan Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple

Marti Hearst, Taxonomy Bootcamp ‘06

Using Facets

 Now there are multiple ways to get to each item

Preparation Method Fry Saute Boil Bake Broil Freeze Desserts Cakes Cookies Dairy Ice Cream Sherbet Flan Fruits Cherries Berries Blueberries Strawberries Bananas Pineapple

Fruit > Pineapple Dessert > Cake Preparation > Bake Dessert > Dairy > Sherbet Fruit > Berries > Strawberries Preparation > Freeze Marti Hearst, Taxonomy Bootcamp ‘06

Faceted navigation’s advantages:

   Integrate browsing and searching seamlessly Support exploration and learning Avoid dead-ends, “pogo’ing”, and “lostness” Marti Hearst, Taxonomy Bootcamp ‘06

Uses of Faceted Navigation in Online Digital Libraries

Marti Hearst, Taxonomy Bootcamp ‘06

WorldCat

Marti Hearst, Taxonomy Bootcamp ‘06

WorldCat

Marti Hearst, Taxonomy Bootcamp ‘06

U Chicago

Marti Hearst, Taxonomy Bootcamp ‘06

U Chicago

Marti Hearst, Taxonomy Bootcamp ‘06

Advantages of Facets

    Can’t end up with empty results sets  (except with keyword search) Helps avoid feelings of being lost.

Easier to explore the collection.

 Helps users infer what kinds of things are in the collection.

 Evokes a feeling of “browsing the shelves” Is preferred over standard search for collection browsing in usability studies.

 (Interface must be designed properly) Marti Hearst, Taxonomy Bootcamp ‘06

Limitation of Facets

  Do not naturally capture MAIN THEMES Facets do not show RELATIONS explicitly

Aquamarine Red Orange Door Doorway Wall

 Which color associated with which object?

Photo by J. Hearst, jhearst.typepad.com

Marti Hearst, Taxonomy Bootcamp ‘06

Usability Studies (using Flamenco)

 

Usability studies done on 3 collections:

   Recipes (epicurious): 13,000 items Architecture Images: 40,000 items Fine Arts Images: 35,000 items Conclusions:  Users like and are successful with the dynamic faceted hierarchical metadata, especially for browsing tasks  Very positive results, in contrast with studies on earlier iterations.

Marti Hearst, Taxonomy Bootcamp ‘06

How to Create Facet Hierarchies?

Our Approach: Castanet

Biomedical Journal Titles

(3275 Titles) "Journal of clinical hypertension" American journal of hypertension : journal of the American Society of Hypertension Hypertension in pregnancy : official journal of the International Society for the Study of Hypertension in Pregnancy Journal of interventional cardiac electrophysiology : an international journal of arrhythmias and pacing Heart failure reviews Hypertension research : official journal of the Japanese Society of Hypertension Current hypertension reports European journal of heart failure : journal of the Working Group on Heart Failure of the European Society of Cardiology "Congestive heart failure (Greenwich, Conn.)" "Clinical and experimental hypertension (New York, N.Y. : 1993)" Hypertension Journal of human hypertension Marti Hearst, Taxonomy Bootcamp ‘06

Castanet Output (Bio titles)

Marti Hearst, Taxonomy Bootcamp ‘06

Castanet Output (Bio titles)

Marti Hearst, Taxonomy Bootcamp ‘06

Castanet Output (LibraryThing tags)

Marti Hearst, Taxonomy Bootcamp ‘06

Castanet Output (LibraryThing Tags)

Marti Hearst, Taxonomy Bootcamp ‘06

Castanet Output (LibraryThing Tags)

Marti Hearst, Taxonomy Bootcamp ‘06

Our Approach: Leverage the structure of WordNet Marti Hearst, Taxonomy Bootcamp ‘06

Our Approach

 Leverage the structure of WordNet WordNet Get hypernym paths Build tree Compress tree Divide into facets Marti Hearst, Taxonomy Bootcamp ‘06

1. Select Terms

 Select well distributed terms from collection WordNet Get hypernym paths Build tree Comp. tree red blue Marti Hearst, Taxonomy Bootcamp ‘06

2. Get Hypernym Path

abstraction property visual property color chromatic color red, redness red abstraction property visual property color chromatic color blue, blueness blue WordNet

Get hypernym paths

Build tree Comp. tree Marti Hearst, Taxonomy Bootcamp ‘06

3. Build Tree

Get hypernym paths

Build tree

Comp. tree abstraction property visual property color chromatic color red, redness red abstraction property visual property color chromatic color blue, blueness blue WordNet abstraction property visual property color chromatic color red, redness blue, blueness red Marti Hearst, Taxonomy Bootcamp ‘06 blue

4. Compress Tree

WordNet Get hypernym paths Build tree

Comp.

tree

color chromatic color red, redness blue, blueness green, greenness red blue green red color chromatic color blue green Marti Hearst, Taxonomy Bootcamp ‘06

4. Compress Tree (cont.)

WordNet Get hypernym paths Build tree

Comp. tree

color chromatic color red blue green color red blue green Marti Hearst, Taxonomy Bootcamp ‘06

5. Divide into Facets

Divide into facets Marti Hearst, Taxonomy Bootcamp ‘06

Disambiguation

 Ambiguity in:  Word senses  Paths up the hypernym tree 2 paths for same word Sense 1 for word “tuna” organism, being => plant, flora => vascular plant => succulent => cactus => tuna Sense 2 for word “tuna” organism, being => fish => food fish => tuna 2 paths for => bony fish => spiny-finned fish same sense => percoid fish => tuna Marti Hearst, Taxonomy Bootcamp ‘06

How to Select the Right Senses and Paths?

   First: build core tree     (1) Create paths for words with only one sense (2) Use Domains Wordnet has 212 Domains  medicine, mathematics, biology, chemistry, linguistics, soccer, etc.

Automatically scan the collection to see which domains apply The user selects which of the suggested domains to use or may add own Paths for terms that match the selected domains are added to the core tree  Then: add remaining terms to the core tree.

Marti Hearst, Taxonomy Bootcamp ‘06

Using Domains

dip glosses: Sense 1: A depression in an otherwise level surface Sense 2: The angle that a magnet needle makes with horizon Sense 3: Tasty mixture into which bite-size food s are dipped dip hypernyms Sense 1 Sense 2 Sense 3 solid shape, form food => concave shape => space => ingredient, fixings => depression => angle => flavorer Given domain “ food ”, choose sense 3 Marti Hearst, Taxonomy Bootcamp ‘06

Castanet Evaluation

Castanet Evaluation

   This is a tool for information architects, so people of this type did the evaluation We compared output on  Recipes  Biomedical journal titles We compared to two state-of-the-art algorithms   LDA (Blei et al. 04) Subsumption (Sanderson & Croft ’99) Marti Hearst, Taxonomy Bootcamp ‘06

Subsumption Output (Bio titles) Marti Hearst, Taxonomy Bootcamp ‘06

Subsumption Output (Bio titles) Marti Hearst, Taxonomy Bootcamp ‘06

LDA Output (Bio titles)

Marti Hearst, Taxonomy Bootcamp ‘06

LDA Output (Bio titles)

Marti Hearst, Taxonomy Bootcamp ‘06

Evaluation Method

   Information architects assessed the category systems For each of 2 systems’ output:  Examined and commented on top-level  Examined and commented on two sub-levels Then comment on overall properties    Meaningful?

Systematic?

Likely to use in your work?

Marti Hearst, Taxonomy Bootcamp ‘06

Evaluation Results (Bio titles)

  15 participants, all PubMed Users Results for “Would you use this system in your work?”  Answering “Yes in some cases” or “yes definitely”    Pine (Castanet): 11/15 Oak (LDA): 1/7 Birch (Subsumption): 1/8 Marti Hearst, Taxonomy Bootcamp ‘06

Evaluation Results (recipes)

  Results on recipes collection for “Would you use this system in your work?”  Yes in some cases or yes definitely:   Pine (Castanet): 29/34 Oak (LDA): 0/18  Birch (Subsumption): 6/16 Results on quality of categories: Marti Hearst, Taxonomy Bootcamp ‘06

Conclusions

  Flexible application of hierarchical faceted metadata is a proven approach for navigating scholarly collections.

  Midway in complexity between simple hierarchies and deep knowledge representation.

Currently in use in digital library sites, but the SUBJECT categories need more work.

Algorithms are needed to help create faceted metadata structures  Our WordNet-based algorithm, while not perfect, provides a good starting point for scholarly collections Marti Hearst, Taxonomy Bootcamp ‘06

For more information: flamenco.berkeley.edu

Thank you!

Preslav Nakov, Marti Hearst & Emilia Stoica