Document 7254187

Download Report

Transcript Document 7254187

Facetted Classification and
Thesauri Introduction
University of California, Berkeley
School of Information
IS 245: Organization of Information In
Collections
IS 257 – Fall 2007
2007.04.04 - SLIDE 1
Lecture Overview
• Note: Must end early today
• Facetted Classification
– Traditional vs. Facetted Classification
– Designing Facetted Classifications
– Thesaurus Design intro
IS 257 – Fall 2007
2007.04.04 - SLIDE 2
Agenda
• Facetted Classification
– Traditional vs. Facetted Classification
– Designing Facetted Classifications
– Thesaurus Design
IS 257 – Fall 2007
2007.04.04 - SLIDE 3
Controlled Vocabularies
• Vocabulary control is the attempt to
provide a standardized and consistent set
of terms (such as subject headings,
names, classifications, etc.) with the intent
of aiding the searcher in finding
information
• That is, it is an attempt to provide a
consistent set of descriptions for use in (or
as) metadata
IS 257 – Fall 2007
2007.04.04 - SLIDE 4
Hierarchical Classification
• Each category is successively broken
down into smaller and smaller subdivisions
• No item occurs in more than one
subdivision
• Each level divided out by a “character of
division” (also known as a feature)
– Example:
• Distinguish “Literature” based on:
– Language
– Genre
– Time Period
Slide author: Marti Hearst
IS 257 – Fall 2007
2007.04.04 - SLIDE 5
Hierarchical Classification
Literature
English
French
Spanish
...
... Prose Poetry Drama ... Prose Poetry Drama ...
...
16th 17th 18th 19th
16th 17th 18th 19th
Slide author: Marti Hearst
IS 257 – Fall 2007
2007.04.04 - SLIDE 6
Labeled Categories for Hierarchical
Classification
• LITERATURE
– 100 English Literature
• 110 English Prose
–
–
–
–
English Prose 16th Century
English Prose 17th Century
English Prose 18th Century
...
• 111 English Poetry
– 121 English Poetry 16th Century
– 122 English Poetry 17th Century
– ...
• 112 English Drama
– 130 English Drama 16th Century
– …
– 200 French Literature
Slide author: Marti Hearst
IS 257 – Fall 2007
2007.04.04 - SLIDE 7
Facetted Categories
• Mutually exclusive
– Non-overlapping, distinct categories
• Relational
– Relations between facets, subfacets, and foci
(elements) are not restricted to hierarchical
generalization-specialization relations
• Composable
– Combined using grammars of order and
relation to form compound descriptions
IS 257 – Fall 2007
2007.04.04 - SLIDE 8
Facetted Classification Along With Labeled
Categories
• A Language
– a English
– b French
– c Spanish
• B Genre
– a Prose
– b Poetry
– c Drama
• C Period
–
–
–
–
a 16th Century
b 17th Century
c 18th Century
d 19th Century
• Aa English Literature
• AaBa English Prose
• AaBaCa English Prose
16th Century
• AbBbCd French Poetry
19th Century
• BbCd Drama 19th
Century
Slide author: Marti Hearst
IS 257 – Fall 2007
2007.04.04 - SLIDE 9
Ranganathan
• PMEST Facets
– P(ersonality)
• WHO: The most important types or names of things for the
particular discipline
– M(atter)
• WHAT: Constituent materials
– E(nergy)
• HOW: Action or activity terms
– S(pace)
• WHERE: Where things occur
– T(ime)
• WHEN: When things occur
IS 257 – Fall 2007
2007.04.04 - SLIDE 10
“Classical” CRG/BC2 Facet Analysis
•
•
•
•
•
•
•
Entity
Kind
Part
Property
Material
Process
Operation
IS 257 – Fall 2007
•
•
•
•
•
•
Patient
Product
By-Product
Agent
Space
Time
2007.04.04 - SLIDE 11
“Classical” Facet Analysis
• What is being done?
–
–
–
–
Entity
Kind
Product
By-Product
• What are its parts?
– Part
• How is this achieved?
– Process
• By what means?
– Operation
• By whom?
– Agent
– Patient
• What are its
properties?
• Where?
– Property
– Material
• When?
IS 257 – Fall 2007
– Space
– Time
2007.04.04 - SLIDE 12
“Classical” Facet Analysis
• Nouns
– Entity
– Kind
– Part
– Patient
– Product
– By-Product
– Agent
• Intransitive Verb
– Process
• Transitive Verb
– Operation
• Adverb
– Space
– Time
• Adjectives
– Property
– Material
IS 257 – Fall 2007
2007.04.04 - SLIDE 13
Semantic and Syntactic Relationships
• Semantic
relationships
– Is-A (thing/kind,
genus/species)
• Mammals
– Primates
» Humans
• Syntactic
relationships
– Compounds
• Wheat + harvesting =
“wheat harvesting”
• Object + operation =
operation on object
– Has-Parts
• Human
– Head
» Eyes
IS 257 – Fall 2007
2007.04.04 - SLIDE 14
Facetted Classification
• Clearly distinguishes between semantic
relationships and syntactic relationships
– Semantic relationships
• Within a facet
• Containment relations
– Syntactic relationships
• Across facets
• Combinatoric relations
• Have a “syntax” for syntactic combination
of semantic terms
IS 257 – Fall 2007
2007.04.04 - SLIDE 15
Power of Facet Combinations
• The syntactic relations of facetted
classifications enable a small controlled
vocabulary to produce
– Many, many structured descriptions
– Complex, but formally structured descriptions
using nested compound descriptions
– Descriptions for things we do not have words
for
IS 257 – Fall 2007
2007.04.04 - SLIDE 16
Example: Objects
Red Plastic Glass
IS 257 – Fall 2007
Blue Paper Straw
2007.04.04 - SLIDE 17
IS202 Project Team Facetted
Classifications (2004)
• 007
• ARTery
– Personality
• Straw
• Glass
– Operation
• Drinking
• Slurping
• Sipping
– Material
• Plastic
• Paper
– Color
• Blue
• Red
IS 257 – Fall 2007
–
–
–
–
–
–
–
–
–
–
–
Color
Size
Material
Weight
Shape
Radius/Circumference
Density
Volume/Capacity
Function/Use
Hardness/Softness
Yin/Yang
2007.04.04 - SLIDE 18
IS202 Project Team Facetted
Classifications (2004)
• Culture Feed
– Color
• Picture Portal
– Color
• Red
• Blue
• Red
• Blue
– Material
– Material
• Plastic
• Paper
– Use
• Drink from
• Drink with
– Dimensions
• Circumference
• Height
• Diameter
• Paper
• Plastic
– Use
• Containment
• Transport
– Shape
• Torus
• Planar
– # Holes
• 0
• 1
IS 257 – Fall 2007
2007.04.04 - SLIDE 19
IS202 Project Team Facetted
Classifications (2004)
• F.U.N.
– Shape
– Color
– Material
• Rigidity
– Function
• Container
• Conduit
• MNM
– Functionality
• What it does
• What you can do with it
– Physical Properties
• Color
• Shape
• Material
– Locale
– Weight
– Size
IS 257 – Fall 2007
2007.04.04 - SLIDE 20
IS202 Project Team Facetted
Classifications (2004)
• pillBox
• Team iTour
– Function
• Container
• Conduit
– Form
– Color
• Red
• Blue
– State
• Shape
– Cylinder
• Composition
– Paper
– Plastic
• Color
– Blue
– Red
• Size
– Tall and skinny
– Short and fat
• Solid
• Non-porous
• Flexible
– Material
• Plastic
• Paper
– Geometry
• Cylindrical
• Hollow
– Function
•
•
•
•
IS 257 – Fall 2007
Container
Drinking
Sucking
Blowing
2007.04.04 - SLIDE 21
Example: Objects
Gray Metal Glass
IS 257 – Fall 2007
Two Yellow Plastic Straws
2007.04.04 - SLIDE 22
Example: Objects
• Function
• Form
–
–
–
–
Shape
Material
Color
Number
IS 257 – Fall 2007
Function: Drinking
Form
Shape: Cylinder
Material: Plastic
Color: Red
Number: 1
2007.04.04 - SLIDE 23
Agenda
• Facetted Classification
– Traditional vs. Facetted Classification
– Designing Facetted Classifications
– Thesaurus Design
IS 257 – Fall 2007
2007.04.04 - SLIDE 24
Facetted Classification Design
• Collect examples that need to be classified
• Identify candidates for facets and subfacets
– Test classification scheme on examples for facet orthogonality
• Order foci within facets
• Explicate grammar for ordering and combining facets
and subfacets
– Test classification scheme on examples for combinatoric power
• Extend foci for comprehensiveness where applicable
• Create new facets and subfacets where needed
– Test classification scheme on new examples, especially
boundary cases
• Iterate and refine throughout
IS 257 – Fall 2007
2007.04.04 - SLIDE 25
Facet Guidelines
• Terms on the same level in the ontology should
be of the same level and type
– Sports
• Team Sports
– Baseball
•
•
•
•
Football
Basketball
Solo Sports
Marathon Running
– Sports
• Team Sports
– Baseball
– Football
– Basketball
• Solo Sports
– Marathon Running
• Facets, subfacets, and foci should have a
discernible order
• Use of capitalization and singular/plural forms
should be uniform
IS 257 – Fall 2007
2007.04.04 - SLIDE 26
Ordering Foci (“Array”)
• Simple to complex
– (Locomotions: walk, run, jump, skip, hurdle, cartwheel)
• Common/popular to uncommon/unpopular
– (Vegetarian Pizza Toppings: mushroom, onion, olive, artichoke,
pineapple, pine nuts)
• Spatial, geographical, or geometric
– (Southwestern States: California, Nevada, Arizona, New Mexico )
• Chronological, historical, or evolutionary
– (Dinosaur Eras: Triassic, Jurassic, Cretaceous)
• Canonical (pre-established order)
– (Playground Counting: Eenie, Meenie, Mynee, Mo)
• Alphabetical
– (Boy’s Names: Al, Bob, Chuck, David, Ed, Frank, George, Harry)
• Size
– (T-Shirts: Small, Medium, Large, XL, XXL)
IS 257 – Fall 2007
2007.04.04 - SLIDE 27
Agenda
• Facetted Classification
– Traditional vs. Facetted Classification
– Designing Facetted Classifications
– Thesaurus Design (intro)
IS 257 – Fall 2007
2007.04.04 - SLIDE 28
Types of Indexing Languages
• Uncontrolled keyword indexing
• Indexing languages
– Controlled, but not structured
• Thesauri
– Controlled and structured
• Classification systems
– Controlled, structured, and coded
• Facetted classification systems
IS 257 – Fall 2007
2007.04.04 - SLIDE 29
Thesauri
• A Thesaurus is a collection of selected
vocabulary (preferred terms or descriptors)
with links among synonymous, equivalent,
broader, narrower and other related terms
IS 257 – Fall 2007
2007.04.04 - SLIDE 30
Thesaurus Standards
• National and International Standards for
Thesauri
– ANSI/NISO z39.19-1994 — American National
Standard Guidelines for the Construction, Format and
Management of Monolingual Thesauri
– ANSI/NISO Draft Standard Z39.4-199x — American
National Standard Guidelines for Indexes in
Information Retrieval
– ISO 2788 — Documentation — Guidelines for the
establishment and development of monolingual
thesauri
– ISO 5964 — Documentation — Guidelines for the
establishment and development of multilingual
thesauri
IS 257 – Fall 2007
2007.04.04 - SLIDE 31
Thesaurus Examples
• Examples
– Non-Facetted
• The ERIC Thesaurus of Descriptors
– Semi-Facetted
• The Medical Subject Headings (MESH) of the
National Library of Medicine
– Facetted
• The Art and Architecture Thesaurus
IS 257 – Fall 2007
2007.04.04 - SLIDE 32
ERIC Thesaurus – Entry
IS 257 – Fall 2007
2007.04.04 - SLIDE 33
ERIC Thesaurus – Alphabetic
IS 257 – Fall 2007
2007.04.04 - SLIDE 34
ERIC Thesaurus – KWIC Index
IS 257 – Fall 2007
2007.04.04 - SLIDE 35
ERIC Thesaurus – Hierarchies
IS 257 – Fall 2007
2007.04.04 - SLIDE 36
ERIC Thesaurus – Groups
IS 257 – Fall 2007
2007.04.04 - SLIDE 37
ERIC Thesaurus – Online
http://www.ericfacility.net/extra/pub/thessearch.cfm
IS 257 – Fall 2007
2007.04.04 - SLIDE 38
MESH – Entry
IS 257 – Fall 2007
2007.04.04 - SLIDE 39
MESH – Alphabetic
IS 257 – Fall 2007
2007.04.04 - SLIDE 40
MESH – Tree Structures
IS 257 – Fall 2007
2007.04.04 - SLIDE 41
MESH – KWOC Index
IS 257 – Fall 2007
2007.04.04 - SLIDE 42
MESH - Online
http://www.nlm.nih.gov/mesh/meshhome.html
IS 257 – Fall 2007
2007.04.04 - SLIDE 43
AAT – Facets
IS 257 – Fall 2007
2007.04.04 - SLIDE 44
AAT – Hierarchies (print)
IS 257 – Fall 2007
2007.04.04 - SLIDE 45
AAT – Hierarchies (online)
http://www.getty.edu/research/tools/vocabulary/aat/
IS 257 – Fall 2007
2007.04.04 - SLIDE 46
AAT – Entry (online)
IS 257 – Fall 2007
2007.04.04 - SLIDE 47
Lecture Overview
• Thesaurus Design and Development
– Controlled Vocabularies for topical description
– Thesaurus Design
– Steps In Thesaurus Development (intro)
IS 257 – Fall 2007
2007.04.04 - SLIDE 48
Why Develop a Thesaurus?
• To provide a conceptual structure or
“space” for a body of information
– To make it possible to adequately describe
the topical content of information resources at
an appropriate level of generality or specificity
– To provide enhanced search capabilities and
to improve the effectiveness of searching (i.e.,
to retrieve most of the relevant material
without too much irrelevant material)
IS 257 – Fall 2007
2007.04.04 - SLIDE 49
Why Develop a Thesaurus?
• To provide vocabulary (or terminological)
control
– When there are several possible terms
designating a single concept, the thesaurus
should lead the indexer or searcher to the
appropriate concept, regardless of the terms
they start with
IS 257 – Fall 2007
2007.04.04 - SLIDE 50
Preliminary Considerations
• What is used now?
– Continue using an existing thesaurus?
– Ad hoc modification of existing thesaurus?
– Develop a new well-structured thesaurus?
• What is the scope and complexity of the
subject field?
• What kind of retrieval objects or data will
be dealt with?
• How exhaustive and specific is the desired
description of objects?
IS 257 – Fall 2007
2007.04.04 - SLIDE 51
Preliminary Considerations
• The scope and complexity of the field will
provide some indication of the scope and
complexity of the thesaurus
– It is better to plan for a larger and more
comprehensive system than a smaller system
that rapidly will become inadequate as the
database grows
• Development of a good thesaurus requires
a major intellectual effort as well as clerical
operations like data entry and production
of sorted lists
IS 257 – Fall 2007
2007.04.04 - SLIDE 52
Development of a Thesaurus
• Term selection
• Merging and development of concept
classes
• Definition of broad subject fields and
subfields
• Development of classificatory structure
• Review, testing, application, revision
IS 257 – Fall 2007
2007.04.04 - SLIDE 53
Flow of Work in Thesaurus Construction
Select Sources
Define Broad Subject
Fields
Improve Class Structure
Assign codes
Sort Terms into Broad
Subject Fields
Print Classified Index
and review
Select Terms
Define Subfields within
one Subject Field
Discuss with Experts and
Users
Record Selected Terms
Work out detailed structure
of the Subject Field
Select descriptors and
checklist items
Sort Terms
Select Preferred Terms
Many
Modifications?
Yes
Revise as
needed
No
Merge identical Terms
Merge Terms in Same
Concept class
Based on Soergel, pp 327-333
IS 257 – Fall 2007
All Subfields of Broad
Subject finished?
No
Assign Notation
Yes
All Broad
Subjects finished?
Yes
No
Produce Full Thesaurus
and Check references
Review and Test
2007.04.04 - SLIDE 54