Developing Digital Libraries:

Download Report

Transcript Developing Digital Libraries:

DL:Lesson 5
Classification Schemas
Luca Dini
[email protected]
Overview
The Dublin Core defines a number of metadata elements,
but what about the values for those elements?
Should they be unrestricted text values or come from
pre-defined vocabularies?
"it depends".
We will discuss how to determine the appropriate
approach for an organization's situation.
We will also cover how pre-defined vocabularies should
be sourced, structured, and maintained.
Vocabulary development and
maintenance
Vocabulary development and maintenance is the LEAST of three
problems:
–
–
The Vocabulary Problem: How are we going to build and maintain
the lists of pre-defined values that can go into some of the metadata
elements?
The Tagging Problem: How are we going to populate metadata
elements with complete and consistent values?

–
What can we expect to get from automatic classifiers? What kind of error
detection and error correction procedures do we need?
The ROI Problem: How are we going to use content, metadata, and
vocabularies in applications to obtain business benefits?


More sales? Lower support costs? Greater productivity?
How much content? How big an operating budget?
Need to know the answer to the ROI Problem before solving the
Vocabulary Problem.
Definitions
Term
Definition
Metadata Element
A ‘field’ for storing information about one piece
of content. Examples: Title, Creator, Subject,
Date, …
Metadata Value
The ‘contents’ of one Metadata Element.
Values may be text strings, or selections from
a predefined vocabulary.
Metadata Schema
A defined set of metadata elements. The
Dublin Core is one schema.
Free Text Value
An unconstrained text metadata value. Some
text values are constrained to follow a format
(e.g. YYYY-MM-DD).
Vocabulary
A list of predefined values for a metadata
element.
Controlled
A vocabulary with a defined and enforced
Controlled vocabularies
Hierarchical classification of things into a tree structure
Animalia
Chordata
Mammalia
Carnivora
Canidae
Canis
C. familiari
Kingdom
Phylum
Class
Order
Family
Genus
Species
Linnaeus …
44-Office Equipment and Accessories and
Supplies
.12-Office Supplies
.17-Writing Instruments
.05-Mechanical pencils
.06-Wooden pencils
.07-Colored pencils
Segment
UNSPSC …
Family
Class
Commodity
Types of vocabularies
Vocabulary Type
Cplxty.
Term List
1
Simple list of terms with no internal structure or
relations.
Synonym Rings
2
List of sets of terms to regard as equivalent.
Widely supported in search software.
Authority Files
3
List of names for known entities – people,
organizations, books, etc.
Reference
Classification
Schemes
4
Hierarchical arrangement of concepts.
Loose
Hierarchy
Thesauri
5
Hierarchical arrangement of concepts plus
supporting information and additional, nonhierarchical, relations.
“Is-a” Hierarchy
plus Loose
Relations
Ontologies
6
Arrangement of concepts and relations based
on a model of underlying reality – e.g. organs,
symptoms, diseases & treatments in medicine.
Model-based
Typed Relations
Description
Relation
Type
None
Equivalence
Vocabulary Control

The degree of control over a vocabulary is (mostly)
independent of its type.
–
–
–

Uncontrolled – Anybody can add anything at any time and no
effort is made to keep things consistent. Multiple lists and
variations will abound.
Managed – Software makes sure there is a list that is
consistent (no duplicates, no orphan nodes) at any one time.
Almost anybody can add anything, subject to consistency
rules. (e.g. File System Hierarchy)
Controlled – A documented process is followed for the update
of the vocabulary. Few people have authority to change the
list. Software may help, but emphasis is on human processes
and custodianship. (e.g. Employee list)
Term lists, synonym lists, … can be controlled,
managed, or uncontrolled.
Type of controls
 Controlled
vocabularies are
frequently mentioned
–
–
That does not mean they are
always necessary
Control comes at a cost, but
can provide significant data
quality benefits by reducing
variations.
 Is
this a well-controlled
vocabulary?
–
No! It is an uncontrolled, but
well-managed, term list
 Is
this part of an appropriate
solution to the ROI problem?
–
Yes! There is no budget to do
ongoing control and QA
Source: http://del.icio.us/tag/
Likelihood of controlled values
(Virtually)
Mandatory
Language
Format
Coverage
Type
Subject
Highly Likely
Maybe
RFC 3066
IMT
ISO 3166
DCMI Type?
Custom
Creator
LDAP?
Publisher
Custom
Contributor
LDAP?
Identifier
Custom
Date
Rights
Title
Relation
Source
Description
Highly
Unlikely
W3C DTF
(Virtually)
Impossible
Mandatory
DC recommends specific best practices:
–
–
Language: RFC 3066 (which works with ISO 639)
Format: Internet Media Types (aka MIME)
These vocabularies are widely used throughout the
Internet. If you want to do something else, it should be
justified.
–
Describing physical objects?

–
Use Extent and Medium refinements instead of Format.
Regional (vs. National) dialects?


a) Why?
b) Consider a custom element in addition to standard Language
Likely
DC recommends specific best practices:
–
Coverage: ISO 3166



–
ISO 3166 should be used unless you have good reasons to use
something else
Consider Getty Thesaurus of Geographic Names if you need
cities, rivers, etc.
(http://www.getty.edu/research/conducting_research/vocabularies/
tgn/)
DC provides Encodings for both
Type: DCMITypes (http://dublincore.org/documents/dcmi-typevocabulary/)


DCMIType list is not necessarily a best practice
No widely accepted type list exists, so a custom list is likely
May be

Creator, Contributor could come from an “authority file”
–
–
LC NAF in library contexts
LDAP Directory in corporate contexts



Publisher could come from an authority file
–

Recommended where possible
Many exceptions where author is outside LDAP
Org chart in corporate contexts – e.g. internal records
management system.
Identifier should be a URI
–
Organization may manage these, but its typically a text field,
not a controlled list.
Subject and extensions

Best practice: Use pre-defined subject schemes, not
user-selected keywords.
–
–

Recommended: Factor “Subject” into separate facets.
–

DC Encodings (DDC, LCC, LCSH, MESH, UDC) most useful in
library contexts.
Not useful for most corporate needs
People, Places, Organizations, Events, Objects, Products & Services,
Industry sectors, Content types, Audiences, Business Functions,
Competencies, …
Store the different facets in different fields
–
–
Use DC elements where appropriate (coverage, type, audience, …)
Extend with custom elements for other fields (industry, products, …)
Thesauri

A Thesaurus is a collection of selected
vocabulary (preferred terms or descriptors) with
links among synonymous, equivalent, broader,
narrower and other related terms
Standards

National and International Standards for Thesauri
–
–
–
–
ANSI/NISO z39.19-1994 — American National Standard
Guidelines for the Construction, Format and Management of
Monolingual Thesauri
ANSI/NISO Draft Standard Z39.4-199x — American National
Standard Guidelines for Indexes in Information Retrieval
ISO 2788 — Documentation — Guidelines for the establishment
and development of monolingual thesauri
ISO 5964 — Documentation — Guidelines for the establishment
and development of multilingual thesauri
Thesaurus Examples

Examples
–
–
–
The ERIC Thesaurus of Descriptors
The Medical Subject Headings (MESH) of the
National Library of Medicine
The Art and Architecture Thesaurus
ERIC Thesaurus – Entry
ERIC Thesaurus – Online
http://www.eric.ed.gov/ERICWebPortal/Home.portal?_nfpb=true&_pageLabel
MeSh
MeSh Online
http://www.nlm.nih.gov/mesh/meshhome.html
Dewey




Dewey Decimal Classification System (DDC)
first published in 1876 by Melvil Dewey
Most widely used classification system in the
world (used in 135 countries)
In this country used primarily by public and
school libraries
Maintained by the Library of Congress
Dewey


DDC is divided into ten main classes, then ten
divisions, each division into ten sections
The first digit in each three-digit number represents
the main class.
–

“500” = natural sciences and mathematics.
The second digit in each three-digit number indicates
the division.
–
–
–
–
“500” is used for general works on the sciences
“510” for mathematics
“520” for astronomy
“530” for physics
Dewey

The third digit in each three-digit number indicates
the section.
–
–
–
–

“530”is used for general works on physics
“531” for classical mechanics
“532” for fluid mechanics
“533” for gas mechanics
A decimal point follows the third digit in a class
number, after which division by ten continues to the
specific degree of classification needed.
Library of Congress Subjects





Essentially an artificial indexing language
Based on literary warrant
Entry vocabulary provided in the form of reference
structure
Moving slowly towards a real thesaurus structure (not
there yet)
Not faceted—subdivisions pre-selected, based on
individual heading or “pattern” heading
LCSH

Digital libraries
–
–
–
–
see from “Electronic libraries”
see from “Virtual libraries”
see broader term: “Libraries”
see also “Information storage and retrieval
systems”
Library of Congress Classification




21 basic classes, based on single alphabetic
character (K=law, N=art, etc.)
Subdivided into two or three alpha characters
(KF=American Law, ND=painting, etc.)
Further subdivision by specific numeric assignment
Author numbers and dates arrange works by a
particular author together and in chronological order
LCC

153##$aQL638.E55$hZoology$hChordates.
Vertebrates$hFishes$hSystematic
divisions$hOsteichthys (Bony fishes). By family, AZ$hFamilies$jEngraulidae (Anchovies)
– $a = Classification number--single number or
beginning number of span (R)
– $h = Caption hierarchy
– $j = Caption (lowest level, relating to the specific
number in $a)
DMOZ: A worst case example of a
unified ‘subject’

DMOZ has over 600k categories
Most are a combination of common facets – Geography,
Organization, Person, Document Type, …
(e.g.) Top: Regional: Europe: Spain: Travel and Tourism: Travel Guides

www.dmoz.org


History of Faceted Navigation


Relatively New -- Taxonomies - Aristotle
S. R. Ranganathan – 1960’s
–
–
Issue of Compound Subjects
The Universe consists of PMEST


Classification Research Group- 1950’s, 1970’s
–
–
Based on Ranganathan, simplified, less doctrinaire
Principles:



Personality, Matter, Energy, Space, Time
Division – a facet must represent only one characteristic
Mutual Exclusivity
Classification Theory to Web Implementation
–
–
An Idea waiting for a technology
Multiple Filters / dimensions
What are Facets?



–
–
Facets are not categories
– Entities or concepts belong to a category
– Entities have facets
Facets are metadata - properties or attributes
– Entities or concepts fit into one category
– All entities have all facets – defined by set of values
Facets are orthogonal – mutually exclusive – dimensions
– An event is not a person is not a document is not a place.
– A winery is not a region is not a price is not a color.
Relations between facets, subfacets, and foci (elements) are not
restricted to hierarchical generalization-specialization relations
Combined using grammars of order and relation to form
compound descriptions
Facetted Classification

Clearly distinguishes between semantic
relationships and syntactic relationships
–
Semantic relationships


–
Syntactic relationships



Within a facet
Containment relations
Across facets
Combinatoric relations
Have a “syntax” for syntactic combination of
semantic terms
Semantic and Syntactic Relationships

Semantic relationships
–
Is-A (thing/kind,
genus/species)

Primates
 Humans
Has-Parts

Syntactic relationships
–
Compounds

Mammals
–
–

Human
–
Head
 Eyes

Wheat + harvesting =
“wheat harvesting”
Object + operation =
operation on object
What is Faceted Navigation?

Not a Yahoo-style Browse
–
–

Faceted Navigation is not hierarchical
–
–


Computer Stores under Computers and Internet
One value per facet per entity
Tree – travel up and down, not across
Facets are filters, multidimensional
Facets are applied at search results time – postcoordination, not pre-coordination [Advanced
Search]
Faceted Navigation is an active interface – dynamic
combination of search and browse
When to Use Faceted Navigation
Advantages

Systematic Advantages:
–
Need fewer Elements

–

4 facets of 10 nodes = 10,000 node taxonomy
Ability to Handle Compound Subjects
Content Management Advantages:



Easier to “categorize” – not as conceptual
Fewer = simple, can use auto-classification better
Flexible – can add new facets, elements in facet
When to Use Faceted Navigation
Advantages: Implementation

More intuitive – easy to guess what is behind each
door



Dynamic selection of categories


Allow multiple perspectives
Trick Users into “using” Advanced Search



Simplicity of internal organization
20 questions – we know and use
wine where color = red, price = x-y, etc.
Click on color red, click on price x-y, etc.
Flexible – can be combined with other navigation
elements
When to Use Faceted Navigation
Disadvantages

Systematic Disadvantages:
–
Lack of Standards for Faceted Classifications


Implementation Disadvantages:
–
Loss of Browse Context

–

Every project is unique customization
Difficult to grasp scope and relationships
No immediate support for popular subjects
Essential Limit of Faceted Navigation
–
–
Limited Domain Applicability – type and size
Entities not concepts, documents, web sites
Developing Facet Structure:
Selection of Facets: Theory


Issue - Complete Model of a domain
Ranganathan – PMEST
–
–
–
–
–

Personality – Person, animal, event
Matter – what x is made of
Energy – how x changes
Space – where x is
Time – when x happens
Three Planes – Idea, Verbal, Notational
Facets: an example

A Language

Aa English Literature
a English
b French
c Spanish

AaBa English Prose

AaBaCa English Prose 16th
Century

AbBbCd French Poetry 19th
Century

BbCd Drama 19th Century
–
–
–

B Genre
–
–
–

a Prose
b Poetry
c Drama
C Period
–
–
–
–
a 16th Century
b 17th Century
c 18th Century
d 19th Century
Developing Facet Structure: Selection of
Facets: Practice Wine.com

Region
–




Alphabetical listing
–
Price
–
–
$25 and below
$25-$50

90+ under $20
Top Sellers
–
Red Wine, White, Bubbly
Winery
–
Top Rated Wines
–
Australia, California
Type
–

Cabinet Sauvignon
Pinot Noir
Hot Features
–
–
Wine outlet
Sideways collection
Faceted Approach

Power
–

Faster construction
–



4 independent categories
of 10 nodes = 10,000
nodes (104)
Use existing taxonomies
in specific fields
Reduced maintenance
cost
More opportunity for
data reuse
Can be easier to
navigate with
appropriate UI
60 nodes
24,000 combinations
Organization





Either expose them directly
in the user interface (postcoordinating) or
Combine them in a minimal
hierarchy (pre-coordination)
or
Hide them to the user!
Post-coordination takes
software support, which may
be fancy or basic.
How many facets?
–
Log10(#documents) as a
guide
Element
Data
Type
Length
Req. /
Repeat
Source
Purpose
Asset Metadata
Unique ID
dc:identifier
Integer
dc:title
Fixed
1
System supplied
Basic accountability
Recipe Title
String
Variable
1
Licensed Content
Text search & results display
Recipe summary
String
Variable
1
Licensed Content
Content
Variable
?
Main Ingredients
vocabulary
Key index to retrieve & aggregate
recipes, & generate shopping list
dc:description
X
Main Ingredients
List
Subject Metadata
Meal Types
X
List
Variable
*
Meal Types vocab
Cuisines
List
X
Variable
*
Cuisines
Courses
List
X
Variable
*
Courses vocab
Cooking Method
Flag
X
Fixed
*
Cooking vocab
Browse or group recipes & filter
search results
Link Metadata
Recipe Image
Pointer
Variable
dcterms:hasPa
rt
?
Product Group
Merchandize products
Use Metadata
Rating
String
Variable
1
Licensed Content
Filter, rank, & evaluate recipes
Release Date
Date
Fixed
1
Product Group
Publish & feature new recipes
dc:type=“recipe”,
dc:format=“text/html”, dc:language=“en”
dc:date
Project/exercise




Produce a faced classification of your
documents (at least 3 facets, min 5 foci each)
Encode the facet classification as an extension
of dc:subject
Attribute facets to your docs.
Check exptensibility by adding 10 new docs