Transcript PowerPoint

CS 430: Information Discovery
Lecture 6
Descriptive Metadata 2
Library Catalogs
Dublin Core
1
Course Administration
Assignment 1
• Submission instructions will be posted soon.
• You will need a csuglab account. If you do not have such
an account, go to Upson 311.
Programming in Perl
• First class on Perl is Wednesday night, Hollister 110, 7:30
to 9:00 p.m.
2
Course Administration
New Course
• LAW 410 Limits on and Protection of Creative
Expression - Copyright Law and Its Close Neighbors
This course, offered during fall term 2001, provides an
introduction to copyright law and closely related legal
regimes for non-law students.
3
Example: Monograph catalog record
Citation
Caroline R. Arms, editor, Campus strategies for libraries and
electronic information. Bedford, MA: Digital Press, 1990.
4
MARC fields
tag value
001 89-16879 r93
050 Z675.U5C16 1990
082 027.7/0973 20
245 Campus strategies for libraries and electronic title statement
information/Caroline Arms, editor.
260 {Bedford, Mass.} : Digital Press, c1990.
publisher
300 xi, 404 p. : ill. ; 24 cm.
collation
440 EDUCOM strategies series on information technology
series title
504 Includes bibliographical references (p. {373}-381).
020 ISBN 1-55558-036-X : $34.95
5
MARC fields (continued)
650 Academic libraries--United States--Automation.
subject heading
650 Libraries and electronic publishing--United States.
650 Library information networks--United States.
650 Information technology--United States.
700 Arms, Caroline R. (Caroline Ruth)
040 DLC DLC DLC
043 n-us--955 CIP ver. br02 to SL 02-26-90
985 APIF/MIG
6
MARC Encoding: For Print and
Computer Processing
tag:
260
subfield a:
{Bedford, Mass.} :
subfield b:
Digital Press,
subfield c:
c1990.
MARC encoding:
&2600#abc#{Bedford, Mass.} :#Digital Press,#c1990.%
7
Name authority files
•
Caroline R. Arms or Caroline Ruth Arms?
•
Which William Phillips of Cardiff?
•
Mark Twain or Samuel Clemens?
•
Epithets:
of Cardiff
doctor
•
Dates:
1832 - 1876
flourished 1860
circa 1832 - 1876
8
Shared cataloguing
OCLC -- Large centralized transaction processing database system
When a library catalogs a book it deposits MARC record in OCLC
Other libraries can copy the record
• saves duplication of cataloguing
•
build database of holdings
OCLC database has 43 million records
9
Subject information
Library of Congress Subject Headings
Academic libraries--United States--Automation
Hierarchical classification
Library of Congress call number:
Dewey Decimal Classification:
Z675.U5C16
027.7
Creation and maintenance of lists of subject headings and
classifications is a never ending task.
10
Notes on MARC
A great achievement:
11
•
Developed in 1960s
•
Magnetic tape exchange format for printing catalog records
•
The dawn of computing:
mixed upper and lower case
variable length fields,
repeated fields
non-Roman scripts
•
100(?) million records with standard content and format
•
Thousands of trained librarians (millions?)
Notes on MARC
A great problem:
•
Not designed for computer algorithms
•
One record per item (poor links between records)
•
Tied to traditional materials and traditional practices
•
Not Unicode
•
100 of million records at $100 -- $10 billion
A classic legacy system!
12
Cataloguing Objectives
Functions of catalogs:
finding
collocating (recall and precision)
choosing
acquiring
navigating
... among items in a bibliographic universe
Compare use cases in software design.
13
IFLA Model
Work A work is the underlying abstraction, e.g.,
•
•
•
•
•
The Iliad
The Computer Science departmental web site
Beethoven's Fifth Symphony
Unix operating system
The 1996 U.S. census
This is roughly equivalent to the concept of "literary
work" used in copyright law.
14
IFLA Model
Expression. A work is realized through an expression, e.g.,
• The Illiad has oral expressions and written expressions
• A musical work has score and performance(s).
• Software has source code and machine code
Many works have only a single expression, e.g. a web page, or a
book.
15
IFLA Model
Manifestation. A expression is given form in one or more
manifestations, e.g.,
• The text of The Iliad has been manifest in numerous
manuscripts and printed books.
• A musical performance can be distributed on CD, or
broadcast on television.
• Software is manifest as files, which may be stored or
transmitted in any digital medium.
16
IFLA Model
Item. When many copies are made of a manifestation,
each is a separate item, e.g.,
• a specific copy of a book
• computer file
[Works, expressions, manifestations and items are
explored in CS 502, Computing Methods of
Digital Libraries.]
17
Dublin Core
Simple set of metadata elements for online information
• 15 basic elements
• intended for all types and genres of material
• all elements optional
• all elements repeatable
Developed by an international group chaired by Stuart Weibel
since 1995.
(Diane Hillmann and Carl Lagoze of Cornell are very active in this
group.)
18
19
Dublin Core
publisher: OCLC
creator: Weibel, Stuart L.
creator: Miller, Eric J.
title: Dublin Core Reference Page
date: 1996-05-28
format: text/html
(MIME type)
language: en
(English)
identifier: http://purl.org/dc/documents/rec-dces-199809.htm#
20
Dublin Core with Meta Tags
<meta name="publisher" content="OCLC">
<meta name="creator" content="Weibel, Stuart L.">
<meta name="creator" content="Miller, Eric J.">
<meta name="title" content="Dublin Core Reference Page">
<meta name="date" content="1996-05-28">
<meta name="format" content="text/html">
<meta name="language" content="en">
<meta name="identifier"
content="http://purl.org/dc/documents/rec-dces-199809.htm#">
21
Dublin Core elements
1. Title The name given to the resource by the creator or publisher.
2. Creator The person or organization primarily responsible for
the intellectual content of the resource. For example, authors in the
case of written documents, artists, photographers, or illustrators in
the case of visual resources.
3. Subject The topic of the resource. Typically, subject will be
expressed as keywords or phrases that describe the subject or
content of the resource. The use of controlled vocabularies and
formal classification schemes is encouraged.
22
Dublin Core elements
4. Description A textual description of the content of the resource,
including abstracts in the case of document-like objects or content
descriptions in the case of visual resources.
5. Publisher The entity responsible for making the resource
available in its present form, such as a publishing house, a university
department, or a corporate entity.
6. Contributor A person or organization not specified in a creator
element who has made significant intellectual contributions to the
resource but whose contribution is secondary to any person or
organization specified in a creator element (for example, editor,
transcriber, and illustrator).
23
Dublin Core elements
7. Date A date associated with the creation or availability of the
resource.
8. Type The category of the resource, such as home page, novel,
poem, working paper, preprint, technical report, essay,
dictionary.
9. Format The data format of the resource, used to identify the
software and possibly hardware that might be needed to display
or operate the resource.
24
10. Identifier A string or number used to uniquely identify the
resource. Examples for networked resources include URLs and
URNs.
Dublin Core elements
11. Source Information about a second resource from which
the present resource is derived.
12. Language The language of the intellectual content of the
resource.
13. Relation An identifier of a second resource and its
relationship to the present resource. This element permits links
between related resources and resource descriptions to be
indicated. Examples include an edition of a work
(IsVersionOf), or a chapter of a book (IsPartOf).
25
Dublin Core elements
14. Coverage The spatial locations and temporal durations
characteristic of the resource.
15. Rights A rights management statement, an identifier that
links to a rights management statement, or an identifier that
links to a service providing information about rights
management for the resource.
26
Qualifiers
Element qualifier
Example: Date
DC.Date -> Created: 1997-11-01
DC.Date -> Issued: 1997-11-15
DC.Date -> Available: 1997-12-01/1998-06-01
DC.Date -> Valid: 1998-01-01/1998-06-01
27
Qualifiers
Value qualifiers
Example: Subject
DC.Subject -> DDC: 509.123
DC.Subject -> LCSH: Digital libraries-United States
28
29
Dublin Core with qualifiers
<title>Digital Libraries and the Problem of Purpose</title>
<creator>David M. Levy</creator>
<publisher>Corporation for National Research Initiatives</publisher>
<date date-type = "publication">January 2000</date>
<type resource-type = "work">article</type>
<identifier uri-type = "DOI">10.1045/january2000-levy</identifier>
<identifier uri-type =
"URL">http://www.dlib.org/dlib/january00/01levy.html</identifier>
<language>English</language>
<rights>Copyright (c) David M. Levy</rights>
30
31
32
Limits of Dublin Core
Complex objects
Metadata records
Complete object
Sub-objects
• Article within a journal
• A thumbnail of another image
• The March 28 final edition of a newspaper
33
Flat v. linked records
Flat record
All information about an item is held in a single Dublin Core
record, including information about related items
convenient for access and preservation
information is repeated -- maintenance problem
Linked record
Related information is held in separate records with a link from the
item record
less convenient for access and preservation
information is stored once
Compare with normal forms in relational databases
34
Dublin Core with flat record extension
Continuation
<relation rel-type = "InSerial">
<serial-name>D-Lib Magazine</serial-name>
<issn>1082-9873</issn>
<volume>6</volume>
<issue>1</issue>
</relation>
35
Events
Version 1
Version 2
New
material
Should Version 2 have its own record or should extra
information be added to the Version 2 record?
How are these represented in Dublin Core?
36
Minimalist versus structuralist
Minimalist
15 elements, no qualifiers, suitable for non-professionals
encourage creators to provide metadata
Structuralists
15 elements, qualifiers, RDF, detailed coding rules
will require trained metadata experts
[For an example of how complex Dublin Core can become, see
the source of: http://purl.org/dc/documents/rec-dces199809.htm#]
37
Dublin Core in many languages
See:
Thomas Baker, Languages for Dublin Core, D-Lib Magazine
December 1998,
http://www.dlib.org/dlib/december98/12baker.html
38
Dublin Core: Personal Opinion
Dublin Core is a simple way to describe digital content that:
• is a single, self-contained object ("document-like")
• is static with time
• has few relationships
Some web sites satisfy these criteria
Dublin Core is not suitable for digital content that:
• is heavily structured
• changes dynamically
39