Transcript Document

Digital Libraries
Lillian N. Cassel
A digital library
• An informal definition of a digital library is a
managed collection of information, with
associated services, where the information is
stored in digital formats and accessible over a
network. -• Wm Arms, Digital Libraries, 1999
• A focused collection of digital objects,
including text, video, and audio, along with
methods for access and retrieval, and for
selection, organization, and maintenance of
the collection. -• Witten and Bainbridge, How to Build a Digital Library 2003
What is a library?
• An active exercise to explore what we
know about, and think about, traditional
libraries.
• How do we translate these
characteristics to the digital world?
– Is that the right model? Are we
unnecessarily constraining the digital
environment? Are there things that do not
translate?
Vannevar Bush
• “As we may think”
• (http://www.theatlantic.com/doc/194507/bush)
• Reflecting after WWII
– The value of collaboration
– The sad use of scientific expertise to invent the
atomic bomb
– The need for organization and access to
information.
memex
• Vannevar Bush’s vision
Image source:
kelty.rice.edu/375/images/memex/camera.jpg
http://www.knowledgesearch.org/presentations/etcon/images/memex.gif
MyLifeBits
• Gordon Bell and Microsoft
•
http://www.guardian.co.uk/science/story/0,3605,1674359,00.html
“Gordon Bell doesn't need to remember, but has no chance of forgetting.
At the age of 71, he is recording as much of his life as modern
technology will allow, storing it all on a vast database: a digital facsimile
of a life lived.
If he goes for a walk, a miniature camera that dangles from his neck
snaps pictures every minute or so, immediately committing the scene
to a memory built not of neurons but ones and noughts. If he wanders
into a cafe, sensors note the change in light, the shift of temperature
and squirrel the information away. Conversations are recorded and
steps logged thanks to a GPS receiver carried with him.”
Related work
• Walden’s Path
– http://www.csdl.tamu.edu/walden/
– System used by itself or as a service within a digital library
– Allows a user to make a path through a set of related
resources and save the path for reuse at a later time.
• Used to allow a teacher to “blaze a trail” through a collection of
materials to help students find their way from a starting point to
a goal.
• Also for recording personal trips through a collection of material
to be revisited.
How does that compare to a set of
bookmarks?
Moving Forward
• Looked at what a library is
• Now
– How do we translate that to a digital entity?
• Information resources, including digital libraries, are
very complex systems.
– A formal model helps to capture the essence of the system
and give special attention to specific areas
– The model also allows developers of digital libraries to have
a check list of areas to consider and develop well.
The 5S model
• Streams
– The flow of information in various formats
• Structures
– Organizational aspects of the DL
• Spaces
– Views of components; real or abstract images
• Scenarios
– Services and behaviors
• Societies
– Communities and relationships among them
5S summary
Model
Primitives
Formalisms
Objectives
Stream
Text; video, audio,
software program
Sequences, types
Describes properties of the DL
content, encoding and textual
material or particular forms of
multimedia data.
Structure
Collection, catalog;
hypertext; document;
metadata;
organizational tools
Graphs; nodes; links;
labels; hierarchies
Specifies organizational aspects
of the DL content
Space
User Interface;
index; retrieval
model
Sets; operations; vector Defines logical and
space; measure space; presentational views of several
probability space
DL components
Scenarios
Service, event;
condition; action
Sequence diagrams;
collaboration diagrams
Details the behavior of DL
services
Societies
Community;
managers; actors;
classes;
relationships;
attributes; operators
Object-oriented
modeling constructs;
design patterns
Defines managers responsible
for running DL services; actors
that use those services, and
relationships among them
Source: http://www.dlib.vt.edu/projects/5S-Model/
Etana - A DL for archeology
An example application of 5S Etana: A DL for an archeological site
Scenario
model
Society model
Archaeologist
General public
Services
Value added
Service Manager
Domain specific
Space model
Geographic space
Structure
model
Region
Stream
model
User interface
Text
*Partition
Video
Information Satisfaction
Metric space
Metadata
*Site
Repository building
*Sub-partition
Audio
Taxonomies
Spatial
Temporal
Artifact-specific
*Locus
Drawing
*Container
Photo
*Artifact
3D
Source: E. A. Fox http://feathers.dlib.vt.edu/
Applying the model, informally
Personal Photos; Movie, TV, media
• Stream - what types of data? Gif, jpg, avi?
• Structure - How are the elements organized? Is
there a hierarchy? Are there multiple structures?
• Spaces - How would you index the items? How
would you divide them into related groups
• Scenarios - what services would you provide?
What information do we need to provide those
services?
• Societies - who is the library intended to serve?
Remember to include agents and other
processes as well as users.
In your group, choose one or the other (photos or movie/TV/media).
Start with stream, scenarios, societies.
More formally: Definitions
• Definition: A stream is a sequence
whose co-domain is a non empty set.
• Definition: A structure is a tuple (G, L, F)
where G = (V,E) is a directed graph with
vertex set V and edge set E, L is a set
of label values, and F is a labeling
function.
Definitions, cont’d
• Definition: A space is a measurable space,
measure space, probability space, vector
space, topological space, or metric space
– A vector space is a representation for the set of
elements in a collection. The vector representing
each element is a set of characteristics held by
that element and both connecting that element to
others that are similar and distinguishing it from
those that are different.
– We will do an exercise to illustrate
Definitions - 3
• Definition: A scenario is a sequence of related
transition events (e1, e2, …, en) on state set S
such that ek = (sk, sk+1,) for 1 <= k <= n.
– More easily visualized, a scenario is a path in a
directed graph, G = (S, ∑e), where vertices
correspond to states in the state set S and
directed edges are equivalent to events in a set of
events, ∑e, and correspond to transitions between
states.
– Scenarios must be implemented to make a
working system.
Definitions - 4
• Definition: A society is a tuple (C,R) where
– C = (c1, c2, …, cn) is a set of conceptual
communities, each community referring to a set of
individuals of the same class or type (e.g. actors,
activities, components, hardware, software, data);
– R = (r1, r2, …, rm) is a set of relationships, each
relationship being a tuple rj = (ej, ij) where ej is a
Cartesian product ck1 x ck2 x … x cknj. 1<= k1 < k2 < …
< knj<= n, which specifies the communities involved in the
relationship and ij is an activity.
The Digital Library Content
• Essential elements for a digital library
– Users
– Content
– Services
Content - requirements
• Store
– Organize
– Describe
• Find
• Deliver
Describing the content
• How to describe content
– Metadata
• Machine readable description of anything
• What description
– Machine readable requires standard descriptive elements
• Dublin Core (http://dublincore.org/)
– International standard
– “a standard for cross-domain information resource description.”
– 15 descriptive elements
• Other metadata schemes
– IEEE-LOM
Metadata
• What does metadata look like?
• Metadata is data about data
– Information about a resource, encoded in
the resource or associated with the
resource.
• The language of metadata: XML
– eXtensible Markup Language
Google Books Project
• Michael A. Keller, Closing Keynote
–
–
–
–
Ida M. Green University Librarian at Stanford,
Director of Academic Information Resources,
Publisher of HighWire Press, and
Publisher of the Stanford University Press:
• "One good turn deserves another; how the
Google Book Search project is benefiting
everyone".
Google Books demo
• Full text - Life of Miguel de Cervantes
• Limited Preview - The Life of Miguel de
Cervantes Saavedra
• Snippet View - "Discreción" in the
Works of Cervantes: A Semantic Study
What has been accomplished
• As of September 2006
• Nearly 30,000 Stanford books digitized
– ~1M books from all partner libraries
• Over 4,000 books identified as needing
preservation treatment (& so not digitized)
• A great debate about copyright has started
– Orphan works
– What can an archive do to provide access
– Defense of fair use underway
This slide is taken from the presentation by Michael A. Keller at ECDL 2006
Original Principles
• If legally possible, digitize every book (9M volumes)
in the Stanford libraries
– Now digitizing with imprint dates up to 1963
• Partner libraries (*added recently)
–
–
–
–
–
–
University of Michigan (similar to Stanford)
Harvard (public domain (?), maybe > 1M)
NYPL (public domain, unusual collections)
Oxford - Bodleian (earlier than 1885, ~ 1M titles)
University of California (similar to Stanford >6M)
(more to follow)
This slide is taken from the presentation by Michael A. Keller at ECDL 2006
Purposes
• Digital preservation
– Virtual Bookshelves in Stanford Digital Repository under
construction as part of the Stanford Digital Repository
– For Stanford use only
• Other searching and research functions
–
–
–
–
Subtle searching (as in Socrates & HighWire)
Taxonomic (LCSH & HighWire) & Associative Searching (Takano)
Citation linking (HighWire & “InforTools” (Ebrary)
Better navigation (through visualization ?) (Grokker)
• Digitized books from all sources as test bed for new research;
combine with articles, datasets, etc. for data mining & other
transformative uses.
This slide is taken from the presentation by Michael A. Keller at ECDL 2006
Some Conclusions
• Google Book Search
– Is an indexing, not a publishing project
– Offers substantial increases in access to contents of books
in library collections by keyword searching
– Offers publishers global marketing of their publications
– Offers several useful services to readers
• Offers participating libraries
– Digital copies of books on their shelves for preservation
– New possibilities for services to local readers
– New possibilities for research for local faculty & students
This slide is taken from the presentation by Michael A. Keller at ECDL 2006
Google statement
•
“Many of the books in Google Book Search come from authors and
publishers who participate in our Partner Program. For these books,
our partners decide how much of the book is browsable -- anywhere
from a few sample pages to the whole book.
•
For books that enter Book Search through the Library Project, what you
see depends on the book's copyright status. We respect copyright law
and the tremendous creative effort authors put into their work. If the
book is in the public domain and therefore out of copyright, you can
page through the entire book and even download it and read it offline.
But if the book is under copyright, and the publisher or author is not
part of the Partner Program, we only show basic information about the
book, similar to a card catalog, and, in some cases, a few snippets -sentences of your search terms in context. The aim of Google Book
Search is to help you discover books and learn where to buy or borrow
them, not read them online from start to finish. It's like going to a
bookstore and browsing - with a Google twist.”
http://books.google.com/support/bin/answer.py?answer=43729&topic=9259&hl=en
Other projects
• Open Content Alliance (Yahoo and the
Internet Archive)
• The Internet Archive www.archive.org
• The European Digital Library (Growing
number of countries)
• others
Comments? Discussion?
A DL example
• Library of Congress American Memory project
– http://memory.loc.gov/ammem/index.html
– “American Memory provides free and open access through the
Internet to written and spoken words, sound recordings, still and
moving images, prints, maps, and sheet music that document the
American experience. It is a digital record of American history and
creativity. These materials, from the collections of the Library of
Congress and other institutions, chronicle historical events, people,
places, and ideas that continue to shape America, serving the
public as a resource for education and lifelong learning.”
Dublin Core for a map
• Map found in the LOC American Memory
collection
– Map at
http://memory.loc.gov/ammem/gmdhtml/gmdhome.html
• Dublin Core metadata illustration found at
http://webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
– Part of a DL course at U. of Alabama
Go to web site to explore what is there -including copyright information, title,
history, etc.
Dublin Core: Title
• Name given, usually by the creator or publisher
< META name = “DC.Title”
content = “Novi Belgii Novæque Angliæ:nec non
partis Virginiæ tabula multis in locis emendata ”
lang = “la”
>
Source:
webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Subject
• What the work is about, possibly
keywords, terms from classification
scheme if available.
<META name = “DC.Subject”
content = “Middle Atlantic States - Maps
- Early works to 1800 - Facsimilies”
scheme = “LCSH”
LCSH = Library of Congress Subject Headers
>
Source:
webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Description
• Free text description, abstract, etc.
<META
name = DC.Description”
content = “An (sic) historical map
showing the coast of New Jersey as
perceived in the senventeenth century”
>
Source:
webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Source
• Is this object derived from another? Is
this map a part of a larger map? Is this
text a variation or revision of another
piece of text?
<META
name = “DC.Source”
content = “G3715 1685 .V5 1969”
scheme = “LCCN” LCCN = Library of Congress Call Number
Source:
webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Language
• Language of the content of the resource
• For the map, there is no language
content
<META
name = “DC.Language”
content = “nl”
>
Source:
webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Relation
• To what other object(s) or collection is this object
related? Does it also exist in another collection? Is it
derived from another document or image? How is it
related?
<META
name = “DC.Relation”
content = “isPartOf
http://lcweb2.loc.gov/cgibin/query/r?ammem/gmd:@filreq(@field(NUMBER+@band(g3715+ct000001))+@field(COLLID+dsxpmap))
>
Source:
webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Creator
• Person or organization responsible for
the Intellectual Content of this object
<META
name = “DC.Creator”
content = “Nicolaum Visscher”
>
Source:
webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Publisher
• Entity responsible for making the
resource available in its present form
• Not shown in the example, but should
be something like this:
<META name = “DC.Publisher”
content = “Library of Congress
American Memory Project”
>
Source:
webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Contributor
• Any entity making a contribution to this
object.
• Example: someone who added some
information to the original document or
image
• No entry for this map.
Dublin Core: Rights
• A pointer to a copyright notice, a rights
management statement, or a rights server.
<META
name = “DC.Rights”
content =
http://lcweb2.loc.gov/cgi-bin/ ammemrr.pl
?title=%3ca%20href%3d%22%2fammem%2fgmdhtml
%2fdsxphome.html%22%3eDiscovery%20and%20Exploration
%3c%2fa%3e&coll=gmd&div=&agg=g3715&default=ammem &dir=ammem
>
Dublin Core: Date
• Date on which this object was made available
in its present form, possibly the date it was
entered into this digital collection.
<META
name = “DC.DATE”
content = “1996-04-17”
scheme = “ISO 8601”
Specify the date format so that others can interpret it correctly
>
Source:
webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Type or Category
• What sort of thing is this? Some
examples: home page, novel, poem,
working paper, technical report, essay
dictionary, …
• Type should be selected from a
controlled list. For example, see the
DCMI Type Vocabulary:
•
http://dublincore.org/documents/2006/08/28/dcmi-type-vocabulary/
Why is this recommended as a controlled vocabulary field?
DCMI Type Vocabulary
•
•
•
•
•
•
Collection
Dataset
Event
Image
InteractiveResource
MovingImage
•
•
•
•
•
•
PhysicalObject
Service
Software
Sound
StillImage
Text
See the official page for explanations of the categories.
Note that Image is a broad category and Moving Image and
StillImage are more restricted subcategories.
Dublin Core: Type
• Category of this resource
<META
name = “DC.Type”
content = “image.photograph”
>
Source:
webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Format
• The way the content is encoded. This
tells what resource is needed to access
this content.
<META
name=“DC.Format”
content = “image/gif”
Internet MIME Types:
http://www.ltsw.se/knbase/internet/mime.htp
scheme = “IMT”
See also Internet Media Type:
>
http://www.graphcomp.com/info/specs/mime.html
Dublin Core: Unique ID
• The key for this object in the collection.
• I cannot find one for the map we are looking
at, but the ID for the map of which it is a part
is g3715 ct000001
• The Metadata specification for that would be
<META name= “DC.Id”
content = “g3715 ct000001”
>
Source: http://memory.loc.gov/cgi-bin/query/r?ammem/gmd:@filreq(@field(NUMBER+
@band(g3715+ct000001))+@field(COLLID+dsxpmap))
Dublin Core: Coverage
• The time, space or other measurement of the
scope or completeness of the object.
• No coverage entry specified, but might be
this:
<META
name = “DC.Coverage”
content = “North America, Eastern lands and
coast, as viewed in late seventeenth century”
Example not a controlled vocabulary. Why
>
would a controlled vocabulary be better?
International Concensus
• Recognition of International Scope of
Resource Discovery on Web
• 17 Countries Currently Involved in DC
Working Groups
• 50+ Implementation Projects in 10
Countries
Source:
webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Guide to Good Practice
• The NINCH Guide to Good Practice in the Digital
Representation and Management of Cultural
Heritage Materials
• http://www.nyu.edu/its/humanities/ninchguide/index.html
Legal and Technical Issues
• Legal: When is a resource available to
digitize and make available. What
requirements exist for controlling access.
• Technical: How do we control access to a
resource that is stored online?
– Policies
– Encoding
– Distribution limitations
Date of work
Protected from
Term
Created 1-1-78
or after
When work is fixed in tangible Life + 70 years1(or if work of
corporate authorship, the shorter of
medium of expression
Published before
1923
In public domain
None
Published 1923 63
When published with notice
28 years + could be renewed for 47
years, now extended by 20 years for
a total renewal of 67 years. If not so
renewed, now in public domain
Published from
1964 - 77
When published with notice
28 years for first term; now automatic
extension of 67 years for second term
Created before 11-78 but not
published
1-1-78, the effective date of the
1976 Act which eliminated
common law copyright
Life + 70 years or 12-31-2002,
whichever is greater
Created before
1-1-78 but
published
between then and
12-31-2002
1-1-78, the effective date of the
1976 Act which eliminated
common law copyright
Life + 70 years or 12-31-2047
whichever is greater
95 years from publication, or 120
years from creation
Chart created by Lolly Gasaway. Updates at
http://www.unc.edu/~unclng/public-d.htm
Works for hire
• Usual case -- works created by faculty
are not the property of the university.
– Faculty surrender copyright to publishers of
journals and books
– Some publishers allow faculty to retain
copyright, giving the publisher specific
limited rights to reproduce and distribute
the work.
Fair use
• No clear, easy answers.
• Checksheet provided in the article is a
good guide to the issues.
• Link to the checksheet:
http://www.copyright.iupui.edu/checklist.htm
Moral rights
• Fair to the creator
– Keep the identity of the creator of the work
– Do not cut the work
– Generally, be considerate of the person (or
institution) that created the work.
Getting Permission
• With the best will in the world, getting the appropriate
permissions is not always easy.
– Identify who holds the rights
– Get in touch with the rights holder
– Get a suitable agreement to cover the needs of your use.
• Useful links:
http://www.loc.gov/copyright/
http://www.utsystem.edu/OGC/IntellectualProperty/PERMISSN.HTM
– Connections to various ways to discover and contact the
rights holder of a work.
Checking copyright
status
Source: NINCH Guide to
Good Practice. Chapter 4:
Rights Management
Considering
people
depicted in
the work
Source: NINCH
Guide to Good
Practice. Chapter
4: Rights
Management
Copyright: Lauryn
G. Grant
Technical issues
• Link the resource to the copyright statements
• Maintain that link when the resource is copied
or used
• Approaches:
–
–
–
–
Steganography
Encryption
Digital Wrappers
Digital Watermarks
Issues in Encryption
• General cases for protection of controlled content:
Concern for passive listening, active interference.
– Listening: intruder gains information, may not be detected.
Effects indirect.
– Active interference
• Intruder may prevent delivery of the message to the intended
recipient.
• Intruder may substitute a fake message for the intended one
• Effects are direct and immediate
• Less likely in the case of digital library content