The Question of Quality Most of this presentation is based on the work of Marcos Gonçales as cited in the references.

Download Report

Transcript The Question of Quality Most of this presentation is based on the work of Marcos Gonçales as cited in the references.

The Question of Quality
Most of this presentation is based
on the work of Marcos Gonçales
as cited in the references
Goals for this class
• Consider quality in digital libraries
– How do we define quality
– How do we measure quality
– How does quality control impact a user?
Understanding Quality in a DL
• Quality indicators: proposed descriptions of
quantities or observable variables that may
be related to quality
– “measures” = stronger term. Requires validation
– Gonçalves et al provide analysis of quality conditions and
recommend specific quantities to be used.
• Dimensions of quality
• Proposed indicators
• Application to DL concerns
Getting the data
• Where does the data come from?
– Logging
– Surveys
– Focus Groups
• Know what information is needed, then
choose the method most likely to provide the
data.
– More about the sources of data after we see what
we need to know.
What are we looking for?
• Consider that we are concerned about the
quality of the following characteristics of a DL:
–
–
–
–
–
–
Data objects
Metadata
Collection
Catalog
Repository
Services
• What characteristics do we want each of
those to have?
Dimensions of Quality
Dimensions of Quality
• Digital Object
–
–
–
–
–
–
–
Accessibility
Pertinence
Preservability
Relevance
Similarity
Significance
Timeliness
• Metadata Specification
– Accuracy
– Completeness
– Conformance
• Collection
– Completeness
• Catalog
– Completeness
– Consistency
• Repository
– Completeness
– Consistency
• Services
–
–
–
–
–
–
Composability
Efficiency
Effectiveness
Extensibility
Reusability
Reliability
What information do we need
- related to Digital Objects
• Accessibility
–
–
–
–
What collection?
# of structured streams
Rights management metadata
Communities to be served
• Pertinence
– Context
– Information content
– Information need
Information need - Digital
Objects, continued
• Preservability
–
–
–
–
Fidelity (lossiness)
Migration cost
Digital object complexity
Stream formats
• Relevance
–
–
–
–
–
–
Feature frequency
Inverse document frequency
Document size
Document structure
Query size
Collection size
Information need - Digital
Objects, continued
• Similarity
– All the same features as in relevance
– Also: citation/link patterns
• Significance
– Citation/link patterns
• Timeliness
– Age
– Time of latest citation
– Collection freshness
Information need - Metadata
Specification
• Accuracy
– Accurate attributes
– # attributes in the record
• Completeness
– Missing attributes
– Schema size
• Conformance
– Conformant attributes
– Schema size
Information - Collection and
Catalog
• Completeness of the Collection
– Collection size
– Size of an “ideal” collection
• Completeness of the Catalog
– # of digital objects with no metadata
• Item level metadata
– Size of the collection
• Catalog Consistency
– # of metadata specifications per digital object
Information about the
Repository
• Completeness
– # of collections
• Consistency
– # of collections
– Catalog/collection match
• How well do the catalogs match the collections?
• Are the catalogs for all the collections at the
same level of detail?
Service Information Need
• Composability (ability to be combined to form
new services)
– Extensibility
– Reusability
• Efficiency
– Response time
• Effectiveness
– Precision/recall (of search)
– Classification
Service Information, continued
• Extensibility
– # extended services
– # services in the DL
– # lines of code per service manager
• Reusability
– # reused services
– # services in the DL
– # lines of code per service manager
• Reliability
– # service failures
– # accesses
Making more concrete
• Each of the measures listed gives an
idea of the information need
• Exactly what do we measure?
• How do we combine numbers obtained
to get a usable result?
• Following pages describe specific
measures and formulas for combining
those.
Digital Object Accessibility
• Basic requirement
– If a user cannot access the DO, there is little point
in having it in the DL
– Identified measures:
• Collection, # structured streams, rights management
metadata, communities
– Say it another way:
• Is it present in a collection in the repository?
• Is there a service that can retrieve and display the
content?
• Is the rights management open enough for access by this
user?
Digital Object Accessibility - formally
Define dox = a specific digital object
Accessibility = Acc(dox, acy) =
– 0, if there is no collection C in the DL repository R
such that dox  C
– Otherwise, acc = (∑z  struct_streams(dox) rz(acy))/
|struc_streams(dox)|
– where rz(acy)) is a rights management rule defined as
• 1, if
– Z has no access constraints, or
– Z has access constraints and acy  cmz,
» Where cmz,  Soc(1) is a community that has the right to
access z; and
• 0, otherwise
This does not deal with accessibilty related to accessing the streams
An illustration
• NDLTD is the Networked Digital Library of
Theses and Dissertations
– Some institutions requre that all theses and
dissertations be stored in this DL
– Student chooses how visible to make the
document.
• Parts of the document may be visible while other parts
are not
• The document, or parts of it, may be visible to a
restricted community.
Accessiblity case
• etdx is a specific electronic thesis or
dissertation of interest
• acc(etdx) is
– 0 if it is not in the collection
– Otherwise (∑z  struct_streams(etdx) rz(acy))/
|struc_streams(dox)|
• Where rz(acy) = 1
– if etdx is marked “world wide access” or etdx is marked
“local institution only” and acy  C where C is defined as
identifiable members of the local institution
• = 0 otherwise
With the numbers
• An example from VT
• For authors name beginning with A (219 entries):
– Unrestricted ETDs: 164
– Restricted ETDs: 50
– Mixed ETDs: 5
• Percent unrestricted: 0.5, 0.5, 0.167, 0.1875, 0.6)
• Overall measure of accessibility outside VT:
– (164 *1 + 50 * 0 + .5 + .5 + .167 + .1875 + .6)/219
– 0.76
Solidifying Pertinence
• How do we measure something like
pertinence?
• Relation between the information
content of a digital object and the need
of the user
• Depends on the user’s situation -background, current context, etc.
Pertinence
• Inf(doi) represents the information content of
digital object i
• IN(acj) is the Information Need of actor (user) acj
• Context (acj, k) the combined effects of social
factors that determine the pertinence of doi to acj
at time k
• Two communities of actors
– Users whose information needs we try to satisfy
– External Judges who are responsible for judging the
relevance of a document in response to a query.
– Non overlapping groups
Pertinence formula
• Pertinence (doi, acj, k): Inf(doi) X IN(acj) X
Context(acj, k) defined as
– 1 if Inf(doi) is judged by acj to be informative with
regard to IN(acj) in context Context(acj, k)
– 0 otherwise
• Rather complex way to say that the
information is relevant if either the user or a
qualified independent judge says it is
Preservability
• Property of a digital object that
describes its state relative to changes in
hardware and software, representation
format standards
– Ex new recording technologies
(replacement of VHS video tapes by
DVDs)
– New versions of software such as Word or
Acrobat
– New image standards such as JPEG 2000
Digital preservation techniques
• Migration
Most commonly used
– Transform from one format to another
• Ex. Open the document in one format and save in another or do an
automated transformation
• Emulation
– Reproducing the effect of the environment originally used to
display the material
• Keep an old version of the software, or have new software that can
read the old format
• Wrapping
– Keep the original format, but add enough human-readable
metadata so that it can be decoded in the future
• Note that the material is not directly usable
• Refreshing
– Copy the stream of bits from one location to another
• Particularly suitable for guarding against the physical deterioration of
the medium
Preservability issues
• Obsolescence
– How out of date is the digital object?
• Many versions of the software?
• Old storage media?
– Difficult to migrate
• Appropriate tools? Expertise?
• Fidelity
– How different is the migrated version from the original?
– Distortion = loss of information
• Preservability of a digital object in a digital library is a function
of the fidelity of the migration and the obsolescence of the
object
• Preservability(doi, dl) = (fidelity of migrating (doi, formatx,
formaty), obsolescence(doi, dl))
– Two values to reflect the two dimensions of the concept: fidelity and
obsolescence
Miniclip Internet Archive
Preservability factors
• Capital direct costs
– Software
• Developing software to create new versions of the object
or obtaining licenses for new versions of the original
software
– Hardware
• For processing the migration and for storing the results
• Indirect operating costs
–
–
–
–
Monitoring digital objects for migration needs
Maintaining up-to-date intellectual property rights
Storage
Staff training
Calculating Obsolescence
• obsolence(doi, dl) = cost of
converting/migrating the digital object,
doi, within the context of a specific
digital library
Calculating fidelity
No distortion:
must yield a
fidelity of 1.0
• fidelity is the inverse of distortion.
fidelity(doi, formatx, formaty) =
1/(distortion(mp(formatx, formaty)) + 1.0)
• One common measure of distortion
– mean squared error (mse)
• Let {xn} be a stream of doi and {yn} be the converted stream
• mse({xn}, {yn}) = ∑Nn-1(xn - yn)2 / N
Use mse for distortion:
fidelity(doi, formatx, formaty) = 1/(mse({xn}, {yn}) =
∑Nn-1(xn - yn)2 / N + 1.0)
A Preservation Scenario
From Gonçales, adopted from one of his sources
• Librarian learns that special collection of 1,000 digital images, stored
in TIFF v5.0, is in danger of obsolescence because the latest
version of the display software does not support that version.
• Librarian decides to migrate all images to JPEG 2000, now the de
facto image preservation standard, recommended by the Research
Libraries Group (RLG)
• Librarian does search for options, finds a tool costing $500, that
converts TIFF 5.0 to JPEG 2000
• About 20 hours needed to order, install, learn, apply the software to
all images. Hourly rate of $66.60 per library employee.
• To save space, choose to use a compression rate that produces
average mse = 8 per image.
• Preservability of each image = preservability (image-TIFF5.0, dl) =
(1/9, ($500 +$66.60 *20)/1000) = (0.11, $1.83)
Obsolescence cost
Distortion +1
# images
Fidelity loss
Both numbers are costs and lower is better
Hourly rate * hours
Relevance
• Relevance(d0i,q) =
= 1 if d0i is judged by an external judge to be relevant to
query q
= 0 otherwise
• Measure of the distance between the vector
representing the object and the vector representing
the object
• The “external judge” requirement makes the
measure objective and independent of local
contextual issues. Relevance has a consistency,
independent of the momentary information need.
• Pertinence is a measure of usefulness within a
particular information need.
Significance
• Significance is an expression of the
absolute usefulness of a given digital
object, independent of particular user
needs.
• Citation records of objects in digital
libraries offer one measure of significance.
(This disadvantages the most recently
obtained objects, since they have had less
time to be cited by others.)
Look at ACM DL and the citation counts,
for example.
Life Cycle and Quality
• The quality indicators relate to the core components of a
digital library – creation, use, finding, distribution.
• Creation
– Authoring, modifying
– Describing, Organizing, Indexing
• Use
– Access, filtering
• Finding (seeking)
– Searching, Browsing, recommending
• Distribution
– Storing
– Archiving
– Networking
Quality and Lifecycle - 2
Quality and Life Cycle - 3
• Note that some elements repeat
– Timeliness is relevant to the content and to
the metadata that describes the content
– Accessibility affects both usefulness and
distribution.
References
• Gonçalves, M. A., Moreira, B. L., Fox, E. A.,
and Watson, L. T. “Quality Model for Digital
Libraries”.