Transcript Digital Libraries Based on Draft Book “Foundations for
Digital Libraries
Prof. Marcos Andre Goncalves Universidade Federal de Minas Gerais
Synchronous Scholarly Communication
Same time, Same or different place
Asynchronous, Digital Library Mediated Scholarly Communication
Different time and/or place
Digital Libraries Shorten the Chain from
Editor Reviewer Publisher A&I Library
DLs Shorten the Chain to
Author Teacher
Digital
Reader Editor Reviewer Learner Librarian
Library
DL Overview Why of Global Interest?
•
National projects
can preserve antiquities and heritage: cultural, historical, linguistic, scholarly • Knowledge and information are essential to economic and technological
growth, education
• DL - a
domain for international collaboration
– wherein all can
contribute
and
benefit
– which leverages investment in
networking
– which provides useful
content
on Internet & WWW – which will
tie nations and peoples together
more strongly and through
deeper understanding
Digital Libraries --- Objectives
• World Lit.: 24hr / 7day / from desktop • Integrated “super” information systems: 5S: Table of related areas and their coverage • Ubiquitous, Higher Quality, Lower Cost • Education, Knowledge Sharing, Discovery • Disintermediation -> Collaboration • Universities Reclaim Property • Interactive Courseware, Student Works • Scalable, Sustainable, Usable, Useful
How is a DL different from a database?
• • • A traditional SQL database has as its basic element data items in a relation: – – – – select name from employee, project where employee.deptnumber = “25” AND project.number = “100” databases exploit known structures and relations DBMS retrieval is not probabilistic (Frakes, Baeza-Yates, p. 3)
How is a DL different from the WWW?
• The keyword is
managed
– The WWW is not managed • Some meta searchers (Yahoo, Lycos) attempt to add an organizational framework to their web holdings – However, most are focused on keyword searching (i.e., Google)
How is a DL different from the WWW?
• Another key difference is who controls the input into the system – most meta searchers hunt down their holdings • Lycos is short for
Lycosidae lycosa
(the ), which pursues its prey and does not build a web (Mauldin, IEEE Expert, 1/97) – some (Yahoo) have humans in the loop for review and classification • To date, DLs are generally more tightly controlled, and have a targeted customer set
DL = Content + Services
Vector and/or Boolean Search Engin es (traditional IR) WWW (http) Access (most common) Digital Library Services non-WWW Access (now uncommo n) (searching, browsing, citation anlaysis usage analysis, alerts) RDBMS File Sys tems Other Techno logies Content • ?
– WWW by itself has low archival & management characteristics • – In the same way that a card catalog is not a TL, a RDBMS is candidate technology for use in DLs • DL is the union of the content and services defined on the content
•
How is a DL Different from a Traditional Library?
TL has as its focus physical objects – even if the card catalog (metadata) is electronic, the purpose is to point you to a physical location – trafficking in physical objects has both obvious and subtle implications • • • object can exist only in 1 place if you have it, I can I have to go to the object, or wait for it to come to me
TLs vs. DLs
• DLs clearly better than TLs at: – Dissemination, storing information variety • However, TL objects are more
survivable
– Who will archive the research information?
QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.
• the publishers?
• the institutions?
• the authors?
– Will the average DL object still be accessible in 10 years?
• take my digital preservation seminar in the spring!
image from: http://www.ancientegypt.co.uk/writing/rosetta.html
How is a DL Different from a Traditional Library?
• • Digital Library – removing the physical restriction has obvious benefits • multiple access, multiple listings, electronic transmission – also complicates many other issues...
• intellectual property, terms and conditions, etc.
Note that a TL offers additional social and educational benefits – Most TLs also offer hybrid services too.
DL Definitions - 1
• “A digital library is an organized and focused collection of digital objects, including text, images, video, and audio, along with methods of access and retrieval, and for selection, creation, organization, maintenance, and sharing of the collection.” • Witten & Bainbridge – “How to Build a Digital Library” – Morgan Kaufmann 2003
DL Definitions - 2
• “Digital libraries are organizations that provide the resources, including the specialized staff, to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use by a defined community or set of communities” • Waters,D.J.
CLIR Issues
, July/August 1998 • www.clir.org/pubs/issues/issues04.html
Informal 5S & DL Definitions DLs are complex systems that
• help satisfy info needs of users (
societies
) • provide info services (
scenarios
) • organize info in usable ways (
structures
) • present info in usable ways (
spaces
) • communicate info with users (
streams
)
5Ss
Ss Streams Structures Spaces Scenarios Societies Examples Text; video; audio; image Collection; catalog; hypertext; document; metadata Measure; measurable, topological, vector, probabilistic Searching, browsing, recommending Service managers, learners, teachers, etc.
Objectives Describes properties of the DL content such as encoding and language for textual material or particular forms of multimedia data Specifies organizational aspects of the DL content Defines logical and presentational views of several DL components Details the behavior of DL services Defines managers, responsible for running DL services; actors, that use those services; and relationships among them
5S and DL formal definitions and compositions (April 2004 TOIS) relation (d. 1) sequence (d. 3) graph (d. 6) measurable(d.12), measure(d.13), probability (d.14), vector (d.15), topological (d.16) spaces sequence function (d. 2) state (d. 18) (d. 3) language (d.5) tuple (d. 4)* event (d.10)
5S
grammar (d. 7)
streams
(d.9)
structures
(d.10) structured stream (d.29)
spaces
(d.18)
scenarios
(d.21)
societies
(d. 24)
services
(d.22) digital object (d.30) structural metadata specification (d.25) descriptive metadata specification (d.26) (d.34)indexing service hypertext (d.36) browsing service (d.37) searching service (d.35) transmission (d.23) collection (d. 31) metadata catalog (d.32) repository (d. 33)
digital library (minimal)
(d. 38)
ETANA-DL
• • • • Archaeological DL Integrated DL – Heterogeneous data handling Applies and extends the OAI-PMH – Open Archives Initiative Protocol for Metadata Handling Design considerations – Componentized – Extensible – Portable
Initial ETANA-DL Member Locations Canadian University College Walla Walla College Willamette University Andrews University CWRU Virginia Tech Vanderbilt University Mississippi State University Map courtesy: www.enchantedlearning.com
Lahav Website
Megiddo Opening Screen
Locus Screen: Pictures
View all
Area Screen
ETANA-DL Approach • Applying and extending Digital Library (DL) techniques to solve key problems: making primary data available, data preservation, and interoperability • Modeling archaeological information systems using 5S to better understand the domain and design the system and the supporting services • Rapidly prototyping DLs that handle heterogeneous archaeological data using componentized frameworks: – eliciting requirements – refining metamodel and union schema – modeling sites – mapping – harvesting – providing useful services
ETANA-DL Website
Marking Items Marking – writing notes for a specific user
Sender, Date, Object OAI ID Sender Comments
Options:
View Record, Add record to Items Of Interest, Re-mark item (Redirect), Unmark item (Remove item from list) Marked Items Display
Discussions Page Discussions about an object View/Post messages, create new threads
Recommendations Items recommended on the basis of similar interests
ETANA-DL Searching Service Search
ETANA-DL Multi-dimensional Browsing 3 new sites 2 new types of artifacts
ETANA-DL Visual Browsing Service By site Visual Browse
Visual Browsing Nimrin: Topographical Drawings Full site Square: N40/W20 North west quadrant
Visual Browsing Nimrin : Square information Square: N40/W20 Locus: 86 Loci layout
Visual Browsing Nimrin : locus sheet
Visual Browsing Bab edh-Dhra' Cemetery Pottery # 25
Visual Browsing Bab edh-Dhra' Cemetery Pottery # 25
ETANA Societies
1. Historic and pre-historic societies (being studied) 2. Archaeologists (in academic institutes, fieldwork settings, or local and national governmental bodies) 3. Project directors 4. Technical staff (consisting of photographers, technical illustrators, and their assistants) 5. Field staff (responsible for the actual work of excavation) 6. Camp staff (e.g., camp managers, registrars, tool stewards) 7. General public (e.g., educators, learners, citizens)
ETANA Societies
• Social issues 1. Who owns the finds? 2. Where should they be preserved? 3. What nationality and ethnicity do they represent? 4. Who has publication rights? 5. What interactions took place between those at the site studied, and others? What theories are proposed by whom about this?
ETANA Scenarios
1.
2.
3.
4.
5.
6.
7.
8.
1.
2.
3.
4.
Life in the site in former times Digital recording: the planning stage and the excavation stage Planning stage: remote sensing, fieldwalking, field surveys, building surveys, consulting historical and other documentary sources, and managing the sites and monuments Excavation Detailed information is recorded, including for each layer of soil, and for features such as pole holes, pits, and ditches. Data about each artifact is recorded together with information about its exact find spot. Numerous environmental and other samples are taken for laboratory analysis, and the location and purpose of each is carefully recorded. Large numbers of photographs are taken, both general views of the progress of excavation and detailed shots showing the contexts of finds. Organization and storage of material Analysis and hypotheses generation and testing Publications, museum displays Information services for the general public
ETANA Spaces
1. Geographic distribution of found artifacts 2. Temporal dimension (as inferred by archaeologists) 3. Metric or vector spaces 1. used to support retrieval operations, and to calculate distance (and similarity) 2. used to browse / constrain searches spatially 4. 3D models of the past, used to reconstruct and visualize archaeological ruins 5. 2D interfaces for human-computer interaction
ETANA Structures
1. Site Organization 1. Region, site, partition, sub-partition, locus, … 2. Temporal orderings (ages, periods) 3. Taxonomies 1.
for bones, seeds, building materials, … 4. Stratigraphic relationships 1. above, beneath, coexistent
ETANA Streams
1. successive photos and drawings of excavation sites, loci, unearthed artifacts 2. audio and video recordings of excavation activities and discussions 3. textual reports 4. 3D models used to reconstruct and visualize archaeological ruins.
Streams
• Multiple media types and representation – See ch. 4 for IR (except some here for non-text) – Standards for each, and for some combinations • Text – Character strings, encoding (Unicode) – Morphology -> Stemming – Syntax, semantics -> stop words – ** POS tagging, phrases • Images, Audio, Video, Graphics, Animation – Capture, digitization, representation – CBIR for each • ** Compression, processing, analysis • **Synchronization, rendering, presentation, interchange – RealVideo, SMIL, QoS
Content Based Information Retrieval
Problems
• Image similarity is subjective – Personal Interpretation • Concept x Appearance
By Visual features
– Retrieve images with 50 percent of white colour and 50 percent of black colour
Query result
Textual information retrieval
Query on Google using
Sunset
and
Rio de Janeiro
Image Classification by shape
Image Classification by shape
Work of Torres et al
• Search in collections of fish images • using combination of • image properties (CBIR) and • textual descriptions
Motivation
• Query 1: – List all metadata related to fish which were observed in the Amazon River • Query 2: – Retrieve images of fishes whose shape is similar to that in the example o Query 3: List all metadata related to fishes that were observed in the Amazon River and whose shape is similar to that in the example
Motivation
• Retrieve fish descriptions whose shapes are similar to the one shown below , that belong to the “
Notropis” genre,
that have large yes” e and that have been observed in the “Tennessee River”
Problem
• There is no BIodiversity Information System which allow queries involving : – Geographic data – Species metadata – Image Descriptors • Existing systems: – Metadada or – Metadada + spatial data – Images are stored as separate files • With no possibilty of retrieval by content
WeBioS
Torres: Visualizations
Concentric Rings Pattern Spiral Pattern
Structures
• Digital Objects – Documents, digitization, packaging (METS), interchange, standards, format conversion – Genre: plays, encyclopedia, dictionaries, educational resources: courses (e.g., syllabi) and lessons – Structural organizations (books, chapters, sections), excerpts/spans (mark, superimposed info) • Metadata: standards, markup • Knowledge Structures & Representations – Databases, Schema, Ontologies, Thesauri, Lexicons, Authority files, Concept maps, Semantic networks • Indexes – Inverted files, signature files, R-trees, Quad trees, etc.
• Clusters & Classification Schemes
Degree of Structure Web Chaotic DLs Organized DBs Structured
Digital Objects (DOs)
• Born digital • Digitized version of “real” object – Is the DO version the same, better, or worse?
– Decision for ETDs: structured + rendered • Surrogate for “real” object – Not covered explicitly in metamodel for a minimal DL – Crucial in metamodel for archaeology DL
Metadata Objects (MDOs)
• MARC • Dublin Core • RDF • IMS • OAI (Open Archives Initiative) • Crosswalks, mappings • Ontologies • Topics maps, concept maps
Complex to Simple
MARC ($50) Dublin Core (DC) + thesis
Spaces
• Retrieval models – Boolean, extended Boolean – Vector, LSI – Probabilistic: classical, belief network, inference network, language models • User interfaces and visualization
User interfaces and visualization
• 2D interfaces • 3D interfaces • GIS • Other paradigms
Scenarios
• Recall OO for streams – now have objects as well as scenarios – ex interface components • Information Access – Searching: ad hoc, filtering/routing – Browsing: using an organization, using a visualization, using links (i.e., hypertext, hypermedia) – Workflow: sessions, feedback, etc.
• Scenario-based Design • Usability: goals, tasks, claims • NOTE: this is covered in the outline
Societies
• User communities – Authors, editors, teachers, students, readers – Personal(ization), group(ware), community, global – Accessibility, universal access • Librarians: reference, acquisition, operations • Research community – Associations, conferences, publications, labs, projects • Economics – Copyright, intellectual property rights, digital rights management, authorization, authentication, security, privacy, self-archiving (eprints) – Publishers, catalogers, distributors, sustainability – Open source, commercial, hybrid