Digital Libraries Based on Draft Book “Foundations for

Download Report

Transcript Digital Libraries Based on Draft Book “Foundations for

Digital Libraries

Prof. Marcos Andre Goncalves Universidade Federal de Minas Gerais

Synchronous Scholarly Communication

Same time, Same or different place

Asynchronous, Digital Library Mediated Scholarly Communication

Different time and/or place

Digital Libraries Shorten the Chain from

Editor Reviewer Publisher A&I Library

DLs Shorten the Chain to

Author Teacher

Digital

Reader Editor Reviewer Learner Librarian

Library

DL Overview Why of Global Interest?

National projects

can preserve antiquities and heritage: cultural, historical, linguistic, scholarly • Knowledge and information are essential to economic and technological

growth, education

• DL - a

domain for international collaboration

– wherein all can

contribute

and

benefit

– which leverages investment in

networking

– which provides useful

content

on Internet & WWW – which will

tie nations and peoples together

more strongly and through

deeper understanding

Digital Libraries --- Objectives

• World Lit.: 24hr / 7day / from desktop • Integrated “super” information systems: 5S: Table of related areas and their coverage • Ubiquitous, Higher Quality, Lower Cost • Education, Knowledge Sharing, Discovery • Disintermediation -> Collaboration • Universities Reclaim Property • Interactive Courseware, Student Works • Scalable, Sustainable, Usable, Useful

How is a DL different from a database?

• • • A traditional SQL database has as its basic element data items in a relation: – – – – select name from employee, project where employee.deptnumber = “25” AND project.number = “100” databases exploit known structures and relations DBMS retrieval is not probabilistic (Frakes, Baeza-Yates, p. 3)

How is a DL different from the WWW?

• The keyword is

managed

– The WWW is not managed • Some meta searchers (Yahoo, Lycos) attempt to add an organizational framework to their web holdings – However, most are focused on keyword searching (i.e., Google)

How is a DL different from the WWW?

• Another key difference is who controls the input into the system – most meta searchers hunt down their holdings • Lycos is short for

Lycosidae lycosa

(the “ ), which pursues its prey and does not build a web (Mauldin, IEEE Expert, 1/97) – some (Yahoo) have humans in the loop for review and classification • To date, DLs are generally more tightly controlled, and have a targeted customer set

DL = Content + Services

Vector and/or Boolean Search Engin es (traditional IR) WWW (http) Access (most common) Digital Library Services non-WWW Access (now uncommo n) (searching, browsing, citation anlaysis usage analysis, alerts) RDBMS File Sys tems Other Techno logies Content • “ “?

– WWW by itself has low archival & management characteristics • “ “ – In the same way that a card catalog is not a TL, a RDBMS is candidate technology for use in DLs • DL is the union of the content and services defined on the content

How is a DL Different from a Traditional Library?

TL has as its focus physical objects – even if the card catalog (metadata) is electronic, the purpose is to point you to a physical location – trafficking in physical objects has both obvious and subtle implications • • • object can exist only in 1 place if you have it, I can “ I have to go to the object, or wait for it to come to me

TLs vs. DLs

• DLs clearly better than TLs at: – Dissemination, storing information variety • However, TL objects are more

survivable

– Who will archive the research information?

QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture.

• the publishers?

• the institutions?

• the authors?

– Will the average DL object still be accessible in 10 years?

• take my digital preservation seminar in the spring!

image from: http://www.ancientegypt.co.uk/writing/rosetta.html

How is a DL Different from a Traditional Library?

• • Digital Library – removing the physical restriction has obvious benefits • multiple access, multiple listings, electronic transmission – also complicates many other issues...

• intellectual property, terms and conditions, etc.

Note that a TL offers additional social and educational benefits – Most TLs also offer hybrid services too.

DL Definitions - 1

• “A digital library is an organized and focused collection of digital objects, including text, images, video, and audio, along with methods of access and retrieval, and for selection, creation, organization, maintenance, and sharing of the collection.” • Witten & Bainbridge – “How to Build a Digital Library” – Morgan Kaufmann 2003

DL Definitions - 2

• “Digital libraries are organizations that provide the resources, including the specialized staff, to select, structure, offer intellectual access to, interpret, distribute, preserve the integrity of, and ensure the persistence over time of collections of digital works so that they are readily and economically available for use by a defined community or set of communities” • Waters,D.J.

CLIR Issues

, July/August 1998 • www.clir.org/pubs/issues/issues04.html

Informal 5S & DL Definitions DLs are complex systems that

• help satisfy info needs of users (

societies

) • provide info services (

scenarios

) • organize info in usable ways (

structures

) • present info in usable ways (

spaces

) • communicate info with users (

streams

)

5Ss

Ss Streams Structures Spaces Scenarios Societies Examples Text; video; audio; image Collection; catalog; hypertext; document; metadata Measure; measurable, topological, vector, probabilistic Searching, browsing, recommending Service managers, learners, teachers, etc.

Objectives Describes properties of the DL content such as encoding and language for textual material or particular forms of multimedia data Specifies organizational aspects of the DL content Defines logical and presentational views of several DL components Details the behavior of DL services Defines managers, responsible for running DL services; actors, that use those services; and relationships among them

5S and DL formal definitions and compositions (April 2004 TOIS) relation (d. 1) sequence (d. 3) graph (d. 6) measurable(d.12), measure(d.13), probability (d.14), vector (d.15), topological (d.16) spaces sequence function (d. 2) state (d. 18) (d. 3) language (d.5) tuple (d. 4)* event (d.10)

5S

grammar (d. 7)

streams

(d.9)

structures

(d.10) structured stream (d.29)

spaces

(d.18)

scenarios

(d.21)

societies

(d. 24)

services

(d.22) digital object (d.30) structural metadata specification (d.25) descriptive metadata specification (d.26) (d.34)indexing service hypertext (d.36) browsing service (d.37) searching service (d.35) transmission (d.23) collection (d. 31) metadata catalog (d.32) repository (d. 33)

digital library (minimal)

(d. 38)

ETANA-DL

• • • • Archaeological DL Integrated DL – Heterogeneous data handling Applies and extends the OAI-PMH – Open Archives Initiative Protocol for Metadata Handling Design considerations – Componentized – Extensible – Portable

Initial ETANA-DL Member Locations Canadian University College Walla Walla College Willamette University Andrews University CWRU Virginia Tech Vanderbilt University Mississippi State University Map courtesy: www.enchantedlearning.com

Lahav Website

Megiddo Opening Screen

Locus Screen: Pictures

View all

Area Screen

ETANA-DL Approach • Applying and extending Digital Library (DL) techniques to solve key problems: making primary data available, data preservation, and interoperability • Modeling archaeological information systems using 5S to better understand the domain and design the system and the supporting services • Rapidly prototyping DLs that handle heterogeneous archaeological data using componentized frameworks: – eliciting requirements – refining metamodel and union schema – modeling sites – mapping – harvesting – providing useful services

ETANA-DL Website

Marking Items Marking – writing notes for a specific user

Sender, Date, Object OAI ID Sender Comments

Options:

View Record, Add record to Items Of Interest, Re-mark item (Redirect), Unmark item (Remove item from list) Marked Items Display

Discussions Page Discussions about an object View/Post messages, create new threads

Recommendations Items recommended on the basis of similar interests

ETANA-DL Searching Service Search

ETANA-DL Multi-dimensional Browsing 3 new sites 2 new types of artifacts

ETANA-DL Visual Browsing Service By site Visual Browse

Visual Browsing Nimrin: Topographical Drawings Full site Square: N40/W20 North west quadrant

Visual Browsing Nimrin : Square information Square: N40/W20 Locus: 86 Loci layout

Visual Browsing Nimrin : locus sheet

Visual Browsing Bab edh-Dhra' Cemetery Pottery # 25

Visual Browsing Bab edh-Dhra' Cemetery Pottery # 25

ETANA Societies

1. Historic and pre-historic societies (being studied) 2. Archaeologists (in academic institutes, fieldwork settings, or local and national governmental bodies) 3. Project directors 4. Technical staff (consisting of photographers, technical illustrators, and their assistants) 5. Field staff (responsible for the actual work of excavation) 6. Camp staff (e.g., camp managers, registrars, tool stewards) 7. General public (e.g., educators, learners, citizens)

ETANA Societies

• Social issues 1. Who owns the finds? 2. Where should they be preserved? 3. What nationality and ethnicity do they represent? 4. Who has publication rights? 5. What interactions took place between those at the site studied, and others? What theories are proposed by whom about this?

ETANA Scenarios

1.

2.

3.

4.

5.

6.

7.

8.

1.

2.

3.

4.

Life in the site in former times Digital recording: the planning stage and the excavation stage Planning stage: remote sensing, fieldwalking, field surveys, building surveys, consulting historical and other documentary sources, and managing the sites and monuments Excavation Detailed information is recorded, including for each layer of soil, and for features such as pole holes, pits, and ditches. Data about each artifact is recorded together with information about its exact find spot. Numerous environmental and other samples are taken for laboratory analysis, and the location and purpose of each is carefully recorded. Large numbers of photographs are taken, both general views of the progress of excavation and detailed shots showing the contexts of finds. Organization and storage of material Analysis and hypotheses generation and testing Publications, museum displays Information services for the general public

ETANA Spaces

1. Geographic distribution of found artifacts 2. Temporal dimension (as inferred by archaeologists) 3. Metric or vector spaces 1. used to support retrieval operations, and to calculate distance (and similarity) 2. used to browse / constrain searches spatially 4. 3D models of the past, used to reconstruct and visualize archaeological ruins 5. 2D interfaces for human-computer interaction

ETANA Structures

1. Site Organization 1. Region, site, partition, sub-partition, locus, … 2. Temporal orderings (ages, periods) 3. Taxonomies 1.

for bones, seeds, building materials, … 4. Stratigraphic relationships 1. above, beneath, coexistent

ETANA Streams

1. successive photos and drawings of excavation sites, loci, unearthed artifacts 2. audio and video recordings of excavation activities and discussions 3. textual reports 4. 3D models used to reconstruct and visualize archaeological ruins.

Streams

• Multiple media types and representation – See ch. 4 for IR (except some here for non-text) – Standards for each, and for some combinations • Text – Character strings, encoding (Unicode) – Morphology -> Stemming – Syntax, semantics -> stop words – ** POS tagging, phrases • Images, Audio, Video, Graphics, Animation – Capture, digitization, representation – CBIR for each • ** Compression, processing, analysis • **Synchronization, rendering, presentation, interchange – RealVideo, SMIL, QoS

Content Based Information Retrieval

Problems

• Image similarity is subjective – Personal Interpretation • Concept x Appearance

By Visual features

– Retrieve images with 50 percent of white colour and 50 percent of black colour

Query result

Textual information retrieval

Query on Google using

Sunset

and

Rio de Janeiro

Image Classification by shape

Image Classification by shape

Work of Torres et al

• Search in collections of fish images • using combination of • image properties (CBIR) and • textual descriptions

Motivation

• Query 1: – List all metadata related to fish which were observed in the Amazon River • Query 2: – Retrieve images of fishes whose shape is similar to that in the example o Query 3: List all metadata related to fishes that were observed in the Amazon River and whose shape is similar to that in the example

Motivation

• Retrieve fish descriptions whose shapes are similar to the one shown below , that belong to the “

Notropis” genre,

that have large yes” e and that have been observed in the “Tennessee River”

Problem

• There is no BIodiversity Information System which allow queries involving : – Geographic data – Species metadata – Image Descriptors • Existing systems: – Metadada or – Metadada + spatial data – Images are stored as separate files • With no possibilty of retrieval by content

WeBioS

Torres: Visualizations

Concentric Rings Pattern Spiral Pattern

Structures

• Digital Objects – Documents, digitization, packaging (METS), interchange, standards, format conversion – Genre: plays, encyclopedia, dictionaries, educational resources: courses (e.g., syllabi) and lessons – Structural organizations (books, chapters, sections), excerpts/spans (mark, superimposed info) • Metadata: standards, markup • Knowledge Structures & Representations – Databases, Schema, Ontologies, Thesauri, Lexicons, Authority files, Concept maps, Semantic networks • Indexes – Inverted files, signature files, R-trees, Quad trees, etc.

• Clusters & Classification Schemes

Degree of Structure Web Chaotic DLs Organized DBs Structured

Digital Objects (DOs)

• Born digital • Digitized version of “real” object – Is the DO version the same, better, or worse?

– Decision for ETDs: structured + rendered • Surrogate for “real” object – Not covered explicitly in metamodel for a minimal DL – Crucial in metamodel for archaeology DL

Metadata Objects (MDOs)

• MARC • Dublin Core • RDF • IMS • OAI (Open Archives Initiative) • Crosswalks, mappings • Ontologies • Topics maps, concept maps

Complex to Simple

MARC ($50) Dublin Core (DC) + thesis

Spaces

• Retrieval models – Boolean, extended Boolean – Vector, LSI – Probabilistic: classical, belief network, inference network, language models • User interfaces and visualization

User interfaces and visualization

• 2D interfaces • 3D interfaces • GIS • Other paradigms

Scenarios

• Recall OO for streams – now have objects as well as scenarios – ex interface components • Information Access – Searching: ad hoc, filtering/routing – Browsing: using an organization, using a visualization, using links (i.e., hypertext, hypermedia) – Workflow: sessions, feedback, etc.

• Scenario-based Design • Usability: goals, tasks, claims • NOTE: this is covered in the outline

Societies

• User communities – Authors, editors, teachers, students, readers – Personal(ization), group(ware), community, global – Accessibility, universal access • Librarians: reference, acquisition, operations • Research community – Associations, conferences, publications, labs, projects • Economics – Copyright, intellectual property rights, digital rights management, authorization, authentication, security, privacy, self-archiving (eprints) – Publishers, catalogers, distributors, sustainability – Open source, commercial, hybrid