The National Science Digital Library (NSDL) as an Example of Information Science Research William Y.
Download ReportTranscript The National Science Digital Library (NSDL) as an Example of Information Science Research William Y.
The National Science Digital Library (NSDL) as an Example of Information Science Research
William Y. Arms Cornell University October 25, 2002 1
Some Light Reading
William Y. Arms, "Economic models for open-access publishing."
iMP,
March 2000. http://www.cisp.org/imp/march_2000/03_00arms.htm
William Y. Arms, "Automated digital libraries."
D-Lib Magazine
, July/August 2000. http://www.dlib.org/dlib/july20/07contents.html
William Y. Arms, "What are the alternatives to peer review? Quality control in scholarly publishing on the web."
Journal of Electronic Publishing
, 8(1), August 2002. http://www.press.umich.edu/jep/08-01/arms.html
William Y. Arms, et al., "A Spectrum of Interoperability: The Site for Science Prototype for the NSDL."
D-Lib Magazine
, 8(1), January 2002. 2 http://www.dlib.org/dlib/january02/arms/01arms.html
A Scenario
A faculty member wished to find a paper for students to read in a class. He began by asking an expert. She suggested the original research paper as suitable. Later, he typed a few terms into Google, browsed the hits, selected one that led to ResearchIndex, found the paper, and downloaded a PDF version from the author's web site.
3
Viewpoints
Society Cognitive Studies HCI Computer Science
4
HCI: Eye Tracking
5
6
Information Science
Applications Society Cognitive Studies HCI Computer Science
7
Open Access to Scientific, Scholarly and Professional Information
8
Before the Web Access to Scientific, Medical, Legal Information
In the United States:
excellent if you belonged to a rich organization (e.g, a major university) very poor otherwise (e.g., most K-12 schools)
In many countries of the world:
very poor for everybody 9
Research Libraries are Expensive
library materials buildings & facilities 10 staff
Price
Baumol's Cost Disease
Labor-intensive services Bundle of goods and services 1900 1950 Manufactured goods 2000 11 2050 Year
Baumol's Cost Disease
Price Labor-intensive services Moore's Law Bundle of goods and services 1900 1950 Manufactured goods 2000 12 2050 Year
Brute Force Computing
Few people really understand Moore's Law
Computing power doubles every 18 months Increases 100 times in 10 years Increases 10,000 times in 20 years
Simple algorithms
plus
immense computing power
can outperform human intelligence 13
Example: Catalogs and Indexes
Cost disease: catalogs and indexes
Catalog, index and abstracting records are very expensive when created by skilled professionals
Moore's Law: automatic indexing of full text
Retrieval effectiveness using automatic indexing can be at least as effective as manual indexing with controlled vocabularies (Cleverdon 1967, reporting on experiments by Salton) 14
Brute Force Computing: Substitutes for Human Intelligence
Automated algorithms for information discovery Similarity of two documents
Vector space and statistical methods (Salton, Sparc Jones, et al.)
Importance of digital object
Rank importance of web pages by analysis of the graph of web links (Kleinberg, Page, et al.) 15
Information Discovery: 1992 and 2002
Content Computing Choice of content Index creation Frequency Vocabulary Query Users
1992
print expensive selective human one time controlled Boolean trained 16
2002
digital inexpensive comprehensive automatic monthly not controlled ranked retrieval untrained
Brute Force Computing: Automated Metadata Extraction
Informedia (Carnegie Mellon)
Automatic processing of segments of video, e.g., television news.
Algorithms for:
dividing raw video into discrete items generating short summaries indexing the sound track using speech recognition recognizing faces (Wactlar, et al.) 17
18
Brute Force Computing + Intelligence of the User
Simple algorithms
plus
immense computing power
plus
the intelligence of the user
can replace labor-intensive services
Cognitive Studies HCI Computer Science
19
20
The National Science Foundation's National Science Digital Library (NSDL) http://www.nsdl.org
20
21
Scope
All digital information relevant to any level of education in any branch of science.
Scientific and technical information Materials used in education Materials tailored to education
21
22
How Big might the NSDL be?
All branches of science, all levels of education, very broadly defined:
Five year targets
1,000,000 different users 10,000,000 digital objects 10,000 to 100,000 independent sites 22
23
The Integration Task ...
... to provide a coherent set of collections and services across great diversity 23
24
Resources
Budget Staff Management Integration team $4-6 million 25 - 30 Diffuse How can a small team, without direct management control, create a very large-scale digital library?
24
25
Philosophy
It is possible to build a
very large
digital library with a small staff.
But ...
Every aspect of the library must be planned with scalability in mind.
Some compromises will be made.
25
26
Example 1: The Mortal behind the Portal
[This space left intentionally blank.]
26
27
Example 2: Interoperability The Problem
Conventional approaches require partners to support agreements (technical, content, and business) But NSDL needs thousands of very different partners
... most of whom are not directly part of the NSDL program
The challenge is to create incentives for independent digital libraries to adopt agreements
27
28
Function Versus Cost of Acceptance
Cost of acceptance
Few adopters Many adopters
Function 28
Example: Textual Mark-up
Cost of acceptance 29
ASCII HTML
29
XML SGML
Function
30
The Spectrum of Interoperability
Level
Federation Harvesting Gathering
Agreements
Strict use of standards (syntax, semantic, and business) Digital libraries expose metadata; simple protocol and registry Digital libraries do not cooperate; services must seek out information 30
Example
AACR, MARC Z 39.50
Open Archives metadata harvesting Web crawlers and search engines
31
Example 3: Searching Basic Assumptions
The integration team will not manage any collections The integration team will not create any metadata 31
32
Effective Information Retrieval Comprehensive metadata
with Boolean retrieval (e.g., monograph catalog).
Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available.
Full text indexing
with ranked retrieval (e.g., news articles).
Excellent for relatively homogeneous material, but requires available full text.
Full text indexing with contextual information
and ranked retrieval (e.g., Google).
Excellent for mixed textual information with rich structure.
Contextual information without non-textual materials
and ranked retrieval (e.g., Google image retrieval).
Promising, but still experimental.
32
33
The NSDL Search Service
Full Text or Metadata?
Full text indexing
is excellent, but is not possible for all materials (non-textual, no access for indexing).
Comprehensive metadata
is available for very few of the materials.
What Architecture to Use?
Few collections support an established search protocol (e.g., Z39.50).
33
34
Broadcast Searching does not Scale
Collections User interface server User 34
35
The Metadata Repository
Services Users
Metadata repository
The metadata repository is a resource for service providers. It holds information about every collection and item known to the NSDL,
including contextual information.
Collections 35
36
Support for Service Providers The Metadata Repository as a Resource
Records are exposed through Open Archives Initiative protocol for metadata harvesting.
Core Integration team provides some services based on the metadata repository.
The architecture encourages others to build services.
36
Search Service
Portal Portal Portal SDLIP
Search and Discovery Services
OAI http
Metadata repository
Collections 37 James Allan, Bruce Croft (University of Massachusetts, Amherst) 37
38
Where is the Center of the Universe?
Alexandria Library of Congress Elsevier
NSDL
Informedia Joe's Pictures Math DL 38
39
Where is the Center of the Universe?
British Library Internet Archive Elsevier Library of Congress OCLC Harvard 39
NSDL
40
Where is the Center of the Universe?
email Office Course web sites Bill Arms Directories News and weather
NSDL
Technical documentation 40
41
Acknowledgement
The NSDL is a program of the National Science Foundation's Directorate for Education and Human Resources, Division of Undergraduate Education.
The NSDL Core Integration is a collaboration between the University Center for Atmospheric Research (Dave Fulker), Columbia University (Kate Wittenberg) and Cornell University (Bill Arms). The Technical Director is Carl Lagoze (Cornell University).
41