Stanford Digital Libraries Technologies Projects

Download Report

Transcript Stanford Digital Libraries Technologies Projects

Stanford Digital Libraries
Technologies Projects
Pratik Dave
Raghu Akkapeddi
For CPSC 689DL Fall ’02
Texas A&M University
interLib
Joint effort of U.C. Berkeley, U.C. Santa Barbara, and
Stanford University.
Testbed developed by SDSC (San Diego
Supercomputing Center).
Demonstrated on CDL (California Digital Library).
Berkeley - tools and technologies to support highly improved models of the "scholarly information life cycle."
Our goal is to facilitate the move from the current centralized, discrete publishing model, to a distributed, continuous,
and self-publishing model, while still preserving the best aspects of the current model such as peer review.
Santa Barbara - The Alexandria Digital Earth Prototype (ADEPT) aims to use the digital earth metaphor for organizing, using, and presenting
information at all levels of spatial and temporal resolution.
•creating geospatial information and meta-information collections;
•building operational services for: (1) discovering heterogeneous, distributed collections; (2) organizing these resources into
Iscapes (Information Landscapes) tailored for specific applications; and (3) collaborative use and visualization of iscapes;
•applying and evaluating adept services in undergraduate learning; and developing scalable, efficient, and secure systems
Stanford Component: Goals
Develop technologies to overcome barriers to effective
DLs.
“An important part of the project's vision is that
digital libraries will not just be collections of
information repositories. Rather, they will include
aspects of communication among patrons and
between patrons and human library staff.”
“Design and implement the infrastructure
and services needed for collaboratively
creating, disseminating, sharing, and
managing information in a DL context. “
“Main thrust of project is technology creation, evaluation, and deployment.“
People
Hector Garcia-Molina - chair of CS
department, distributed objects
Terry Winograd - HCI and usability
Dan Boneh - Security
Andreas Paepcke - interoperability
Barriers to effective DLs
1. Heterogeneity of information and services
2. Lack of powerful filtering mechanisms that let users find truly
valuable information
3. Insufficient availability of interfaces and tools that effectively
operate on portable devices
4. Lack of a solid economic infrastructure that encourages providers
to make information available and gives users privacy guarantees.
Retrieving Information
SDLIP
PowerBrowsing
Query Translation
Value Filtering
WebBase
Simple Digital Library Interoperability
Protocol (SDLIP)
SDLIP = InfoBus architecture
Our basic approach is to use distributed objects to allow integrated access to
heterogenous services across networks. We call this system the InfoBus.
We use CORBA to provide communication between remote processes. In
particular, we use Xerox PARC's ILU, a free implementation of a CORBA
superset, MICO, a free CORBA implementation under the Gnu license, and
Visigenic, a commercial provider. We use Java, C++, and the interpreted,
object-oriented language Python for our development work.
Clients use SDLIP to request searches to be performed
over information sources.
The result documents are returned synchronously, or they
are streamed from service to client as they become
available.
SDLIP Core
Synchronous access
Client sends request + tokens:
Server Set ID, Client Request ID
Parking Meter state model
Delegation
SDLIP Async
Delivery interface in client
Result Cache locally
Result Cache distributed
Delegation
PowerBrowsing
•
•
•
•
Site Search/Keyword Completion
Accordion Summarization
Text Summarization
Form Entry
Site Search/Keyword Completion
As a way to address bandwidth and battery life
limitations, we provide local site search
facilities for all sites. We incrementally index
Web sites in real time as the PDA user visits
them. These indexes have narrow scope at
first, and improve as the user dwells on the
site, or as more users visit the site over time.
We address the keyword input problem by
providing site specific keyword completion,
and indications of keyword selectivity within
sites.
Accordion Summarization
We concentrate on end-game browsing,
where the user is close to or on the
target page. Web page is first
represented as a short summary. The
user can then drill down to discover
relevant parts of the page. If desired,
keywords can be highlighted and
exposed automatically.
Text Summarization
Each Web page is broken into text units that
can each be hidden, partially displayed, made
fully visible, or summarized. The methods
accomplish summarization by different
means. One method extracts significant
keywords from the text units, another
attempts to find each text unit's most
significant sentence to act as a summary for
the unit. We found that the combination of
keywords and single-sentence summaries
provides significant improvements in access
times and number of pen actions, as
Form Entry
• The form input widgets are not shown
until the user is ready to fill them in. At
that point, only one widget is shown at a
time. The form is summarized on the
screen by displaying just the text labels
that prompt the user for each widget's
information.
Query Translation
• Deals with the problem of translating
Boolean queries into different native
languages supported by various search
services to make distributed search
possible and mask the users from the
details of different query languages.
Value Filtering
•
•
The project is developing searching and filtering techniques that rely, in
addition to textual similarity, on other information value metrics. These
metrics may be opinion based, for example, did other colleagues we
trust find a document useful, or has this document been reviewed by
some editorial board? The metrics may also be access-pattern based,
e.g., has this video been retrieved by many users? The metrics may be
context-based. For example, is the information coming from a
trustworthy source, do we know the author, or are the Web pages that
point to this document related to our search?
Along similar lines, the Stanford Value Filtering project plans a service
that allows users to annotate Web pages, without needing to physically
modify those pages. The annotations might be reminders users leave
for themselves, or they might be directed at colleagues who are known
to be scanning the same information space. The annotations
themselves can be useful value information, as are the collected
access paths.
WebBase
• The Stanford WebBase project is
investigating various issues in crawling,
storage, indexing, and querying of large
collections of Web pages. The project builds
on the previous Google activity that was part
of the DLI1 initiative. The DLI2 WebBase
project aims to build the necessary
infrastructure to facilitate the development
and testing of new algorithms for clustering,
searching, mining, and classification of Web
content.
Interpreting Information
WebClustering
WebClustering
•
•
Clustering refers to the grouping of pages into categories, in a fashion
similar to Yahoo Yahoo or the Open Directory .
We are currently investigating techniques to efficiently cluster the entire
web. Traditional IR approaches are not appropriate in the context of the
web, due to both the enormous size and hyperlinked nature of the web.
We plan to use recently developed techniques that allow for similarity
searches in high dimensional spaces (for instance)
http://theory.stanford.edu/~indyk/vldb99.ps to allow for offline clustering
of the web. Even with the newer techniques, the resource requirements
will be large, especially as precision requirements are raised.
Supercomputing resources will be a valuable asset in performing
clustering and other mining operations on the contents of the web.
Such resources will allow us to explore and evaluate more of the
available clustering options as we develop the most effective
techniques.
Managing Information
• Archival Repositories
• InterBib
Archival Repositories
The goal of this project is to design and
implement a modern, scalable digital
library repository (DLR).
Under our architecture, a Digital Library
Repository (DLR) is formed by a
collection of independent but
collaborating sites.
Signatures as Object Handles
Each object in a DLR has a handle used to identify and retrieve it. Handles are
internal to the DLR and are not used by end users to identify documents.
Given an object, we define its handle to be a (large) signature computed
exclusively from its contents, using a checksum or a Cyclic Redundancy
Check (CRC). If the contents are smaller than the size of the signature, the
object (at creation time) is ``padded'' with a random string to make its size
larger than the size of a signature.
1.
Each site can generate objects and handles without consulting other sites.
Only need to agree on signature function not on software versions, character
sets, etc.
2.
Handle can be reconstructed from object itself
3.
Copies at different sites will have same handle 4. different objects will have
different handles
No Deletions
Because of our handle scheme, objects cannot be
updated in place. That is, if the contents of an object
are modified, it automatically becomes a new object,
with a different handle.
Another fundamental rule in our architecture is that
objects are never (voluntarily) deleted. Allowing
deletions is dangerous when sites are managed
independently; in particular, it makes it hard to
distinguish between a deleted object and one that
was corrupted (``morphed'' into another) and needs
to be restored.
Layered Architecture
Since each DLR site may be implemented differently, it is important
to have well defined and as simple as possible site interfaces.
1.
Object Store Layer
2.
Identity Layer - provides access to objects via handles provides basic facilities for reporting changes to its objects
3.
Complex Objects Layer - Manages collections of related
objects
4.
Reliability Layer - Coordinates replication of objects to
multiple stores for long term archiving
5.
Upper Layers - protect IP - enforce security - charging
customers under various revenue models
Layered Architectures Diagram
Awareness Everywhere
Awareness services (standing orders,
subscriptions, alerts) are important in
digital libraries. They are also important
for our reliability and indexing layers: if
one site is backing up another, it must
be aware of new objects or corrupted
objects to take appropriate action.. In
our architecture, awareness services
are an integral part of every layer.
Disposable Auxiliary
Structures
Layers typically maintain auxiliary
structures for improving performance. In
our architecture these structures are
designed to be disposable, so they can
be reconstructed from the underlying
digital objects.
InterBib
3 facilities:
1. conversion of bibliographies among
different formats
2. the processing of documents to
include bibliographies
3. the collaborative accumulation of
bibliographies that can be searched.
Converting Bibliographies
online form, accepts BibTeX or Refer to
HTML and MIF (FrameMaker) and
converts back and forth. Retains good
ones for InterBib Server
Generating Bibliographies
RTF, HTML, or Framemaker MIF files. Generally,
citations in your documents need to be of the form
'[garc89, ullm93]' ut you can use any keys you want,
as long as they match the ones in your BibTeX file. In
Refer, you can use the field '%L' to specify a key. If
no key is specified in Refer, InterBib will construct an
all lower-case key from the first four letters of the first
author and the publication year. Apostrophes are left
out of the key. You can also change the characters
you use to delimit citations in your document.
Sharing Bibliographies
can search for relevant entries
Sharing Information
DietORB
Digital Wallets
DietORB
a highly minimalized CORBA for handheld
devices. We developed a CORBA ORB
for the Palm Pilot PDA. The ORB
currently only allows the PDA to call out
to full-sized services. This project is
associated with MICO, a free, GNUlicensed CORBA implementation.
Digital Wallets
A digital wallet is a software component
that allows a user to make an electronic
payment with a financial instrument
(such as a credit card or a digital coin),
and hides the low-level details of
executing the payment protocol that is
used to make the payment.
Extensible
accommodate all of the user's different payment
instruments, and inter-operate with multiple
payment protocols.
vendors should be able to develop electronic
coupons that offer discounts on products
without requiring that users install a new
wallet to hold these coupons and make
payments with them.
Client-Driven
Vendors should not be capable of
invoking the client's digital wallet to do
anything that the end-user may resent
or consider an annoyance.
Symmetric
Vendors and banks run software
analogous to wallets, which manages
their end of the financial operations.
Since the functionality is so similar, it
makes sense to re-use, whenever
possible, the same infrastructure and
interfaces within wallets, vendors, and
banks.
Generalized
Interfaces should be similar regardless of what type of
device or computer that the wallet, bank, or vendor
application is running on.
A digital wallet running on an "alternative" device, such
as a personal digital assistant (PDA) or a smart card,
for example, has substantial functionality in common
with a digital wallet built as an extension to a web
browser.
Thus, a digital wallet in these two environments should
re-use the same instrument and protocol
management interfaces.