Document 7133102

Download Report

Transcript Document 7133102

Information Access for a Digital
Library:
Cheshire II and the Berkeley Environmental Digital Library
Ray R. Larson
School of Information Management & Systems
University of California, Berkeley
[email protected]
Chad Carson
Computer Science Division, EECS
University of California, Berkeley
[email protected]
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
UCB Digital Library Project:
Research Agenda
• Funded by NSF/NASA/DARPA Digital
Library Initiative (Phases I and II)
• Research agenda
– Understand user needs.
– Extend functionality of documents.
• “Enliven” legacy documents.
– Improve access to information.
– Scale to large systems.
– Re-Invent Scholarly Information Access and Use
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Testbed: An Environmental
Digital Library
• Collection: Diverse material relevant to
California’s key habitats.
• Users: A consortium of state agencies,
development corporations, private
corporations, regional government alliances,
educational institutions, and libraries.
• Potential: Impact on state-wide
environmental system (CERES )
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
The Environmental Library Users/Contributors
• California Resources Agency, California
Environment Resources Evaluation System
(CERES)
• California Department of Water Resources
• The California Department of Fish & Game
• SANDAG
• UC Water Resources Center Archives
• New Partners: CDL and SDSC
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
The Environmental Library Contents
•
•
•
•
•
•
•
•
Environmental technical reports, bulletins, etc.
County general plans
Aerial and ground photography
USGS topographic maps
Land use and other special purpose maps
Sensor data
“Derived” information
Collection data bases for the classification and
distribution of the California biota (e.g., SMASCH)
• Supporting 3-D, economic, traffic, etc. models
• Videos collected by the California Resources Agency
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
The Environmental Library Contents
• As of mid 1999, the collection represents
about three quarters of a terabyte of data,
including over 70,000 digital images, over
300,000 pages of environmental documents,
and over a million records in geographical
and botanical databases.
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Botanical Data:
 The CalFlora Database contains taxonomical
and distribution information for more than
8000 native California plants. The Occurrence
Database includes over 300,000 records of
California plant sightings from many federal,
state, and private sources. The botanical
databases are linked to our CalPhotos
collection of Calfornia plants, and are also
linked to external collections of data, maps,
and photos.
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Geographical Data:
 Much of the geographical data in our collection is
being used to develop our web-based GIS Viewer.
The Street Finder uses 500,000 Tiger records of
S.F. Bay Area streets along with the 70,000records from the USGS GNIS database. California
Dams is a database of information about the 1395
dams under state jurisdiction. An additional 11 GB
of geographical data represents maps and imagery
that have been processed for inclusion as layers in
our GIS Viewer. This includes Digital Ortho
Quads and DRG maps for the S.F. Bay Area.
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Documents:
 Most of the 300,000 pages of digital documents are
environmental reports and plans that were provided by
California state agencies. This collection includes
documents, maps, articles, and reports on the California
environment including Environmental Impact Reports
(EIRs), educational pamphlets, water usage bulletins, and
county plans. Documents in this collection come from the
California Department of Water Resources (DWR),
California Department of Fish and Game (DFG), San
Diego Association of Governments (SANDAG), and many
other agencies. Among the most frequently accessed
documents are County General Plans for every California
county and a survey of 125 Sacramento Delta fish species.
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Documents - cont.
The collection also includes about 20Mb of
full-text (HTML) documents from the
World Conservation Digital Library. In
addition to providing online access to
important environmental documents, the
document collection is the testbed for our
Multivalent Document research.
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Photographs:
The photo collection includes 17,000
images of California natural resources from
the state Department of Water Resources,
several hundred aerial photos, 17,000
photos of California native plants from St.
Mary's College, the California Academy of
Science, and others, a small collection of
California animals, and 40,000 Corel stock
photos.
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Testbed Success Stories
• LUPIN: CERES’ Land Use Planning Information
Network
– California Country General Plans and other
environmental documents.
– Enter at Resources Agency Server, documents stored at
and retrieved from UCB DLIB server.
• California flood relief efforts
– High demand for some data sets only available on our
server (created by document recognition).
• CalFlora: Creation and interoperation of
repositories pertaining to plant biology.
• Cloning of services at Cal State Library, FBI
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Research Highlights
• Documents
– Multivalent Document prototype
• Page images, structured documents, GIS data, photographs
• Intelligent Access to Content
– Document recognition
– Vision-based Image Retrieval: stuff, thing, scene
retrieval
– Natural Language Processing: categorizing the web,
Cheshire II, TileBar Interfaces
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
User Interface Paradigms:
Multivalent Documents
• An approach to new document types and
their authoring.
• Supports active, distributed, composable
transformations of multimedia documents.
• Enables sophisticated annotations,
intelligent result handling, user-modifiable
interface, composite documents.
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Multivalent Documents
Cheshire Layer
GIS Layer
Valence:
2: The relative
capacity to unite,
react, or interact
(as with antigens
or a biological
substrate).
Webster’s 7th Collegiate
Dictionary
Table Layer
History of The Classical World
kdk
dkd
kdk
The jsfj sjjhfjs jsjj
jsjhfsjf sjhfjksh sshf
jsfksfjk sjs jsjfs kj
sjfkjsfhskjf sjfhjksh
skjfhkjshfjksh
jsfhkjshfjkskjfhsfh
skjfksjflksjflksjflksf
sjfksjfkjskfjskfjklsslk
slfjlskfjklsfklkkkdsj
ksfksjfkskflk sjfjksf
kjsfkjsfkjshf sjfsjfjks
ksfjksfjksjfkthsjir\\
ks
ksfjksjfkksjkls’ks
klsjfkskfksjjjhsjhuu
sfsjfkjs
taksksh
sksksk
skksksk
kdjjdkd kdjkdjkd kj
kdkdk kdkd dkk
jdjjdj
clclc ldldl
Table 1.
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Network
Protocols &
Resources
OCR Layer
OCR Mapping
Layer
Modernjsfj sjjhfjs jsjj
jsjhfsjf sslfjksh sshf
jsfksfjk sjs jsjfs kj
sjfkjsfhskjf sjfhjksh
skjfhkjshfjksh
jsfhkjshfjkskjfhsfh
skjfksjflksjflksjflksf
sjfksjfkjskfjskfjklsslk
slfjlskfjklsfklkkkdsj
Scanned
Page
Image
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
GIS in the MVD Framework
• Layers are georeferenced data sets.
• Behaviors are
– display semi-transparently
– pan
– zoom
– issue query
– display context
– “spatial hyperlinks”
– annotations
• Written in Java (to be merged with MVD-1 code line?)
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
GIS Viewer Example
http://elib.cs.berkeley.edu/annotations/gis/buildings.html
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Overview of Cheshire II
• The Cheshire II system is intended to
provide an easy-to-use, standards-compliant
system capable of retrieving any type of
information in a wide variety of settings.
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Overview of Cheshire II
•
•
•
•
•
•
•
•
•
•
•
It supports SGML and XML.
It is a client/server application.
Uses the Z39.50 Information Retrieval Protocol.
Server supports a Relational Database Gateway.
Supports Boolean searching of all servers.
Supports probabilistic ranked retrieval in the Cheshire search
engine.
Search engine supports ``nearest neighbor'' searches and
relevance feedback.
GUI interface on X window displays.
WWW/CGI forms interface for DL, using combined
client/server CGI scripting via WebCheshire.
Image Content retrieval using BlobWorld
Support for the SDLIP (Simple Digital Library
Interoperability Protocol) for search and as Z39.50 Gateway
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Cheshire II Searching
Local
Remote
Z39.50
Z39.50
Internet
Z39.50
Z39.50
Scanned Images
Text
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Current Usage of Cheshire II
• Web clients for:
– NSF/NASA/ARPA Digital Library
–
–
–
–
–
–
• Includes support for full-text and page-level search.
• Experimental Blob-World image search
SunSite
University of Liverpool.
University of Essex, HDS (part of AHDS)
California Sheet Music Project
Cha-Cha (Berkeley Intranet Search Engine)
Univ. of Virginia
• Cheshire ranking algorithm is basis for
Inktomi (i.e., Yahoo, Hotbot, MSN? and
others)
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Image Retrieval Research
• Finding “Stuff” vs “Things”
• BlobWorld
• Other Vision Research
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Blobworld: use regions for retrieval
• We want to find general objects
 Represent images based on coherent regions
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Outline
• Why regions?
• Creating Blobworld: segmentation and
description
• Using Blobworld: query experiments
• Indexing blobs for faster querying
• Conclusions
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Creating and using Blobworld
Create
extract features
October 26, 1999
segment image
Use
describe regions
ASIS Annual Meeting 1999: Ray R. Larson
query
Extract features for each pixel
• Color
– Take average color (L*a*b*) at the selected scale
 ignore local color variations due to texture
– “zebra = gray horse + stripes”
• Texture
– Find contrast, anisotropy, polarity at the selected scale
• Position
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Find groups in feature space
• Model feature distribution as a mixture of
Gaussians using Expectation-Maximization (EM)
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Find regions in the image
• Label each pixel based on its Gaussian cluster
• Find connected components  regions
2
1
2
3
3 1
4
1
October 26, 1999
3
4
ASIS Annual Meeting 1999: Ray R. Larson
Describe regions by color, texture,
shape
• Color
– Color histogram within region
– Quadratic distance: encode similarity between
color bins
d2hist(x, y) = (x - y)' A (x - y)
• Texture
– Mean contrast and anisotropy
 stripes vs. spots vs. smooth
• (Basic) Shape
– Fourier descriptors of contour
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Select appropriate scale for
processing
• Polarity: do all the gradient vectors point in
the same direction?
• Choose scale where polarity stabilizes
 include one approximate period
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Initialize means using image data
• Before, we picked random initialization
• Now, choose initial means based on image
tiles
K=2
K=3
K=4
K=5
• Add noise to means and restart EM (4 runs
per K)
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Grouping: ExpectationMaximization
• Given class characteristics (,), find class membership
• Given class membership, find class characteristics (,)
• Iterate
October 26, 1999
update labels
update ,

update labels
update ,

ASIS Annual Meeting 1999: Ray R. Larson
How many Gaussians?
• Model selection: Minimum Description Length
– Prefer fewer Gaussians if performance is comparable
vs.
October 26, 1999
vs.
ASIS Annual Meeting 1999: Ray R. Larson
Find groups in feature space
• Model feature distribution as a mixture of
Gaussians using Expectation-Maximization (EM)
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
EM math
Probability density:
f x      i f i x  i 
K
i 1
f i x  i  
Update equations:

new
i
1
(2 ) det  i
d
2
1

N
1
2
e
 12 ( x   i ) T  i1 ( x   i )
 pi x ,  
N
old
j
j 1
 x pi x ,  
N
old
 inew 
j 1
N
j
j
 pi x ,  
old
j
j 1
 pi x ,  x
N
old

new
i

j
j 1

new
x j   inew
j  i
 pi x ,  
N
old
where
j
j 1


p i x j ,  old 
 i f i x  i 
  f x  
K
k 1
k
k
k

T
Encode similarity between color bins
• Quadratic distance
• Distance between histograms x and y:
d2hist(x, y) = (x - y)' A (x - y)
• Aij is based on the similarity between bins i
and j
– Neighboring bins have Aij = 0.5
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Fourier descriptors for shape
• [Zahn & Roskies ’72, Kuhl & Giardina ’82]
• Find (x,y) representation of outer contour
• Find Fourier series of (x,y)
– Coefficients specify an ellipse (4 parameters):
major axis, minor axis, orientation, starting point
• Remove starting point ambiguity
• Store first ten Fourier coefficients
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Creating and using Blobworld
Create
extract features
October 26, 1999
segment image
Use
describe regions
ASIS Annual Meeting 1999: Ray R. Larson
query
Querying: let user see the
representation
• Current systems are unsatisfying
– User can’t see what the computer sees
– Unclear how parameters relate to the image
• User should interact with the representation
– Helps in query formulation
– Makes results understandable
– Minimizes disappointment
http://elib.cs.berkeley.edu/photos/blobworld
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Query experiments
• Collection of 10,000 Corel stock photos
• Five query images in each of ten categories
(e.g., cheetahs, polar bears, airplanes)
• Compare Blobworld to global histogram queries
• Precision (% of retrieved images that are correct)
vs. Recall (% of correct images that are retrieved)
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Distinctive objects
• Tigers, cheetahs, and zebras:
– Blobworld does better than global histograms
cheetahs
1
1
two blobs
blob+background
global histogram
0.6
0.4
0
0
0.2
0.3
global histogram
0.4
0.2
0.1
blob+background
0.6
0.2
0
two blobs
0.8
precision
0.8
precision
zebras
0
0.2
recall
recall
October 26, 1999
0.1
ASIS Annual Meeting 1999: Ray R. Larson
0.3
Distinctive objects and
backgrounds
• Eagles and black bears:
– Blobworld does better than global histograms
black bears
1
two blobs
blob+background
precision
0.8
global histogram
0.6
0.4
0.2
0
0
0.1
0.2
recall
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
0.3
Distinctive scenes
• Airplanes and brown bears:
– Global histograms do better than Blobworld
– But Blobworld has room to grow (shape, etc.)
airplanes
1
two blobs
precision
0.8
blob+background
global histogram
0.6
0.4
0.2
0
0
0.1
0.2
recall
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
0.3
Index to search huge collections
• Indexing is trickier than for traditional data
• We can afford some mistakes: even with full
search, we’ll miss some tigers and include
some pumpkins
• Two approaches we have tried:
– Store terms and treat image as a document
– Store features and index using a tree
• Final (“correct”) ranking of images from
index
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Index using conventional IR
methods
• Treat each database blob as a document
– Store “terms” (bins) for color, texture, location, and
shape
– Repeat color terms based on histogram weights
• Index using Cheshire II
• Treat each query blob as a document
– Repeat “terms” according to query weights
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Indexing and Retrieval with
Cheshire II
• Originally used the same probabilistic
algorithm used for text
– Blobs are not distributed like text words or
stems
• Now using a weighting based on
coordination level match with a minimum
threshold (must have at least half of the
characteristics of the query cluster.
• Still eyeballing data, but seems much better
for many types of queries
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Conclusions
• Image retrieval in general collections
requires region segmentation and
description
• Blobworld yields high precision in queries
for distinctive objects
• Blobworld can be indexed to allow fast
querying
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson
Further Information
• Full Cheshire II client and server source is
available
ftp://sherlock.berkeley.edu/pub/cheshire/
– Includes HTML and Troff documentation
• http://cheshire.lib.berkeley.edu/
• UC Berkeley Digital Library Project
– http://elib.cs.berkeley.edu
October 26, 1999
ASIS Annual Meeting 1999: Ray R. Larson