No Slide Title

Download Report

Transcript No Slide Title

Information Discovery
Lecture 18
Library Catalogs 3
A Footnote Note on Stoplists
Adding a term to a stoplist tends to:
Lower recall (some matches are missed)
Increase precision (unrelated apparent matches are
missed)
A skilled user:
Is aware of the stoplist
Takes care not to avoid stoplist words if recall might
suffer
An unskilled user:
Is unaware of the stoplist
General practice:
Limits of Dublin Core and MARC: Complex
Objects
Complex objects
Metadata
records
Complete
object
Sub-objects
•
•
•
•
Article within a journal
Page within a Web site
A thumbnail of another image
The March 28 final edition of a newspaper
Limits of Dublin Core and MARC: Events
Version 1
Version 2
New
material
Should Version 2 have its own record or should
extra information be added to the Version 2 record?
How are these represented in Dublin Core or
MARC?
Packaging Rules
Metadata Object Description Schema (MODS)
http://www.loc.gov/standards/mods/
MPEG 21
http://www.chiariglione.org/mpeg/standards/mpeg-21/mpeg21.htm
MPEG
21
Modernizing MARC
1. Keep the data
2. Convert to Unicode for representing scripts
3. Convert to XML for tagging cataloguing metadata.
MARCXML (MARC 21 XML)
http://www.loc.gov/standards/marcxml/
[Direct conversion to XML tagging]
Metadata Object Description Schema (MODS)
http://www.loc.gov/standards/mods/
[Subset of MARC with data clean-up]
MARCXML
1. Simple and Flexible MARC XML Schema
The schema retains the semantics of MARC. Fields
are treated as elements with the tag as an attribute
and indicators treated as attributes. Subfields are
treated as subelements with the subfield code as an
attribute.
2. Lossless Conversion of MARC to XML
3. Roundtripability from XML back to MARC
4. Data Presentation by writing a XML stylesheet
5. Validation of MARC data
6. Extensibility
MODS Example (extracts)
<mods>
<titleInfo>
<title>Sound and fury :</title>
<subTitle>the making of the punditocracy /</subTitle>
</titleInfo>
<name type="personal">
<namePart>Alterman, Eric</namePart>
<role>
<roleTerm type="text">creator</roleTerm>
</role>
</name>
MODS Example (extracts)
<typeOfResource>text</typeOfResource>
<originInfo>
<place>
<placeTerm type="text">Ithaca, N.Y</placeTerm>
</place>
<publisher>Cornell University Press</publisher>
<dateIssued>c1999</dateIssued>
</originInfo>
<language>
<languageTerm authority="iso639-2b"
type="code">eng</languageTerm>
</language>
</mods>
Using Catalog Data for IR
The basic operation of information retrieval is to
match the way that a user describes an information
requirement (a query), against the way that items
are described (an index).
The success of conventional catalogs (e.g., MARC +
Anglo-American Cataloguing Rules) or indexing
services (e.g., Medline) comes from the use of
precise language to describe items combined with
trained and experienced users to formulate
queries.
Why is Dublin Core not used to Index and
Search the Web?
Technology: The methods used in early Infoseek, Lycos
and Altavista have been greatly enhanced.
(Note that these methods provide quite good precision
at the expense of low recall.)
Users: The typical user who searches the Web has
limited training and does not understand catalogs.
Economics: The size of the Web makes human
indexing of every important site impossible. The rate
of change requires frequent re-indexing.
Automatic extraction of catalog data
Example: Dublin Core records for web pages
Strategies
•
Manual by trained cataloguers
- high quality records, but expensive and time
consuming
•
Entirely automatic
- fast, almost zero cost, but poor quality
•
Automatic followed by human editing
- cost and quality depend on the amount of editing
• Manual collection level record, automatic item level
record
DC-dot
DC-dot is a Dublin Core metadata editor for web
pages, created by Andy Powell at UKOLN
http://www.ukoln.ac.uk/metadata/dcdot/
DC-dot has two parts:
(a) A skeleton Dublin Core record is created
automatically from clues in the web page
(b) A user interface is provided for cataloguers to edit the
record
Automatic record for CS 430 home page
DC-dot applied to
http://www.cs.cornell.edu/courses/cs430/2001sp/
<link rel="schema.DC" href="http://purl.org/dc">
<meta name="DC.Title" content="CS 430: Information
Discovery">
<meta name="DC.Subject" content="[email protected];
Course Structure; Readings and references; Slides; Basic
Information; William Y. Arms; Information Retrieval Data
Structures and Algorithms; [email protected]; Assignments;
Syllabus; Text Book; Laptop computers; Assumed Background;
Nomadic Computing Experiment; Notices; Course Description;
Code of practice; Assignments and Grading; Last changed:
Automatic record for CS 430 home page
(continued)
DC-dot applied to
http://www.cs.cornell.edu/courses/cs430/2001sp/
<meta name="DC.Publisher" content="Cornell University">
<meta name="DC.Date" scheme="W3CDTF" content="2001-0207">
<meta name="DC.Type" scheme="DCMIType" content="Text">
<meta name="DC.Format" content="text/html">
<meta name="DC.Format" content="5781 bytes">
<meta name="DC.Identifier"
content="http://www.cs.cornell.edu/courses/cs430/2001sp/">
Observations on DC-dot applied to CS430
home page
DC.Title is a copy of the html <title> field
DC.Publisher is the owner of the IP address where the page
was stored
DC.Subject is a list of headings and noun phrases presented
for editing
DC.Date is taken from the Last-Modified field in the http
header
DC.Type and DC.Format are taken from the MIME type of the
http response
DC.Identifier was supplied by the user as input
Automatic record for George W.
Bush home page
DC-dot applied to http://www.georgewbush.com/
<link rel="schema.DC" href="http://purl.org/dc">
<meta name="DC.Subject" content="George W. Bush;
Bush; George Bush; President; republican; 2000 election;
election; presidential election; George; B2K; Bush for
President; Junior; Texas; Governor; taxes; technology;
education; agriculture; health care; environment; society;
social security; medicare; income tax; foreign policy;
defense; government">
<meta name="DC.Description" content="George W. Bush
is running for President of the United States to keep the
country prosperous.">
Automatic record for George W.
Bush home page (continued)
DC-dot applied to http://www.georgewbush.com/
<meta name="DC.Publisher" content="Concentric Network
Corporation">
<meta name="DC.Date" scheme="W3CDTF" content="200101-12">
<meta name="DC.Type" scheme="DCMIType" content="Text">
<meta name="DC.Format" content="text/html">
<meta name="DC.Format" content="12223 bytes">
<meta name="DC.Identifier"
content="http://www.georgewbush.com/">
Observations on DC-dot applied to George
W. Bush home page
The home page has several meta tags:
<META NAME="TITLE" CONTENT="George W. Bush for
President"> [The page has no html <title>]
<META NAME="CONTACT" CONTENT="George W Bush
Campaign, P. O. Box 1902, Austin, TX 78767, Phone: (512) 6372000">
<META NAME="DESCRIPTION" CONTENT="George W. Bush
is running for President of the United States to keep the country
prosperous.">
<META NAME="KEYWORDS" CONTENT="George W. Bush,
Bush, George Bush, President, republican, 2000 election and
more
Collection-level metadata
Several of the most difficult fields to extract
automatically are the same across all pages in a web
site.
Therefore create a collection record manually and combine it
with automatic extraction of other fields at item level.
For the CS 430 home page, collection-level metadata:
<meta name="DC.Publisher" content="Cornell University">
<meta name="DC.Creator" content="William Y. Arms">
<meta name="DC.Rights" content="William Y. Arms, 2001">
See: Jenkins and Inman
Metadata extracted automatically by DCdot
D.C. Field
Qualifier
title
Content
Digital Libraries and the Problem of
Purpose
subject
not included in this slide
publisher
Corporation for National Research
Initiatives
date
W3CDTF
type
DCMIType
2000-05-11
Text
format
text/html
format
27718 bytes
identifier
http://www.dlib.org/dlib/january00/01levy.html
Collection-level record
D.C. Field
Qualifier
Content
publisher
Corporation for National Research
Initiatives
type
article
type
resource
work
relation
rel-type
InSerial
relation
serial-name
D-Lib Magazine
relation
issn
1082-9873
language
English
rights
material
Permission is hereby given for the
in D-Lib Magazine to be used for ...
Combined item-level record
(DC-dot plus collection-level)
D.C. Field
title
Purpose
publisher
Initiatives
date
type
type
type
Qualifier
Content
Digital Libraries and the Problem of
(*) Corporation for National Research
W3CDTF
2000-05-11
(*) article
resource (*) work
DCMIType Text
format
text/html
format
27718 bytes
(*) indicates collection-level metadata
Combined item-level record
(DC-dot plus collection-level)
D.C. Field
Qualifier
Content
relation
rel-type
(*) InSerial
relation
serial-name (*) D-Lib Magazine
relation
issn
(*) 1082-9873
language
(*) English
rights
material
(*) Permission is hereby given for the
in D-Lib Magazine to be used for ...
identifier
http://www.dlib.org/dlib/january00/01levy.html
(*) indicates collection-level metadata
Manually created record
D.C. Field
Qualifier
title
Purpose
Content
Digital Libraries and the Problem of
creator
(+) David M. Levy
publisher
Initiatives
Corporation for National Research
date
type
publication
type
resource
January 2000
article
work
(+) entry that is not in the automatically generated records
continued on next slide
Manually created record
D.C. Field
Qualifier
Content
relation
rel-type
InSerial
relation
serial-name
D-Lib Magazine
relation
issn
1082-9873
relation
volume
(+) 6
relation
issue
(+) 1
identifier DOI
(+) 10.1045/january2000-levy
identifier URL
http://www.dlib.org/dlib/january00/01levy.html
language
English
rights
(+) Copyright (c) David M. Levy
(+) entry that is not in the automatically generated records
Collection-level metadata
Compare:
(a) Metadata extracted automatically by DC-dot
(b) Collection-level record
(c) Combined item-level record (DC-dot plus collection-level)
(d) Manual record
For web pages information retrieval works better by
automatic indexing, rather than automatic extraction of
metadata followed by indexing of metadata.
However, we will see later an effective example of
automated extraction of metadata from video sequences
(Informedia).
Search Engine Spam
D-Lib Magazine
Web pages created for user, with good quality control
and no attempt to impress search engines. (The editor
originally trained as a librarian.)
The site lends itself to automatic indexing.
Political Web Sites (Bush and Gore)
Web pages created for marketing, with little consistency,
designed to impress search engines. (The editors are
specialists in public relations.)
The sites are difficult to index automatically.
Metatest
Metatest is a research project led by Liz Liddy at
Syracuse with participation from the Human Computer
Interaction group at Cornell.
The aim is to compare the effectiveness as perceived by
the user of indexing based on:
(a) Manually created Dublin Core
(b) Automatically created Dublin Core (higher quality
than
DC-dot)
(c) Full text indexing
Preliminary results suggest remarkably little difference in