Introducing Voyager with Unicode

Download Report

Transcript Introducing Voyager with Unicode

Voyager® with Unicode™ :
A Catalogers Session
Connie Braun
Training Consultant
Agenda
Introduction
Your Work Environment
Conversion
New Features
Learning More
Q&A
Release Update
General release occurred October 6, 2004!
 4 production partners
 1 Windows Server, 3 Solaris
 8 test server partners
 4 Task Force members (large non-roman collections)
 1 large consortia with Universal Borrowing & Universal
Catalog
 2 European customers
 As of 01/20/05, 71 customers have upgraded and are
functioning in a production environment with Voyager
with Unicode. Approximately 50 upgrades are
scheduled between now and May 2005.
Why Unicode™ in
Voyager?
 Brings Voyager up to current IT standards
 Finds and displays records in the native language
 Create and edit any MARC record using UTF-8
 Import and export of records with any supported character set
 Operators may select a Unicode-compliant font of their choice
 Display Unicode characters in OPAC without proprietary software
Implementing Voyager with Unicode
For our customers, it’s business as usual, but with some interesting
changes and improvements, especially in Cataloging.
Helping everyone to implement a Unicode-compliant system is
Endeavor’s aim.
The Unicode standard is an important step towards realizing that
goal.
Implementing the Unicode standard is an extension of Endeavor’s
original mission: access to information regardless of location or
format.
Following Standards
 Follows Standards (not proprietary)



See http://www.unicode.org for much more detail on these standards.
See http://lcweb.loc.gov/marc/specifications/speccharucs.html for
details on LC’s format of MARC records that use Unicode. Voyager
follows this specification.
Specifics on the Code Tables may be viewed at
http://www.loc.gov/marc/specifications/specchartables.html
 The Voyager implementation of the Unicode standard gives
libraries and their users greater flexibility when accessing
collection materials that contain both Roman and non-Roman text.
Multilingual Input and Display
 By introducing improved multilingual input and
display capabilities in Voyager, characters now
display correctly according to the Unicode and
MARC standards.
 Greater script coverage for cataloging items in your
collections, published in languages around the
world.
 How many? The total number of possible characters
for UTF-8 is: 2,147,483,648!
Preview Server
• Anyone interested in trying out Voyager with Unicode before your
upgrade? You can!
• http://support.endinfosys.com/cust/voy/upgrade/unicode/testwv_pre.html
provides all the details necessary to get you started
• Preview Server uses the Voyager training database that has been
augmented with numerous records in both Roman and nonRoman languages
•
Try keyword searches:
•
•
•
•
“non roman script japanese”
“non roman script arabic”
“roman script french”
“roman script italian”
Agenda
Introduction
Your Work Environment
 Workstation Requirements
 Setting Up For Languages Other Than English
 Tag Tables
 Session Defaults and Preferences
Conversion
New Features
Q&A
Workstation Requirements
In order to enjoy the full range of benefits, PCs
must have up-to-date operating systems and
productivity software.
This means that staff PCs will need:


Windows 2000 or XP operating system
Unicode standard compliant Internet browser
 IE 6+
 Netscape 6+

Unicode-compliant font: Lucida and Arial Unicode MS
MS Windows™
Voyager is more integrated with Windows in terms of
Standard Windows 2000/XP Unicode support
Standard Unicode fonts
Standard input using Input Method Editors (IMEs)
Standard browser support
Setting Up for Languages Other
Than English
• Workstations need to be specifically
configured to work with languages
other than English
• Likely will require technical IT
assistance to install needed
languages on staff PCs
• Best to install all languages so that
cataloger may easily include new
ones as necessary
Adding Languages to PCs
• Regional and language
options are specific to each
PC
• Among options available via
Start – Settings – Control
Panel
• Details button on
Languages tab lets operator
view or change languages
and methods to enter text
• Can include supplemental
language support, too
Choosing Languages
• Languages added to
PCs will match
languages for items
found in your
collections
• Add and remove
according to your
needs; as few or
many as necessary
• May also set
preferences for
language bar and key
settings
Tag Tables
MARC Tag Tables have been completely revised
and rewritten for Voyager with Unicode
Tag Tables
• Ability to modify tag table configuration remains
the same as in earlier releases
• But, may not specify anything for Leader position
9 since that byte is now hard-coded to identify
records that have been converted to UTF-8
• May want to consider whether or not library will
need or want to revise Tag Tables for local use
• See Appendix A of Cataloging User’s Guide for
full details on revising, maintaining and updating
the Tag Tables
Record Validation
 MARC validation
 MARC21 character set validation
 Authority control validation
 Decomposition of accented characters for
MARC21
Session Defaults and Preferences:
Record Validation
Bypass MARC21 Character set validation
•
•
Uses MARC21 Repertoire.cfg to control validation of the
MARC21 character set
Helps to enforce MARC21 standard
Bypass Decomposition of accented characters for
MARC21
•
•
Allows records to be saved to the database without
decomposing the characters
IMPORTANT: If you select this option, MARC21 rules are
ignored. We strongly recommend that this check box be unchecked, in order to comply with the MARC21 standard.
Session Defaults and Preferences:
Mapping Tab
Expected Character Set of Imported Records now has six options
Session Defaults and Preferences:
Colors/Fonts Tab
Agenda
Introduction
Your Work Environment
Conversion
• Data Conversion
• Conversion Error Logging
• Conversion Details
• Identifying Non-Unicode Data
• The Rest of Voyager
New Features
Learning More
Q&A
Data Conversion
Conversion process during upgrade treats data
differently than when importing records through
Cataloging client or via BulkImport
MARC records are converted from VRLIN (Voyager
legacy encoding) to MARC21 compliant UTF-8
encoding



Leader position 9 becomes an ‘a’
Conversion Log Created
UTF-8 allows for variable length characters. The majority of
characters in the database occupy the same amount of space
as before conversion.
Note: All indexes and database columns with MARC
data are regenerated after conversion.
Conversion Details
IMPORTANT! NO RECORDS ARE LOST

Each field in the record handled individually.

As each field is processed, it may change length, requiring adjustments to
the leader and directory of the record.

Records are saved to the database with a leader position 9 = ‘a’.

Both record-level and field-level checking are performed. In rare cases an
entire record might fail conversion; it is more likely that an individual field
fails to be converted.

Records may not convert if they contain text that cannot be mapped into
Unicode according to the standard MARC-8 to Unicode mappings.

Records that do not convert are stored in the database as is, without being
converted to Unicode.
Conversion Error Logging
Libraries need to know the details about the
results of the conversion process.
 Full error checking and logging is included as part of
the upgrade
 Technical User’s Guide, Chapter 4
 Cataloging User’s Guide, Appendix C
 Library designates should review this file to plan for
correcting any records that have errors
Sample from Conversion Log File
Conversion Log Details 1
1
2
3
4
5
6
7
# 11 secs read=982 changed=791 880=0 okay=982 errors=0 written=982
# 21 secs read=1931 changed=1558 880=0 okay=1931 errors=0 written=1931
# 29 secs read=2848 changed=2087 880=0 okay=2848 errors=0 written=2848
# 36 secs read=3699 changed=2533 880=0 okay=3699 errors=0 written=3699
# 43 secs read=4607 changed=3076 880=0 okay=4607 errors=0 written=4607
# 51 secs read=5519 changed=3610 880=0 okay=5519 errors=0 written=5519
Legend
1 number of seconds used by job so far
2 read=number of records processed
3 changed=number of records changed
4 880=how many records contain 880s
5 okay=# records processed successfully
6 errors=# records not processed due to errors
7 written=# records written to the database
Conversion Log Details 2
1
2
3
4
5
6
7
8
=bib 6213: [17](700): c->8 loose char page=0 at 20 '091e
..‘
9
=bib 35322: [14](856): c->8 undefined char page=0 at 61 'fc7220486973746f .r Histo‘
10
=bib 35516: [23](856): c->8 no char to combine to page=0 at 82 '1e
.‘
=================================================================
1 record type and id
2 index within record of field that generated error
3 tag that generated error
4 c->8 indicates conversion to UTF-8 encoding
5 description of error
6 page=subset to which source character belongs
7 at # position of source character that caused error
8 hex dump of source character
9 description of error
10 description of error
Conversion Log Details 3
loose char:
a warning message indicating that a
character not strictly part of Voyager
encoding has been converted (e.g.
unexpected carriage return)
no char to combine to:
a warning message indicating that a
combining character appeared but it
lacks a base character with which to
combine (e.g. umlaut but no a, o, u base
letter)
undefined char:
an error message indicating that there is
a single character that cannot be
mapped to UTF-8
Identifying non-Unicode data
•
To identify a non-Unicode record in the Cataloging client, select a color for
Conversion records in Session Defaults and Preferences > Colors-Fonts
tab.
Identifying non-Unicode data
• Any non-converted record displays in the color selected
in Options/Preferences.
Identifying non-Unicode data
There are other ways to identify records that have conversion errors.
Records that cannot be converted to Unicode are viewable in the Cataloging
module with nc (not converted) displayed in the Title Bar.
Any characters that cannot be matched or recognized are replaced with a
Unicode substitution character.
Fonts and Unicode
• A MARC record may contain non-Roman
characters even though you cannot see them.

Records are sure to display correctly if a Unicodecompliant font has been selected.
• Lucida Sans Unicode installed by default with
Windows
• Arial Unicode MS


Good choice for libraries with mixed cataloging
Included with Microsoft Office and other Microsoft products
The Rest of Voyager
• Non-MARC data is not converted



Acquisitions data
Circulation data (patron info, etc.)
Item data
• Reporter



Not Unicode standard compliant
Translates data to LATIN1
Dots appear where you used to see squares
Agenda
Introduction
Your Work Environment
Conversion
New Features
• Cataloging
Diacritics & Special Characters, Importing Records, New
Record Views, Search URIs
• WebVoyáge
Browsers, Searching, Displaying
• Interacting with Other Systems
Learning More
Q&A
Diacritic and Special Character Entry
• Cataloging practices: then and now
 Pre-Unicode input in Cataloging = accent character (diacritic)
precedes the base character.
 Example: Espa~na
 Post-Unicode input in Cataloging = accent character
(diacritic) follows the base character.
 Example: Espan~a
 Ability to display combined characters is an improvement
over past versions and a way to insure accurate entry
 Example: España
Special Characters.cfg
SpecialCharacters.cfg, located in the C:\Voyager\Catalog folder,
defines the content of the special character entry dialog box.
Operators may define their most frequently used characters here.
Special Character Entry
This is what the dialog box in Cataloging looks like.
The key press
column identifies
the keyboard
equivalent that may
be used instead of
turning on Special
Character Mode in
Cataloging.
Finding Little Used Characters
• For situations where a character not part of the
Special Characters list is needed, operator can
use Character Map from MS Windows
• Start – Programs – Accessories – System Tools –
Character Map
• Locate character or perform search
• Select and Copy character, then paste into
position in bib record
Cataloging: Input of Non-Roman Text
Voyager® with Unicode allows Cataloging operators
to use all of the standard Microsoft Windows
keyboard and input method editors (IMEs).
With this functionality in place, operators may search
for, display, and edit the contents of all MARC records
using the full range of UTF-8 characters.
Entire JACKPHY group is part of the UTF-8 character
set which includes right-to-left input needed for
Arabic, Persian, Hebrew and Yiddish.
Reminder: JACKPHY = Japanese, Arabic, Chinese, Korean, Persian,
Hebrew, Yiddish
Linking in a MARC21 Record
Tag
I1
100
1
245
1
I2
Subfield Data
‡6 880-01 ‡a An, Zhen.
0
‡6 880-02 ‡a Ri yue yun yan / ‡c An Zhen zhu.
250
‡6 880-03 ‡a Di 1 ban.
260
‡6 880-04 ‡a Changchun Shi : ‡b Changchun chu ban she, ‡c 1997.
300
‡a 4, 2, 291 p. ; ‡c 21 cm.
440
0
‡6 880-05 ‡a Zhongguo li dai wang chao xing shuai qu shi lu
‡a Non-Roman script – Chinese
500
651
0
880
1
880
1
‡a China ‡x History ‡y Ming dynasty, 1368-1644.
‡6 100-01/$1 ‡a 安
0
震.
‡6 245-02/$1 ‡a 日月
云烟 / ‡c 安
880
‡6 250-03/$1 ‡a 第1版.
880
‡6 260-04/$1 ‡a 长春市 : ‡b 长春
880
0
‡6 440-05/$1 ‡a 中国
历代
王朝
震
著.
出版社,‡c 1997.
兴衰
启示录
Using On-Screen Keyboard
Typically, the path is Start—Programs—Accessories—Accessibility—OnScreen Keyboard
Importing Records
• Conversion process is separate and distinct from the
process of importing records
• Important distinction for operators who import records
through the Cataloging client or via BulkImport
• Expected character set needs to be accurately identified if
records are to be imported correctly
• Some experimentation may be necessary to determine the
correct character set
• Let’s look at some details to help everyone understand
what is happening
Record Exchange Scenarios
Voyager 2001.2 and earlier
•
In Voyager 2001.2 and
earlier, there were
several options from
which to choose
regarding the character
set:
•
•
•
•
•
Latin1
OCLC
RLIN legacy
MARC21 MARC8
Until now it has been
quite simple to choose
the correct option when
importing records
through the Cataloging
client or processing
large numbers of
records through
BulkImport.
After Upgrade to Voyager 2003.1
•
From Voyager 2003.1 forward,
there are numerous options from
which to choose regarding the
character set:
•
•
•
•
•
•
Latin1 (non-Unicode)
MARC21 MARC8 (non-Unicode)
MARC21 UTF8
OCLC (non-Unicode)
RLIN legacy (non-Unicode)
Voyager legacy (non-Unicode)
•
With Voyager 2003.1 and
beyond, it is very important to
determine the character set of
records before importing records
through the Cataloging client or
processing large numbers of
records through BulkImport.
Some experimentation may be
necessary.
•
* transition to MARC21 UTF8
occurs as Unicode standard
becomes pervasive
One Year From Now
•
In Voyager 2003.1 and
beyond, numerous options for
character sets will continue to
be needed:
•
•
•
•
•
•
Latin1 (non-Unicode)
MARC21 MARC8 (nonUnicode)
MARC21 UTF8
OCLC (non-Unicode)
RLIN legacy (non-Unicode)
Voyager legacy (non-Unicode)
•
But, the Unicode standard will
be much more pervasive,
having been adopted and
deployed by bibliographic
utilities, vendors who massage
records, vendors who supply
records, and others.
•
This means that selecting the
correct option will again be
simpler, even though knowing
the character sets will continue
to be very important.
Bulk Import
• Bulk Import of MARC Records

Fundamentally the same as before

Leader byte 9 is checked against the incoming character
set identified in the import rule.





Blank = non-Unicode™; converted & imported
‘a’ = Unicode™; imported
Neither Blank nor ‘a’; errors out – not imported
See log.imp.yyyymmdd for details on import success
Records that cannot be converted are not imported; found
in err.imp.yyyymmdd
Bulk Import and Expected Character Set

Character set mapping for Bulk Import is designated in the Bulk
Import rule in SysAdmin > Cataloging > Bulk Import Rules.
MARC Export

Default export character set is MARC21 UTF-8

Use the –a option to choose different character set (in the
command line)
 See page 10-8, in Technical User’s Guide for more
detail

LATIN1 records will get a dot exported for characters
outside the LATIN1 character set

If mapping for a composed character is not found, it
decomposes and Voyager® attempts to find a match for
each part.
New ISBN Indexes
For improved duplicate detection:
New ISBN Index


020N
020R
020a Number only
020z Number only
020 |a 1234567890 (Knopf)
020 |a 1234567890
 Check Bibliographic and Authority duplicate detection
profiles in System Administration!
HTTP Posting
Much easier access to WebVoyáge display from
clients
Available in Cataloging, Acquisitions & Circulation

Toggle record view from staff client to WebVoyáge
 Record menu in Cataloging contains a Send Record to option
Send Record To:
WebVoyáge

LinkFinderPlus available in Cataloging, Acquisitions & Circulation
 Record menu in Cataloging contains a Send Record to option
Send Record To:
LinkFinderPlus

Configured in voyager.ini file [MARC POSTing] stanza
Enabling HTTP Posting
To enable HTTP posting, a stanza is added to the
voyager.ini file. An example is shown below.
• [MARC POSTing]
• WebVoyage="http://train20031c1db.comet.endinfosys.com/cgi-bin/Pbibredirect.cgi"
• LinkfinderPlus="http://207.56.64.116/cgibin/Phttplinkresolver.cgi"
Easier Access to OPAC Display
Send Record To…….in Cataloging
•Send Record To…….in Acquisitions
Search URI
• Staff Client Search URI in Cataloging, Circulation
and Acquisitions

Drive searches to resources on the web

Add new button to search interface in staff clients

Click button…a browser is opened & search is executed

This is PC specific (voyager.ini)

Possible applications
 Link to another OPAC
 Link to one of your vendors
 Link to an online book seller
Presenting Search URI
Staff client search URI
Available in Cataloging, Circulation, and
Acquisitions
Adding Search URIs
clipped from voyager.ini
•
[SearchURI]
•
•
•
•
Name=Google
URI=http://www.google.com
Copy=Y
SearchSyntax=/search?&q=<searchtext>
•
•
•
•
#Name=Barnes&Noble
#URI=http://search.barnesandnoble.com
#Copy=Y
#SearchSyntax=/booksearch/results.asp?WRD=<searchtext>
•
•
•
•
#Name=Gale Group
#URI=http://www.galegroup.com
#Copy=Y
#SearchSyntax=/servlet/SearchPageServlet?region=9&imprint=<searchtext>
WebVoyáge and Unicode
• MARC data supplied to the browser in UTF-8

IE 6+ generally displays Unicode characters correctly. Some
characters do not display correctly unless a Unicode-compliant
font is selected.

Netscape 6+ figures out that it needs to display Unicode
characters without any special settings

Consider new help text in your OPAC to help patrons understand
about language options, especially if there are records using
different languages in your database
• New UTF-8 download/save format
Searching in WebVoyáge
Search and display in native languages for staff
and users.

WebVoyáge and Cataloging allow Unicode character
input; you can search for and retrieve records in native
languages.

Record display includes non-Latin scripts, including
right-to-left scripts like Arabic and Hebrew. Voyager
takes advantage of the web browser’s native rendering
support.
Records with Other Languages in the OPAC
Displaying Records in WebVoyáge
Linking in a MARC21 Record
Tag
I1
100
1
245
1
I2
Subfield Data
‡6 880-01 ‡a An, Zhen.
0
‡6 880-02 ‡a Ri yue yun yan / ‡c An Zhen zhu.
250
‡6 880-03 ‡a Di 1 ban.
260
‡6 880-04 ‡a Changchun Shi : ‡b Changchun chu ban she, ‡c 1997.
300
‡a 4, 2, 291 p. ; ‡c 21 cm.
440
0
‡6 880-05 ‡a Zhongguo li dai wang chao xing shuai qu shi lu
‡a Non-Roman script – Chinese
500
651
0
880
1
880
1
‡a China ‡x History ‡y Ming dynasty, 1368-1644.
‡6 100-01/$1 ‡a 安
0
震.
‡6 245-02/$1 ‡a 日月
云烟 / ‡c 安
880
‡6 250-03/$1 ‡a 第1版.
880
‡6 260-04/$1 ‡a 长春市 : ‡b 长春
880
0
‡6 440-05/$1 ‡a 中国
历代
王朝
震
著.
出版社,‡c 1997.
兴衰
启示录
Interacting with Other Systems
• Incoming Z39.50 Connections

Records in Unicode databases are UTF8 encoded

z3950svr may send either or both MARC8-encoded or
UTF8-encoded records

Default is set to send MARC8 encoded records

But, two different z3950svr ports can be configured to
provide records in both formats, thereby accommodating
all sites connecting to database
Interacting with Other Systems
• Outgoing Z39.50 Connections


Retrieves and displays records of any type in UTF-8
Converts incoming records based on new Database
Definitions setting in System Administration called
‘Source Character Set’






Latin1 (non Unicode™)
MARC 21 MARC8 (non Unicode™)
MARC21 UTF8
OCLC (non Unicode™)
RLIN legacy (non Unicode™)
Voyager® legacy (non Unicode™)
Agenda
Introduction
Your Work Environment
Conversion
New Features
Learning More
Final Q&A
If you want to know more about…..
Coded Character Sets - EndUser 2004: Session 29
Title: Coded Character Sets: A Technical Primer for Librarians
Presenters: Michael Doran, Systems Librarian, University of Texas at Arlington;
Dan Sweeney, Business Analyst II, Endeavor Information Systems
Great Website: http://rocky.uta.edu/doran/charsets/
Strategies and Tools for Cleaning Up Your Data -- EndUser 2004: Session 45
Title: Transitioning To Unicode: Strategies for Tidying Your Data
Presenters: Fran Budde, Acquisitions & Cataloging Specialist, Pacific Lutheran
University; Francesca Lane Rasmus, Director, Technical Services, Pacific
Lutheran University; Layne Nordgren, Director of Instructional
Technologies/Library Systems, Pacific Lutheran University
If you want to know more about…..
Special Character Input/Issues – EndUser 2004:Session 65
Title: Why Unicode?
Presenter: Martin Heijdra, Chinese Bibliographer/ Head of Public Services,
East Asian Library, Princeton University
Preparing for Unicode Conversion & Cataloging Issues – EndUser 2004:
Session 74
Title: Unicode Conversion at the Library of Congress
Presenter: Ann Della Porta, Assistant Coordinator, Integrated Systems
Office, Library of Congress
SupportWeb: KnowledgeBase, EndUser archives
http://support.endinfosys.com/cust/index.html
If you want to know more about….
880 – Alternate Graphic Representation (R)
http://www.loc.gov/marc/bibliographic/ecbdhold.html#mrcb880
OCLC Character Sets
http://www.oclc.org/support/documentation/worldcat/records/subscription/5/5.pdf
Original Scripts in RLG Databases
http://www.rlg.org/origscripts.html
MARC 21 Concise Bibliographic: Control Subfields
http://www.loc.gov/marc/bibliographic/ecbdcntf.html
MARC 21 Concise Bibliographic: Multiscript Records
http://www.loc.gov/marc/bibliographic/ecbdmulti.html
Thank you!