Document 7564256

Download Report

Transcript Document 7564256

Towards a Digital Edition of the
Slovenian Biographical Lexicon
Petra Vide Ogrin
Slovenian Academy of Sciences and Arts,
Library
Tomaž Erjavec
Department of Knowledge Technologies,
Jožef Stefan Institute
INFuture 2007, Zagreb
Overview of the talk




SBL (publication, nature, significance)
Methodology:
 TEI P5
 up-conversion into TEI-XML format
Example of TEI-XML article structure:
 skeleton
 actual XML document
Future plans: implementation of IR system
INFuture 2007, Zagreb
SBL
15
volumes + index, published over a long period of
time (1925-1991)
Who
is included? : notable figures important for
Slovenian cultural life, from the beginnings up to the
contemporary time - criteria
Covers
5,031 biographical entries, over 5,100
persons
Data
in the articles are checked against the relevant
primary material sources
INFuture 2007, Zagreb
Methodology of encoding




Use of open standards and software
Use of TEI P5 Guidelines
Up-conversion from OCR source into TEI-XML
Down-conversion into XHTML
(Implementation of DL open source software →
full-text and advanced searching)
INFuture 2007, Zagreb
TEI – Text Encoding Initiative



What’s TEI?
Why do we encode?
 to make explicit (to a machine) what is implicit (to a
person)
 to add value by supplying annotations (structural
metadata)
 to facilitate re-use of the same material
XML (eXtensible Markup Language):



international standard
application-, platform- and vendor- independent
extensible
TEI P5






no backward compatibility with P4 – new possibilities
for text encoding
validation of an XML document: checking against an
XML schema
an XML schema (XML syntax) = project-specific
combination of TEI modules
extension and generalization of modular system
interoperability and standards (ISO, W3C: Unicode,
lang → xml:lang, id → xml:id)
some new elements, e. g. for biographical and
prosopographical data → relevant for SBL project
INFuture 2007, Zagreb
Up-conversion into TEI-XML


OpenOffice – TEI OO package (XSLT
stylesheets) → TEI-XML document (basic
structure)
(semi-)automatic encoding – to achieve the
needed structure:
 Perl, XSLT
 manual intervention (correction)
INFuture 2007, Zagreb
An SBL article



Typical structure:
 biographical entry
 biography: data about birth, death, residence,
occupation, important events (marriage, ordination
etc.)
 representative bibliography that depicts a person's
life and work
One or more paragraphs
Encyclopaedic style: dense language, many
abbreviations (bibliography, authors, general: e.g.
months (Sept.) etc.)
INFuture 2007, Zagreb
Article TEI-XML structure
<div>
<listPerson>
<person>
<!--other elements for biographical
data: birth, death, occupation ...-->
</person>
</listPerson>
<p>
<!--the annotated text of the article-->
</p>
</div>
INFuture 2007, Zagreb
Future plans



Implementation of an IR system – for fulltext and advanced searching
Possible adoption of PhiloLogic
Exploring automatic recognition, extraction
and encoding of data
INFuture 2007, Zagreb