Document 7564256
Download
Report
Transcript Document 7564256
Towards a Digital Edition of the
Slovenian Biographical Lexicon
Petra Vide Ogrin
Slovenian Academy of Sciences and Arts,
Library
Tomaž Erjavec
Department of Knowledge Technologies,
Jožef Stefan Institute
INFuture 2007, Zagreb
Overview of the talk
SBL (publication, nature, significance)
Methodology:
TEI P5
up-conversion into TEI-XML format
Example of TEI-XML article structure:
skeleton
actual XML document
Future plans: implementation of IR system
INFuture 2007, Zagreb
SBL
15
volumes + index, published over a long period of
time (1925-1991)
Who
is included? : notable figures important for
Slovenian cultural life, from the beginnings up to the
contemporary time - criteria
Covers
5,031 biographical entries, over 5,100
persons
Data
in the articles are checked against the relevant
primary material sources
INFuture 2007, Zagreb
Methodology of encoding
Use of open standards and software
Use of TEI P5 Guidelines
Up-conversion from OCR source into TEI-XML
Down-conversion into XHTML
(Implementation of DL open source software →
full-text and advanced searching)
INFuture 2007, Zagreb
TEI – Text Encoding Initiative
What’s TEI?
Why do we encode?
to make explicit (to a machine) what is implicit (to a
person)
to add value by supplying annotations (structural
metadata)
to facilitate re-use of the same material
XML (eXtensible Markup Language):
international standard
application-, platform- and vendor- independent
extensible
TEI P5
no backward compatibility with P4 – new possibilities
for text encoding
validation of an XML document: checking against an
XML schema
an XML schema (XML syntax) = project-specific
combination of TEI modules
extension and generalization of modular system
interoperability and standards (ISO, W3C: Unicode,
lang → xml:lang, id → xml:id)
some new elements, e. g. for biographical and
prosopographical data → relevant for SBL project
INFuture 2007, Zagreb
Up-conversion into TEI-XML
OpenOffice – TEI OO package (XSLT
stylesheets) → TEI-XML document (basic
structure)
(semi-)automatic encoding – to achieve the
needed structure:
Perl, XSLT
manual intervention (correction)
INFuture 2007, Zagreb
An SBL article
Typical structure:
biographical entry
biography: data about birth, death, residence,
occupation, important events (marriage, ordination
etc.)
representative bibliography that depicts a person's
life and work
One or more paragraphs
Encyclopaedic style: dense language, many
abbreviations (bibliography, authors, general: e.g.
months (Sept.) etc.)
INFuture 2007, Zagreb
Article TEI-XML structure
<div>
<listPerson>
<person>
<!--other elements for biographical
data: birth, death, occupation ...-->
</person>
</listPerson>
<p>
<!--the annotated text of the article-->
</p>
</div>
INFuture 2007, Zagreb
Future plans
Implementation of an IR system – for fulltext and advanced searching
Possible adoption of PhiloLogic
Exploring automatic recognition, extraction
and encoding of data
INFuture 2007, Zagreb