The use of SGML and XML at the Publications Office

Download Report

Transcript The use of SGML and XML at the Publications Office

The use of SGML and XML
at the Publications Office
Dr. Holger Bagola
Dir A – Cell “Methods and
Development — Formats”
[email protected]
Table of contents
• Historical overview
• Formex
• Other areas of XML usage
• Conclusion
The use of SGML and XML at the
Publications Office
2
Table of contents
• Historical overview
• Formex
• Other areas of XML usage
• Conclusion
The use of SGML and XML at the
Publications Office
3
Historical overview
• Tasks of the Publications Office
• Archiving of legislative publications
• First steps in SGML
• Migration to XML
• Basic advantage: availability of tools
The use of SGML and XML at the
Publications Office
4
Table of contents
• Historical overview
• Formex
• Other areas of XML usage
• Conclusion
The use of SGML and XML at the
Publications Office
5
Formex (1)
• Basic principles
– XML Schema instead of DTD
– One single schema
– Number of root elements 12 instead of
30
– Number of elements about 350 instead
of 1200
– Distinction between semantic and
physical markup
The use of SGML and XML at the
Publications Office
6
Formex (2)
ARTICLE
TI.ARTICLE
PARAG
NO.PARAG
ALINEA
(TI.ARTICLE, (PARAG+ | ALINEA+))
(#PCDATA)
(NO.PARAG, ALINEA+)
(#PCDATA)
((#PCDATA | NOTE | HT| FT)* |
(P | LIST | TABLE)+)
. . .
Blue:
Red:
semantic markup
physical markup
The use of SGML and XML at the
Publications Office
7
Formex (3)
• Table model
– Analysis of CALS, HTML, Formex v. 3
– Choice:
• Model close to HTML (top-down approach,
nested tables)
• Maintenance of semantic information such
as in Formex v. 3
The use of SGML and XML at the
Publications Office
8
Formex (4)
• Footnotes
– Distinction between notes in text and
tables for readability and production
simplicity
– Insertion of text notes into the
surrounding text
– ID/IDREF to signal identical footnotes
– Numbering is an object of presentation
– Table notes assembled at the top of the
table
The use of SGML and XML at the
Publications Office
9
Formex (5)
• Quotations
– Structured quotations vs. ‘#PCDATA’
quotations
– Elements signaling start and end of a
quotation (quotation marks)
– Element with function of a container for
structured quotations.
The use of SGML and XML at the
Publications Office
10
Formex (6)
Example:
Article 2
In article 1(2) of regulation (EC) 1234/94 the word ‘car’ is replaced by ‘bus’.
Article 6 of the same regulation is replaced by the following text:
‘Article 6
This is the new text of article 6.’
The use of SGML and XML at the
Publications Office
11
Formex (7)
Example:
<ARTICLE IDENTIFIER=“002”>
<TI.ARTICLE>Article 2</TI.ARTICLE>
<ALINEA>In article 1(2) of regulation (EC) 1234/94 the <QUOT.START
ID=“QS0001” REF.END=“QE0001” CODE=“2018”/>car <QUOT.END ID=“QE0001”
REF.START=“QS0001” CODE=“2019”/> is replaced by <QUOT.START ID=“QS0002”
REF.END=“QE0002” CODE=“2019”/>bus<QUOT.END ID=“QE0002”
REF.START=“QS0002” CODE=“2019”/>.</ALINEA>
<ALINEA>
<P>Article 6 of the same regulation is replaced by the following
text:</P>
<QUOT.S>
<ARTICLE IDENTIFIER=“006”>
<TI.ARTICLE><QUOT.START ID=“QS0003”
REF.END=“QE0003” CODE=“2018”/>Article 6</TI.ARTICLE>
<ALINEA>This is the new text of article
6.<QUOT.END ID=“QE0003” REF.START=“QS0003” CODE=“2019”/></ALINEA>
</ARTICLE>
</QUOT.S>
</ALINEA>
</ARTICLE>
The use of SGML and XML at the
Publications Office
12
Formex (8)
• Splitting large documents
– Fragmentation by definition of inclusions
for the main document
– Secondary instances referencing the
inclusions by means of XML entity
mechanism
– Inclusions may not necessarily be valid
XML instances
The use of SGML and XML at the
Publications Office
13
Formex (9)
frag-1.frg
main.xml
<?xml version=“1.0”?>
<doc>
<ti>title</ti>
<chap no=“1”>
<incl ref=“frag-1.frg”/>
</chap>
</doc>
<text>…</text>
<text>…</text>
container.xml
<?xml version=“1.0”?>
<!DOCTYPE frag [
<!ENTITY cnt SYSTEM “frag-1.frg”>
]>
<frag>&cnt;</frag>
The use of SGML and XML at the
Publications Office
14
Formex (10)
• Character set
– OJ publications in 20 (21) languages
– Different alphabets
– International character set definition
Unicode (UTF-8)
– Definition of allowed character ranges
– Special font ‘EU-Albertina’
The use of SGML and XML at the
Publications Office
15
Formex (11)
• Meta-data
– OJ publications are composed of
different levels:
• Publication
• Document
• ‘Contents’
– Meta-data separated according to these
levels
The use of SGML and XML at the
Publications Office
16
Formex (12)
Publication
Document
Meta-data
concerning the
publication
Meta-data for
document
Structure of the
publication with
references to
documents
ProCat
References to
components
Document
Meta-data for
document
References to
components
Contents
main part
001
Contents
Annex 1
001.001
Contents
Annex 2
001.002
Contents
main part
002
The use of SGML and XML at the
Publications Office
17
Formex (13)
• Meta-data (continued)
– Extraction of meta-data by means of
automatic processes (pre-notices)
– Extension of pre-notices by juridical analysis
– Availability of notices in ProCat for other
productions (Celex) and projects
The use of SGML and XML at the
Publications Office
18
Formex (14)
• Final remark on Formex
specifications
– Only few complete production chains
from the author to the printer
– Concentration on publication of Official
Journal
The use of SGML and XML at the
Publications Office
19
Formex (15)
• Validation of Formex deliveries
– In-depth validation necessary
– Automatic procedures
– Manual procedures
The use of SGML and XML at the
Publications Office
20
Formex (16)
• Validation of Formex deliveries
(continued)
– Automatic procedures
• Control of filename conventions
• Parsing of various components
• Control of completeness
• Execution of additional validation rules
• Comparison of contents between Formex
and PDF
 Report (XML instance)
The use of SGML and XML at the
Publications Office
21
Formex (17)
• Validation of Formex deliveries
(continued)
– Manual procedures
• Verification of the report generated by the
automatic validation procedure
• Control of the use of Formex specifications
in all language versions
 Report (XML instance) = basis for
archiving or rejection
The use of SGML and XML at the
Publications Office
22
Formex (18)
• Conversion of Formex v. 3 into
Formex v. 4
– Conversion of character set (ISO 2020 – UTF8)
– Transformation of SGML instances into wellformed XML instances
– Extraction of tables and conversion into an
intermediate model
– Generation of meta-data levels
– Conversion of old elements and generation of
new elements
– Validation of the results
The use of SGML and XML at the
Publications Office
23
Formex (19)
• Specifications:
http://formex.publications.eu.int/
The use of SGML and XML at the
Publications Office
24
Table of contents
• Historical overview
• Formex
• Other areas of XML usage
• Conclusion
The use of SGML and XML at the
Publications Office
25
Other areas of XML usage
(1)
• Index of OJ publications
– Biannual issues
– Monthly issues
– Extraction from Celex/ProCat
– Transformation into PDF by means of
XSLT and XSL FO (biannual version
only)
The use of SGML and XML at the
Publications Office
26
Other areas of XML usage
(2)
• Consolidation of legal documents
– Mainly based on Formex
– Additional administrative data in XML
– Relations between historical levels
• Description of the composition of a given
historical level
• Concordance of information on numbering
schemes (articles, …) for each level
The use of SGML and XML at the
Publications Office
27
Other areas of XML usage
(3)
• Conversion to RTF
– Compatibility with other EU services
– Input in SGML or XML
– Results with LegisWrite templates
The use of SGML and XML at the
Publications Office
28
Other areas of XML usage
(4)
SGML
instance
(Formex v. 3)
Character
conversion
Transformation
into wellformed XML
Transformation
into internal
XML format
XML
instance
(Formex v. 4)
Transformation
into RTF
(LegisWrite)
Output in
RTF (LegisWrite)
The use of SGML and XML at the
Publications Office
29
Other areas of XML usage
(5)
• Production of the EU budget
– Creation and maintenance of a common
central repository (XML)
– Markup of modified elements during the
decision process in working language
– Translation only of parts modified
– Update of repository after publication
The use of SGML and XML at the
Publications Office
30
Other areas of XML usage
(6)
Translation
service
Budget
services
Budget XML
repository
Formex
archive
Publications Office
pre-printing
post-printing
Printer
The use of SGML and XML at the
Publications Office
31
Other areas of XML usage
(7)
• ‘Secondary legislation’
– Publication of legislation in force in ‘new’
languages
– XML production on basis of Formex
archive
– Transformation of translated input
– Transformation of SGML into XML of
Formex instance
– Merging of XML instances
The use of SGML and XML at the
Publications Office
32
Other areas of XML usage
(8)
Word
document
Formex
archive
Conversion
into XML
Conversion
into XML
Extraction
of text
Extraction
of skeleton
Merging
skeleton &
text
Simplify
structure
Celex
ProCat
Publication
The use of SGML and XML at the
Publications Office
33
Other areas of XML usage
(9)
• European document repository
– TIFF of publications
– PDF of publications
– Formex instances of OJ publications
– Exchange of information by XML
messages
The use of SGML and XML at the
Publications Office
34
Other areas of XML usage
(10)
• Publication of calls for tender (OJ-S)
– Input in different electronic formats
– Harmonization in XML
– Updating database TED
– Production of CD-ROM version
The use of SGML and XML at the
Publications Office
35
Table of contents
• Historical overview
• Formex
• Other areas of XML usage
• Conclusion
The use of SGML and XML at the
Publications Office
36
Conclusion
• Difficult start with SGML
• Successful use of XML as well as of
other standards such as XSLT/XPath,
XSL FO
• Powerful possibilities of re-use of
XML instances
• How to profit from our experiences?
The use of SGML and XML at the
Publications Office
37
Proposal for technical solution
• An example: a regulation in the European
legislative context and a ‘Verordnung’ in
German legislation
• Evident structural differences
• Evident common structural objects
The use of SGML and XML at the
Publications Office
38
Differences and common
objects (1)
• EU regulation
– Title
– Preamble
• German regulation
– Title
– Preamble
• Citations
• Recitals
• Paragraphs
– Enacting terms
• Articles
– Article header
» Numbering
– Enacting terms
• Articles
– Paragraphs or
alineas
The use of SGML and XML at the
Publications Office
– Article header
» Numbering +
text
– alineas
39
Differences and common
objects (2)
– Final
• Applicability
• Signature
– Final
• Signature
The use of SGML and XML at the
Publications Office
40
Differences and common
objects (3)
• preamble
– European model
PREAMBLE
PREAMBLE.INIT
CITATION
RECITAL
PREAMBLE.FINAL
(PREAMBLE.INIT,CITATION+,RECITAL+,
PREAMBLE.FINAL)
(P)
(P)
(NP)
(P)
– German model
PREAMBLE
(P)
The use of SGML and XML at the
Publications Office
41
Differences and common
objects (4)
• article
– European model
ARTICLE
ARTICLE.HEADER
PARAG
ALINEA
(ARTICLE.HEADER,
(PARAG+ |ALINEA+))
(#PCDATA)
(NO.PARAG, ALINEA+)
(P|LIST)+
– German model
ARTICLE
ARTICLE.HEADER
NP
PARAG
ALINEA
(ARTICLE.HEADER,
(PARAG+ |ALINEA+))
(NP)
(NO.P,TXT)
(NO.PARAG, ALINEA+)
(P|LIST)+
The use of SGML and XML at the
Publications Office
42
Differences and common
objects (5)
• final
– European model
FINAL
APPLICABILITY
SIGNATURE
PL.DATE
SIGNATORY
(APPLICABILITY,SIGNATURE)
(P)
(PL.DATE,SIGNATORY)
(P)
(P+)
– German model
FINAL
SIGNATURE
PL.DATE
SIGNATORY
(SIGNATURE)
(PL.DATE,SIGNATORY)
(P)
(P+)
The use of SGML and XML at the
Publications Office
43
Differences and common
objects (6)
Specific models for
European regulation
Common models
for European and
German regulation
Specific models for
German regulation
The use of SGML and XML at the
Publications Office
44
Differences and common
objects (7)
• Common grammar fragment
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ATTLIST
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ATTLIST
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
ALINEA
ARTICLE
ENACTING.TERMS
ITEM
NO.P
NOTE
NOTE
NOTE.ID
NP
P
PARAG
PARAG.NO
PL.DATE
REGULATION
CTRY
SIGNATORY
SIGNATURE
TITLE
TXT
(P | LIST)+
(ARTICLE.HEADER, (ALINEA+ | PARAG+))
(ARTICLE+)
(NP, (P | LIST)
(#PCDATA)
(P+)
>
>
>
>
>
>
ID
#REQUIRED
(NO.P, TXT)
(#PCDATA | NOTE)*
(PARAG.NO, ALINEA+)
(#PCDATA)
(P+)
(TITLE, PREAMBLE, ENACTING.TERMS, FINAL)
(DE | EU-EN)
#REQUIRED
(P+)
(PL.DATE, SIGNATORY)
(P+)
(#PCDATA | LIST | NOTE)*
>
>
>
>
>
>
>
>
>
>
>
>
The use of SGML and XML at the
Publications Office
45
Differences and common
objects (8)
• Specific grammar for EU regulation
<!ENTITY
%common;
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
% common SYSTEM “regulation-common.dtd”>
APPLICABILITY
ARTICLE.HEADER
CITATION
FINAL
PREAMBLE
(P)
(P)
(P)
(APPLICABILITY, SIGNATURE)
(PREAMBLE.INIT, CITATION+, RECITAL.INIT?,
RECITAL+, PREAMBLE.FINAL)
PREAMBLE.FINAL (P)
PREAMBLE.INIT (P)
RECITAL
(P | NP)
RECITAL.INIT
(P)
The use of SGML and XML at the
Publications Office
>
>
>
>
>
>
>
>
>
46
Differences and common
objects (9)
• Specific grammar for German regulation
<!ENTITY
%common;
<!ELEMENT
<!ELEMENT
<!ELEMENT
% common SYSTEM “regulation-common.dtd”>
ARTICLE.HEADER (NP)
FINAL
(SIGNATURE)
PREAMBLE
(P+)
The use of SGML and XML at the
Publications Office
>
>
>
47
Final remarks
• Possible objects:
– Metadata on document level
– Metadata on archiving level (research
aspects)
– Common models for complex objects: tables,
quotations, etc.
The use of SGML and XML at the
Publications Office
48