Implementing a Government-wide Semantic Solution to Thesauri

Download Report

Transcript Implementing a Government-wide Semantic Solution to Thesauri

Implementing a Government-wide
Semantic Solution to Thesauri
Kenneth B. Sall, Science Applications International
Corporation (SAIC) and
Ronald P. Reck, RRecktek LLC
November 15, 2005
XML 2005, Atlanta, GA
Agenda
• Problem
• Goals and Requirements
• Early Thesaurus Attempts
• Thesauri Standards and Specifications
• Basic Thesaurus Terminology
• SKOS (Simple Knowledge Organisation System)
• Our SKOS Element Subset and Extensions
• SKOSaurus Pilot (updated since September 2005 paper)
• Next Steps
2
2
Problem Statement
• Government agencies need common vocabulary of
(technical) terminology.
• Communication and data sharing is greatly enhanced
when the semantics are clear.
• Various government groups approach this in different
ways -- Microsoft Word, Excel, HTML, databases, and wiki
pages: bulleted lists, tables, spreadsheets, acronym lists,
etc.
• Need to focus on a common formats and standards that
enable reuse and harmonization across Communities of
Interest (COIs).
3
3
Goals and Requirements
• Allow grouping terms in one COI or sharing across COIs.
• Should benefit from ISO standards for thesauri.
• Enable term authors to use familiar tools (e.g., Excel).
• XML-based (RDF) solution with few required elements but
many optional and/or repeatable elements.
• Multiple definitions of the same term must be permitted,
with either same or different subject/context.
• Should support semantic relationships between terms
(synonyms, related-to, broader-than, narrower-than);
search thesaurus.
• [Many more in paper.]
4
4
Early Thesaurus Attempts
See http://kensall.com/gov/glossary/#older
5
5
XSLT-Generated Search Links
AcronymFinder; WikiPedia; Clusty; Clusty Gov [.gov and .mil]; Google Uncle Sam [.gov and .mil];
Google Define; Google; Merriam-Webster; W3C; W3Schools; Webopedia; WhatIs; WordNet; ZVON.
6
6
Thesauri Standards and Specifications
• ISO 2788:1986 – Documentation - Guidelines for the
establishment and development of monolingual thesauri
– Developing a Thesaurus (mono-lingual)
– ISO 5964:1986 – multi-lingual version
• ISO 1087:2000 - Vocabulary of Terminology
• ISO 704:2000 - Principles and Methods
• ANSI/NISO Z39.19-2003 - Construction, Format, and
Management
• ISO 15836:2003 - The Dublin Core metadata element set
• [Many more listed in paper.]
7
7
Basic Thesaurus Terminology [ISO 2788:1986]
• Thesaurus – list of concepts in a particular domain of
knowledge together with explicit relationships
• Concept - unit of thought that exists in the mind as an
abstract entity, independent of the term(s) that identify it
• Concept Scheme - set of concepts, optionally including
statements about semantic relationships between those
concepts.
– Thesauri, classification schemes, subject heading lists,
taxonomies, terminologies, glossaries and other types of
controlled vocabularies
8
8
Basic Thesaurus Terminology [ISO 2788:1986]
• USE (or SEE) – preferred label for this concept
• UF = Use For – alternate label, may be a synonym but less
preferred [e.g., birds USE FOR Aves]
• SN = Scope Note - to clarify or constrain the meaning;
sometimes contains the concept’s definition
• BT = Broader Than – another concept more general than
this concept
• NT = Narrower Than – more specialized than this concept
• RT = Related To – concept that is similar in some way
9
9
Thesaurus Concept Example
Source: GAO Thesaurus, Feb. 2005
10
10
SKOS (Simple Knowledge Organisation System)
• Leverages ISO 2788 (and ISO 5964)
• Semantic Web Best Practices and Deployment Working
Group – W3C
• SKOS Working Drafts (W3C) and Related Efforts
– SKOS Core Guide
– SKOS Core Vocabulary Specification
– Quick Guide to Publishing a Thesaurus on the Semantic Web
– SKOS Mapping
– SKOS Extensions
– SKOS API
– Development Wiki
11
11
SKOS
• “SKOS Core is a model for expressing the structure and
content of concept schemes (thesauri, classification
schemes, subject heading lists, taxonomies,
terminologies, glossaries and other types of controlled
vocabulary).”
• “The SKOS Core Vocabulary is an application of the
Resource Description Framework (RDF), that can be used
to express a concept scheme as an RDF graph. Using RDF
allows data to be linked to and/or merged with other RDF
data by semantic web applications.”
• It uses RDFS Classes and RDF Properties to describe
Concepts and Concept Schemes.
Source: SKOS Core Guide, May 2005
12
12
SKOS Vocabulary
Implemented 9 of the 26 SKOS Properties
13
13
Our SKOS Element Subset and Extensions, 1
• skos:Concept – used for Containment
• skos:prefLabel, skos:altLabel – used for
Lexical Labeling
• skos:related, skos:narrower, skos:broader
– used for Semantic Relationships
• skos:scopeNote, skos:definition,
skos:example – used for Documentation
• skos:subject – used for Indexing (future)
14
14
Our SKOS Element Subset and Extensions, 2
• skos:Concept – contains all statements about
properties for one concept
• skos:prefLabel – USE; preferred handle for
this concept; designator. [In SKOS, no two concepts
in the same concept scheme may have same prefLabel.]
• skos:altLabel – UF; alternate handle; spelling
variants; can be used for abbreviations or
acronyms (but we don’t)
• skos:related, skos:narrower, skos:broader
– associated with, more specific, or more
general than this concept
15
15
Our SKOS Element Subset and Extensions, 3
• skos:scopeNote – constrains meaning; ISO
2788 allows definitions to appear here (but we
don’t)
• skos:definition – statement or formal
explanation of the meaning of a concept
• skos:example – used in a sentence
• skos:subject – topic; can be a skos:broader
16
16
Our SKOS Element Subset and Extensions, 4
• Pilot Extensions (non-SKOS)
– ABBREVIATON_OR_ACRONYM – common government
need (could define as rdfs:subPropertyOf
skos:altLabel)
– SOURCE - official document names and URLs are
preferred, but specific names of people or agencies are
acceptable; (probably could define as
rdfs:subPropertyOf skos:note)
– COI – essentially a skos:Collection (with a potential
skos:ConceptScheme)
17
17
Bird Example as RDF Graph
Back to
SKOSaurus
18
18
Illustrative Statements
• An alternate label (skos:altLabel) for "bird" is "Aves".
• The concepts with the preferred label "vertebrate" and "animal"
are broader than the concept with the preferred label "bird".
• There are four specializations of birds listed ("robin", "hawk",
"sparrow" and "eagle"), each indicated as skos:narrower than
"bird".
• The concepts "lizard" and "reptile" are skos:related to the
"bird" concept in some way.
• Among various concepts which might have the skos:prefLabel
of "bird", the one illustrated is constrained to ornithology,
according to skos:scopeNote. This distinguishes the concept
from "bird", such as in the informal term for a (young) woman.
19
19
OWL Statements About SKOS
• skos:broader owl:inverseOf skos:narrower
• skos:narrower owl:inverseOf skos:broader
and
• skos:broader is an owl:TransitiveProperty
• skos:narrower is an owl:TransitiveProperty
• RDF/OWL version of SKOS Core
20
20
SKOSaurus Pilot
• Proof of concept
• Many simplifying assumptions
• Fabricated data (except for DTIC)
• About 100 man hours
• Ron Reck and Ken Sall
21
21
SKOSaurus Pilot: Environment
• The host operating system is Microsoft Windows XP with
Service Pack 2.
• Dell Latitude D800 (1.69GHz) with 1G of RAM.
• The Windows XP host runs VMware 5.0 build 13124 to
emulate a machine onto which the Solaris X86 operating
system version 10 is installed.
• This is referred to as the guest operating system which
runs the SKOSaurus system, consisting of:
– Perl version 5.8.7 and various Perl modules
– Java version 1.4.2.08
– Kowari server 1.1.0 Pre2
– XSLT stylesheets
22
22
Main Use Cases (for Pilot)
• Concept Entry via Web Form
• File Upload of Excel Spreadsheet (as CSV)
• File Upload of SKOS (or RDF)
• Query of Concept Data Store
23
23
Example Spreadsheet: birds
24
24
Spreadsheet Conventions (Pilot)
• One row per concept, sparse or densely populated.
• New row for different definition or homonym (e.g., bird). [SKOS
conflict: duplicate prefLabels.]
• The heading row should not be removed or modified.
• Column order is invariant.
• Since several elements are repeatable, use semi-colon to indicate
iteration.
• A limitation in our pilot parser requires the author to use the
pipe symbol ("|") instead of a comma within a cell.
• Any number of rows can be included, but there must be no
blank rows or separator rows.
• File > Save As Comma Separated Values (*.csv).
25
25
SKOSaurus: Home
26
26
SKOSaurus: Manage COIs
27
27
SKOSaurus: Upload CSV or SKOS
28
28
SKOSaurus: Upload Feedback
Generated SKOS files
Datastore for COI
29
29
SKOSaurus: Generated SKOS Excerpt, 1 of 2
30
30
SKOSaurus: Generated SKOS Excerpt, 2 of 2
31
31
SKOSaurus: Web Form
32
32
SKOSaurus: Kowari Model Dump: Query
33
33
SKOSaurus: Kowari Model Dump: Result
34
34
SKOSaurus: Intuitive Search
35
35
SKOSaurus: Intuitive Search: bird Result
Note: 2
altLabels,
really 2
different
concepts.
36
36
SKOSaurus: Intuitive Search: animal, reptile
Jump to
Graph
37
37
SKOSaurus: Intuitive Search: eagle
38
38
SKOSaurus: Intuitive Search: bald eagle
39
39
SKOSaurus: Intuitive Search: dame
40
40
SKOSaurus: DTIC Data: 73,500 lines
41
Source: Defense Technical Information Center
41
SKOSaurus: DTIC Search: ANATOMY
42
42
SKOSaurus: DTIC Search: ANATOMY
43
43
SKOSaurus: DTIC Search: BIOLOGY
44
44
SKOSaurus: DTIC Search: BIOPHYSICS, PHYSICS
45
45
Design Issue: Bootstrapping
• Unique numeric URI
– Pro: designed to avoid collisions (duplicate prefLabels)
– Con: not intuitive and not transparent for humans
• URI based on concept’s skos:prefLabel
– Pro: transparent; less query intensive since we can map from
prefLabel directly to URI
– Con: easy to have collisions; need rules for dealing with:
• Capitalization
• White space
• Singular vs. plural forms
• Non-alphabetic characters vs. acceptable URI characters
46
46
Design Issue: URI Format
1. <skos:Concept
rdf:about="http://skos.rrecktek.com/drm/bird#concept">
2. <skos:Concept
rdf:about="http://skos.rrecktek.com/drm/concept#bird">
3. <skos:Concept
rdf:about="http://skos.rrecktek.com/drm/bird/concept">
4. <skos:Concept
rdf:about="http://skos.rrecktek.com/drm/concept/bird">
47
47
Next Steps – 1 of 3
1. Normalize Definitions - A submission-and-approval
process can be added which requires each Community of
Interest (COI) to designate an owner who can promote
entries from the candidate status to the approved status.
2. Edits of Existing Terms
3. Add Query across COIs.
4. Add skos:historyNote - to record CSV file upload date,
contributor; update with modifications.
5. If a database can output data in RDF or SKOS format, the
SKOSaurus system could permit an upload or SOAP entry
of the data without modification.
6. Migrate from an existing representation (database) to
SKOS or RDF (e.g., Oracle 10G Release 2 has exciting
promise for supporting RDF).
48
48
Next Steps – 2 of 3
7. Consider other conversion processes (e.g., Microsoft
Word tables).
8. Integrate Google define:, WikiPedia, and other generated
search links.
9. Create XSLT and XSL-FO stylesheets for printing, probably
with an interface to select a subset of concepts to print.
10. Consider a central or federal data store for all concepts
for multiple COIs and agencies with system management
roles at COI and agency level, as well as across the entire
data store.
11. Enhance the web form so that repeatable fields
(narrower, broader, related, altLabel, etc.) can be entered
more than once.
49
49
Next Steps – 3 of 3
12. Establish clear authoring conventions
– Case convention (UpperCamelCase, Title Case, lowercase, all
caps?)
– Pluralization (use singular form, but what about irregular
cases: men/man, geese/goose, etc.)
– Compound terms (e.g., Data Architecture, Data Class)
– Placement of acronym/abbreviation (separate element)
– Placement of source (separate element)
– Citation method (URIs, bibliographical, free form?) Source
could contain child elements for each possible format.
50
50
Summary and Conclusion
• SKOS is a useful vocabulary for implementing a thesaurus.
• The U.S. Government would benefit from a unified
approach to thesauri, especially when sharing terminology
within and across Communities of Interest.
• Our approach assumes government term authors want to
work in Excel, not XML/RDF/SKOS (although we permit
SKOS upload).
• Other SKOS implementations are worth considering (e.g.,
Java-based NBII SKOS Thesaurus client).
• We hope W3C considers SKOS for the Recommendation
Track.
51
51
[email protected]
Resources
• SKOS home page
http://www.w3.org/2004/02/skos/
• Semantic Web Activity home page
• W3C Semantic Web News and Events Archive
• SKOS: A language to describe simple knowledge
structures for the web – A. Miles et al, XTech 2005
slides or paper
• SKOS Core Tutorial for DCMI 2005 – A. Miles, slides or
PDF
• NBII SKOS Thesaurus
• Sall’s Earlier Glossary Work
52
52