Transcript Slide 1

Taxonomy Tools: Requirements and Capabilities
Joseph A. Busch, PPC Senior Principal
Today’s agenda
Time
Duration
Agenda
1:00-1:15
15 min Introductions
1:15-2:00
45 min Taxonomy Basics
2:00-3:00
60 min Taxonomy Development Process
3:00-3:15
15 min Coffee Break
3:15-4:00
45 min Taxonomy Construction Tools
4:00-4:45
45 min Exercise
4:45-5:00
15 min Q&A, Closing
Learning Objectives:
 Ability to identify taxonomies by type, to choose the appropriate type for an
information product development application, and to articulate the benefits
of the taxonomy for use in development of an information product.
 Understand basic taxonomy-related terminology.
 Demonstrate the ability to identify taxonomy term record elements.
 Demonstrate the ability to focus on the key concepts and build terms records
for a small taxonomy.
1. TAXONOMY BASICS
What taxonomy is: Systematics view
Biological taxonomy place an organism in one and only one place.
Animalia
Chordata
Mammalia
Carnivora
Canidae
Canis
C. familiari
Kingdom
Phylum
Linnaeus …
Class
Order
Family
Genus
Species
What taxonomy is: Pragmatic view
But most of the time things belong to more than one category.
Animalia
Chordata
Mammalia
Carnivora
Canidae
Canis
C. familiari
Kingdom
Phylum
Class
Order
Family
Genus
Species
Linnaeus …
Pets
Mammals
Dogs
Farm
Animals
Other semantic schemes
Type
Remarks
Synonym Ring
A set of words/phrases that can be used interchangeably for
searching.
Example: Hypertension, High blood pressure
Controlled
Vocabulary
A list of preferred and variant terms, with defined hierarchical
and associative relationships. A taxonomy is a type of controlled
vocabulary.
Typically used for names of countries, individuals, organizations
Classification
Scheme
An arrangement of knowledge that does not follow taxonomy
rules.
Usually enumerated; e.g., Dewey Decimal Classification
Thesaurus
A tool that controls synonyms and identifies the semantic
relationships among terms.
Ontology
Resembles faceted taxonomy but uses richer semantic
relationships among terms and attributes and strict specification
rules.
Semantic schemes: Simple to complex
Taxonomies
Ontologies
(Vocabularies)
Synonym
Rings
Authority
Files
Classification
Schemes
Complex
Simple
Equivalence
Thesauri
Hierarchical
Associative
(Relationships)
Source: Amy Warner. Metadata and Taxonomies for a More Flexible Information
Architecture (http://www.lexonomy.com/presentations/metadataAndTaxonomies.ppt)
Taxonomic metadata: e-Government example
Agency
Form Type
Industry
Impact
0001 Legislative
1000 Judicial
1100 Executive
Office of Pres
0003 Exec Depts
1200 Agriculture
1300 Commerce
9700 Defense
9100 Education
8900 Energy
7500 HHS
7000 DHS
8600 HUD
1400 Interior
1500 Justice
1600 Labor
1900 State
6900 Transport
2000 Treasury
3600 Veterans
Ind Agencies
Intl Orgs
Application
Approval
Claim
Information
request
Information
submission
Instructions
Legal filing
Payment
Procurement
Renewal
Reservation
Service request
Test
Other input
Other
transaction
00 Generic
11 Agriculture
21 Mining
22 Utilities
23 Construct
31-33 Manuf
42 Wholesale
44-45 Retail
48-49 Trans
51 Info
52 Finance
54 Profession
55 Mgmt
56 Support
61 Education
62 Health Care
71 Arts
72 Hospitality
81 Other
Services
92 Public
Admin
Jurisdiction
Metadata Elements
Federal
State +
Local +
Other +
BRM Impact
Keyword Topic
Audience
Citizen Srvcs
Social Srvs
Defense
Disasters
Econ Dev
Education
Energy
Env Mgmt
Law Enf
Judicial
Correctional
Health
Security
Income Sec
Intelligence
Intl Affairs
Nat Resour
Transport
Workforce
Science
Delivery
Support
Management
Agriculture &
food
Commerce
Communications
Education
Energy
Env pro
Foreign rels
Govt
Health &
safety
Housing &
comm dev
Labor
Law
Named grps
National def
Nat resources
Recreation
Sci & tech
Social pgms
Transport
All
General
Citizen
Business
Govt
Employee
Native
American
Nonresident
Tourist
Special group
Controlled Vocabularies
Standards
• Taxonomy
– Z39.19-2003. Guidelines for the Construction, Format, and
Management of Monolingual Thesauri
• BT/NT
– ISO 13250. Topic Maps
• Topics, associations, occurrences
• Metadata
– ISO 15836 and Z39.85-2007. Dublin Core Metadata Element Set.
• 15 elements
– FRBR. Functional Requirements for Bibliographic Records
• Work  Expression  Manifestation  Item
Standards (2)
• Semantic web (interoperability)
– RDF. Resource Description Framework.
• Subject-predicate object descriptions
– ISO 11179. Metadata Registry (MDR).
• Metadata-driven exchange of data in an heterogeneous
environment, based on exact definitions of data.
Taxonomy definitions
Definition
Concept
The characteristics of a real or imaginary object expressed as
terms in the taxonomy.
Controlled Vocabulary
A list of terms that have been explicitly enumerated. The
terms are controlled and published by a designated
authority or authoritative source. If multiple terms are used
to mean the same thing, one of the terms is identified as the
Preferred Term in the Controlled Vocabulary and the other
terms are listed as synonyms or aliases.
Facet
A grouping of concepts of the same inherent category.
Examples of categories that may be used for grouping
concepts into facets are: Audience, Channels, Components,
Content Types, Functions, Industries, Intentions, Lifecycle,
Location, Organization, Products, etc.
Taxonomy
The core metadata elements and the Controlled
Vocabularies required to find, use, and manage content in a
collection.
Some definitions associated with terms
Term
Definition
UID
The unique identifier for the concept.
Entry Term
The preferred term that is used to label a concept. An entry
term is also known as a Descriptor.
Broader Term (BT)
A term to which another term (or multiple terms) are
subordinate in a hierarchy.
Narrower Term (NT)
A term that is subordinate to another term or to multiple
terms in a hierarchy.
Used For Term (UF)
Non-preferred term(s) that are equivalent to the Entry Term.
Used for terms may be synonyms, aliases (such as
abbreviations) and quasi-synonyms (such as more specific
terms).
RT (Related Term)
A term that is associatively (but not hierarchically) linked to
another term in a Controlled Vocabulary.
SN (Scope Note)
A note following a term explaining its source, rationale,
coverage, specialized usage, or rules for assigning it.
Relationships
Definition
Associative
Relationship
A relationship between or among terms that leads from one
term to other terms that are related to or associated with it.
An Associative Relationship is a Related Term or crossreference relationship.
Equivalence
Relationship
A relationship between or among terms in a Controlled
Vocabulary that leads to one or more terms that are to be
used instead of the term from which the Reference is made.
An Equivalence Relationship is a Used For Term relationship.
Hierarchical
Relationship
A relationship between or among terms in a Controlled
Vocabulary that depicts broader (generic) to narrower
(specific) or whole-part relationships. A Hierarchical
relationship is a Broader Term to Narrower Term relationship.
Concept, terms and relationships
CONCEPT
Is
Preferred
Label
IBM
TERMS
Is Used For
Is Used For
I.B.M.
RELATIONSHI
PS
IBM
International
Business
Machines
Business taxonomy problem: How can a
customer pick from >5,000 faucets w/o quitting?
Refine search by:










Category
Price
Brand
Color/Finish
# Handles
Series Name
Water Filter?
Faucet Spray
Handle Shape
Soap Dispenser?
How business taxonomy translates into frontend interface
Metadata Field:
Size
Taxonomy Values:
4.5
5.5
6
6.5
7
8
…
Metadata Field:
Type
Taxonomy Values:
Athletic Inspired
Boots
Loafers and Slip-ons
Oxfords and More
Sandals
Metadata Field:
Color
Metadata Field:
Brand
Taxonomy Values:
Black
Blue
Brown
Green
Grey
Ivory
…
Taxonomy Values:
Antonio Maurizi
Bacco Bucci
Ben Sherman
Bruno Magli
…
Learning Objectives:
 Demonstrate knowledge of multiple taxonomy development methods.
 Demonstrate the ability to choose the appropriate taxonomy development
method for use in development of an information product.
 Demonstrate knowledge of common taxonomy facets.
 Demonstrate the ability to identify specialized facets for use in an information
product.
 Demonstrate the ability to map the facets to the appropriate elements in a Dublin
Core-based metadata specification.
2. TAXONOMY DEVELOPMENT
PROCESS
Taxonomy development methods
Method
Automated
analysis
Description
Munge, blast, crunch text to analyze
corpus.
Workshopping
Guide group in activities to identify
key concepts.
Prepare best guess, then bring it to
the table to discuss.
Customize internal terminology,
industry standards, etc.
Combination of some or all of these
methods.
Strawman
Adapt Existing
Vocabularies
Hybrid
Key components to a successful taxonomy
project
Set-up
taxonomy
team
Identify
business
case
Planning &
research
Maintain &
evolve
taxonomy
Interview
stakeholders
Migrate
content
Define use
cases
Validation
testing &
review
Build-out
taxonomy
detail
Build highlevel
taxonomy
Define business case: Business case examples
• Improve search and browsing to reduce the amount of time
employees spend looking for information.
• Reduce business silos, foster collaboration and content reuse,
and thereby reduce redundant work.
• Reduce the amount of time employees spend e-mailing basic
information to each other.
• Build confidence that employees are getting the most up to date
information, and increase employee loyalty by helping them stay
“up to date” on the company.
Research & planning
• Identify target content to be focused on.
– Provide a list of websites (and/or other target content file stores)
– Prioritize this list for the purposes of the taxonomy project.
• Gather any query logs, usage statistics and usability surveys.
• Collect any existing documentation related to audience
personas, content organization, metadata, keywords, and any
other guidelines or standards.
• Identify and gather any internal classifications (org charts, sales
regions, records retention schedule, code of conduct, product
lists, etc.); and any relevant industry standard classifications
(UNSPSC, NAICS, USPS, regulated activities, etc.)
Interview stakeholders
• Recruit people from business-critical functions such as
marketing, public relations, product marketing, legal, etc.
– Include people who have credibility, are early adopters, hold large
amounts of content, and are “squeaky wheels” or “fans.”
• Conduct 10-20 interviews.
• The goal is for stakeholders to be the review board during the
taxonomy development process, and beyond.
Define use cases: Intranet examples
• Content related to business areas or facilities
– By geographic location, by type, by specific facility, by access
restrictions, by audience, etc.
 Use Case: Create a safety policies and procedures website for facilities organized by
State.
 Use Scenario: Find all safety policies and procedures related to a facilities located in
Ohio.
• Company-wide content
– By business function, by topic, by access rights, etc.
 Use Case: Locate any content that has policies and procedures around a particular
topic.
 Use Scenario: A policy regarding smoking company-wide has changed and references
to outdated policies should be removed. Find official policies, as well as newsletters
related to the smoking policy company-wide.
Define use cases: .com examples
• Web content managers
– By content type, by topic, by location, etc.
 Use Case: Find and recall all public-facing pages that describe a specific safety tip.
 Use Scenario: Find and recall all public-facing pages that discuss gas safety.
• Public users seeking information
– by topic, by location, etc.
 Use Case: Provide search for dividend schedules, earnings statements and stock splits;
and the corresponding press releases for a specific time period.
 Use Scenario: An investor who recently sold stock is preparing taxes and would like to
do a concise search so that they can find historical information about their holdings.
Build high-level taxonomy
• Identify the types of actors
– Audiences, roles & access rights
• Identify the types of content
• Identify the types activities
– Business processes, applications & uses
• Identify the types of named entities
– Products, services, projects, organizations, locations, etc.
• Topics will be everything else.
 A business taxonomy should have no more than 6-10 broad
divisions.
Build high level taxonomy: Oracle.com top-level
taxonomy
Person
Organization
Location
Content Type
Audience
Products
Product Line
Technology
Application
Industry Solution
 The Oracle.com taxonomy has no explicit
topics, only actors, content types, and
named entities.
“Is a” groups of
Products
Build high level taxonomy: SGMS top-level
taxonomy
Topics
 The SGMS (Singapore Government
Metadata Standard) Taxonomy is much
more focused on Topics.
Build-out taxonomy detail
• Get agreement on the broad divisions first, then build-out the
detailed taxonomy.
• Use existing terminologies whenever they are available for
business functions, locations, products & services, etc.
• Only build a vocabulary when no alternative authoritative source
exists.
• Only create categories for which there already is content, or
likely to be content soon.
• Keep the taxonomy broad and shallow.
– Roll-up more specific terms into broader categories
 A business taxonomy should have no more than 1,200
categories.
Build out taxonomy detail: NASA Taxonomy
http://nasataxonomy.jpl.nasa.gov/
Validation testing and review
Method
Process
Who
Requires
Validation
Walkthrough
Show & explain
• Taxonomist
• SME
• Team
• Rough taxonomy
• Approach
• Appropriateness to task
Walkthrough
Check conformance
to editorial rules
• Taxonomist
• Draft taxonomy
• Editorial Rules
• Consistent look and feel
Usability
Testing
Contextual analysis
(card sorting,
scenario testing,
etc.)
• Users
• Rough taxonomy
• Tasks & Answers
• Tasks are completed successfully
• Time to complete task is reduced
User
Satisfaction
Survey
• Users
• Rough Taxonomy
• UI Mockup
• Search prototype
• Reaction to taxonomy
• Reaction to new interface
• Reaction to search results
Tagging
Samples
Tag sample content
with taxonomy
• Taxonomist
• Team
• Indexers
• Sample content
• Rough taxonomy
• Content ‘fit’
• Fills out content inventory
• Training materials for people &
(or better)
algorithms
• Basis for quantitative methods
Migrate content
• Prioritize content to be tagged
– Identify and dispose of ROT.
• Use business rules to automate content tagging
– Tag landing pages for major sections.
– Lower-level pages inherit tags from top-level pages.
• Use workflow to enforce tagging
– Require entry of simple tagging in order to submit an item into the
content management system.
• Use templates to guide user tagging
– Pre-populate template fields whenever possible.
– Use context-sensitive pick lists.
– Call-out to taxonomy service for more complex controlled vocabularies.
• Provide tagging incentives
– Almost instantaneous feedback.
Maintain and evolve taxonomy
• Taxonomy building is iterative.
– A taxonomy should be improved over time and maintained.
• Designate a taxonomy editor as the single point-of-contact for
taxonomy changes.
• Log change requests and notify requestors.
• Prioritize taxonomy changes, e.g.
 Improves information access, use and reuse.
 Requires creating new data or metadata.
 Affects program operations or has a financial impact.
 Enables communication campaigns or organizational strategy.
 Positive impact on users
Licensing an existing taxonomy
• See Factiva’s taxonomy www.taxonomywarehouse.com
– There are usually license fees, but these will be less than the effort to
develop an equivalent taxonomy.
– But pre-existing taxonomies rarely fit an organization’s needs and may
require extensive customization.
• Recommendation
– Adopt a faceted approach.
– Reuse existing (especially internal) vocabularies for as many of the
facets as possible.
– Plan on doing full-custom “Content Type” and “Topic” taxonomies.
Free sources for 8 common taxonomies
Taxonomy
Definition
Potential Sources
Organization
Organizational structure.
SP 800-87, U.S. Government
Manual, Your organizational
structure, etc.
Content Type
Structured list of the various types
of content being managed or used.
Dublin Core Type Vocabulary, AGLS
Document Type, Your records
management policy, etc.
Industry
Broad market categories such as
lines of business, life events, or
industry codes.
SIC, NAICS, Your market segments,
etc.
Location
Place of operations or
constituencies.
FIPS 5-2, FIPS 55-3, ISO 3166, UN
Statistics Div, US Postal Service,
Your sales regions, etc.
Business Activity
Business activities or functions
performed to accomplish mission
and goals.
Federal Enterprise Architecture
Business Reference Model,
Enterprise ontology, Your business
functions, etc.
Topic
Business topics relevant to your
mission & goals.
Federal Register Thesaurus, NAL
Agricultural Thesaurus, Your
research areas, etc.
Audience
Subset of constituents to whom a
piece of content is directed or is
intended to be used by.
GEM, ERIC Thesaurus, IEEE LOM,
Your psycho-graphics or personas,
etc.
Products & Services
Names of products/programs and
services.
ERP system, Your products and
services, etc.
Learning Objectives:
 Demonstrate the ability to identify appropriate taxonomy sources for use in
development of an information product.
 Demonstrate the ability to define and populate a small taxonomy with 3-5
facets using MultiTes.
 Demonstrate the ability to design the validation methods for a taxonomy.
3. TAXONOMY CONSTRUCTION
TOOLS
Tools
• Taxonomy editing
– Data Harmony, MultiTes, protégé, Synaptica, SchemaLogic, Wordmap
• Metadata tagging (automated categorization)
– CIS, ConceptSearching, Data Harmony, MetaTagger, nStein, Smartlogic,
temis
• Content management
– Documentum, Drupal, Fat Wire Interwoven, Joomla!, OpenText,
SharePoint
Vendor
Taxonomy Editing Tools
URL
Cuadra STAR/Thesaurus
www.cuadra.com/products/thesaurus.html
Thesaurus Master
www.dataharmony.com/products/tm.htm
Autonomy Interwoven
MetaTagger
http://www.interwoven.com/components/pagenext.jsp?topic=PROD
UCT::METATAGGER
Business Objects Tools for
Advanced Visualization
http://www.sap.com/solutions/sapbusinessobjects/large/businessintelligence/dashboard-visualization/advancedvisualization/index.epx
MS Excel
www.microsoft.com
Intelligent Topic Manager
www.mondeca.com
MultiTes Pro
www.multites.com
Taxonomy/Authority File
Manager
www.nstein.com/epub/ncm-taxonomy.asp
Protégé
http://protege.stanford.edu/
SchemaServer
www.schemalogic.com
Semaphore
www.smartlogic.com
Synaptica
www.synapticasoftware.com
SAS Ontology Management
http://www.sas.com/text-analytics/ontologymanagement/index.html
Luxid for Content Enrich
www.temis.com
Term Tree
www.termtree.com.au
Enterprise Vocab Server
www.webchoir.com/products/wvs.html
Designer
www.wordmap.com
Advanced
Midrange
Basic
Normal taxonomy editor functionality
requirements






Standard and Custom Fields
Standard and Custom Relations
Data Typing and Restrictions
Consistency Enforcement
Flexible Reporting
Flexible Importing?




UNICODE
Multiple Vocabulary Support
Inter-Vocabulary Relations
Unique IDs: externally supplied IDs are
not sufficient





Workflow
Voting
Change Request Mgmt.
Stylistic rules enforcement
Programmability
Term
Editing
Hierarchy
Browser
Additional functionality for taxonomy editing
software:
 Aliases – Need to deal with
 Inter-category relations – Must be
synonyms, but also with
able to provide links that don’t
alternative labels based on
follow hierarchy, and even go
language or other factors.
between vocabularies.
 Notes – Useful to have several
 Poly-hierarchy – Mid-range tools
types of notes fields to keep public should deal with this.
notes separate from team’s
 Rules checking – Check
working notes.
conformance to style rules like
 Effective dates – Enable the
length, use of &, etc.
determination of what was the
 Workflow – Tracking the handling
‘valid’ taxonomy on dates in the
of change requests, as well as the
past. Part of a set of strong
process of getting approvals for
requirements on provenance.
edits.
Sample taxonomy editor: Data Harmony
Hierarchy
Browser
Standard
Term Info
Taxonomy editing tools vendors
An immature area– No vendors
are in upper-right quadrant!
Ability to Execute
high
Most popular taxonomy editor is
MS Excel
low
High functionality /high
cost products (~$100K)
Niche Players
MultiTes is widely used, cheap with Completeness of Vision
functionality
Visionaries
MultiTes Taxonomy Tool
• Z39.19 compatible taxonomy editor
• Self-study: http://www.multites.com/lessons.htm
–
–
–
–
–
Getting Started with MultiTes Pro
Navigating your thesaurus
Importing data from text files
Working with Subject Categories
Working with Multilingual Thesauri
MultiTes: Formatting an import file
Recommendation: Use a text editor (Notepad)
Subject Taxonomy
Arithmetic
Operations
Addition
Subtraction
Multiplication
Division
Roots
Factorials
Factoring
Properties of Operations
Estimation
Fractions
Decimals
Comparison of numbers
Exponents
MultiTes Import Format
Arithmetic
Operations
BT: Arithmetic
Addition
BT: Operations
Subtraction
BT: Operations
Multiplication
BT: Operations
Division
BT: Operations
Roots
BT: Operations
Factorials
BT: Operations
Factoring
BT: Operations
Properties of Operations
BT: Operations
Estimation
BT: Operations
Fractions
BT: Arithmetic
Decimals
BT: Arithmetic
Comparison of numbers
BT: Arithmetic
Exponents
BT: Arithmetic
MultiTes: Create a new taxonomy, then Import a
file
•
•
•
•
•
File > New
Navigate to destination directory, then enter filename
Click Continue button in New Thesaurus pop-up
File > Import
Navigate to target file, then click Open button
MultiTes: Imported taxonomy
MultiTes: Hierarchy report
• Reports > Top term
– Not Hierarchical
• In Select Term range tab, click on Print/Export button
– Default should be set to Output to: Screen
MultiTes: Hierarchy report
MultiTes: Alphabetical report
• Reports > Alphabetical report
• Click on Print/Export button
MultiTes exercise
• Format a small taxonomy (10-20 terms, 2-3 levels deep)
• Import it into MultiTes.
• Generate hierarchy (TopTerm) and alphabetical reports.
¿Questions?
Joseph A. Busch, + 415-377-7912, [email protected]
http://www.ppc.com