Transcript Slide 1

Taxonomies and Metadata
for Content Management
Michael Huff
Information Resource Officer
U.S. Department of State
E-Government Act of 2002
• The use of computers and the Internet is rapidly transforming
societal interactions and the relationships among citizens, private
businesses, and the Government.
• The Federal Government has had uneven success in applying
advances in information technology to enhance governmental
functions and services, achieve more efficient performance,
increase access to Government information, and increase citizen
participation in Government.
• Most Internet-based services of the Federal Government are
developed and presented separately, according to the jurisdictional
boundaries of an individual department or agency, rather than
being integrated cooperatively according to function or topic.
Which U.S. Government organizations are
experienced in using metadata & taxonomy tools?
–
–
–
–
Defense Intelligence Agency
USDA Economic Research Service (ERS)
Federal Aviation Administration
FirstGov
– NASA
– Small Business Administration
– Social Security Administration
– Department of State
Terms
Definitions
Metadata
Data about data - a label that describes a
content object so unstructured content can be
managed like structured content.
Taxonomy
The specification and classification of the
names of people, places, things, and
everything else that is needed to allow search
engines and other content applications to work
better.
Facet
Classification
Discrete set of elements (or fields) for labeling
content and content components.
Controlled
Vocabulary
A managed set of terms for which there is an
agreed upon value or definition.
Metadata
Field
Data Type / Source
Title
string
Creator
string
Identifier
URL
Date
date
Subject
(~10,000 categories)
Taxonomy
Why use metadata?
• Adding metadata to unstructured content
allows it to be managed like structured content.
• Enriching content with structured metadata is
critical for supporting search and personalized
content delivery.
• Content that has been adequately tagged with
metadata can be leveraged in usage tracking,
personalization and improved searching.
Where does metadata fit in the
information system architecture?
User experience. How content is presented and how users
experience and interact with it dictates its perceived and
actual value.
Content architecture: Scalable metadata framework to
enable content reuse, and handle changes in organization
goals, user needs, and retrieval concerns.
Tools and technology. The information supply-chain
platform that enables workflows, and supports organizational
and operational concerns.
Content architecture defined
 Scalable metadata framework to enable content reuse, and
handle changes in organizational goals, user needs, and
retrieval concerns.
 Business objectives
 Content inventory
 Metadata specification
 Content model
Content Types
 Vocabulary specification

Training
Organization
Audience
Location
Web Content
Internal
Internal
International
Images
- IT
- By Tier
- Africa
Code
- HR/CRE
- Level
- Asia
Rich Media
- PASB
• Leadership
- Europe
Document Type
- International
• Associate
- Canada
- General
- ESG
• Employee
US
Management
- USCO
• Administrative - Idaho
- Reports &
- US Consumer
- Associate Type
- Massachusetts
Documentation
Card
• Phone
- Texas
- Tracking
- Credit Risk
• Non-Phone
- Virginia
- Control / Policies
Management
- Function
- California
& Procedures
- Thrift
• Manager
- Washington
- Legal &
- Auto Finance
• People
- Florida
Compliance
- Finance
Manager
- Personnel
- Investor Relations
• Non-Manager
- Learning/Training
- Legal
- Type
- Templates &
- Strategy
• Exempt
Forms
- Brand
•
Non-Exempt
- Public Relations
- Enterprise Risk
- Time with Firm
- Models
Management
• New
- Meeting
Committees
Employee
- Credit
- Executive
• Old Employee
- Cross-Functional External
External
- Customers
- Contractors
- Regulators
- Vendors
- Media
- Affinity
- Non-Profit
Relationships
- Contractors
- Partnerships
- Vendors
- Board of Directors
- Affinity
Relationships
- Partnerships
- Board of Directors
Function /
Process
Market
Business Processes Lines of Business
- Develop Business - Partnership
Implementation
Strategy
- Under-served
- Develop Products
- Lifestyle
& Services
- Cross-Sell
Strategy
- Hispanic
- Market Products
- Canada
- Process Orders
- Young Adults
- Service
- CRS
Customers
- E-Commerce
- Manage
- Smile
Customer
- Small Business
Relationships
Asset Type
- Manage
- Sub-Prime
Collections and
- Prime
Recoveries
- Super Prime
- Staff Services
LifePhases
Operating /
Supporting Processes - Marriage
- New U.S.
- Analytical
Residents
Functions
- Young Adults
- Communications
Functions
- New Parents
- Financial
- Moving
Functions
- Divorce
- Information
- Death
Handling
Functions
- Maintenance
- Organizational
Functions
- Sponsoring
- Project
Management
• SDM
• PMM
Product/
Services
Topics
Card
Contracts
- Credit Card
Credits
• Classic
Credit Line
Management
• Premium
Fee & Charges
• Secured
Finance
• Small
Business
Financial Institutions
• Equity
Financial
Instruments
• Others
Management
- Debit
Market Strategy
Loans
Marketing
- Auto
• PeopleFirst Mass Media
Public Relations
- Home Equity
• Full SpectrumPurchasing
• Countrywide Rates
• LoanCenter Rates and Rankings
Ratios
- Medical
Research
• AmeriFee
Risk
- Installment
Settlement and
- Home Equity
Damages
Insurance
Statistics
- Auto
- Life
Savings Products
- CDs
- Money Markets
- IRAs
Product Attributes
- Annual Fee
- Credit Line Levels
- APR
- Balance Transfer
Rate
- Other Benefits
 Rules &
procedures
What is Dublin Core?
• Dublin Core is the metadata standard for describing
Internet resources so they are easy to find.
Original workshop held
in Dublin, Ohio.
Dublin Core approved
as ISO 15836.
Shanghai meeting.
95
03
04
For more information: http://www.dublincore.org
Why is metadata important?
Complexity
Subject metadata Better
–
Use metadata –
What & Why:
navigationHow
& can it be used:
Subject, Description,
Rights & Permissions
Coverage discovery
Asset metadata –
Who, Where & More
When: efficient
Relational metadata
Title, Creator, Publisher,
Links between and to:
editorial
Contributor, Date, Type,
Relation
Format, Identifier, Source,
process
Language
–
Enabled Functionality
http://dublincore.org/documents/dcmi-terms/
What is a taxonomy?
The specification of the names of people, places, things
The specification
… and everything
of the names
elseofthat
people,
is needed
places, things
to allow search engines and other content applications to work better.
Animalia
Chordata
Mammalia
Carnivora
Canidae
Canis
C. familiari
Kingdom
Phylum
Class
Order
Family
Genus
Species
Linnaeus …
44-Office Equipment and Accessories and
Supplies
.12-Office Supplies
.17-Writing Instruments
.05-Mechanical pencils
.06-Wooden pencils
.07-Colored pencils
Segment
UNSPSC …
Family
Class
Commodity
Sample Recipe Taxonomy
Facet Categories
Main
Ingredients
Chocolate
Dairy
Fruits
Grains
Meat & Seafood
Nuts
Olives
Pasta
Spices &
Seasonings
Vegetables
Meal Type
Breakfast
Brunch
Lunch
Supper
Dinner
Snack
Cuisines
African
American
Asian
Caribbean
Continental
Eclectic/ Fusion/
International
Jewish
Latin American
Mediterranean
Middle Eastern
Vegetarian
Courses
Cooking
Methods
Appetizers
Beverages
Breads
Cheese
Cocktails
Desserts
Fish & Shellfish
Fruit
Hors d'Oeuvres
Meat
Pasta
Salad
Sandwiches
Soup
Vegetables
Advanced
Bake
Broil
Fry
Grill
Marinade
Microwave
No Cooking
Poach
Quick
Roast
Sauté
Slow Cooking
Steam
Stir-fry
Controlled Vocabularies
The power of taxonomy facets
• 4 independent
categories of 10 nodes
each have the same
discriminatory power
as one hierarchy of
10,000 nodes (104)
• Easier to maintain
• Can be easier to
navigate
Main
Ingredients
Chocolate
Dairy
Fruits
Grains
Meat &
Seafood
Nuts
Olives
Pasta
Spices &
Seasonings
Vegetables
Meal Type
Breakfast
Brunch
Lunch
Supper
Dinner
Snack
Cuisines
African
American
Asian
Caribbean
Continental
Eclectic/ Fusion/
International
Jewish
Latin American
Mediterranean
Middle Eastern
Vegetarian
Cooking
Methods
Advanced
Bake
Broil
Fry
Grill
Marinade
Microwave
No Cooking
Poach
Quick
Roast
Sauté
Slow Cooking
Steam
Stir-fry
7 Common taxonomy facets
Personalized content delivery requires defining taxonomy facets
Facet
Definition
Example Source
Products and
Services
Names of products and services.
ERP system, Your products and
services, etc.
Organization
Organizational structure.
FIPS 95-2, Your organizational
structure, etc.
Content Type
Structured list of the various types of
content being managed or used.
AGLS Document Type, AAT
Information Forms , Records
management policy, etc.
Industry
Broad market categories such as lines of
business, life events, or industry codes.
FIPS 66, SIC, NAICS, etc.
Location
Place of operations or constituencies.
FIPS 5-2, FIPS 55-3, ISO 3166, US
Postal Service, etc.
Function
Functions and processes performed to
accomplish mission and goals.
FEA Business Reference Model,
Enterprise Ontology, AAT
Functions, etc.
Audience
Subset of constituents to whom a piece of
content is directed or intended to be used.
GEM, ERIC Thesaurus, IEEE LOM,
etc.
Topic
Business topics relevant to your mission
and goals.
Federal Register Thesaurus, ERIC
Thesaurus, ProQuest, etc.
… and re-use of existing vocabulary sources
Applying the facets to the Dublin Core metadata elements
Dublin Core
Elements
Definition
Vocabulary
Source
High-Level Taxonomy
Content types
Organization
Audiences
Topics
Functionality and
process
Contracts
Collection
practices
Rich-Media
Credit
Credit policies
Web content
Competition
Cross-selling
Source code
Fees and
charges
Customer
acquisition
Document
Internal
Internal
Locations
International
Markets
LOB
Product and
services
Title
Resource name.
Not applicable
Creator
Content maker.
LDAP
Subject
Content topic.
Keyword Topic facet
Description
Description of content, summary.
Not applicable
Publisher
Publisher of this manifestation.
Agency facet
Contributor
Content contributor.
LDAP
Date
Content lifecycle event for this
manifestation.
Not applicable
Type
Genre.
Form Type facet
Format
Format of this manifestation.
RFC 2045
Identifier
Reference for this manifestation,
e.g., URL.
Not applicable
Source
Source from which this
manifestation has been derived.
Not applicable
Language
Language of this manifestation.
ISO 639
Relation
Reference to related resource.
None
Coverage
Space, period, date, jurisdiction,
etc.
Jurisdiction facet
Rights
Who has rights to use this
manifestation.
Privacy level
Credit cards
Country
Life events
Loans
Demographics
Insurance
Provences
US
Financial
Services
States
Finance
Governance
Financial
institutions
Product design
City
Military
External
Contractors
Customer
Vendors
Partners
Project
management
Retention
process
Market
strategy
Risk
management
Marketing
Testing
External
Vendors
Suppliers
Financial
instruments
Management
Regulators
Contactors
Media
Applied taxonomy metadata
facilitates a multi-faceted view
of content
Facets at work on FirstGov site
Frequency
Organization
Audience
Content Type
http://www.firstgov.gov
Powered by
Guided Navigation
2-3 clicks to product
No dead ends
http://www.tesco.com/winestore
http://www.towerrecords.com
Powered by
http://www.fortunoff.com
Seven practical rules for taxonomies
1.
2.
3.
4.
5.
6.
7.
Incremental, extensible process that identifies and
enables owners, and engages stakeholders.
Quick implementation that provides measurable
results as quickly as possible.
Not monolithic—has separately maintainable facets.
Re-uses existing IP as much as possible.
A means to an end, and not the end in itself.
Not perfect, but it does the job it is supposed to
do—such as improving search and navigation.
Improved over time, and maintained.
What is the general purpose of the content you are managing?
What types of content are you handling?
Who is the audience for this content?
What are the core organizational objectives that the content is
related to?
• Creating a taxonomy is
only part of the job
• How will it be put to use?
• In a new application, or by
modifying an existing
application?
• What’s the effort around
that?
• Additional Issues
• Tagging – Who will add the
metadata and how?
Browse by Topic
Link to Bios from
Personal Names
Link to company data
(quotes, news, ...)
from Company names
Link to info on
Countries
Alerts on People,
Companies, and
Topics
1 Identify
Objectives
2 Inventory
Content
3 Specify
Metadata
4 Model Content
5 Specify
Vocabularies
6 Specify
Procedures
7 Train Staff
Conduct interviews
ID sources, spider
assets & extract
metadata
Define fields &
purpose
Define content
chunks & XML
DTDs
Compile controlled
vocabularies
Develop workflow,
rules & procedures
Develop
materials &
train staff
Task 1 – Identify objectives
What do you do? What kinds of digital
assets are being produced? For
what audiences?
What is the business process for
submitting, selecting, editing,
maintaining digital assets?
How many digital assets are there?
How fast is this growing?
Are there particular industry or other
standards that are important?
What types of assets are hard to search
for (that should be easier to find)?
What tools would be helpful in locating
assets? Acronyms? Abbreviations?
Nick names? Glossary? Thesaurus?
Taxonomy?
Who else should we be talking to?
Task 2 – Inventory content
1. Identify target asset file path/URL.
2. Automatically generate
inventory metadata by crawling
file stores.
3. Audit assets using inventory.
4. Enhance metadata with new facets.
Path/URL
Audit process
Spider-generated
New facets
Task 3 – Specify metadata
Element
Data
Type
Length
Req. /
Repeat
Source
Purpose
Identifier
String
48 chars
1
System
supplied
Author
String
Variable
*
LDAP validated
Credits
Title
String
Variable
?
User
Text search, results display
Embargo Date
Date
Fixed
?
System
Obey rights
Description
String
Variable
?
User
Text search, results display
1
Asset Types
vocabulary
Browse or group search
results
Custom interface for group of
users
Asset Type
List
Fixed
Basic accountability
Subject
Audience
List
Fixed
*
Audience
vocabulary
Location
List
Fixed
*
ISO 3166
Filter or rank search results
*
Organization
vocabulary
Key index to retrieve &
aggregate assets
Organization
...
List
Fixed
Legend:
? – 1 or more
* - 0 or more
Task 4 – Model content
Header area
Factor asset types from inventory
into canonical types.
Select examples from inventory
(possibly with spider).
Identify useful chunks for each
asset type.
Factor chunks into element
superset.
Identify relationships between
chunks.
Iterate until agree on asset types,
elements, and relationships.
Main content
area
Footer area
Left navigation
area
Task 5 – Specify vocabularies
Develop broad taxonomy outline (13 levels deep)
Review, revise, and approve
taxonomy outline with
stakeholders and subject matter
experts.
Fill in taxonomy outline
Tag random samples from content
inventory
Review, revise, and approve draft
taxonomy with stakeholders and
subject matter experts.
Task 6 – Specify procedures
Develop taxonomy style rules,
ensure that the taxonomy follows
them.
Develop tagging rules and
procedures, along with software
to assist in the task.
Specify taxonomy maintenance
process and the update
procedures to follow.
Task 6 – Governance & Maintenance
The taxonomy must
be changed over
time.
Suggestions for
changes can
come from users,
through query log
analysis, and
staff, from
feedback form.
Governance
structure needed
to make sure
changes are
justified.
Firewall
Application
UI
Tagging
UI
Application
Logic
Content
Tagging
Logic
Taxonomy
Staff
notes
‘missing’
concepts
Query log
analysis
End User
Recommendations by Editor
1 Small taxonomy changes
(labels, synonyms)
2 Large taxonomy changes
(retagging, application
changes)
3 New ‘best bets’ content
Tagging Staff
Taxonomy Editor
Steering Committee
Committee considerations
1 Business Goals
2 Change in user
experience
3 Retagging cost
Task 6 – Steering Committee Roles
Business Lead
Keeps committee on track with larger business objectives
Balances cost/benefit issues to decide appropriate levels of effort
Specialists help in estimating costs
Obtains needed resources if those in committee can’t accomplish a particular task
Technical Specialist
Estimates costs of proposed changes in terms of amount of data to be retagged,
additional storage and processing burden, software changes, etc.
Helps obtain data from various systems
Content Specialist
Committee’s liaison to content creators
Estimates costs of proposed changes in terms of editorial process changes, additional
or reduced workload, etc.
Taxonomy Specialist
Suggests potential taxonomy changes based on analysis of query logs, indexer
feedback
Makes edits to taxonomy, installs into system with aid of IT specialist
Content Owner
Reality check on process change suggestions
Task 7 – Train staff
Indexing rules
Rule
Staff will require training on
The UI they use to tag the content
The rules to follow when deciding
what codes to apply
The end-effect of the codes they
apply
The structure of the taxonomy
Tagging examples come from
the content inventory
Hardcopies of the taxonomy,
and yellow highlighters, are
helpful during training
Description
Specificity
rule
Apply the most specific terms when tagging
assets. Specific terms can always be generalized,
but generic terms cannot be specialized.
Repeatable
rule
All attributes should be repeatable. Use as many
terms as necessary to describe What the asset is
about and Why it is important. Storage is cheap.
Re-creating content is expensive.
Appropriate
ness rule
Not all attributes apply to all assets. Only supply
values for attributes that make sense.
Usability
rule
Anticipate how the asset will be searched for in
the future, and how to make it easy to find it.
Remember that search engines can only operate
on explicit information.
Indexing UI
What about Automatic Categorization?
• Automatic vs. Manual Categorization is a
cost/benefit tradeoff
– Semi-automated recommended over pure
manual in production situations.
– Automatic performance not bad, but not equal to
trained manual tagging.
• Software is not sane, so errors look crazy.
– Large backlogs of content can’t justify
investment of high-quality manual tagging
• Old articles rarely accessed.
• Recommend automated bulk tagging with
error reporting and correction process.
What about automatically-created taxonomies?
Typically a single hierarchy with no overall plan
Results hard for people to navigate
What about automatic categorization?
Accuracy close to human levels, but errors are very
different
Cost/benefit tradeoff
Semi-automation is best practice
Enterprise taxonomy maintenance workflow
Problem?
Yes
Suggest new
name/category
Review new
name
Problem?
No
Copy edit new
name
Add to
enterprise
Taxonomy
Taxonomy
No
Yes
Taxonomy Tool
Analyst
Editor
Copywriter
Sys Admin
Categorize with a purpose
What is the problem you are trying to solve?
Improve search
Browse for content on an enterprise-wide portal
Enable users to syndicate content
Otherwise provide the basis for content re-use
How will you control the cost of creating and maintaining the
metadata) needed to solve these problems?
CMS with a metadata tagging products
Semi-automated classification
Taxonomy editing tools
Guided navigation tools
How do you sell it?
Don’t sell the taxonomy, sell the vision of what
you want to be able to do
Clearly understanding what the problem is
and what the opportunities are
Costs and benefits
Design the taxonomy in relation to the value
at hand
Internet Resources
U.S. Government
Resources
http://www.nasa.gov/home/index.html
http://pub-lib.jpl.nasa.gov/pub-lib/dscgi/ds.py/View/Collection-10
http://www.loc.gov/flicc/wg/taxonomy.html
http://www.loc.gov/lexico/servlet/lexico/
http://www.archives.gov/federal_register/code_of_federal_regulations/thesaurus.html
http://feapmo.gov/
http://www.km.gov/
Other Resources
http://www.educause.edu/asp/taxonomy/show_taxonomy_links.asp?TREE=1&EXPAND=1
http://databases.unesco.org/thesaurus/
http://www.naa.gov.au/recordkeeping/control/functions_thesaur/contents.html
http://www.taxonomystrategies.com/html/bibliography.htm
Summary
Why taxonomies?
Why metadata?
Shiyali Ramamrita Ranganathan
Ranganathan’s Five Laws of
Library Science
1. Books are for use (They don't belong on the
shelf)
2. Books are for all; every reader his book (Every
reader is unique)
3. Every book its reader (Every book is unique)
4. Save the time of the reader (Make libraries
easy to use)
5. A library is a growing organism (Libraries are
constantly changing to meet changing patron
needs)
Thank you
Michael Huff
Information Resource Officer
U.S. Department of State
[email protected]