Metadata Strategy

Download Report

Transcript Metadata Strategy

Automatic Facets:
Faceted Navigation and Entity Extraction
Tom Reamy
Chief Knowledge Architect
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com
Agenda
 Introduction: Elements
–
Facets, Taxonomies, Software, People
 3 Environments
–
E-Commerce, Enterprise, Internet
 Design Issues – Facets and Entities
 Conclusion – Integrated Solution
2
KAPS Group: General






Knowledge Architecture Professional Services
Virtual Company: Network of consultants – 12-15
Partners – Inxight, FAST, etc.
Consulting, Strategy, Knowledge architecture audit
Taxonomies: Enterprise, Marketing, Insurance, etc.
Services:
– Taxonomy development, consulting, customization
– Technology Consulting – Search, CMS, Portals, etc.
– Metadata standards and implementation
– Knowledge Management: Collaboration, Expertise, e-learning
– Applied Theory – Faceted taxonomies, complexity theory, natural
categories
3
Elements






Facet – orthogonal dimension of metadata
Entity / Noun Phrase – metadata value of a facet
Entity extraction – feeds facets, signature, ontologies
Taxonomy and categorization rules
Auto-categorization – aboutness, subject facets
People – tagging, evaluating tags, fine tune rules and
taxonomy
4
Essentials of Facets
 Facets are not categories
Categories are what a document is about – limited number
– Entities are contained within a document – any number
–
 Facets are orthogonal – mutually exclusive – dimensions
–
An event is not a person is not a document is not a place.
 Facets – variety – of units, of structure
–
–
Numerical range (price), Location – big to small
Alphabetical, Hierarchical – taxonomic
 Facets are designed to be used in combination
• Wine where color = red, price = excessive, location = Calirfornia,
• And sentiment = snotty
5
Advantages of Faceted Navigation
 More intuitive – easy to guess what is behind each door
• Simplicity of internal organization
• 20 questions – we know and use
 Dynamic selection of categories
• Allow multiple perspectives
• Ability to Handle Compound Subjects
 Systematic Advantages – fewer elements
–
–
4 facets of 10 nodes = 10,000 node taxonomy
Ability to Handle Compound Subjects
 Flexible – can be combined with other navigation elements
6
Essentials of Taxonomies
Internal Organization
 Formal Taxonomy – parent – child relationship
–
–
Is-A-Kind-Of ---- Animal – Mammal – Zebra
Partonomy – Is-A-Part-Of ---- US-California-Oakland
 Browse Classification – cluster of related concepts
–
Food and Dining – Catering – Restaurants
 Taxonomies deal with complex, not compound
–
–
Conceptual relationships – category membership
Contextual relationships – Computers & Software
 Taxonomies deal with semantics & documents
–
–
Multiple meanings and purposes
Essential attributes of documents are not single value
7
Developing Facets: Tools and Techniques
Software Tools
 Text Analytics – Taxonomy management, entity extraction,
categorization, sentiment
 Search – Integrated features, at index, Internet sources
 CM – Enterprise environment, taggers and policy
 Programmable Rules
–
–
–
Business and Subject matter expertise
Auto-populate variety of metadata – author, title, date, etc.
Relevance – best bets to weights and classes of documents
 People – refine, monitor – it’s not automatic
8
Developing Facets: Tools and Techniques
Software Tools – Auto-categorization
 Auto-categorization
–
–
–
–
–
–
Training sets – Bayesian, Vector Machine
Terms – literal strings, stemming, dictionary of related terms
Rules – simple – position in text (Title, body, url)
Advanced – saved search queries (full search syntax)
NEAR, SENTENCE, PARAGRAPH
Boolean – X NEAR Y and Not-Z
 Advanced Features
Facts / ontologies /Semantic Web – RDF +
– Sentiment Analysis – positive, negative, neutral
–
9
Developing Facets: Tools and Techniques
Software Tools – Entity Extraction
 Dictionaries – variety of entities, coverage, specialty
Cost of update – service or in-house
– Inxight – 50+ predefined entity types
– Nstein – 800,000 people, 700,000 locations, 400,000 organizations
–
 Rules
Capitalization, text – Mr., Inc.
– Advanced – proximity and frequency of actions, associations
– Need people to continually refine the rules
–
 Entities and Categorization
–
Total number and pattern of entities = a type of aboutness of
the document – Bar Code, Fingerprint
10
Elements: People
 Programmers, Librarians, Taxonomists, Metadata specialist
–
Integrate, design, develop rules, monitor activity & quality
 Authors, Subject Matter Experts
–
Input into design (important facets), rules, activity meaning
 Users – Web 2.0
–
–
–
Feedback – quality and usability
Suggestions – missing terms, bad categorization & entity
Tags Clouds & folksonomy – for social networking features,
not for information retrieval
11
Three Environments
 E-Commerce
–
–
Catalogs, small uniform collections of entities
Uniform behavior – buy this
 Enterprise
–
–
–
More content, more types of content
Enterprise Tools – Search, ECM
Publishing Process – tagging, metadata standards
 Internet
–
–
–
Wildly different amount and type of content, no taggers
General Purpose – Flickr, Yahoo
Vertical Portal – selected content, no taggers
12
Three Environments: E-Commerce
13
Three Environments: E-Commerce
14
Enterprise Environment – When and how add metadata
 Enterprise Content – different world than eCommerce
–
More Content, more kinds, more unstructured
– Not a catalog to start – less metadata and structured content
– Complexity -- not just content but variety of users and activities
 Combination of human and automatic metadata – ECM
–
Software aided - suggestions, entities, ontologies
 Enterprise – Question of Balance / strategy
–
More facets = more findability (up to a point)
– Fewer facets = lower cost to tag documents
 Issues
–
Not enough facets
– Wrong set of facets – business not information
– Ill-defined facets – too complex internal structure
15
Facets and Taxonomies
Enterprise Environment – Case One – Taxonomy, 7 facets
 Taxonomy of Subjects / Disciplines:
–
Science > Marine Science > Marine microbiology > Marine toxins
 Facets:
–
Organization > Division > Group
– Clients > Federal > EPA
– Instruments > Environmental Testing > Ocean Analysis > Vehicle
– Facilities > Division > Location > Building X
– Methods > Social > Population Study
– Materials > Compounds > Chemicals
– Content Type – Knowledge Asset > Proposals
16
External Environment – Text Mining, Vertical Portals
 Internet Content
Scale – impacts design and technology – speed of indexing
– Limited control – Association of publishers to selection of content to none
– Major subtypes – different rules – metadata and results
–
 Complex queries and alerts
–
Terrorism taxonomy + geography + people + organizations
 Text Mining
–
General or specific content and facets and categories
– Dedicated tools or component of Portal – internal or external
 Vertical Portal
–
Relatively homogenous content and users
– General range of questions
17
Internet Design
 Subject Matter taxonomy – Business Topics
–
Finance > Currency > Exchange Rates
 Facets
–
–
–
–
–
–
Location > Western World > United States
People – Alphabetical and/or Topical - Organization
Organization > Corporation > Car Manufacturing > Ford
Date – Absolute or range (1-1-01 to 1-1-08, last 30 days)
Publisher – Alphabetical and/or Topical – Organization
Content Type – list – newspapers, financial reports, etc.
18
19
20
21
Integrated Facet Application
Design Issues - General
 What is the right combination of elements?
–
Faceted navigation, metadata, browse, search, categorized
search results, file plan
 What is the right balance of elements?
–
–
Dominant dimension or equal facets
Browse topics and filter by facet
 When to combine search, topics, and facets?
–
–
Search first and then filter by topics / facet
Browse/facet front end with a search box
22
Integrated Facet Application
Design Issues - General
 Homogeneity of Audience and Content
 Model of the Domain – broad
–
How many facets do you need?
– More facets and let users decide
– Allow for customization – can’t define a single set
 User Analysis – tasks, labeling, communities
• Issue – labels that people use to describe their
business and label that they use to find information
 Match the structure to domain and task
– Users can understand different structures
23
Automatic Facets – Special Issues
 Scale requires more automated solutions
–
More sophisticated rules
 Rules to find and populate existing metadata
Variety of types of existing metadata – Publisher, title, date
– Multiple implementation Standards – Last Name, First / First Name, Last
–
 Issue of disambiguation:
Same person, different name – Henry Ford, Mr. Ford, Henry X. Ford
– Same word, different entity – Ford and Ford
Number of entities and thresholds per results set / document
–

–
Usability, audience needs
 Relevance Ranking – number of entities, rank of facets
24
Putting it all together – Infrastructure Solution
 Facets, Taxonomies, Software, People
 Combine formal power with ability to support multiple





user perspectives
Facet System – interdependent, map of domain
Entity extraction – feeds facets, signatures, ontologies
Taxonomy & Auto-categorization – aboutness, subject
People – tagging, evaluating tags, fine tune rules and
taxonomy
The future is the combination of simple facets with rich
taxonomies with complex semantics / ontologies
25
Questions?
Tom Reamy
[email protected]
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com