Empowering the Publishing Process with Semantic Technologies

Download Report

Transcript Empowering the Publishing Process with Semantic Technologies

Empowering the Publishing Process
with Semantic Technologies
Stephen Cohen
Principal Consultant
O’Reilly Tools of Change Conference
23 February 2010
www.innodata-isogen.com
1
Agenda
• Overview
• Semantic technologies
• Case studies
• Benefits and challenges
• Questions
2
2
Innodata Isogen – Who We Are
Innodata Isogen
provides knowledge,
production, technology
and consulting services
to the world’s leading
media, publishing and
information services
companies
We specialize in publishing, to help our clients to:
• lower total cost of ownership for their content supply chain
• re-engineer business processes
• multi-shore services to lower cost, manage risk and balance
the cost / quality ratio
• combine content and technology outsourcing add value
Our clients include
• leading scholarly, business and legal publishers
• secondary publishers (content aggregators)
• agencies of the U.S. Department of Defense
• major aerospace manufacturers
6,500 global staff
London
Paris
Israel
Delhi
Manila
Cebu
Colombo
New Jersey
Dallas
3
3
Overview
• Semantic technologies are often used to more
effectively monetize content and improve the
customer experience on the Web
– semantic advertising
– semantic search
• They have also been used effectively
throughout the publishing process
• Today we will talk about companies that are
using semantic technologies and text mining
to process content better, faster, cheaper
4
4
What Do Publishers Have in Common?
• They all want to deliver information better,
faster, cheaper
• Better
– offer the information customers and users want and
need (focused)
– make it easier for customers to discover new
information and relationships between information
• Faster
– get it in the hands of customers ahead of your
competition (when they need it)
• Cheaper
– do it in the most cost effective way possible
5
5
Semantic Analysis Tools Can Help
• Across the content supply chain
• Better
– more accurate, consistent content tagging, indexing,
abstracting, linking
• Faster
– find out sooner about new information (e.g.,
announcements, legal opinions, rules changes)
– (semi) automate content enrichment
– increase throughput
• Cheaper
– deploy resources most cost effectively (do more with
less)
6
6
Semantic Technologies: Some Characteristics
• Briefly, semantic technologies are algorithms that seek
to model the associative processes that humans
perform to extract meaning from information
• Knowing a little bit about “the man behind the curtain”
can help when it comes to deciding which approach is a
good fit for your company’s needs
• They can be rules-based, use statistical analysis, use
semantic and linguistic clustering, etc.
• Not surprisingly, there are many approaches to
modeling and each has its strengths and weaknesses
7
7
Rules-Based Text Analysis
• Precisely defines criteria by which a document belongs to a category
• Matches terms in a thesaurus to words in content
• Typically uses “if-then-else” rules
• Relative easy to deploy; start with simple rules and enhance over time
• Rules can get complex, difficult to maintain
Word =
shrub?
Assign Category =
‘bush’
Word = Bush
AND
within 4 words
of President?
Assign Category =
‘chief executive’
doc.type
= email?
Assign Category =
‘internal communication’
8
8
Statistical Analysis
•
•
•
•
•
Word frequency
Relative placement of words, groupings
Distance between words in a document
Pattern analysis
Co-occurrence of terms to find clumps or clusters of closely
related documents
• Makes assignments to categories based on a set of training
documents
• Requires more time to deploy due to need to select a
representative set of documents for training the tool
• Accuracy of the semantic analysis will depend on how well the
training documents have been chosen
9
9
Semantic and Linguistic Clustering
• Concept extraction
• Language dependent
• Documents clustered or grouped depending on meaning of
words using thesauri, parts-of-speech analyzers, rule-based &
probabilistic grammar, etc.
• Analyzes structure of sentences
–
–
–
–
analysis of words - prefixes, suffixes, roots
word-level analysis including parts of speech
analyzes structure & relationships between words in a sentence
possible meanings of a sentence; enhanced by statistical analysis
10
10
The Content Supply Chain
• We view the publishing process in terms of a supply chain
• It begins with content acquisition through conversion and
enhancement, on to product assembly and, lastly, to product
publishing and distribution
• Using semantic tools has an impact on roles and responsibilities,
workflows and the way content is processed at each stage of the
content supply chain
• Semantic tools and text mining are used at different stages of
the editorial and production process
11
11
Semantic Tools in the Content Supply Chain
Source /
Create
Intelligent agents for targeted retrieval (content federation);
“acquire what is new or changed from sites I am interested in”
Convert /
Structure
Extract content for tagging; identify not only document structure
but document meaning; structure unstructured content
Normalize
Linking; entity extraction; citations; classification ,
machine aided indexing; contextual meaning
Store /
Manage
Controlled vocabulary and authority list management; taxonomy
managers; knowledge management
Edit /
Enhance
Abstracting, auto-summarization (e.g., synopses, headnotes)
Product
Assembly
Custom publishing; ‘Synthetic documents’
Publish /
Distribute
Content delivery for multiple output channels and product formats
12
12
Case Studies
13
13
Preview of Case Studies
• Rules-based auto-classification
• Document analysis and entity linking
• Auto-summarization
• Product assembly
• Custom information feeds
14
14
Case Study
Rules-based Auto-classification
15
15
Rules-based Auto-classification
DEFINE CLASSIFICATION RULES
RULES MANAGEMENT
Indexer defines
classification rules
Review usage
statistics
Rules used, not
used; add,
modify, delete
rules
Baseline
Test Set
Test & adjust
rules
TAXONOMY MANAGER
SYSTEM
INDEXER
Add/remove
terms; Create
groupings;
Map terms
Automatic
update of
rules to
reflect
changes in
taxonomy
RULES BASE
INDEXER REVIEW
AUTO-CLASSIFICATION
INDEXER
Accepts, rejects, adds, classification terms
Reviews rules system applied that yielded wrong
classification
Flag problems to rules builder; suggest new terms
Set-up
Apply rules to classify
content against taxonomy
System tracks rules usage
(which ones used; frequency)
SYSTEM
Tracks rules that generated
incorrect classifications
16
16
Case Study
Document Analysis and Entity Linking
17
17
Document Analysis and Entity Linking
• Focus is on document analysis and entity linking in
editorial workflow
• Subsidiary of a global legal publishing house
–
–
–
–
content base of 3.5 million cases, related documents
manages over 17 million citations
updates of case law processed daily
cases growing at 20% per annum
• Challenges
– avoid processes performed manually by individuals
– allow the user to select and filter the information needed for their job
– take into account an increasing number of legal information sources
• Describes target configuration but not yet fully realized
18
18
Goals for the New Process
• Aid the process of knowledge extraction and storage
– identify legal sources (e.g., official publication, case law
decision)
– extract legal citations (which source is cited and why?)
– populate a knowledge base and cyclically enrich the content
• Process each piece of information one time
– normalize, tag, enrich, link, form concepts, etc.
• Build standardized common knowledge base for use throughout
the editorial and production process and by downstream by
end-users
• Maintain consistent thesauri, ontologies, taxonomies and
provide a mechanism for their management and updating
19
19
Automated Semantic
Analysis
Document Analysis and Linking Process
DEFINITION PHASE
Domainspecific lists
for entity
recognition
AUTOMATED TEXT ANALYSIS
Entity extraction
Text mining rules
Tag content
Linking
Test text
analysis tool
Baseline
Test Set
Iterative application of rules
KNOWLEDGE MANAGEMENT
SEARCH AND NAVIGATION SERVICES
Use search, navigation tools to
review, identify, and correct
Entity
error
Link
error
Concept
error
Librarian
Legal editors
REVIEW AND QC ENRICHED CONTENT
LIST AND RULES
MAINTENANCE
Weekly review of exception
reports
20
20
20
Benefits of the New Process
• Workflow
– a semi-automated process
– editors review QC output from text mining tool to enhance and correct as
necessary
– analysis and linking by automated text analysis tool
– parallel processing in text analysis tool
– analysis, referencing and linking became part of the same workflow
• Roles and responsibilities
– editors no longer need to be experts in mark-up languages; content is tagged
automatically
– low value editorial tasks handled by text analysis tool
– existing staff can focus on high value tasks
– new role to maintain and enhance semantic lists and text mining tool rules
• Content
– quality document analysis improves through enhancements to the lists and rules
used by the text mining tool
– able to federate metadata across multiple content management systems
– same knowledge base and text mining tool integrated into online products
21
21
Case Study
Auto-summarization
22
22
Auto-summarization – Major Newspaper
Content in
Document Analysis
Source; type; format;
content
OR
Extent of automation
depends on article
importance
Auto-summarization
Rules Base
OR
OR
Auto-summarization
(draft version)
• Document zones
• Rules: semantics; dictionary;
complex grammar rules
• Section weightings
• Sentence position
• Relative importance of sentences
• Markers for start of sections,
paragraphs, sentences
• Sentence length of summary
Administrator monitors, improves
rules set based on usage
Expert review and edit
(final version)
Manual summarization
Outsource or in-house experts
23
23
Case Study
Product Assembly
24
24
Product Assembly
New
Content
Process
Source /
Capture
Convert /
Normalize
Analyze / Classify / Enhance Editorial
Content Repository
XML Content Store
Rich Media
Extract Product Content From Repository
Format
Product
Select content
(XQuery)
Select content
(XQuery)
Select content
(XQuery)
Select content
(XQuery)
Render
FOSI; XSLFO;
Proprietary
Render
XSLT; CSS; RSS
Render
Render
WCSS; Proprietary
25
25
Case Study
Custom Information Feeds
26
26
Custom Information Feeds
End Users
Delivery
RULES
STATS
SCHEDS
SOCCER
RICH MEDIA
REC
STANDINGS
HIGH
SCHOOL
ENRICHED EMAIL
PEOPLE
FOOTBALL
REAL-TIME FEEDS
PLAYERS
BASEBALL
NEWS
SCORES
COLLEGE
TARGETED INFO
XML
Content
PRO
REAL-TIME
UPDATES
Repository
REAL-TIME FEEDS
HOCKEY
ENRICHED EMAIL
27
27
Benefits of Using Semantic Technologies
• People
–
–
–
–
minimize high-value resources performing commodity tasks
editors focus on real editorial added value; no need to be concerned about markup
increased capacity without increasing headcount
novice indexers come up to speed quicker
• Process
–
–
–
–
reduced processing time due to automation
sequential tasks can be performed in one step
products can be more targeted to specific customer needs
parts can be outsourced
• Content
–
–
–
–
–
richer more consistent classification, linking, summarization, semantic tagging
common controlled vocabularies maintained and applied across entire content base
same content can be classified and summarized along more dimensions to serve
different customer groups
greater value can be extracted from unstructured content with text mining and
semantic analysis
taxonomy managers support a rigorous approach to maintenance and updating
28
28
Challenges Using Semantic Technologies
• People
–
–
retrain resources for new roles (rules builder, taxonomy manager, etc.) is time
consuming
level of accuracy depends on ability of editors to write logical rules
• Process
–
–
time required to refine rules and train analysis engine can be extensive (some
report 12-18 months)
productivity improvements are a function of thesaurus structure, rule-builder’s
skill level, document type; the more complex any of these are the longer it takes
to achieve return on investment
• Content
– automated content analysis doesn’t match up to the analytical skills of
trained subject area experts (at least in some highly technical disciplines)
– some find it difficult to measure the impact of indexing consistency
– lower quality when there is fully automated machine aided indexing
with no follow-on QC by subject area experts
29
29
Questions
30
30
THANK YOU
Stephen Cohen
Principal Consultant
[email protected]
+1 (201) 371-8044
Innodata Isogen, Inc.
Three University Plaza
Hackensack, NJ 07601
+1 (201) 371-2828
www.innodata-isogen.com
Proprietary and Confidential
www.innodata-isogen.com
WWW.INNODATA-ISOGEN.COM
31