Transcript Slide 1

Extreme Content Management at LexisNexis
Alfresco Summit 2013
Presenter: Flavio Villanustre
LexisNexis Risk Solutions, Reed Elsevier
November 13th, 2013
Boston, USA
Content Management: the traditional view
“The set of processes and technologies that support the collection, managing, and
publishing of information in any form or medium” – Wikipedia
More generally: storage, processing, retrieval and disposal of digital content such as
text documents, multimedia files, etc., where usually certain types of workflows in
the document lifecycle are involved.
As long as volume, content acquisition speed,
classification complexity and retrieval process are
kept within reasonable limits, we have a good
solution…
Extreme Content Management
2
Evolution and new requirements: a wider and semantic World
• The world’s information is doubling every
two years
• We broke the zettabyte barrier already
• Sifting through huge volumes of data
efficiently creates new challenges
• And big stores can be hard to manage
and slow
• The traditional TF/IDF approach no
longer works
Source: IDC/EMC
Extreme Content Management
3
Content Management and Big Data Technologies: a dichotomy
• Enterprise Content Management: evolved from
the necessity to have appropriate control over
internal enterprise digital content
• Big Data technologies: born from the necessity of
integrating large amounts of diverse data,
prioritizing functionality over discipline
• ECM provides control, search, traceability,
workflows, user interfaces
• Big Data brings large-scale data integration,
semantic analysis, recommendations, contextual
search
Extreme Content Management
4
Let’s dream for a moment…
• What if your content management system was
able to recommend content based on your
interest, recent activity or your affinity with
related people?
• How about ingesting petabytes of data and still
being able to effectively integrate it, by
resolving and disambiguating entities and
creating relationships that can enhance
retrieval capabilities?
• What if you could express your queries in
natural language?
• What if the system could present to you
interesting content based on your behavior?
Extreme Content Management
5
What we do at LexisNexis Risk Solutions
• Data and analytics-based solutions (part of Reed Elsevier, together
with LexisNexis, Elsevier, Reed Business Information and Reed
Exhibitions)
• $1.4B Dollars in revenue
• Insurance, Financial Services, Law Enforcement, Healthcare, Retail,
etc.
• Usually Hundreds of Billions of Records and Trillions of individual
attributes on Petabytes of data
• Search and retrieval, massive graphs, analytics and statistical
modeling
• Designed our Big Data system in the late 90’s
• Distributed platform for data processing and real-time delivery,
with a declarative dataflow programming paradigm (ECL)
• Released the HPCC Systems platform and ECL as an Open Source
project in 2011
Extreme Content Management
6
ECM in LexisNexis Risk Solutions
Our culture:
• A DIY culture, with a large technology group (850+ people)
• But we understand our core competencies and don’t want to reinvent the wheel
• We love open source, not because it’s cheap but because it’s free (free beer vs.
freedom of speech) and we can fix and enhance it ourselves
Enterprise Content Management:
• Supporting human based document oriented workflows for certain products and
enterprise services
• Usually deeply customized with integration and automation (integrated document
routing and approvals, automated escalation process, interoperable with our specific
products, etc.)
• Originally based on EMC Documentum, migrated to Alfresco several years ago with
overhauled functionality
Extreme Content Management
7
A look at the LexisNexis HPCC Systems Big Data Platform
Extreme Content Management
8
Alfresco + HPCC: an integrated solution to semantic information management
• Alfresco and the HPCC Systems platform have similar business
models: both are Open Source (LGPL and Apache, respectively)
and offer commercial licenses with support, maintenance, etc.
• Alfresco is a top notch and flexible Content Management
System
• HPCC is a robust and proven Big Data platform, currently in use
in other semantic stores (for example, the recommendation
system for Elsevier’s Science Direct)
• We have experience in both…
Extreme Content Management
9
Implementation
Billions of
events
Millions of
documents
Thor
Similarity
Co-download
matrix
Roxie
Attribute
Ranking
Extreme Content Management
10
Recommendation generation process
•
•
•
•
Export data and metadata from the Alfresco document store
Export usage logs from Alfresco
Export user information from Alfresco
Extract feature vectors from the document data and metadata
in Thor
• Analyze behavioral patterns and similitudes across users
• Create distance vectors for users and documents and generate
the actual recommendations
• Provide real-time ranked recommendations from Roxie, as users
search and browse content
Extreme Content Management
11
Business Impact in LexisNexis Risk Solutions
• Significant human efficiency gains, moving from multiple cycles of “search, wait and
pray” to content proactively pushed and much smarter search abilities
• Specific functionality is now better customized to particular groups (content and
behavior driven)
• Scalability to much larger content repositories without increased retrieval latencies
• Streamlined the data ingest process, since diverse sources are all managed and
integrated through HPCC
• Proper handling of near real time data updates and streaming when needed
• Ability to tap into a large library of Natural Language Processing and Machine Learning
algorithms on HPCC
• Additional visualization and Exploratory Data Analysis capabilities when appropriate
Extreme Content Management
12
Future enhancements
• Contextually Relevant Content (semantic search and
browse)
• Minimal Searching
• Minimal Next Steps
• Enable all devices
• Knows me, my team, my market, my interests
• Improved Usability
• Increased Effectiveness
• Better Alignment
• Mobile and Social
Extreme Content Management
13
Questions?
Thank you!
Email: [email protected]
http://hpccsystems.com
http://www.lexisnexis.com/risk
Extreme Content Management
14