Transcript Slide 1
Extreme Content Management at LexisNexis Alfresco Summit 2013 Presenter: Flavio Villanustre LexisNexis Risk Solutions, Reed Elsevier November 13th, 2013 Boston, USA Content Management: the traditional view “The set of processes and technologies that support the collection, managing, and publishing of information in any form or medium” – Wikipedia More generally: storage, processing, retrieval and disposal of digital content such as text documents, multimedia files, etc., where usually certain types of workflows in the document lifecycle are involved. As long as volume, content acquisition speed, classification complexity and retrieval process are kept within reasonable limits, we have a good solution… Extreme Content Management 2 Evolution and new requirements: a wider and semantic World • The world’s information is doubling every two years • We broke the zettabyte barrier already • Sifting through huge volumes of data efficiently creates new challenges • And big stores can be hard to manage and slow • The traditional TF/IDF approach no longer works Source: IDC/EMC Extreme Content Management 3 Content Management and Big Data Technologies: a dichotomy • Enterprise Content Management: evolved from the necessity to have appropriate control over internal enterprise digital content • Big Data technologies: born from the necessity of integrating large amounts of diverse data, prioritizing functionality over discipline • ECM provides control, search, traceability, workflows, user interfaces • Big Data brings large-scale data integration, semantic analysis, recommendations, contextual search Extreme Content Management 4 Let’s dream for a moment… • What if your content management system was able to recommend content based on your interest, recent activity or your affinity with related people? • How about ingesting petabytes of data and still being able to effectively integrate it, by resolving and disambiguating entities and creating relationships that can enhance retrieval capabilities? • What if you could express your queries in natural language? • What if the system could present to you interesting content based on your behavior? Extreme Content Management 5 What we do at LexisNexis Risk Solutions • Data and analytics-based solutions (part of Reed Elsevier, together with LexisNexis, Elsevier, Reed Business Information and Reed Exhibitions) • $1.4B Dollars in revenue • Insurance, Financial Services, Law Enforcement, Healthcare, Retail, etc. • Usually Hundreds of Billions of Records and Trillions of individual attributes on Petabytes of data • Search and retrieval, massive graphs, analytics and statistical modeling • Designed our Big Data system in the late 90’s • Distributed platform for data processing and real-time delivery, with a declarative dataflow programming paradigm (ECL) • Released the HPCC Systems platform and ECL as an Open Source project in 2011 Extreme Content Management 6 ECM in LexisNexis Risk Solutions Our culture: • A DIY culture, with a large technology group (850+ people) • But we understand our core competencies and don’t want to reinvent the wheel • We love open source, not because it’s cheap but because it’s free (free beer vs. freedom of speech) and we can fix and enhance it ourselves Enterprise Content Management: • Supporting human based document oriented workflows for certain products and enterprise services • Usually deeply customized with integration and automation (integrated document routing and approvals, automated escalation process, interoperable with our specific products, etc.) • Originally based on EMC Documentum, migrated to Alfresco several years ago with overhauled functionality Extreme Content Management 7 A look at the LexisNexis HPCC Systems Big Data Platform Extreme Content Management 8 Alfresco + HPCC: an integrated solution to semantic information management • Alfresco and the HPCC Systems platform have similar business models: both are Open Source (LGPL and Apache, respectively) and offer commercial licenses with support, maintenance, etc. • Alfresco is a top notch and flexible Content Management System • HPCC is a robust and proven Big Data platform, currently in use in other semantic stores (for example, the recommendation system for Elsevier’s Science Direct) • We have experience in both… Extreme Content Management 9 Implementation Billions of events Millions of documents Thor Similarity Co-download matrix Roxie Attribute Ranking Extreme Content Management 10 Recommendation generation process • • • • Export data and metadata from the Alfresco document store Export usage logs from Alfresco Export user information from Alfresco Extract feature vectors from the document data and metadata in Thor • Analyze behavioral patterns and similitudes across users • Create distance vectors for users and documents and generate the actual recommendations • Provide real-time ranked recommendations from Roxie, as users search and browse content Extreme Content Management 11 Business Impact in LexisNexis Risk Solutions • Significant human efficiency gains, moving from multiple cycles of “search, wait and pray” to content proactively pushed and much smarter search abilities • Specific functionality is now better customized to particular groups (content and behavior driven) • Scalability to much larger content repositories without increased retrieval latencies • Streamlined the data ingest process, since diverse sources are all managed and integrated through HPCC • Proper handling of near real time data updates and streaming when needed • Ability to tap into a large library of Natural Language Processing and Machine Learning algorithms on HPCC • Additional visualization and Exploratory Data Analysis capabilities when appropriate Extreme Content Management 12 Future enhancements • Contextually Relevant Content (semantic search and browse) • Minimal Searching • Minimal Next Steps • Enable all devices • Knows me, my team, my market, my interests • Improved Usability • Increased Effectiveness • Better Alignment • Mobile and Social Extreme Content Management 13 Questions? Thank you! Email: [email protected] http://hpccsystems.com http://www.lexisnexis.com/risk Extreme Content Management 14