Transcript Slide 1

©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

Data Hub

Enabling easy and safe access to Experian’s data

Greg Bonin

Principal Scientist | Experian DataLabs

©2014 Experian Information Solutions, Inc. All rights reserved. Experian and the marks used herein are service marks or registered trademarks of Experian Information Solutions, Inc. Other product and company names mentioned herein are the trademarks of their respective owners. No part of this copyrighted work may be reproduced, modified, or distributed in any form or manner without the prior written permission of Experian.

Experian Confidential.

Introduction and overview

How do we cost-effectively and safely provide

simple access to Experian’s internal data

to clients and ourselves?

The Experian Analytical Sandbox™ – a case study

 What is it?

 How did we build it?

From Experian Analytical Sandbox™ to data hub

 Extending the Experian Analytical Sandbox™ to other parts of Experian  Making the Experian Analytical Sandbox™ into a delivery platform 3 ©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

What is the Experian Analytical Sandbox™

An ad-hoc environment where

clients

and

internal

users can access something like MAD(Monthly Analytic Dataset) and perform statistical analysis

Key design goals

   Dataset will be shared across many users (should be scalable) Underlying data will be anonymized (but real data) Dataset should contain all records (not a sample)   Client’s have their own environment, where they may bring in data Clients should not be able to pull data out of the system  Clients must be able to access data through SAS ©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

4

Experian Analytical Sandbox™ –

Data requirements

What is the MAD data?

   Raw tradeline data (one record per trade per consumer) Various scores and attributes (one record per consumer) MAD data is a 10% sample of U.S. consumers and is typically produced monthly

How much storage do we need?

 We want to store 100% of the raw files ► One month of 100% file is approximately 10TB (uncompressed)   Five years of monthly history needed for analytical use Our total storage needs are around 700TB!

©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

5

Experian Analytical Sandbox™ –

Design overview

High-Level Lab Design Diagram for Analytic Sandbox Experian Data Center

Annonymized Bureau

Premiers TV Variables ...

Database:

Contains: HDFS-Based Primary Access Method : Hive Raw Tradelijne Raw Trendview Client Services - Provides monthly depersonalized data  Utilize Hadoop as a cost efficient scalable data store  Access data through HIVE Client Server Demo Server Multiple clients, no client data Client Data Internal Users  Strong authentication via Kerberos  Leverage CITRIX to ensure all data stays within Experian External Users – Use Citrix and SAS EG 6 ©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

What Do We Have Now

Cluster Specs

   30 node Hadoop cluster running CDH 128GB and 16 cores per data node 700TB total disk (usable ~230TB)

Cost

  ~$700,000 for hardware Funded by CIS

Usage

 Currently have one client(AMEX). Current contract recovers most of initial cost ©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

7

Shared data store –

Why Hadoop?

 

We need to store and access large amounts of data in a cost-effective way

 Works well with off-the-shelf hardware Can meet performance needs by adding servers Limited licensing costs

We want to make the data access easy and flexible

 Hadoop supports several SQL like languages (Hive, Impala, etc.)  We needed to integrate with SAS, which works with Hive

Usage pattern fits well with Hadoop

8 ©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

Technical challenges

Hadoop does not have strong authentication by default

 We used Kerberos to handle the authentication … which was painful to setup  Complicates client applications as they need to support Kerberos  

SAS and Hadoop are not ideal bed-fellows

 Pulling large quantities of data down through SAS is slow It is hard to force SAS to utilize the cluster efficiently Managing DB permissions with SAS is annoying 9 ©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

 

Case study –

Using the Experian Analytical Sandbox™ to answer questions

Mean VantageScore ® V3 Score Score 705,00 “What is the trend of VantageScore ® an auto loan?” for people who recently obtained 690,00

A simple SQL query was able to answer this question in 2.5 minutes

675,00 660,00

► Process involved joining a 2TB file with a 250GB file

645,00

Similar analysis using SAS on a single server could take 50 100x longer

Date Overall Auto opened recently Has auto

10 ©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

Building Experian Analytical Sandbox’s™ for other parts of Experian

 Opportunities exist to build more sandboxes  Building more sandboxes across Experian’s data assets will allow broad, safe access to data  We believe this would lead to increased opportunities for innovation

Experian Analytical Sandbox™ name

Business Information Services Healthcare Digital Advertising ConsumerView SM

Type of data

Raw trade line data Claims and eligibility checks from Experian Healthcare IP impression information (Audience IQ SM ) Device ID’s (41 st Parameter ® )

Potential use

Similar to use case for Experian Analytical Sandbox™ except less regulatory sensitivity Provide researchers or private parties a rich data set to analyze Allow third parties to use this data for model-building or reporting Monthly-trended ConsumerView SM data Provide insight into changes in demographic data over time 11 ©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

The “Cloud” landscape

Client-driven ©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

Data

Proprietary 12

From Experian Analytical Sandbox™ to Data Hub

 Extending our design will allow solutions developed in the Experian Analytical Sandbox™ to be deployed Experian Data Hub

Ad-hoc Modeling Tools Experian Modeling Tools Depersonalized Data ( Credit, Medical, etc) Source Data Systems Depersonalization Process

 Using Experian tools will allow quick deployment of models ► Example: Model outputs written in PMML would allow quick deployment

Model and Attribute Definitions Batch Data Hub Production System Linking DB Client Systems Personalized Model Results Consumer Data for “new model” Personalized Data

13 ©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

Conclusion

 The Experian Analytical Sandbox™ is one way to make Experian’s internal data easier to access and use ► Making access to data easier reduces barriers to innovation  Extending the functionality of the Experian Analytical Sandbox™ could lead to a new way of using Experian data ► Easy and safe access to raw data can allow clients to understand their customers better ► Streamlined deployment can make those insights actionable 14 ©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.

#FOIC2014

Greg Bonin

Principal Scientist Experian DataLabs e: [email protected]

t: (858)314-2613 ©2014 Experian Information Solutions, Inc. All rights reserved.

Experian Confidential.