Presentation title

Download Report

Transcript Presentation title

Designing and Creating
applications built on R
Richard Pugh, Andy Nicholls & Chris Campbell
23rd October 2012
Thank you for
the invitation
to speak
tonight
Andy Nicholls
Senior R Consultant
Richard Pugh
Principal R Consultant
& Co-Founder
Chris Campbell
Senior R Consultant
Agenda
•
•
•
•
•
•
Who are Mango Solutions?
Why Build Analytic Applications on R?
Formal R Application Development
Case Studies
The R Community
Discussion
Who are Mango Solutions?
Overview of Mango Solutions
• Private Company formed in 2002
• Global Team of ~70
• Cross-Sector Software and Services
• ISO 9001 Accredited
Located here ...
Bath, UK
London, UK
Shanghai, CN
Basel, CH
Spend a lot of time here ...
The Beginning: October 2002
• Started by 2 ex-Insightful
colleagues
• Sales Guy (BO, Cognos etc)
• Techy Guy (S+, SAS, R etc)
• Idea to deploy predictive
analytics to business users
Why Mango?
• Early awful ideas
• DataStatz
• Stats Entertainment
• VizUStat
• Stats2U
• In the end, named after my
colleagues cat
Growth of Mango Solutions
2006
Consultants
Developers
PMs
Testers
Support
Total = 10
Other
2010
Consultants
Developers
PMs
Testers
Support
Total = 28
Other
2012
Consultants
Developers
PMs
Testers
Support
Total = 71
Other
0
2
4
6
8
10
12
Number of Employees
14
16
18
20
22
What we do?
R Training
Code Creation
Consultants
Validation
Support
What we do?
Consultants
Developers
Analytic Application Development
Mango Key Industries
• Mango work across sectors:
• Pharmaceuticals
• Mango Imaging
• Finance
• Energy
• Sensory
Why Build Analytic Applications
on R?
Why Analytics?
• Analytics can help people answer all sorts of questions
• I believe there is no company in the world today who
cannot benefit from analytics in some way
• More and more people are realising it
Who is a good driver?
What bonus should I pay?
Will someone like this?
When might this break?
How do we win more games?
What are they likely to want?
Why build Analytic Applications?
• 3 key reasons we see:
• To deploy analytical tools to decision makers
• To make an analysts life more efficient
• To add rigour to an analysts workflow
Deploying Analytics
• Adding analytics into a business process can
mean more informed decisions can be made
• Complex analytics shouldn’t be attempted by
non-analysts
• Means there is a communication between the
decision maker and the analyst
Deploying Analytics
• If we build an application which …
• is easy for the decision maker to use
• contains the correct analysis to apply
• communicates analytical results in suitable manner
• … this leads to some major benefits!
Benefits for the
Decision Maker
Benefits for
the Analyst
No need to wait for
information
Less need to perform oftenrepetitive tasks
Can perform “what if”
analysis
Comfortable that the
“right” analysis is being run
Decision not dependent on
analyst availability
Can get on with more
strategic things?
Analytic App Structure
User
Interface
Analytic
Outputs
Data
Analytic
Engine
Data
Storage
Code
Mgment
Analytic
Code
Why build Analytic Applications on R?
Building applications requires installing analytic engine on
desktops, servers, clusters, clouds
R is license free
Building analytic applications involves integrating an analytic
engine with other technologies (data sources, UI etc)
R’s open nature means it can be readily integrated
Why build Analytic Applications on R?
We want a programmable engine so that it can be readily
extended (i.e. no black boxes please)
R can be extended by the developer as needed
We often want to be able to deploy new algorithms and
techniques as they become available
R is rapidly developed
Formal R Application Development
Formal R Development
• Creating sophisticated analytic applications requires a
formal development approach
• This mostly means taking standard development
practices and applying it to analytics
• Mango’s formal R development procedures and structure
has been evolving since its inception ~2004
Quality Manual
Requirements
Project Mgment
Issue Tracking
Behaviour Driven
Dev Procedures
R Coding Standards
StatET
runit
roxygen2
mangoUtils
Continuous Integration
Code Review
Review board
Knowledge Mgment
Quality Manual
Project Mgment
Requirements
Issue Tracking
Behaviour Driven
Dev Procedures
Coding Standards
StatET
runit
roxygen2
mangoUtils
Continuous Integration
Code Review
Review board
Knowledge Mgment
Quality Manual
Requirements
Project Mgment
Issue Tracking
Behaviour Driven
Dev Procedures
R Coding Standards
StatET
testthat
roxygen2
mangoUtils
Continuous Integration
Code Review
Review board
Knowledge Mgment
Case Studies
Case Studies
• These are examples of applications we’ve built that use R
in some way
• We’re presented a range of information about each
including:
• Business Reason for the application
• Technical Approach
• Some Technical Detail where applicable
• Things that worked well / things that didn’t
Case Studies
• Ranges from information we can fully disclose to only being
able to say vague things about the customer
• Only so much info we can give today – please see us after
or contact us and we can step through things in more detail
Richard Pugh = [email protected]
Andy Nicholls = [email protected]
Chris Campbell = [email protected]
Case Studies
• PKPD Web Modelling Platform
• M&S Workflow Platform
• Non-Compartmental Analysis Application
• Coffee Blend Optimisation Tool
• Pipeline Corrosion Forecasting Application
• Backtesting Application
CASE STUDY
PKPD WEB PLATFORM
Case Study: PKPD Modelling
Overview
• Pharmacokineticspharmacodynamics (PKPD) is
the study of the manner in
which a drug transitions through
the body and its impact on a
target disease
• PK is highly complex, involving
sophisticated non-linear mixed
effects modelling approaches
Case Study: PKPD Modelling
Overview
• Modellers use “NONMEM” software in order to fit
these models
• Inputs and outputs to NONMEM are a mixture of
structured and unstructured textual files
• R often used to analyse the outputs in order to
assess model fit (see “xpose4” library)
Case Study: PKPD Modelling
Overview
• PKPD is an evolving and exciting area, with
modellers needing flexibility and a variety of tools
• However, being within life sciences, rigour around
workflows is key in order to satisfy regulatory
requirements
Case Study: PKPD Modelling
The Challenge
• Build a modern modelling platform that provides rigour
whilst allowing the modellers the flexibility they need
• Range of technical users from “everything is a shell
script” to “which button do I click”
• Execution of third party tools (NONMEM, R, SAS, PsN, …)
in a controlled manner
• Interface to generate reproducible graphics, tables and
reports
Case Study: PKPD Modelling
The “R” bit
• Where does R fit in?
• Many users use R and want to be able to
develop scripts and execute them on an
internal grid
• R used as the graphics engine to support the
model evaluation and reporting processes
• Users want to be able to execute R
interactively with objects in their project
App Server
The App
Execution Server(s)
RPool Mgr
MIF Queue
MIF
Cloud
+ Others
Grid
+ Others
Case Study: PKPD Modelling
What is a “Report Item Definition”
• Definition of a graph or table that can be executed
from Navigator
• Consists of snippet of R code, options that may be
presented to the user, required columns, and a few
other bits
• Can be used in a number of situations in the
application
• Originally XML then stored in Db (XML shown to give a
feel for structure on next slide)
Command
Definition
Report Options
Source Data
The App / RPool Manager
Text
Data
Table
Graph
xml
Method
xml
Method
xml
Method
xml
Method
Text
Item
Data
Item
Table
Item
Graph
Item
Character
Data
Frame
Table
Object
Graphics
Version
Control
Command
Definitions
Command
Results
Execution Engine (Java)
Case Study: PKPD Modelling
How are “RIDs” used?
• Created, managed by Super Users (under version control)
• Called in a few places in the application:
• Directly (create this graph with this data)
• In “Run Views” (reports)
• In “Comparison Views” (reports that compare models)
• In “Template Reports” (tagged docx files)
Case Study: PKPD Modelling
Outcome
• The app in general was a big success
• The “R” part was created as a separate service that we have
since reused in a number of other applications (e.g. Lloyds
Risk Platform!)
• Shame that regulatory rules forced some design which we’re
now building alternatives too
• Next: interactive graphical presentation
CASE STUDY
M&S WORKFLOW PLATFORM
Case Study: M&S Workflow Platform
Overview
• Exciting project for major pharmaceutical company
• Possibly the closest we’ve come to deploying an analysts
workflow in a scalable platform
• Hundreds of pre-clinical (animal) studies are run by a team
of ~400 scientists
• Analysis performed by roughly 15 advanced modellers
• Outcome: most studies not analysed!
Case Study: M&S Workflow Platform
The Challenge
• Idea to create a truly scalable platform to allow bench
scientists to run their own analysis
• Modeller publishes an analysis “protocol” containing analysis
paths, code, and support documentation
• Desktop application pulls from central set of protocols and
“derives” the interface which is presented to the user
• Modelling can put in checks to ensure things look right (e.g.
data is of right format, model fit is particularly poor but
user seems keep to create predictions from it)
Case Study: M&S Workflow Platform
The Solution
• Eclipse RCP application executing R and NONMEM scripts on
an internal LSF grid, with protocols and code held in SVN
• Generated workflow “protocol” definition (XML) detailing
possible paths in a step, linked to R scripts and NONMEM
model code with corresponding dialog
• Built “Protocol Developer” Eclipse interface onto repository
• RCP application derives analysis paths, UI, options and
commentary to guide the end user
Protocol
Metadata
Workflow
Data Check Step
R
Script
Analysis Step
R
Script
NM Model
File
Options
Options
Commentary
Modeller
File
System
Scientist
NONMEM
Protocol Server
LSF Grid
NONMEM
Possible Models
Derived Options
Commentary
Case Study: M&S Workflow Platform
How did it go
• Technical solution was very strong and applicable to
other areas
• RCP good technology, but steep learning curve
• Testing was complex
• Agile project – pros and cons
• Ultimately, not deployed (site closure)
CASE STUDY
NON-COMPARTMENTAL
ANALYSIS
RapidNCA, the non-compartmental
analysis workflow tool
• Need for RapidNCA
• Using .NET
• RapidNCA Structure
• Code Quality
• Connections with R.NET
• Complete & Deploy RapidNCA
Need for RapidNCA
• Customer needed to send monthly reports to dozens of
trial centres
• Small team, so time limited
• Predefined non-compartmental
analysis
• Standardized report
Using .NET
What is .NET?
• Object-oriented environment to develop applications
• Safe execution environment
• Choice of programming languages
• Framework consisting of:
• runtime
• class library
• Developed with Visual Studio
Using .NET
Visual Studio
• A graphical programming tool (IDE)
• Visual Studio Express - free version
Using .NET
Choice of languages
• C# is the main one
• F# is a functional language (similar concepts to OCaml)
• XAML (a Microsoft declarative XML language) for interactive
graphics
• C++/CLI useful for legacy and bespoke parallel processing
(including GPGPU)
• Other possibilities...
• Vb.Net is very like C# (no advantage over it)
• Third parties have added languages to the CLI platform
Using .NET
“Ajar Source” Platform
• Not exactly open source, but…
• Most CLI third party languages are open
• C# and VB.Net are not, but many open source projects
based on them
• Microsoft have made F# open source
• Compiler is free
• Other editors / IDEs are available
Using .NET
Performance
• Performance is very good
• On graphics (millions of data points will plot with ease and
zoom smoothly)
• Computation is fast enough in C#, calling R adds little overhead
• Standard Maths library is limited; third parties and MS maths for
“drawing” are better
• Data parallel computation is possible on the desktop (GPGPU)
• F# provides further “big data” capabilities
RapidNCA Structure
Code
Mgment
User
Interface
Analytic
Outputs
Data
Data
Service
Data
Storage
Analytic
Code
Analytic
Engine
RapidNCA Structure
MangoNca Analytic Code
Data
Checks
Analyse
Element
Do
Analysis
Unit
Tests
Get
Analysis
RapidNCA Structure
MangoNca Analytic Code
Code Quality
Unit Tests
• Ensure product works!
• User/Customer/Payer trust
• Ease of maintenance/extension
Code Quality
Run Code, Check Output
> require(RUnit)
> # there are other automated test packages!
• Working Cases
> test1 <- ncaAnalysis(Conc = c(4, 9, 8, 6, 4:1, 1),
+
Time = 0:8, Dose = 100, Dof = 2)
•
> checkEquals(test1[1,
"ROutput_adjr2"], 0.9714937901,
+
tol = 1e-8)
[1] TRUE
Code Quality
Error Case Unit Tests
• Use try
> test7 <- try(AUCLast(Conc = 1:10, Time = 9:0),
+
silent = TRUE)
•
> checkEquals(test7,
+
"Error in checkOrderedVector(Time, ... ")
[1] TRUE
• Handled Error Cases
> test26 <- ncaAnalysis(Conc = c(4, 9, 8, 6, 4:1, 1),
+
Time = 0:8, Dof = 1)
> checkEquals(test26[, "ROutput_Error"],
+
"Error in checkSingleNumeric(Dose, ... ")
[1] TRUE
Connections with R.NET
• What will be provided to R?
• What will be returned from R?
• What happens if something goes wrong?
Connections with R.NET
Using the R Service
• R.NET allows R calls to be submitted to an R service
• R.NET connects to R down to Expression level
• Data from return objects passed back into .NET
Connections with R.NET
Data Checks
• Function may be passed data outside its anticipated
structure
> checkOrderedVector(c(0, 1, 3, 2, 4),
+
description = "Time")
Error in checkOrderedVector(c(0, 1, 3, 2, 4),
description = "Time") :
Error: Time is not ordered. Actual value is 0 1 3 2 4
>
Connections with R.NET
Data Checks
• The tool expects a certain return object
• An error in an R call should be trapped by the
communicating function
> check01 <- try(checkOrderedVector(Time,
+
description = "Time"), silent = TRUE)
> if (is(check01, "try-error")) { return(object) }
• Return object passed as normal
• An error checking element of the return object can report
information about the error
Connections with R.NET
_pluginsManager = new RPluginManager(PluginLocation, RLocation);
_pluginsManager.SetActivePlugin();
_session = _pluginsManager.GetSession();
bool sessionOk = _pluginsManager.TryMakeSession(out _session);
• R is efficiently
accessed, via R.Net
(as pictured in Visual
Studio) via a Plugin
(as above)
Connections with R.NET
Data
User
Interface
Analytic
Outputs
Data
Service
Data
Storage
R.NET
Code
Mgment
Analytic
Code
Analytic
Engine
Connections with R.NET
.NET Data Service
Validators
Project
Wizard
Data
Service
Analysis
Display
Get PK
Params
R.NET
Data
Importers
Receive
R Output
Create R
Expressns
Dialog
Service
App
Logger
Status Bar
Service
App
Config
Mgment
Connections with R.NET
Using the framework
_pluginsManager = new RPluginManager(PluginLocation, RLocation);
_pluginsManager.SetActivePlugin();
_session = _pluginsManager.GetSession();
bool sessionOk = _pluginsManager.TryMakeSession(out _session);
_session.SetNumericSymbol("TimePtVector", CheckTimePointData(toAnalyse));
_session.SetNumericSymbol("ConcVector", CheckConcentrationPointData(toAnalyse));
var evalString = string.Format("ncaAnalysis(TimePtVector, ConcVector, …
MathEngineDataRowDto<double> ncaGetBack =
_session.PerformNumericEvaluation(evalString, "ROutput_Error");
_lastErrors = ncaGetBack.ErrorStrings;
_session.FlushConsole();
_pluginsManager.RelinquishSession();
Complete & Deploy
RapidNCA
• Can users understand how to use tool?
• How confident are we in tool output?
• On-going code review
• Independent test team
• Installation Qualification
• Operational Qualification
• Performance Qualification
Deploy Tool
Data Import
Map Variables
Review Analysis
Review Grouping
Generate Report
Select Report Type
Add Group Comments
View Report
Conclusions
• Great graphical interfaces can be built using .NET
• Intuitive interactive features are available
• R.NET allows R analysis to be accessed as a service
• Good coding practice will ensure application is robust
• Work on a well engineered framework will be rewarded
with desktop solutions created at high speed
CASE STUDY
COFFEE BLEND OPTIMIZATION
Company Background
• A global chocolatier, biscuit baker, candy maker and
maker of gum.
Business /Technical Situation
• The client was using a desktop SPLUS application to simulate
and optimise coffee blends for their manufacturing teams
• Hugely successful application saving the company $millions
• They wanted to make improvements and expand the usage
beyond Global Statistics and beyond coffee
• Also keen to remove the license fee
Application Workflow
Import Data from
Excel
Simulate Blends
Run Blend Optimiser
Graphical
Visualisations
Export Data
Audit Log
System Architecture
Data Import
R Package
Functions for GUI
Functions for
Analysis
Data Export
Optimizer
Approach
• Development phase split into three separate pieces:
• Code conversion
• GUI creation
• Development and integration of a new optimiser
• Each required the generation of unit and system tests and
appropriate documentation, including help files
• Design specifications captured prior to development
• Project estimated at c90 man days over 3 months
Creation of new GUI
GUI Choices
Some R/R-based technologies we could have used...
• tcltk is R’s ‘recommended’ menu builder
• Glade, RGtk2
• gWidgets
• rpanel
• Deducer
• manipulate (Rstudio)
• ...
GUI Choices
Other options:
• Choice is almost limitless
• Often they require a knowledge of other languages such
as Java or C
• Possibly warrants a standalone talk...
Creation of a New GUI using RGtk2
• RGtk2 adapter for R of the GTK+ engine
• Gimp Toolkit
• Glade can be used to trial new features
• GTK allows for automated testing of the GUI
• Huge time saving
Code Conversion
Mango took a test-based approach for the code conversion
(RUnit)
• Allows for automated testing in future revisions
• Simple PASS/FAIL reporting
• SPLUS knowledge not required for R code development
Optimization
• The original SPLUS application used the SPLUS NuOpt
optimizer
• R NuOpt exists but only on license
• Mango used an open source optimiser that we integrated
into the R GUI
• Mango implemented a ‘quick run’ option to allow quick
comparisons with the simulation piece
Primary Benefits
• New departments are now benefitting from the
application
• The application is now in the hands of the manufacturing
teams, reducing the burden on Global Statistics
• Test-based approach facilitates future development of
the application
CASE STUDY
PIPELINE CORROSION APP
Background
• One of the biggest companies in the world with thousands
of staff
• Oilfield Exploration Team based in the UK but with
responsibility for complex exploration areas
• Alaska, shale fields etc
Business Situation
• Thousands of miles of pipeline corroding in freezing,
isolated areas
• How do you choose how often to inspect them?
• The cost of a leak can run into many billions of £s
Technical Situation
• Customer Team were analysing data using S-PLUS
Insightful Miner with many non-analytical workarounds
• Process was messy and took a long time to run
System Architecture
• This piece is one of several in a continuous workflow
• All information is fed back into the database
R Package
General
Workflow
Functions for GUI
Read
Functions for
Analysis
Access Database
Write
Approach
• Consulting engagement to improve programming
techniques and statistical methodology
• Create an R package for the code
• Construct a GUI in order to deploy to non-technical users
on the frontline
An Interesting Challenge:
Converting S-Plus Code to R
This is Easy, Right?
Some (true?) statements:
• R can be considered as a different implementation of S
• There are some important differences, but much code
written for S runs unaltered under R
Discuss...
Source: www.r-project.org
Considerations
S+ applications can generally be split into two pieces:
• An underlying library of code
• A set of functions defining the menu system and help
pages
Approach
There are essentially two approaches to code conversion:
• Direct Conversion
• Test-based Conversion
Direct Conversion
• Requires knowledge of both languages (stdev vs sd)
• Relatively quick to achieve
• Difficult to prove the new code does what the old code
did
Test-based Conversion
• Generating unit tests in S+ requires some S+ knowledge
• Takes some time to generate and document tests but
better in the long-run
• Unit tests give a definitive PASS/FAIL result
• Can often be automated
Code Conversion Challenges
• The application upgrade usually coincides with an
operating system upgrade
• Windows (or other) version and R version need to be
determined in advance
• It is almost guaranteed that the new system will produce
different results for the same test data!
What is “different”?
• Often this is simply rounding
• Still require agreement on precision: 0.049782 vs
0.050436
• If simulation is involved this can be VERY difficult to
define!!!
• Appearance of graphics may also differ
Other Challenges
As the business owner I want to use the opportunity to
improve the application:
• New menu items
• New functionality
• Modifications to existing functionality
All of these require careful planning
Primary Benefits for Customer
• Rationalised code base means the analysis is quicker and
extensible by end-users
• Construction of a front-end has enable rollout to users on
the font-line in Alaska
• Conversion to R has removed license cost
CASE STUDY
BACKTESTING APP FOR HEDGE
FUND
Case Study: Backtesting App
Overview
• Backtesting has a key role to play in the testing of
automated trading strategies
• Asked by a Hedge Fund Manager to build for his team
of users (who love Excel)
• Mango were asked to build a backtesting platform
that was more sophisticated that what was on offer
from other vendors
• Sorry that the details may be occasionally sketchy in
this section 
Case Study: Backtesting App
The Challenge
• Key parts of the challenge included:
• Integration with standard finance data streams
• Advanced portfolio optimisation
• Flexibility to define automated strategy
• Transaction-cost based benefit analysis
• Leverage of financial hurdle
• ARCH-style error incorporation
• Advanced reporting
.NET
Interface
RdotNet
C
Interface!
.Rda
Files
Data
Storage
Alpha
Storage
Data
Flow
How I learnt apply functions!!
Some hacky code here …
Case Study: Backtesting App
The Outcome
• Very successful hedge fund
• Convinced the users to use R – UI dropped!
The R Community
IP Considerations
• IP based on R includes:
• New libraries & code
• New scripts
• Mango attempt to open source
(with client permission) any
“R-side” generic functionality
• Also feedback and assist
library authors
User
Interface
Analytic
Code
New R
Libraries
Great Example
• MSToolkit library built for Pfizer
• Funded by Pfizer, built by Mango
• Released as open source library
• Since extended by other companies
R Community
• Contribute code where allowed/useful
• Sponsor R conferences and events
• Provide free training courses / webinars
• Organise and fund many R user groups
(LondonR, BaselR, ZurichR, ShanghaiR,
NewJerseyR, …)
The End!
Summary
• Thank you for the invitation
• Hope the discussion was useful 
• We could only cover certain amount of detail in
time, so ask us for more if interested!
Richard Pugh
[email protected]
Chris Campbell
[email protected]
Andy Nicholls
[email protected]