Transcript Document

R na prática: linguagem, plataforma e
um case de Telecom
R in practice: language, environment
and a Telecom case
André Andrade Baceti
25 de Março de 2015
Start-up
About 1 year old
Development of
explanatory and
prediction models
Project delivery/billing
methodologies
Call center
demand
Pre-defined scope
projects
Traditional project structure
(heritage from traditional IT
projects)
Credit
score
Forecast as service
Billing by forecast and
assertiveness
Identification of
optimality frontiers
Development of in
company data science
labs
Creation of a basic structure
for mining modeling
opportunities in company
…
In company maintenance
Company Structure
Electric Engineer (San Peterburbg State University – Russia) - PHD by UFRJ
Worked as reseasher in Atlas experiment at CERN ( Boson de Higgs)
Board
Eugenio Caner
Currently working as board president at Murabei
Physic (USP) with specialization in astrophysics
Headed projects with multinationals in different economic sectors
Roberto Ono
Currently focused with modeling and dealing with big data
Biologist, master in microbiology with emphasis in biosensor development,
currently finishing 2o degree in math applied to signal and system control
Worked in different data science projects
André Baceti
Currently focused in the development of PumpWood (this lecture)
Statisticians
Other members
of Murabei
Web Developer
IT professional
Summary
R in practice
Environment
Language
Development
Platforms
Syntax
Overview
Speed
Community
Memory
Consumption
Project overview
Telecom case
R Development Challenges
Proposed solution: PumpWood
PumpWood Overall Results
Packages
Learning
Curve
In Company
Usage
R Environment
R is a language and environment for statistical computing and graphics
Operational Systems
Nice graphics
Community
Stackoverflow
78.792 R
3.651 SAS
1.137 Stata
550 SPSS
O'Reilly Data Scientist Survey for 2014
R Environment
Packages
Integration
Database quering
Access
SQL Server
Teradata
Sqlite
PostgreSQL
Oracle
MySQL
Haddop
MongoDB
…
In database execution
Oracle
Exasol
“Tableau”
…
In other program. lang. execution
SAS
Stata
SPSS
Python
Perl
…
Hornik, 2012
Other Programming languages in R
2015
6218
Packages
Available
Fortran
C
Python
C++
Julia
Java
…
R Language
R Syntax is similar to other programing
languages like C and Java, statements are
surrounded by brackets.
If Else
If(condition){
statement 1
}else if(cond.) {
statement 2
}
For
No need to define
size of arrays in
declaration
Class definition are tricky,
not so easy to work with
For(I in 1:5){
statement
}
But it has its owns ways…
All primitive
variables are arrays
by definition
Class programing is available too
There are some tutorials
at internet
R is optimized to
deal with arrays
Generic functions
(This one is nice)
For and while
loops are slow
Ex.: summary(obj)
predict(model)
Operations made with
entire arrays and apply
functions are much faster
Summary de obj with a
different behavior
depending of the obj class
R Language
Speed
Memory consumption
R is slow
There are some paid alternative
implementations of the language
R runs
on RAM
Have problems
for managing
memory
Limiting
factor for
GLMs mainly
Solutions
Open source packages
You can always go back
Implement time demanding steps
in C or Fortran and wrap in a R
function
Low quality codes
Just analyze data one time
No need for optimizing
and documentation
Paid
Closed source
ff
ffbase
biglm
Disk storage
datasets
Stats for ff
objs
In disk GLM
models
Memory consumption… still a problem?
In memory database era…
R Language
Learning Curve
R is difficult?
More difficult
SPSS
SAS
Some R point and click
R
When you don’t have any
programming skills… yeah!
Can you build a data
science project without
programming skills…
I don’t think so!
Point and click
In Company Usage
IT Departments
Resistance to open
source technologies
Ok… 6218 packages
are too much to
homologate
FDA recommends
basic packages for
drug trial studies
Like to pay?
Closed source
Telecom case: Project overview
2x per month
Project overview
Call center
demand
forcast
Predict 60
days ahead
total incoming
calls by day
Two different
areas
Area 1
24 different skills
Area 2
15 different skills
Alignment with affected areas
Follow up meetings
Model improvement
Client tested a proprietary modeling toll
“Modeling Tool box”
Need a data scientist
to pilot the system
Large tool boxes do not
imply in a correct repair
Model used to verify impacts
Models were also used to verify
impacts of different actions
Client mailing
Promotions
Expire dates
Telecom case: R Development Challenges
R
Programming language
Easily create/modify
models
Model versioning
Model evaluation
Model sharing and
validation
No out of the box solutions
Nope
No point and click interface to define
and modify models. Have to be done in
the script file
Nope
Models are defined in script files. You
can store old file or work with version
control softwares (git, mercurial)… not
so easy
Yahh
Can develop a script to bring models
deviance and other parameters. Have
to open file to chance constants like file
name, date, series, etc…
Nope
Models on script, model sharing means
sending scripts files and data one to
another… in limited time this will tend
to caos
Telecom case: R Development Challenges
R
Have all we needed for statistical
models and analysis for the project
We had to find a way to manage
the draw backs
R Statistical language
So let’s divide the tasks
Do the stats
We need an interface for the users
Model sharing and
validation
A way that everybody can see
the same model and results
Web interface!
Platform independent
No recompiling and easily go mobile!
No code on users
No installing anything
Central data service
Everybody see the same results
Easier to find developers
Much easier to find a web developer
than R, C++, etc…
Advantages
Telecom case: Proposed solution
Web interface!
…
Nice catch phase
The web framework for
perfectionist with deadlines
N-array objects and linear algebra methods
Data ploting, optimization, stat. models, etc …
Possible to open R objects and codes in Python
Database
Open source
Also plays well
with Django!
PostGres plays well
with Django and R
Have a spatial extension
Telecom case: Proposed solution
Model / View / Template
Framework
Task division helps to
keep the code minimal
Model
Each model define an information unity
(much like a class) and each entry of
this unity (class object) is saved in DB
View
Define functions which are responsible
for the user interaction
Template
Stores the web site html code. It is
possible to heritage from pages, helps
to keep the coding not repetitive
Auto-generated admin web page
Object-relational mapper
Change, remove and add database entries
Automatic database management
Specify functions associated with models
Table / field / constrains / 1 keys / etc…
Ex.: DescriptionModel -> run model
Field in DB are like obj attributes
Telecom case: Proposed solution
Solution overview: Job Division
Main user interface
Estimate models
Manage created models
Load data
Store all data regarding
models, results,
historical data and
prediction
Create predictions (RPy2)
Scalability
Modular design
Each part of the
system can be busted
by adding new nodes
Still have to stress the
system and finish
some implementations
Telecom case: Proposed solution
Communication between parts
Model info
DB connection
JSON
All saved in DB
Django -> R
DB connection
Just confirmations
Django login and Cross Site Request Forgery
Login
Token
CSRF
Stored in R as a local variable
Telecom case: Proposed solution
Communication between parts
DB connection
JSON
DB connection
Packages used in implementation
rjson
Transform R objects to JSON
and vice versa
httr
Make requests through R
manage cookies too
gsubfn
Modify strings, making easy to
create urls Django-style in R
Don’t remove JSON
Vulnerability
(“)]},” at the string begging)
/descriptionmodel/%(id)s
Telecom case: Proposed solution
Communication between parts
DB connection
JSON
DB connection
Packages used in implementation
RPostgreSQL
PostGres database connection,
queries, inserts, etc…
reshape2
Easily pivot and unpivot tables
gsubfn
Modify strings, making easy to
create dynamic queries
Makes it easy o build
regressive matrix
Select *
From table_1
Where id = %(id)s
Telecom case: Proposed solution
Communication between parts
DB connection
JSON
DB connection
Packages used in implementation
Pandas
Create R like dataframes,
pivot and unpivot tables
A little tricky to work with… Pivoting
and grouping are faster than R and
SQL
Rest
FrameWork
Easily create rest services
Really this one is awesome! Define
serializers to the models
South
Automatically create
database migrations based
in model modification
Custom complex migration can be
built too
Telecom case: Proposed solution
Models are run according
to an hierarchical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Queen is associated with a mold
Coordinates model
breading and check if the
best model is achieved
Mold stores the model parameters and
which variables should be used in
stepwise
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Coordinates model
breading and check if the
best model is achieved
Model queue hierarchy determines that
the queen model have to wait breeding
model finish to run
Breeding Model
Red arrows indicate
dependencies
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Coordinates model
breading and check if the
best model is achieved
R calls Django
(JSON) asking to
breed models
according to the
mold
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Django sets new models
under queen and above
breeding in hierarchy
Coordinates model
breading and check if the
best model is achieved
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Coordinates model
breading and check if the
best model is achieved
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
R calls Django
to execute
queen method
Coordinates model
breading and check if the
best model is achieved
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
R calls Django
to execute
queen method
Queen chooses
the best model
under it
Coordinates model
breading and check if the
best model is achieved
This best is
better than the
one before?
No one before
Next breeding
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Breeding model compare mold and previous best
model to see which variables still have to be tested
Queen is set as
waiting again
Coordinates model
breading and check if the
best model is achieved
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Coordinates model
breading and check if the
best model is achieved
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Coordinates model
breading and check if the
best model is achieved
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Coordinates model
breading and check if the
best model is achieved
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Coordinates model
breading and check if the
best model is achieved
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Coordinates model
breading and check if the
best model is achieved
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Coordinates model
breading and check if the
best model is achieved
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Coordinates model
breading and check if the
best model is achieved
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: Proposed solution
Models are run according
to an hieraquical queue
Rest communication
between R and Django
Loops of interactions
between R and Django
Model is only run if all its
dependencies are
finished
R can call Django to
generate new models
More complex model
structures
Stepwise Implementation
Queen Model
Coordinates model
breading and check if the
best model is achieved
Final Model
Breeding Model
Creates new models
based on the best
previous layer model
Queen
Breeding
Ordinary
Best one
Mold
Telecom case: PumpWood
Born of a new point and click (part of it at least) statistical system
Name of a pioneer tropical
tree which have developed
a symbioses with ants
Tree houses and
feeds the ants
Ants protect the tree
from predators
Why ants?
An ant tell little
about the
colony
Russian word for an ant colony
Have to look at
the BIG
picture…
Big Data
Data Science
Telecom case: PumpWood
Project PumpWood’s overall results
System stability
Django
Last months with no need of
restarting by crash
R
There are some hiccup in
implementation
(multicolinearity of models),
but ok, stable too
Hardware usage
3 Django process
All nodes in
same machine
Usually running
with
3 R process
8 Gb RAM
PostGres
12 Gb of disk usage
Hardware investment
Less than R$5k
Telecom case: PumpWood
Project PumpWood’s overall results
In one year
35711 different models created
2411 without stepwise ones
40193 estimated models
6200 without stepwise ones
Each change in a model`s
inputs leads to a new model
Model is defined by
its inputs and output
(24 x 2 + 15)x12 = 756
That is too many models!
2411 / 756 = 3.1891 (retries by update)
Would senior analysts be more economical when using PumpWood?
Telecom case: PumpWood
Current development
Migrating out of Django’s admin
Django admin is very
useful, but limited
Development
already in course
Use Django as a REST-Full
service, which can receive CORS
Advantages
Easier to change
GUI platform
Single page app
Less net trafic
More freedom
for designing
Working on a touch
friendly design
Graphics and other data visualization options
R Trends
Bibliografia
R project: www.r-project.org
Kurt Hornik, Austrian Journal OF Statistics Volume 41-1 p59–66 (2012)
Companies using R:
http://www.revolutionanalytics.com/companies-using-r
Pumpwood Photos:
http://espacepourlavie.ca/en/biodome-flora/shield-leaf-pumpwood
https://treesandfish.wordpress.com/2011/06/25/trees-of-puerto-rico-part-1-cecsch-and-schmor/
Thank you
Contact: [email protected]
PumpWood: A data science tool
Big Data Science
Science is a method
Reproducible
experiments
Testable
hypothesis
Diffusion of
the results
Huge amount of data
Challenges
Usually associated with user information
Parallel processing
Profile
Navigation path
Service usage
Geospacial data
…
Infrastructure
Fast algorithms
PumpWood: A data science tool
Model evolution history
Insights!!
Create deliveries inside of PumpWood (helps with the information diffusion)
Description
Study correlation between sales and price for
food and beverage
Notes
Price and sales correlation shown significantive
negative of price effect on sales. Despite that,
high value products have a reduced elasticity. To
see more check reports attached to basquet
Delivery Basquet
Delivery date: 2014-03-05 User: abaceti
Model’s objects holds
the information
necessary to rerun it
over again
Files and reports can be
attached to de deliveries
PumpWood: A data science tool
Overview data analysts development
Check how the job is done
by your best ones
Improve the rest of the
team with the learned
lessons!
Grant and ungrant production status to different models
Models can also be used in frequent tasks
This helps to keep track on which ones is
been used in production or is part of a
development
Thank you
Contact: [email protected]