Scala for Machine Learning Patrick Nicolas December 2014 patricknicolas.blogspot.com www.slideshare.net/pnicolas What challenges? What makes Scala particularly suitable to solve machine learning and optimization problems? Building scientific and machine.

Download Report

Transcript Scala for Machine Learning Patrick Nicolas December 2014 patricknicolas.blogspot.com www.slideshare.net/pnicolas What challenges? What makes Scala particularly suitable to solve machine learning and optimization problems? Building scientific and machine.

Scala for
Machine Learning
Patrick Nicolas
December 2014
patricknicolas.blogspot.com
www.slideshare.net/pnicolas
What challenges?
What makes Scala particularly suitable to solve
machine learning and optimization problems?
Building scientific and machine learning applications
requires ….
1. Clearly defined abstractions
2. Flexible, dynamic models
3. Scalable execution
... and may involve mathematician, data scientists,
software engineers and dev. ops.
Scala tool box
Which elements in the Scala tool box are useful to meet
these challenges?
Reactive
F-bound
Composed futures
Actors
Abstraction
Non-linear learning models <= functorial tensors
Kernel monadic composition <= monads
Extending library types <= implicits
Flexibility
Scalability
Abstraction: Non-linear learning models
Low dimension features
space (manifold)
embedded into an
observation space
(Euclidean)
Abstraction: Non-linear learning models
Tensor fields are geometric entities defining linear relation between
vector fields, differential forms, scalars and other tensor fields
β€’ Field 𝑓(π‘₯, 𝑦, 𝑧)
πœ•π‘“
πœ•π‘“
πœ•π‘“
𝑖+
𝑗+
π‘˜
β€’ Vector field (contravariant) 𝛻𝑓 =
πœ•π‘₯
πœ•π‘¦
πœ•π‘§
β€’ Inner product < 𝑣, 𝑀 > = f
β€’ Covariant vector field (one-form/map) 𝛼 𝑀 =< 𝑣, 𝑀 >
π‘ž
𝑛 ⨂𝑇
β€’ Tensor product π‘‡π‘š
𝑝 ,exterior product 𝑑π‘₯𝑖 ∧ 𝑑π‘₯𝑗 , …
Each type of tensors is a category, associated with a functor category.
Abstraction: Non-linear learning models
Machine learning consists of identifying a low dimension
features space, manifold within an Euclidean observations
space. Computation of smooth manifolds relies on tensors
and tensor metrics (Riemann, Laplace-Beltrami,…)
Problem: How to represent tensors and metrics?
Solution: Functorial representation of tensors,
tensor products and differential forms.
Abstraction: Non-linear learning models
Functor:
f: U => V F(f): F(U) => F(V)
One option is to define a vector field as a collection (i.e. List) and
leverage the functor for the list.
Convenient but inconsistent with mathematical definition of tensors
Abstraction: Non-linear learning models
Functor (F)
f: U => V F(f): F(U) => F(V)
Let’s define a generic vector field and covector fields types
Define a tensor as a Higher kind being either a vector or a
co-vector accessed through type projection.
Abstraction: Non-linear learning models
Covariant Functor
f: U => V F(f): F(U) => F(V)
The functor for the vector field relies on the projection (Hom
functor) of 2 argument type functor Tensor on covariant and
contravariant types.
Abstraction: Non-linear learning models
Contravariant functor
f: U => V F(f): F(V) => F(U)
Contravariant functors are used for morphisms or transformation on
Covariant tensors (type CoVField)
Abstraction: Non-linear learning models
Tensor metrics and products requires other type of functors …
Product Functor
BiFunctor
(*) Cats framework https://github.com/non/cats http://non.github.io/cats/
Abstraction: Kernel monadic composition
Clustering or classifying observations entails computation of
inner product of observations on the manifold
Kernel functions are commonly used in training to separate
classes of observations with a linear decision boundary
(hyperplane).
Problem: Building a model entails creating,
composing and evaluating numerous kernels.
Solution: Define kernels as a 1st class
programming concept with monadic operations.
Abstraction: Kernel monadic composition
Define a kernel function as the composition of 2 functions g o h
𝒦𝑓 𝐱, 𝐲 = 𝑔(
β„Ž(π‘₯𝑖 , 𝑦𝑖 ))
𝑖
We create a monad to generate any kind of kernel functions Kf, by
composing their component g: g1 o g2 o … o gn o h
Abstraction: Kernel functions composition
A monad extends a functor with binding method (flatMap)
The monadic implementation of the kernel function component h
Abstraction: Kernel functions composition
Declaration explicit kernel function
Radius basis function kernel
𝒦 𝐱, 𝐲 =
1 π±βˆ’π² 2
βˆ’
𝑒 2 𝜎
h: π‘₯, 𝑦 β†’ π‘₯ βˆ’ 𝑦 g: π‘₯ β†’ 𝑒
βˆ’
Polynomial kernel
𝒦 𝐱, 𝐲 = (1 + 𝐱. 𝐲)𝑑
h: π‘₯, 𝑦 β†’ π‘₯. 𝑦
g: π‘₯ β†’ (1 +
1
(
2𝜎2
π‘₯)2
π‘₯)𝑑
Abstraction: Kernel functions composition
Our monad is ready for composing any kind of explicit kernels on
demand, using for-comprehension
Abstraction: Kernel functions composition
Notes
β€’ Quite often monads defines filtering capabilities (i.e.
Scala collections).
β€’ Accidently, the for-comprehension closure can be
also used to create dynamic workflow
Abstraction: Extending library types
The purpose of reusability goes beyond refactoring code.
It includes leveraging existing well understood concepts
and semantic.
Scala libraries classes cannot always be subclassed. Wrapping library component in a
helper class clutters the design.
Implicit classes extends classes functionality
without cluttering name spaces (alternative to
type classes)
Abstraction: Extending library types
Data flow micro-router for successful and failed computation
by transforming Try to Either with recovery and processing
functions
scala.util.Try[T]
recover[U >: T](f: PartialFunction[Throwable, U]): Try[U]
getOrElse[U >: T](f: () => U): U
orElse[U :> T](f: () => Try[U]): Try[U]
toEither[U](rec: () => U)(f: T => T): Either[U, T]
Abstraction: Extending library types
4 lines of Scala code to extend Try with Either concept.
.. as applied to a normalization problem.
Abstraction: Extending library types
Notes
β€’ Type conversion such as toDouble, toFloat can be
extended to deal rounding error or rendering
precision
β€’ Creating a type class is a more generic (appropriate?)
methodology to extends functionality of a closed
model or framework. Is there a reason why Try in
Scala standard library does not support conversion to
Either ?
Abstraction
non-linear learning models <= functorial tensors
Kernel monadic composition <= monads
Extending library types <= implicits
Flexibility
Modeling <= Stackable traits
Scalability
Flexibility: modeling
Factory design patterns have been used to model dynamic
systems (GoF). Are they adequate to model dynamic
workflow?
Building machine learning apps requires
configurable, dynamic workflows that preserve
the model formalism
Leverage mixins, inheritance and abstract values
to create models and weave data transformation.
Flexibility: modeling
Traditional programming languages compare unfavorably to
scientific related language such as R because their inability
to follow a strict mathematical formalism:
1. Variable declaration
2. Model definition
3. Instantiation
Scala stacked traits and abstract values preserve the core
formalism of mathematical expressions.
Flexibility: modeling
Declaration
𝑓 ∈ ℝ𝑛 β†’ ℝ𝑛
𝑔 ∈ ℝ𝑛 β†’ ℝ
Model
β„Ž = π‘”π‘œπ‘“
Instantiation
𝑓 π‘₯ = 𝑒π‘₯
g 𝒙 = 𝑖 π‘₯𝑖
Flexibility: modeling
Multiple models and algorithms are typically evaluated by
weaving computation tasks.
A learning platform is a framework that
β€’ Define computational tasks
β€’ Wires the tasks (data flow)
β€’ Deploys the tasks (*)
Overcome limitation of monadic composition (3 level of
dynamic binding…)
(*) Actor-based deployment
Flexibility: modeling
Even the simplest workflow (model of data transformation) requires
flexibility …..
Flexibility: modeling
Summary of the 3 configurability layers of Cake pattern
1. Given the objective of the computation, select the best
sequence of module/tasks (i.e. Modeling: Preprocessing +
Training + Validating)
2. Given the profile of data input, select the best data
transformation for each module (i.e. Data preprocessing:
Kalman, DFT, Moving average….)
3. Given the computing platform, select the best
implementation for each data transformation (i.e. Kalman:
KalmanOnAkka, Spark…)
Flexibility: modeling
Implementation of Preprocessing module
Flexibility: modeling
Implementation of Preprocessing module using discrete Fourier
… and discrete Kalman filter
Flexibility: modeling
Clustering workflow = preprocessing task -> Reducing task
Modeling workflow = preprocessing task -> model training
task -> model validation
Loading
d
Preprocessing
Modeling
Preprocessor
DFTFilter
Kalman
Validating
Clustering
d
Training
Reducing
Supervisor
Reducer
PCA
SVM
EM
MLP
Flexibility: modeling
A simple clustering workflow requires a preprocessor &
reducer. The computation sequence exec transform a
time series of element of type U and return a time series
of type W as option
Flexibility: modeling
A model is created by processing the original time series of type TS[T]
through a preprocessor, a training supervisor and a validator
Flexibility: modeling
Putting all together for a conditional path execution …
1
Abstraction
Non-linear learning models <= functorial tensors
Kernel monadic composition <= monads
Extending library types <= implicits
Flexibility
Modeling <= Stackable traits
Scalability
Dynamic programming <= tail recursion
Online processing <= streams
Data flow control <= back-pressure strategy
Scalability: dynamic programming
Choosing between iterative and recursive implementation
of algorithms is a well-documented dilemma.
Many machine learning algorithms (HMM,RL,
EM, MLP, …) relies on dynamic programming
techniques
Tail recursion is very efficient solution because it
avoids the creation of new stack frames
Scalability: dynamic programming
Viterbi algorithm for hidden Markov Models
The objective is to find the most likely sequence of states
{qt} given a set of observations {Ot} and a Ξ»-model
Scalability: dynamic programming
The algorithm recurses along the observations with N
different states. The first invocation initializes the context of
the recursion. The context can be used to wrap the results.
Scalability: dynamic programming
Relative performance of the recursion w/o tail elimination
for the Viterbi algorithm given the number of observations
Scalability: online processing
An increasing number of algorithms such as reinforcement
training relies on online (or on-demand) training.
Some problems lend themselves to process very
large data sets of unknown size for which the
execution may have to be aborted or re-applied
Streams reduce memory consumption by
allocating and releasing chunk of data (or slice or
time series) while allowing reuse of intermediate
results.
Scalability: online processing
The large data set is converted into a stream then broken
down into manageable slices. The slices are instantiated,
processed (i.e. loss function) and released back to the
garbage collector, one at the time
Traversal loss function
1
2π‘š
Heap
𝑦𝑛 βˆ’ 𝑓 π’˜|π‘₯𝑛
Allocate
slice .take
X0 X1
…....
Release slice .drop
2
+πœ† π’˜
2
Data stream
Xn
Garbage collector
……….
Xm
Scalability: online processing
Slices of NOBS observations are allocated one at the time, (.take)
processed, then released (.drop) at the time.
Scalability: online processing
The reference streamRef has to be weak, in order to have the slices
garbage collected. Otherwise the memory consumption increases
with each new batch of data.
(*) Alternatives: define strmRef as a def or use StreamIterator
Scalability: online processing
Comparing list, stream and stream with weak references.
Operating zone
Scalability: online processing
Notes:
Iterators:
β€’ computation cannot not memoized. (β€œIterators are the
imperative version of streams”)
β€’ One element at a time
β€’ Non-recursive (tail elimination)
Views:
β€’ No intermediate results preserved
β€’ One element at a time
Stream iterators:
β€’ Lazy tails
Scalability: flow control
Actors provides a very efficient and reliable way to deploy
workflows and tasks over a large number of cores and
hosts.
The execution of workflow may create a stream
bottleneck, for slow tasks and overflow local
buffers.
A flow control mechanism handling back pressure
on bounded mail boxes of upstream actors.
Scalability: flow control
Akka has reliable mechanism to handle failure. What about
temporary disruptions?
Router, Dispatcher, …
Workers
Actor-based workflow has to consider
β€’ Cascading failures => supervision strategy
β€’ Cascading bottleneck => Mailbox back-pressure strategy
Scalability: flow control
Messages passing scheme to process various data streams
with transformations.
Load->
Dataset
Status ->
<- GetStatus
Controller
Completed->
Watcher
Compute->
Bounded mailboxes
Workers
It be modeled as a flow control feedback loop …..
Scalability: flow control
Worker actors processes data chunk msg.xt sent by the
Controller with the transformation msg.fct
Message sent by collector to trigger computation
Scalability: flow control
Watcher actor monitors messages queues report to collector with
Status message.
GetStatus message sent by the collector has no payload
Scalability: flow control
Controller creates the workers, bounded mailbox for each worker
actor (msgQueues) and the watcher actor.
Scalability: flow control
The Controller loads the data sets per chunk upon receiving the
message Load from the main program. It processes the results
of the computation from the worker (Completed) and throttle
the input to workers for each Status message.
Scalability: flow control
The Load message is implemented as a loop that create data chunk
which size is adjusted according to the load computed by the
watcher and forwarded to the controller, Status
Scalability: flow control
Simple throttle increases/decreases size of the batch of observations
given the current load and specified watermark.
Selecting faster/slower and less/more accurate version of algorithm
can also be used in the regulation strategy
Scalability: flow control
Feedback control loop adjusts the size of the batches given the
load in mail boxes and complexity of the computation
Scalability: flow control
Notes
β€’ Feedback control loop should be smoothed (moving
average, Kalman…)
β€’ A larger variety of data flow control actions such as
adding more workers, increasing queue capacity, …
β€’ The watch dog should handle dead letters, in case of a
failure of the feedback control or the workers.
β€’ Reactive streams introduced in Akka 2.2+ has a
sophisticated TCP-based propagation and back pressure
control flows
… and there is more
There are many other Scala programming language constructs
I found particularly intriguing as far as for machine learning is
concerned …
Domain Specific Language
Emulate β€˜R’ language for scientists to use the application.
Reactive streams (TCP)
Effective fault-tolerance & flow control mechanism
Delimited continuation
Save, restore, reuse computation states
References
Monads are Elephants J. Ivy –
james-iry.blogspot.com/2007/10/monads-are-elephans-part2.html
Cats functional library P. Phillips – https://github.com/non/cats
Extending the Cake pattern: Dependency injection in Scala A. Warski –
www.warski.org/blog/2010/12/di-in-scala-cake-pattern
Programming in Scala $12.5 Traits as stackable modification M. Odersky,
M. Spoon, L. Venners - Artima 2008
Introducing Akka J. Boner - Typesafe 2012
www.slideshare.net/jboner/introducing-akka
Scala in Machine Learning: $1 Getting started P. Nicolas –
Packt publishing 2014
Exploring Akka Stream’s TCP Back Pressure: U. Peter – Xebia 2015
blog.xebia.com/2015/01/14/exploring-akka-streams-tcp-back-pressure/
Donate to Apache software and Eclipse foundations
List of books on Machine learning
Beginner
β€’ Introduction to Machine Learning E. Alpaydin – MIT Press
2004-2007
References
β€’ Pattern Recognition and Machine Learning C. Bishop Springer
2006
β€’ Machine learning: A Probabilistic Perspective K. Murphy –
MIT Press 2012
β€’ The Elements of Statistical Learning: Data Mining, Inference
and Prediction T. Hastie R. Tibshirani, J. Friedman - Springer
2001