Transcript Document
Information Geometry:
Duality, Convexity, and Divergences
Jun Zhang*
University of Michigan
Ann Arbor, Michigan 48104
[email protected]
*Currently on leave to AFOSR under IPA
Lecture Plan
1) A revisit to Bregman divergence
2) Generalization (a-divergence on Rn) and a-Hessian
geometry
3) Embedding into infinite-dimensional function space
4) Generalized Fish metric and a-connection on Banach space
Clarify two senses of duality in information geometry:
Reference duality: a -a
choice of the reference vs comparison point on the manifold;
Representational duality: p log p, p,...
choice of a monotonic scaling of density function;
Bregman Divergence
Let : R n R be a strictly convex function, and 1, 2 R n ,
B (1, 2) ( 2 ) - (1 ) - ()(1 ), 2 - 1
i) Quadri-lateral relation:
B (1, 2 ) B ( 4 ,3 ) - B (1,3 ) - B ( 4 , 2 ) 3 - 2 , ()(1 ) - ()( 4 )
Triangular relation (generalized cosine) as a special case:
B (1 ,2 ) B (2 ,3 ) - B (1 ,3 ) 1 - 2 , ()(3 ) - ()(2 )
ii) Reference-representation biduality:
B (1 , 2 ) B* ( )( 2 ), ( )(1 )
Canonical Divergence and Fenchel Inequality
An alternative expression of Bregman divergence is canonical divergence
A ( ,) B ( , ()-1 ())
or explicitly:
A ( , ) ( ) ( ) - , A ( , )
where * ( ) () -1 ( ), - (() -1 ( ))
That A is non-negative is a direct consequence of the Fenchel inequality
for a strictly convex function:
( ) ( ) ,
where equality holds if and only if ()( ) (* )()
Convex Inequality and a-Divergence Induced by it
By the definition of a strictly convex function ,
1-a
1 a
1-a
1 a
(1 )
( 2 )
1
2
2
2
2
2
a (-1,1)
It is easy to show that the following is non-negative for all a R ,
D(a ) (1 , 2 )
4 1 - a
1 a
1 - a 1 a 0
(
)
(
)
1
2
1
2
2 2
2
2
2
1-a
Conjugate-symmetry:
(a )
( -a )
D
(1, 2 ) D
( 2 ,1 )
Easily verifiable:
(1)
(a )
D
(1, 2 ) lima 1D
(1, 2 ) B (1, 2 )
( -1)
(a )
D
(1, 2 ) lima -1D
(1, 2 ) B ( 2 ,1 )
Significance of Bregman Divergence
Among a-Divergence Family
Proposition:
For a smooth function : Rn -> R, the following are equivalent:
(i) is strictly convex;
(ii) D(1) (1 , 2 ) B (1 , 2 ) 0;
(iii) D( -1) (1, 2 ) B ( 2 ,1 ) 0;
(iv) D(a ) (1 , 2 ) 0 for all a 1;
(v) D(a ) (1 , 2 ) 0 for all a 1;
Statistical Manifold Structure Induced From
Divergence Function (Eguchi, 1983)
Given a divergence D(x,y), with D(x,x)=0. One can then derive
the Riemannian metric and a pair of conjugate connections:
In essence,
Expanding D(x,y) around x=y:
k gij ki , j kj ,i
i) 2nd order: one (and the same) metric
is satisfied by such
identification of
derivatives of D.
2 D ( x, y )
2 D ( x, y )
gij ( x ) i
j
x y x y
y i x j x y
ii) 3rd order: a pair of conjugated connections
3 D ( x, y )
ij, k ( x ) - i j k
x x y
ij, k ( x )
x y
3 D ( x, y )
- i j k
y y x
x y
a-Hessian Geometry (of Finite-Dimension Vector Space)
Theorem.
Da induces the a-Hessian manifold, i.e.
i) The metric and conjugate affine connections are given by:
2
gij ( )
i j
1-a
3
(a )
ij,k ( )
2 i j k
1a
3
*(a )
ij,k ( )
2 i j k
ii) Riemann curvature is given by:
Rij(a) ( )
1-a 2
lk
*(a )
( il jk - il jk ) Rij ( )
4 l ,k
iii) The manifold is equi-affine, with the Tchebychev potential given by:
2 ( )
det | g | det i j
and a-parallel volume form given by
(a )
(1-a ) / 2
i
j
iv) There exists biorthogonal coordinates:
i
i
i
i
with
j
i
g ij
i j
g ij
j i
From Vector Space to Function Space
Question: How to extend the above analysis to infinite-dimensional
function space?
A General Divergence Function(al)
D(fa,) ( p, q)
4
1a
1-a
f
(
(
p
))
f ( ( q))
2
1-a 2
2
1-a
1-a
f
( p)
(q) d
2
2
for any two functions p( ), q( ) in some function space, and an arbitrary, strictly
increasing function : R R .
Remark: Induced by convex inequality
A Special Case of Da: Classic a-Divergence
(a )
A
4
1- α
1 α
( 1-α) 2 ( 1α) 2
( p, q )
E
p
q
p
q
2 2
2
1-a
For parameterized pdf’s, such divergence induces an a-independent metric,
but a-dependent dual connections:
log p log p
gij ( ) E pθ
i
j
ij(a, k) ( )
2 log p 1 - a log p log p
E pθ
i
j
2
i
j
ij,(ka ) ( ) ij( -,ka ) ( )
log p
k
Other Examples of D(a)
Jensen Difference
E (a ) ( p , q )
4
1 α
1 - α
E
p
log
p
q log q
2
2
1-a
2
1 α
1 α
1- α
1- α
-
p
q log
p
q
2
2
2
2
U-Divergence (a1
U (a ) ( p , q )
4
1 α
1 - α
-1
-1
E
U
((
U
)
(
p
))
U
((
U
)
( q))
2
2
1-a
2
1-a
1 - a -1
-U
(U ) ( p )
(U ) -1 ( q)
2
2
A Short Detour: Monotone Scaling
Define monotone embedding (“scaling”) of a measurable function p as the
transformation (p), where
:RR
is a strictly monotone function.
Observe:
i) is strictly monotone iff -1 is strictly monotone;
ii) (t) = t as the identity element;
iii) 1, 2 are strictly monotone, so is 1 2
Therefore, monotone embeddings of a given probability density function
form a group, with functional composition as group operation:
( 1 2 )(t ) 1 ( 2 (t ))
We recall that for a strictly convex function f :
f is strictlyincreasingwith inverse ( f )-1 ( f )
DEFINITION: -embedding is said to be conjugated to -embedding with
respect to a strictly convex function f (whose conjugate is f*) if f :
( p) f ( ( p)) ( p) ( f )-1( ( p)) ( f )( ( p))
, a 1
t
log t , a 1
(t )
Example: a-embedding
(1-a ) / 2
t (1a ) / 2 , a -1
(t )
log t , a -1
f (t )
f (t )
2 1-a
t
1a 2
2
1-a
2 1 a
t
1-a 2
2
1a
Parameterized Functions as Forming
a Submanifold under Monotone Scaling
A sub-manifold is said to be -affine if there exists a countable set of linearly
independent functions li over a measurable space such that:
( p( )) i i li ( )
Here, is called the “natural parameter”. The “expectation parameter” is
defined by projecting the conjugated -embedding onto the li:
i ( p( )) li ( ) d
Example: For log-linear model (exponential family)
log p( ) i i li ( )
The expectation parameter is:
i p( )li ( )d
p p
p p '
Proposition.
For the -affine submanifold:
i) The following potential function is strictly convex:
( ) f ( ( p( | ))) d
is called the generating (partition) functional.
ii) Define, under the conjugate representations
~
( ) f ( ( p( | ))) d
~
then ()
(()-1 ()) is Fenchel conjugate of ( ) .
is called the generalized entropy functional.
Theorem.
The -affine submanifold is a-Hessian manifold.
An Application: the (a,)-Divergence
Take f=-, where:
( )
4
2
D(a , ) ( p, q)
1-a 2 1
t (1- ) / 2 , 1 called “alpha-embedding”,
(t )
log t, 1 now denoted by .
2 (1- )
1-a
1a
1 - a (1- ) 2 1 a (1- ) 2
d
p
q-
p
q
2
2
2
2
a: parameter reflecting reference duality
: parameter reflecting representation duality
They reduce to a-divergence proper Aa and to Jensen difference Ea :
lim D (a , ) ( p, q) A( - ) ( p, q)
lim D (a , ) ( p, q) A( ) ( p, q)
lim D(a , ) ( p, q) A(a ) ( p, q)
lim D (a , ) ( p, q) E (a ) ( p, q)
a -1
1
a 1
-1
Information Geometry on Banach Space
Proposition 1. Denote tangent vector fields u( | p), v( | p) which are,
at given p on the manifold, themselves functions in Banach space. The metric
and dual connections induced by D(fa, ) ( p, q) take the forms:
g p (u, v ) g ( p) u( | p) v( | p) d
((va ) u ) p ( ) ( d v u )( ) B (a ) ( p ) u ( | p ) v ( | p )
(v(a ) u ) p ( ) ( d v u )( ) B ( -a ) ( p ) u( | p ) v ( | p )
where
g ( p ) f ( ( p )) ( ( p )) 2
1 - a f ( ( p )) ( p ) ( p )
(a )
B ( p)
2
f ( ( p ))
( p )
g ( p) ( p) ( p)
Written in dually
symmetric form:
B (a ) ( p )
d 1 a
1-a
log ( p )
log ( p )
dp 2
2
Corollary 1a.
For a finite-dimensional submanifold (parametric model), with
( p( | ))
( p( | ))
u
,
v
i
i
j
j
The metric and dual connections associated with D(fa, ) ( , ) are given by:
gij ( ) f ( )
ij(a, k) ( )
d
d
i
j
j
j
2
1-a
f ( ) i j
f ( ) i
d
j
k
2
1 a 2 1 - a 2
d
i
j
k
i
j
k
2
2
with ij,(ka ) ( ) ij( -, ka ) ( )
Remark: Choosing (t ) t, (t )
f (t ) exp(t ) reduces to the forms of Fisher
metric and the a-connections in classical parametric information geometry, where
log p , f ( ) p , p
Proposition 2. The curvature R(a and torsion tensors T(a associated with
any a-connection on the infinite-dimensional function space B are identically zero.
Remark: The ambient space B is flat, so it embeds, as proper submanifolds,
(i) the manifold M of probability density functions (constrained to be
positive-valued and normalized to unit measure);
(ii) the finite-dimensional manifold M of parameterized probability models.
M
M
B (ambient manifold)
CAVEAT: Topology? (G. Pistone and his colleagues)
Proposition 3.
The a,-divergence for the parametric models
gives rise to the Fisher metric proper and alpha-connections proper:
log p log p
gij ( ) pθ
d
i
j
ij(a,k, ) ( ) pθ
log p 1 - a log p log p
i
j
i
j
2
2
log p
( -a , - )
d
( )
ij, k
k
ij,(ka , ) ( ) ij( -,ka , ) ( ) ij(a,k,- ) ( )
Remark: The (a,)-divergence is the homogeneous f-divergence
D(a , ) (lp, lq) lD(a , ) ( p, q)
As such, it should reproduce the standard Fisher metric and the dual alphaconnections in their proper form. Again, it is the a that takes the role of
the conventional “alpha” parameter.
Summary of Current Approach
Divergence
a-divergence
equiv to d-divergence (Zhu & Rohwer, 1985)
includes KL divergence as a special case
f-divergence (Csiszar)
Bregman divergence
equivalent to the canonical divergence
U-divergence (Eguchi)
Convex-based a-divergence for
vector space of finite dim
function space of infinite dim
Geometry
Riemannian metric
Fisher information
Conjugate connections
a-connection family
Equi-affine structure
cubic form, Tchebychev 1-form
Curvature
Generalized expressions of
Fisher metric
a-connections
References
Zhang, J. (2004). Divergence function, duality, and convex analysis. Neural
Computation, 16: 159-195.
Zhang, J. (2005) Referential duality and representational duality in the scaling of
multidimensional and infinite-dimensional stimulus space. In Dzhafarov, E. and
Colonius, H. (Eds.) Measurement and representation of sensations: Recent progress
in psychological theory. Lawrence Erlbaum Associates, Mahwah, NJ.
Zhang, J. and Hasto, P. (2006) Statistical manifold as an affine space: A functional
equation approach. Journal of Mathematical Psychology, 50: 60-65.
Zhang, J. (2006). Referential duality and representational duality on statistical
manifolds. Proceedings of the Second International Symposium on Information
Geometry and Its Applications, Tokyo (pp 58-67).
Zhang J. (2007). A note on curvature of a-connections of a statistical manifold. Annals
of the Institute of Statistical Mathematics. 59, 161-170.
Zhang, J. and Matsuzuo, H. (in press). Dualistic differential geometry associated with
a convex function. To appear in a special volume in the Springer series of
Advances in Mechanics and Mathematics.
Zhang, J. (under review) Nonparametric information geometry: Referential duality
and representational duality on statistical manifolds.
Questions?