Document

Transcript Document

Information Geometry:
Duality, Convexity, and Divergences
Jun Zhang*
University of Michigan
Ann Arbor, Michigan 48104
[email protected]
*Currently on leave to AFOSR under IPA
Lecture Plan
1) A revisit to Bregman divergence
2) Generalization (a-divergence on Rn) and a-Hessian
geometry
3) Embedding into infinite-dimensional function space
4) Generalized Fish metric and a-connection on Banach space
Clarify two senses of duality in information geometry:
Reference duality: a  -a
choice of the reference vs comparison point on the manifold;
Representational duality: p  log p, p,...
choice of a monotonic scaling of density function;
Bregman Divergence
Let  : R n  R be a strictly convex function, and 1, 2  R n ,
B (1, 2)  ( 2 ) - (1 ) - ()(1 ), 2 - 1
i) Quadri-lateral relation:
B (1, 2 )  B ( 4 ,3 ) - B (1,3 ) - B ( 4 , 2 )  3 -  2 , ()(1 ) - ()( 4 )
Triangular relation (generalized cosine) as a special case:
B (1 ,2 )  B (2 ,3 ) - B (1 ,3 )  1 - 2 , ()(3 ) - ()(2 )
ii) Reference-representation biduality:
B (1 , 2 )  B* ( )( 2 ), ( )(1 ) 
Canonical Divergence and Fenchel Inequality
An alternative expression of Bregman divergence is canonical divergence
A ( ,) B ( , ()-1 ())
or explicitly:
A ( , )   ( )    ( ) -  ,  A ( , )

where * ( )  () -1 ( ), - (() -1 ( ))
That A is non-negative is a direct consequence of the Fenchel inequality
for a strictly convex function:
( )   ( )   ,
where equality holds if and only if   ()( )    (* )()
Convex Inequality and a-Divergence Induced by it
By the definition of a strictly convex function ,
1-a
1 a
1-a
1 a 
(1 ) 
( 2 )  
1 
2 
2
2
2
 2

a  (-1,1)
It is easy to show that the following is non-negative for all a  R ,
D(a ) (1 , 2 ) 
4 1 - a
1 a
 1 - a   1  a    0

(

)


(

)


1
2
1
2 
2 2
2
2
2


1-a 
Conjugate-symmetry:
(a )
( -a )
D
(1, 2 )  D
( 2 ,1 )
Easily verifiable:
(1)
(a )
D
(1, 2 )  lima 1D
(1, 2 )  B (1, 2 )
( -1)
(a )
D
(1, 2 )  lima -1D
(1, 2 )  B ( 2 ,1 )
Significance of Bregman Divergence
Among a-Divergence Family
Proposition:
For a smooth function : Rn -> R, the following are equivalent:
(i)  is strictly convex;
(ii) D(1) (1 , 2 )  B (1 , 2 )  0;
(iii) D( -1) (1, 2 )  B ( 2 ,1 )  0;
(iv) D(a ) (1 , 2 )  0 for all a  1;
(v) D(a ) (1 , 2 )  0 for all a  1;
Statistical Manifold Structure Induced From
Divergence Function (Eguchi, 1983)
Given a divergence D(x,y), with D(x,x)=0. One can then derive
the Riemannian metric and a pair of conjugate connections:
In essence,
Expanding D(x,y) around x=y:
 k gij  ki , j  kj ,i
i) 2nd order: one (and the same) metric
is satisfied by such
identification of
derivatives of D.
 2 D ( x, y )
 2 D ( x, y )
gij ( x )  i
j
x y x  y
y i x j x  y
ii) 3rd order: a pair of conjugated connections
 3 D ( x, y )
ij, k ( x )  - i j k
x x y
ij, k ( x )
x y
 3 D ( x, y )
- i j k
y y x
x y
a-Hessian Geometry (of Finite-Dimension Vector Space)
Theorem.
Da induces the a-Hessian manifold, i.e.
i) The metric and conjugate affine connections are given by:
 2
gij ( ) 
 i  j
1-a
 3
(a )
ij,k ( ) 
2  i  j  k
1a
 3
*(a )
ij,k ( ) 
2  i  j  k
ii) Riemann curvature is given by:
Rij(a) ( )
1-a 2
lk
*(a )

 ( il  jk -  il  jk )  Rij ( )
4 l ,k
iii) The manifold is equi-affine, with the Tchebychev potential given by:
 2 ( )
  det | g | det i j
 
and a-parallel volume form given by

(a )

(1-a ) / 2
 i
 j
iv) There exists biorthogonal coordinates:

 i
i



i
i
with

j


i
 g ij
 i  j

 g ij
 j i
From Vector Space to Function Space
Question: How to extend the above analysis to infinite-dimensional
function space?
A General Divergence Function(al)
D(fa,) ( p, q) 
4
1a
1-a
f
(

(
p
))

f (  ( q)) 
2 
1-a  2
2
1-a
1-a

f
 ( p) 
 (q)  d
2
 2

for any two functions p( ), q( ) in some function space, and an arbitrary, strictly
increasing function  : R  R .
Remark: Induced by convex inequality
A Special Case of Da: Classic a-Divergence
(a )
A
4
1- α
1 α

( 1-α) 2 ( 1α) 2 
( p, q ) 
E
p

q
p
q

2  2
2


1-a
For parameterized pdf’s, such divergence induces an a-independent metric,
but a-dependent dual connections:
   log p  log p  
gij ( )  E   pθ 
i
j 


  
ij(a, k) ( )
   2 log p 1 - a  log p  log p
 E   pθ 

i
j
2
 i
 j
   
ij,(ka ) ( )  ij( -,ka ) ( )
  log p

  k




Other Examples of D(a)
Jensen Difference
E (a ) ( p , q ) 
4
1 α
1 - α
E
p
log
p

q log q

2
2
1-a
 2
1 α 
1  α 
1- α
1- α
-
p
q  log 
p
q 
2
2
2
2




U-Divergence (a1
U (a ) ( p , q ) 
4
1 α
1 - α
-1
-1


E
U
((
U
)
(
p
))

U
((
U
)
( q))

2 
2
1-a
 2
1-a
 1 - a  -1

-U 
(U ) ( p ) 
(U ) -1 ( q)  
2
 2

A Short Detour: Monotone Scaling
Define monotone embedding (“scaling”) of a measurable function p as the
transformation (p), where
:RR
is a strictly monotone function.
Observe:
i)  is strictly monotone iff -1 is strictly monotone;
ii) (t) = t as the identity element;
iii) 1, 2 are strictly monotone, so is 1  2
Therefore, monotone embeddings of a given probability density function
form a group, with functional composition as group operation:
( 1  2 )(t )  1 ( 2 (t ))
We recall that for a strictly convex function f :
f  is strictlyincreasingwith inverse ( f )-1  ( f  )
DEFINITION: -embedding is said to be conjugated to -embedding with
respect to a strictly convex function f (whose conjugate is f*) if   f    :
 ( p)  f (  ( p))   ( p)  ( f )-1( ( p))  ( f  )( ( p))
, a 1
t
 log t , a  1
 (t )  
Example: a-embedding
(1-a ) / 2
t (1a ) / 2 , a  -1
 (t )  
 log t , a  -1
f (t ) 
f  (t ) 
2 1-a 
t

1a  2 
2
1-a
2 1 a 
t

1-a  2 
2
1a
Parameterized Functions as Forming
a Submanifold under Monotone Scaling
A sub-manifold is said to be -affine if there exists a countable set of linearly
independent functions li over a measurable space such that:
 ( p( ))  i i li ( )
Here,  is called the “natural parameter”. The “expectation parameter” is
defined by projecting the conjugated -embedding onto the li:
i   ( p( )) li ( ) d
Example: For log-linear model (exponential family)
log p( )  i i li ( )
The expectation parameter is:
i   p( )li ( )d
p p 
p p '
Proposition.
For the -affine submanifold:
i) The following potential function is strictly convex:
( )   f ( ( p( |  ))) d
 is called the generating (partition) functional.
ii) Define, under the conjugate representations
~
( )   f  ( ( p( |  ))) d
~
then  ()  
(()-1 ()) is Fenchel conjugate of ( ) .
 is called the generalized entropy functional.
Theorem.
The -affine submanifold is a-Hessian manifold.
An Application: the (a,)-Divergence
Take f=-, where:

( )
4
2
D(a , ) ( p, q) 
1-a 2 1 
t (1-  ) / 2 ,   1 called “alpha-embedding”,
(t )  
 log t,   1 now denoted by .
2 (1-  )
1-a

1a
1 - a (1- ) 2 1  a (1- ) 2 


 d
p
q-
p

q



2
2
 2

 2

a: parameter reflecting reference duality
: parameter reflecting representation duality
They reduce to a-divergence proper Aa and to Jensen difference Ea :
lim D (a , ) ( p, q)  A( -  ) ( p, q)
lim D (a , ) ( p, q)  A(  ) ( p, q)
lim D(a , ) ( p, q)  A(a ) ( p, q)
lim D (a , ) ( p, q)  E (a ) ( p, q)
a -1
 1
a 1
 -1
Information Geometry on Banach Space
Proposition 1. Denote tangent vector fields u( | p), v( | p) which are,
at given p on the manifold, themselves functions in Banach space. The metric
and dual connections induced by D(fa, ) ( p, q) take the forms:
g p (u, v )   g ( p) u( | p) v( | p) d
((va ) u ) p ( )  ( d v u )( )  B (a ) ( p ) u ( | p ) v ( | p )
(v(a ) u ) p ( )  ( d v u )( )  B ( -a ) ( p ) u( | p ) v ( | p )
where
g ( p )  f (  ( p )) (  ( p )) 2
1 - a f (  ( p ))  ( p )  ( p )
(a )
B ( p) 

2
f (  ( p ))
 ( p )
g ( p)   ( p)  ( p)
Written in dually
symmetric form:
B (a ) ( p ) 
d 1 a
1-a

log  ( p ) 
log  ( p ) 

dp  2
2

Corollary 1a.
For a finite-dimensional submanifold (parametric model), with
 ( p( |  )) 
 ( p( |  )) 
u

,
v


i
i
j



 j
The metric and dual connections associated with D(fa, ) ( , ) are given by:
gij ( )   f (  )
ij(a, k) ( )
 
  
d


d
 i
j
j
j
 
 

 2 
1-a
   

   f (  ) i j 
f (  ) i
d
j
k
2
 
   

 1  a  2   1 - a  2    
d
  

i
j
k
i
j
k 
2    
 2   
with ij,(ka ) ( )  ij( -, ka ) ( )
Remark: Choosing  (t )  t, (t ) 
f (t )  exp(t ) reduces to the forms of Fisher
metric and the a-connections in classical parametric information geometry, where
  log p , f (  )  p ,  p
Proposition 2. The curvature R(a and torsion tensors T(a associated with
any a-connection on the infinite-dimensional function space B are identically zero.
Remark: The ambient space B is flat, so it embeds, as proper submanifolds,
(i) the manifold M of probability density functions (constrained to be
positive-valued and normalized to unit measure);
(ii) the finite-dimensional manifold M of parameterized probability models.
M
M
B (ambient manifold)
CAVEAT: Topology? (G. Pistone and his colleagues)
Proposition 3.
The a,-divergence for the parametric models
gives rise to the Fisher metric proper and alpha-connections proper:
 log p  log p
gij ( )   pθ
d
i
j



ij(a,k,  ) ( )   pθ 
log p 1 - a  log p  log p

i
j
i
j
2









2
  log p
( -a , -  )

d



( )
ij, k
  k

ij,(ka ,  ) ( )  ij( -,ka ,  ) ( )  ij(a,k,-  ) ( )
Remark: The (a,)-divergence is the homogeneous f-divergence
D(a , ) (lp, lq)  lD(a , ) ( p, q)
As such, it should reproduce the standard Fisher metric and the dual alphaconnections in their proper form. Again, it is the a that takes the role of
the conventional “alpha” parameter.
Summary of Current Approach
Divergence
a-divergence
equiv to d-divergence (Zhu & Rohwer, 1985)
includes KL divergence as a special case
f-divergence (Csiszar)
Bregman divergence
equivalent to the canonical divergence
U-divergence (Eguchi)
Convex-based a-divergence for
vector space of finite dim
function space of infinite dim
Geometry
Riemannian metric
Fisher information
Conjugate connections
a-connection family
Equi-affine structure
cubic form, Tchebychev 1-form
Curvature
Generalized expressions of
Fisher metric
a-connections
References
Zhang, J. (2004). Divergence function, duality, and convex analysis. Neural
Computation, 16: 159-195.
Zhang, J. (2005) Referential duality and representational duality in the scaling of
multidimensional and infinite-dimensional stimulus space. In Dzhafarov, E. and
Colonius, H. (Eds.) Measurement and representation of sensations: Recent progress
in psychological theory. Lawrence Erlbaum Associates, Mahwah, NJ.
Zhang, J. and Hasto, P. (2006) Statistical manifold as an affine space: A functional
equation approach. Journal of Mathematical Psychology, 50: 60-65.
Zhang, J. (2006). Referential duality and representational duality on statistical
manifolds. Proceedings of the Second International Symposium on Information
Geometry and Its Applications, Tokyo (pp 58-67).
Zhang J. (2007). A note on curvature of a-connections of a statistical manifold. Annals
of the Institute of Statistical Mathematics. 59, 161-170.
Zhang, J. and Matsuzuo, H. (in press). Dualistic differential geometry associated with
a convex function. To appear in a special volume in the Springer series of
Advances in Mechanics and Mathematics.
Zhang, J. (under review) Nonparametric information geometry: Referential duality
and representational duality on statistical manifolds.
Questions?