Transcript Document
Information Geometry: Duality, Convexity, and Divergences Jun Zhang* University of Michigan Ann Arbor, Michigan 48104 [email protected] *Currently on leave to AFOSR under IPA Lecture Plan 1) A revisit to Bregman divergence 2) Generalization (a-divergence on Rn) and a-Hessian geometry 3) Embedding into infinite-dimensional function space 4) Generalized Fish metric and a-connection on Banach space Clarify two senses of duality in information geometry: Reference duality: a -a choice of the reference vs comparison point on the manifold; Representational duality: p log p, p,... choice of a monotonic scaling of density function; Bregman Divergence Let : R n R be a strictly convex function, and 1, 2 R n , B (1, 2) ( 2 ) - (1 ) - ()(1 ), 2 - 1 i) Quadri-lateral relation: B (1, 2 ) B ( 4 ,3 ) - B (1,3 ) - B ( 4 , 2 ) 3 - 2 , ()(1 ) - ()( 4 ) Triangular relation (generalized cosine) as a special case: B (1 ,2 ) B (2 ,3 ) - B (1 ,3 ) 1 - 2 , ()(3 ) - ()(2 ) ii) Reference-representation biduality: B (1 , 2 ) B* ( )( 2 ), ( )(1 ) Canonical Divergence and Fenchel Inequality An alternative expression of Bregman divergence is canonical divergence A ( ,) B ( , ()-1 ()) or explicitly: A ( , ) ( ) ( ) - , A ( , ) where * ( ) () -1 ( ), - (() -1 ( )) That A is non-negative is a direct consequence of the Fenchel inequality for a strictly convex function: ( ) ( ) , where equality holds if and only if ()( ) (* )() Convex Inequality and a-Divergence Induced by it By the definition of a strictly convex function , 1-a 1 a 1-a 1 a (1 ) ( 2 ) 1 2 2 2 2 2 a (-1,1) It is easy to show that the following is non-negative for all a R , D(a ) (1 , 2 ) 4 1 - a 1 a 1 - a 1 a 0 ( ) ( ) 1 2 1 2 2 2 2 2 2 1-a Conjugate-symmetry: (a ) ( -a ) D (1, 2 ) D ( 2 ,1 ) Easily verifiable: (1) (a ) D (1, 2 ) lima 1D (1, 2 ) B (1, 2 ) ( -1) (a ) D (1, 2 ) lima -1D (1, 2 ) B ( 2 ,1 ) Significance of Bregman Divergence Among a-Divergence Family Proposition: For a smooth function : Rn -> R, the following are equivalent: (i) is strictly convex; (ii) D(1) (1 , 2 ) B (1 , 2 ) 0; (iii) D( -1) (1, 2 ) B ( 2 ,1 ) 0; (iv) D(a ) (1 , 2 ) 0 for all a 1; (v) D(a ) (1 , 2 ) 0 for all a 1; Statistical Manifold Structure Induced From Divergence Function (Eguchi, 1983) Given a divergence D(x,y), with D(x,x)=0. One can then derive the Riemannian metric and a pair of conjugate connections: In essence, Expanding D(x,y) around x=y: k gij ki , j kj ,i i) 2nd order: one (and the same) metric is satisfied by such identification of derivatives of D. 2 D ( x, y ) 2 D ( x, y ) gij ( x ) i j x y x y y i x j x y ii) 3rd order: a pair of conjugated connections 3 D ( x, y ) ij, k ( x ) - i j k x x y ij, k ( x ) x y 3 D ( x, y ) - i j k y y x x y a-Hessian Geometry (of Finite-Dimension Vector Space) Theorem. Da induces the a-Hessian manifold, i.e. i) The metric and conjugate affine connections are given by: 2 gij ( ) i j 1-a 3 (a ) ij,k ( ) 2 i j k 1a 3 *(a ) ij,k ( ) 2 i j k ii) Riemann curvature is given by: Rij(a) ( ) 1-a 2 lk *(a ) ( il jk - il jk ) Rij ( ) 4 l ,k iii) The manifold is equi-affine, with the Tchebychev potential given by: 2 ( ) det | g | det i j and a-parallel volume form given by (a ) (1-a ) / 2 i j iv) There exists biorthogonal coordinates: i i i i with j i g ij i j g ij j i From Vector Space to Function Space Question: How to extend the above analysis to infinite-dimensional function space? A General Divergence Function(al) D(fa,) ( p, q) 4 1a 1-a f ( ( p )) f ( ( q)) 2 1-a 2 2 1-a 1-a f ( p) (q) d 2 2 for any two functions p( ), q( ) in some function space, and an arbitrary, strictly increasing function : R R . Remark: Induced by convex inequality A Special Case of Da: Classic a-Divergence (a ) A 4 1- α 1 α ( 1-α) 2 ( 1α) 2 ( p, q ) E p q p q 2 2 2 1-a For parameterized pdf’s, such divergence induces an a-independent metric, but a-dependent dual connections: log p log p gij ( ) E pθ i j ij(a, k) ( ) 2 log p 1 - a log p log p E pθ i j 2 i j ij,(ka ) ( ) ij( -,ka ) ( ) log p k Other Examples of D(a) Jensen Difference E (a ) ( p , q ) 4 1 α 1 - α E p log p q log q 2 2 1-a 2 1 α 1 α 1- α 1- α - p q log p q 2 2 2 2 U-Divergence (a1 U (a ) ( p , q ) 4 1 α 1 - α -1 -1 E U (( U ) ( p )) U (( U ) ( q)) 2 2 1-a 2 1-a 1 - a -1 -U (U ) ( p ) (U ) -1 ( q) 2 2 A Short Detour: Monotone Scaling Define monotone embedding (“scaling”) of a measurable function p as the transformation (p), where :RR is a strictly monotone function. Observe: i) is strictly monotone iff -1 is strictly monotone; ii) (t) = t as the identity element; iii) 1, 2 are strictly monotone, so is 1 2 Therefore, monotone embeddings of a given probability density function form a group, with functional composition as group operation: ( 1 2 )(t ) 1 ( 2 (t )) We recall that for a strictly convex function f : f is strictlyincreasingwith inverse ( f )-1 ( f ) DEFINITION: -embedding is said to be conjugated to -embedding with respect to a strictly convex function f (whose conjugate is f*) if f : ( p) f ( ( p)) ( p) ( f )-1( ( p)) ( f )( ( p)) , a 1 t log t , a 1 (t ) Example: a-embedding (1-a ) / 2 t (1a ) / 2 , a -1 (t ) log t , a -1 f (t ) f (t ) 2 1-a t 1a 2 2 1-a 2 1 a t 1-a 2 2 1a Parameterized Functions as Forming a Submanifold under Monotone Scaling A sub-manifold is said to be -affine if there exists a countable set of linearly independent functions li over a measurable space such that: ( p( )) i i li ( ) Here, is called the “natural parameter”. The “expectation parameter” is defined by projecting the conjugated -embedding onto the li: i ( p( )) li ( ) d Example: For log-linear model (exponential family) log p( ) i i li ( ) The expectation parameter is: i p( )li ( )d p p p p ' Proposition. For the -affine submanifold: i) The following potential function is strictly convex: ( ) f ( ( p( | ))) d is called the generating (partition) functional. ii) Define, under the conjugate representations ~ ( ) f ( ( p( | ))) d ~ then () (()-1 ()) is Fenchel conjugate of ( ) . is called the generalized entropy functional. Theorem. The -affine submanifold is a-Hessian manifold. An Application: the (a,)-Divergence Take f=-, where: ( ) 4 2 D(a , ) ( p, q) 1-a 2 1 t (1- ) / 2 , 1 called “alpha-embedding”, (t ) log t, 1 now denoted by . 2 (1- ) 1-a 1a 1 - a (1- ) 2 1 a (1- ) 2 d p q- p q 2 2 2 2 a: parameter reflecting reference duality : parameter reflecting representation duality They reduce to a-divergence proper Aa and to Jensen difference Ea : lim D (a , ) ( p, q) A( - ) ( p, q) lim D (a , ) ( p, q) A( ) ( p, q) lim D(a , ) ( p, q) A(a ) ( p, q) lim D (a , ) ( p, q) E (a ) ( p, q) a -1 1 a 1 -1 Information Geometry on Banach Space Proposition 1. Denote tangent vector fields u( | p), v( | p) which are, at given p on the manifold, themselves functions in Banach space. The metric and dual connections induced by D(fa, ) ( p, q) take the forms: g p (u, v ) g ( p) u( | p) v( | p) d ((va ) u ) p ( ) ( d v u )( ) B (a ) ( p ) u ( | p ) v ( | p ) (v(a ) u ) p ( ) ( d v u )( ) B ( -a ) ( p ) u( | p ) v ( | p ) where g ( p ) f ( ( p )) ( ( p )) 2 1 - a f ( ( p )) ( p ) ( p ) (a ) B ( p) 2 f ( ( p )) ( p ) g ( p) ( p) ( p) Written in dually symmetric form: B (a ) ( p ) d 1 a 1-a log ( p ) log ( p ) dp 2 2 Corollary 1a. For a finite-dimensional submanifold (parametric model), with ( p( | )) ( p( | )) u , v i i j j The metric and dual connections associated with D(fa, ) ( , ) are given by: gij ( ) f ( ) ij(a, k) ( ) d d i j j j 2 1-a f ( ) i j f ( ) i d j k 2 1 a 2 1 - a 2 d i j k i j k 2 2 with ij,(ka ) ( ) ij( -, ka ) ( ) Remark: Choosing (t ) t, (t ) f (t ) exp(t ) reduces to the forms of Fisher metric and the a-connections in classical parametric information geometry, where log p , f ( ) p , p Proposition 2. The curvature R(a and torsion tensors T(a associated with any a-connection on the infinite-dimensional function space B are identically zero. Remark: The ambient space B is flat, so it embeds, as proper submanifolds, (i) the manifold M of probability density functions (constrained to be positive-valued and normalized to unit measure); (ii) the finite-dimensional manifold M of parameterized probability models. M M B (ambient manifold) CAVEAT: Topology? (G. Pistone and his colleagues) Proposition 3. The a,-divergence for the parametric models gives rise to the Fisher metric proper and alpha-connections proper: log p log p gij ( ) pθ d i j ij(a,k, ) ( ) pθ log p 1 - a log p log p i j i j 2 2 log p ( -a , - ) d ( ) ij, k k ij,(ka , ) ( ) ij( -,ka , ) ( ) ij(a,k,- ) ( ) Remark: The (a,)-divergence is the homogeneous f-divergence D(a , ) (lp, lq) lD(a , ) ( p, q) As such, it should reproduce the standard Fisher metric and the dual alphaconnections in their proper form. Again, it is the a that takes the role of the conventional “alpha” parameter. Summary of Current Approach Divergence a-divergence equiv to d-divergence (Zhu & Rohwer, 1985) includes KL divergence as a special case f-divergence (Csiszar) Bregman divergence equivalent to the canonical divergence U-divergence (Eguchi) Convex-based a-divergence for vector space of finite dim function space of infinite dim Geometry Riemannian metric Fisher information Conjugate connections a-connection family Equi-affine structure cubic form, Tchebychev 1-form Curvature Generalized expressions of Fisher metric a-connections References Zhang, J. (2004). Divergence function, duality, and convex analysis. Neural Computation, 16: 159-195. Zhang, J. (2005) Referential duality and representational duality in the scaling of multidimensional and infinite-dimensional stimulus space. In Dzhafarov, E. and Colonius, H. (Eds.) Measurement and representation of sensations: Recent progress in psychological theory. Lawrence Erlbaum Associates, Mahwah, NJ. Zhang, J. and Hasto, P. (2006) Statistical manifold as an affine space: A functional equation approach. Journal of Mathematical Psychology, 50: 60-65. Zhang, J. (2006). Referential duality and representational duality on statistical manifolds. Proceedings of the Second International Symposium on Information Geometry and Its Applications, Tokyo (pp 58-67). Zhang J. (2007). A note on curvature of a-connections of a statistical manifold. Annals of the Institute of Statistical Mathematics. 59, 161-170. Zhang, J. and Matsuzuo, H. (in press). Dualistic differential geometry associated with a convex function. To appear in a special volume in the Springer series of Advances in Mechanics and Mathematics. Zhang, J. (under review) Nonparametric information geometry: Referential duality and representational duality on statistical manifolds. Questions?