Face Alignment by Explicit Shape Regression Xudong Cao Fang Wen Yichen Wei Jian Sun Visual Computing Group Microsoft Research Asia.
Download ReportTranscript Face Alignment by Explicit Shape Regression Xudong Cao Fang Wen Yichen Wei Jian Sun Visual Computing Group Microsoft Research Asia.
Face Alignment by Explicit Shape Regression Xudong Cao Fang Wen Yichen Wei Jian Sun Visual Computing Group Microsoft Research Asia Problem: face shape estimation β’ Find semantic facial points β π = π₯π , π¦π β’ Crucial for: β Recognition β Modeling β Tracking β Animation β Editing Desirable properties 1. Robust β complex appearance β rough initialization pose expression lighting occlusion 2. Accurate β error: ||π β π|| π: ground truth shape 3. Efficient training: minutes / testing: milliseconds Previous approaches β’ Active Shape Model (ASM) [Cootes et. al. 1992] β detect points from local features [Milborrow et. al. 2008] β¦ β sensitive to noise β’ Active Appearance Model (AAM) [Cootes et. al. 1998] β sensitive to initialization β fragile to appearance change [Matthews et. al. 2004] ... All use a parametric (PCA) shape model Previous approaches: cont. β’ Boosted regression for face alignment β predict model parameters; fast β [Saragih et. al. 2007] (AAM) β [Sauer et. al. 2011] (AAM) β [Cristinacce et. al. 2007] (ASM) β’ Cascaded pose regression β [Dollar et. al. 2010] β pose indexed feature β also use parametric pose model Parametric shape model is dominant β’ But, it has drawbacks 1. Parameter error β alignment error β minimizing parameter error is suboptimal 2. Hard to specify model capacity β usually heuristic and fixed, e.g., PCA dim β not flexible for an iterative alignment β’ strict initially? flexible finally? Can we discard a parametric model? 1. Directly estimate shape π by regression? Yes 2. Overcome the challenges? Yes β high-dimensional output β highly non-linear β large variations in facial appearance β large training data and feature space 3. Still preserve the shape constraint? Yes Our approach: Explicit Shape Regression 1. Directly estimate shape π by regression? Yes β boosted (cascade) regression framework β minimize ||π β π|| from coarse to fine 2. Overcome the challenges? Yes β two level cascade for better convergence β efficient and effective features β fast correlation based feature selection 3. Still preserve shape constraint? β automatic and adaptive shape constraint Yes Approach overview t=0 t=1 initialized from face detector t=2 β¦ t = 10 β¦ affine transform πΌ: image π‘β1 + π π‘ πΌ, π π‘β1 transform back π = ππ‘ Regressor π π‘ updates previous shape π π‘β1 incrementally π π‘ = argmin βπ β π πΌ, π π‘β1 , over all training examples π βπ = π β π π‘β1 : ground truth shape residual Regressor learning π0 π 1 π1 β¦... π π‘β1 π‘ π ππ‘ πβ1 π β¦... π π 1. Whatβs the structure of π π‘ 2. What are the features? 3. How to select features? ππ Regressor learning π0 π 1 π1 β¦... π π‘β1 π‘ π ππ‘ πβ1 π β¦... π π 1. Whatβs the structure of π π‘ 2. What are the features? 3. How to select features? ππ Two level cascade × too weak π π‘ β slow convergence and poor generalization a simple regressor, e.g., a decision tree π0 π 1 π1 β¦... π π‘β1 π‘ π π π‘β1 π1 β¦β¦ π π ππ‘ πβ1 π β¦... π π ππ π‘ π ..β¦. π πΎ two level cascade: stronger π π‘ β rapid convergence Trade-off between two levels #stages in top level 5000 #stages in bottom level 1 5.2 error (× 10β2 ) 100 50 4.5 10 500 3.3 5 1000 6.2 with the fixed number (5,000) of regressor π π Regressor learning π0 π 1 π1 β¦... π π‘β1 π‘ π ππ‘ πβ1 π β¦... π π 1. Whatβs the structure of π π‘ 2. What are the features? 3. How to select features? ππ Pixel difference feature Powerful on large training data [Ozuysal et. al. 2010], key point recognition [Dollar et. al. 2010], object pose estimation [Shotton et. al. 2011], body part recognition β¦ Extremely fast to compute β no need to warp image β just transform pixel coord. πΌππππ‘ ππ¦π β πΌπππβπ‘ ππ¦π πΌπππ’π‘β β« πΌπππ π π‘ππ How to index pixels? × β’ Global coordinate (π₯, π¦) in (normalized) image β’ Sensitive to personal variations in face shape Shape indexed pixels β’ Relative to current shape (βπ₯, βπ¦, ππππππ π‘ πππππ‘) β’ More robust to personal geometry variations β Tree based regressor π π β’ Node split function: ππππ‘π’ππ > π‘βπππ βπππ β select (ππππ‘π’ππ, π‘βπππ βπππ) to maximize the variance reduction after split πΌπ₯1 β πΌπ¦1 > π‘1 ? πΌπ₯1 πΌπ₯2 β πΌπ¦2 > π‘2 ? πΌπ₯2 πΌπ¦2 πΌπ¦1 βπππππ = argmin βπ |ππ β (ππ + βπ)| = πβππππ πβππππ(ππ β ππ ) ππππ π ππ§π ππ : ground truth ππ : from last step Non-parametric shape constraint βπππππ = argmin βπ |ππ β (ππ + βπ)| = πβππππ π π‘+1 = π π‘ + βπ ππ‘ = π0 + πβππππ(ππ β ππ ) ππππ π ππ§π π€π ππ β’ All shapes π π‘ are in the linear space of all training shapes ππ if initial shape π 0 is β’ Unlike PCA, it is learned from data β automatically β coarse-to-fine Learned coarse-to-fine constraint #PCs Apply PCA (keep 95% variance) to all βπππππ in each first level stage 30 20 10 2 Stage 1 #1 PC #2 PC #3 PC 4 6 stage 8 Stage 10 10 Regressor learning π0 π 1 π1 β¦... π π‘β1 π‘ π ππ‘ πβ1 π β¦... π π 1. Whatβs the structure of π π‘ 2. What are the features? 3. How to select features? ππ Challenges in feature selection β’ Large feature pool: π pixels β π 2 features β N = 400 β 160,000 features β’ Random selection: pool accuracy β’ Exhaustive selection: too slow Correlation based feature selection β’ Discriminative feature is also highly correlated to the regression target β correlation computation is fast: π(π) time β’ For each tree node (with samples in it) 1. Project regression target βπ to a random direction 2. Select the feature with highest correlation to the projection 3. Select best threshold to minimize variation after split More Details β’ Fast correlation computation β π(π) instead of π(π 2 ), π: number of pixels β’ Training data augmentation β introduce sufficient variation in initial shapes β’ Multiple initialization β merge multiple results: more robust Performance #points Training (2000 images) Testing (per image) 5 5 mins 0.32 ms β’ Testing is extremely fast β pixel access and comparison β vector addition (SIMD) 29 10 mins 0.91 ms 87 21 mins 2.9 ms β300+ FPS Results on challenging web images β’ Comparison to [Belhumeur et. al. 2011] β P. Belhumeur, D. Jacobs, D. Kriegman, and N. Kumar. Localizing parts of faces using a concensus of exemplars. In CVPR, 2011. β 29 points, LFPW dataset β 2000 training images from web β the same 300 testing images β’ Comparison to [Liang et. al. 2008] β L. Liang, R. Xiao, F. Wen, and J. Sun. Face alignment via componentbased discriminative search. In ECCV, 2008. β 87 points, LFW dataset β the same training (4002) and test (1716) images Compare with [Belhumeur et. al. 2011] β’ Our method is 2,000+ times faster relative error reduction by our approach point radius: mean error 1 20% 9 5 6 13 17 11 14 10% 19 0% 0 5 10 15 20 25 better by > 10% better by < 10% worse 30 4 3 7 8 2 15 12 18 10 16 21 22 25 23 26 27 28 29 20 24 Results of 29 points Compare with [Liang et. al. 2008] β’ 87 points, many are texture-less β’ Shape constraint is more important Mean error < 5 pixels < 7.5 pixels < 10 pixels Method in [2] 74.7% 93.5% 97.8% Our Method 86.1% 95.2% 98.2% percentage of test images with πππππ < π‘βπππ βπππ Results of 87 points Summary Challenges: Our techniques: β’ Heuristic and fixed shape β’ Non-parametric shape constraint model (e.g., PCA) β’ Large variation in face appearance/geometry β’ Cascaded regression and shape indexed features β’ Large training data and feature space β’ Correlation based feature selection