No Slide Title

Download Report

Transcript No Slide Title

Detecting Cartoons
a Case Study in Automatic Video-Genre
Classification
Tzvetanka Ianeva
Arjen de Vries
Hein Röhrig
Outline
• Goal: remove cartoons from search
results in TREC-2002 video track
• Our Approach: extract Image
Descriptors & SVM Machine Learning
• Related work
• Novel Descriptors from Granulometry
• SVM Learning
• Experimental Results
TREC-2002 video track
• TREC- workshops for large scale
evaluation of information retrieval
technology
• CWI participation: Probabilistic
Multimedia Retrieval Model
• does not distinguish sufficiently
“Cartoons”
Example of undesirable
‘cartoon’
Query
Best Matches returned
Related work
• M.Roach et al. Motion based classification
of cartoons (2001)
• B.T.Truong et al. Automatic genre
identification for content-based video
categorization (2000)
• J.R.Smith et al. Searching for images and
videos on the world wide web
• N.C.Rowe et al. Automatic caption
localization for photographs on www pages
• V.Athitsos et al. [ASF] Distinguishing
photographs and graphics on the www
Cartoons
• What is a Cartoon?
– Cartoons do not contain any photographic
material
– Photos photographic camera
• Appears easy to find cartoons
– Few, simple, strong colors, patches of
uniform colors, strong black edges, text
Quiz: Cartoon or Photo?
Examples not so Typical
Photos like cartoons
“Cartoons” like photos
Artificial photos
Small cues
Overlapping Frames
Mixed
Shadow & Sparkle
Image Descriptors
Input Image
Image descriptors
0.6231 0.9266
1
…
2
…
148
(240x352x3)
0.2880 0.4125
1
2
…
…
148
• greater correlation
• normalized
• Example: avg. sat., thresh. brightness
Overview of our all image
descriptors
Image Descriptors
average saturation
threshold brightness
color histogram
edge-direction histogram
compression ratio
multi-scale pat. spectrum
Dimension
1
1
45
40
1
60
Brightness and Saturation
• HSV color model
• Cartoons brighter =>
use % pixels with
Value > 0.4
• Cartoons have strong
colors =>
use average Saturation
Saturation in cartoon and photo images
RGB
S-(HSV)
0.6231
RGB
S-(HSV)
0.2880
Brightness in cartoon and photo images
.
RGB
V-(HSV)
0.9266
RGB
V-HSV
0.4125
Histograms
•
•
•
•
•
Image I : XxY -> Rc
Filter F : I -> I’
Bins Bk partition of Rc
hk = #{ (x,y) : I’(x,y) є Bk }
E.g. brightness metric: I grayscale,
c=1, B1 = [ 0, 0.4 ], B2=[0.4,1], return
h2
Color Histogram
• More general than
brightness & saturation
• Again HSV color space
• Partition HSV into
3x3x5 = 45 bins
• Cartoons have less
colors => col. hist. desc.
Color histogram for
in the 45-bin HSV
Color histogram for
in the 45-bin HSV
Edge detection
• Cartoons have strong black edges =>


 I(x,y) = (x I(x,y), y I(x,y) )
• Approx. total derivative of intensity


Approx. || and 
histogram of (, ||)
5 intervals for ||  0 … sqrt(20)
8 intervals for   0 … 2
Edge angles & edge magnitudes
Edge histogram
0.13548
Compressibility
0.23365
• Cartoons: more simple composition
• Detect complexity by measuring
compression ratio
• Theory: “Kolmogorov complexity”
• Our application: use lossless PNG
compression
• Lossy JPEG not useful
Granulometries
• Idea: measure size
distribution of objects
• How? openings by
structuring element of
growing scale
• Normalized size
distribution
• Derivative = pattern
spectrum
Openings
• Opening = erosion then
dilation with same SE
 B ( f ) =  Bˆ [ B ( f )]
Structuring Elements
• Non-flat parabola
better(?) than flat
disk
 B ( f ) = min { f ( x, y)  B( x, y )}
( x , y )B
• Parabola: efficient
computation,
symmetry
Small-scale pattern spectrum
descriptors
SE disk
ri = i, i = 1,…20
SVM Learning
• Simplest case: 
linear separator
• SVM finds
hyperplane with
largest margin
• Closest points =
Support Vectors
SVM Learning: nonseparable
• Noisy data: no
separating
hyperplane at all!
• Solution: penalty C
for points inside
the margin
• C SVM machines
SVM = quadratic programming
SVM task:
Equivalent dual
problem:
l
1 2
min w  C   i
w,b , 2
i =1
subject to: yi ((w  xi )  b )  1 - i
i = 1,, l
1 l
max   i    i j yi y j (xi  x j )

2 i , j =1
i =1
l
subject to: 0   i  C i = 1,  , l ,
l
 y
i =1
i
i
=0
:R F
k ( x, xˆ) = ( ( x)   ( xˆ ))
n
SVM with kernels
SVM task:
Equivalent dual
problem:
l
1 2
min w  C  i
w,b , 2
i =1
subject to: yi ((w   ( xi ) )  b )  1 - i
i = 1,, l
l
1 l
max   i    i j yi y j k ( xi , x j )

2 i , j =1
i =1
subject to: 0   i  C i = 1,  , l ,
l
 y
i =1
i
i
=0
SVM kernels
RBF kernels
Polynomial
kernels
 x  xˆ
k ( x, xˆ ) = exp 
2

2


2
k ( x, xˆ) = ((x  xˆ ) 1)
q




SVM with kernels: decision function
SVM task:
Equivalent dual
problem:
l
1 2
min w  C  i
w,b , 2
i =1
subject to: yi ((w   ( xi ) )  b )  1 - i
l
1 l
max   i    i j yi y j k ( xi , x j )

2 i , j =1
i =1
subject to: 0   i  C i = 1,  , l ,
l
 y
i =1
Decision
function:
i = 1,, l
i
i
=0
 l

f ( x) = sgn   yi i k ( x, xi )  b 
 i =1

Experimental Data
• Key frames from TREC 2002 Video
Track
• 13,026 photographic images
• 1,620 cartoons
• Manually classified
• Experiments 1-3: train on (random)
3908 photos and 486 cartoons
Experiment 1: individual performance
Et = Ep
average saturation
0,0027
treshhold brightness
0
color histogram
σ2 =
0,0919
0.07
edge histogram
0.05 <
0,0095
σ2 <
compression ratio
0.05 < σ2 < 0.5
pattern spectrum
σ2 = 0.07
0,754
Error photos
Error cartoons
0
0,1106
0.5
|p|+|c|
1
0,1106
0.05 < σ2 < 0.5
|p|+|c|
+Ec
|c|
0,9541
0,108
σ2 = 0.1
|p|
0
0,1106
0,0002
0,1052
1
1
0,9497
Total error
Experiment 2: “convergence”
of SVM learning
0,1120
error
0,1100
0,1080
0,1060
0,1040
0,1020
1/2
(Pattern spectrum)
1/4
1/6
1/8
1/10
σ²
1/12
1/14
1/16
1/18
Experiment 3: combined performance
all - average saturation
all - treshhold brightness
all - color histogram
edge histogram
all - compression ratio
all - pattern spectrum
all
0,0068
0,0825
0,0111
0,0825
0,0068
0,0916
0,009
0,0823
0,0098
0,0826
0,011
0,0884
0,0111
0,0811
0,6914
0,657
0,7734
σ2 = 0.06
Error photos
0,672
0,6684
0,7046
0,6437
Error cartoons
Total error
Experiment 4: web-image
classifier on our data
0.5
error
0.4
0.3
we
0.2
[ASF]
0.1
0.0
100
200
300
400
500
600
training set
Test set: random 1,000 photos and 1,000 cartoons
Experiment 5: Performance on web
images
Comparison with 14,039 photographic and 9,512 graphical
images harvested from WWW train on (random) 4239
photographics and 2826 graphics
0,1
error
0,08
0,06
0,04
0,02
0
we
[ASF]
+ dimension and file
type features
Conclusions
• Hard task: good classifier
• Use dynamics/spatio-temporal
relations ?
• Semantic Gap?
• Combine classifiers?
• Granulometry not enough