Kein Folientitel

Transcript Kein Folientitel

Détection des textes dans les images
issues d ’un flux vidéo pour
l´indexation sémantique
5 décembre 2003
Christian Wolf
http://rfv.insa-lyon.fr/~wolf
Directeur de thèse: Jean-Michel Jolion
Laboratoire d'Informatique en Images et Systèmes d'information
LIRIS, FRE 2672 CNRS
Bât. Jules Verne, INSA de Lyon
69621 Villeurbanne cedex
1
The framework of the thesis
2 Industrial contracts with France
Télécom: ECAV I, ECAV II
“Enrichissement du Contenu AudioVisuel”
Collaboration with the Language and
Media Processing Laboratory,
University of Maryland.
2 research internships:
2001: character segmentation
2002: video indexing (TREC)
2
Indexing using Text
Result
Key word
keyword-based
Search
Patrick Mayhew
Indexing phase
Patrick Mayhew
Min. chargé de
l´irlande de Nord
ISRAEL
Jerusalem
montage
T.Nouel
...
...
...
...
...
3
Detection in
still images
Introduction
Detection in video
sequences
Plan
Conclusion
Introduction
Still images
System
Ashida
HWDavid
Wolf
Todoran
Full
Videos
Character
segmentation
Recall
46
46
44
18
6
Precision
55
44
30
19
1
H. mean
50
45
36
18
2
Experimental
Results
Character segmentation
Results
Conclusion
4
Temporal
aspects
Complex
and moving
background
Videos vs. scanned
documents
Artificial
shadows
Low resolution
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
5
What is text? - character segmentation
Scene text
Artificial text
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
6
Original image
Filter tuned to the example text
Gabor energy
Introduction
Thresholded Gabor energy
Still images
Videos
Character segmentation
What is
text? texture
Example: Gabor
energy features
on a text image
Results
Conclusion
7
What is text? - contrast & geometry
Example image
Introduction
Still images
Accumulated horizontal Sobel edges
Videos
Character segmentation
Results
Conclusion
8
A text detection system for videos
Initial frame integration (averaging)
Detection per
single frame
Text occurrences
Tracking
Suppression of
false alarms
Image Enhancement Multiple frame integration
Binarization
OCR
Introduction
Still images
“Soukaina Oufkir”
Videos
Character segmentation
Results
Conclusion
9
Detection in
still images
Introduction
Detection in video
sequences
Plan
Experimental
Results
Conclusion
Introduction
Still images
Character
segmentation
Videos
Character segmentation
Results
Conclusion
10
2 Algorithms for still images
Calculate a text
probability image
according to a text
model (1 value/
pixel)
Calculate a text
feature image (N
values/pixel)
Separate the
probability values
into 2 classes.
Classify each
pixel in the feature
image
Find the optimal
threshold
Post processing
Introduction
Still images
Videos
Post processing
Character segmentation
Results
Conclusion
11
The local contrast method
Calculate a text
probability image
according to a text
model (1 value/
pixel)
F. LeBourgeois
Separate the
probability values
into 2 classes.
Fisher/Otsu
Post processing
•
•
•
•
Introduction
Still images
Videos
Mathematical morphology
Geometrical constraints
Verification of special cases
Combination of rectangles
Character segmentation
Results
Conclusion
12
Properties of the
local contrast method
+ High detection accuracy (accurate localization).
+ Not very sensitive to the type of text.
+ Low computational complexity (very fast!).
– False alarms due to the assumption of text
presence.
Geometrical constraints are imposed in the
post-processing step.
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
13
Method 2: why learning?
+ Hope to increase the precision (decrease the number of
false alarms) of the detection algorithm by learning the
characteristics of text.
+ More complex text models are very difficult to derive
analytically.
+ The discovery of support vector machine (SVM) learning
and its ability to generalize even in high dimensional
spaces opened the door to complex decision functions
and feature models.
Inconvenience:
– Specialization to a specific type of text (generalization)?
Text exists in wide varies of forms, fonts, sizes,
orientations and deformations (especially scene text).
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
14
Geometrical features
Learning gray values and edge maps alone
may not generalize enough.
Texture alone is not reliable, especially if the
text is short.
Geometry is a valuable feature.
State of the art: enforce geometrical
constraints in the post-processing step
(mathematical morphology)
We propose the usage of geometrical
features very early in the detection process,
i.e. not during post-processing.
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
15
Geometrical features: baseline
Text consists of:
• A high density of strokes in
direction of the text baseline.
• A consistent baseline (a
rectangular region with an upper
and lower border).
Two detection philosophies:
• Detection of the baseline directly
before detecting the text region.
• Detection of the baseline as the
boundary area of the detected text
region in order to refine the
detection quality.
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
16
Estimation of the text rectangle height
Original image
Introduction
Still images
Accumulated gradients
Videos
Character segmentation
Results
Conclusion
17
Features
Mode width (=rectangle height)
Mode height (=Contrast)
Difference height left-right
Mode mean
Mode standard deviation
Difference in mode width
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
18
Learning with Support Vector Machines
Training image
database
positive samples
negative samples
Bootstrapping, cross-validation
Classification step: a reduction of the computational
complexity is necessary:
• Sub-sampling of the pixels to classify (4x4)
• Approximation of the SVM model by SVM-regression.
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
19
Detection in
still images
Introduction
Detection in video
sequences
Plan
Experimental
Results
Conclusion
Introduction
Still images
Character
segmentation
Videos
Character segmentation
Results
Conclusion
20
Tracking the text appearances
List of rectangles detected
for the current frame
List containing the most recent rectangle
of each text occurrence
Frame nr.
(time)
Text occurrences
The integration is done using greedy search
in the overlap matrix.
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
21
Tracking: content verification
Verification of the text box contents: L2 comparison of a
signature vector (vertical projection profile of the Sobel edges).
Frequently text occurrences
appear at the same location
without significant temporal
pause between them
500
500
500
450
450
450
400
400
400
350
350
350
300
300
300
250
250
250
200
200
200
150
150
150
100
100
100
50
50
50
0
0
0
Different text
Same text
Introduction
Still images
Videos
Fading text
Character segmentation
Results
Conclusion
22
Enhancement
Super-resolution
(interpolation)
Bi-linear interpolation
Detected
text
occurence
Bi-cubic splines
Multiple frame integration:
Averaging
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
23
Detection in
still images
Introduction
Detection in video
sequences
Plan
Experimental
Results
Conclusion
Introduction
Still images
Character
segmentation
Videos
Character segmentation
Results
Conclusion
24
Adaptive binarization
Niblack’s adaptive method:
Sauvola’s improvement:
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
25
Our solution: contrast maximization
Contrast at the
center of the image
The maximum local
contrast
The contrast of the
window
We keep the following pixels:
Threshold:
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
26
Character segmentation: examples
Original
image
Fisher/Otsu
Fisher/Otsu
(windowed)
Yanowitz-B.
Yanowitz-B.
+post-proc.
Niblack
Sauvola
et al.
Contrast
maximiz.
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
27
Modeling text with a Markov random field
Collaboration with Laboratory for language and
Media Processing, University of Maryland (David
Doermann)
Binarization as a Bayesian maximum a posteriori
estimation problem using a Markov random field model.
Prior
models the prior knowledge
on the spatial relationships
in the image as a MRF.
Introduction
Still images
Videos
Likelihood of the observation
depends on the observation and
noise model. In our case:
Gaussian Noise corrected by
Niblack’s threshold surface.
Character segmentation
Results
Conclusion
28
The prior knowledge
• The clique energies (4x4) are learned
and interpolated from training data.
• Optimization of the energy function
with simulated annealing.
The clique labelings of the repaired pixel before
and after flipping it. All 16 cliques favor the
change of the pixel.
Introduction
Still images
Videos
Character segmentation
before
after
1.05
1.82
1.48
1.85
2.00
2.14
1.80
1.77
1.87
1.84
1.72
1.66
2.00
2.08
1.89
1.93
0.95
1.38
1.15
1.30
1.36
1.40
1.79
1.52
1.16
1.57
1.32
1.42
1.28
1.57
1.50
1.69
Results
Conclusion
29
Detection in
still images
Introduction
Detection in video
sequences
Plan
Conclusion
Introduction
Still images
System
Ashida
HWDavid
Wolf
Todoran
Full
Videos
Character
segmentation
Recall
46
46
44
18
6
Precision
55
44
30
19
1
H. mean
50
45
36
18
2
Experimental
Results
Character segmentation
Results
Conclusion
30
Evaluation measures
Detection
Ground truth
ICDAR:
• 1-1 matches
• overlap information only
CRISP:
• 1-1, 1-M, M-1 matches
• thresholded matches
• no overlap information
AREA:
• 1-1, 1-M, M-1 matches
• thresholded matches
• overlap information
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
31
AIM2
Commercials
AIM3
News
AIM4
Cartoons,
News
AIM5
News
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
32
Detection in still images
Dataset
#
G
Eval.
Artificial text 144 1.49 ICDAR
+ no text
CRISP
AREA
Artificial text 384 1.84 ICDAR
+ scene text
CRISP
+ no text
AREA
Recall Precision H.Mean
70.2
18.0
28.6
81.2
20.1
32.3
83.5
26.3
40.0
55.9
17.3
26.4
59.1
18.1
27.7
60.8
21.9
32.2
Local contrast
Dataset
#
G
Eval.
Artificial text 144 1.49 ICDAR
+ no text
CRISP
AREA
Artificial text 384 1.84 ICDAR
+ scene text
CRISP
+ no text
AREA
Recall Precision H.Mean
54.8
23.2
32.6
59.7
23.9
34.2
68.8
25.5
37.3
45.1
21.7
29.3
47.5
21.5
29.6
53.6
24.1
33.3
SVM learning
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
33
Local contrast
SVM learning
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
34
Local contrast
SVM learning
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
35
The influence of falling generality
Local contrast
Introduction
Still images
SVM learning
Videos
Character segmentation
Results
Conclusion
36
Detection in video sequences
Videos
Contrast
SVM Learn.
301
21
322
284
38
322
Positives
False alarms
Logos
Scene text
Total - false alarms
Total
350
947
75
72
497
1444
384
171
39
90
513
684
Recall (%)
Precision (%)
Harmonic mean (%)
93.5
34.4
50.3
88.2
75.0
81.1
Classified as text
Classified as non-text
Total in ground truth
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
37
OCR results
Local contrast based binarization
Bin. Method
Otsu
Niblack
Sauvola
Max. contrast
Recall Precision H. Mean
47.3
90.5
62.1
80.5
80.4
80.4
72.4
81.2
76.5
85.4
90.7
88.0
N. Cost
56.8
40.0
42.3 Recognition by Abby
23.0 Finereader 5.0
Baysian estimation using a Markov random field prior
Document
Sauvola
MRF
1
77.1
81.0
Character recognition rate
2
3
4
39.8
40.5
Sauvola et al.
Introduction
77.1
87.3
99.0
99.3
5
Total
98.7
98.8
79.0
82.0
MRF
Still images
Videos
Character segmentation
Results
Conclusion
38
TREC 2002
“Music”
The type of videos present in the collection does not
favor the use of recognized text: text is only rarely
present.
“Oil”
“Energy
Gas”
“Air plane”
Introduction
“Airline”
Still images
Videos
“Dance”
Character segmentation
Results
Conclusion
39
Conclusion
We developed a new system for detection, tracking,
enhancement and binarisation of text.
Detection performance is high due to the integration of
several types of features in a very early stage. The learning
method is less sensitive to textured noise in the image.
We proposed a new evaluation method which takes into
account several measures of detection quality.
We derived a new binarisation method adapted to the type
of text found in videos.
2 patents
2 publications in international journals (+1 submitted)
3 publications in international conferences
6 publications in national conferences
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
40
Outlook
Possible improvement of the features (e.g. contrast
normalization, non-linear texture filters).
Integration of different feature types (statistical, structural, ...)
Multi orientation processing is not yet complete (new training
set, implementation of the post processing)
Adaptation of the tracking algorithm to general types of
motion.
OCR on low resolution grayscale images.
Usage of a priori knowledge on text in order to decrease the
number of false alarms
Integration of the detected text into a
indexing/browsing/segmentation framework
Introduction
Still images
Videos
Character segmentation
Results
Conclusion
41