Topic Here

Transcript Topic Here

Chinese Academy of Sciences, Beijing, China
Report Document
Overview of MPEG-7
Dr Zhang Sen
Speech Group, INRIA-LORIA
Villers les Nancy, France
Chinese Academy of Sciences
Beijing, China
7/17/2015
Speech and Language Processing Techniques
Chinese Academy of Sciences, Beijing, China
Report Document
Outline of contents
•
•
•
•
•
•
•
Introduction
Basic Components
Content Description
Audiovisual (AV) Descriptions
Multimedia Description Schemes
XM and Applications
More Information
Speech and Language Processing Techniques
2
Chinese Academy of Sciences, Beijing, China
Report Document
Ozone WP2 architecture
Ozone application
User Context
Ozone
Context
Multi-modal widgets
spee ch
re cognition
Dialog management
ge sture
re cognition
smart
age nt
Pe rce ption
QoS
Situation
Se nsitivity
User
Interface
manage ment
v ide o
browser
animated
age nt
Oz
on
e
Se
ic
rv
es
Authen
tication
User-interaction module
...
Security
Software Environment layer
Speech and Language Processing Techniques
3
Chinese Academy of Sciences, Beijing, China
Report Document
From MPEG-1 to MPEG-7
90
92
94
98 99
01
?
mpeg7
mpeg21
v1 v2
mpeg1
mpeg2
mpeg4
• MPEG-3, ever defined, but abandoned
• MPEG-5 and -6, not defined
Speech and Language Processing Techniques
4
Chinese Academy of Sciences, Beijing, China
Report Document
MPEG Family
MPEG-1 – Coding of moving pictures and audio for digital
storage media (CD-ROM, MP3), 11/92
MPEG-2 – Generic Coding of moving pictures and audio
information (DVD, Digital TV), 11/94
MPEG-4 – Coding of Audiovisual Objects for MM appls
Ver1 09/98, Ver2 11/99
MPEG-7 – Multimedia content description for AV material
08/01
MPEG-21 – Digital AV framework: Integration of
multimedia technologies, 11/01
Speech and Language Processing Techniques
5
Chinese Academy of Sciences, Beijing, China
Report Document
Why is MPEG-7 needed
• Digital audiovisual information increasing
– more and more available contents
– all kinds of sources of information
• Use of the digital audiovisual information
– description of the contents
– fast search of the contents
Speech and Language Processing Techniques
6
Chinese Academy of Sciences, Beijing, China
Report Document
Objective of MPEG-7
• Standardize content-based description for various types of
audiovisual information
– Enable fast and efficient content searching, filtering and
identification
– Describe several aspects of the content (low-level features,
structure, semantic, models, collections, creation, etc.)
– Address a large range of applications
• Types of audiovisual information:
– Audio, speech
– Moving video, still pictures, graphics, 3D models
– Information on how objects are combined in scenes
Speech and Language Processing Techniques
7
Chinese Academy of Sciences, Beijing, China
Report Document
Scope of MPEG-7
Description
generation
Description
Description
consumption
Research and
future competition
Scope of MPEG-7
Research and
future competition
• The description generation (feature extraction, indexing
process, annotation & authoring tools,...) and consumption
(search engine, filtering tool, retrieval process, browsing
device, ...) are non normative parts of MPEG-7.
• The goal is to define the minimum that enables
interoperability.
Speech and Language Processing Techniques
8
Chinese Academy of Sciences, Beijing, China
Report Document
Scope of MPEG-7
standardization
Feature
Extraction
Feature Extraction:
Content analysis (D, DS)
Feature extraction (D, DS)
Annotation tools (DS)
Authoring (DS)
MPEG-7
Description
MPEG-7 Scope:
Description Schemes (DSs)
Descriptors (Ds)
Language (DDL)
Ref: MPEG-7 Concepts
Search
Engine
Search Engine:
Searching & filtering
Classification
Manipulation
Summarization Indexing
Speech and Language Processing Techniques
9
Chinese Academy of Sciences, Beijing, China
Report Document
Audio in MPEG-7
•
•
•
•
•
Audio content description (yes)
Sound retrieval and classifier (yes)
Speech synthesis (no)
Speech recognition (no)
Probability Models (yes)
Speech and Language Processing Techniques
10
Chinese Academy of Sciences, Beijing, China
Report Document
Parts of the MPEG-7 Standard
• ISO / IEC 15938 - 1: Systems
• ISO / IEC 15938 - 2: Description Definition Language
• ISO / IEC 15938 - 3: Visual
• ISO / IEC 15938 - 4: Audio
• ISO / IEC 15938 - 5: Multimedia Description Schemes
• ISO / IEC 15938 - 6: Reference Software
Speech and Language Processing Techniques
11
Chinese Academy of Sciences, Beijing, China
Report Document
Outline of contents
•
•
•
•
•
•
•
Introduction
Basic Components
Content Description
Audiovisual (AV) Descriptions
Multimedia Description Schemes
XM and Applications
More Information
Speech and Language Processing Techniques
12
Chinese Academy of Sciences, Beijing, China
Report Document
Main elements of MPEG-7
• Descriptors (D): representations of features, that define the
syntax and the semantics of each feature representation (low-level).
• Description Schemes (DS): that specify the structure and
semantics of the relationships between their components, which may
be both Ds and DSs (high-level).
• A Description Definition Language (DDL):
based
on XML Schema, to allow the creation of new DSs and Ds, and to
allow the extension and modification of existing DSs
• System tools: to support multiplexing of descriptions,
synchronization issues, transmission mechanisms, coded
representations, management and protection of intellectual property
Speech and Language Processing Techniques
13
Chinese Academy of Sciences, Beijing, China
Report Document
Relations of main elements
DDL
DS
D
DS
D
DS
DS
D
DS
D
D
D
DS
DS
DS
D
D
Speech and Language Processing Techniques
14
Chinese Academy of Sciences, Beijing, China
Report Document
Description Definition Language
• Description Definition Language (DDL) is a language
that define what description is valid, and allows the
creation of new Description Schemes and Descriptors.
It also allows the extension and modification of existing
Description Schemes
• DDL is used to define a set of formal rules
• ordering of the elements
• occurrences of elements
……...
• XML + MPEG-7 extensions
Speech and Language Processing Techniques
15
Chinese Academy of Sciences, Beijing, China
Report Document
XML: Base for DDL
• Why choose XML as the base for the DDL?
• The popularity of XML
• The interoperability with other standards in the future
• Why XML should be extended for MPEG-7?
• SGML > XML
• Structural extensions
• Datatype extensions
Speech and Language Processing Techniques
16
Chinese Academy of Sciences, Beijing, China
Report Document
DDL parser
DDL parser is a software to check if
a description is valid
Description
Parser
Yes
or
No
Schema
Speech and Language Processing Techniques
17
Chinese Academy of Sciences, Beijing, China
Report Document
Outline of contents
•
•
•
•
•
•
•
Introduction
Basic Components
Content Description
Audiovisual (AV) Descriptions
Multimedia Description Schemes
XM and Applications
More Information
Speech and Language Processing Techniques
18
Chinese Academy of Sciences, Beijing, China
Report Document
Type of descriptions
• Low level description (features, etc)
• Generic and flexible
• Intelligent / efficient search engine
• High level description (structures, concepts,etc)
• Efficient and powerful
• Lack of flexibility
Speech and Language Processing Techniques
19
Chinese Academy of Sciences, Beijing, China
Report Document
Low-level Description
• Information in the creation and production processes
• director, title, short feature movie
• Information related to the usage of the content
• copyright pointers, usage history, broadcast schedule
• Information on the storage features of the content
• storage format, encoding
• Information about low-level features in the content
• colors, textures, sound timbres, melody
Speech and Language Processing Techniques
20
Chinese Academy of Sciences, Beijing, China
Report Document
High-level Description
• Structural description
– video segments, frames, still and moving regions,
audio segments
– Segment DS (representing the spatial, temporal or
spatio-temporal structure)
• Conceptual (semantic) description
– objects, events, and notions
– links of the two descriptions
Speech and Language Processing Techniques
21
Chinese Academy of Sciences, Beijing, China
Report Document
Illustration of descriptions
Speech and Language Processing Techniques
22
Chinese Academy of Sciences, Beijing, China
Report Document
Basic description
• Elements
– Information containers
– containing data and other elements
– <city> …… </city>
• Attributes
– Attribute-value pairs used to characterize elements
– <city population=“10000”> …… </city>
Speech and Language Processing Techniques
23
Chinese Academy of Sciences, Beijing, China
Report Document
Structured descriptions
• Structured descriptions are trees
• Trees are suitable for retrieval and search
DS
DS
D
D
DS
D
D
D
Speech and Language Processing Techniques
24
Chinese Academy of Sciences, Beijing, China
Report Document
Description trees
<letter>
<header>
<name> Mr Sen </name>
<address>
<street> 16 rue Laplace </street>
<city> Nancy </city>
</address>
</header>
<text> Dear Mr White, …</text>
letter
</letter>
header
name
text
address
street
Speech and Language Processing Techniques
city
25
Chinese Academy of Sciences, Beijing, China
Report Document
Example: Audio description
<Mpeg7Main>
<DescriptionMetadata>
<Version>1.0</Version>
</DescriptionMetadata>
<ContentDescription>
<AudioContent xs1:type=“AudioType”>
<Audio>
<CreationInformation>
<Creation>
<Title> The daily news </Title>
</Creation>
</CreationInformation>
</Audio>
</AudioContent>
</ContentDescription>
</Mpeg7Main>
Speech and Language Processing Techniques
26
Chinese Academy of Sciences, Beijing, China
Report Document
Outline of contents
•
•
•
•
•
•
•
Introduction
Basic Components
Content Description
Audiovisual (AV) Descriptions
Multimedia Description Schemes
XM and Applications
More Information
Speech and Language Processing Techniques
27
Chinese Academy of Sciences, Beijing, China
Report Document
Audio description
• Low-level Description
– spectrum, parametric, and temporal features
• High-level Description
–
–
–
–
Audio signature Description Scheme
Instrument timbre Description Schemes
The melody Description Tools
Sound recognition and indexing Description
Tools
– Spoken Content Description Tools
Speech and Language Processing Techniques
28
Chinese Academy of Sciences, Beijing, China
Report Document
Audio low-level descriptors
•
•
•
•
•
•
•
•
•
Waveform
Loudness
Spectral basis
Spectral envelope
Spectral centroid
Spectral spread
Fundamental frequency
Harmonicity
Attack time
Speech and Language Processing Techniques
29
Chinese Academy of Sciences, Beijing, China
Report Document
Audio descriptor: Basic
• Two basic audio Descriptors
– AudioWaveform Descriptor
• describes the audio waveform envelope (minimum
and maximum)
– AudioPower Descriptor
• describes the temporally-smoothed instantaneous
power
Speech and Language Processing Techniques
30
Chinese Academy of Sciences, Beijing, China
Report Document
Audio descriptor: Basic Spectral
• AudioSpectrumEnvelope Descriptor
– describes the short-term power spectrum
• AudioSpectrumCentroid Descriptor
– describes the center of gravity of the log-frequency
power spectrum
• AudioSpectrumSpread Descriptor
– describing the second moment of the log-frequency
power spectrum
• AudioSpectrumFlatness Descriptor
– describes the flatness properties of the spectrum
Speech and Language Processing Techniques
31
Chinese Academy of Sciences, Beijing, China
Report Document
Audio Signature Description
• AudioSignature Description Scheme
provides a unique content identifier for the
purpose of robust automatic identification
of audio signals
• Applications include
– audio fingerprinting
– identification of audio
– locating metadata for legacy audio content
Speech and Language Processing Techniques
32
Chinese Academy of Sciences, Beijing, China
Report Document
Instrument Timbre Description
• Timbre is defined as the perceptual features that
make two sounds having the same pitch and
loudness sound different.
• Timbre Description describes the perceptual
features with a reduced set of Descriptors
–
–
–
–
HarmonicInstrumentTimbre Descriptor
LogAttackTime Descriptor
PercussiveIinstrumentTimbre Descriptor
Combination with Basic Spectral Descriptors
Speech and Language Processing Techniques
33
Chinese Academy of Sciences, Beijing, China
Report Document
Melody Description Tools
The melody Description Tools is to facilitate efficient, robust,
and expressive melodic similarity matching
• MelodyContour Description Scheme
– 5-step contour representation
– basic rhythmic information representation
• MelodySequence Description Scheme
– supporting an expanded descriptor set and high
precision of interval encoding
Speech and Language Processing Techniques
34
Chinese Academy of Sciences, Beijing, China
Report Document
General Sound Recognition and
Indexing Description Tools
• SoundModel (SM) DS
– statistical model, such as HMM or GMM
– SoundModelStatePath Descriptor
• consists of a state sequence generated by a SM
– SoundModelStateHistogram Descriptor
• consists of a normalized histogram of the state
sequence generated by a SM given an audio segment
• SoundClassificationModel DS
– a trainable multi-way classifier based on SMs
• speech vs music, male vs female, trumpet vs violin
• genre classification, voice recognition
Speech and Language Processing Techniques
35
Chinese Academy of Sciences, Beijing, China
Report Document
Spoken content retrieval
• Output of ASR
– phone lattice or word lattice
– spoken content DS stores these
lattices instead of plain text
– lattices are good for retrieval
Speech and Language Processing Techniques
36
Chinese Academy of Sciences, Beijing, China
Report Document
Spoken Content Description Tools
• SpokenContentLattice
– representing the actual decoding produced by
an ASR engine
• SpokenContentHeader
– contains information about the speakers being
recognized and the recognizer itself
– WordLexicon Descriptor
– PhoneLexicon Descriptor
– SpeakerInfo Descriptor
– ConfusionInfo Descriptor
Speech and Language Processing Techniques
37
Chinese Academy of Sciences, Beijing, China
Report Document
Gaussian DS
<Gaussian>
<Mean>
4087.18 7173.73 1.36364 94.2727 1834.36 2359.55 2645.27 2577.09
………………………………
</Mean>
<Variance>
1.6982e+007 5.21621e+007 14.3636 9749.09 3.65743e+006
………………………………
</Variance>
</Gaussian>
Speech and Language Processing Techniques
38
Chinese Academy of Sciences, Beijing, China
Report Document
State-transition model DS
<StateTransitionModel>
<Transitions size1="20" size2="20">
0 0 0.210526 0.0526316 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
……………………………………
</Transitions>
<Initial size="20">
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
</Initial>
<State label="0 players" confidence="1">
……………………………………
<State label="19 players" confidence="0.223607">
</StateTransitionModel>
Speech and Language Processing Techniques
39
Chinese Academy of Sciences, Beijing, China
Report Document
ProbabilityModelClassier DS
<ProbabilityModelClassifier confidence="0.9" length="2">
<ProbabilityModelClass SemanticLabel="fish" Confidence="0.5"
DescriptorName="ColorHistogram">
<Gaussian>
<Mean>
4087.18 7173.73 1.36364 94.2727 1834.36 2359.55
………………………….
</Mean>
<Variance>
1.6982e+007 5.21621e+007 14.3636 9749.09
………………………….
</Variance>
</Gaussian>
</ProbabilityModelClass>
Speech and Language Processing Techniques
40
Chinese Academy of Sciences, Beijing, China
Report Document
SpokenContentLattice DS
A lattice structure for an hypothetical (combined phone and word)
decoding of the expression “Taj Mahal drawing …”.
Speech and Language Processing Techniques
41
Chinese Academy of Sciences, Beijing, China
Report Document
SoundRecognitionClassifier
HMM
AND
BASES
Extraction of sound indexes using
a sound-recognition classifier.
The model reference and state
path is stored.
AudioSpectrumBasis
HMM 1
Segmented Audio
Description
HMM 2
AUDIO
QUERY
SPECTRUM
PROJECTION
N
SELECT
MODEL REF
+STATE PATH
MPEG-7
SOUND
DATABASE
SoundModelStatePath
HMM N-1
HMM N
SoundRecognitionModel
Speech and Language Processing Techniques
42
Chinese Academy of Sciences, Beijing, China
Report Document
SoundRecognitionClassifier
Indexed Audio
HMM
AND
BASIS
MPEG-7
SOUND
DATABASE
ContinuousMarkovModel
AudioSpectrumBasis
HMM 1
HMM 2
AUDIO
QUERY
SPECTRUM
PROJECTION
N
SELECT
MODEL REF
+STATE PATH
MATCHING
SoundModelStatePath
HMM N-1
Query-by-example application with a query
in media source form. Features must be
extracted and projected into the
classification space for each model
in order to match against the database.
RESULT LIST
HMM N
SoundRecognitionModel
Speech and Language Processing Techniques
43
Chinese Academy of Sciences, Beijing, China
Report Document
An example search application
utilizing a query in DDL format
DDL
QUERY
MODEL REF +
STATE PATH
MPEG-7
SOUND
DATABASE
MATCHING
RESULT LIST
Speech and Language Processing Techniques
44
Chinese Academy of Sciences, Beijing, China
Report Document
Extraction of hidden Markov model
and basis functions
and storage in a DDL representation
AudioSpectrumBasis
AUDIO
WAV FILES
FEATURE
EXTRACT
SoundRecognitionFeatures
BASIS
EXTRACT
SoundRecognitionModel
HMM
HMM
AND
BASIS
ContinuousMarkovModel
Speech and Language Processing Techniques
45
Chinese Academy of Sciences, Beijing, China
Report Document
Scenario for for the spoken
content Description Tools
• Recall of AV data by memorable spoken events
– A film or video recording where a character or person spoke a particular
word or sequence of words. The source media would be known, and the
query would return a position in the media.
• Spoken Document Retrieval
– There is a database consisting of separate spoken documents. The result
of the query is the relevant documents, and optionally the position in those
documents of the matched speech
• Annotated Media Retrieval
– Similar to spoken document retrieval. The result of the query is the
media which is annotated with speech, and not the speech itself.
An example is a photograph retrieved using a spoken annotation.
Speech and Language Processing Techniques
46
Chinese Academy of Sciences, Beijing, China
Report Document
Outline of contents
•
•
•
•
•
•
•
Introduction
Basic Components
Content Description
Audiovisual (AV) Descriptions
Multimedia Description Schemes
XM and Applications
More Information
Speech and Language Processing Techniques
47
Chinese Academy of Sciences, Beijing, China
Report Document
Multimedia DSs
Multimedia Description Schemes are metadata structures
for describing and annotating audio-visual (AV) content
•
•
•
•
•
•
Basic Elements
Content Management
Content Description
Content Organization
Navigation and Access
User Interaction
Speech and Language Processing Techniques
48
Chinese Academy of Sciences, Beijing, China
Report Document
Organization of Multimedia DSs
Speech and Language Processing Techniques
49
Chinese Academy of Sciences, Beijing, China
Report Document
Content Management
• Creation and production information
– Creation information
• title, textual annotation, creators, and dates
– Classification information
• genre, subject, purpose, language
• Media coding, storage and file formats
– format, compression, and coding
• Content usage
– usage rights, usage record
Speech and Language Processing Techniques
50
Chinese Academy of Sciences, Beijing, China
Report Document
Navigation and Access
• Summaries
– hierarchical summaries
– sequential summaries
• Partitions and Decompositions
– decompositions in space, time and frequency
– used in multi-resolution access and progressive retrieval
• Variations
– selection of the most suitable of an AV program
– adapt to the different capabilities of terminal devices,
network conditions or user preferences
Speech and Language Processing Techniques
51
Chinese Academy of Sciences, Beijing, China
Report Document
Hierarchical summary
Speech and Language Processing Techniques
52
Chinese Academy of Sciences, Beijing, China
Report Document
Illustration of variations
Speech and Language Processing Techniques
53
Chinese Academy of Sciences, Beijing, China
Report Document
Content Organization
• Collections
– group the contents into clusters
– describes statistics and models of the attribute values
– describe relationships among collection clusters
• Models
– model the attributes and features of AV content
– Probability Model
• specify statistical functions and structures
– Analytic Model
• specify semantic labels
• specify the confidence
• build classifiers
Speech and Language Processing Techniques
54
Chinese Academy of Sciences, Beijing, China
Report Document
Collection Structure
Speech and Language Processing Techniques
55
Chinese Academy of Sciences, Beijing, China
Report Document
User Interaction
• User Preference
–
–
–
–
context dependency in terms of time and place
relative importance of different preferences
privacy characteristics of the preferences
preferences update by agent or user
• Usage History
– history of actions
– used to determine the user's preferences
Speech and Language Processing Techniques
56
Chinese Academy of Sciences, Beijing, China
Report Document
Outline of contents
•
•
•
•
•
•
•
Introduction
Basic Components
Content Description
Audiovisual (AV) Descriptions
Multimedia Description Schemes
XM and Applications
More Information
Speech and Language Processing Techniques
57
Chinese Academy of Sciences, Beijing, China
Report Document
eXperimentation Model(XM)
• Simulation platform for:
• Ds, DSs, CSs, DDL
• XM applications:
• the server (extraction) applications
• the client (search, filtering and/or transcoding)
applications
CS: Coding Schemes
Speech and Language Processing Techniques
58
Chinese Academy of Sciences, Beijing, China
Report Document
The XM applications
• Extraction from Media
• all low-level Ds or DSs should have an
application class of this type
• Search & Retrieval Application
• either client application
• Media Transcoding Application
• either client application
• Description Filtering Application
• either client application
Speech and Language Processing Techniques
59
Chinese Academy of Sciences, Beijing, China
Report Document
Extraction from Media
Speech and Language Processing Techniques
60
Chinese Academy of Sciences, Beijing, China
Report Document
Search and retrieval application
Speech and Language Processing Techniques
61
Chinese Academy of Sciences, Beijing, China
Report Document
Media transcoding application
Speech and Language Processing Techniques
62
Chinese Academy of Sciences, Beijing, China
Report Document
Description Filtering Application
Speech and Language Processing Techniques
63
Chinese Academy of Sciences, Beijing, China
Report Document
Interface model for XM app
Speech and Language Processing Techniques
64
Chinese Academy of Sciences, Beijing, China
Report Document
Real world application
MDB = media database, DDB = description database.
First, from a media database two features are extracted. Then, basing on the first feature,
relevant media files are selected from the media database.
The relevant media files are transcoded basing on the second extracted feature.
Speech and Language Processing Techniques
65
Chinese Academy of Sciences, Beijing, China
Report Document
MPEG-7 application areas
• Storage and retrieval of audiovisual databases (image, film, radio
archives)
• Broadcast media selection (radio, TV programs)
• Surveillance (traffic control, surface transportation, production chains)
• E-commerce and Tele-shopping (searching for clothes / patterns)
• Remote sensing (cartography, ecology, natural resources management)
• Entertainment (searching for a game, for a karaoke)
• Cultural services (museums, art galleries)
• Journalism (searching for events, persons)
• Personalized news service on Internet (push media filtering)
• Intelligent multimedia presentations
• Educational applications nBio-medical applications
Speech and Language Processing Techniques
66
Chinese Academy of Sciences, Beijing, China
Report Document
Illustration of applications
Users
Speech and Language Processing Techniques
67
Chinese Academy of Sciences, Beijing, China
Report Document
Information Flow
Feature
extraction
AV Description
Manual/automatic
Search/query
Storage
Pull
Browse
Filter
Decoding
Encoding
Push
Transmission
Users
Speech and Language Processing Techniques
68
Chinese Academy of Sciences, Beijing, China
Report Document
Push and Pull applications
• Push applications
– Example: Search engines for internet and DBs
– Advantage: Many search engines work on
standardized descriptions
• Pull applications
– Example: Broadcast of video, Interactive TV
– Advantage: Intelligent agents filter standardized
descriptions
Speech and Language Processing Techniques
69
Chinese Academy of Sciences, Beijing, China
Report Document
Example: Pull application
MPEG-7
Database
Speech and Language Processing Techniques
70
Chinese Academy of Sciences, Beijing, China
Report Document
Example: Push application
Speech and Language Processing Techniques
71
Chinese Academy of Sciences, Beijing, China
Report Document
Example: queries
• Text (keywords):
– Find AV material with subject corresponding to some
keywords
• Semantic description:
– Find AV material corresponding to a specified semantic
• Image as an example:
– Find an image with similar characteristics (global or
local)
• A few notes of music:
– Find corresponding musical pieces or movies
• Low level features (example: motion):
– Find video with specific object motion trajectories
Speech and Language Processing Techniques
72
Chinese Academy of Sciences, Beijing, China
Report Document
Integration of MPEG-7 into XML
<seq begin=20s dur=10s>
<img id="Image1" dur=5s>
<MP7: annotation>
<Who>Fernado Morientes</Who>
< WhatAction >Spain vs. Sweden soccer match
</ WhatAction>
</MP7: annotation>
</img>
<img id="Image2" dur=2s />
</seq>
Speech and Language Processing Techniques
73
Chinese Academy of Sciences, Beijing, China
Report Document
Outline of contents
•
•
•
•
•
•
•
Introduction
Basic Components
Content Description
Audiovisual (AV) Descriptions
Multimedia Description Schemes
XM and Applications
More Information
Speech and Language Processing Techniques
74
Chinese Academy of Sciences, Beijing, China
Report Document
MPEG-7 and other Standards
• MPEG-1, -2, and -4 are designed to
represent the information itself, while
MPEG-7 is meant to represent information
about the information.
• MPEG-1, -2, and -4 make content available,
while MPEG-7 allows you to find the
content you need.
Speech and Language Processing Techniques
75
Chinese Academy of Sciences, Beijing, China
Report Document
Ultimate ambition of MPEG-7
• To make the web as searchable for
multimedia content as it is searchable for
text today
• To improve the use of computer systems
as easy as possible
Speech and Language Processing Techniques
76
Chinese Academy of Sciences, Beijing, China
Report Document
MPEG-7 beyond
• To mould computers around human requirements
and not humans around computer requirements
• To enable content disclosure based on facts, rather
than on human annotations
• To find information by rich spoken queries, handdrawn images and address what most people
expect computers to be able to do
Speech and Language Processing Techniques
77
Chinese Academy of Sciences, Beijing, China
Report Document
More Information on WWW
• Major MPEG-7 documents
http://www.cselt.it/mpeg/, semi-official website
http://www.mpeg-7.com, official website
• Others
http://www.elsevier.com/locate/image
Speech and Language Processing Techniques
78
Chinese Academy of Sciences, Beijing, China
Report Document
Conclusion
Ds
Features
AV contents
User
Structures
DSs
DDL
Ds, DSs
Speech and Language Processing Techniques
79
Chinese Academy of Sciences, Beijing, China
Report Document
Thanks
Speech and Language Processing Techniques
80
Chinese Academy of Sciences, Beijing, China
Report Document
Speech and Language Processing Techniques
81
Chinese Academy of Sciences, Beijing, China
Report Document
Low level AV descriptors
Video segments
Still regions
•Color
•Camera motion
•Motion activity
•Mosaic
•Color
•Shape
•Position
•Texture
Audio segments
Moving regions
•Color
•Motion trajectory
•Parametric motion
•Spatio-temporal shape
•Spoken content
•Spectral feature
•Timbre
Speech and Language Processing Techniques
82
Chinese Academy of Sciences, Beijing, China
Report Document
Face Recognition Descriptor
• Projection of a face vector onto a set of basis vect
• Feature set is extracted from a normalized face im
• Normalized face image
– 56 lines with 46 intensity values in each line
– The centers of the two eyes are located on the 24th row
Speech and Language Processing Techniques
83
Chinese Academy of Sciences, Beijing, China
Report Document
Segment Decomposition
Speech and Language Processing Techniques
84
Chinese Academy of Sciences, Beijing, China
Report Document
MPEG-7 Normative Interfaces
Speech and Language Processing Techniques
85
Chinese Academy of Sciences, Beijing, China
Report Document
Example: Content description
Indexing
Fea extrac
High level
process
Search
retrieval
MPEG-7
Database
Low level
process
Speech and Language Processing Techniques
86
Chinese Academy of Sciences, Beijing, China
Report Document
Segment DS
Segment DS describes the result of a spatial,
temporal, or spatio-temporal partitioning of the
AV content. It has nine major subclasses:
• Multimedia Segment DS
• AudioVisual Region DS
• AudioVisual Segment DS
• Audio Segment DS
• Still Region DS
• Still Region 3D DS
• Moving Region DS
• Video Segment DS
• Ink Segment DS
Speech and Language Processing Techniques
87
Chinese Academy of Sciences, Beijing, China
Report Document
Examples: T/S segments
Speech and Language Processing Techniques
88
Chinese Academy of Sciences, Beijing, China
Report Document
Example: Segment trees
Speech and Language Processing Techniques
89
Chinese Academy of Sciences, Beijing, China
Report Document
Illus of conceptual description
Semantic base DS
Object DS
Event DS
Semantic container
DS
Semantic DS
Concept DS
Semantic state DS
Semantic place DS
AV content
Semantic time DS
Speech and Language Processing Techniques
90
Chinese Academy of Sciences, Beijing, China
Report Document
Visual description
• Basic structures
– Grid layout, Time series, Multiple view,
Spatial 2D coordinates, Temporal interpolation
• Descriptors
– Color, Texture, Shape, Motion, Localization
Speech and Language Processing Techniques
91
Chinese Academy of Sciences, Beijing, China
Report Document
Example: Color Descriptors
•
•
•
•
•
•
•
Color space
Color Quantization
Dominant Colors
Scalable Color
Color Layout
Color-Structure
GoF/GoP Color
Speech and Language Processing Techniques
92
Chinese Academy of Sciences, Beijing, China
Report Document
Example: Color space
•
•
•
•
•
R,G,B
Y,Cr,Cb
H,S,V
HMMD
Linear transformation matrix with
reference to R, G, B
• Monochrome
Speech and Language Processing Techniques
93
Chinese Academy of Sciences, Beijing, China
Report Document
Audio Framework
Speech and Language Processing Techniques
94
Chinese Academy of Sciences, Beijing, China
Report Document
Descriptor
• Definition
A Descriptor (D) is a representation of a Feature. A Descriptor
defines the syntax and the semantics of the Feature representation.
• Notes
A descriptor allows an evaluation of the corresponding feature
via the descriptor value. It is possible to have several descriptors
representing a single feature.
• Examples
For example for the color feature, possible descriptors are:
the color histogram, the average of the frequency components,
the motion field, the text of the title, etc.
Speech and Language Processing Techniques
95
Chinese Academy of Sciences, Beijing, China
Report Document
Descriptor Value
• Definition
A Descriptor Value is an instantiation of a Descriptor for
a given data set (or subset thereof).
• Notes
Descriptor Values are combined via the mechanism of a
Description Scheme to form a Description.
Speech and Language Processing Techniques
96
Chinese Academy of Sciences, Beijing, China
Report Document
Description Scheme
• Definition
A Description Scheme (DS) specifies the structure and
semantics of the relationships between its components,
which may be both Descriptors and Description Schemes.
• Examples
A movie, structured as scenes and shots, including some
textual descriptors at the scene level, and color, motion
and some audio descriptors at the shot level.
• Note
Ds contain only basic data types, and does not refer to
others D or DSs.
Speech and Language Processing Techniques
97
Chinese Academy of Sciences, Beijing, China
Report Document
DS: XML Scheme & Extensions
• XML Scheme
• Data types
• Simple and Complex types
• Elements
• Inheritance, Abstract types
• MPEG-7 extensions
• Array and Matrix datatype
• Enumerated datatypes for MimeType,
CountryCode, RegionCode,
CurrencyCode and CharacterSetCode
• Typed references
Speech and Language Processing Techniques
98
Chinese Academy of Sciences, Beijing, China
Report Document
Basic elements of DS
• Constructs for linking media files
• Localizing pieces of content
• Describing
– time, places, persons, individuals, groups,
organizations, and textual annotation, etc
– Who? What object? What action? Where?
When? Why? and How?
Speech and Language Processing Techniques
99
Chinese Academy of Sciences, Beijing, China
Report Document
Content recognition tools
• No speech or face or gesture recognition
engines included in MPEG-7
• Content recognition tools is a task for
industries, not a standard
– coding tools in MPEG-1, -2, -4 were for
research purposes, not part of the standard
– no tools were part of the MPEG standard
Speech and Language Processing Techniques
100
Chinese Academy of Sciences, Beijing, China
Report Document
Speech and Language Processing Techniques
101

Topic Here

Transcript Topic Here

Directory