Multimedia Information extraction from HTML product catalogues Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1 {labsky, svatek, xsvao06}@vse.cz, [email protected] rainbow.vse.cz 1 Dept. of Information and Knowledge.
Download ReportTranscript Multimedia Information extraction from HTML product catalogues Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1 {labsky, svatek, xsvao06}@vse.cz, [email protected] rainbow.vse.cz 1 Dept. of Information and Knowledge.
Slide 1
Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]
rainbow.vse.cz
1 Dept.
of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005
Agenda
•
•
•
•
•
Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application
DATESO, April 14th 2005
2
IE from Internet
IE from Internet
• Motivation
searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)
– Semantic and structured search over large
document collections
• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005
3
IE from Internet
Our approach to IE
Acquire new
document
w1 w2 ... wn
Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9
HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w
Instance
extraction
DATESO, April 14th 2005
4
IE from Internet
Relevant documents
DATESO, April 14th 2005
5
Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application
DATESO, April 14th 2005
6
Annotation using HMMs
Preprocessing
• HTML cleanup
– conversion to valid XHTML
• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted
• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)
• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005
7
Annotation using HMMs
Preprocessing – example
Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]
rainbow.vse.cz
1 Dept.
of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005
Agenda
•
•
•
•
•
Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application
DATESO, April 14th 2005
2
IE from Internet
IE from Internet
• Motivation
searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)
– Semantic and structured search over large
document collections
• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005
3
IE from Internet
Our approach to IE
Acquire new
document
w1 w2 ... wn
Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9
HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w
Instance
extraction
DATESO, April 14th 2005
4
IE from Internet
Relevant documents
DATESO, April 14th 2005
5
Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application
DATESO, April 14th 2005
6
Annotation using HMMs
Preprocessing
• HTML cleanup
– conversion to valid XHTML
• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted
• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)
• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005
7
Annotation using HMMs
Preprocessing – example
src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77
(2005)
OUR PRICE £3000.00