Multimedia Information extraction from HTML product catalogues Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1 {labsky, svatek, xsvao06}@vse.cz, [email protected] rainbow.vse.cz 1 Dept. of Information and Knowledge.

Download Report

Transcript Multimedia Information extraction from HTML product catalogues Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1 {labsky, svatek, xsvao06}@vse.cz, [email protected] rainbow.vse.cz 1 Dept. of Information and Knowledge.

Slide 1

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 2

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 3

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 4

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 5

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 6

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 7

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 8

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 9

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 10

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 11

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 12

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 13

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 14

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 15

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 16

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 17

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 18

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 19

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 20

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 21

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 22

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23


Slide 23

Multimedia Information extraction
from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept.

of Information and Knowledge Engineering,
Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
DATESO, April 14th 2005

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

2

IE from Internet

IE from Internet
• Motivation

searching for objects of type Bicycle
in price range €500 - €900
find structures (name, price, equipment)

– Semantic and structured search over large
document collections

• Requirements
– Identify relevant documents
– Perform automatic IE
• documents are semi-structured, have
heterogeneous layouts and formattings
DATESO, April 14th 2005

3

IE from Internet

Our approach to IE
Acquire new
document
w1 w2 ... wn

Annotation
using HMMs
Bicycle offer
name w3w4
price w6w7
picture w9

HTML
Preprocessing
name
price picture
w1 w2 w
w77 w8 w99 ... wn
w33 w
w44 w5 w6 w

Instance
extraction
DATESO, April 14th 2005

4

IE from Internet

Relevant documents

DATESO, April 14th 2005

5

Agenda
• Information Extraction from Internet
• Annotation using Hidden Markov
Models
• Extracting images
• Instance composition guided by ontology
• Bicycle search application

DATESO, April 14th 2005

6

Annotation using HMMs

Preprocessing
• HTML cleanup
– conversion to valid XHTML

• Only potentially relevant blocks kept
– blocks that do not directly contain text or images omitted

• Formatting tags
– attributes removed
– several rules matching common constructions (add-tobasket form, choose-amount button)

• Images
– baseline: all images treated as a single token
DATESO, April 14th 2005

7

Annotation using HMMs

Preprocessing – example

src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70
alt="TREK Session 77" border=0>
TREK Session 77


(2005)
OUR PRICE £3000.00

action=/products.php?plid=m1b0s1p0 name=buyit> name=cartadditem id=cartadditem value=979> name="selected_size" id="selected_size"> value="15.5">15.5

type="hidden" name="selected_colour" id="selected_colour"
value="default"> type=submit name=submit id=submit value="Add to Basket">


TREK Session 77
( 2005 )

OUR PRICE £ 3000 . 00

- - Select Size - 15 . 5 17 . 5 19
<_CHOOSEAMOUNT/> <_ADDTOBASKET/>
DATESO, April 14th 2005

8

Annotation using HMMs

Document modeling using HMMs
word

class

• Generative model
• Document = [w1c1] [w2c2]
• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)
transition prob.

P(c2|c1)
c1
P(w1|c1)

lexical prob.

c2
P(c1|c2)

P(w1|c2)

estimated from
training data (frequencies)

• c1c2 = argmaxi,j P([w1ci] [w2cj])
DATESO, April 14th 2005

9

Annotation using HMMs

HMM Structure
• States
– adopted from [Freitag, McCallum 99]
– Target, Prefix, Suffix and Background
– densely connected

• Class trigram model
– P(name | name_prefix, name)

• Variations
– word-ngram models for lexical probabilities of
target states P(w1 | wi-1, name)
– state substructures instead of single target states,
learned by EM
DATESO, April 14th 2005

10

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

11

Extracting Images

Extracting Images
• Baseline
– every image represented by the same
token
– HMM only extracts product images based on
context, e.g.
P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images
– classifies into 3 classes – Pos, Neg, Unk
– before HMM annotation, each image occurrence
in document is substituted by its class
DATESO, April 14th 2005

12

Extracting Images

Image Classification – Features
• Image size
– estimated 2-dimensional normal distribution from a set
of 1000 unique bicycle images  NC(x, y)
– estimated decision threshold (1-feature binary classifier)
using held-out set of 150 images (60% positive)

• Image similarity
– latent semantic similarity [Praks 2004]  sim(I1,I2)


– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?
DATESO, April 14th 2005

13

Extracting Images

Image Classification
• Combined binary classifier
– Multi-layer perceptron (Weka)
– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers
– 10-fold cross-validation, document-level folds

DATESO, April 14th 2005

14

Extracting Images

Annotation Results
• Combined ternary classifier
– outputs Pos Unk Neg
– decision list based on predictions of all 3 single
feature ternary classifiers

DATESO, April 14th 2005

15

Agenda





Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by
ontology
• Bicycle search application

DATESO, April 14th 2005

16

Instance Composition

Instance Composition
Document
annotated
by HMM

Instance
extraction
algorithm

Presentation
ontology

Instances
(xml)

Sesame
RDF
repository

DATESO, April 14th 2005

17

Instance Composition
Domain ontology

Presentation Ontology

DATESO, April 14th 2005

18

Instance Composition

Instance extraction algorithm
• Sequentially parses annotated document
• Adds annotated attributes to working instance WI
• If adding an attribute would cause an inconsitency, an
empty working_instance is created. The old
working_instance is saved only if it is consistent.
http://eso.vse.cz/~labsky/cgi-bin/client/
1. WI = empty_instance;
2. while (more_attributes) {
3.
A = next_attribute;
4.
if (cannot_add (WI, A)) {
5.
if (consistent (WI)) {
6.
store (WI);
7.
}
8.
WI = empty_instance;
9.
}
10.
add (WI, A);
11. }
DATESO, April 14th 2005

19

Agenda






Information Extraction from Internet
Annotation using Hidden Markov Models
Extracting images
Instance composition guided by ontology
Bicycle search application

DATESO, April 14th 2005

20

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

DATESO, April 14th 2005

21

Future work
• Learn to correct annotation errors
– use document structure to detect unlabeled attributes
– bootstrap from these new examples
– use ontology constraints on values (types, lists, regexps)

• Population algorithm
– utilize scores for each annotated attribute
– augment presentation ontology with frequencies of attribute
orderings
– use approximate name matching to identify instances

• Improve search interface
– approximate name matching (word and char edit distance)

DATESO, April 14th 2005

22

Thank you!

rainbow.vse.cz

DATESO, April 14th 2005

23