Transcript ODE: Ontology-Assisted Data Extraction
ODE: Ontology-Assisted Data Extraction
Weifeng Su, Jiying Wang, Frederick H. Lochovsky Summarized by Joseph Park
Overview
• “Web databases…compose what is referred to as the deep Web” • The goal of data extraction: – (1) Query result section identification - decides what section in a dynamically generated query result page contains the data that need to be extracted.
– (2) Record segmentation - segments the query result section into records and extracts them.
– (3) Data value alignment - aligns the data values from multiple records that belong to the same attribute so that they can be arranged into a table.
– (4) Label assignment - assigns a suitable, meaningful label (i.e., an attribute name) to each column in an aligned table.
Problems
• •
Automatically extract data from query results Limitations of other systems:
–
Incapable of processing either zero or few query results.
–
Vulnerable to optional and disjunctive attributes.
–
Incapable of processing nested data structures.
–
No label assignment.
Approach
• • • • • ODE – Ontology-assisted data extraction PADE wrapper Query result annotation Attribute matching Ontology construction
Approach continued
• • • Query result section identification Record segmentation Data value alignment and label assignment – MaxEnt model is used
Experimental Results
Extraction performed using DeLa
Conclusion
• • Can only label attributes that appear in query result pages References a few DEG papers – DKE99, Tisp, TANGO • • Could take advantage of MaxEnt for pre labeling data Need to look into DeLa for data extraction