A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman

Download Report

Transcript A “Quick and Dirty” Website Data Quality Indicator Irit Askira Gelman

A “Quick and Dirty” Website
Data Quality Indicator
Irit Askira Gelman
University of Arizona
Anthony L. Barletta
University of Arizona
Information quality on the web: DEBKAfile

An Israeli, Jerusalem-based website (www.debka.co.il) with
commentary and analyses on terrorism, intelligence, security, and
military and political affairs in the Middle East

According to DEBKAfile, over 1,000,000 viewers a week

Forbes' Best of The Web award: “Debkafile has been ahead of
the pack often enough to suggest that the reporting is good.”

However, Forbes decries the fact that "most of the information is
attributed to unidentified sources"

Has been criticized as a fringe outfit catering to conspiracy theorists.
Some claim that the site relies on sources with an agenda, and that
Israeli intelligence does not consider even 10% of the content
reliable

The site's operators have claimed that 80% turns out to be true
October 30, 2008
WICOW 2008
2
Information quality on the web: DEBKAfile
repeated
October 30, 2008
WICOW 2008
3
Information quality on the web: DEBKAfile
Content quality:
 Highly current, up-to-date
But.. deficiencies in
 Accuracy
 Source reliability
 Objectivity
October 30, 2008
Representation quality:
 Spelling errors & various
typos
 Very long sentences
 Grammatical errors
 ..
WICOW 2008
4
Critical observation
Information quality deficiencies are often not
isolated
 Poor information quality control?
October 30, 2008
WICOW 2008
5
Website information quality assessment:
Our approach (I)

Look for an easy to measure data quality
facet

Use it as an indicator of aggregate data
quality
October 30, 2008
WICOW 2008
6
Website information quality assessment:
Our approach (I)

Focus on spelling errors as an indicator of
aggregate data quality

Hypothesis 1: The spelling error rate of a
document set is positively related to the
aggregate data quality of the set
October 30, 2008
WICOW 2008
7
Related questions (I)

To what extent is a lower aggregate quality
detected by the spelling error rate?

To what extent does a higher spelling error rate
indicate a lower aggregate quality?

Are there significant variations across different
settings?
October 30, 2008
WICOW 2008
8
Our approach (II):
A “quick and dirty” indicator

Instead of an exhaustive spelling error check,
focus on a minimal set of spelling errors,
carefully chosen to fit the target document
population

Use the hit count feature of a common search
engine (e.g., Google) to assess the rate of the
chosen spelling errors in the target population
October 30, 2008
WICOW 2008
9
A “quick and dirty” indicator:
Initial implementation

10 common English spelling errors selected
from the autocorrect word list of MS Office


target broad document populations
Google’s hit count
October 30, 2008
WICOW 2008
10
A “quick and dirty” indicator:
Initial implementation
Spelling Error
October 30, 2008
Correct Spelling
Recieve
Receive
Accomodate
Accommodate
Accross
Across
Truely
Truly
Acheive
Achieve
Affraid
Afraid
Agressive
Aggressive
Appearence
Appearance
Tomorow
Tomorrow
Arguement
Argument
WICOW 2008
11
A “quick and dirty” indicator:
Initial implementation

Indicator defined by:
ErrorIndex(e j , d ) 
HitCount (e j , d )
HitCount (e j , d ) +HitCount (c j , d ) +1
ErrorIndex(d )  AVERAGE {ErrorIndex (e j , d ): j  1,..,10}
e j , j=1,..,10, denotes the jth spelling error
c j denotes the correct spelling that matches e j
d
denotes the document set
October 30, 2008
WICOW 2008
12
Website information quality assessment:
Our approach (II)

Hypothesis 2: The proposed indicator is
positively related to the aggregate data quality of
the document set
October 30, 2008
WICOW 2008
13
Related questions (II)

To what extent is a lower aggregate quality detected
by this indicator?
… (see Questions I)

Spelling error set:




what spelling errors to include?
How many?
Hit count:


is it reliable?
How valid is it in measuring error rates?
October 30, 2008
WICOW 2008
14
Initial tests & results

We have conducted initial tests of hypothesis 1,
hypothesis 2, & related questions

Askira Gelman I. and Barletta A.L. Initial Study of a
“Quick and Dirty” Website Data Quality Index, ICIQ
2008
October 30, 2008
WICOW 2008
15
Initial tests & results

To what extent does a higher spelling error rate indicate
a lower aggregate quality?

Positive initial results on large websites & web domains
(.gov sites, university sites, wikipedia, and more)

Spelling error set: size can be increased; select
carefully to avoid the lack of context sensitivity of the
search engine

Hit count: for higher reliability conduct a series of
measurements and remove outliers
October 30, 2008
WICOW 2008
16