Processing of Large Document Collections 1

Download Report

Transcript Processing of Large Document Collections 1

Processing of Large
Document Collections 1
Helena Ahonen-Myka
University of Helsinki
Organization of the course
Classes: 17.9., 22.10., 23.10., 26.11.
lectures (Helena Ahonen-Myka): 10-12,13-15
exercise sessions (Lili Aunimo): 15-17
required presence: 75%
Exercises are given (and returned) each
week
 required: 75%
Exam: 4.12. at 16-20, Auditorio
Points: Exam 30 pts, exercises 30 pts
Schedule
17.9. Character sets, preprocessing of
text, text categorization
22.10. Text summarization
23.10. Text compression
26.11. … to be announced…
self-study: basic transformations for text
data, using linguistic tools, etc.
In this part...
Character sets
preprocessing of text
text categorization
1. Character sets
Abstract character vs. its graphical
representation
abstract characters are grouped into
alphabets
each alphabet forms the basis of the written
form of a certain language or a set of
languages
Character sets
For instance
for English:
uppercase letters A-Z
lowercase letters a-z
punctuation marks
digits 0-9
common symbols: +, =
ideographic symbols of Chinese and
Japanese
phonetic letters of Western languages
Character sets
To represent text digitally, we need a
mapping between (abstract) characters
and values stored digitally (integers)
this mapping is a character set
the domain of the character set is called a
character repertoire (= the alphabet
for which the mapping is defined)
Character sets
For each character in the character repertoire,
the character set defines a code value in the
set of code points
in English:
26 letters in both lower- and uppercase
ten digits + some punctuation marks
in Russian: cyrillic letters
both could use the same set of code points (if
not a bilingual document)
in Japanese: could be over 6000 characters
Character sets
The mere existence of a character set
supports operations like editing and
searching of text
usually character sets have some
structure
e.g. integers within a small range
all lower-case (resp. upper-case) letters have
code values that are consecutive integers
(simplifies sorting etc.)
Character sets: standars
Character sets can be arbitrary, but in
practice standardization is needed for
interoperability (between computers,
programs,...)
early standards were designed for English
only, or for a small group of languages at
a time
Character sets: standards
ASCII
ISO-8859 (e.g. ISO Latin1)
Unicode
UTF-8, UTF-16
ASCII
American Standard Code for Information
Interchange
A seven bit code -> 128 code points
actually 95 printable characters only
code points 0-31 and 128 are assigned to
control characters (mostly outdated)
ISO 646 (1972) version of ASCII
incorporated several national variants
(accented letters and currency symbols)
ASCII
With 7 bits, the set of code points is too
small for anything else than American
English
solution:
8 bits brings more code points (256)
ASCII character repertoire is mapped to the
values 0-127
additional symbols are mapped to other
values
Extended ASCII
Problem:
different manufacturers each developed their
own 8-bit extensions to ASCII
different character repertoires -> translation
between them is not always possible
also 256 code values is not enough to
represent all the alphabets -> different
variants for different languages
ISO 8859
Standardization of 8-bit character sets
In the 80´s: multipart standard ISO 8859 was
produced
defines a collection of 8-bit character sets, each
designed for a group of languages
the first part: ISO 8859-1 (ISO Latin1)
covers most Western European languages
0-127: identical to ASCII, 128-159 (mostly) unused,
96 code values for accented letters and symbols
Unicode
256 is not enough code points
for ideographically represented languages
(Chinese, Japanese…)
for simultaneous use of several languages
solution: more than one byte for each
code value
a 16-bit character set has 65,536 code
points
Unicode
16-bit character set, e.g. 65,536 code
points
not sufficient for all the characters
required for Chinese, Japanese, and
Korean scripts in distinct positions
CJK-consolidation: characters of these scripts
are given the same value if they look the
same
Unicode
Code values for all the characters used to
write contemporary ’major’ languages
also the classical forms of some languages
Latin, Greek, Cyrillic, Armenian, Hebrew,
Arabic, Devanagari, Bengali, Gurmukhi,
Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Thai, Lao, Georgian, Tibetan
Chinese, Japanese, and Korean ideograms,
and the Japanese and Korean phonetic and
syllabic scripts
Unicode
punctuation marks
technical and mathematical symbols
arrows
dingbats (pointing hands, stars, …)
both accented letters and separate diacritical
marks (accents, tildes…) are included, with a
mechanism for building composite characters
can also create problems: two characters that
look the same may have different code values
->normalization may be necessary
Unicode
Code values for nearly 39,000 symbols
are provided
some part is reserved for an expansion
method (see later)
6,400 code points are reserved for private
use
they will never be assigned to any character
by the standard, so they will not conflict with
the standard
Unicode: encodings
Encoding is a mapping that transforms a code
value into a sequence of bytes for storage and
transmission
identity mapping for a 8-bit code?
it may be necessary to encode 8-bit characters as
sequences of 7-bit (ASCII) characters
e.g. Quoted-Printable (QP)
code values 128-255 as a sequence of 3 bytes
1: ASCII code for ’=’, 2 & 3: hexadecimal digits of the value
233 -> E9 -> =E9
Unicode: encodings
UTF-8
ASCII code values are likely to be more
common in most text than any other values
in UTF-9 encoding ASCII characters are sent
themselves (high-order bit 0)
other characters (two bytes) are encoded using
up to six bytes (high-order bit is set to 1)
Unicode: encodings
UTF-16: expansion method
two 16-bit values are combined to a 32-bit
value -> a million characters available
2. Preprocessing of text
Text cannot be directly interpreted by the
many document processing applications
an indexing procedure is needed
mapping of a text into a compact
representation of its content
which are the meaningful units of text?
how these units should be combined?
usually not ”important”
Vector model
A document is usually represented as a
vector of term weights
the vector has as many dimensions as
there are terms (or features) in the whole
collection of documents
the weight represents how much the term
contributes to the semantics of the
document
Vector model
Different approaches:
different ways to understand what a term is
different ways to compute term weights
Terms
Words
typical choice
set of words, bag of words
phrases
syntactical phrases
statistical phrases
usefulness not yet known?
Terms
Part of the text is not considered as terms
very common words (function words):
articles, prepositions, conjunctions
numerals
these words are pruned
stopword list
other preprocessing possible
stemming, base words
Weights of terms
Weights usually range between 0 and 1
binary weights may be used
1 denotes presence, 0 absence of the term in
the document
often the tfidf function is used
higher weight, if the term occurs often in the
document
lower weight, if the term occurs in many
documents
Structure
Either the full text of the document or
selected parts of it are indexed
e.g. in a patent categorization application
title, abstract, the first 20 lines of the
summary, and the section containing the
claims of novelty of the described invention
some parts may be considered more
important
e.g. higher weight for the terms in the title
Dimensionality reduction
Many algorithms cannot handle high
dimensionality of the term space (= large
number of terms)
usually dimensionality reduction is
applied
dimensionality reduction also reduces
overfitting
classifier that overfits the training data is good
at re-classifying the training data but worse at
classifying previously unseen data
Dimensionality reduction
Local dimensionality reduction
for each category, a reduced set of terms is
chosen for classification that category
hence, different subsets are used when
working with different categories
global dimensionality reduction
a reduced set of terms is chosen for the
classification under all categories
Dimensionality reduction
Dimensionality reduction by term selection
the terms of the reduced term set are a
subset of the original term set
Dimensionality reduction by term
extraction
the terms are not the same type of the terms
in the original term set, but are obtained by
combinations and transformations of the
original ones
Dimensionality reduction
by term selection
Goal: select terms that, when used for
document indexing, yields the highest
effectiveness in the given application
wrapper approach
the reduced set of terms is found iteratively and
tested with the application
filtering approach
keep the terms that receive the highest score
according to a function that measures the
”importance” of the term for the task
Dimensionality reduction
by term selection
Many functions available
document frequency: keep the high
frequency terms
stopwords have been already removed
50% of the words occur only once in the
document collection
e.g. remove all terms occurring in at most 3
documents
Dimensionality reduction
by term selection
Information-theoretic term selection
functions, e.g.
chi-square
information gain
mutual information
odds ratio
relevancy score
Dimensionality reduction
by term extraction
Term extraction attempts to generate,
from the original term set, a set of
”synthetic” terms that maximize
effectiveness
due to polysemy, homonymy, and
synonymy, the original terms may not be
optimal dimensions for document content
representation
Dimensionality reduction
by term extraction
Term clustering
tries to group words with a high degree of pairwise
semantic relatedness
groups (or their centroids) may be used as
dimensions
latent semantic indexing
compresses document vector into vectors of a lowerdimensional space whose dimensions are obtained as
combinations of the original dimensions by looking at
their patterns of co-occurrence
3. Text categorization
Text classification, topic
classification/spotting/detection
problem setting:
assume: a predefined set of categories, a set
of documents
label each document with one (or more)
categories
Text categorization
Two major approaches:
knowledge engineering -> end of 80’s
manually defined set of rules encoding expert
knowledge on how to classify documents under
the given gategories
machine learning, 90’s ->
an automatic text classifier is built by learning,
from a set of preclassified documents, the
characteristics of the categories
Text categorization
Let
D: a domain of documents
C = {c1, …, c|C|} : a set of predefined categories
T = true, F = false
The task is to approximate the unknown target
function ’: D x C -> {T,F} by means of a function
: D x C -> {T,F}, such that the functions
”coincide as much as possible”
function ’ : how documents should be classified
function : classifier (hypothesis, model…)
We assume...
Categories are just symbolic labels
no additional knowledge of their meaning is
available
No knowledge outside of the documents
is available
all decisions have to be made on the basis of
the knowledge extracted from the documents
metadata, e.g., publication date, document
type, source etc. is not used
-> general methods
Methods do not depend on any
application-dependent knowledge
in operational applications all kind of
knowledge can be used
content-based decisions are necessarily
subjective
it is often difficult to measure the
effectiveness of the classifiers
even human classifiers do not always agree
Single-label vs. multi-label
Single-label text categorization
exactly 1 category must be assigned to each
dj  D
Multi-label text categorization
any number of categories may be assigned
to the same dj  D
Special case of single-label: binary
each dj must be assigned either to category
ci or to its complement ¬ ci
Single-label, multi-label
The binary case (and, hence, the singlelabel case) is more general than the multilabel
an algorithm for binary classification can also
be used for multi-label classification
the converse is not true
Category-pivoted vs.
document-pivoted
Two different ways for using a text
classifier
given a document, we want to find all the
categories, under which it should be filed
-> document-pivoted categorization (DPC)
given a category, we want to find all the
documents that should be filed under it
-> category-pivoted categorization (CPC)
Category-pivoted vs.
document-pivoted
The distinction is important, since the sets C
and D might not be available in their entirety
right from the start
DPC: suitable when documents become
available at different moments in time, e.g.
filtering e-mail
CPC: suitable when new categories are added
after some documents have already been
classified (and have to be reclassified)
Category-pivoted vs.
document-pivoted
Some algorithms may apply to one style
and not the other, but most techniques
are capable of working in either mode
Hard-categorization vs.
ranking categorization
Hard categorization
the classifier answers T or F
Ranking categorization
given a document, the classifier might rank
the categories according to their estimated
appropriateness to the document
respectively, given a category, the classifier
might rank the documents
Applications of text
categorization
Automatic indexing for Boolean
information retrieval systems
document organization
text filtering
word sense disambiguation
hierarchical categorization of Web pages
Automatic indexing for
Boolean IR systems
In an information retrieval system, each
document is assigned one or more keywords or
keyphrases describing its content
keywords belong to a finite set called controlled
dictionary
TC problem: the entries in a controlled
dictionary are viewed as categories
k1  x  k2 keywords are assigned to each
document
document-pivoted TC
Document organization
Indexing with a controlled vocabulary is
an intance of the general problem of
document base organization
e.g. a newspaper office has to classify the
incoming ”classified” ads under categories
such as Personals, Cars for Sale, Real
Estate etc.
organization of patents, filing of
newspaper articles...
Text filtering
Classifying a stream of incoming documents
dispatched in an asynchronous way by an
information producer to an information
consumer
e.g. newsfeed
producer: news agency; consumer: newspaper
the filtering system should block the delivery of
documents the consumer is likely not interested in
Word sense
disambiguation
Given the occurrence in a text of an
ambiguous word, find the sense of this
particular word occurrence
E.g.
Bank of England
the bank of river Thames
”Last week I borrowed some money from the
bank.”
Word sense
disambiguation
Indexing by word senses rather than by words
text categorization
documents: word occurrence contexts
categories: word senses
also resolving other natural language
ambiguities
context-sensitive spelling correction, part of speech
tagging, prepositional phrase attachment, word
choice selection in machine translation
Hierarchical categorization of
Web pages
E.g. Yahoo like web hierarchical
catalogues
typically, each category should be
populated by ”a few” documents
new categories are added, obsolete ones
removed
usage of link structure in classification
usage of the hierarchical structure
Knowledge engineering
approach
In the 80´s: knowledge engineering
techniques
building manually expert systems capable of
taking text categorization decisions
expert system: consists of a set of rules
wheat & farm -> wheat
wheat & commodity -> wheat
bushels & export -> wheat
wheat & winter & ~soft -> wheat
Knowledge engineering
approach
Drawback: rules must be manually
defined by a knowledge engineer with the
aid of a domain expert
any update necessitates again human
intervention
totally domain dependent
-> expensive and slow process
Machine learning approach
A general inductive process (learner)
automatically builds a classifier for a category ci
by observing the characteristics of a set of
documents manually classified under ci or ci by
a domain expert
from these characteristics the learner gleans the
characteristics that a new unseen document
should have in order to be classified under ci
supervised learning (= supervised by the
knowledge of the training documents)
Machine learning approach
The learner is domain independent
usually available ’off-the-shelf’
the inductive process is easily repeated, if the
set of categories changes
manually classified documents often already
available
manual process may exist
if not, it still easier to manually classify a set of
documents than to build and tune a set of rules
Training set, test set,
validation set
Initial corpus of manually classified
documents
let dj belong to the initial corpus
for each pair <dj, ci> it is known if dj should
be filed under ci
positive examples, negative examples of a
category
Training set, test set,
validation set
The initial corpus is divided into two sets
a training (and validation) set
a test set
the training set is used to build the
classifier
the test set is used for testing the
effectiveness of the classifiers
each document is fed to the classifier and the
decision is compared to the manual category
Training set, test set,
validation set
The documents in the test are not used in
the construction of the classifier
alternative: k-fold cross-validation
k different classifiers are built by partitioning
the initial corpus into k disjoint sets and then
iteratively applying the train-and-test
approach on pairs, where k-1 sets construct a
training set and 1 set is used as a test set
individual results are then averaged
Training set, test set,
validation set
Training set can be split to two parts
one part is used for optimising parameters
test which values of parameters yield the
best effectiveness
test set and validation set must be kept
separate
Inductive construction of
classifiers
A ranking classifier for a category ci
definition of a function that, given a
document, returns a categorization status
value for it, i.e. a number between 0 and 1
documents are ranked according to their
categorization status value
Inductive construction of
classifiers
A hard classifier for a category
definition of a function that returns true or
false, or
definition of a function that returns a value
between 0 and 1, followed by a definition of a
threshold
if the value is higher than the threshold -> true
otherwise -> false
Learners
Probabilistic classifiers (Naïve Bayes)
decision tree classifiers
decision rule classifiers
regression methods
on-line methods
neural networks
example-based classifiers (k-NN)
support vector machines
Rocchio method
Linear classifier method
for each category, an explicit profile (or
prototypical document) is constructed
benefit: profile is understandable even for
humans
Rocchio method
A classifier is a vector of the same
dimension as the documents
weights:
wkj
wkj
wki    
  
{djPOSi} | POSi |
{djNEGi} | NEGi |
classifying: cosine similarity of the
category vector and the document vector