Processing of Large Document Collections 1
Download
Report
Transcript Processing of Large Document Collections 1
Processing of Large
Document Collections 1
Helena Ahonen-Myka
University of Helsinki
Organization of the course
Classes: 17.9., 22.10., 23.10., 26.11.
lectures (Helena Ahonen-Myka): 10-12,13-15
exercise sessions (Lili Aunimo): 15-17
required presence: 75%
Exercises are given (and returned) each
week
required: 75%
Exam: 4.12. at 16-20, Auditorio
Points: Exam 30 pts, exercises 30 pts
Schedule
17.9. Character sets, preprocessing of
text, text categorization
22.10. Text summarization
23.10. Text compression
26.11. … to be announced…
self-study: basic transformations for text
data, using linguistic tools, etc.
In this part...
Character sets
preprocessing of text
text categorization
1. Character sets
Abstract character vs. its graphical
representation
abstract characters are grouped into
alphabets
each alphabet forms the basis of the written
form of a certain language or a set of
languages
Character sets
For instance
for English:
uppercase letters A-Z
lowercase letters a-z
punctuation marks
digits 0-9
common symbols: +, =
ideographic symbols of Chinese and
Japanese
phonetic letters of Western languages
Character sets
To represent text digitally, we need a
mapping between (abstract) characters
and values stored digitally (integers)
this mapping is a character set
the domain of the character set is called a
character repertoire (= the alphabet
for which the mapping is defined)
Character sets
For each character in the character repertoire,
the character set defines a code value in the
set of code points
in English:
26 letters in both lower- and uppercase
ten digits + some punctuation marks
in Russian: cyrillic letters
both could use the same set of code points (if
not a bilingual document)
in Japanese: could be over 6000 characters
Character sets
The mere existence of a character set
supports operations like editing and
searching of text
usually character sets have some
structure
e.g. integers within a small range
all lower-case (resp. upper-case) letters have
code values that are consecutive integers
(simplifies sorting etc.)
Character sets: standars
Character sets can be arbitrary, but in
practice standardization is needed for
interoperability (between computers,
programs,...)
early standards were designed for English
only, or for a small group of languages at
a time
Character sets: standards
ASCII
ISO-8859 (e.g. ISO Latin1)
Unicode
UTF-8, UTF-16
ASCII
American Standard Code for Information
Interchange
A seven bit code -> 128 code points
actually 95 printable characters only
code points 0-31 and 128 are assigned to
control characters (mostly outdated)
ISO 646 (1972) version of ASCII
incorporated several national variants
(accented letters and currency symbols)
ASCII
With 7 bits, the set of code points is too
small for anything else than American
English
solution:
8 bits brings more code points (256)
ASCII character repertoire is mapped to the
values 0-127
additional symbols are mapped to other
values
Extended ASCII
Problem:
different manufacturers each developed their
own 8-bit extensions to ASCII
different character repertoires -> translation
between them is not always possible
also 256 code values is not enough to
represent all the alphabets -> different
variants for different languages
ISO 8859
Standardization of 8-bit character sets
In the 80´s: multipart standard ISO 8859 was
produced
defines a collection of 8-bit character sets, each
designed for a group of languages
the first part: ISO 8859-1 (ISO Latin1)
covers most Western European languages
0-127: identical to ASCII, 128-159 (mostly) unused,
96 code values for accented letters and symbols
Unicode
256 is not enough code points
for ideographically represented languages
(Chinese, Japanese…)
for simultaneous use of several languages
solution: more than one byte for each
code value
a 16-bit character set has 65,536 code
points
Unicode
16-bit character set, e.g. 65,536 code
points
not sufficient for all the characters
required for Chinese, Japanese, and
Korean scripts in distinct positions
CJK-consolidation: characters of these scripts
are given the same value if they look the
same
Unicode
Code values for all the characters used to
write contemporary ’major’ languages
also the classical forms of some languages
Latin, Greek, Cyrillic, Armenian, Hebrew,
Arabic, Devanagari, Bengali, Gurmukhi,
Gujarati, Oriya, Tamil, Telugu, Kannada,
Malayalam, Thai, Lao, Georgian, Tibetan
Chinese, Japanese, and Korean ideograms,
and the Japanese and Korean phonetic and
syllabic scripts
Unicode
punctuation marks
technical and mathematical symbols
arrows
dingbats (pointing hands, stars, …)
both accented letters and separate diacritical
marks (accents, tildes…) are included, with a
mechanism for building composite characters
can also create problems: two characters that
look the same may have different code values
->normalization may be necessary
Unicode
Code values for nearly 39,000 symbols
are provided
some part is reserved for an expansion
method (see later)
6,400 code points are reserved for private
use
they will never be assigned to any character
by the standard, so they will not conflict with
the standard
Unicode: encodings
Encoding is a mapping that transforms a code
value into a sequence of bytes for storage and
transmission
identity mapping for a 8-bit code?
it may be necessary to encode 8-bit characters as
sequences of 7-bit (ASCII) characters
e.g. Quoted-Printable (QP)
code values 128-255 as a sequence of 3 bytes
1: ASCII code for ’=’, 2 & 3: hexadecimal digits of the value
233 -> E9 -> =E9
Unicode: encodings
UTF-8
ASCII code values are likely to be more
common in most text than any other values
in UTF-9 encoding ASCII characters are sent
themselves (high-order bit 0)
other characters (two bytes) are encoded using
up to six bytes (high-order bit is set to 1)
Unicode: encodings
UTF-16: expansion method
two 16-bit values are combined to a 32-bit
value -> a million characters available
2. Preprocessing of text
Text cannot be directly interpreted by the
many document processing applications
an indexing procedure is needed
mapping of a text into a compact
representation of its content
which are the meaningful units of text?
how these units should be combined?
usually not ”important”
Vector model
A document is usually represented as a
vector of term weights
the vector has as many dimensions as
there are terms (or features) in the whole
collection of documents
the weight represents how much the term
contributes to the semantics of the
document
Vector model
Different approaches:
different ways to understand what a term is
different ways to compute term weights
Terms
Words
typical choice
set of words, bag of words
phrases
syntactical phrases
statistical phrases
usefulness not yet known?
Terms
Part of the text is not considered as terms
very common words (function words):
articles, prepositions, conjunctions
numerals
these words are pruned
stopword list
other preprocessing possible
stemming, base words
Weights of terms
Weights usually range between 0 and 1
binary weights may be used
1 denotes presence, 0 absence of the term in
the document
often the tfidf function is used
higher weight, if the term occurs often in the
document
lower weight, if the term occurs in many
documents
Structure
Either the full text of the document or
selected parts of it are indexed
e.g. in a patent categorization application
title, abstract, the first 20 lines of the
summary, and the section containing the
claims of novelty of the described invention
some parts may be considered more
important
e.g. higher weight for the terms in the title
Dimensionality reduction
Many algorithms cannot handle high
dimensionality of the term space (= large
number of terms)
usually dimensionality reduction is
applied
dimensionality reduction also reduces
overfitting
classifier that overfits the training data is good
at re-classifying the training data but worse at
classifying previously unseen data
Dimensionality reduction
Local dimensionality reduction
for each category, a reduced set of terms is
chosen for classification that category
hence, different subsets are used when
working with different categories
global dimensionality reduction
a reduced set of terms is chosen for the
classification under all categories
Dimensionality reduction
Dimensionality reduction by term selection
the terms of the reduced term set are a
subset of the original term set
Dimensionality reduction by term
extraction
the terms are not the same type of the terms
in the original term set, but are obtained by
combinations and transformations of the
original ones
Dimensionality reduction
by term selection
Goal: select terms that, when used for
document indexing, yields the highest
effectiveness in the given application
wrapper approach
the reduced set of terms is found iteratively and
tested with the application
filtering approach
keep the terms that receive the highest score
according to a function that measures the
”importance” of the term for the task
Dimensionality reduction
by term selection
Many functions available
document frequency: keep the high
frequency terms
stopwords have been already removed
50% of the words occur only once in the
document collection
e.g. remove all terms occurring in at most 3
documents
Dimensionality reduction
by term selection
Information-theoretic term selection
functions, e.g.
chi-square
information gain
mutual information
odds ratio
relevancy score
Dimensionality reduction
by term extraction
Term extraction attempts to generate,
from the original term set, a set of
”synthetic” terms that maximize
effectiveness
due to polysemy, homonymy, and
synonymy, the original terms may not be
optimal dimensions for document content
representation
Dimensionality reduction
by term extraction
Term clustering
tries to group words with a high degree of pairwise
semantic relatedness
groups (or their centroids) may be used as
dimensions
latent semantic indexing
compresses document vector into vectors of a lowerdimensional space whose dimensions are obtained as
combinations of the original dimensions by looking at
their patterns of co-occurrence
3. Text categorization
Text classification, topic
classification/spotting/detection
problem setting:
assume: a predefined set of categories, a set
of documents
label each document with one (or more)
categories
Text categorization
Two major approaches:
knowledge engineering -> end of 80’s
manually defined set of rules encoding expert
knowledge on how to classify documents under
the given gategories
machine learning, 90’s ->
an automatic text classifier is built by learning,
from a set of preclassified documents, the
characteristics of the categories
Text categorization
Let
D: a domain of documents
C = {c1, …, c|C|} : a set of predefined categories
T = true, F = false
The task is to approximate the unknown target
function ’: D x C -> {T,F} by means of a function
: D x C -> {T,F}, such that the functions
”coincide as much as possible”
function ’ : how documents should be classified
function : classifier (hypothesis, model…)
We assume...
Categories are just symbolic labels
no additional knowledge of their meaning is
available
No knowledge outside of the documents
is available
all decisions have to be made on the basis of
the knowledge extracted from the documents
metadata, e.g., publication date, document
type, source etc. is not used
-> general methods
Methods do not depend on any
application-dependent knowledge
in operational applications all kind of
knowledge can be used
content-based decisions are necessarily
subjective
it is often difficult to measure the
effectiveness of the classifiers
even human classifiers do not always agree
Single-label vs. multi-label
Single-label text categorization
exactly 1 category must be assigned to each
dj D
Multi-label text categorization
any number of categories may be assigned
to the same dj D
Special case of single-label: binary
each dj must be assigned either to category
ci or to its complement ¬ ci
Single-label, multi-label
The binary case (and, hence, the singlelabel case) is more general than the multilabel
an algorithm for binary classification can also
be used for multi-label classification
the converse is not true
Category-pivoted vs.
document-pivoted
Two different ways for using a text
classifier
given a document, we want to find all the
categories, under which it should be filed
-> document-pivoted categorization (DPC)
given a category, we want to find all the
documents that should be filed under it
-> category-pivoted categorization (CPC)
Category-pivoted vs.
document-pivoted
The distinction is important, since the sets C
and D might not be available in their entirety
right from the start
DPC: suitable when documents become
available at different moments in time, e.g.
filtering e-mail
CPC: suitable when new categories are added
after some documents have already been
classified (and have to be reclassified)
Category-pivoted vs.
document-pivoted
Some algorithms may apply to one style
and not the other, but most techniques
are capable of working in either mode
Hard-categorization vs.
ranking categorization
Hard categorization
the classifier answers T or F
Ranking categorization
given a document, the classifier might rank
the categories according to their estimated
appropriateness to the document
respectively, given a category, the classifier
might rank the documents
Applications of text
categorization
Automatic indexing for Boolean
information retrieval systems
document organization
text filtering
word sense disambiguation
hierarchical categorization of Web pages
Automatic indexing for
Boolean IR systems
In an information retrieval system, each
document is assigned one or more keywords or
keyphrases describing its content
keywords belong to a finite set called controlled
dictionary
TC problem: the entries in a controlled
dictionary are viewed as categories
k1 x k2 keywords are assigned to each
document
document-pivoted TC
Document organization
Indexing with a controlled vocabulary is
an intance of the general problem of
document base organization
e.g. a newspaper office has to classify the
incoming ”classified” ads under categories
such as Personals, Cars for Sale, Real
Estate etc.
organization of patents, filing of
newspaper articles...
Text filtering
Classifying a stream of incoming documents
dispatched in an asynchronous way by an
information producer to an information
consumer
e.g. newsfeed
producer: news agency; consumer: newspaper
the filtering system should block the delivery of
documents the consumer is likely not interested in
Word sense
disambiguation
Given the occurrence in a text of an
ambiguous word, find the sense of this
particular word occurrence
E.g.
Bank of England
the bank of river Thames
”Last week I borrowed some money from the
bank.”
Word sense
disambiguation
Indexing by word senses rather than by words
text categorization
documents: word occurrence contexts
categories: word senses
also resolving other natural language
ambiguities
context-sensitive spelling correction, part of speech
tagging, prepositional phrase attachment, word
choice selection in machine translation
Hierarchical categorization of
Web pages
E.g. Yahoo like web hierarchical
catalogues
typically, each category should be
populated by ”a few” documents
new categories are added, obsolete ones
removed
usage of link structure in classification
usage of the hierarchical structure
Knowledge engineering
approach
In the 80´s: knowledge engineering
techniques
building manually expert systems capable of
taking text categorization decisions
expert system: consists of a set of rules
wheat & farm -> wheat
wheat & commodity -> wheat
bushels & export -> wheat
wheat & winter & ~soft -> wheat
Knowledge engineering
approach
Drawback: rules must be manually
defined by a knowledge engineer with the
aid of a domain expert
any update necessitates again human
intervention
totally domain dependent
-> expensive and slow process
Machine learning approach
A general inductive process (learner)
automatically builds a classifier for a category ci
by observing the characteristics of a set of
documents manually classified under ci or ci by
a domain expert
from these characteristics the learner gleans the
characteristics that a new unseen document
should have in order to be classified under ci
supervised learning (= supervised by the
knowledge of the training documents)
Machine learning approach
The learner is domain independent
usually available ’off-the-shelf’
the inductive process is easily repeated, if the
set of categories changes
manually classified documents often already
available
manual process may exist
if not, it still easier to manually classify a set of
documents than to build and tune a set of rules
Training set, test set,
validation set
Initial corpus of manually classified
documents
let dj belong to the initial corpus
for each pair <dj, ci> it is known if dj should
be filed under ci
positive examples, negative examples of a
category
Training set, test set,
validation set
The initial corpus is divided into two sets
a training (and validation) set
a test set
the training set is used to build the
classifier
the test set is used for testing the
effectiveness of the classifiers
each document is fed to the classifier and the
decision is compared to the manual category
Training set, test set,
validation set
The documents in the test are not used in
the construction of the classifier
alternative: k-fold cross-validation
k different classifiers are built by partitioning
the initial corpus into k disjoint sets and then
iteratively applying the train-and-test
approach on pairs, where k-1 sets construct a
training set and 1 set is used as a test set
individual results are then averaged
Training set, test set,
validation set
Training set can be split to two parts
one part is used for optimising parameters
test which values of parameters yield the
best effectiveness
test set and validation set must be kept
separate
Inductive construction of
classifiers
A ranking classifier for a category ci
definition of a function that, given a
document, returns a categorization status
value for it, i.e. a number between 0 and 1
documents are ranked according to their
categorization status value
Inductive construction of
classifiers
A hard classifier for a category
definition of a function that returns true or
false, or
definition of a function that returns a value
between 0 and 1, followed by a definition of a
threshold
if the value is higher than the threshold -> true
otherwise -> false
Learners
Probabilistic classifiers (Naïve Bayes)
decision tree classifiers
decision rule classifiers
regression methods
on-line methods
neural networks
example-based classifiers (k-NN)
support vector machines
Rocchio method
Linear classifier method
for each category, an explicit profile (or
prototypical document) is constructed
benefit: profile is understandable even for
humans
Rocchio method
A classifier is a vector of the same
dimension as the documents
weights:
wkj
wkj
wki
{djPOSi} | POSi |
{djNEGi} | NEGi |
classifying: cosine similarity of the
category vector and the document vector