Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon.

Download Report

Transcript Text Categorization Moshe Koppel Lecture 5: Authorship Verification with Jonathan Schler and Shlomo Argamon.

Text Categorization
Moshe Koppel
Lecture 5: Authorship Verification
with Jonathan Schler and Shlomo Argamon
Attribution vs. Verification
• Attribution – Which of authors A1,…,An
wrote doc X?
• Verification – Did author A write doc X?
Authorship Verification:
Did the author of S also write X?
Story: Ben Ish Chai, a 19th C. Baghdadi Rabbi, is the
author of a corpus, S, of 500+ legal letters.
Ben Ish Chai also published another corpus of
500+ legal letters, X, but denied authorship of X,
despite external evidence that he wrote it.
How can we determine if the author of S is also the
author of X?
Verification is Harder than
Attribution
In the absence of a closed set of alternate
suspects to S, we’re never sure that we have
a representative set of not-S documents.
Let’s see why this is bad.
Round 1: “The Lineup”
D1,…,D5 are corpora written by other Rabbis of the same
region and period as Ben Ish Chai. They will play the
role of “impostors”.
Round 1: “The Lineup”
D1,…,D5 are corpora written by other Rabbis of the same
region and period as Ben Ish Chai. They will play the
role of “impostors”.
1.
2.
3.
Learn model for S vs. (each of) impostors.
For each document in X, check if it is classed as S or an
impostor.
If “many” are classed as impostors, exonerate S.
Round 1: “The Lineup”
D1,…,D5 are corpora written by other Rabbis of the same
region and period as Ben Ish Chai. They will play the
role of “impostors”.
1.
2.
3.
Learn model for S vs. (each of) impostors.
For each document in X, check if it is classed as S or an
impostor.
If “many” are classed as impostors, exonerate S.
In fact, almost all are classified as S. (i.e. many mystery
documents seem to point to S as the “guilty” author.)
Does this mean S really is the author?
Why “The Lineup” Fails
No.
This only shows that S is a better fit than these
impostors, not that he is guilty.
The real author may simply be some other person
more similar to S than to (any of) these impostors.
(One important caveat: suppose we had, say, 10000
impostors. That would be a bit more convincing.)
Well, at least we can rule out these impostors.
Round 2: Composite Sketch
Does X Look Like S?
Learn a model for S vs. X. If CV “fails” (so that we
can’t distinguish S from X), S is probably guilty
(esp. since we already know that we can distinguish S [and
X] from each of the impostors).
Round 2: Composite Sketch
Does X Look Like S?
Learn a model for S vs. X. If CV “fails” (so that we
can’t distinguish S from X), S is probably guilty
(esp. since we already know that we can distinguish S [and
X] from each of the impostors).
In fact, we obtain 98% CV accuracy for S vs. X.
So can we exonerate S?
Why Composite Sketch Fails
No.
Superficial differences, due to:
• thematic differences,
• chronological drift,
• different purposes or contexts,
• deliberate ruses
would be enough to allow differentiation between S and X
even if they were by the same author.
We call these differences “masks”.
Example: House of Seven Gables
This is a crucial point, so let’s consider an example
where we know the author’s identity.
With what CV accuracy can we distinguish House of
Seven Gables from the known works of
Hawthorne, Melville and Cooper (respectively)?
Example: House of Seven Gables
This is a crucial point, so let’s consider an example
where we know the author’s identity.
With what CV accuracy can we distinguish House of
Seven Gables from the known works of
Hawthorne, Melville and Cooper (respectively)?
In each case, we obtain 95+% (even though
Hawthorne really wrote it).
Example (continued)
A small number of features allow House to be
distinguished from other Hawthorne work (Scarlet
Letter). For example: he, she
What happens when we eliminate features like
those?
Round 3: Unmasking
1. Learn models for X vs. S and for X vs. each
impostor.
2. For each of these, drop the k (k=5,10,15,..) best
(=highest weight in SVM) features and learn again.
3. “Compare curves.”
House of Seven Gables (concluded)
100
90
80
70
60
50
1
Melville
2
3
Cooper
4
5
6
7
Hawthorne
8
9
Does Unmasking Always Work?
Experimental setup:
• Several similar authors each with multiple books
(chunked into approx. equal-length examples)
• Construct unmasking curve for each pair of books
• Compare same-author pairs to different-author
pairs
Unmasking: 19th C. American Authors
(Hawthorne, Melville, Cooper)
Identical Authors
Non Identical Authors
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0
2
4
6
8
0
2
4
6
8
Unmasking: 19th C. English Playwrights
(Shaw, Wilde)
Non Identical Authors
Identical Authors
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Unmasking: 19th C. American Essayists
(Thoreau, Emerson)
Non Identical Authors
Identical Authors
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
Experiment
• 21 books ; 10 authors (= 210 labelled examples)
• Represent unmasking curves as vectors
Leave-out-one-book experiments
• Use training books to learn same-author curves
from diff-author curves
• Classify left out book (yes/no) for each author
(independently)
• Use “The Lineup” to filter false positives
Results
• 2 misclassed out of 210
• Simple rule that almost always works:
· accuracy after 6 elimination rounds is lower than 89% and
· second highest accuracy drop in two consec. iterations greater than 16%
 books are by same author
Unmasking Ben Ish Chai
100
90
80
70
60
50
1
2
3
4
5
6
7
8
9
10
11
Unmasking: Summary
• This method works very well in general (provided
X and S are both large).
• Key is not how similar/different two texts are, but
how robust that similarity/difference is to changes
in the feature set.
Now let’s try a much harder problem…
• Suppose, instead of one candidate, we have 10,000
candidate authors – and we aren’t even sure if any of them
is the real author. (This is two orders of magnitude more than has ever been
tried before.)
• Building a classifier doesn’t sound promising, but
information retrieval methods might have a chance.
• So, let’s try assigning an anonymous document to
whichever author’s known writing is most similar (using the
usual vector space/cosine model).
IR Approach
• We tried this on a corpus of 10,000 blogs, where the object
was to attribute a short snippet from each blog. (Each attribution
problem is handled independently.)
• Our feature set consisted of character 4-grams.
IR Approach
• We tried this on a corpus of 10,000 blogs, where the object
was to attribute a short snippet from each blog. (Each attribution
problem is handled independently.)
• Our feature set consisted of character 4-grams.
• 46% of “snippets” are correctly attributed.
IR Approach
• 46% is not bad but completely useless in most
applications.
• What we’d really like to do is figure out which
attributions are reliable and which are not.
• In an earlier attempt (KSA 2006), we tried building a
meta-classifier that could solve that problem (but
meta-classifiers are a bit fiddly).
When does most similar = actual author?
• Can generalize unmasking idea.
• Check if similarity between snippet and an author’s known
text is robust wrt changes in feature set.
– If it is, that’s the author.
– If not, we just say we don’t know. (If in fact none of the
candidates wrote it, that’s the best answer).
Algorithm
1.
Randomly choose subset of features.
2.
Find most similar author (using that FS).
3.
Iterate.
4.
If S is most similar, for at least k% of iterations, S is
author. Else, say Don’t Know. (Choice of k trades off
precision against recall.)
Results
•
•
100 iterations, 50% of features per iteration
training text= 2000 words, snippet = 500 words
# candidates
1000 candidates: 93.2% precision at 39.2% recall. (k=90)
Results
How often do we attribute a snippet not written by
any candidate to somebody?
K=90
• 10,000 candidates – 2.5%
• 5,000 candidates – 3.5%
• 1,000 candidates – 5.5%
(The fewer candidates, the greater the chance some poor shnook will consistently
be most similar.)
Comments
• Can give estimate of probability A is author.
Almost all variance in recall/precision explained by:
– Snippet length
– Known-text length
– Number of candidates
– Score (number iterations A is most similar)
• Method is language independent.
So Far…
• Have covered cases of many authors (closed or open set).
• Unmasking covers cases of open set, few authors, lots of text.
• Only uncovered problem is the ultimate problem: open set, few
authors, little text.
• Can we convert this case to our problem by adding artificial
candidates?