Automatic Authorship Identification

Download Report

Transcript Automatic Authorship Identification

Automatic Authorship
Identification
Diana Michalek, Ross T. Sowell, Paul Kantor,
Alex Genkin, David Madigan, Fred Roberts,
and David D. Lewis
Acknowledgements
• Support
– U.S. National Science Foundation
• Knowledge Discovery and Dissemination Program
• Disclaimer
– The views expressed in this talk are those of the
authors, and not of any other individuals or
organizations.
The Authorship Problem
• Given:
– A piece of text with unknown author
– A list of possible authors
– A sample of their writing
• Problem:
– Can we automatically determine which person
wrote the text?
The Authorship Problem
• Given:
– A piece of text
– A list of possible authors
– A sample of their writing
• Problem:
– Can we automatically determine which person wrote
the text?
• Approach:
– Use style markers to identify the author
Motivation and Applications
• Forensics
• Arts
Motivation and Applications
• Forensics
– Unabomber
• Arts
Motivation and Applications
• Forensics
– Unabomber
• Arts
– Shakespeare
Motivation and Applications
• History
• E-mail
Motivation and Applications
• History
– Federalist Papers
• E-mail
Motivation and Applications
• History
– Federalist Papers
• E-mail
Motivation and Applications
• History
– Federalist Papers
• E-mail
Motivation and Applications
• History
– Federalist Papers
• 85 Total
• 12 Disputed
• E-mail
Motivation and Applications
• History
– Federalist Papers
• 85 Total
• 12 Disputed
• E-mail
Motivation and Applications
• Counter-Terrorism
Motivation and Applications
• Counter-Terrorism
– Osama Bin Laden
Previous Work: Mosteller and
Wallace (1984)
• Function Words
Previous Work: Mosteller and
Wallace (1984)
• Function Words
Upon
Also
An
By
Of
On
There
This
To
Although
Both
Enough
While
Whilst
Always
Though
Commonly
Consequently
Considerable(ly)
According
Apt
Direction
Innovation(s)
Language
Vigor(ous)
Kind
Matter(s)
Particularly
Probability
Work(s)
Previous Work: Mosteller and
Wallace (1984)
• Function Words
Upon
Also
An
By
Of
On
There
This
To
Although
Both
Enough
While
Whilst
Always
Though
Commonly
Consequently
Considerable(ly)
According
Apt
Direction
Innovation(s)
Language
Vigor(ous)
Kind
Matter(s)
Particularly
Probability
Work(s)
wk = number times
word k appears in text
T = (w1, w2, …, w30)
Previous Work: Mosteller and
Wallace (1984)
• Bayesian Inference
Previous Work: Mosteller and
Wallace (1984)
• Bayesian Inference
Odds(1, 2 | x) = (p1/p2)[f1(x)/f2(x)]
Final odds = (initial odds)(likelihood ratio)
Previous Work: Mosteller and
Wallace (1984)
• Experiment
– Use 18 Hamilton and 14 Madison papers to
gather information
• Results
Previous Work: Mosteller and
Wallace (1984)
• Experiment
– Use 18 Hamilton and 14 Madison papers to
gather information
– Test: known Hamilton papers, disputed papers
• Results
Previous Work: Mosteller and
Wallace (1984)
• Experiment
– Use 18 Hamilton and 14 Madison papers to gather
information
– Test: known Hamilton papers, disputed papers
• Results
– Strong odds in favor of Hamilton for other known
Hamilton papers
– Strong odds in favor of Madison for all disputed papers
Previous Work: Corney (2003)
• Analyzed email data to determine:
– minimum message length
– minimum number of messages needed to model
an authors’ style
– which stylometric features can be used to
determine authorship
Previous Work: Corney (2003)
• Stylometric features
–
–
–
–
–
Proportion of white-space
Punctuation patterns
Function word frequencies
Frequency of 2-grams
Email-specific features
• Greetings, signatures, html tags
Previous Work: Corney (2003)
• Conclusions:
– Authorship attribution can be successfully
performed
– 200-250 words is enough
– 20 data points is enough for training
– Best feature: function words
– Not so great: 2-grams
Our Work: Trials with the
Federalist Papers
• Wrote scripts in Perl and Python to
compute
– Sentence length frequencies
– Word length frequencies
– Ratios of 3-letter words to 2-letter words
• Analyzed our data with graphing and
statistics software.
Sentence Length Frequencies
• Step 1: Parsing the text
– What constitutes a sentence?
“Mrs. Jones is has been working on her Ph.D. for 8.5 years.”
“Take the no. 7 bus downtown.”
“I said no.”
“What are you talking about ?!?!?!?!!”
“Sometimes….I just feel…anxious.”
Sentence Length Frequencies
• Step 2: Obtain sentence length data
i
M
H
i
M
H
1
1
0
10
19
21
2
0
0
11
15
20
3
0
0
…
…
…
4
1
0
30
26
21
5
9
2
31
28
16
6
6
6
32
26
28
7
14
7
…
…
…
8
22
6
173
0
1
9
16
14
201
1
0
i - sentence length
M - Number of length-i sentences in
known Madison papers
(1139 sentences)
H - Number of length-i sentences in
known Hamilton papers
(1142 sentences)
Sentence Length Frequencies
• Step 3: Graph the data
Sentence Length Distributions
• Step 4: Does the data show a difference
between Madison and Hamilton?
– View sentence lengths as sample data taken
from two distributions
– Apply the Kolmogorov-Smirnov test
Kolmogorov-Smirnov Test
A
B
1
4
A
B
– Two vectors of data values, taken from a 4
3
continuous distribution.
6
1
4
2
5
10
8
7
8
12
5
1
16 19
• Input:
• Method:
– Examines maximal vertical distance
between empirical cumulative
distribution curves
• Output:
– p-value
21 20
Kolmogorov-Smirnov Test
• Results of step 4:
– p-value for sentence length frequency data
is…
0.5121
Kolmogorov-Smirnov Test
• Results of step 4:
– p-value for sentence length frequency data
is…
0.5121
• Not too helpful…but there is hope!
– Try more features
– Try different features
Future Work
• Examine email data
• Build our own authorship-identification
tool
• Test new stylometric features for
distinguishing ability