TXA982_1 - PhiloComp.net

Transcript TXA982_1 - PhiloComp.net

1
A Variety of Literary Puzzles
1. The Works of “Homer”
 The Iliad and the Odyssey are generally
attributed to a single individual named “Homer”.
But both are derived from long oral tradition,
and it is not clear whether they are indeed
ultimately the work of a single author (or group
of authors).
 Since historical evidence is almost non-existent
in this case, the only way of addressing this issue
is to look at internal evidence within the texts.
2
2. The Letters of Saint Paul
 The New Testament contains a large number of
letters traditionally attributed to Saint Paul, but it is
not clear whether he indeed wrote them all. In
particular the letter to the Hebrews has for a long
time been viewed suspiciously on theological
grounds, but these are hard to make precise.
 It is interesting to ask whether such theological
speculations can be backed up with hard statistical
data derived from textual analysis (e.g. does
Hebrews have similar word frequencies and
sentence lengths to Paul’s known letters?).
3
3.
The Dialogues of Plato
 Plato developed his philosophy in the form of
dialogues, putting his own doctrines into the
mouth of Socrates his teacher. Little of Socrates’
own work has survived, and there is clear
evidence of development in the philosophy of
Plato’s dialogues in various respects, but in which
directions did the development take place?
Knowing the order of his dialogues could be
crucial for the understanding of his ideas.
 Stylometric methods have been used to try to
place the dialogues in the correct order.
4
4.
The Ethics of Aristotle
 There are two famous works of Ethics by
Aristotle, both transcribed from lecture notes by
others. These are the Eudemian Ethics and the
Nicomachean Ethics.
 However books 5-7 of the Nicomachean Ethics
are identical to books 4-6 of the Eudemian Ethics,
raising the question of which of the two sources
these books originally came from.
 This issue is of philosophical as well as literary
interest, and has been settled using stylometric
research by philosopher Anthony Kenny.
5
5.
The Plays of Shakespeare
 Notoriously Shakespeare’s plays have been
attributed to a variety of authors: e.g. Bacon,
Marlowe and various noblemen.
 There are also a number of plays (for example
Edward III and The Two Noble Kinsmen)
supposed by some to have been written either by
Shakespeare alone or co-authored by him.
 Stylometry has the potential to bring objective
analysis to bear on such (heated!) controversies,
and was first attempted by Mendelhall in 1901
(without the benefit of computers, of course!).
6
6.
The Federalist Papers
 The Federalist Papers are a series of articles
published in 1787-88 with the aim of promoting
the ratification of the new constitution. They
were written by three authors, Jay, Hamilton and
Madison, under the pseudonym “Publius”.
 Some of the papers are of known (and in some
cases joint) authorship but others are disputed.
Stylometric methods were famously brought to
bear by Mosteller and Wallace in the early 1960s
to attempt to answer this question. It is now
considered as settled.
7
The Signature Stylometric Program
 The Signature computer program provides a
number of tools for “literary detection”. It can be
downloaded from www.etext.leeds.ac.uk/signature/
together with various resource files (e.g. Federalist
Papers, various novels, Shakespearean texts).
 Unpack all these into a suitable base directory (e.g.
C:\Signature), run signature.exe, and use “Load...”
from the File menu to select your texts.
 When loading texts, note that you can select many at
once, by “control-clicking”, i.e. holding down the
“Ctrl” key while you select the texts you want.
8
Frequency Tests within Signature
 Having loaded your texts, highlight them within
the file list box in the usual way (again controlclicking if necessary to select several) – as you do
this, an appropriate graph and table will display
automatically on the right-hand side of the screen.
 Five different graphs (and tables) can be chosen
using the “tabs” at the bottom of the screen: these
display the frequency of word length, sentence
length, paragraph length, letters, and punctuation.
 Analysis of specific word frequencies is more
advanced, and will be dealt with later.
9
Comparing Texts by Eye
 The graphs produced by the frequency tests
can give a good visual impression of the
similarities and differences between texts.
 This is usually much more striking if you
select “Display percentages” rather than
absolute frequencies (using the radio buttons at
the top), as then the relative frequencies within
the various texts appear on a matching scale.
 Graphs can be displayed “flat” or “deep” using
the “2D” or “3D” buttons. “Options” gives
access to many other display possibilities.
10
Adding Rigour (1) - large data
 Simply comparing one text with another will not
necessarily give a reliable indication of common
(or different) authorship – any single text,
especially a short one, could be idiosyncratic for
many different reasons.
 Tests are far more reliable if they are carried out
on large bodies of data, so if you want to check,
say, a play for possible authorship by
Shakespeare, it is far better to test it against ALL
the relevant Shakespearean corpus (e.g. all the
known tragedies, or comedies, or even all his
plays) rather than just one other play.
11
Combining Texts
 To treat several texts as a single “corpus”, select
them with the mouse and click on the “Combine
Files into Corpus” button under the file list box.
 You’ll be invited to give a name to the corpus – take
this opportunity to give a meaningful name (e.g.
“Austen”, “Shakespeare tragedies”) as this will
appear in your graphs and tables. (To change a
corpus name or a file’s “alias”, double-click on it.)
 All “corpora” will appear in the lower box. These
can be graphed exactly as though they were single
files (so you can combine files and corpora within a
graph; to deselect all, right-click the relevant box).
12
Adding Rigour (2) - Statistics
 You can add more rigour to your stylometric
tests by replacing mere comparison of graphs
by eye (which can be extremely misleading)
with objective statistical measurement.
 For this purpose the Signature system provides
an option under the Statistics menu that
performs a “Chi-Square” comparison test. It is
not necessary to understand the details of how
this test works (the calculations can just be left
to the computer), but you do need to be able to
interpret the results that it yields.
13
Principles of Statistical Testing
 Most statistical testing works on the principle of
calculating some “measure” (in this case, a measure
of how far two texts differ in some respect) which is
known to follow some standard “distribution” (in
this case, the so-called “Chi-Square” distribution).
 The form of the distribution can be looked up in
statistical tables (or stored in a computer program),
and then compared against the calculated measure.
 The result of the test is a “p-value”, which gives the
probability that mere random variation between two
samples would give rise to a difference measure of
at least the calculated magnitude.
14
 Suppose we generate word length frequency data on
two texts S and R, R being a large “reference” text
(e.g. all the known works of some author) and S a
“sample” text (e.g. a work of unknown authorship).
 Then applying the Chi-Square test to these data can
be equivalent to asking:
 “Suppose text R gives a reliable guide to the word length
frequencies used by an author (so if R contains a 20%
proportion of 3-letter words, this reflects a general
tendency of that author to use 3-letter words on average
one word in five). Then what is the probability that such
an author, writing a text of the same length as S, would
purely by chance produce a text with a word length
frequency (WLF) ‘spectrum’ which differs from that of R
by as much as the actual WLF ‘spectrum’ of S differs?”
15
The Meaning of the Chi-Square Value
 The “Chi-Square” value is calculated as a number,
and the larger the number, the less likely it is that a
text like the sample (in the measured respect, e.g.
word length frequencies) could have arisen by
chance from an author having the overall
characteristics manifested in the reference text.
 This number is usually compared against a standard
value (e.g. the “5%” value or the “1%” value), to
indicate whether it counts as “significant” to the
corresponding degree (typically in scientific
research, only a result which is at least “significant
at the 5% level” is taken to be worthy of note).
16
 If, for example, the test yields a Chi-Square value
which is between the “5%” and the “1%” values,
then Signature will display a message saying that
the difference between the sample and reference
texts is “significant at the 5% level” (i.e. sufficiently
great that you’d only expect such large differences
to arise by chance on average at most 1 time in 20).
 Signature will also display the actual Chi-Square
value given by the test, and also the two standard
values between which it lies (in the case above, the
“5%” and the “1%” values). This enables you to
see roughly how close the actual value came to
these limits (e.g. you might be able to see that it
very nearly achieved significance at the 1% level).
17
“Statistical” and Genuine Significance
 Texts vary more systematically than do the repeatable
scientific phenomena for which statistical tests are
most typically used. Even a “difference at the 1%
level” might not be genuinely significant in context – it
might be that almost any pairs of texts you took, even
by the same author, would differ just as much!
 So when assessing the real significance of Signature’s
statistical results, it’s important to make appropriate
comparisons of the Chi-Square results you get.
 For this purpose, there’s an option under the File menu
to divide single texts into halves, so you can see what
results you get from testing one half against the other.
18
What Statistical Tests Do Not Prove!
 Many people who use statistical tests fall victim to a
natural misinterpretation of their results.
 The p-value does NOT give the probability that the
“sample” (in our case, the sample text) actually “came
from the same population as” (in our case, had the
same author as) the “reference” (the reference text).
 Rather, it gives the probability that a difference of at
least the magnitude measured between the “sample”
and the “reference” would arise by chance if a sample
of similar size were to be taken at random from the
reference “population”.
 THESE TWO ARE QUITE DIFFERENT!
19
How to Use the Chi-Square Result
 To assess the actual probability that “the sample came
from the same population as the reference” requires all
sorts of background information – it might, for
example, be known that the two works had the same
(or different) author, or that one author’s style is
notoriously inconsistent, or that two authors’ styles are
remarkably similar. Any other relevant information
bearing on the case (e.g. from conventional historical
and literary studies) would also need to be taken into
account in any assessment of the actual probability.
 Use the Chi-Square test, therefore, to confirm the
objective significance of apparent differences, rather
than as a definitive assessment of probability.
20
Degrees of Freedom
 The Chi-Square test can operate on any chosen range of
data, and its results can depend very largely on what
range is chosen. In a word length frequency test, for
example, you could apply the test to words of ALL
lengths, or just to the length range, say, 2 to 8.
 The larger the range of data included, the larger the “degrees
of freedom” (i.e. the scope for mere random variation to give
rise to a large Chi-Square value). Hence it can be more
difficult for the test to give a genuinely significant result.
 It’s best to restrict the test to the visually “crucial” and
“reliable” data range – where there’s plenty of data – and
to exclude the long thin “tail” of the graphs where
proportionately large random variation is to be expected.
21
Reference and Sample Texts
 We have so far taken the “reference” and “sample”
texts to be quite distinct, but if both are quite short, it
can be best to include the “sample” text within the
“reference” (there is a checkbox in Signature to do
this automatically; this is ticked by default, but will
be greyed out if the texts involve overlapping
corpora). Applying the test is then like asking:
 “Suppose we have an author whose style is typified by the
combination of the reference and the sample texts, R and S.
Then what is the probability that such an author, writing a
text of the same length as S, would purely by chance ...”
 The principle here is simply that statistical results are
more reliable when they’re based on more data.
22
Word Lists
 The tests looked at so far are relatively crude from a
literary point of view. Stylistic differences can be
shown more sensitively by testing for authors’ usage
of particular words rather than mere word lengths etc.
 Signature provides a Word List facility (under the
Wordlists menu). Select “Edit wordlists”, then click
on “New List”, give your word list a name, then type
into the Memo the words you want the list to contain,
before clicking on “OK” (also indicate if you want
your list to be case-sensitive). Now when you select
any texts, they will if necessary be re-read, and a
graph of the listed word occurrences displayed on the
“Words” tab (see bottom-right corner of the screen).
23
Key Words
 Signature’s key words facility can help you decide
which words are most useful for author identification.
To illustrate this, load the five Federalist Papers files
provided, click on “Hamilton” and “Madison”, then
select “Key words” from the Wordlists menu. A table
will be generated showing the top 50 “keywords” that
tell in favour of Hamilton (+ values), followed by the
top 50 that tell in favour of Madison (– values).
 Move down the list, control-clicking to select “upon”,
“while”, “whilst” and “on”. Then click on the cyan
label, and you’ll enter the word list facility with the
chosen words already included. Use this to identify the
true author of the “Unknown” Federalist Papers!
24
Other Features of Signature
 The File menu enables you to read texts into multiple
files (where “<DIV …>” tags are present), and to reload texts (useful if editing has taken place meanwhile,
or if loading was previously aborted for some reason).
You can also “hide” texts to avoid cluttering the lists.
 The View menu provides a text viewer into which texts
can be read (at the cost of speed). This helps when
defining an appropriate “filter” (also accessible through
the View menu) for removing unwanted textual
artefacts (e.g. HTML entities, or changing “Mr.” to
“Mr” to avoid misidentification of sentence endings).
 The Graphs and Tables menus provide a variety of
formats for copying graphs/tables to the ClipBoard.

TXA982_1 - PhiloComp.net

Transcript TXA982_1 - PhiloComp.net

Directory