STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA

Download Report

Transcript STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA

STYLOMETRY IN IR
SYSTEMS
Leyla BİLGE
Büşra ÇELİKKAYA
Kardelen HATUN
OUTLINE
4/20/2007
Stylistics and Stylometry
 Applications of stylometry
 History of stylometric researches
 Stylistic features
 Recent Studies
 Our approach
 Conclusion

2
Stylometry in IR Systems
STYLISTICS
The theoritical framework for stylistic combines;


Halliday’s Language Theory
Sander’s Theories of Stylistic
Stylometry in IR Systems


4/20/2007

Halliday says:
“A text is what is meant, selected from the total
set of opinions that constitute what can be
meant”
Sander says:
“Style is the result of choices made by an author
from a range of possibilities offered by the
language system”
3
STYLISTICS
Stylistic variation depends on




Author preferences and competence
Familiarity
Genre
Communicative context
Expected characteristics of the intended audience
Stylometry in IR Systems


4/20/2007

Modeling, representing and utilizing this
variation is the business of stylistic analysis.
4
STYLOMETRY

The application of the study of linguistic style
Stylometry in IR Systems

4/20/2007

Style refers to the linguistic choices of authors
that persist over their works, independently of
content
Aim is to describe a text from a rather formal
perspective like;



Number of words
Number of repetitions
Sentence length
5
APPLICATIONS OF STYLOMETRY
4/20/2007

Authorship attribution
Forensic author identification
 To find the author of an anonymous text



Stylometry in IR Systems

Observation of the “characteristics” of a
particular author
Organization and retrieval of documents based
on their writing style
Systems for genre-based information retrieval
6
HISTORY OF STYLOMETRY

Stylometry grew out of analyzing text for
evidence of authenticity, authorial identity
Stylometry in IR Systems

4/20/2007

According to modern practice of discipline, there
are distinctive patterns of a language to identify
authors
After development of computers and their
capacities
Large data sets can be analyzed
 New methods can be generated and easily applied

7
HISTORY OF STYLOMETRY, CONT’D
4/20/2007

Stylometry in IR Systems
Current researches uses techniques based on
term frequency counts
Frequency data are collected for common terms
 These data are then analyzed using a range of fairly
standard statistical techniques


However, they cannot guarantee quality ouput
yet, i.e. Ulysses
8
METHODOLOGY

Use a subset of structural and stylometric features on
a set of authors without consideration of author
characteristics

Currently, authorship attribution studies are
dominated by the use of lexical measures

Generally used statistics:






Word length
Syllables per word
Sentence-length
Sentence count
Text length in words
Use of punctuation marks
STYLISTIC FEATURES

Lexically-Based Methods



Vocabulary richness of the author
Frequencies of occurrence of individual words
Vocabulary diversity:

Type-token ratio V/N
V: size of vocabulary of sample text
 N: number of tokens


Hapax legomena


How many words occur once
Frequencies of occurrence:

Function words
STYLISTIC FEATURES

Problems:
Text length dependent
 Unstable for short texts
 Function word set requires manual effort
 Specific to the group of authors considered


Solution:
Use set of most frequent words
 Both content-words and function words

RELATED STUDIES

Analysis of the text by a natural language
processing tool:
Use existing NLP tool
 Sentence and Chunk Boundaries Detector (SCBD)


Use sub-word units like character N-grams
instead of word frequencies:
Character sequences of length n
 Most frequent n-grams provide information about
author’s stylistic choices on lexical, syntactical and
structural level

WORD BASED FEATURES

Bag-of-words


Apply stemming and stopword list
Function words

Content-free
POS Annotation
 Feature Selection
 Semantic Disambiguation

LINGUISTIC CONSTITUENTS
Structure of natural language sentences show
word occurrences follow a specific order
 Words are grouped into syntactic units called
“constituents”
 Use word relationships by extracting constituents
for feature construction

Subdivide document into sentences
 Construct a syntax tree for each sentence

SYNTAX TREE

Use a syntax tree representation of different
authors sentences as features
OUR APRROACH
Use Stylometry to
analyze the following
Stylometry in IR Systems
Texts translated by
the same translator
but written by
different authors
 Texts translated by
different translators
but written by the
same authors

4/20/2007

16
PROPOSED STEPS



4.
Training the classifier with a training set
Many methods present, (SVM, bayesian…)
Recognition and Classification of texts
Analyzing the results of classification
Stylometry in IR Systems
Training
2.
3.
Determine which features represent the style best
4/20/2007
Feature Extraction
1.
17
1. FEATURE EXTRACTION
The stylometric features of a text can be:




Feature choices affect classification results
seriously.
 Then obtain a feature vector with n-dimensions


Stylometry in IR Systems

Word length
Sentence length
Paragraph length
Character n-grans
Function words
4/20/2007

V = {v1,v2,v3 … vn}
18
2. TRAINING
Choose training data
for every class

Determine the
corresponding
parameters to each
class
Feature
Extraction
Class
Parameters
Stylometry in IR Systems
May be randomly
selected texts
 May be manually
picked

Training
data
4/20/2007

19
3. RECOGNITION AND CLASSIFICATION
Distance
Stylometry in IR Systems
Recognition
4/20/2007
Use the parameters
we obtained from
training data
 Compute the distance
 Label the data
 Classify the data

Classification
20
RESULTS OF THE CLASSIFICATION
We will have two set of results

These results will give us a clue about the two
issues we stated at the beginning

Example: “The Picture of Dorian Gray” is translated
into Turkish by many translators

Stylometry in IR Systems
The original texts classified by author
 The translated texts classified by no prior class
information

4/20/2007

Look if these are clustered in one class or separate classes
21
OUR AIM
With the right classification we will be able to
identify

“…yet, to date, no stylometrist has managed to
establish a methodology which is better able to
capture the style of a text than that based on
lexical items.”
Stylometry in IR Systems
If sytlometric analysis works in finding an author in
two different languages
 If translations carry more of their translators’ style
or if they still have their authors’ style

4/20/2007

22
CONCLUSION
Today there are many useful applications of
stylometry.


Authorship attribution, plagiarism detection, genrebased information retrieval
What features are valuable for analysis is still an
important question.
Stylometry in IR Systems

4/20/2007

We aim to find the stylistic connection between a
text and its translation.
23
REFERENCES
4/20/2007
Stylometry in IR Systems
Computational Stylistics in Forensic Author
Identifiction, Carole E. Charsi
 Style vs. Expression in Literary Narratives,
Özlem Uzuner, Boris Katz
 Computer-Based Authorship Attribution Without
Lexical Measures, E. Stamatatos, N. Fakotakis,
G. Kokkinakis
 Ensemble-Based Author Identification Using
Character N-grams, E. Stamatatos
 Combining Text and Linguistic Document
Representations for Authorship Attribution, A.
Kaster, S. Siersdofer, G. Weikum

24
4/20/2007
Stylometry in IR Systems
25