STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA
Download
Report
Transcript STYLOMETRY IN IR SYSTEMS Leyla BİLGE Büşra ÇELİKKAYA
STYLOMETRY IN IR
SYSTEMS
Leyla BİLGE
Büşra ÇELİKKAYA
Kardelen HATUN
OUTLINE
4/20/2007
Stylistics and Stylometry
Applications of stylometry
History of stylometric researches
Stylistic features
Recent Studies
Our approach
Conclusion
2
Stylometry in IR Systems
STYLISTICS
The theoritical framework for stylistic combines;
Halliday’s Language Theory
Sander’s Theories of Stylistic
Stylometry in IR Systems
4/20/2007
Halliday says:
“A text is what is meant, selected from the total
set of opinions that constitute what can be
meant”
Sander says:
“Style is the result of choices made by an author
from a range of possibilities offered by the
language system”
3
STYLISTICS
Stylistic variation depends on
Author preferences and competence
Familiarity
Genre
Communicative context
Expected characteristics of the intended audience
Stylometry in IR Systems
4/20/2007
Modeling, representing and utilizing this
variation is the business of stylistic analysis.
4
STYLOMETRY
The application of the study of linguistic style
Stylometry in IR Systems
4/20/2007
Style refers to the linguistic choices of authors
that persist over their works, independently of
content
Aim is to describe a text from a rather formal
perspective like;
Number of words
Number of repetitions
Sentence length
5
APPLICATIONS OF STYLOMETRY
4/20/2007
Authorship attribution
Forensic author identification
To find the author of an anonymous text
Stylometry in IR Systems
Observation of the “characteristics” of a
particular author
Organization and retrieval of documents based
on their writing style
Systems for genre-based information retrieval
6
HISTORY OF STYLOMETRY
Stylometry grew out of analyzing text for
evidence of authenticity, authorial identity
Stylometry in IR Systems
4/20/2007
According to modern practice of discipline, there
are distinctive patterns of a language to identify
authors
After development of computers and their
capacities
Large data sets can be analyzed
New methods can be generated and easily applied
7
HISTORY OF STYLOMETRY, CONT’D
4/20/2007
Stylometry in IR Systems
Current researches uses techniques based on
term frequency counts
Frequency data are collected for common terms
These data are then analyzed using a range of fairly
standard statistical techniques
However, they cannot guarantee quality ouput
yet, i.e. Ulysses
8
METHODOLOGY
Use a subset of structural and stylometric features on
a set of authors without consideration of author
characteristics
Currently, authorship attribution studies are
dominated by the use of lexical measures
Generally used statistics:
Word length
Syllables per word
Sentence-length
Sentence count
Text length in words
Use of punctuation marks
STYLISTIC FEATURES
Lexically-Based Methods
Vocabulary richness of the author
Frequencies of occurrence of individual words
Vocabulary diversity:
Type-token ratio V/N
V: size of vocabulary of sample text
N: number of tokens
Hapax legomena
How many words occur once
Frequencies of occurrence:
Function words
STYLISTIC FEATURES
Problems:
Text length dependent
Unstable for short texts
Function word set requires manual effort
Specific to the group of authors considered
Solution:
Use set of most frequent words
Both content-words and function words
RELATED STUDIES
Analysis of the text by a natural language
processing tool:
Use existing NLP tool
Sentence and Chunk Boundaries Detector (SCBD)
Use sub-word units like character N-grams
instead of word frequencies:
Character sequences of length n
Most frequent n-grams provide information about
author’s stylistic choices on lexical, syntactical and
structural level
WORD BASED FEATURES
Bag-of-words
Apply stemming and stopword list
Function words
Content-free
POS Annotation
Feature Selection
Semantic Disambiguation
LINGUISTIC CONSTITUENTS
Structure of natural language sentences show
word occurrences follow a specific order
Words are grouped into syntactic units called
“constituents”
Use word relationships by extracting constituents
for feature construction
Subdivide document into sentences
Construct a syntax tree for each sentence
SYNTAX TREE
Use a syntax tree representation of different
authors sentences as features
OUR APRROACH
Use Stylometry to
analyze the following
Stylometry in IR Systems
Texts translated by
the same translator
but written by
different authors
Texts translated by
different translators
but written by the
same authors
4/20/2007
16
PROPOSED STEPS
4.
Training the classifier with a training set
Many methods present, (SVM, bayesian…)
Recognition and Classification of texts
Analyzing the results of classification
Stylometry in IR Systems
Training
2.
3.
Determine which features represent the style best
4/20/2007
Feature Extraction
1.
17
1. FEATURE EXTRACTION
The stylometric features of a text can be:
Feature choices affect classification results
seriously.
Then obtain a feature vector with n-dimensions
Stylometry in IR Systems
Word length
Sentence length
Paragraph length
Character n-grans
Function words
4/20/2007
V = {v1,v2,v3 … vn}
18
2. TRAINING
Choose training data
for every class
Determine the
corresponding
parameters to each
class
Feature
Extraction
Class
Parameters
Stylometry in IR Systems
May be randomly
selected texts
May be manually
picked
Training
data
4/20/2007
19
3. RECOGNITION AND CLASSIFICATION
Distance
Stylometry in IR Systems
Recognition
4/20/2007
Use the parameters
we obtained from
training data
Compute the distance
Label the data
Classify the data
Classification
20
RESULTS OF THE CLASSIFICATION
We will have two set of results
These results will give us a clue about the two
issues we stated at the beginning
Example: “The Picture of Dorian Gray” is translated
into Turkish by many translators
Stylometry in IR Systems
The original texts classified by author
The translated texts classified by no prior class
information
4/20/2007
Look if these are clustered in one class or separate classes
21
OUR AIM
With the right classification we will be able to
identify
“…yet, to date, no stylometrist has managed to
establish a methodology which is better able to
capture the style of a text than that based on
lexical items.”
Stylometry in IR Systems
If sytlometric analysis works in finding an author in
two different languages
If translations carry more of their translators’ style
or if they still have their authors’ style
4/20/2007
22
CONCLUSION
Today there are many useful applications of
stylometry.
Authorship attribution, plagiarism detection, genrebased information retrieval
What features are valuable for analysis is still an
important question.
Stylometry in IR Systems
4/20/2007
We aim to find the stylistic connection between a
text and its translation.
23
REFERENCES
4/20/2007
Stylometry in IR Systems
Computational Stylistics in Forensic Author
Identifiction, Carole E. Charsi
Style vs. Expression in Literary Narratives,
Özlem Uzuner, Boris Katz
Computer-Based Authorship Attribution Without
Lexical Measures, E. Stamatatos, N. Fakotakis,
G. Kokkinakis
Ensemble-Based Author Identification Using
Character N-grams, E. Stamatatos
Combining Text and Linguistic Document
Representations for Authorship Attribution, A.
Kaster, S. Siersdofer, G. Weikum
24
4/20/2007
Stylometry in IR Systems
25