Transcript Slide 1

Adding typology to
lexicostatistics: a combined
approach to language
classification
ASJP Consortium
< Dik Bakker et al. mult. >
Language Classification
1
Overview
Project ASJP (Started January 2007):
(Automated Similarity Judgment Program)
Language Classification
2
Overview
Project:
ASJP (Automated Similarity Judgment Program)
Overall goal:
Automatic reconstruction of language relationships
Language Classification
3
Overview
Project:
ASJP (Automated Similarity Judgment Program)
Overall goal:
Automatic reconstruction of language relationships
Basis:
Distance matrix between individual languages based
on lexical elements
Language Classification
4
Overview
Project:
ASJP (Automated Similarity Judgment Program)
Overall goal:
Automatic reconstruction of language relationships
Basis:
Distance matrix between individual languages
Method:
Lexicostatistics: mass comparison of basic lexical items
Language Classification
5
Overview
Project:
ASJP (Automated Similarity Judgment Program)
As in traditional lexicostatistics, but:
Language Classification
6
Overview
Project:
ASJP (Automated Similarity Judgment Program)
As in traditional lexicostatistics, but:
1. use of computational algorithms and tools
Language Classification
7
Overview
Project:
ASJP (Automated Similarity Judgment Program)
As in traditional lexicostatistics, but:
1. use of computational algorithms and tools
2. methodology from classification in biology
Language Classification
8
Overview
Project:
ASJP (Automated Similarity Judgment Program)
As in traditional lexicostatistics, but:
1. use of computational algorithms and tools
2. methodology from classification in biology
3. extended by all relevant data available
Language Classification
9
Caveat:
ASJP goal:
Reconstruction of relationships between languages
NOT: better than experts in classification of areas/groups
Language Classification
10
Caveat:
ASJP goal:
Reconstruction of relationships between languages
NOT: better than experts in classification of areas/groups
BUT:
1. Optimize lexicostatistics on basis of expert knowledge
on well-explored areas
Language Classification
11
Caveat:
ASJP goal:
Reconstruction of relationships between languages
NOT: better than experts in classification of areas/groups
BUT:
1. Optimize lexicostatistics on basis of expert knowledge
2. Provide method and tools to assess and improve
classifications for un(der)explored areas
Language Classification
12
Overview
Current collaborators:
Dik Bakker
David Beck
Oleg Belyaev
Cecil H. Brown
Pamela Brown
Matthew Dryer
Dmitry Egorov
Pattie Epps
Anthony Grant
Eric W. Holman
Hagen Jung
Johann-Mattis List
Robert Mailhammer
André Müller
Uri Tadmor
Matthias Urban
Viveka Velupillai
Søren Wichmann
Kofi Yakpo
Language Classification
13
Overview
Current collaborators:
Dik Bakker
David Beck
Oleg Belyaev
Cecil H. Brown
Pamela Brown
Matthew Dryer
Dmitry Egorov
Pattie Epps
Anthony Grant
Eric W. Holman
Hagen Jung
Johann-Mattis List
Robert Mailhammer
André Müller
Uri Tadmor
Matthias Urban
Viveka Velupillai
Søren Wichmann
Kofi Yakpo
Language Classification
14
Overview ASJP system
LEX
Language Classification
15
Overview ASJP system
LEX
Method
ASJP software
Language Classification
16
Overview ASJP system
LEX
ASJP software
distance
matrix
Language Classification
17
Overview ASJP system
LEX
ASJP software
distance
matrix
DUTCH
ENGLISH
53.3
DUTCH
FRENCH
72.7
DUTCH
MANDARIN
93.8
…
Language Classification
18
Overview ASJP system
LEX
ASJP software
distance
matrix
CLASSIF
software
Language Classification
19
Existing Expert Classifications:
ETHN
WALS
EXPRT
LEX
ASJP software
EVALUATION
distance
matrix
STAT
software
CLASSIF
software
Language Classification
20
Existing Expert Classifications:
ETHN
WALS
EXPRT
LEX
Method
ASJP software
CALIBRATION
distance
matrix
STAT
software
CLASSIF
software
Language Classification
21
GEO
GRAPH
ETHN
WALS
EXPRT
LEX
ASJP software
distance
matrix
MAP
STAT
software software
CLASSIF
software
Language Classification
22
HIST
FACTS
GEO
GRAPH
ETHN
WALS
EXPRT
LEX
ASJP software
distance
matrix
MAP
STAT
software software
CLASSIF
software
Language Classification
23
HIST
FACTS
GEO
GRAPH
ETHN
WALS
EXPRT
LEX
TYPOL
DATA
ASJP software
distance
matrix
MAP
STAT
software software
CLASSIF
software
Language Classification
24
Today …
LEX
ASJP software
distance
matrix
CLASSIF
software
Language Classification
25
Today …
LEX
TYPOL
DATA
ASJP software
distance
matrix
CLASSIF
software
Language Classification
26
List of basic lexical items
Language Classification
27
Lexical items
Word list Morris Swadesh (1955):
100 basic meanings
Language Classification
28
1. I
21. dog
41. nose
61. die
81. smoke
2. you
22. louse
42. mouth
62. kill
82. fire
3. we
23. tree
43. tooth
63. swim
83. ash
4. this
24. seed
44. tongue
64. fly
84. burn
5. that
25. leaf
45. claw
65. walk
85. path
6. who
26. root
46. foot
66. come
86. mountain
7. what
27. bark
47. knee
67. lie
87. red
8. not
28. skin
48. hand
68. sit
88. green
9. all
29. flesh
49. belly
69. stand
89. yellow
10. many
30. blood
50. neck
70. give
90. white
11. one
31. bone
51. breasts
71. say
91. black
12. two
32. grease
52. heart
72. sun
92. night
13. big
33. egg
53. liver
73. moon
93. hot
14. long
34. horn
54. drink
74. star
94. cold
15. small
35. tail
55. eat
75. water
95. full
16. woman
36. feather
56. bite
76. rain
96. new
17. man
37. hair
57. see
77. stone
97. good
18. person
38. head
58. hear
78. sand
98. round
19. fish
39. ear
59. know
79. earth
99. dry
20. bird
40. eye
60. sleep
80. cloud
100. name
Language Classification
29
Lexical items
Swadesh list: assumptions
Language Classification
30
Lexical items
Swadesh list:
- Word in most languages
Language Classification
31
Lexical items
Swadesh list:
- Word in most languages
- Inherited rather than borrowed
Language Classification
32
Lexical items
Swadesh list:
- Word in most languages
- Inherited rather than borrowed
- Relatively stable over time
Language Classification
33
Lexical items
Swadesh list:
- Word in most languages
- Inherited rather than borrowed
- Relatively stable over time
- Easily accessible (fieldwork / grammars)
Language Classification
34
Lexical items
Languages transcribed to date:
- Over 3500 languages (incl. dialects; around 45%
of lgs of the world)
Language Classification
35
Languages currently collected
Language Classification
36
Lexical items: further reduction
Reduction of the full Swadesh list:
Language Classification
37
Lexical items: further reduction
Reduction of the full Swadesh list:
1. Not the complete list, only most stable items
Language Classification
38
Lexical items: further reduction
Reduction of the full Swadesh list:
1. Not the complete list, only most stable items
2. Not full IPA representation, but generalized coding
Language Classification
39
Lexical items: further reduction
1. Not the complete list
- Most stable items = least formal variation in
well-established genetic groups (Dryer’s genera)
Language Classification
40
Lexical items: further reduction
1. Not the complete list
- Most stable items = least formal variation in
well-established genetic groups (Dryer’s genera)
Nichols (1995):
lg pairs (wordk=wordk) +++
all pairs
Language Classification
41
Lexical items: further reduction
1. Not the complete list
- Most stable items = least formal variation in
well-established genetic groups (Dryer’s genera)
Nichols (1995):
lg pairs (wordk=wordk)
all pairs
 What is optimal number … ?
Language Classification
42
Ethnologue Classification*
WALS Classification**
+  Stability  -
Language Classification
*Goodman-Kruskal
**Pearson
Ethnologue Classification
WALS Classification
+  Stability  -
Language Classification
Ethnologue Classification
WALS Classification
Language Classification
45
Ethnologue Classification
WALS Classification
Language Classification
46
Ethnologue Classification
WALS Classification
40
Language Classification
47
Ethnologue Classification
WALS Classification
Language Classification
48
Ethnologue Classification
WALS Classification
Language Classification
49
I
dog
nose
die
smoke
you
louse
mouth
kill
fire
we
tree
tooth
swim
ash
this
seed
tongue
fly
burn
that
leaf
claw
walk
path
who
root
foot
come
mountain
what
bark
knee
lie
red
not
skin
hand
sit
green
all
flesh
belly
stand
yellow
many
blood
neck
give
white
one
bone
breast
say
black
two
grease
heart
sun
night
big
egg
liver
moon
hot
long
horn
drink
star
cold
small
tail
eat
water
full
woman
feather
bite
rain
new
man
hair
see
stone
good
person
head
hear
sand
round
fish
ear
know
earth
dry
bird
eye
sleep
cloud
name
Language Classification
40
Most
Stable
50
Lexical items: transcription
2. NOT full IPA but ASJPcode:
7 Vowels
34 Consonants
All other
phonemes
to
‘closest sound’
(automatic)
Language Classification
51
Abaza (Caucasian):
Meaning
IPA
PERSON
ʕʷɨʧʼʲʷʕʷɨs
LEAF
bɣʲɨ
SKIN
ʧʷazʲ
HORN
ʧʼʷɨʕʷa
NOSE
pɨnʦʼa
TOOTH
pɨʦ
Language Classification
52
Abaza (Caucasian):
Meaning
IPA
ASJPcode
PERSON
ʕʷɨʧʼʲʷʕʷɨs
Xw3Cw"yXw3s
LEAF
bɣʲɨ
bxy3
SKIN
ʧʷazʲ
Cwazy
HORN
ʧʼʷɨʕʷa
Cw"3Xwa
NOSE
pɨnʦʼa
p3nc"a
TOOTH
pɨʦ
p3c
Language Classification
53
Loss of information?
Shown for representative groups:
- ASJP as good for separating language families as
full IPA
Language Classification
54
Loss of information?
Shown for representative groups:
- ASJP as good for separating language families as
full IPA
- More accurate for precise genetic classification
than IPA (under our current method)
Language Classification
55
Comparing words and languages
Language Classification
56
Comparing words
Most successful measure to date:
Levenshtein Distance
Language Classification
57
Comparing words
Levenshtein Distance (LD) =
Number of transformations (=changes & additions)
to get from the shorter form to the longer form
Language Classification
58
Comparing words
Levenshtein Distance (LD) =
Number of transformations (=changes & additions)
to get from the shorter form to the longer form
ALT
ASJP
Language Classification
59
Comparing words
Levenshtein Distance (LD) =
Number of transformations (=changes & additions)
to get from the shorter form to the longer form
ALT
ASJP
xxx = 3
Language Classification
60
Comparing words
Levenshtein Distance (LD) =
Number of transformations (=changes & additions)
to get from the shorter form to the longer form
1. Normalization:
 LDN = ( LD / Lmax )

Language Classification
0.0 – 1.0
61
Comparing words
Levenshtein Distance (LD) =
Number of transformations (=changes & additions)
to get from the shorter form to the longer form
1. Normalization:
 LDN = ( LD / Lmax )

0.0 – 1.0
2. Eliminate ‘ background noise’:
LDND = ( LDN / LDNdifferent pairs )
Language Classification
62
Classifying languages
Language Classification
63
LNG
SIL
I
YOU
WE
ONE
TWO
CANTONESE
yue
Noh
neihdeih
Nhdeih
yat
yih
HAINAN_MINNAN
nan
va
lu
vaneN
zy~a7
no*|no
HAKKA
hak
Nai
Ni
Naiteu
yit
ly~oN|Ni
MANDARIN
cmn
wo
nimen
women
i
el
SUZHOU_WU
wuu
No
nE
Sia*nj3
ji7
lia*
A_TONG
aot
aN
naN
niN
sa
ni
MIKIR
mjw
ne
nEng
netum
isi
hini
TARAON
mhu
ha*
nu*
niN
kiN
kaiN
NAXI
nbf
N3
nv
N3Ng3
d3
5i
CHIANGRAI_MIEN
ium
yia
mei
bua
yet
i
HMONG_DAW
mww
ku
ko
pe
i
o
SUYONG_HMONG
mww
ko
ko
pe
i
au
TAK_HMONG
mww
ku
ko
pe
i
o
…
Language Classification
…
64
Swadesh
(3500)
AJSP
Language Classification
65
Swadesh
(3500)
AJSP
distance
matrix
Language Classification
66
LG1
LG2
LDND
MANDARIN
MIDDLE_CHINESE
81.75
MANDARIN
OLD_CHINESE
94.30
MANDARIN
SUZHOU_WU
85.87
MANDARIN
DHAMMAI
97.48
MANDARIN
A_TONG
97.91
MANDARIN
KAYAH_LI_EASTERN
94.75
MANDARIN
MIKIR
99.05
MANDARIN
LEPCHA
97.24
MANDARIN
APATANI
92.24
MANDARIN
BENGNI
96.91
MANDARIN
BOKAR
95.28
…
Language Classification
67
LG1
LG2
LDND
MANDARIN
MIDDLE_CHINESE
81.75
MANDARIN
OLD_CHINESE
94.30
MANDARIN
SUZHOU_WU
85.87
MANDARIN
DHAMMAI
97.48
MANDARIN
A_TONG
97.91
MANDARIN
KAYAH_LI_EASTERN
94.75
MANDARIN
MIKIR
99.05
MANDARIN
LEPCHA
97.24
MANDARIN
APATANI
92.24
MANDARIN
BENGNI
96.91
MANDARIN
BOKAR
95.28
…
Language Classification
68
LG1
LG2
LDND
MANDARIN
MIDDLE_CHINESE
81.75
MANDARIN
OLD_CHINESE
94.30
MANDARIN
SUZHOU_WU
85.87
MANDARIN
DHAMMAI
97.48
MANDARIN
A_TONG
97.91
MANDARIN
KAYAH_LI_EASTERN
94.75
MANDARIN
MIKIR
99.05
MANDARIN
LEPCHA
97.24
MANDARIN
APATANI
92.24
MANDARIN
BENGNI
96.91
MANDARIN
BOKAR
95.28
…
Language Classification
69
LG1
LG2
LDND
MANDARIN
MIDDLE_CHINESE
81.75
MANDARIN
OLD_CHINESE
94.30
MANDARIN
SUZHOU_WU
85.87
MANDARIN
DHAMMAI
97.48
MANDARIN
A_TONG
97.91
MANDARIN
KAYAH_LI_EASTERN
94.75
MANDARIN
MIKIR
99.05
MANDARIN
LEPCHA
97.24
MANDARIN
APATANI
92.24
MANDARIN
BENGNI
96.91
MANDARIN
BOKAR
95.28
3500 languages ~ 240.000.000 comp
Language Classification
70
Processing problems …
Language Classification
71
Solution: parallel processing
Language Classification
72
Swadesh
(3500)
AJSP
distance
matrix
http://www.megasoftware.net/
MEGA4
DNA patterns
Language Classification
73
Swadesh
(3500)
AJSP
distance
matrix
MEGA4
Language Classification
Neighbour
Joining
74
SEE COMPLETE TREE-OF-THE-MONTH ON:
email.eva.mpg.de/~wichmann/ASJPHomePage
Language Classification
75
RABINAL ACHI
SAKAPULTEKO
SIPAKAPENSE
KAQCHIKEL NORTHERN
USPANTEKO
TZUTUJIL
QUICHE
POQOMCHI WESTERN
POCOMAM
KEKCHI
TEKTITEKO
AGUACATEC
MAM
IXIL CHAJUL
MOCHO
JACALTEC
QANJOBAL EASTERN
AKATEKO
CHUJ
TOJOLABAL
TZELTAL OXCHUC
TZELTAL
TZOTZIL SAN ANDRES
ZINACANTAN TZOTZIL
CHOL TILA
CHOL
CHONTAL TABASCO
CHORTI
MOPAN
LACANDON
MAYA YUCATAN
ITZAJ
CHICOMUCELTEC
HUASTEC
LDND + Mega4
10
Mayan (34 / 69 Ethn)
RABINAL ACHI
SAKAPULTEKO
Mayan (34 / 69)
SIPAKAPENSE
KAQCHIKEL NORTHERN
USPANTEKO
TZUTUJIL
QUICHE
POQOMCHI WESTERN
POCOMAM
KEKCHI
TEKTITEKO
AGUACATEC
MAM
IXIL CHAJUL
MOCHO
JACALTEC
QANJOBAL EASTERN
AKATEKO
CHUJ
TOJOLABAL
TZELTAL OXCHUC
TZELTAL
TZOTZIL SAN ANDRES
ZINACANTAN TZOTZIL
CHOL TILA
CHOL
CHONTAL TABASCO
CHORTI
MOPAN
LACANDON
MAYA YUCATAN
ITZAJ
CHICOMUCELTEC
HUASTEC
LDND + Mega4
10
< all & only >
RABINAL ACHI
SAKAPULTEKO
Mayan (34 / 69)
SIPAKAPENSE
KAQCHIKEL NORTHERN
USPANTEKO
TZUTUJIL
QUICHE
POQOMCHI WESTERN
POCOMAM
KEKCHI
TEKTITEKO
AGUACATEC
MAM
IXIL CHAJUL
MOCHO
JACALTEC
QANJOBAL EASTERN
AKATEKO
CHUJ
TOJOLABAL
TZELTAL OXCHUC
TZELTAL
TZOTZIL SAN ANDRES
ZINACANTAN TZOTZIL
CHOL TILA
CHOL
CHONTAL TABASCO
CHORTI
MOPAN
LACANDON
MAYA YUCATAN
ITZAJ
CHICOMUCELTEC
HUASTEC
LDND + Mega4
10
cholan
RABINAL ACHI
SAKAPULTEKO
Mayan (34 / 69)
SIPAKAPENSE
KAQCHIKEL NORTHERN
USPANTEKO
TZUTUJIL
QUICHE
POQOMCHI WESTERN
POCOMAM
KEKCHI
TEKTITEKO
AGUACATEC
MAM
IXIL CHAJUL
MOCHO
JACALTEC
QANJOBAL EASTERN
AKATEKO
CHUJ
TOJOLABAL
TZELTAL OXCHUC
TZELTAL
tzeltalan
TZOTZIL SAN ANDRES
ZINACANTAN TZOTZIL
CHOL TILA
CHOL
cholan
CHONTAL TABASCO
CHORTI
MOPAN
LACANDON
MAYA YUCATAN
ITZAJ
CHICOMUCELTEC
HUASTEC
LDND + Mega4
10
cholan
RABINAL ACHI
SAKAPULTEKO
Mayan (34 / 69)
SIPAKAPENSE
KAQCHIKEL NORTHERN
USPANTEKO
TZUTUJIL
QUICHE
POQOMCHI WESTERN
POCOMAM
KEKCHI
TEKTITEKO
AGUACATEC
MAM
IXIL CHAJUL
MOCHO
JACALTEC
QANJOBAL EASTERN
AKATEKO
CHUJ
TOJOLABAL
TZELTAL OXCHUC
TZELTAL
tzeltalan
TZOTZIL SAN ANDRES
ZINACANTAN TZOTZIL
CHOL TILA
CHOL
CHONTAL TABASCO
cholan
CHORTI
MOPAN
yucatecan
LACANDON
MAYA YUCATAN
ITZAJ
CHICOMUCELTEC
HUASTEC
LDND + Mega4
10
RABINAL ACHI
SAKAPULTEKO
Mayan (34 / 69)
SIPAKAPENSE
KAQCHIKEL NORTHERN
USPANTEKO
TZUTUJIL
QUICHE
POQOMCHI WESTERN
POCOMAM
KEKCHI
TEKTITEKO
Ethnologue/experts:
AGUACATEC
MAM
IXIL CHAJUL
MOCHO
JACALTEC
QANJOBAL EASTERN
AKATEKO
CHUJ
TOJOLABAL
TZELTAL OXCHUC
TZELTAL
yucatecan
TZOTZIL SAN ANDRES
ZINACANTAN TZOTZIL
CHOL TILA
CHOL
CHONTAL TABASCO
CHORTI
MOPAN
LACANDON
MAYA YUCATAN
ITZAJ
CHICOMUCELTEC
HUASTEC
LDND + Mega4
10
cholan
tzeltalan
ASJP and genetic classification
- Method works at a global level
Language Classification
82
ASJP and genetic classification
- Method works at a global level
- Often also at the lowest levels
Language Classification
83
ASJP and genetic classification
- Method works at a global level
- Often also at the lowest levels
- Refinement necessary at intermediate level
Language Classification
84
Adding typological data
Language Classification
85
Trying to improve the fit …
Enrich lexical with typological data:
Haspelmath, M., M. Dryer, D. Gil & B. Comrie (eds) (2005).
The World Atlas Of Language Structures.
Oxford: Oxford University Press
WALS Online: http://wals.info/
Language Classification
86
Lexical plus typological data
Swadesh
(3500)
+
WALS
(2580)
< 140 FEATURES >
ASJP
distance
matrix
TREE
SFTW
Language Classification
87
‘SWALSH’
ASJP
distance
matrix
TREE
SFTW
Language Classification
88
Improving the fit
Enrich lexical with typological data:
- NOT 1:1 with ASJP languages
Language Classification
89
SWALSH
(1250)
ASJP
distance
matrix
TREE
SFTW
Language Classification
90
Improving the fit
Enrich lexical with typological data:
- NOT 1:1 with ASJP languages
- WALS matrix very UNevenly filled (16%)
cf. Cysouw (2008) – STUF 61.3
Language Classification
91
Improving the fit
Enrich lexical with typological data:
- NOT 1:1 with ASJP languages
- WALS features very unevenly filled
 Determine most stable features
Language Classification
92
Feature Stability
Nichols (1995): metric for S(Ftrk) in Gx:
pairs (valk=valk)
all pairs
Language Classification
93
Feature Stability
ASJP: metric for Stability Ftrk:
For Gx:
pairs (valk=valk)
all pairs
Language Classification
94
Feature Stability
ASJP: metric for stability Ftrk:
For Gx:
pairs (valk=valk)
all pairs
all pairs
Size differences between G
Language Classification
95
Feature Stability
ASJP: metric for stability Ftrk:
SFk=
pairs (valk=valk)
all pairs
all pairs
Language Classification
96
Feature Stability
ASJP: metric for stability Ftrk:
SFk=
pairs (valk=valk)
all pairs
U
all pairs
pairs (valk=valk)
all pairs
‘Background noise’
Language Classification
97
Feature Stability
ASJP: metric for stability Ftrk:
SFk=
pairs (valk=valk)
all pairs
U
all pairs
pairs (valk=valk)
all pairs
(1 – U)
Normalization: SFk comparable
Language Classification
98
Most stable WALS features
31. Sex-based and Non-sex-based Gender Systems
118. Predicative Adjectives
0.81
0.74
30. Number of Genders
0.73
119. Nominal and Locational Predication
29. Syncretism in Verbal Person/Number Marking
Language Classification
0.71
0.71
99
Most instable WALS features
128. Utterance Complement Clauses
0.07
115. Negative Indefinite Pronouns/Predicate Negation
0.07
59. Possessive Classification
135. Red and Yellow
0.01
-0.07
58. Obligatory Possessive Inflection
Language Classification
-0.25
100
Correlation with Ethnologue
Min ftrs
20
Language Classification
101
Correlation with Ethnologue
Min ftrs
40
20
Language Classification
102
Correlation with Ethnologue
Min ftrs
60
40
20
Language Classification
103
Correlation with Ethnologue
Min ftrs
80
60
40
20
Language Classification
104
Correlation with Ethnologue
Min ftrs
100
80
60
40
20
Language Classification
105
Correlation with Ethnologue
Min ftrs
100
80
60
40
20
+  Stability  -
Language Classification
106
Correlation with Ethnologue
Min ftrs
100
80
60
40
20
20
Language Classification
107
Correlation with Ethnologue
Min ftrs
100
80
60
40
20
40
Language Classification
108
Correlation with Ethnologue
Min ftrs
100
80
60
40
20
60
Language Classification
109
Correlation with Ethnologue
Min ftrs
100
80
60
40
20
85
Language Classification
110
Correlation with Ethnologue
Min ftrs
100
80
60
40
20
Language Classification
111
WALS
Language Classification
112
WALS
Swadesh40
Language Classification
113
Improving the fit
Typological variables* do not perform better
than lexical ones to establish genetic relationships
*WALS!
Language Classification
114
Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic relationships
What about a combination?
Language Classification
115
Ftrs
100
80
60
40
20
Only
WALS
Language Classification
Only
Sw40
Lgs
79
109
139
218
341
116
Ftrs
100
80
60
40
20
Only
WALS
Language Classification
Only
Sw40
Lgs
79
109
139
218
341
117
Ftrs
100
80
60
40
20
Lgs
79
109
139
218
341
85:15
Only
WALS
Language Classification
Only
Sw40
118
Ftrs
100
80
60
40
20
Lgs
79
109
139
218
341
70:30
Only
WALS
Language Classification
Only
Sw40
119
0.91
Ftrs
100
80
60
40
20
Lgs
79
109
139
218
341
50:50
Only
WALS
Language Classification
Only
Sw40
120
Ftrs
100
80
60
40
20
Lgs
79
109
139
218
341
35:65
Only
WALS
Language Classification
Only
Sw40
121
Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic relationships
A combined, balanced approach is superior, but …
Language Classification
122
Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic relationships
A combined, balanced approach is superior, but …
… at a much higher cost per language than just
lexicostatistics: 84% WALS to be filled in …
Language Classification
123
Improving the fit
Typological variables do not perform better
than lexical ones to establish genetic relationships
A combined, balanced approach is superior, but …
… at a much higher cost per language
Continue extension/optimization of lexical method
Language Classification
124
Publications 2008 - 2009
1. Brown, Cecil H., Eric W. Holman, Søren Wichmann & Viveka Velupillai
(2008). Automated Classification of the World’s languages: a
description of the method and prelimary results. Sprachtypologie und
Universalienforschung 61: 285-308.
2. Holman, E. W., S. Wichmann, C. H. Brown, V. Velupillai, A. Müller & D.
Bakker (2008) 'Advances in automated language classification'. In A.
Arppe, K. Sinnemäke and U. Nikanne (eds) Quantitative Investigations
in Theoretical Linguistics. Helsinki: University of Helsinki, 40-43.
3. Holman, E. W., S. Wichmann, C. H. Brown, V. Velupillai, A. Müller & D.
Bakker (2008). ‘Explorations in automated language classification’.
Folia Linguistica 42-2, 331-354.
4. Bakker, D., A. Müller, V. Velupillai, S. Wichmann, C. H. Brown, P.
Brown, D. Egorov, R. Mailhammer, A. Grant, E. W. Holman (2009).
’Adding typology to lexicostatistics: a combined approach to language
classification’. Linguistic Typology 13, 167-179.
Language Classification
125
?
Language Classification
126
ASJP
Overall goal:
- Method + Tools for Reconstruction of Language Relationships
Derived goals:
- Critical assessment and refinement of existing classifications
- Classify newly described and unclassified languages
- Search for (ir)regularities in family reconstructions
- Test hypotheses about families
- Experimentally find an optimal dating method
- Automatically detect borrowings
Language Classification
127