Document 7375688

Download Report

Transcript Document 7375688

Social Network Inspired Models
of NLP and Language Evolution
Monojit Choudhury (Microsoft Research India)
Animesh Mukherjee (IIT Kharagpur)
Niloy Ganguly (IIT Kharagpur)
What is a Social Network?
 Nodes: Social entities (people,
organization etc.)
 Edges: Interaction/relationship
between entities (Friendship,
collaboration, sex)
Courtesy: http://blogs.clickz.com
Social Network Inspired Computing
 Society and nature of human interaction is a
Complex System
 Complex Network: A generic tool to model
complex systems
There is a growing body of work on CNT Theory
Applied to a variety of fields – Social, Biological,
Physical & Cognitive sciences, Engineering &
Technology
Language is a complex system
Objective of this Tutorial
 To show that SNIC (Soc. Net. Inspired Comp.) is
an emerging and promising technique
 Apply it to model Natural Languages
NLP, Quantitative Linguistics, Language Evolution,
Historical Linguistics, Language acquisition
 Familiarize with tools and techniques in SNIC
 Compare it with other standard approaches to
NLP
Outline of the Tutorial
 Part I: Background
Introduction [25 min]
Network Analysis Techniques [25 min]
Network Synthesis Techniques [25 min]
 Break [3:20pm – 3:40pm]
 Part II: Case Studies
Self-organization of Sound Systems [20 min]
Modeling the Lexicon [20 min]
Unsupervised Labeling (Syntax & Semantics) [20 min]
 Conclusion and Discussions [20 min]
Complex System
 Non-trivial properties and patterns emerging
from the interaction of a large number of simple
entities
 Self-organization: The process through which
these patterns evolve without any external
intervention or central control
 Emergent Property or Emergent Behavior: The
pattern that emerges due to self-organization
Emergence of a networked life
Communities
Atom
Organisms
Molecule
Tissue
Cell
Organs
Language – a complex system
 Language: medium for communication through
an arbitrary set of symbols
 Constantly evolving
 An outcome of self-organization at many levels
Neurons
Speakers and listeners
Phonemes, morphemes, words …
 80-20 Rule in every level of structure
Syntactic Network of Words
color
sky
weight
light
1
20
blue
blood
100
heavy
red
Complex Network Theory
Handy toolbox for modeling complex
systems
Marriage of Graph theory and Statistics
Complex because:
Non-trivial topology
Difficult to specify completely
Usually large (in terms of nodes and edges)
Provides insight into the nature and
evolution of the system being modeled
Internet
9-11 Terrorist Network
Social Network Analysis is a
mathematical methodology for
connecting the dots -- using
science to fight terrorism.
Connecting multiple pairs of
dots soon reveals an emergent
network of organization.
What Questions can be asked
 Do these networks display some symmetry?
 Are these networks creation of intelligent
objects or they have emerged?
 How have these networks emerged
What are the underlying simple rules leading
to their complex formation?
Bi-directional Approach
Analysis of the real-world networks
Global topological properties
Community structure
Node-level properties
Synthesis of the network by means of
some simple rules
Small-world models ……..
Preferential attachment models
Application of CNT in Linguistics - I
 Quantitative linguistics
Invariance and typology (Zipf’s law, syntactic
dependencies)
 Natural Language Processing
Unsupervised methods for text labeling (POS tagging,
NER, WSD, etc.)
Textual similarity (automatic evaluation, document
clustering)
Evolutionary Models (NER, multi-document
summarization)
Application of CNT in Linguistics - II
 Language Evolution
How did sound systems evolve?
Development of syntax
 Language Change
Innovation diffusion over social networks
Language as an evolving network
 Language Acquisition
Phonological acquisition
Evolution of the mental lexicon of the child
Linguistic Networks
Name
Nodes
Edges
Why?
PhoNet
Phonemes
Co-occurrence
likelihood in languages
Evolution of sound
systems
WordNet
Words
Ontological relation
Host of NLP applications
Syntactic
Network
Words
Similarity between
syntactic contexts
POS Tagging
Semantic
Network
Words,
Names
Semantic relation
IR, Parsing, NER, WSD
Mental
Lexicon
Words
Phonetic similarity and
semantic relation
Cognitive modeling,
Spell Checking
Tree-banks
Words
Syntactic Dependency
links
Evolution of syntax
Word Cooccurrence
Words
Co-occurrence
IR, WSD, LSA, …
Summarizing
 SNIC and CNT are emerging techniques for
modeling complex systems at mesoscopic level
 Applied to Physics, Biology, Sociology,
Economics, Logistics …
 Language - an ideal application domain for SNIC
 SNIC models in NLP, Quantitative linguistics,
language change, evolution and acquisition
Topological
Characterization
of Networks
Types Of Networks and Representation
Unipartite
Binary/
Weighted
Undirected/
Directed
Bipartite
Binary/
Weighted
Undirected/
Directed
Representation
a
b
c
1. Adjacency Matrix
a
0
1
1
2. Adjacency List
b
1
0
1
c
1
1
0
a
{b,c}
b
{a,c}
c
{a,b}
Characterization of Complex N/ws??
 They have a non-trivial topological structure
 Properties:
Heavy tail in the degree distribution (non-negligible
probability mass towards the tail; more than in the case of
an exp. distribution)
High clustering coefficient
Centrality Properties
Social Roles & Equivalence
Assortativity
Community Structure
Random Graphs & Small avg. path length
Preferential attachment
Small World Properties
Degree Distribution (DD)
 Let pk be the fraction of vertices in the network that has a
degree k.
 The k versus pk plot is defined as the degree distribution
of a network
 For most of the real world networks these distributions
are right skewed with a long right tail showing up values
far above the mean – pk varies as k-α
 Due to noisy and insufficient data sometimes the
definition is slightly modified
Cumulative degree distribution is plotted
 Probability that the degree of a node is greater than or
equal to k
A Few Examples
Power law: Pk ~ k-α
Friend of Friends
 Consider the following scenario
 Sourish and Ravi are friends
 Sourish and Shaunak are friends
 Are Shaunak and Ravi friends?
 If so then …
Ravi
Saurish
Saunak
 This property is known as transitivity
Measuring Transitivity: Clustering Coefficient
 The clustering coefficient for a vertex ‘v’ in a network
is defined as the ratio between the total number of
connections among the neighbors of ‘v’ to the total
number of possible connections between the
neighbors
 High clustering coefficient means my friends know
each other with high probability – a typical property
of social networks
Mathematically…
 The clustering coefficient of a vertex i is
# of links between ‘n’ neighbors
Ci =
n(n-1)/2
 The clustering coefficient of the whole network is
the average
C=
1
N
∑Ci
 Alternatively,
C=
# triangles in the n/w
# triples in the n/w
Network
C
Crand
L
N
WWW
0.1078
0.00023
3.1
153127
Internet
0.18-0.3
0.001
3.7-3.76
30156209
Actor
0.79
0.00027
3.65
225226
Coauthorship
0.43
0.00018
5.9
52909
Metabolic
0.32
0.026
2.9
282
Foodweb
0.22
0.06
2.43
134
C. elegance
0.28
0.05
2.65
282
Centrality
 Centrality measures are commonly described as
indices of 4 Ps -- prestige, prominence,
importance, and power
Degree – Count of immediate neighbors
Betweenness – Nodes that form a bridge between two
regions of the n/w
 Where σst is total number of shortest paths between s and
t and σst (v) is the total number of shortest paths from s to t
via v
Eigenvector centrality – Bonacich (1972)
It is not just how many people knows me
counts to my popularity (or power) but how
many people knows people who knows
me – this is recursive!
In context of HIV transmission – A person
x with one sex partner is less prone to the
disease than a person y with multiple
partners
But imagine what happens if the partner of x
has multiple partners
The basic idea of eigenvector centrality
Definition
 Eigenvector centrality is defined as the principal
eigenvector of the adjacency matrix
 Eigenvector of any symmetric matrix A = {aij} is
any vector e such that
Where λ is a constant and ei is the centrality of the node i
 What does it imply – centrality of a node is
proportional to the centrality of the nodes it is
connected to (recursively)…
 Practical Example: Google PageRank
Assortativity (homophily)
Rich goes with the rich (selective linking)
A famous actor (e.g., Shah Rukh Khan) would
prefer to pair up with some other famous actor
(e.g., Rani Mukherjee) in a movie rather than
a new comer in the film industry.
Assortative
Scale-free network
Disassortative
Scale-free network
Measures of Assortativity
ANND (Average nearest neighbor degree)
Find the average degree of the neighbors of each node
i with degree k
Find the Pearson correlation (r) between the degree of i
and the average degree of its neighbors
For further reference see the supplementary material
Community structure
Community structure: a group of vertices that
have a high density of edges within them and a
low density of edges in between groups
Example:
•Friendship n/w of children
•Citation n/ws: research interest
•World Wide Web: subject matter
of pages
•Metabolic networks: Functional
units
•Linguistic n/ws: similar linguistic
categories
Some Examples
Community Structure in
Political Books
Community structure in a Social n/w of
Students (American High School)
Community Identification Algorithms
 Hierarchical
 Girvan-Newman
 Radicchi et al.
 Chinese Whispers
 Spectral Bisection
See (Newman 2004) for a comprehensive
survey (you will find the ref. in the
supplementary material)
Evolution of Networks
Processes on Networks
The World is Small!
 “Registration fee for IJCNLP 2008 are being
waived for all participants – get it collected from
the registration counter”
 How long do you think the above information will
take to spread among yourselves
 Experiments say it will spread very fast – within 6
hops from the initiator it would reach all
 This is the famous Milgram’s six degrees of
separation
The Small World Effect
Even in very large social networks, the average distance
between nodes is usually quite short.
Milgram’s small world experiment:
 Target individual in Boston
 Initial senders in Omaha, Nebraska
 Each sender was asked to forward a packet to a friend
who was closer to the target
 Friends asked to do the same
Result: Average of ‘six degrees’ of separation.
S. Milgram, The small world problem, Psych. Today, 2 (1967), pp. 60-67.
Measure of Small-Worldness
 Low average geodesic path length
 High clustering coefficient
 Geodesic path – Shortest path through the
network from one vertex to another
 Mean path length
 ℓ = 2∑i≥jdij/n(n+1) where dij is the geodesic distance from
vertex i to vertex j
Most of the networks observed in real world have ℓ ≤ 6
 Film actors
 Company Directors
 Emails
 Internet
 Electronic circuits
3.48
4.60
4.95
3.33
4.34
Random Graphs & Small
Average Path Length
Q: What do we mean by a ‘random graph’?
A: Erdos-Renyi random graph model:
For every pair of nodes, draw an edge
between them with equal probability p.
Degrees of Separation in a Random Graph
Poisson distribution
• N nodes
• z neighbors per node, on average, z =<k>
• D degrees of separation
z N
D
log N
D
log z
P(k)~ e-<k> <k>k/k!
Clustering
C = Probability that two of a node’s neighbors
are themselves connected
In a random graph: Crand ~ 1/N (if the average
degree is held constant)
Watts-Strogatz ‘Small World’ Model
Watts and Strogatz introduced this simple model to
show how networks can have both short path lengths
and high clustering.
D. J. Watts and S. H. Strogatz, Collective dynamics of “small-world”
networks, Nature, 393 (1998), pp. 440–442.
Power Law
Airlines
Poisson distribution
Exponential Network
Power-law distribution
Scale-free Network
Degree distributions for various networks
(a) World-Wide Web
(b) Coauthorship
networks: computer
science, high energy
physics, condensed
matter physics,
astrophysics
(c) Power grid of the
western United
States and Canada
(d) Social network of 43
Mormons in Utah
How do Power law DDs arise?
Barabási-Albert Model of Preferential Attachment
(Rich gets Richer)
(1) GROWTH : Starting with a small number of nodes (m0) at
every timestep we add a new node with m (<=m0) edges
(connected to the nodes already present in the system).
(2) PREFERENTIAL ATTACHMENT : The probability Π that
a new node will be connected to node i depends on the
connectivity ki of that node
ki
 ( ki ) 
 jk j
A.-L.Barabási, R. Albert, Science 286, 509 (1999)
Growth analysis
Markov chain representation
Probability that the new edge is attached to any of the vertices of degree k
where total number of edges
Growth analysis
Markov chain representation
Growth dynamics at time (t+1)
Number of
nodes of
degree (k-1) at
t
Number of nodes
of degree k at t
Number of
nodes of
degree k at
t+1
Growth analysis
Markov chain representation
The net change in npk per vertex added
for k > m
for k = m
In the stationary solution, we find
Which results
CASE STUDY I: Self-Organization
of the Sound Inventories
Human Speech Sounds

Human speech sounds are called
phonemes – the smallest unit of a
language
Phonemes are characterized by certain
distinctive features like

I.
Place of articulation
II. Manner of articulation
III. Phonation
Mermelstein’s Model
Types of Phonemes
Consonants
Vowels
/i/
/a/
L
/u/
/t/
/p/
Diphthongs
/ai/
/k/
Choice of Phonemes
How a language chooses a set of
phonemes in order to build its sound
inventory?
Is the process arbitrary?
Certainly Not!
What are the forces affecting this choice?
Vowels: A (Partially) Solved Mystery
Languages choose vowels based on
maximal perceptual contrast.
For instance if a language has three
vowels then in more than 95% of the
cases they are /a/,/i/, and /u/.
/a/
/i/
/u/
Maximally Distinct
Consonants: A J
i
g
s
a
w
puzzle
Research: From 1929 – Date
No single satisfactory explanation of the
organization of the consonant inventories
The set of features that characterize
consonants is much larger than that of vowels
No single force is sufficient to explain this
organization
Rather a complex interplay of forces goes on in
shaping these inventories
Principle of Occurrence
 PlaNet – The “Phoneme-Language Network”
 Data Source: UPSID (317 languages)
Choudhury et al. 2006 ACL
Mukherjee et al. 2007 Int. Jnl of Modern Physics C
/θ/
L1
/ŋ/
L2
/m/
L3
/d/
Consonants
 There is an edge e Є E between two nodes
 vl Є VL and vc Є VC if the consonant c occurs
 in the language l.
Languages
A bipartite network N=(VL,VC,E)
VL : Nodes representing languages of the world
VC : Nodes representing consonants
E : Set of edges which run between VL and VC
/s/
L4
/p/
The Structure of PlaNet
Degree Distribution of PlaNet
0.08
DD of the language nodes follows a βdistribution
pk = beta(k) with α = 7.06,
and β = 47.64
0.06
pk
0.04
Γ(54.7) k6.06(1-k)46.64
pk =
Γ(7.06) Γ(47.64)
0.02
DD of the consonant nodes follows a
power-law with an exponential cut-off
kmin= 5, kmax= 173, kavg= 21
0
50
100
150
11
200
0.1
0.1
Language inventory size (degree k)
Pk
0.01
0.01
Distribution of Consonants over
Languages follow a power-law
Pk = k -0.71
Exponential Cut-off
0.001
0.001
11
10
100
10
100
Degree of a consonant, k
1000
1000
Synthesis of PlaNet
L1
L2
L3
L4
After step 3
L1
L2
L3
L4
 Non-linear preferential
attachment
 Iteratively construct the
language inventories
given their inventory
sizes
After step 4
Pr(Ci) =
diα+ ε
∑xV* (dxα + ε)
Simulation Result
1
PlaNet
PlaNetsyn
PlaNetrand
Pk
.1
.01
.001
1
10
100
1000
Degree (k)
The parameters α and ε are 1.44 and 0.5 respectively.
The results are averaged over 100 runs
Principle of Co-occurrence
 Consonants tend to co-occur in groups or
communities
 These groups tend to be organized around a few
distinctive features (based on: manner of
articulation, place of articulation & phonation) –
Principle of feature economy
plosive
voiced
bilabial
/b/
dental
/d/
then it will also tend to have
voiceless
/p/
/t/
If a language has
in its inventory
How to Capture these Co-occurrences?
 PhoNet – “Phoneme Phoneme Network”
A weighted network N=(VC,E)
VC : Nodes representing consonants
E : Set of edges which run between the nodes in VC
 There is an edge e Є E between two nodes vc1 ,vc2 Є VC
if the consonant c1 and c2 co-occur in a language. The
number of languages in which c1 and c2 co-occurs defines
the edge-weight of e. The number of languages in which c1
occurs defines the node-weight of vc1.
/k′/
50
14
42
/kw/
39
38
/k/
283
13
17
/d′/
Construction of PhoNet
 Data Source : UPSID
 Number of nodes in VC is 541
 Number of edges is 34012
PhoNet
Community Formation
Radicchi et al Algorithm
1
52
110
100
3
101
2
10
5
5.17
10.94
4
45
46
1
S
11.11
6
3
0.06
4
7.14
2
5
7.5
3.77
6
η>1
1
For different values of η we get
different sets of communities
2
4
3
5
6
Consonant Societies!
η=0.35
η=0.60
η=0.72
η=1.25
The fact that the
communities are
good can
quantitatively
shown by
measuring the
feature entropy
Problems to ponder on …
Physical significance of PA:
Functional forces
Historical/Evolutionary process
Labeled synthesis of PlaNet and PhoNet
Language diversity vs. Preferential
attachment
CASE STUDY II: Modeling the
Mental Lexicon
Metal Lexicon (ML) – Basics
It refers to the repository of the word forms
that resides in the human brain
Two Questions:
How words are stored in the long term memory,
i.e., the organization of the ML.
How are words retrieved from the ML (lexical
access)
The above questions are highly inter-related – to
predict the organization one can investigate how
words are retrieved and vice versa.
Ways of Organization of Mental Lexicon
Un-organized (a bag full of words) or,
Organized
By sound (phonological similarity)
E.g., start the same: banana, bear, bean …
End the same: look, took, book …
Number of phonological segments they share
By Meaning (semantic similarity)
Banana, apple, pear, orange …
By age at which the word is acquired
By frequency of usage
By POS
Orthographically
Some Unsolved Mysteries –
You can Give it a Try 
What can be a model for the evolution of
the ML?
How is the ML acquired by a child learner?
Is there a single optimal structure for the
ML; or is it organized based on multiple
criteria (i.e., a combination of the different
n/ws) – Towards a single framework for
studying ML!!!
CASE STUDY III: Syntax
Unsupervised POS Tagging
Labeling of Text
 Lexical Category (POS tags)
 Syntactic Category (Phrases, chunks)
 Semantic Role (Agent, theme, …)
 Sense
 Domain dependent labeling (genes, proteins, …)
How to define the set of labels?
How to (learn to) predict them automatically?
“Nothing makes sense, unless in context”
Distribution-based definition of
Lexical category
Sense (meaning)
The X is …
If you X then I shall …
… looking at the star PP
General Approach
 Represent the context
of a word (token)
 Define some notion of
similarity between the
contexts
 Cluster the contexts
of the tokens
 Get the label of the
tokens
w1 w2 w3 w4 …
w1
w3
w2
w4
Issues
How to define the context?
How to define similarity
How to Cluster?
How to evaluate?
Syntactic Network of Words
color
sky
weight
light
1
20
blue
100
blood
heavy
1
1 – cos(red, blue)
red
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
blood
0.9
0.5
red
heavy
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
blood
0.9
0.5
red
heavy
The Chinese Whisper Algorithm
color
sky
weight
0.9
0.8
light
-0.5
0.7
blue
blood
0.9
0.5
red
heavy
Word Sense Disambiguation
 Véronis, J. 2004. HyperLex: lexical cartography
for information retrieval. Computer Speech &
Language 18(3):223-252.
 Let the word to be disambiguated be “light”
 Select a subcorpus of paragraphs which have at
least one occurrence of “light”
 Construct the word co-occurrence graph
HyperLex
A beam of white light is dispersed
into its component colors by its
passage through a prism.
Energy efficient light fixtures
including solar lights, night lights,
energy star lighting, ceiling lighting,
wall lighting, lamps
What enables us to see the light
and experience such wonderful
shades of colors during the course
of our everyday lives?
beam
prism
dispersed
white
colors
shades
energy
efficient
fixtures
lamps
Hub Detection and MST
light
beam
prism
dispersed
white
colors
beam
lamps
white
shades
prism
shades
dispersed
colors
energy
fixtures
energy
efficient
efficient
White fluorescent lights consume less
energy than incandescent lamps
fixtures
lamps
Other Related Works
 Solan, Z., Horn, D., Ruppin, E. and Edelman, S. 2005.
Unsupervised learning of natural languages. PNAS, 102
(33): 11629-11634
 Ferrer i Cancho, R. 2007. Why do syntactic links not
cross? Europhysics Letters
 Also applied to: IR, Summarization, sentiment
detection and categorization, script evaluation,
author detection, …
Discussions & Conclusions
What we learnt
Advantages of SNIC in NLP
Comparison to standard techniques
Open problems
Concluding remarks and Q&A
What we learnt
 What is SNIC and Complex Networks
 Analytical tools for SNIC
 Applications to human languages
 Three Case-studies:
Area
Perspective
Technique
I
Sound
systems
Language evolution and
change
Synthesis models
II
Lexicon
Psycholinguistic modeling
and linguistic typology
Topology and
search
III Syntax &
Applications to NLP
Semantics
Clustering
Insights
 Language features complex structure at every
level of organization
 Linguistic networks have non-trivial properties:
scale-free & small-world
 Therefore, Language and Engineering systems
involving language should be studied within the
framework of complex systems, esp. CNT
Advantages of SNIC
 Fully Unsupervised techniques:
No labeled data required: A good solution to
resources scarcity
Problem of evaluation: circumvented by semisupervised techniques
 Ease of computation:
Simple and scalable
Distributed and parallel computable
 Holistic treatment:
 Language evolution & psycho-linguistic theories
Comparison to Standard Techniques
Rule-based vs. Statistical NLP
Graphical Models
Generative models in machine learning
HMM, CRF, Bayesian belief networks
JJ
NN
RB
VF
Graphical Models vs. SNIC
GRAPHICAL MODEL
 Principled: based on
Bayesian Theory
 Structure is assumed and
parameters are learnt
 Focus: Decoding &
parameter estimation
 Data-driven or
computationally intensive
 The generative process is
easy to visualize, but no
visualization of the data
COMPLEX NETWORK
 Heuristic, but underlying
principles of linear algebra
 Structure is discovered
and studied
 Focus: Topology and
evolutionary dynamics
 Unsupervised and
computationally easy
 Easy visualization of the
data
Language Modeling
 A network of words as a model of language vs.
n-gram models
 Hierarchical, hyper-graph based models
 Smoothing through holistic analysis of the
network topology
Jedynak, B. and Karakos, D. 2007. Unigram Language
Models using Diffusion Smoothing over Graphs. Proc. of
TextGraphs - 2
Open Problems
 Universals and variables of linguistic networks
 Superimposition of networks: phonetic, syntactic,
semantic
 Which clustering algorithm for which topology?
 Metrics for network comparison – important for
language modeling
 Unsupervised dependency parsing using
networks
 Mining translation equivalents
Resources
 Conferences
TextGraphs, Sunbelt, EvoLang, ECCS
 Journals
PRE, Physica A, IJMPC, EPL, PRL, PNAS, QL, ACS,
Complexity, Social Networks
 Tools
Pajek, C#UNG,
http://www.insna.org/INSNA/soft_inf.html
 Online Resources

Bibliographies, courses on CNT
Contact
 Monojit Choudhury
[email protected]
http://www.cel.iitkgp.ernet.in/~monojit/
 Animesh Mukherjee
[email protected]
http://www.cel.iitkgp.ernet.in/~animesh/
 Niloy Ganguly
[email protected]
http://www.facweb.iitkgp.ernet.in/~niloy/
Thank you!!
Book Volume on Dynamics on and of Complex Networks
To be published by May 2008 from Birkhauser, Springer
http://www.cel.iitkgp.ernet.in/~eccs07/