No Slide Title

Download Report

Transcript No Slide Title

Social Sub-groups II
Outline
“How?”
- Review group-finding strategies
- “Evade” – PCA (=SVD for the math-oriented!)
- Theory Problem: What should group-structure be?
“Why?”
Wayne Baker
•Social structure in a place where there should be none
Scott Feld
•What causes clustering in a network? Opportunity and
interests
Examples from Add Health & Prosper
Practical:
•Software & Program examples.
Next week: Roles & Blockmodels
Methods: How do we identify primary groups in a network?
Strategies for identifying primary groups:
Search:
1) Fit Measure: Identify a measure of groupness (usually a function of
the number of ties that fall within group compared to the number of
ties that fall between group).
2) Algorithm to maximize fit. Once we have the index, we need a
clever method for searching through the network to maximize the fit.
See: “Jiggle”, “Factions” etc.
Destroy:
Break apart the network in strategic ways, removing the weakest parts
first, what’s left are your primary groups. See “edge betweeness”
“MCL”
Evade:
Don’t look directly, instead find a simpler problem that correlates:
Examples: Generalized cluster analysis, Factor Analysis, RM.
Strategies for identifying primary groups:
Search:
- UCINET’s Factions
- R’s FastGreedy
- PAJEK’s Generalized block-modeling
- Frank’s KliqueFinder
Destroy:
Edge-betweenness reduction
MCL Flow model
Evade:
Leading Eigenvector model
Clustering Distance (or other) matrix
Principle Component / Factor / SVD methods
RNM
Hybrids:
Use a simple evade technique for starting values and then use a search
technique. (CROWDS, JIGGLE)
Strategies for identifying primary groups:
Evade
Factor Analysis: Treat the adjacency/similarity matrix as a set of N variables and look for
latent factors that explain the variance in the data.
SES
IQ
1.0
Income
1.0
Math
Score
0.0
d
0.0
d
We often use simple indicators and assume they measure our concepts
Strategies for identifying primary groups:
Evade
Factor Analysis: Treat the adjacency/similarity matrix as a set of N variables and look for
latent factors that explain the variance in the data.
IQ
SES
Income
d
Occupation
d
House
Size
d
Highest
Degree
d
Languages
Spoken
Reading
Score
d
d
Math
Score
d
But we don’t have to! We can imagine that each latent concept causes our
indicators, and build a measurement model.
Strategies for identifying primary groups:
Evade
Factor Analysis: Treat the adjacency/similarity matrix as a set of N variables and look for
latent factors that explain the variance in the data.
Income  1 ( ses )  d1
Occupation   2 ( ses )  d 2
HouseSize   3 ( ses )  d 3
But we don’t have to! We can imagine that each latent concept causes our
indicators, and build a measurement model.
Strategies for identifying primary groups:
Evade
Factor Analysis: Treat the adjacency/similarity matrix as a set of N variables and look for
latent factors that explain the variance in the data.
In a network, we assume that the tie pattern is an imperfect measure of an
underlying latent structure that we can explain with similar factors. Instead of lots
of “measurements” we have many columns in the adjacency (sim) matrix, and we
can summarize that with factor scores.
-- works best if the similarity matrix has more information
– so multiple account data are perfect.
– or you can transform the data in some way to more information (like
use a distance matrix.
Strategies for identifying primary groups:
Evade
Factor Analysis: Treat the adjacency/similarity matrix as a set of N variables and look for
latent factors that explain the variance in the data.
Here is code I used in the PROSPER data:
/* this section builds info on how to weight dyads for in-group, out-group. */
twostp=((adjmat+adjmat`)>0)*adjmat; /* make it either direction w. the first term */
ttie=adjmat#twostp; /*=1 if tie contributes to a transitive triple */
ttie=((ttie+ttie`));
adjraw=adjmat;
adjmat=(adjmat+adjmat`); /* force it to be symetric, 1=asym 2=reciped */
adjmat=adjmat-diag(adjmat); /* remove any self ties */
d2=reachlim((adjmat>0),3);
/* re-weight to bias toward recip ties */
wm_4 = (d2=1)#(adjmat=2)#8; /* recip direct ties */
wm_2a = (d2=1)#(adjmat=1)#4; /* unrecip direct ties */
wm_1 = 2*(d2=2);/* ties 2-steps out */
wm_p5 = 0*(d2=3); /* ties 3-steps out - note it's zeroed out here*/
wm=wm_4+wm_2a+wm_1++wm_p5+(3*(ttie/(max(ttie)))); /* transitivity is at the end*/
wm=wm-diag(wm);
Strategies for identifying primary groups:
Evade
Factor Analysis: Treat the adjacency/similarity matrix as a set of N variables and look for
latent factors that explain the variance in the data.
Here is code I used in the PROSPER data:
/* run factor analysis.
Note nfactors is a high value, should only take those
w. EV > 2, but this gives us room... */
proc factor rotate=varimax
min=&minev out=factset data=symmat nfactors=175
outstat=fscores noprint;
run; quit;
Strategies for identifying primary groups:
Evade
Result:
Strategies for identifying primary groups:
Evade
Result:
Each column is a
person, these are
the factor loadings
for each person on
each retained
factor.
Strategies for identifying primary groups:
Evade
Result:
Sociogram for a
single school
Strategies for identifying primary groups:
Evade
Result:
Sociogram for a single school.
Problem is that there are no
necessary connectivity
checks – you can get
“groups” that are
disconnected.
Biggest strengths are:
a) Really fast
b) Allows for overlapping
groups
c) Gives you “embeddedness”
scores based on factor loadigs
Strategies for identifying primary groups:
Hybrid
The Crowds Algorithm
1. Identify members of network bicomponents, remove people not included.
2. Cluster the reduced network.
- Identify optimal number of groups: (TREEWALK)
- For each level of the cluster partition tree do (BFS):
-Move up the tree from smaller to larger groups.
-If the fit for both groups is improved by joining them then do so.
-If not, then identify group at that level.
-End TREEWALK.
Do until all groups are identified (GLOBAL LOOP):
3. Evaluate node fit.
Do until nodes cannot be moved:
For each identified cluster do (GRPCHECK):
- Ensure group is a bi-component.
-Calculate effect on group a of moving node j to group a.
-Calculate effect on j's present group of removing j.
- If there is a positive net gain to moving j from own group to a, then do so.
End.
4. Identify Bridging members.
-If removing j from group a would improve the fit of group a, AND assigning j to any other group
would lower the fit for that group, then j is considered a bridge. Place all bridges in separate class.
5. Group Check.
Check returns to combining groups. IF merging groups would improve the fit of all groups to be
merged, then do so.
- Evaluate bridges, to be sure that they are not bridging two groups that have now merged.
End Global loop.
Return to first question: What is a group?
•The simple notions of a complete clique are difficult to square w. real-world data.
•Density is an indicator, but subject to over-grouping (no connectivity) and star-patterns.
•Groups are likely internally differentiated – with “core” vs. “periphery” members
•Most sociological theories of groups rest on transitive closure and short distances
•There’s a sense that members are equal – a tight-knit group
•The group should be fairly small – face-to-face scale
•The social processes underlying the group turn on reciprocity, trust, communication,
homogeneity of norms & beliefs.
•Almost all require a comparative set: in-group to out-group. It is relational not
essential.
•Cross-cutting social circles – would lead us to expect overlapping groups, but in
practice most methods do not do that, as it’s analytically too cumbersome.
Practically, group detection is hard and most methods will give you (slightly) different
results. You can compare results using a Rand statistic (proportion of pairs similarly
categorized in two partitions), but for small settings these differences can matter.
Social Sub-groups: why look?
Wayne Baker: The Social Structure of a National Securities Market:
1) Behavioral assumptions of economic actors
2) Micro-structure of networks
3) Macro-structure of networks
4) Price Consequences
Under standard economic assumptions, people should act rationally and
act only on price. This would result in expansive and homogeneous (I.e.
random) networks. It is, in fact, this structure that allows microeconomic
theory to predict that prices will settle to an optimal equilibrium
Baker’s Model:
Baker’s Model:
He makes two assumptions in contrast to standard economic assumptions:
a) that people do not have access to perfect information and
b) that some people act opportunistically
He then shows how these assumptions change the underlying mechanisms
in the market, focusing on price volatility as a marker for uncertainty.
The key on the exchange floor is “market makers” people who will keep
the process active, keep trading alive, and thus not ‘hoard’ (and lower
profits system wide)
Baker’s Model:
Micronetworks: Actors should trade extensively and widely. Why might they not?
A) Physical factors (noise and distance)
B) Avoid risk and build trust
Macro-Networks: Should be undifferentiated. Why not?
A) Large crowds should be more differentiated than small crowds. Why?
Price consequences: Markets should clear. They often don’t. Why?
Network differentiation reduces economic efficiency, leading to less
information and more volatile prices
Baker: Use
frequency of
exchange to identify
the network,
resulting in:
Baker finds that the
structure of this
network significantly
(and differentially)
affects the price
volatility of the
network
Groups found w.
NEGOPY
The one other program you should know about is NEGOPY.
Negopy is a program that combines elements of the density based
approach and the graph theoretic approach to find groups and
positions. Like CROWDS, NEGOPY assigns people both to
groups and to ‘outsider’ or ‘between’ group positions. It also tells
you how many groups are in the network.
It’s a DOS based program, and a little clunky to use, but
NEGWRITE.MOD will translate your data into NEGOPY format
if you want to use it.
There are many other approaches. If you’re interested in some
specifically designed for very large networks (10,000+ nodes),
I’ve developed something I call Recursive Neighborhood Means
that seems to work fairly well.
Baker: Because
size is the primary
determinant of
clustering in this
setting, he
concludes that the
standard economic
assumption of
large market =
efficient is
unwarranted.
Scott Feld: Focal Organization of Social Ties
Feld wants to look at the effects of constraint & opportunity for mixing, to
situate relational activity within a wider context.
The contexts form “Foci”,
“A social, psychological, legal or physical entity around which
joint activities are organized” (p.1016)
People with similar foci will be clustered together. He contrasts this with
social balance theory.
Claim: that much of the clustering attributed to interpersonal balance
processes are really due to focal clustering.
(note that this is not theoretically fair critique -- given that balance theory
can easily accommodate non-personal balance factors (like smoking or
group membership) but is a good empirical critique -- most researchers
haven’t properly accounted for foci.)
Observed Clustering within Adolescent Social Networks
Network Characteristics of Sub Groups
• On average, 65% of a school’s adolescents are in
cohesive sub-groups.
• 87% of all relations are within sub-groups.
• The average sub-group has 22 members.
• The average diameter for a sub-group is 3 steps.
• The mean segregation index is .96 (1=Complete,
0=Random)
Observed Clustering within Adolescent Social Networks
Distribution of Characteristic within groups, relative to school distribution
34%
65%
84%
86%
79%
74%
Grade
Race
College
GPA
Activities
Smoking
Group Data in Add Health
Groups 23 & 24
Group 1
Group 15
Group 18
Group data in Add Health
Inter-Group Relations
Mostly Seniors
Mostly Juniors
4
1
17
Mostly Sophomores
30
7
27
3
Mostly Freshmen
25
Mixed Grades
12
16
15
Directed Arrow
23
24
19
13
14
31
10
18
21
5
20
2
Group data in Prosper
We have 368 network observations based on 2 cohorts observed over 5 waves in 2 states.
Using a variant of the CROWDs algorithm, I identified groups in every network.
-Results in about 4500 groups averaging in size of about 10 kids, though some settings
are really too cohesive to break into small bits, resulting “peer groups” of 40ish kids.
Table 1. All groups with > 40 members:
state
1
1
2
1
2
1
1
2
cohort
2
1
1
1
2
1
1
2
wave
1
2
2
1
3
5
5
5
school
112
112
160
220
262
306
306
351
Group
ean Network
network descriptives.
Variable
NumGrps
pisolate
pliaison
jfoptmod
Mean
13.3287671
0.0295607
0.0391871
0.5605613
group
5
4
11
1
1
1
5
2
grpsize
45
73
41
45
42
53
66
45
grpnumbc
2
2
1
1
1
1
1
2
grppctbc
0.82222
0.91781
0.90244
0.93333
1.00000
0.98113
0.87879
0.84444
Characteristics
Std Dev
8.1827593
0.0245523
0.0422634
0.0661626
Min
2.0000000
0
0
0.2668055
Max
.
50.0000000
0.1343284
0.3750000
0.7366568
Group data in Prosper
We have 368 network observations based on 2 cohorts observed over 5 waves in 2 states.
Using a variant of the CROWDs algorithm, I identified groups in every network.
-Results in about 4500 groups averaging in size of about 10 kids, though some settings
are really too cohesive to break into small bits, resulting “peer groups” of 40ish kids.
Table 3. Descriptive stats for group-level structure scores.
Variable
grpsize
group
igrpties
s_ogrpties
r_ogrpties
ingrprat
grpsegs
avgogtrcvd
avgogtsent
grpden
grptran
grprecp
grpdst
grprchbl
grpdst_sym
grprchbl_sym
grppctbc
grpnumbc
avgpop
grpcntrlzn
Label
Number of people in group
Group label
Sum of within-group ties
Sum of ties sent to out-groups
Sum of ties received from out-groups
Ratio of in group ties to out-group ties
Freeman Segregation index, group specific
Per member ties received from other groups
Per member ties sent to other groups
Density of within group ties
Transitivity of within group ties
Reciprocity of within group ties
Mean distance btwn reachable pairs, directed
Proportion pairs reachable, directed
Mean distance btwn reachble pairs, undirected
Proportion pairs reachable, undirected
Proportion of members in largest bicomponent
Number of Bicomponents within group
Average popularity of members, percentile normalized
Closeness centralization of the group
N
4865
4865
4865
4865
4865
4482
4539
4865
4865
4777
4379
4433
4433
4433
4433
4433
4160
4160
4865
4263
Mean Std
10.025 5.759
56.644 200.1
26.461 23.683
10.829 8.493
10.829 9.332
1.590 2.169
0.655 0.164
1.055 0.783
1.100 0.777
0.294 0.170
0.446 0.205
0.393 0.181
1.800 0.474
0.675 0.231
1.777 0.438
0.978 0.124
0.828 0.191
1.131 0.377
0.528 0.171
0.431 0.318
Min
1.0
0
0
0
0
0
-0.032
0
0
0
0
0
1.0
0.029
1.0
0.044
0.125
1.0
0.013
0
Max
73.0
888.0
220.0
99.0
111.0
49.0
1.0
7.0
6.0
1.0
1.0
1.0
4.64
1.0
5.50
1.0
1.0
5.0
0.96
5.60
Group data in Prosper
We have 368 network observations
based on 2 cohorts observed over 5
waves in 2 states. Using a variant of the
CROWDs algorithm, I identified groups
in every network.
-Results in about 4500 groups averaging
in size of about 10 kids, though some
settings are really too cohesive to break
into small bits, resulting “peer groups”
of 40ish kids.
AVG USE
setting
group
person
wave1
0.0003
0.0018
0.0488
wave2
0.0004
0.0081
0.0825
wave3
0.0018
0.0139
0.1985
wave4
0.0051
0.0581
0.3458
wave5
0.0097
0.1102
0.5290
ICC - setting
ICC - group
0.0060
0.0359
0.0049
0.0898
0.0085
0.0665
0.0124
0.1472
0.0149
0.1795
setting
group
person
wave1
0.0014
0.0067
0.1893
wave2
0.0020
0.0223
0.2657
wave3
0.0052
0.0400
0.4377
wave4
0.0103
0.1016
0.6317
wave5
0.0152
0.1660
0.8646
ICC - setting
ICC - group
0.0073
0.0352
0.0068
0.0788
0.0108
0.0880
0.0139
0.1470
0.0145
0.1739
setting
group
person
wave1
0.0003
0.0055
0.0751
wave2
0.0005
0.0103
0.1084
wave3
0.0005
0.0158
0.1781
wave4
0.0015
0.0302
0.2364
wave5
0.0015
0.0319
0.3009
ICC - setting
ICC - group
0.0043
0.0685
0.0046
0.0866
0.0024
0.0815
0.0056
0.1140
0.0045
0.0969
setting
group
person
wave1
0.0025
0.0366
0.3446
wave2
0.0032
0.0523
0.4030
wave3
0.0030
0.0686
0.5157
wave4
0.0132
0.0989
0.5948
wave5
0.0090
0.1058
0.6753
ICC - setting
ICC - group
0.0065
0.0978
0.0070
0.1173
0.0050
0.1197
0.0186
0.1531
0.0114
0.1429
setting
group
person
wave1
0.0066
0.1004
0.5686
wave2
0.0075
0.1202
0.5992
wave3
0.0155
0.1852
0.6724
wave4
0.0051
0.2040
0.6605
wave5
0.0160
0.2159
0.6905
ICC - setting
ICC - group
0.0098
0.1552
0.0103
0.1729
0.0178
0.2276
0.0058
0.2397
0.0173
0.2501
IRT USE
AVG DEV
IRT DEV
TGRAD_R
Group data in Prosper
Group Sizea
Fixed Effects
School Level
Intercept
PA School
Treatment School
Group Level
Group Delinquency (IRT)
Group Drinking (%)
Family Attachment
Grades
Religious Attendance
School Attachment
Friends Outside of School
Free Lunch (%)
Two-Parent Family (%)
Male Group
Female Group
White Group
Group Size
Coef.
Model 1
SE
2.370 ***
-0.151 **
0.206 *
Coef.
Reciprocityb
Model 2
SE
Coef.
Model 1
SE
Coef.
Transitivityb
Model 2
SE
Coef.
Model 1
SE
Coef.
Model 2
SE
0.027
2.372 ***
-0.117 *
-0.027
0.027
0.056
0.053
0.384 ***
0.007
0.382 ***
0.029
0.018
0.006
0.016
0.013
0.429 ***
0.009
0.433 ***
-0.009
0.012
0.009
0.019
0.018
0.052
0.090
-0.018
0.143
0.213 *
0.088 **
0.001
-0.016
0.016
-0.317 ***
0.020
-0.029
-0.052
-0.026
0.054
0.094
0.100
0.029
0.016
0.012
0.009
0.064
0.093
0.042
0.038
0.036
-0.101 ***
0.123 **
0.019
0.038
-0.007
0.072 *
-0.030
0.045 *
0.009
0.006
-0.021 ***
-0.015
-0.012
-0.038 **
0.103 ***
0.028 **
-0.003 ***
0.022
0.035
0.033
0.019
0.007
0.005
0.005
0.034
0.058
0.011
0.013
0.011
0.001
-0.087 ***
0.135 **
0.023
0.038
0.041
0.105 **
0.114 *
0.061 **
0.018 **
0.002
-0.013 **
-0.064
-0.027
0.017
0.085 ***
-0.001
-0.006 ***
0.027
0.036
0.048
0.020
0.007
0.006
0.005
0.043
0.057
0.018
0.016
0.014
0.001
Random Effects
Variance Components
Between (level-2)
0.025 ***
0.026 ***
Within (level-1)
2.270
2.170
***p<.001, **p<.01, *p<.05
Note: SE's are robust (adjusted for clustering) and variables are grand centered.
a
Model is hierarchical overdispersed poisson
b
Model is hierarchical linear
0.001
0.034
0.000
0.029
0.002 **
0.040
0.002 ***
0.034