A Brief History of Lognormal and Power Law Distributions

Download Report

Transcript A Brief History of Lognormal and Power Law Distributions

Toward Validation and Control of
Network Models
Michael Mitzenmacher
Harvard University
1
Internet Mathematics
Articles Related to This Talk
The Future of Power Law Research
A Brief History of Generative
Models for Power Law and
Lognormal Distributions
2
Motivation: General
• Network Science and Engineering is emerging as
its own (sub)field.
– NSF : cross-cutting area starting this year.
– Courses : Cornell (Easley/Kleinberg), Kearns (U Penn),
many others.
• For undergrads, not just grads!
– In popular culture: books like Linked by Barabasi or
Six Degrees by Watts.
– Other sciences: Economics, biology, physics, ecology,
linguistics, etc.
• What has been and what should be the research
agenda?
3
My (Biased) View
•
The 5 stages of networking research.
1) Observe: Gather data to demonstrate a behavior in a
system. (Example: power law behavior.)
2) Interpret: Explain the importance of this observation in
the system context.
3) Model: Propose an underlying model for the observed
behavior of the system.
4) Validate: Find data to validate (and if necessary
specialize or modify) the model.
5) Control: Design ways to control and modify the
underlying behavior of the system based on the model.
4
My (Biased) View
• In networks, we have spent a lot of time observing
and interpreting behaviors.
• We are currently very active in modeling.
– Many, many possible models.
– Perhaps easiest to write papers about.
• We need to now put much more focus on
validation and control.
– Have been moving in this direction.
– And these are specific areas where computer science
has much to contribute!
5
Models
• After observation, the natural step is to
explain/model the behavior.
• Outcome: lots of modeling papers.
– And many models rediscovered.
• Example : power laws
• Lots of history…
6
History
• In 1990’s, the abundance of observed power laws in networks
surprised the community.
– Perhaps they shouldn’t have… power laws appear frequently
throughout the sciences.
•
•
•
•
•
•
Pareto : income distribution, 1897
Zipf-Auerbach: city sizes, 1913/1940’s
Zipf-Estouf: word frequency, 1916/1940’s
Lotka: bibliometrics, 1926
Yule: species and genera, 1924.
Mandelbrot: economics/information theory, 1950’s+
• Observation/interpretation were/are key to initial understanding.
• My claim: but now the mere existence of power laws should not
be surprising, or necessarily even noteworthy.
• My (biased) opinion: The bar should now be very high for
observation/interpretation.
7
So Many Models…
•
•
•
•
•
•
Preferential Attachment
Optimization (HOT)
Monkeys typing randomly (scaling)
Multiplicative processes
Kronecker graphs
Forest fire model (densification)
8
What Makes a Good Model…
• New variations coming up all of the time.
• Question : What makes a new network model
sufficiently interesting to merit attention and/or
publication?
– Strong connection to an observed process.
• Many models claim this, but few demonstrate it convincingly.
– Theory perspective: significant new mathematical
insight or sophistication.
• A matter of taste?
• My (biased) opinion: the bar should start being
raised on model papers.
9
Validation: The Current Stage
• We now have so many models.
• It is important to know the right model, to
extrapolate and control future behavior.
• Given a proposed underlying model, we need tools
to help us validate it.
• We appear to be entering the validation stage of
research…. BUT the first steps have focused on
invalidation rather than validation.
10
Examples : Invalidation
• Lakhina, Byers, Crovella, Xie
– Show that observed power-law of Internet topology
might be because of biases in traceroute sampling.
• Pedarsani, Figueiredo, Grossglauser
– Show that densification may also arise by sampling
approaches, not necessarily intrinsic to network.
• Chen, Chang, Govindan, Jamin, Shenker,
Willinger
– Show that Internet topology has characteristics that do
not match preferential-attachment graphs.
– Suggest an alternative mechanism.
• But does this alternative match all characteristics, or are we
still missing some?
11
My (Biased) View
• Invalidation is an important part of the process!
BUT it is inherently different than validating a
model.
• Validating seems much harder.
• Indeed, it is arguable what constitutes a validation.
• Question: what should it mean to say
“This model is consistent with observed data.”
12
An Alternative View
• There is no “right model”.
• A model is the best until some other model comes
along and proves better.
– Greedy refinement via invalidation in model space.
– Statistical techniques: compare likelihood ratios for
various models.
• My (biased) opinion: this is one useful approach;
but not the end of the question.
– Need methods other than comparison for confirming
validity of a model.
13
Time-Series/Trace Analysis
• Many models posit some sort of actions.
– New pages linking to pages in the Web.
– New routers joining the network.
– New files appearing in a file system.
• A validation approach: gather traces and see if the
traces suitably match the model.
– Trace gathering can be a challenging systems problem.
– Check model match requires using appropriate
statistical techniques and tests.
– May lead to new, improved, better justified models.
14
Sampling and Trace Analysis
• Often, cannot record all actions.
– Internet is too big!
• Sampling
– Global: snapshots of entire system at various times.
– Local: record actions of sample agents in a system.
• Examples:
– Snapshots of file systems: full systems vs. actions of
individual users.
– Router topology: Internet maps vs. changes at subset of
routers.
• Question: how much/what kind of sampling is
sufficient to validate a model appropriately?
– Does this differ among models?
15
To Control
• In many systems, intervention can impact the
outcome.
– Maybe not for earthquakes, but for computer networks!
– Typical setting: individual agents acting in their own
selfish interest. Agents can be given incentives to
change behavior.
• General problem: given a good model, determine
how to change system behavior to optimize a
global performance function.
– Distributed algorithmic mechanism design.
– Mix of economics/game theory and computer science.
16
Possible Control Approaches
• Adding constraints: local or global
– Example: total space in a file system.
– Example: preferential attachment but links limited by
an underlying metric.
• Add incentives or costs
– Example: charges for exceeding soft disk quotas.
– Example: payments for certain AS level connections.
• Limiting information
– Impact decisions by not letting everyone have true view
of the system.
17
My Related Work : Hash Algorithms
• On the Internet, we need a measurement and
monitoring infrastructure, for validation and
control.
– Approximate is fine; speed is key.
– Must be general, multi-purpose.
– Must allow data aggregation.
• Solution : hash-based architecture.
– Eventual goal: every router has a programmable “hash
engine”.
18
Vision
• Three-pronged research data.
• Low: Efficient hardware implementations of
relevant algorithms and data structures.
• Medium: New, improved data structures and
algorithms for old and new applications.
• High: Distributed infrastructure supporting
monitoring and measurement schemes.
19
The High-Level Pitch
• Lots of hash-based schemes being designed
for approximate measurement/monitoring
tasks.
– But not built into the system to begin with.
• Want a flexible router architecture that
allows:
– New methods to be easily added.
– Distributed cooperation using such schemes.
20
What We Need
Memory
Computation
Communication
+ Control
Off-Chip
Memory
Hashing
Computation
Unit
Control
System
On-Chip
Memory
CAM(s)
Unit for
Programming
Other
Language
Computation
Communication
Architecture
21
Lots of Design Questions
• How much space for various memory levels? How
to dynamically divide memory among competing
applications?
• What hash functions should be included? Openness
to new hash functions?
• What programming language and functionality?
• What communication infrastructure?
• Security?
• And so on…
22
Which Hash Functions?
• Theorists:
– Want analyzable hash functions.
– Dislike standard assumption of perfectly random hash
functions.
– Hard to prove things about actual performance.
• Practitioners
– Want easy implementation, speed, small space.
– Want simple analysis (back-of-the-envelope).
– Will accept simulated results under right settings.
23
Why Do Weak Hash Functions
Work So Well?
• In reality, assuming perfectly random hash
functions seems to be the right thing to do.
– Easier to analyze.
– Real systems almost always work that way,
even with weak hash functions!
• Can Theory explain strong performance of
weak hash functions?
24
Recent Work
• A new explanation (joint work with Salil Vadhan):
• Choosing a hash function from a pairwise independent
family is enough – if data has sufficient entropy.
– Randomness of hash function and data “combine”.
– Behavior matches truly random hash function with high
probability.
• Techniques based on theory of randomness
extraction.
– Extensions of Leftover Hash Lemma.
25
What Functionality?
• Hash tables should be a basic primitive.
• “Best” hash tables: cuckoo hashing.
– Worst case constant lookup time.
– Simple to build, design.
• How can we make them even better?
– Move cuckoo hashing from theory to practice!
26
Cuckoo Hashing [Pagh,Rodler]
• Basic scheme: each element gets two
possible locations.
• To insert x, check both locations for x. If
one is empty, insert.
• If both are full, x kicks out an old element y.
Then y moves to its other location.
• If that location is full, y kicks out z, and so
on, until an empty slot is found.
27
Cuckoo Hashing Examples
A
B
E
C
D
28
Cuckoo Hashing Examples
A
B
C
F
E
D
29
Cuckoo Hashing Examples
A
B
E
C F
D
30
Cuckoo Hashing Examples
A
B
C F
G
E
D
31
Cuckoo Hashing Examples
E
G
B
A
C F
D
32
Cuckoo Hashing Examples
A
B
C
G
E
D
F
33
Cuckoo Hashing Failures
• Bad case 1: inserted element runs into cycles.
• Bad case 2: inserted element has very long path before
insertion completes.
– Could be on a long cycle.
• Bad cases occur with small probability when load is
sufficiently low, but not low enough:
• Theoretical solution: re-hash everything if a failure occurs.
• For 2 choices, load less than 50%, n elements gives failure
rate of Q(1/n); maximum insert time O(log n).
– Better space utilization and rate for more choices, more elements per
bucket.
34
Recent Work : A CAM-Stash
• Use a CAM (Content Addressable Memory) to stash away
elements that would cause failure.
– Joint with Kirsch/Wieder.
• Intuition: if failures were independent, probability that s
elements cause failures goes to Q(1/ns).
– Failures not independent, but nearly so.
– A stash holding a constant number of elements greatly reduces failure
probability.
– Implemented as a CAM in hardware, or a cache line in
hardware/software.
• Lookup requires also looking at stash.
35
Modeling : Economic Principles
• Joint work with Corbo, Jain, Parkes.
• Exploration : what models make sense for AS
connectivity.
– Extending approach of Chang, Jamin, Mao, Willinger.
– Entering nodes link according to business model, utility
function.
– Nodes revise their links based on new entrants.
• Like the forest fire model.
• Future considerations: how to validate such
models.
36
Conclusion : My (Biased) View
•
There are 5 stages of networking research.
1) Observe: Gather data to demonstrate power law
behavior in a system.
2) Interpret: Explain the import of this observation in the
system context.
3) Model: Propose an underlying model for the observed
behavior of the system.
4) Validate: Find data to validate (and if necessary
specialize or modify) the model.
5) Control: Design ways to control and modify the
underlying behavior of the system based on the model.
•
We need to focus on validation and control.
–
Lots of open research problems.
37
A Chance for Collaboration
• The observe/interpret stages of research are dominated by
systems; modeling dominated by theory.
– And need new insights, from statistics, control theory, economics!!!
• Validation and control require a strong theoretical
foundation.
– Need universal ideas and methods that span different types of
systems.
– Need understanding of underlying mathematical models.
• But also a large systems buy-in.
– Getting/analyzing/understanding data.
– Find avenues for real impact.
• Good area for future systems/theory/others collaboration
and interaction.
38
More About Me
• Website: www .eecs.harvard.edu/~michaelm
– Links to papers
– Link to book
– Link to blog : mybiasedcoin
• mybiasedcoin.blogspot.com
39