No Slide Title

Download Report

Transcript No Slide Title

IST 511 Information Management: Information
and Technology
Probabilistic reasoning
Dr. C. Lee Giles
David Reese Professor, College of Information Sciences
and Technology
The Pennsylvania State University, University Park, PA,
USA
[email protected]
http://clgiles.ist.psu.edu
Special thanks to J. Lafferty, J. Latombe, T. Cover, R.V. Jones
Last Time
Introduction to machine learning (ML)
– Definitions/theory
• Supervised learning most common
– Why important
– How is ML used (often)
What is learning
– Part of AI
– Relation to animal/human learning
Impact on information science
Today
What are probabilities
What is information theory
What is probabilistic reasoning
–
–
–
–
Definitions
Why important
How used – decision making
Decision trees
Impact on information science
Tomorrow
Topics used in IST
• Data mining, information extraction
• Metadata; digital libraries, scientometrics
• Others?
Theories in Information Sciences
Enumerate some of these theories in this course.
Issues:
– Unified theory?
– Domain of applicability
– Conflicts
Theories here are
– Very algorithmic
– Some quantitative
– Some qualitative
Quality of theories
– Occam’s razor
– Subsumption of other theories (foundational)
Theories of reasoning
– Cognitive, algorithmic, social
Probability vs all the others
Probability theory
• the branch of mathematics concerned with analysis of
random phenomena.
• Randomness: a non-order or non-coherence in a sequence of
symbols or steps, such that there is no intelligible pattern or
combination.
• The central objects of probability theory are random
variables, stochastic processes, and events
• mathematical abstractions of non-deterministic events or
measured quantities that may either be single occurrences
or evolve over time in an apparently random fashion.
Uncertainty
– A lack of knowledge about an event
– Can be represented by a probability
• Ex: role a die, draw a card
– Can be represented as an error
Statistic (a measure in statistics)
– Can use probability in determining that measure
Founders of Probability Theory
Blaise Pascal
Pierre Fermat
(1623-1662, France)
(1601-1665, France)
Laid the foundations of the probability theory in a
correspondence on a dice game posed by a French nobleman.
Sample Spaces – measures of events
Collection (list) of all possible outcomes
EXPERIMENT: ROLL A DIE
– e.g.: All six faces of a die:
EXPERIMENT: DRAW A CARD
e.g.: All 52 cards in a deck:
Types of Events
Random event
– Different likelihoods of occurrence
Simple event
– Outcome from a sample space with one
characteristic in simplest form
– e.g.: King of clubs from a deck of cards
Joint event
– Conjunction (AND,
, “,”); disjunction (OR,v)
– Contains several simple events
– e.g.: A red ace from a deck of cards
Visualizing Events
Excellent ways of determining probabilities:
Contingency tables (neat way to look at):
Ace
Not Ace
Total
Black
Red
2
2
24
24
26
26
Total
4
48
52
Tree diagrams:
Ace
Full
Deck
of Cards
Red
Cards
Black
Cards
Not an Ace
Ace
Not an Ace
Review of Probability Rules
Given 2 events: G, H
1)
2)
3)
4)
P(G OR H) = P(G) + P(H) - P(G AND H); for mutually
exclusive events, P(G AND H) = 0
P(G and H) = P(G)P(HG), also written as P(HG) = P(G and
H)/P(G)
If G and H are independent, P(HG) = P(H), thus P(G AND
H) = P(G)P(H)
P(G) > P(G)P(H); P(H) > P(G)P(H)
Odds
Another way to express probability is in terms of odds, d
d = p/(1-p)
p = probability of an outcome
Example: What are the odds of getting a six on a dice throw?
We know that p=1/6, so
d = 1/6/(1-1/6) = (1/6)/(5/6) = 1/5.
Gamblers often turn it around and say that the odds against
getting a six on a dice roll are 5 to 1.
Probabilistic Reasoning
Types of probabilistic reasoning
• Reasoning using probabilistic methods
• Reasoning with uncertainty
• Rigorous reasoning vs heuristics or biases
Heuristics and Biases in Reasoning
Tversky & Kahneman showed that people often do not
follow rules of probability
Instead, decision making may be based on heuristics
(heuristic decision making)
Lower cognitive load but may lead to systematic errors and
biases
Example heuristics
– Representativeness
– Availability
– Conjunctive fallacy
Gambling/Predictions
A fair coin is flipped. H heads, T tails
- What is a more likely sequence?
A) H T H T T H
B)H H H H H H
-What is the result more likely to
-
follow A)?
follow B)?
Decision Tree
Representativeness Heuristic
The sequence “H T H T T H” is seen as more representative
of or similar to a prototypical coin sequence
While each sequence has the same probability of occurring
The likelihood of a flip following both A and B are the same:
½ H; ½ T
The T for B) is no more likely; events are independent
Gambler’s Fallacy
When is this not the case?
Linda is 31 years old, single, outspoken, and very bright. She
majored in philosophy. As a student, she was deeply concerned
with issues of discrimination and social justice, and also
participated in anti-nuclear demonstrations.
Please choose the most likely alternative:
(a) Linda is a bank teller
(b) Linda is a bank teller and is active in the feminist movement
Conjunction Fallacy
Nearly 90% choose the second alternative (bank teller and
active in the feminist movement), even though it is
logically incorrect (conjunction fallacy)
bank tellers
bank tellers
who are not
feminists
feminists
feminist bank tellers
P(A) > P(A,B); P(B) > P(A,B)
feminists
who are not
bank tellers
Kahnemann and Tversky (1982)
How to avoid these mistakes
Such mistakes can cause bad decisions and loss of
• Profits
• Lives
• Health
• Justice (prosecutor’s fallacy)
• Etc.
Instead, use probabilistic methods
Example
Reasoning with an Uncertain Agent
sensors
?
?
environment
agent
?
actuators
model
An Old Problem … Getting
Somewhere
Types of Uncertainty
For example, to drive your car in the morning:
Uncertainty
priorbeen
knowledge
• It must notinhave
stolen during the night
E.g., some causes of a disease are unknown and are not
• Itrepresented
must not have
flatbackground
tires
in the
knowledge of a medical• There
must
be gas in the tank
assistant
agent
• The batteryinmust
not be dead
Uncertainty
actions
E.g.,ignition
actionsmust
are represented
with relatively short lists of
• The
work
preconditions,
while
listskeys
are in fact arbitrary long
• You
must not have
lostthese
the car
• No truck should obstruct the driveway
• You must not have suddenly become blind or paralytic
Etc…
Not only would it not be possible to list all of them, but
would trying to do so be efficient?
Questions
How to represent uncertainty in knowledge?
How to reason (inferences) with uncertain
knowledge?
Which action to choose under uncertainty?
Handling Uncertainty
Possible Approaches:
1. Default reasoning
2. Worst-case reasoning
3. Probabilistic reasoning
Default Reasoning
Creed: The world is fairly normal. Abnormalities are rare
So, an agent assumes normality, until there is evidence of the
contrary
E.g., if an agent sees a bird x, it assumes that x can fly, unless it
has evidence that x is a penguin, an ostrich, a dead bird, a
bird with broken wings, …
Worst-Case Reasoning
Creed: Just the opposite! The world is ruled by
Murphy’s Law
Uncertainty is defined by sets, e.g., the set possible
outcomes of an action, the set of possible positions
of a robot
The agent assumes the worst case, and chooses the
actions that maximizes a utility function in this case
Example: Adversarial search
Probabilistic Reasoning
Creed: The world is not divided between “normal” and
“abnormal”, nor is it adversarial. Possible situations have
various likelihoods (probabilities)
The agent has probabilistic beliefs – pieces of knowledge with
associated probabilities (strengths) – and chooses its actions
to maximize the expected value of some utility function
Notion of Probability
You drive on Atherton often, and you notice that 70%
P(AvA)
= 1exit to Park. The
of the times there is a traffic slowdown
at the
P(A)
P(A) that
next time you plan to drive on Atherton, =
you
will+believe
So: at the exit to Park” is True
the proposition “there is a slowdown
P(A) = 1 - P(A)
with probability 0.7
The probability of a proposition A is a real number P(A)
between 0 and 1
P(True) = 1 and P(False) = 0
P(AvB) = P(A) + P(B) - P(AB)
P(AB) = P(A|B) P(B) = P(B|A) P(A)
Axioms of probability
Interpretations of Probability
Frequency
Subjective
Frequency Interpretation
Draw a ball from a bag containing n balls of the same size, r
red and s yellow.
The probability that the proposition A = “the ball is red” is true
corresponds to the relative frequency with which we expect
to draw a red ball  P(A) = r/n
Subjective Interpretation
There are many situations in which there is no objective
frequency interpretation:
– On a windy day, just before paragliding from the top of El Capitan,
you say “there is probability 0.05 that I am going to die”
– You have worked hard in this class and you believe that the
probability that you will get an A is 0.9
Random Variables
A proposition that takes the value True with
probability p and False with probability 1-p is a
random variable with distribution (p,1-p)
If a bag contains balls having 3 possible colors – red,
yellow, and blue – the color of a ball picked at
random from the bag is a random variable with 3
possible values
The (probability) distribution of a random variable X
with n values x1, x2, …, xn is:
(p1, p2, …, pn)
with P(X=xi) = pi and Si=1,…,n pi = 1
Expected Value
Random variable X with n values x1,…,xn and distribution
(p1,…,pn)
E.g.: X is the state reached after doing an action A under
uncertainty
Function U of X
E.g., U is the utility of a state
The expected value of U after doing A is
E[U] =
S
i=1,…,n pi
U(xi)
Toothache Example
A certain dentist is only interested in two things
about any patient, whether he has a toothache and
whether he has a cavity
Over years of practice, she has constructed the
following joint distribution:
Cavity
Cavity
Toothache
Toothache
0.04
0.01
0.06
0.89
Joint Probability Distribution
k random variables X1, …, Xk
The joint distribution of these variables is a table in
which each entry gives the probability of one
combination of values of X1, …, Xk
Example:
Toothache
Toothache
0.04
0.06
Cavity 0.01
0.89
Cavity
P(CavityToothache)
P(CavityToothache)
Joint Distribution Says It All
Toothache
Toothache
0.04
0.06
Cavity 0.01
0.89
Cavity
P(Toothache) = P((Toothache Cavity) v (ToothacheCavity))
= P(Toothache Cavity) + P(ToothacheCavity)
= 0.04 + 0.01 = 0.05
P(Toothache v Cavity)
= P((Toothache Cavity) v (ToothacheCavity)
v (Toothache Cavity))
= 0.04 + 0.01 + 0.06 = 0.11
Conditional Probability
Definition:
– P(AB) = P(A|B) P(B)
Read P(A|B): Probability of A given that
we know B
P(A) is called the prior probability of A
P(A|B) is called the posterior or conditional probability of A given
B
Example
Toothache
Toothache
0.04
0.06
Cavity 0.01
0.89
Cavity
P(CavityToothache) = P(Cavity|Toothache) P(Toothache)
P(Cavity) = 0.1
P(Cavity|Toothache) = P(CavityToothache) / P(Toothache)
= 0.04/0.05 = 0.8
Generalization
P(A  B  C) = P(A|B,C) P(B|C) P(C)
Conditional Independence
Propositions A and B are (conditionally) independent iff:
P(A|B) = P(A)
 P(AB) = P(A) P(B)
A and B are independent given C iff:
P(A|B,C) = P(A|C)
 P(AB|C) = P(A|C) P(B|C)
Conditional Independence
Let A and B be independent, i.e.:
P(A|B) = P(A)
P(AB) = P(A) P(B)
What about A and B?
Conditional Independence
Let A and B be independent, i.e.:
P(A|B) = P(A)
P(AB) = P(A) P(B)
What about A and B?
P(A|B) = P(A B)/P(B)
Conditional Independence
Let A and B be independent, i.e.:
P(A|B) = P(A)
P(AB) = P(A) P(B)
What about A and B?
P(A|B) = P(A B)/P(B)
A = (AB) v (AB)
P(A) = P(AB) + P(AB)
Conditional Independence
Let A and B be independent, i.e.:
P(A|B) = P(A)
P(AB) = P(A) P(B)
What about A and B?
P(A|B) = P(A B)/P(B)
A = (AB) v (AB)
P(A) = P(AB) + P(AB)
P(AB) = P(A) x (1-P(B))
P(B) = 1-P(B)
Conditional Independence
Let A and B be independent, i.e.:
P(A|B) = P(A)
P(AB) = P(A) P(B)
What about A and B?
P(A|B) = P(A B)/P(B)
A = (AB) v (AB)
P(A) = P(AB) + P(AB)
P(AB) = P(A) x (1-P(B))
P(B) = 1-P(B)
P(AB) = P(A)
Bayes’ Rule
P(A  B) = P(A|B) P(B)
= P(B|A) P(A)
P(A|B) P(B)
P(B|A) =
P(A)
Example
Given:
– P(Cavity) = 0.1
– P(Toothache) = 0.05
– P(Cavity|Toothache) = 0.8
Bayes’ rule tells:
– P(Toothache|Cavity) = (0.8 x 0.05)/0.1
= 0.4
Generalization
P(ABC) = P(AB|C) P(C)
= P(A|B,C) P(B|C) P(C)
P(ABC) = P(AB|C) P(C)
= P(B|A,C) P(A|C) P(C)
P(A|B,C) P(B|C)
P(B|A,C) =
P(A|C)
Web Size Estimation - Capture/Recapture Analysis
Consider the web page coverage of search engines a and b
– pa probability that engine a has indexed a page, pb for engine b, pa,b joint probability
pa,b  pa|b pb  pa pb
– sa number of unique pages indexed by engine a; N number of web pages
sa
pa 
N 
s a ,b
s s
 a b
N N N
N
sa sb
sa,b
web size
– nb number of documents returned by b for a query, na,b number of documents returned
by both engines a&b for a query
sb
nb

sa ,b na ,b queries

Lower bound estimate of size of the Web:
Nˆ  sao
nb
na ,b
; sao known
queries
– random sampling assumption
– extensions - bayesian estimate, more engines (Bharat, Broder, WWW7 ‘98), etc.
Lawrence, Giles, Science’98
What we just covered
Types of uncertainty
Default/worst-case/probabilistic reasoning
Probability
Random variable/expected value
Joint distribution
Conditional probability
Conditional independence
Bayes rule
The most common use of the term “information theory”
•Shannon founded information theory with
landmark paper published in 1948.
•Founding both digital computer and digital circuit
design theory in 1937
•21-year-old master's student at MIT, he wrote a thesis
demonstrating that electrical application of Boolean
algebra could construct and resolve any logical,
numerical relationship. It has been claimed that this was
the most important master's thesis of all time.
•Shannon contributed to the basic work on code
breaking.
•Coined the term “bit”
Information Theory (in classical sense)
A model of innate information content of something
Documents, images, messages, DNA
Other models?
Information
– That which reduces uncertainty
Entropy
– A measure of information content
– Conditional Entropy
• Information content based on a context or other information
Formal limitations on what can be
– Compressed
– Communicated
– Represented
Claude Shannon 1948
Shannon defined information in a way not previously used
Shannon noted that information content can depend on the
probability of the events, not just on the number of
outcomes.
Uncertainty is the lack of knowledge about an outcome.
Entropy is a measure of that uncertainty (or randomness)
– in information
– in a system
Information Theory – another
definition
Defines the amount of information in a message or document
as the minimum number of bits needed to encode all
possible meanings of that message, assuming all messages
are equally likely
What would be the minimum message to encode the days of
the week field in a database?
A type of compression!
Fundamental Questions Addressed by
Entropy and Information Theory
What is the ultimate data compression for an information
source?
How much data can be sent reliably over a noisy
communications channel?
How accurately can we represent an object (e.g. image, etc.)
as a function of the number of bits used.
Good feature selection for data mining and machine learning
Information Content I(x)
Define the amount of information gained after observing an
event x with probability p(x) is I(x) where:
–
I(x) = log2(1/p(x)) = - log2 p(x)
Examples
– Flip a coin, x = heads
• p(heads) = 1/2; I(heads) = 1
– Role a die, x = 6
• p(6) = 1/6; I(6) = 2.58..
More information gained from observing a die toss than a
coin flip. Why, there are more events.
Properties of Information I(x)
p(x) = 1; I = 0
– If we know with certainty the outcome of an event ,
there is no information gained by its occurrence
I ( x)  0
– The occurrence of an event provides some or no
information but it never results in a loss of
information.
I(x) > I(y) for p(x) < p(y)
– The less probable an event is, the more
information we gain from its occurrence.
I(xy) = I(x) + I(y) : additive
Entropy H(x)
Entropy H(x) of an event is the expectation (average) of
amount of information gained from that event over all
possible happenings
H ( x)  E I ( x)   p( x) I ( x)   p( x) log(1 / p( x))
x
x
Entropy is the average amount of uncertainty in an event.
Entropy is the amount of information in a message or
document
A message in which everything is known p(x) = 1 has zero
entropy
Entropy as a function of probability
H
p(x)
Max entropy occurs when all p(x)’s are equal!
Entropy Definition
Entropy Facts
Examples of Entropy
Average over all possible outcomes to calculate the entropy.
If all events likely, more entropy if more events can occur.
More possibilities (events) for a die than a coin =>
entropy die > entropy coin
Joint Entropy
H(x,x) = H(x)
Mutual Information I(x;y)
I(x;y) = H(y) – H(y|x)
I is how much information x gives about y on the average
Mutual Information I(x;y)
– Entropy is a special case H(x) = I(x;x)
– Symmetric: I(x;y) = I(y;x)
• Uncertainty of x after seeing y is the same as the uncertainty of
y after seeing x
– Nonnegative I(x;y) 0

Other methods for making decisions
Decision Trees
– Powerful/popular for classification & prediction
– Represent rules
• Rules can be expressed in English
– IF Age <=43 & Sex = Male
& Credit Card Insurance = No
THEN Life Insurance Promotion = No
• Rules can be expressed using SQL for query
– Useful to explore data to gain insight into relationships
of a large number of candidate input variables to a
target (output) variable
You use mental decision trees often!
– Game: “I’m thinking of…” “Is it …?”
Decision for playing tennis
Outlook
sunny
sunny
overcast
rain
rain
rain
overcast
sunny
sunny
rain
sunny
overcast
overcast
rain
Tempreature Humidity Windy Class
hot
high
false
N
hot
high
true
N
hot
high
false
P
mild
high
false
P
cool
normal false
P
cool
normal true
N
cool
normal true
P
mild
high
false
N
cool
normal false
P
mild
normal false
P
mild
normal true
P
mild
high
true
P
hot
normal false
P
mild
high
true
N
Outlook
sunny overcast
overcast
humidity
rain
windy
P
high
normal
true
false
N
P
N
P
Grade decision tree
Yes
Percent >= 90%?
Grade = A
Yes
Grade = B
Grade = C
Yes
No
89% >= Percent >= 80%?
No
79% >= Percent >= 70%?
No
Etc...
Decision tree
Written decision rules
If tear production rate = reduced then recommendation = none.
If age = young and astigmatic = no and tear production rate = normal
then recommendation = soft
If age = pre-presbyopic and astigmatic = no and tear production
rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope and
astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no and
tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes and
tear production rate = normal then recommendation = hard
If age = young and astigmatic = yes and tear production rate =
normal
then recommendation = hard
If age = pre-presbyopic and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
Decision Tree Template
Drawn top-to-bottom or left-toright
Root
Top (or left-most) node = Root
Node
Descendent node(s) = Child
Node(s)
Bottom (or right-most) node(s) =
Leaf Node(s)
Unique path from root to each
leaf = Rule
Child
Child
Child
Leaf
Leaf
Leaf
Decision Tree – What is it?
A structure that can be used to divide up a large
collection of records into successively smaller sets
of records by applying a sequence of simple
decision rules
A decision tree model consists of a set of rules for
dividing a large heterogeneous population into
smaller, more homogeneous groups with respect
to a particular target variable
Decision Tree Types
Binary trees – only two choices in each split. Can be nonuniform (uneven) in depth
N-way trees or ternary trees – three or more choices in at least
one of its splits (3-way, 4-way, etc.)
Scoring
Often it is useful to show the proportion of the data in each of
the desired classes
Decision Tree Splits (Growth)
The best split at root or child nodes is defined as one
that does the best job of separating the data into
groups where a single class predominates in each
group
– Example: US Population data input categorical
variables/attributes include:
• Zip code
• Gender
• Age
– Split the above according to the above “best split” rule
Example: Good & Poor Splits
Good Split
Split Criteria
The best split is defined as one that does the best job of
separating the data into groups where a single class
predominates in each group
Measure used to evaluate a potential split is a purity measure
– The purity measure answers the question, "Based upon a particular
split, how good of a job did we do of separating the two classes away
from each other?" We calculate this purity measure for every possible
split and choose the one that gives the highest possible value.
– The best split is one that increases purity of the sub-sets by the
greatest amount
– A good split also creates nodes of similar size or at least does not
create very small nodes
– Must have a stopping criteria
Methods for Choosing Best Split
Purity (Diversity) Measures:
– Gini (population diversity)
– Entropy (information gain)
– Information Gain Ratio
– Chi-square Test
– others
Gini (Population Diversity)
The Gini measure of a node is the sum of the squares of the
proportions of the classes.
Root Node: 0.5^2 + 0.5^2 = 0.5 (even balance)
Leaf Nodes: 0.1^2 + 0.9^2 = 0.82 (close to pure)
Pruning
Decision Trees can often
be simplified or pruned:
– CART
– C5
– Stability-based
Decision Tree Advantages
1.
Easy to understand
2.
Map nicely to a set of domain rules
3.
Applied to real problems
4.
Make no prior assumptions about the data
5.
Able to process both numerical and categorical data
Decision Tree Disadvantages
1.
Sensitive to initial conditions
2.
Output attribute must be categorical
3.
Small number of output attributes
4.
Decision tree algorithms can be unstable
5.
Trees created from numeric datasets can be complex (scaling)
What we covered
•
•
•
•
Probabilistic reasoning
Flaws in human decision making
Information theory
Decision trees
Propositions
• Decision making is not easy
• Humans often make mistakes
• In some cases animals are smarter (empirical learning)
• Probabilistic methods help
• Data sensitive
• Bayes methods
• Information theory
• measures the amount of information in a message or document(s)
• Uses
• Filtering
• Data mining
• Decision trees are useful for learning rules
Questions
• Role of reasoning in information science
• Impact of probabilistic reasoning on
information science
• Role of decision making in formation science
• What next?