Transcript lecture16

Today’s Topics
• Some Exam-Review Notes
– Midterm is Thurs, 5:30-7:30pm HERE
– One 8.5x11 inch page of notes
(both sides), simple calculator (log’s and arithmetic)
– Don’t Discuss Actual Midterm with Others until Nov 3
•
•
•
•
•
•
Planning to Attend TA’s Review Tomorrow?
Bayes’ Rule
Naïve Bayes (NB)
NB as a BN
Prob Reasoning Wrapup
Next: BN’s for Playing Nannon (HW3)
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
1
Topics Covered So Far
If you don’t
recognize this …
•
•
•
•
•
•
Some AI History and Philosophy (more final class)
Learning from Labeled Data (more ahead)
Reasoning from Specific Cases (k-NN)
Searching for Solutions (many variants, common core)
Projecting Possible Futures (eg, game-playing)
Simulating ‘Problem Solving’ Done by the
Biophysical World (SA, GA, and [next] neural nets)
• Reasoning Probabilistically (just Ch 13 & Lec 14)
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
2
Detailed List of Course Topics
Learning from labeled data
Experimental methodologies for choosing parameter
settings and estimating future accuracy
Decision trees and random forests
Probabilistic models, nearest-neighbor methods
Genetic algorithms
Neural networks
Support vector machines
Reinforcement learning (if time permits)
Searching for solutions
Heuristically finding shortest paths
Algorithms for playing games like chess
Simulated annealing
Genetic algorithms
Reasoning probabilistically
Probabilistic inference (just the basics so far)
Bayes' rule
Bayesian networks
10/13/15
Reasoning from concrete cases
Cased-based reasoning
Nearest-neighbor algorithm
Reasoning logically
First-order predicate calculus
Representing domain knowledge using mathematical logic
Logical inference
Problem-solving methods based on the biophysical world
Genetic algorithms
Simulated annealing
Neural networks
Philosophical aspects
Turing test
Searle's Chinese Room thought experiment
The coming singularity
Strong vs. weak AI
Societal impact of AI
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
3
Some Key Ideas
ML:
Easy to fit training examples,
hard to generalize to future examples
(never use TESTSET to choose model!)
SEARCH: OPEN holds partial solutions, how to
choose which partial sol’n to extend?
(CLOSED prevents infinite loops)
PROB:
10/13/15
Fill JOINT Prob table (explicitly or
implicitly) simply by COUNTING data,
then can answer all kinds of questions
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
4
Exam Advice
• Mix of ‘straightforward’ concrete problem solving
and brief discussion of important AI issues and
techniques
• Problem solving graded ‘precisely’
• Discussion graded ‘leniently’
• Previous exams great training and tune sets
(hence soln’s not posted for old exams, ie so
they can be used as TUNE sets)
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
5
Exam Advice (cont.)
• Think before you write
• Briefly discuss important points
• Don’t do a ‘core dump’
• Some questions are open-ended
so budget your time wisely
• Always say SOMETHING
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
6
Bayes’ Rule
• Recall P(A  B)  P(A | B) x P(B)
 P(B | A) x P(A)
• Equating the two RHS (right-hand-sides) we get
P(A | B) = P(B | A) x P(A) / P(B)
This is Bayes’ Rule!
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
7
Common Usage
- Diagnosing CAUSE Given EFFECTS
P(disease | symptoms)
= P(symptoms | disease) x P(disease)
P(symptoms)
Usually a big AND of several random
variables, so a JOINT probability
In HW3, you’ll compute
prob(this move leads to a WIN | NANNON board configuration)
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
8
Simple Example
(only ONE symptom variable)
• Assume we have estimated from data
P(headache | condition=haveFlu)
= 0.90
P(headache | condition=haveStress)
= 0.40
P(headache | condition=healthy)
= 0.01
P(haveFlu)
= 0.01 // Dropping ‘condition=’ for clarity
P(haveStress) = 0.20 // Because it’s midterms time!
P(healthy)
= 0.79 // We assume the 3 ‘diseases’ disjoint
• Patient comes in with headache,
what is most likely diagnosis?
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
9
Solution
P(flu
P(disease | symptoms)
= P(symptoms | disease) x P(disease)
P(symptoms)
| headache) = 0.90  0.01 / P(headache)
P(stress | headache) = 0.40  0.20 / P(headache)
P(healthy | headache) = 0.01  0.79 / P(headache)
Note: we never need to compute the
denominator to find most likely diagnosis!
STRESS most likely (by nearly a factor of 9)
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
10
This same issue arises when have
many more neg than pos ex’s –
false pos overwhelm true pos
Base-Rate Fallacy
https://en.wikipedia.org/wiki/Base_rate_fallacy
Assume Disease A is rare
(one in 1 million, say
– so picture not to scale)
99.99%
Assume population is 10B = 1010
So 104 people have it
A
0.01%
Assume testForA is 99.99% accurate
You test positive. What is the prob
you have Disease A?
Someone (not in cs540)
might naively think prob = 0.9999
10/13/15
People for whom
testForA = true
9999 people that actually have Disease A
106 people that do NOT have Disease A
Prob(A | testForA) = 0.01
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
11
Recall: What if Symptoms
NOT Disjoint?
• Assume we have symptoms A, B, and C,
and they are not disjoint
• Convert to
A’ = A  B  C
B’ = A  B  C
C’ = A  B  C
G’ = A  B  C
H’ = A  B  C
D’ = A  B  C
E’ = A  B  C
F’ = A  B  C
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
12
Dealing with Many Boolean-Valued Symptoms
(D = Disease, Si = Symptom i)
P(D | S1  S2  S3  …  Sn)
// Bayes’ Rule
= P(S1  S2  S3  …  Sn | D) x P(D)
P(S1  S2  S3  …  Sn)
If n small, could use a full joint table
If not, could design/learn a Bayes Net
We’ll consider `conditional independence’ of S’s
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
13
Assuming
Conditional Independence
Repeatedly using P(A  B | C)  P(A | C)  P(B | C)
We get
P(S1  S2  S3  …  Sn | D) =  P(Si | D)
Assuming D has three possible disjoint values
P(D1 | S1  S2  S3  …  Sn) = [  P(Si | D1) ] x P(D1) / P(S1  S2  S3  …  Sn)
P(D2 | S1  S2  S3  …  Sn) = [  P(Si | D2) ] x P(D2) / P(S1  S2  S3  …  Sn)
P(D3 | S1  S2  S3  …  Sn) = [  P(Si | D3) ] x P(D3) / P(S1  S2  S3  …  Sn)
We know  P(Di | S1  S2  S3  …  Sn) = 1, so if we want, we could solve
for P(S1  S2  S3  …  Sn) and, hence, need not compute/approx it!
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
14
Full Joint vs. Naïve Bayes
• Completely assuming conditional
independence is called Naïve Bayes (NB)
– We need to estimate (eg, from data)
P(Si | Dj)
P(Dj)
// For each disease j, prob symptom i appears
// Prob of each disease j
• If we have N binary-valued symptoms
and a tertiary-valued disease,
size of full joint is (3  2N ) – 1
• NB needs only (3 x N) + 3 - 1
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
15
Odds > 1 iff
prob > 0.5
Log Odds
Odds(x)  prob(x) / (1 – prob(x))
Recall (and now assuming D only has TWO values)
1) P( D | S1  S2  S3  …  Sn) = [  P(Si | D) ] x P( D) / P(S1  S2  S3  …  Sn)
2) P(D | S1  S2  S3  …  Sn) = [  P(Si | D) ] x P(D) / P(S1  S2  S3  …  Sn)
Dividing (1) by (2), denominators cancel out!
P( D | S1  S2  S3  …  Sn)
[  P(Si | D) ] x P( D)
=
P(D | S1  S2  S3  …  Sn)
Since
[  P(Si | D) ] x P(D)
Notice we
removed
one  via
algebra
P(D | S1  S2  S3  …  Sn) = 1 - P(D | S1  S2  S3  …  Sn)
odds(D | S1  S2  S3  …  Sn) = [  { P(Si | D) / P(Si | D) } ] x [ P(D) / P(D) ]
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
16
The Missing Algebra
The Implicit Algebra from Prev Page
a1 ₓ a2 ₓ a3 ₓ … ₓ an
b1 ₓ b2 ₓ b3 ₓ … ₓ bn
= (a1 / b1) ₓ (a2 / b2) ₓ (a3 / b3) ₓ … ₓ (an / bn)
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
17
Log Odds (continued)
Odds(x)  prob(x) / (1 – prob(x))
We ended two slides ago with
odds(D | S1  S2  S3  …  Sn) = [  { P(Si | D) / P(Si | D) } ] x [ P(D) / P(D) ]
Recall log(A  B) = log(A) + log(B), so we have
log [ odds(D | S1  S2  S3  …  Sn) ]
= {  log [ P(Si | D) / P(Si | D) ] } + log [ P(D) / P(D) ]
If log-odds > 0, D is more likely than D since log(x) > 0 iff x > 1
If log-odds < 0, D is less likely than D since log(x) < 0 iff x < 1
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
18
Log Odds (concluded)
We ended last slide with
log [ odds(D | S1  S2  S3  …  Sn) ]
= {  log [ P(Si | D) / P(Si | D) ] } + log [ P(D) / P(D) ]
Consider log [ P(D) / P D) ]
if D more likely than D, we start the sum with a positive value
Consider each log [ P(Si | D) / P(Si | D) ]
if Si more likely give D than given D, we add to the sum a pos value
if less likely, we add negative value
if Si independent of D, we add zero
At end we see if sum is POS (D more likely), ZERO (tie),
or NEG (D more likely)
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
19
Viewing NB as a PERCEPTON,
the Simplest Neural Network
S1
S1
…
Sn
log-odds
log [ P(Sn | D) / P(Sn | D) ]
out
If Si = true, then
NODE Si=1 and
NODE Si=0
Sn
If Si = false, then
NODE Si=0 and
NODE Si=1
‘1’
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
20
Naïve Bayes Example
(for simplicity, ignore m-estimates here)
Dataset
S1
S2
S3
D
T
F
T
T
F
T
T
F
F
T
T
T
T
T
F
T
T
F
T
F
F
T
T
T
T
F
F
F
P(D=true)
P(D=false)
P(S1=true | D = true)
P(S1=true | D = false)
P(S2=true | D = true)
P(S2=true | D = false)
P(S3=true | D = true)
P(S3=true | D = false)
=
=
=
=
=
=
=
=
‘Law of Excluded Middle’
P(S3=true | D=false) + P(S3=false | D=false) = 1
so no need for the P(Si=false | D=?) estimates
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
21
Naïve Bayes Example
(for simplicity, ignore m-estimates)
Dataset
S1
S2
S3
D
T
F
T
T
F
T
T
F
F
T
T
T
T
T
F
T
T
F
T
F
F
T
T
T
T
F
F
F
10/13/15
P(D=true)
P(D=false)
P(S1=true | D = true)
P(S1=true | D = false)
P(S2=true | D = true)
P(S2=true | D = false)
P(S3=true | D = true)
P(S3=true | D = false)
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
=
=
=
=
=
=
=
=
4/7
3/7
2/4
2/3
3/4
1/3
3/4
2/3
22
Processing a ‘Test’ Example
Prob(D = true | S1 = true  S2 = true  S3 = true) ?
Odds(D | S1  S2  S3) = // Recall Odds(x)  Prob(x) / (1 – Prob(x))
P(S1 | D)  P(S2 | D)  P(S3 | D)  P( D)
P(S1 | D)  P(S2 |  D)  P(S3 | D)  P(D)
Here, vars =
true unless
NOT sign
present
= (3 / 4)  (9 / 4)  (9 / 8)  (4 / 3) = 81 / 32 = 2.53
Use Prob(x) = Odds(x) / (1 + Odds(x)) to get prob = 0.72
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
23
NB as a BN
1 CPT of
size 2n+1
(N+1) CPTs
P(D | S1  S2  S3  …  Sn)
=
[  P(Si | D) ] x P(D)
of size 1
/
P(S1  S2  S3  …  Sn)
We only need to compute this part if we use the ‘odds’ method from prev slides
D
S1
10/13/15
S2
S3
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
…
Sn
24
Recap: Naïve Bayes
Parameter Learning
• Use training data to estimate
(for Naïve Bayes) in one pass through data
P(fi = vj | category = POS) for each i, j
P(fi = vj | category = NEG) for each i, j
P(category = POS)
P(category = NEG)
// Note: Some of above unnecessary since some combo’s of probs sum to 1
• Apply Bayes’ rule to find
odds(category = POS | test example’s features)
• Incremental/Online Learning Easy
simply increment counters (true for BN’s in general,
if no ‘structure learning’)
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
Lecture #6, Slide 25
Is NB Naïve?
Surprisingly, the assumption of independence,
while most likely violated, is not too harmful!
• Naïve Bayes works quite well
– Very successful in text categorization (‘bag-o- words’ rep)
– Used in printer diagnosis in Windows, spam filtering, etc
• Prob’s not accurate (‘uncalibrated’) due to double counting,
but good at seeing if prob > 0.5 or prob < 0.5
• Resurgence of research activity in Naïve Bayes
– Many ‘dead’ ML algo’s resuscitated by availability
of large datasets (KISS Principle)
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
26
A Major Weakness of BN’s
• If many ‘hidden’ random vars (N binary vars, say),
then the marginalization formula leads to many calls
to a BN (2N in our example; for N = 20, 2N = 1,048,576)
• Using uniform-random sampling to estimate the result is too
inaccurate since most of the probability might be concentrated
in only a few ‘complete world states’
• Hence, much research (beyond cs540’s scope) on scaling up
inference in BNs and other graphical models, eg via more
sophisticated sampling (eg, MCMC)
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
27
Bayesian Networks Wrapup
• BNs one Type of ‘Graphic Model’
• Lots of Applications (though
currently focus on ‘deep [neural] networks’)
• Bayes’ Rule Appealing Way to Go from
EFFECTS to CAUSES (ie, diagnosis)
• Full Joint Prob Tables and Naïve Bayes
are Interesting ‘Limit Cases’ of BNs
• With ‘Big Data,’ Counting Goes a Long Way!
10/13/15
CS 540 - Fall 2015 (Shavlik©), Lecture 16, Week 6
28