Transcript Uncertainty
Midterm Review Constructing decision trees • Normal procedure: top down in recursive divide-andconquer fashion – First: an attribute is selected for root node and a branch is created for each possible attribute value – Then: the instances are split into subsets (one for each branch extending from the node) – Finally: the same procedure is repeated recursively for each branch, using only instances that reach the branch • Process stops if all instances have the same class, or when most instances have the same class. Weather data Which attribute to select? (b) (a) Choose attribute that results in lowest entropy of the children nodes (c) (d) Example: attribute “Outlook” Information gain Usually people don’t use directly the entropy of a node. Rather the information gain is being used. Clearly, greater the information gain better the purity of a node. So, we choose “Outlook” for the root. Continuing to split Highly-branching attributes • The weather data with ID code Tree stump for ID code attribute Subsets are more likely to be pure if there is a large number of values Information gain is biased towards choosing attributes with a large number of values What’s the remedy? Gain ratio Gain ratios for weather data Well, in this example of only 14 training instances, the “ID code” has still greater gain ratio. But its advantage is greatly reduced. Numerical attributes • Tests in nodes can be of the form xj > constant • Divides the space into rectangles. Considering splits • The only thing we need to do differently in our algorithm is to consider splitting between each data point in each dimension. Bankruptcy Example Bankruptcy Example • We consider all the possible splits in each dimension, and compute the average entropies of the children. Bankruptcy Example • Now, we consider all the splits of the remaining part of space. • Note that we have to recalculate all the average entropies again, because the points that fall into the leaf node are taken out of consideration. Regression Trees • Like decision trees, but with real-valued constant outputs at the leaves. Splitting • Use average variance of the children to evaluate the quality of splitting on a particular feature. • Here we have a data set, for which I've just indicated the y values. Splitting • Compute a weighted average variance • We can see that the average variance of splitting on feature 3 is much lower than of splitting on f7, and so we'd choose to split on f3. Stopping • Stop when the variance at the leaf is small enough. • Then, set the value at the leaf to be the mean of the y values of the elements. Rules: Coverage and Accuracy • Coverage of a rule: – Fraction of records that satisfy the antecedent of a rule • Accuracy of a rule: – Fraction of records that satisfy both the antecedent and consequent of a rule (over those that satisfy the antecedent) Tid Refund Marital Status Taxable Income Class 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married No 7 Yes Divorced 220K 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 60K No 10 (Status=Single) No Coverage = 40%, Accuracy = 50% A simple covering algorithm • Generates a rule by adding tests that maximize rule’s accuracy. • Here, each new test (growing the rule) reduces rule’s coverage. • Goal: maximizing accuracy – t: total number of instances covered by rule – p: positive examples of the class covered by rule – t-p: number of errors made by rule Select test that maximizes the ratio p/t s pace o f examp l es ru le s o far ru le aft er add in g n ew term Pseudo-code for PRISM For each class C Initialize E to the instance set While E contains instances in class C Create a rule R with an empty left-hand side that predicts class C Until R is perfect (or there are no more attributes to use) do For each attribute A not mentioned in R, and each value v, Consider adding the condition A = v to the left-hand side of R Select A and v to maximize the accuracy p/t (break ties by choosing the condition with the largest p) Add A = v to R Remove the instances covered by R from E RIPPER Algorithm is similar. It uses instead of p/t the info gain. Probabilistic Reasoning Conditional Independence – Naïve Bayes P(cause| effect1 , effect2 ) P(effect1 | effect2 , cause) P(effect2 | cause) P(cause) P(effect1 | cause) P(effect2 | cause) P(cause) P(cause| effect1,...,effectn ) P(effect1 | cause)...P(effectn | cause)P(cause) Two assumptions: • Attributes are equally important • Conditionally independent (given the class value) • This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known) • Although based on assumptions that are almost never correct, this scheme works well in practice! Weather Data Here we don’t really have effects, but rather evidence. The weather data example P(play=yes | E) = P(Outlook=Sunny | play=yes) * P(Temp=Cool | play=yes) * P(Humidity=High | play=yes) * P(Windy=True | play=yes) * P(play=yes) / P(E) = = (2/9) * (3/9) * (3/9) * (3/9) * (9/14) / P(E) = 0.0053 / P(E) Don’t worry for the 1/P(E); It’s alpha, the normalization constant. The weather data example P(play=no | E) = P(Outlook=Sunny | play=no) * P(Temp=Cool | play=no) * P(Humidity=High | play=no) * P(Windy=True | play=no) * P(play=no) / P(E) = = (3/5) * (1/5) * (4/5) * (3/5) * (5/14) / P(E) = 0.0206 / P(E) Normalization constant P(play=yes | E) + P(play=no | E) = 1 0.0053 / P(E) + 0.0206 / P(E) = 1 P(E) = 0.0053 + 0.0206 So, P(play=yes | E) = 0.0053 / (0.0053 + 0.0206) = 20.5% P(play=no | E) = 0.0206 / (0.0053 + 0.0206) = 79.5% The “zero-frequency problem” • What if an attribute value doesn’t P(play=yes | E) = occur with every class value (e.g. P(Outlook=Sunny | play=yes) * “Humidity = High” for class “Play=Yes”)? P(Temp=Cool | play=yes) * – Probability P(Humidity=High | play=yes) * P(Humidity=High|play=yes) P(Windy=True | play=yes) * will be zero! P(play=yes) / P(E) = • A posteriori probability will also be = (2/9) * (3/9) * (3/9) * (3/9) *(9/14) / P(E) = zero! 0.0053 / P(E) – No matter how likely the other values are! It will be instead: Number of possible values for ‘Outlook’ • Remedy: – Add 1 to the count for every = ((2+1)/(9+3)) * ((3+1)/(9+3)) * attribute value-class ((3+1)/(9+2)) * ((3+1)/(9+2)) *(9/14) / combination (Laplace P(E) = 0.007 / P(E) estimator); – Add k (no of possible attribute Number of possible values) to the denumerator. (see values for ‘Windy’ example on the right). Missing values • Training: instance is not included in frequency count for attribute value-class combination • Classification: attribute will be omitted from calculation • Example: P(play=yes | E) = P(Temp=Cool | play=yes) * P(Humidity=High | play=yes) * P(Windy=True | play=yes) * P(play=yes) / P(E) = = (3/9) * (3/9) * (3/9) *(9/14) / P(E) = 0.0238 / P(E) P(play=no | E) = P(Temp=Cool | play=no) * P(Humidity=High | play=no) * P(Windy=True | play=no) * P(play=no) / P(E) = = (1/5) * (4/5) * (3/5) *(5/14) / P(E) = 0.0343 / P(E) After normalization: P(play=yes | E) = 41%, P(play=no | E) = 59% Dealing with numeric attributes • Usual assumption: attributes have a normal or Gaussian probability distribution (given the class). • Probability density function for the normal distribution is: 1 f ( x | class) e 2 ( x )2 2 2 • We approximate by the sample mean: 1 n x xi n i 1 • We approximate 2 by the sample variance: 1 n s ( xi x ) 2 n 1 i 1 2 Weather Data outlook temperature humidity windy sunny 85 85 FALSE sunny 80 90 TRUE overcast 83 86 FALSE rainy 70 96 FALSE rainy 68 80 FALSE rainy 65 70 TRUE overcast 64 65 TRUE sunny 72 95 FALSE sunny 69 70 FALSE rainy 75 80 FALSE sunny 75 70 TRUE overcast 72 90 TRUE overcast 81 75 FALSE rainy 71 91 TRUE play no no yes yes yes no yes no yes yes yes yes yes no f(temperature=66 | yes) =e(- ((66-m)^2 / 2*var) ) / sqrt(2*3.14*var) m= (83+70+68+64+69+75+75+72 +81)/ 9 = 73 var = ( (83-73)^2 + (70-73)^2 + (68-73)^2 + (64-73)^2 + (6973)^2 + (75-73)^2 + (75-73)^2 + (72-73)^2 + (81-73)^2 )/ (9-1) = 38 f(temperature=66 | yes) =e(- ((66-73)^2 / (2*38) ) ) / sqrt(2*3.14*38) = .034 Classifying a new day • A new day E: P(play=yes | E) = P(Outlook=Sunny | play=yes) * P(Temp=66 | play=yes) * P(Humidity=90 | play=yes) * P(Windy=True | play=yes) * P(play=yes) / P(E) = = (2/9) * (0.0340) * (0.0221) * (3/9) *(9/14) / P(E) = 0.000036 / P(E) P(play=no | E) = P(Outlook=Sunny | play=no) * P(Temp=66 | play=no) * P(Humidity=90 | play=no) * P(Windy=True | play=no) * P(play=no) / P(E) = = (3/5) * (0.0291) * (0.0380) * (3/5) *(5/14) / P(E) = 0.000136 / P(E) After normalization: P(play=yes | E) = 20.9%, P(play=no | E) = 79.1% Bayesian Net Semantics We order them according to the topological of the given BayesNet Suppose we have the variables X1,…,Xn. The probability for them to have the values x1,…,xn respectively is P(xn,…,x1): P( xn ,..., x1 ) P( xn | xn 1 ,..., x1 ) P ( xn 1 ,..., x1 ) P( xn | xn 1 ,..., x1 ) P ( xn 1 | xn 2 ,..., x1 ) P ( xn 2 ,..., x1 ) ... n n i 1 i 1 P ( xi | xi 1 ,..., x1 ) P ( xi | parents( xi )) e.g., P(j m a b e) = P(j | a) P(m | a) P(a | b, e) P(b) P(e) =… P(xn,…,x1): is short for P(Xn=xn,…, Xn= x1): Inference in Bayesian Networks • Notation: – X denotes query variable – E denotes the set of evidence variables E1,…,Em, and e is a particular event, i.e. an assignment to the variables in E. – Y will denote the set of the remaining variables (hidden variables). • Typical query asks for the posterior probability P(x|e1,…,em) • E.g. We could ask: What’s the probability of a burglary if both Mary and John call, P(burglary | johhcalls, marycalls)? Classification • We compute and compare the following: P( x | e1 ,...,em ) P( x, e1 ,...,em ) P( x, e1 ,...,em ) P(e1 ,...,em ) P(x | e1 ,...,em ) P(x, e1 ,...,em ) P(x, e1 ,...,em ) P(e1 ,...,em ) • However, how do we compute: P( x, e1 ,...,em ) and P(x, e1 ,...,em ) ? What about the hidden variables Y1,…,Yk? Inference by enumeration P ( x, e1 ,...,em ) y ... y P ( x, e1 ,...,em , y1 ,..., yk ) 1 k and P (x, e1 ,...,em ) y ... y P (x, e1 ,...,em , y1 ,..., yk ) 1 Example: P(burglary k | johhcalls, marycalls)? (Abbrev. P(b|j,m)) P (b | j , m) P(b, j , m) a e P(b, j , m, a, e) P (b, j , m, a, e) P(b, j , m, a, e) P (b, j , m, a, e) P(b, j , m, a, e) Weather data What is the Bayesian Network corresponding to Naïve Bayes? Play probability table Based on the data… P(play=yes) = 9/14 P(play=no) = 5/14 Let’s correct with Laplace … P(play=yes) = (9+1)/(14+2) = .625 P(play=yes) = (5+1)/(14+2) = .375 Outlook probability table Based on the data… P(outlook=sunny|play=yes) = (2+1)/(9+3) = .25 P(outlook=overcast|play=yes) = (4+1)/(9+3) = .417 P(outlook=rainy|play=yes) = (3+1)/(9+3) = .333 P(outlook=sunny|play=no) = (3+1)/(5+3) = .5 P(outlook=overcast|play=no) = (0+1)/(5+3) = .125 P(outlook=rainy|play=no) = (2+1)/(5+3) = .375 Windy probability table Based on the data…let’s find the conditional probabilities for “windy” P(windy=true|play=yes,outlook=sunny) = (1+1)/(2+2) = .5 Windy probability table Based on the data… P(windy=true|play=yes,outlook=sunny) = (1+1)/(2+2) = .5 P(windy=true|play=yes,outlook=overcast) = 0.5 P(windy=true|play=yes,outlook=rainy) = 0.2 P(windy=true|play=no,outlook=sunny) = 0.4 P(windy=true|play=no,outlook=overcast) = 0.5 P(windy=true|play=no,outlook=rainy) = 0.75 Final figure Classify it Classify it Classification I Classify it P(play=yes|outlook=sunny, temp=cool,humidity=high, windy=true) = *P(play=yes) *P(outlook=sunny|play=yes) *P(temp=cool|play=yes, outlook=sunny) *P(humidity=high|play=yes, temp=cool) *P(windy=true|play=yes, outlook=sunny) = *0.625*0.25*0.4*0.2*0.5 = *0.00625 Classification II Classify it P(play=no|outlook=sunny, temp=cool,humidity=high, windy=true) = *P(play=no) *P(outlook=sunny|play=no) *P(temp=cool|play=no, outlook=sunny) *P(humidity=high|play= no, temp=cool) *P(windy=true|play=no, outlook=sunny) = *0.375*0.5*0.167*0.333*0.4 = *0.00417 Classification III Classify it P(play=yes|outlook=sunny, temp=cool,humidity=high, windy=true) = *0.00625 P(play=no|outlook=sunny, temp=cool,humidity=high, windy=true) = *.00417 = 1/(0.00625+0.00417) =95.969 P(play=yes|outlook=sunny, temp=cool,humidity=high, windy=true) = 95.969*0.00625 = 0.60 Classification IV (missing values or hidden variables) P(play=yes|temp=cool, humidity=high, windy=true) = *outlookP(play=yes) *P(outlook|play=yes) *P(temp=cool|play=yes,outlook) *P(humidity=high|play=yes, temp=cool) *P(windy=true|play=yes,outlook) =…(next slide) Classification V (missing values or hidden variables) P(play=yes|temp=cool, humidity=high, windy=true) = *outlookP(play=yes)*P(outlook|play=yes)*P(temp=cool|play=yes,outlook) *P(humidity=high|play=yes,temp=cool)*P(windy=true|play=yes,outlook) = *[ P(play=yes)*P(outlook= sunny|play=yes)*P(temp=cool|play=yes,outlook=sunny) *P(humidity=high|play=yes,temp=cool)*P(windy=true|play=yes,outlook=sunny) +P(play=yes)*P(outlook= overcast|play=yes)*P(temp=cool|play=yes,outlook=overcast) *P(humidity=high|play=yes,temp=cool)*P(windy=true|play=yes,outlook=overcast) +P(play=yes)*P(outlook= rainy|play=yes)*P(temp=cool|play=yes,outlook=rainy) *P(humidity=high|play=yes,temp=cool)*P(windy=true|play=yes,outlook=rainy) ] = *[ 0.625*0.25*0.4*0.2*0.5 + 0.625*0.417*0.286*0.2*0.5 + 0.625*0.33*0.333*0.2*0.2 ] =*0.01645 Classification VI (missing values or hidden variables) P(play=no|temp=cool, humidity=high, windy=true) = *outlookP(play=no)*P(outlook|play=no)*P(temp=cool|play=no,outlook) *P(humidity=high|play=no,temp=cool)*P(windy=true|play=no,outlook) = *[ P(play=no)*P(outlook=sunny|play=no)*P(temp=cool|play=no,outlook=sunny) *P(humidity=high|play=no,temp=cool)*P(windy=true|play=no,outlook=sunny) +P(play=no)*P(outlook= overcast|play=no)*P(temp=cool|play=no,outlook=overcast) *P(humidity=high|play=no,temp=cool)*P(windy=true|play=no,outlook=overcast) +P(play=no)*P(outlook= rainy|play=no)*P(temp=cool|play=no,outlook=rainy) *P(humidity=high|play=no,temp=cool)*P(windy=true|play=no,outlook=rainy) ] = *[ 0.375*0.5*0.167*0.333*0.4 + 0.375*0.125*0.333*0.333*0.5 + 0.375*0.375*0.4*0.333*0.75 ] =*0.0208 Classification VII (missing values or hidden variables) P(play=yes|temp=cool, humidity=high, windy=true) =*0.01645 P(play=no|temp=cool, humidity=high, windy=true) =*0.0208 =1/(0.01645 + 0.0208)= 26.846 P(play=yes|temp=cool, humidity=high, windy=true) = 26.846 * 0.01645 = 0.44 P(play=no|temp=cool, humidity=high, windy=true) = 26.846 * 0.0208 = 0.56 I.e. P(play=yes|temp=cool, humidity=high, windy=true) is 44% and P(play=no|temp=cool, humidity=high, windy=true) is 56% So, we predict ‘play=no.’ Evaluation Predicting performance • We want to say: p (probability of success) lies within a certain specified interval with a certain specified confidence • Example: S=750 successes in N=1000 trials – Estimated success rate: 75% – How close is this to true success rate p? • Answer: with 80% confidence p[73.2,76.7] • Another example: S=75 and N=100 – Estimated success rate: 75% – • With 80% confidence p[69.1,80.1] • I.e. the probability that p[69.1,80.1] is 0.8. Bigger the N more confident we are, i.e. the surrounding interval is smaller. – Above, for N=100 we were less confident than for N=1000. Estimating p • • • • • Let Y be the random variable with possible values 1 for success and 0 for error. Let probability of success be p. Then probability of error is q=1-p. Mean is p Variance is pq • Let S be the random variable denoting the number of successes, i.e. S is the sum of N value samplings of Y. • Now, we approximate p with the success rate in N trials, i.e. S/N. • By the Central Limit Theorem, when N is big, the probability distribution of the random variable f=S/N is approximated by a normal distribution with – mean p and – variance pq/N. Estimating p Suppose a random variable has mean 0 and symmetric distribution. Then, Pr[− z≤ X≤ z]= c Pr[− z≤ X≤ z]=1−2× Pr[ x≥ z] Confidence limits for the normal distribution with 0 mean and variance of 1: E.g.: Pr[−1.65≤ X≤1.65]=90% Estimating p Pr[−1.65≤ X≤1.65]=90% To use this, we have to reduce our random variable S/N to have 0 mean and unit variance: Pr[−1.65≤ (S/N – p) / S/N ≤1.65]=90% Now we solve two equations: (S/N – p) / S/N =1.65 (S/N – p) / S/N =-1.65 Let N=100, and S=70 S/N is sqrt(pq/N) and we approximate it by sqrt(.7*.3/100) = .046 The two equations become: (0.7 – p) / .046 =1.65 p= .624 (0.7 – p) / .046 =-1.65 p= .776 With a 90% confidence we have k-Fold Cross-Validation • First step: split data into k subsets of equal size • Second step: use each subset in turn for testing, the remainder for training • The error estimates are averaged to yield an overall error estimate • • • • Frequent question: which of two learning schemes performs better? Obvious way: compare for example 10-fold Cross Validation estimates Problem: variance in estimate We don’t know whether the results are reliable. Need to use statistical-test for that • Student’s t-test tells whether the means of two samples are significantly different. – In our case the samples are cross-validation estimates for different datasets from the domain • Use a paired t-test because the individual samples are paired – The same Cross Validation is applied twice Distribution of the means • • • • • x1, x2, … xk and y1, y2, … yk are the 2k samples for the k different datasets mx and my are the means, Estimated variances of the means are sx2 and sy2 With enough samples, the mean of a set of independent samples is normally distributed With small samples (k < 30) the mean follows Student’s distribution with k–1 degrees of freedom (similar shape, but wider than normal distribution) Distribution of the differences • Let md = mx – my • The difference of the means (md) also has a Student’s distribution with k–1 degrees of freedom • Let sd2 be the estimated variance of the difference md t • The standardized version of md is called the t-statistic: 2 sd k • Fix a significance level – If a difference is significant at the % level, there is a (100- )% chance that the true means differ • Divide the significance level by two because the test is two-tailed • Look up the value for z that corresponds to /2 • If t–z or tz then the difference is significant – i.e. the null hypothesis (that the difference is zero) can be rejected Example • • We have compared two classifiers through cross-validation on 10 different datasets. The error rates are: Dataset 1 2 3 4 5 6 7 8 9 10 Classifier A 10.6 9.8 12.3 9.7 8.8 10.6 9.8 12.3 9.7 8.8 Classifier B 10.2 9.4 11.8 9.1 8.3 10.2 9.4 11.8 9.1 8.3 Difference .4 .4 .5 .6 .5 .4 .4 .5 .6 .5 Example • md = 0.48 • sd = 0.0789 md 0.48 t 19.24 2 sd k 0.0789 10 The critical value of t for a two-tailed statistical test, = 10% and 9 degrees of freedom is: 1.83 19.24 is way bigger than 1.83, so classifier B is much better than A. True positives and False positives and ROC Space True positive rate is TP = P correctly classified / P False positive rate is FP = N incorrectly classified / N A discrete classier produces an (fp rate, tp rate) pair corresponding to a single point in ROC space. One point in ROC space is better than another if it is to the northwest of the first –tp rate is higher, fp rate is lower, Curves in ROC space • Some classifiers, such as a Naive Bayes classifier, yield an instance probability or score. – Such a ranking or scoring classier can be used with a threshold to produce a discrete (binary) classier: • if the classier output is above the threshold, the classier produces a Y, • else a N. – Each threshold value produces a different point in ROC space (corresponding to a different confusion matrix). – Conceptually, we may imagine varying a threshold from –infinity to + infinity and tracing a curve through ROC space. Algorithm • Exploit monotonicity of thresholded classifications: – Any instance that is classified positive with respect to a given threshold will be classified positive for all lower thresholds as well. • Therefore, we can simply: – sort the test instances decreasing by their scores and – move down the list, processing one instance at a time and – update TP and FP as we go. • In this way, an ROC graph can be created from a linear scan. Example Example A threshold of +inf produces the point (0; 0). As we lower the threshold to 0.9 the first positive instance is classified positive, yielding (0;0.1). As the threshold is further reduced, the curve climbs up and to the right, ending up at (1;1) with a threshold of 0.1. Lowering this threshold corresponds to moving from the “conservative” to the “liberal” areas of the Observations • The ROC point at (0.1, 0.5) produces its highest accuracy (70%). • Note that the classifier's best accuracy occurs at a threshold of .54, rather than at .5 as we might expect with a balanced class distribution. Area under an ROC Curve • AUC has an important statistical property: The AUC of a classifier is equivalent to the probability that the classier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. • Often used to compare classifiers: – The bigger AUC the better • AUC can be computed by a slight modification to the algorithm for constructing ROC curves. Convex Hull If you aim to cover just 40% of the true positives you should choose method A, which gives a false positive rate of 5%. If you aim to cover 80% of the true positives you should choose method B, which gives a false positive rate of 60% as compared with A’s 80%. If you aim to cover 60% of the true positives then you should combine A and B. • The shaded area is called the convex hull of the two curves. • You should operate always at a point that lies on the upper boundary of the convex hull. • What about some point in the middle where neither A nor B lies on the convex hull? • Answer: “Randomly” combine A and B Combining classifiers • Example (CoIL Symposium Challenge 2000): – There is a set of 4000 clients to whom we wish to market a new insurance policy. – Our budget dictates that we can afford to market to only 800 of them, so we want to select the 800 who are most likely to respond to the offer. – The expected class prior of responders is 6%, so within the population of 4000 we expect to have 240 responders (positives) and 3760 non-responders (negatives). Combining classifiers • Assume we have generated two classifiers, A and B, which score clients by the probability they will buy the policy. • In ROC space, – A’s best point lies at (.1, .2) and – B’s best point lies at (.25, .6) • We want to market to exactly 800 people so our solution constraint is: fp rate * 3760 + tp rate * 240 = 800 • If we use A, we expect: .1 * 3760 + .2*240 = 424 candidates, which is too few. • If we use B we expect: .25*3760 + .6*240 = 1084 candidates, which is too many. • We want a classifier between A and B. Combining classifiers • The solution constraint is shown as a dashed line. • It intersects the line between A and B at C, approximately (.18, .42) • A classifier at point C would give the performance we desire and we can achieve it using linear interpolation. • Calculate k as the proportional distance that C lies on the line between A and B: k = (.18-.1) / (.25 – .1) 0.53 • Therefore, if we sample B's decisions at a rate of .53 and A's decisions at a rate of 1-.53=.47 we should attain C's performance. In practice this fractional sampling can be done as follows: For each instance (person), generate a random number between zero and one. If the random number is greater than k, apply classier A to the instance and report its decision, else pass the Iso-Performance lines • • • • • Let c(Y,n) be the cost of a false positive error. Let c(N,p) be the cost of a false negative error. Let p(p) be the prior probability of a positive example p(n) = 1- p(p) is the prior probability of a negative example The expected cost of a classification by the classifier represented by a point (TP, FP) in ROC space is: p(p) * (1-TP) * c(N,p) + p(n) * FP * c(Y,n) • Therefore, two points (TP1,FP1) and (TP2,FP2) have the same performance if (TP2 – TP1) / (FP2-FP1) = p(n)c(Y,n) / p(p)c(N,p) Iso-Performance lines • The equation defines the slope of an isoperformance line, i.e., all classifiers corresponding to points on the line have the same expected cost. • Each set of class and cost distributions defines a family of isoperformance lines. – Lines “more northwest”--having a larger TP intercept--are better because they correspond to classifiers with lower expected cost. Lines and show the optimal classifier under different sets of conditions. Discussion – Comparing Classifiers Precision and Recall Precision is the fraction of retrieved documents that are relevant Recall is the fraction of relevant documents that are retrieved In terms of confusion matrix… Why having two numbers? • The advantage of having the two numbers for precision and recall is that one is more important than the other in many circumstances. • Typical web surfers: – would like every result on the first page to be relevant (high precision), but have not the slightest interest in knowing let alone looking at every document that is relevant. • In contrast, various professional searchers such as paralegals and intelligence analysts: – are very concerned with trying to get as high recall as possible, and will tolerate fairly low precision results in order to get it. • Nevertheless, the two quantities clearly trade off against one another: – you can always get a recall of 1 (but very low precision) by retrieving all documents for all queries! • Recall is a non-decreasing function of the number of documents retrieved. – On the other hand, precision usually decreases as the number of documents retrieved is increased. What about a single number? • The combined measure which is standardly used is called the F measure, which is the weighted harmonic mean of precision and recall: where • The default is to equally weight precision and recall, giving a balanced F measure. – This corresponds to making = 1/2 or =1. – Commonly written as F1, which is short for F=1 • Evaluation of ranked retrieval results Precision-recall curves have a distinctive saw-tooth shape: – if the (k+1)th document retrieved is irrelevant then recall is the same as for the top k documents, but precision has dropped. – If it is relevant, then both precision and recall increase, and the curve jags up and to the right. • It’s often useful to remove these jiggles (by interpolation): – the precision at a certain recall level r is defined as the highest precision found for any recall level q r. • Justification: – almost anyone would be prepared to look at a few more documents if it would increase the percentage of the viewed set that were relevant (that is, if the precision of the larger set is higher). Interpolated precision is shown by the red line. Precision at k • The above measures precision at all recall levels. • What matters is rather how many good results there are on the first page or the first three pages. • This leads to measures of precision at fixed low levels of retrieved results, such as 10 or 30 documents. • This is referred to as “Precision at k”, for example “Precision at 10.” Books and Slides • Corresponding chapters in the book: Ch 1, Ch 2 (2.1, 2.2) Ch 4 (4.1, 4.2, 4.3, 4.5, 4.6) Ch 5 (5.1, 5.3, 5.4.1, 5.4.2, 5.4.3, 5.7.1, 5.7.2) • However, the book isn’t very clear in the predictive data mining (Book is pretty good for the second part of the course; that’s the reason for choosing this book). • The Witten&Eibe book (in library reserve) is clearer for predictive data mining (but not good for the second part of the course). – See my e-mail with respect to the relevant chapters. • Exercises in the midterm will be similar to the examples in the slides and assignment. – Slides also expand more (than both books) in some topics (e.g. decision trees, Bayesian Nets, ROC curves etc.) • In the midterm: Calculators and a single-side help sheet are allowed.