Chapter I: Basics of R - DCU School of Computing

Download Report

Transcript Chapter I: Basics of R - DCU School of Computing

4. Basic distributions with R
CA200
(based on the book by Prof. Jane M. Horgan)
1
Discrete distributions: Binomial distribution
Def:
Conditions:
1. An experiment consists of n repeated trials
2. Each trial has two possible outcomes
a) a success with probability p is constant from trial to trial
b) a failure with probability q = 1 − p
3. Repeated trials are independent
X = number of successes in n trials. X is a BINOMIAL RANDOM VARIABLE.
The probability of getting exactly x successes in n trials is:
In R use function:
prefix “binom” with prefix “d” for pdf, “p” for cdf. Function are
dbinom(x, n, p)
pbinom(x, n, p)
2
Binomial distribution – cont.
Example 1. Terminals on an on-line computer system are attached to a communication line to the
central computer system. The probability that any terminal is ready to transmit is 0.95.
p = 0.95
Let X = number of ready terminals
Example 2. A fair coin is tossed 10 times. Success and failure are “heads” and “tails”, respectively,
each with probability, 0.5.
Lets denote:
X = number of heads (successes) obtained
Example 3. It is known that 20% of integrated circuit chips on a production line are defective. To
maintain and monitor the quality of the chips, a sample of 20 chips is selected at regular
intervals for inspection.
Lets denote:
X = number of defectives found in the sample
p = 0.20
n = 20
P(X=0) = P(0 defective in a sample of size 20);
P(X=1) = P(1 defectives in a sample of size 20)
P(X=2) = P(2 defectives in a sample of size 20)
3
Binomial distribution – cont.
Example 4. It is known that 1% of bits transmitted through a digital transmission are received in
error. One hundred bits are transmitted each day.
Lets denote:
X = number of bits found in error each day
p = 0.01
n = 100
P(X=0) = P(0 errors in the 100 transmission)
P(X=1) = P(1 error in the 100 transmission)
P(X=2) = P(2 errors in the 100 transmission)
4
Binomial distribution – cont.
Calculating binomial pdfs in R:
Use prefix “binom” with “d” for pdf. Function is dbinom with arguments n and p.
Example 1:
For all cumulative probabilities:
x <- 0:5
dbinom(x, size = 5, prob = 0.95)
[1] 0.0000003125 0.0000296875 0.0011281250 0.0214343750 0.2036265625 0.7737809375
The probability of getting exactly 3 ready terminals from five:
dbinom(x = 3, size =5, prob = .95)
[1] 0.02143438
Example 2: with round function for 4 decimals!
x <- 0:10
round(dbinom(x, 10, 0.5), 4)
[1] 0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098 0.0010
Example 3:
dbinom(x, 20, 0.2)
Example 4:
dbinom(x, 100, 0.01)
5
Binomial distribution – cont.
Plotting binomial pdfs in R:
par(mfrow = c(2, 2))
# multiframe
x <- 0:5
#Example 1
plot(x, dbinom(x, size = 5, prob = 0.95), xlab = "Number of Ready Terminals",
ylab = "P(X = x)", type = "h", main = "Ready Terminals, n = 5, p = 0.95")
x <- 0:10
#Example 2
plot(x, dbinom(x, size = 10, prob = 0.5), xlab = "Number of Heads",
ylab = "P(X = x)", type = "h", main = "Heads, n = 10, p = 0.5")
x <- 0:20
#Example 3
plot(x, dbinom(x, size = 20, prob = 0.2), xlab = "Number of Defectives",
ylab = "P(X = x)", type = "h", main = "Defectives, n = 20, p = 0.2")
x <- 0:20
#Example 4. No need for full range here
plot(x, dbinom(x, size = 100, prob = 0.01), xlab = "Number of Bits in Error",
ylab = "P(X = x)", type = "h", main = "Bits in Error, n = 100, p = 0.01")
6
Binomial distribution – cont.
Binomial pdfs
7
Binomial distribution – cont.
Calculating binomial cdfs in R:
Use prefix “binom” with “p” for cdf. Function is pbinom with arguments n and p.
Example 1: n = 5, p = 0.95
P(X ≤ 3):
pbinom(3, 5, 0.95)
#[1] 0.0225925
P(X > 3):
1 - pbinom(3, 5, 0.95)
#[1] 0. 9774075
Example 3: n = 20, p = 0.02
P(X ≤ 4):
pbinom (4, size = 20, prob = 0.2) #[1] 0.6296483
P(X > 4) = 1 - P(X ≤ 4 ):
1- pbinom(4, size = 20, prob = 0.2) #[1] 0.3703517
For all cumulative probabilities
x <- 0:20
prob <- pbinom(x, size = 20, prob= 0.2)
round to 4 decimal places:
round(prob, 4)
# [1] 0.0115 0.0692 0.2061 0.4114 0.6296
8
Binomial distribution – cont.
Plotting binomial cdfs in R:
par(mfrow = c(2,2))
#multiframe
x<-0:5
plot(x, pbinom(x, size = 5, prob = 0.95), xlab = "Number of Ready Terminals",
ylab = "P(X <= x)", ylim = c(0, 1), type = "s", main = "n = 5, p = 0.95")
x<-0:10
plot(x, pbinom(x, size = 10, prob = 0.5), xlab = "Number of Heads",
ylab = "P(X< = x)", ylim = c(0, 1), type = "s", main= "n = 10, p = 0.5")
x<-0:20
plot(x, pbinom(x, size = 20, prob = 0.2), xlab = "Number of Defectives",
ylab = "P(X <= x)", ylim = c(0, 1), type = "s", main = "n = 20, p = 0.2")
x<-0:100
plot(x, pbinom(x, size = 100, prob = 0.01), xlab = "Number of Error Bits",
ylab = "P(X <= x)", ylim = c(0, 1), type = "s", main = "n = 100, p = 0.01")
9
Binomial distribution – cont.
Binomial cdf
10
Normal distribution – continuous distribution
•
in practice, measures tend to cluster around a central values
•
the largest proportion of measurement occurring near central value and
decreasing as the measurements extend at either side of this central value
•
bell-shaped curve symmetrical about the central value is called “normal
distribution”
17
Normal distribution – cont.
Example 5. Heights of females and males. We can accept that:
- on average, males are taller than females but maybe spread around their respective
averages could be the same
- most of female and male heights cluster around the same average
- the heights of females and males might be modelled by normal distribution:
Height of females and males:
X ~ N(1.65, 0.25)
X ~ N(1.85, 0.25)
 bell-shaped curves
 centred at different points
 variation at either side
of central points is the same
Two normal distributions with different mean
and the same spread
18
Normal distribution – cont.
Example 6. Arises in product manufacturing. Components manufactured by two
machines are set to a diameter 10 cm.
One machine produces components with error 1cm:
X ~ N(10, 1)
The second machine has an error of 2cm:
X ~ N(10, √2)
 bell-shaped curves
 centred at the same point, 10cm
 variation at either side
of central points is different
→ one high and narrow,
the other low and wide
Two normal distributions with the same mean
and different standard deviations 19
Normal distribution – cont.
The Probability Density Function for Normal distribution
Def:
The PDF of a normal (Gaussian) distribution, f(x), is a function defined as:
𝒇 𝑿 =
𝟏
𝝈 𝟐𝝅
(𝑿−µ)𝟐
𝒆− 𝟐𝝈
−∞<𝑿<∞
It can be shown that
E(X) = µ
Var(X) = σ2
X is said to be normally distributed with a mean of μ and a variance of σ2
i.e.
X ~ N (μ, σ2)
Increasing μ moves the distribution along the x-axis (example 1), and increasing σ2
widens the distribution (example 2)
20
Normal distribution – cont.
Generating Normal Curves
To plot a normal curve use the function curve with dnorm as the argument:
curve (dnorm(x, 2, 0.25), from = 1, to = 3)
# plots a normal curve with a mean of 2 and a standard deviation of 0.25 in the range 1 to 3
curve (dnorm(x, 2, 0.25), from = 1, to = 3, labels = FALSE, tick = FALSE)
# to suppress the labels and axis ticks
21
Normal distribution – cont.
Generating Normal Curves
To generate normal curves on same axis use (example 1, example 2)
cbind which combines R objects with columns and
matplot which plot columns as matrices
0.5
f(x)
1.0
1.5
x <- seq(0.5, 3, len = 101)
#101 equidistant values between 0.5 and 3
y <- cbind(dnorm(x, 1.65, 0.25), dnorm(x, 1.85, 0.25))
#two normal curves
matplot(x, y, type = "l", xlab = "height(metres)", ylab = "f(x)")
#multiple plotting
text(1.15, 0.5, "Females")
text(2.35, 0.5, "Males")
Males
0.0
Females
0.5
1.0
1.5
2.0
height(metres)
2.5
3.0
Normal distribution – cont.
The Cumulative Distribution Function
Def:
The CDF of normal distribution in R is obtained with pnorm function:
P(X ≤ x) = pnorm(x, μ, σ)
Example 7. From previous records it is known that examination results are normally
distributed with mean μ = 45 and the standard deviation σ = 4:
X ~ N(45, 16)
What percentage of students obtain a mark
• less than 45;
• larger than 45;
• larger than 50;
• between 40 and 50;
• greater than 37?
23
Using R to obtain normal probabilities
24
Conditional Probabilities in Normal distribution
Example 8. An analogue signal received at a detector (measured in microvolts),
is normally distributed with a mean of 100 and a variance of 256. What is the probability
that the signal will be less than 120 microvolts, given that it is larger than 110 microvolts?
Solution 8.
X ~ N(100, 16)
Recall Conditional probability of B given A:
𝑃 𝑋 < 120 𝑋 > 110 =
𝑃 𝐵𝐴 =
𝑃(𝐴∩𝐵)
𝑃(𝐴)
𝑃( 𝑋 > 110 ∩ 𝑋 < 120 ) 𝑃(110 < 𝑋 < 120)
=
𝑃(𝑋 > 110)
𝑃(𝑋 > 110)
For
P(110 < X < 120)
pnorm(120, 100, 16) - pnorm(110, 100, 16)
[1] 0.1603358
P(X > 110)
1 - pnorm(110, 100, 16)
[1] 0.2659855
𝑃 𝑋 < 120 𝑋 > 110 =
0.1603358
= 0.6027988
0.2659855
25
Example 9. From previous records scores on an aptitude are normally distributed with
mean μ = 500 and standard deviation σ = 100. What is the probability that an
individual’s score exceeds 650, given that it exceeds 500?
Solution 9.
X ~ N(500, 100)
𝑃 𝑋 > 650 𝑋 > 500 =
𝑃( 𝑋 > 500 ∩ 𝑋 > 650 ) 𝑃(𝑋 > 650)
=
𝑃(𝑋 > 500)
𝑃(𝑋 > 500)
For
P(X > 650)
1 - pnorm(650, 500, 100)
[1] 0.0668072
P(X > 110)
1 - pnorm(500, 500, 100)
[1] 0.5
0.0668072
P X > 650 X > 500 =
= 0.1336144
0.5
26
Example 10. Summary for basic statistics with R:
(Based on the Excel file “CA200_Questionnaire_Summary”which has been previously circulated)
a) Read the survey answers into a suitable data frame in R. [It was necessary to save the file, from within Excel,
in comma separated (csv) format. But this has already been done for you that is the circulated file is of .csv
type]
To read the file in R use “read.csv”. For example, if the file is on the C drive then
results<-read.csv(“C:/CA200_Questionnaire_Summary.csv”, header=T)
this will read the data from the file into the data frame called “results”.
As a check that the data have been read correctly, one can see that the command:
results$Q1
gives results from column Q1 only:
[1] 1 1 1 1 1 2 1 1 2 1 1 2 1 2 2 1 1 1 1 1 1 1 2 2 1 2 2 NA 1 2 1 NA 1 2 2 1 1 2 2 1 1 1 1 1 1 2
b)
Use R to find out how many males and females there are (check the questionnaire for the questions!) For
number of females and males use summary function. This will give you all important information:
- Hint: First convert categorical variable Q1 into a factor using:
results$Q1<-factor(results$Q1)
- Then use summary(results) to find out how many females/males are out there
c)
Use R to and to separate data for females and for males into two different data sets:
Use subset function:
m_results<-subset(results, results$Q1==1)
To check if this is properly entered use function to print results in the work space:
m_results
#This will filter only data for males
- repeat same for separating results for females by using:
f_results<-subset(results, results$Q1==2)
#This will filter only data for females
27
Example 10. Summary - cont.
d) Identify any other categorical variables (questions) and similarly specify these as
factors. Then, repeat the summary(results) command.
e) Calculate mean and standard deviation of the salary for summer job for females Q13b:
> f_results$Q13b
[1] 200 150 NA 240 NA 150 NA 300 200 200 300 NA NA NA 330
> summary(f_results$Q13b)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
150 200 200 230 300 330
6
> mean(f_results$Q13b)
#don’t forget to remove all NAs
[1] NA
> mean(f_results$Q13b, na.rm=T)
[1] 230
> sd(f_results$Q13b, na.rm=T)
[1] 66.52067
f) For question e) use R to calculate skewness and kurtosis:
Note: you will have to install additional package which contains these functions. The name of a package is
“e1071”. For installation use:
> install.packages(“e1071”)
> library("e1071")
Loading required package: class
Warning message:
package 'e1071' was built under R version 2.13.2
> skewness(f_results$Q13b, na.rm=T)
[1] 0.2196901
#what can you conclude from this number?
> kurtosis(f_results$Q13b, na.rm=T)
[1] -1.68135
#what can you conclude from this number?
28
Example 10. Summary - cont.
Note:
Categorical data are generally non-numerical data which are placed into exclusive categories and
then counted rather than measured. In fact, there are two types of categorical data, namely
Nominal: Simple names or labels which cannot be ordered (e.g. car registration number or a PIN
number)
Ordinal: Can be placed in a meaningful order (though without any measurements). For example,
“small”, “medium” or “large” vests.
A factor is a vector object used to specify a discrete classification (grouping) of the components.
of other vectors of the same length.
g)
Plot the data from e) in a histogram and compare to Normal curve by e.g. superposition:
> m <-mean(f_results$Q13b, na.rm=T)
> s <-sd(f_results$Q13b, na.rm=T)
> mn <-min(f_results$Q13b, na.rm=T)
> mx <-max(f_results$Q13b, na.rm=T)
> c<-curve(dnorm(x, m, s), from=mn, to=mx)
> hist(f_results$Q13b, freq=FALSE, xlab=“put adequate name”, ylab=“density”, main=“put adequate name”)
> lines(c)
29