Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data Overview In this module you’ll be learning about the basics.

Download Report

Transcript Probability and Statistics Representation of Data Measures of Center for Data Simple Analysis of Data Overview In this module you’ll be learning about the basics.

Probability and Statistics
Representation of Data
Measures of Center for Data
Simple Analysis of Data
Overview
In this module you’ll be learning about the basics of statistics:
Statistical Displays – Data can be displayed graphically in
different ways. You will learn how to choose displays by the type
of date and the message to be delivered to the audience. Some
“do not” examples will also be covered.
Measures of Center – A single number or data is commonly used
to describe an entire set of data. You will explore the different
types of “averages” and learn why you might choose one over
another.
Analysis – This module covers the simple analysis of the data.
You will look at what information can be obtained from the data
and how to make comparisons of various data sets.
Topics
•
•
•
•
•
•
•
•
•
•
An Introduction to statistics
Types of data
Displaying data
How NOT to display data
Arithmetic Mean
Median
Mode
Weighted Mean
Types of distributions
Measures of center vs. variation
Introduction to Statistics
Statistics: An Introduction
The first page at this site gives an explanation of how statistics is used.
Clicking on the “Continue” link on the bottom right of the page will take
you to the section on “Revealing Patterns Using Descriptive Statistics”. It
may be worthwhile to read the first page of this section to review some
of the common terms/vocabulary used in describing data. Return to this
presentation when you are ready.
(On the right of the web page, you will notice additional information that is beyond the
scope of this module. Feel free to come back at a later time to explore further.)
Statistics is the set of mathematical tools for collecting,
organizing and analyzing data; and then interpreting the
information to make decisions
Displaying Data
Types of Data
What is Data?
Read the text on the web site. Answer the nine questions at the bottom
of the page to check your understanding of the topic. Return to this
presentation when finished.
Data can be qualitative, describing distinct categories, or quantitative,
describing numerical counts or measurements.
Qualitative data can be nominal, where no natural order exists
between the categories, or ordinal, meaning an order does exist.
Quantitative date can be divided into continuous, when data are
values within a range, and discreet, when the measurements are
integers.
Types of Data (cont.)
Another explanation of Types of Data
If you are still unsure about recognizing qualitative and
quantitative data, click on the link above to review how to
distinguish between these two variables. When you have
completed the “Progress check” at the bottom of the web page,
return to this presentation.
You should now be able to classify data as qualitative nominal,
qualitative ordinal, quantitative discrete or quantitative
continuous, and are ready to explore how to display data.
Continue to the next slide!
Types of Graphs
Common Graphs
Visit the above website for a brief description and representation
of ten of the most common graphs.
You should have a basic idea of the types of graphs that can be created
to display data. As you move on through the slides, you will learn how
to create these graphs and how to determine which graph gives the
best representation of the data you want to display.
Types of Graphs (cont.)
Math Dictionary
This web site is home to a mathematics dictionary. It has examples to the
graphs listed below. Return to this presentation when finished.
There are various ways to display your data. The differences arise from
the type of data and the information and/or message you want to
deliver. Following is a list of the more traditional types of graphs.
Bar graph
Histogram
Pie graph
Line graph
Box plot
Scatter plot
Line plot/Dot plot
Pictograph
Stem & Leaf plot
Bar Graphs
Bar Graphs
Click on the link above to learn how to create a bar graph. After
reading the information on bar graphs, answer the ten questions
in the “Your Turn” section at the bottom of the page.
One way to graphically represent data is by using a bar
graph/chart. What type of data is best represented by a bar
graph? What information about a data set should you be able
to interpret from a bar graph?
Histograms
Histograms
Read about histograms by following the link above. Check your
understanding by answering the ten questions in the “Your turn”
section at the bottom of the web page.
Histograms can be used to represent continuous data.
You should be able to identify data that is continuous and be
able to create a histogram to represent that data.
Histograms
Create a Histogram (video)
This video demonstrates how to take a data set and create a
histogram.
Histograms are best used when the data variable on the x-axis is
quantitative. The bars most often represent a range of values.
Each bar could also represent an individual value. In this case,
the histogram would more accurately be called a frequency
distribution graph.
Histograms (cont.)
A Histogram is NOT a Bar Chart
It is important to distinguish the difference between a histogram
and a bar chart. This is the first of several sites that will help you
determine when to display data as one or the other. Read the
information on the first page and then return to this
presentation.
Histograms and bar charts can look similar even though they
display very different representations of the data. After reading
the information on the web page linked above, you should be
able to identify three differences between histograms and bar
charts.
Histograms (cont.)
Bar Charts and Histograms (includes a video)
This webpage has additional information on when to use a bar
chart or histogram to display your data. You can also view the
video which shows how to create a bar chart and histogram.
Histograms (cont.)
Histograms vs. Bar Graphs
Click on the link above to read more about the differences
between histograms and bar charts. The information is set up as
a conversation between a teacher and student reasoning through
how each graph can be used to display specific types of data.
When you have finished reading the discussion, please return to
this presentation.
Can you answer the following questions?
• What type of data would be best represented by a histogram?
• What information should you be able to identify when data is
represented in a bar graph?
Pie Charts
Pie Charts
Read about how to create a pie chart and what type of data
displays best in this format. Be sure to complete the questions at
the bottom of the web page in the “Your turn” section and then
return to this presentation!
Pie charts represent data as a part-to-whole relationship.
Pie Charts (cont.)
Pie Charts
This site looks at how NOT to use pie charts, along with showing
many examples found in the news, in business reports and other
media.
You should be able to answer the following question regarding
pie charts:
• What is the best type of data to represent graphically in a pie
chart?
• When interpreting information from a pie chart, what are
three areas you should pay attention to in the representation?
Scatterplots
What is a Scatterplot?
This site will introduce you to scatterplots. Click the blue “View
Video” button to see how to make and read scatterplots. Once
you have watched the video and read through the information on
this webpage, return to this presentation.
Main points:
• A scatterplot is used to graph the relationship between two
quantitative variables or bivariate data;
• Scatterplots may show patterns – weak or strong, positive or
negative correlations;
• Correlation does not indicate cause and effect.
Scatterplots (cont.)
Scatterplots and Correlation
This site presents another view of scatterplots and correlation.
After reading this information, answer the nine question in the
“Your turn” section at the bottom of the page.
Explore further…
At the above website, under the correlation graphs, is a link
More About Correlation. Here you will see how correlation is
calculated. In most cases, you will use a calculator or software
function for this; however, it’s beneficial to know how the
correlation coefficient is derived.
Line Graph
Line Graph
This website gives many examples of line graphs and explains
what makes a line graph different from a scatter plot. Read
through this information and then return to this presentation.
Main ideas:
• Line graphs help to determine the relationship between two
sets of values;
• Value sets represent an independent variable and an
independent variable;
• Line graphs are useful in showing trends and making
predictions.
Line Graph (cont.)
Line Graphs
Check your understanding in interpreting line graphs by
answering the ten “Your turn” questions at the bottom on this
webpage.
You should now be able to answer to following questions:
• What are the main differences between scatterplots and line
graphs?
• What type of data is best represented in a line graph?
Box Plot
Box Plots (YouTube video)
This video introduces you to Box Plots, as it demonstrates how to
create a box plot and defines the vocabulary terms listed below.
When you have finished viewing the video, return to this
presentation.
Vocabulary to understand box plots:
• Distribution
• Median
• Average (Mean)
• Extremes
• Quartiles
• Interquartile Range
Box Plot
Quartiles / Interquartile Range / Box and Whisker Plot
This webpage gives another look at the breakdown of Box Plots.
Once you have read through the information, try answering the
ten “Your turn” questions at the end of the page. (Tip: It will be
helpful to have scrap paper available)
At this point, you should be able to:
• determine the lower, middle and upper quartiles of a data set;
• calculate interquartile range;
• construct a Box and Whisker Plot to represent the data;
• compare box plots from two data sets and make observations
about the distributions.
Box Plot
Boxplot (aka, Box and Whisker Plot)
If you need additional information to understand boxplots, click
on the link above and “View Video”, which gives more details on
how to read a boxplot. When you have finished reading through
Boxplots Basics and How to Interpret a Boxplot, return to this
presentation.
Stem & Leaf Plot
Stemplots (aka, Stem and Leaf Plots)
Click the blue button to View Video and then read the
information on stem and leaf plots. For additional explanation
about this type of graph, proceed to the next slide.
•
•
•
•
•
•
Use to display quantitative data
Best used with small sets of data
Shows shape of distribution
Stem values can have any number of digits
Leaves can only be represented by one digit
Limitations displaying decimals
Stem & Leaf Plot
Stem and Leaf Plots
This site provides additional details on “splitting the stems” and
“splitting stems using decimal values”.
You should now know:
• under what circumstances stems should be split;
• how to organize decimal data in a stem and leaf plot;
• how to interpret data by looking at a stem plot.
Line Plot / Dot Plot
Line Plot (YouTube Video)
View this video on how to make a Line Plot then return to this
presentation.
Vocabulary:
• Clusters
• Gaps
• Outliers
Line Plot / Dot Plot
Dot Plot vs. Line Plot (YouTube Video)
This YouTube video does a good job describing the similarities
and differences between a line plot and dot plot. Then return to
this slide and click here to re-enforce what you have have learned
about Line and Dot plots.
Picture Graph / Pictograph
Pictographs
Read the information on Pictographs and then answer the nine
“Your turn” questions at the bottom of this webpage.
In a Pictograph, symbols are used to display statistical data.
Symbols can be misleading if not accurately proportioned or if
the symbols can not be divided evenly to represent fractional
parts.
Types of Graphs
Comparing Graphs
Test your understanding of the graphs covered in this unit. At
this website, read through the problems and decide which graph
most clearly represents the data and what information is to be
conveyed to the reader. Also, work through the five questions at
the bottom of the page.
Most data can be represented using multiple graphs. Decisions
on the most appropriate display should be make based on what
information you want the reader to draw from the graph.
An Advanced Display of Data
Hans Rosling
Probably one of the most informative and modern displays of
data can be seen from the work of Hans Rosling. The link above
shows a video of his TED talk in 2006. It is a 20 minute video and
it gets very interesting about 4 minutes into the video. Watch it
all if you have time but we recommend at least 10 minutes.
The point of this experience is not that we expect you to
duplicate this extraordinary presentation, but that you appreciate
the power of displaying data in a clear and understandable
method. Any enhancement of the display should be for the
purpose of clarity and not just distracting visuals.
How NOT to Display Data
Misleading Line Graphs by Khan Academy
The above link is by Salman Khan, founder of the Khan Academy.
In his video he highlights the misleading visual displays of a line
graph (5 min). Return to this presentation when finished.
Often times a data display can mislead the reader. At times this
may be intentional when the creator wants to persuade the
reader in some way. Other times it may be unintentional when
the creator tries to make the display more visually appealing and
causes the reader to misinterpret the results.
How NOT to Display Data
Misleading Graphs by Wikipedia
The above link by Wikipedia, shows various ways a graphic
display can mislead its intended audience. Return to this
presentation when finished.
Typically the displays we see are technically accurate but they use
visual “tricks” to mislead the reader who may not pay close
attention to the details of the graphic display.
Measures of Center
Measures of Center
Definitions of these terms
If you are not familiar with the terms listed below, follow the link
above to familiarize yourself with these terms.
(The above site includes other measures of center that are beyond the scope of this
presentation)
Different ways to measure the center of data
•
•
•
•
Arithmetic mean (commonly called average or just mean)
Median
Mode
Weighted mean
Measures of Center
Central Values
The link above gives some simple examples of measures of
center and compares the mean, median, and mode. Check your
understanding with the ten questions at the bottom of the page
before returning.
What is meant by “Measure of Center”? Sometimes we want to
describe a group of data (numbers, values) by a single number.
The advantage of this is the ability to more easily compare
different groups of data. The disadvantage is when you describe
a data set by a single number you lose the details and could
mislead someone.
Arithmetic Mean
The mean of a set of data is found by adding all the data values
and dividing that answer by the number of points. (often referred
to as “n”)
Strengths
• Its calculation includes all the data
• It is common and more likely understood by others
• It is often used in other statistical formulas
Arithmetic Mean
Weaknesses
• Sometimes you don’t know all the data points needed to
calculate the mean (data may be in a graph only)
• An extremely large data set may be difficult to calculate.
• It can be influenced by outliers, those values much larger or
smaller than the rest of the data.
• It is often a value that is different than any of the data values
When best to use
• The mean is best used when you data is continuous and symmetrical.
• Often necessary for use in other statistical measures.
Lessons on Arithmetic Mean
How to Find the Mean
Visit the web site above to learn more about the arithmetic
mean. After reading the lesson make sure and check your
understanding by answering the ten questions at the end.
In case you missed it, make sure and check out the “mean
machine”. Run this virtual machine to see the relationship
between the data points and the mean value.
Median
Wikipedia defines median
The web site above give a very detailed definition of median.
(Many of the examples are beyond the scope of this presentation)
The median of a set of data is found by arranging all the data in
numerical order and then selecting the data point in the middle.
If the data has an even number of values the median is the mean
of the two central values.
Strengths
• Requires little if any mathematical calculation
• It is not effected by outliers (large or small data points)
• It can be approximated from a frequency distribution or a
distribution graph
Median
Weaknesses
• Arranging a large set of data in order can be very difficult.
When best to use
• The median is usually preferred when the data distribution is
skewed
• It is used with ordinal data when the mean cannot be used
Lessons on Median
How to Find the Median Value
Visit the following web site to learn more about the median.
After reading the lesson make sure and check your understanding
by answering the ten questions at the end.
Comparing Mean & Median
Mean / Median Applet
The link above gives you the ability to see how the mean and
median change as the data points change. The applet allows you
to drag data points on the line or move data points on the line.
Take some time and play with this applet and see how the mean
and median change and compare.
You can also check the box for “box plot” to see how a boxplot
would look with the data that shows on the line.
When you have finished, jot down the patterns you have
observed and then return to this presentation
Comparing Mean & Median
Seeing Statistics
Use the link above for a more comprehensive lesson on the
attributes and differences between the mean and median. The
link will take you to an introduction of the web interphase. When
you think you are familiar with how to navigate the system, click
on the
icon in the left column.
When the table of contents show, click on lesson #3 “Describing
the Center”. You can advance from one page to the next by
clicking on the
icon in the top, right corner of the page.
Return to this presentation when you finish.
Mode
Wikepedia defines mode
The web site above give a very detailed definition of mode.
(Many of the examples are beyond the scope of this presentation)
The mode of a set of data is found by identifying the data
element that occurs most often. Many people remember this by
associating the word “most” with mode.
Strengths
• Depending on the display of the data or the size of the data, it is
often easy to identify
• It is the ONLY measure of center you can use for non-numeric
data (nominal data). Example: What is the best measure of
center for the eye color of this group of people?
Mode
Weaknesses
• Sometimes the data set could have more than one mode or even
multiple modes.
• Often the data does not have any data element that is more
numerous than any other.
• Sometimes the mode is nowhere near the center of the data.
When best to use
• It is the only measure of center valid with nominal data
(Example: data on student’s eye color)
• It can support the validity of the mean and median if it has a
similar value. If the data is perfectly normal,
mean=median=mode
Mode
How to Find the Mode Value
Visit the web site above to learn more about the mode. After
reading the lesson make sure and check your understanding by
answering the ten questions at the end.
Weighted Mean
Wikipedia defines weighted mean
The web site above give a very detailed definition of weighted
mean.
(Many of the examples are beyond the scope of this presentation)
• Sometimes certain values in a data set contribute more to a
measure of center than other values. In this situation, we
calculate a weighted mean.
Weighted Mean
Dr. Math explains weighted mean (weighted average)
The web site above gives examples of calculating a weighted
mean or weighted average.
A simple example:
• Consider a university that teaches two classes. One class has
10 students, the other has 100 students. If you ask the
university the average (mean) class size they respond with 55.
(100+10)/2. However, if you ask every student what size class
they are in to find the mean you would get 91.8. [(100 * 100)
+ (10 * 10)] / 110
• The 100 students in the larger class carry more WEIGHT that
the 10 students in the smaller class.
Weighted Mean
Weighted Mean
Visit the web site above to learn more about the weighted mean.
After reading the lesson make sure and check your understanding
by answering the ten questions at the end.
Simple Analysis of Data
Simple Exploration and
Analysis of Data
Exploring Data
Read the first page in the link above to learn about the
importance of thinking carefully about how to interpret what a
data set can reveal.
Measures of center, like the mean, median, and mode, give useful information
about a data set, but is hidden by such single number summaries.
To understand the information in a complex or large data set, it is important to
examine the integrity of the data, to look out for interesting and useful
patterns, and to summarize the data skillfully.
Patterns are likely to be found more easily in visual, rather than numerical,
representations of the data.
A single number is unlikely to summarize the data effectively.
Data Integrity - Outliers
Outliers
Outliers are numbers in a data set that are very different from all the
others. Read the information in the link above to learn more about
outliers. Then work the problems at the end to test your
understanding.
Why do some data sets have outliers? Have these numbers been recorded
wrongly? Do they correspond to bad mistakes, e.g. in measurements?
You should have found from the problems you worked that
• Outliers don’t have much effect on the median;
• Outliers can have a big effect on the mean.
• How can we tell when a “suspicious” number is a “genuine” outlier rather
than just being at the limits of what is “normal”?
• What, if anything, should we do about outliers? If we ignore them, will they
make problems for how we interpret the data?
We’ll discuss some of these issues in the pages ahead.
Patterns of Data
Patterns of Data
Open the link above to read about various ways to describe and
identify patterns in data sets.
To begin with, we will concentrate on finding ways to measure the spread of a
data set.
This will give us two numbers (a measure of center and a measure of spread) to
use when we summarize a data set. Two is better than one.
The interaction of center and spread is important. If a data set has small spread,
the values will be clustered closely around the center, so the center will
represent the data values well.
Range
Range
Discover what is meant by the range of a quantitative data set by
reading the explanation in the link above and working through
the activities.
The range of a data set is the difference between the largest and smallest
values. It is the simplest measure of the spread of the data.
Strength:
• Very simple to compute.
Weaknesses:
• Very sensitive to unreliable data or outliers; it is easy for an inaccurate
measurement to be much bigger or smaller than the others; this could
have a big and misleading impact on the range.
• Only uses two data values, so a lot of information about the data set is
lost.
Quartiles
Quartile Definition and Computation
Click the link above to read about quartiles and how to compute
them. Then check your understanding by working the problem at
the end.
Half the values in a data set are at least as big as the median – and half the
values are no bigger than the median.
• The first (or lower) quartile Q1 is essentially the median of the lower half of
the data set;
• The third (or upper) quartile Q3 is essentially the median of the upper half of
the data set.
Controversy:
• There is no generally accepted agreement about how precisely to compute
upper and lower quartiles. In fact, different calculators or software packages
will give different results for the quartiles of the same data set.
Interquartile Range
Interquartile Range
Open the link above to learn what interquartile range (IQR)
means and how to compute it. Don’t miss the imbedded YouTube
video. Then read more here and check your understanding by
answering the ten questions at the end.
The IQR tells you about the spread of a data set through focusing on the middle
50% of the data. Its value is the length of the box in a box and whisker plot.
Advantages of the IQR as a measure of spread:
• Easy to compute;
• Much less likely than the range to be affected by outliers.
When best to use:
• When you use the median as a measure of center, the IQR is a good
measure of the spread of a distribution.
Interquartile Range and
Outliers
IQR and Outliers
Visit the link above to find how to use the interquartile range to
identify outliers in a data set. Then try your hand at the problems
in this set.
A standard convention is that a number in a data set is an outlier if it is at least
1.5 IQRs away from the median.
It is important to understand that the choice of 1.5 IQRs is not specified by any
theory. It is an arbitrary convention – but it has worked well for many years.
An outlier can easily be spotted in a box and whisker plot – the end of a whisker
that is more than one and a half times as long as the box.
Five Number Summary
Five number summary
The link above explains how a box and whisker plot provides five
numbers that conveniently summarize a data set. Five summary
numbers allow us a much richer analysis of a data set than a single
measure of center. Practice problems here.
The five number summary of a data set is based on a box and whisker plot. It
consists of
• The biggest value;
• The third quartile;
• The median;
• The first quartile;
• The smallest value.
It provides representative information about a data set that easily leads to a
measure of center and a measure of spread at the same time as making outlier
detection a simple matter of checking for over-long whiskers.
Standard Deviation
Standard Deviation
So far we have measured the spread of a data set in ways(range, IQR) that
associate well with the median measure of center. Read the link above to
see how to measure spread in a way (standard deviation) that is compatible
with the mean measure of center. Then work the problems at the end.
The deviation of a data value from the mean is just the difference between the
two.
The variance of a data set is the average of the squares of the deviations (with a
slight adjustment if the data consists of a selection from all possible values.)
The standard deviation of a data set is the square root of its variance.
It is usually better to avoid computing variance or standard deviation by hand.
Many calculators have these computations pre-programmed, so it is easy to get
the information once the data is entered.
Standard Deviation and
Outliers
Standard Deviation Video
Here’s a video to help you understand the concept of standard
deviation.
Standard deviation behaves like the average distance of the data values from
their mean. It is a measure of the spread of the data set:
• When the average distance is small, many of the data values will be
clustered around the mean, and the spread will be small;
• When the average distance is large, many of the data values will be far from
the mean, and the spread will be large.
We viewed a data value as an outlier when it was “far away” in IQR terms from
the median. We shall also label a data value as an outlier when it is far away
from the mean, as measured in terms of the standard deviation.
• One convention is that an outlier is at least 3 standard deviations from the
mean – but this is a matter of debate, as it what you should do about
outliers..
Distinguishing Between
Data Sets
Anscombe's Quartet
It is important to realize that very different data sets can have identical
means and standard deviations. In other words, they have identical center
and spread, but look very different. The link above is to Anscombe’s famous
examples.
Our objective is to understand and analyze our data sets. Because of
Anscombe’s and similar examples, simple number summaries cannot give
complete answers to our questions.
It is necessary to look for patterns in other ways.
Challenging Problems
Challenging Problems
Before changing direction, here are some problems that you will need
to think about carefully. If you get stuck, there are solutions posted.
Frequency Distributions
Frequency Distributions
Consult the link above as well as this one to find out about frequency
distributions. Make sure you confirm your understanding by working
the problems.
• Sometimes a value occurs more than once in a data set. Its frequency is the
number of times it appears in the list of values.
• By putting the values into bins if appropriate and counting up the total
frequency in each bin, it is not difficult to create a frequency table that can
be represented as a histogram.
Relative Frequency
Distributions
Relative Frequency Distributions
To create a relative frequency distribution, we proceed as for the
frequency distributions, but scale each of the frequencies by dividing by
the total count of data values. This scaled frequency is the relative
frequency. Read the link above and here, working the problems.
• Relative frequency distributions are actually distributions of probabilities.
• Using relative frequency distributions allows us to compare data sets of
different sizes on an equal basis.
• If we selected one data set of 100 measurements and another data set
of 1000 measurements by taking samples from a massive collection,
there shouldn’t be too much difference between how each value
compares with the others, regardless of which data set we examine.
• However, we should expect the frequency (e.g. 49) of one value in the
second data set to be roughly 10 times its frequency (e.g. 5) in the first
set. On the other hand, the relative frequencies should be similar (e.g.
4.9 and 5.0.)
Describing Data Patterns I
Describing Data Patterns
The link above describes some basic patterns that can occur in frequency
distributions of data. These descriptions go beyond measuring center and
spread, and focus as well on shape and unusual features.
• Some distributions are symmetric around the mean – which must then
coincide with the median. (Why is this the case?)
• Some distributions are skewed with a tail to the right or the left.
• Some distributions have more than one mode.
Describing Data Patterns II
Data Pattern Video
See whether you can use your knowledge of the previous slide to
answer the questions in the YouTube video linked above.
• It turns out that distributions that are symmetric around the mean/median
and that have a single mode that also coincides with the mean/median are
the most important of all.
Characteristics of a Normal
Distribution
Normal Distribution
Consult the link above for basic facts about the normal distribution and
how it arises. As always, work the problems at the end. Check out the
YouTube video here for ways to test carefully and systematically
whether or not your data really is normal.
• The continuous normal distribution has a distinctive bell shape, with the
mode, mean and median all at the same place. The bell is wider when the
standard deviation is greater, but the basic form is always the same.
• Approximately 68% of all values are within 1 standard deviation of the mean,
95% are within 2 standard deviations of the mean, and 99.7% are within 3
standard deviations of the mean.
• Normal distributions are continuous distributions that tend to arise in
measuring heights, sizes, pressures, temperatures, and so on.
Standard Normal
Distribution
Standard Normal Distribution
Consult the link above for basic facts about the standard normal
distribution and work the problems at the end.
• A normal distribution is standard if it has mean 0 and standard deviation 1.
• All normally distributed data sets can be converted simply to a standard
normal distribution. If the original data set has mean μ and standard
deviation σ, the new data set created by replacing the old data values x by
new data values z = (x-μ)/σ will be a standard normal distribution.
• It is common and useful to convert normal distributions to standard form;
this makes it much easier to make comparison.
Characteristics of a
Binomial Distribution
Poisson Distribution
Consult the Wikipedia link above to seek out basic facts about the Binomial
distribution. It is a discrete probability distribution that “expresses the probability of
a given number of events occurring in a fixed interval of time and/or space if these
events occur with a known average rate and independently of the time since the last
event.
• A Poisson distribution is a discrete distribution that takes on values that are 0
or a positive integer.
• The mean and variance of a Poisson distribution are always the same.
• When the mean/variance is small, the Poisson distribution is skewed wit a
long tail to the right.
• When the mean/variance is large, the Poisson distribution closely resembles
the normal distribution with the same mean and variance.
Characteristics of a
Binomial Distribution
Binomial Distribution
Consult the Wikipedia link above to fish for basic facts about the
Poisson distribution. It is a discrete probability distribution of the
number of successes in a sequence of n independent yes/no
experiments, each of which yields success with probability p.
• A binomial distribution B(n,p) is a discrete distribution that takes on values
that are 0 or a positive integer no greater than n.
• The mean is np and variance is np(1-p).
• A Poisson distribution is not symmetric, except when p = ½.
• When n is large and both np and n(1-p) are not too small, the binomial
distribution closely resembles the normal distribution with the same mean
and variance.