C459- Intro to Probability & Statistics, Western Governors University
The *explanatory* (or independent) variable is denoted by
*X*
The *response* (or dependent) variable is denoted by
*Y*
Note that in most studies involving gender, it will be the
*explanatory* variable. We are not trying to see how a person's smoking habits affect the person's gender, but rather explore whether gender is one of the factors that can explain whether the person is a smoker or not.
Case C→C
---------------
Case C→Q
---------------
Case Q→Q
---------------
PART 1 *Examining Distributions*
---------------
PART 2 *Examining Relationships*
---------------
*Select data display and numerical summary approach for Case C-C, C-Q, Q-C, or Q-Q*
1. Case C-C: Two way table or double bar chart using conditional percents. 2. Case C-Q: Box plots and five number spread. 3. Case Q-C: not covered. 4. Case Q-Q: Scatterplot (explanatory on x, response on y) or labeled scatterplot.
Role-Type Classification Table
1. Categorical explanatory and quantitative response 2. Categorical explanatory and categorical response 3. Quantitative explanatory and quantitative response 4. Quantitative explanatory and categorical response
Statistics is a process where we are:
1. Collecting data, 2. Summarizing data 3. Interpreting Data
The stemplot is a simple but useful visual display of quantitative data. Its principal virtues are:
1. Easy and quick to construct for small, simple datasets. 2. Retains the actual data. 3. Sorts (ranks) the data.
The distribution of a categorical variable is summarized using:
1. Graphical display: pie chart or bar chart, supplemented by 2. Numerical summaries: category counts and percentages.
The three main numerical measures for the center of a distribution are the
1. Mode 2. Mean 3. Median
What are the 4 steps of the big picture?
1. Producing Data- Choosing a sample from the population of interest and collecting data. 2. Exploratory Data Analysis (EDA)- Summarizing the data we've collected. 3. and 4. Probability and Inference—Drawing conclusions about the entire population based on the data collected from the sample.
The three most commonly used measures of spread are:
1. Range 2. Inter-quartile range (IQR) 3. Standard deviation Like the different measures of center, these measures provide different ways to quantify the variability of the distribution.
When describing the shape of a distribution, we should consider:
1. Symmetry/skewness of the distribution. 2. Peakedness (modality)—the number of peaks (modes) the distribution has.
What are two simple graphical displays for visualizing the distribution of *categorical* data?
1. The Pie Chart 2. The Bar Chart
Most studies involve two variables, each of the variables has a role. We distinguish between:
1. the explanatory variable (also commonly referred to as the independent variable)-—*the variable that claims to explain, predict or affect the response*; and 2. the response variable (also commonly referred to as the dependent variable)-—*the outcome of the study*.
One Standard Deviation equals
68%
Two Standard Deviations equal
95%
Three Standard Deviations equal
99.7%
Scatterplot
A display of data that has two quantitative variables
Stemplot
Another graphical display of the distribution of quantitative data.
Outliers
Data points that deviate from the pattern.
In a histogram, the height of the bars or the number of data in a variable is labeled as the
Frequency
We want to explore whether the outcome of the study—the score on a test—is affected by the test-taker's gender. Therefore:
Gender is the *explanatory* -and- Test score is the *response*
The formula to find the IQR is
IQR= Q3 - Q1
Skewed Distributions can also be bimodal. A medium size neighborhood 24-hour convenience store collected data from 537 customers on the amount of money spend in a single visit to the store. The following histogram displays the data.
Note that the overall shape of the distribution is skewed to the right with a clear mode around $25. In addition it has another (smaller) "peak" (mode) around $50-55. The majority of the customers spend around $25 but there is a cluster of customers who enter the store and spend around $50-55.
Remember that the *median* is the value of the middle exam score, not the middle of the range.
Only 10 students out of 40 could have scores lower than 60. So 60 is too small to be the median. The correct answer is 68.
To find the median:
Order the data from smallest to largest. #1- Consider whether *n*, the number of observations, is an even or odd number. #2- If *n* is an odd number, the median *M* is the center observation in the ordered list. This observation is the one "sitting" in the (n + 1) / 2 spot in the ordered list. #3- If *n* is an even number, the median *M* is the mean of the two center observations in the ordered list. These two observations are the ones "sitting" in the n / 2 and n / 2 + 1 spots in the ordered list.
The entire group that is the target of our interest is referred to as the
Population
The 1.5(IQR) Criterion for Outliers An observation is considered a suspected outlier if it is: below Q1
Q1 - 1.5(IQR)
The 1.5(IQR) Criterion for Outliers An observation is considered a suspected outlier if it is: above Q3
Q3 + 1.5(IQR)
Histograms are used as a graphical display of what?
Quantitative Data
When the distribution is either skewed to the left or right, and we also see a suspected outlier on either side,
Remember that both of these shape features will tend to pull the *mean* more than they would pull the median.
Standard Deviations- Doing the process in reverse to find the % percentage instead of the range of numbers like I did above.
Start subtracting numbers until you find the one you need. Work the problem backwards like the attached example.
Type 2 ~Histogram Symmetric Distributions~ Symmetric, Double-peaked (Bimodal) Distribution
Symmetric, Double-peaked (Bimodal) Distribution
Type 1 ~Histogram Symmetric Distributions~ Symmetric, Single-peaked (Unimodal) Distribution
Symmetric, Single-peaked (Unimodal) Distribution
Type 3 ~Histogram Symmetric Distributions~ Symmetric, Uniform, Distribution
Symmetric, Uniform, Distribution
Midpoint
The center of the distribution. The value that divides the distribution so that approximately half the observations take smaller values, and approximately half the observations take larger values (rough estimates of course.)
Dotplot
The dotplot, like the stemplot, shows each observation, but displays it with a dot rather than with its actual value. Here is the dotplot for the ages of Best Actress Oscar winners.
Separate each data point into a stem and leaf, as follows:
The leaf is the right-most digit. The stem is everything except the right-most digit. So, if the data point is 34, then 3 is the stem and 4 is the leaf. If the data point is 3.41, then 3.4 is the stem and 1 is the leaf.
Here are the number of hours that 9 students spend on the computer on a typical day: 1 6 7 5 5 8 11 12 15 The mean number of hours spent on the computer is:
The mean involves the sum of all the observations (1 + 6 + 7 + 5 + 5 + 8 + 11 + 12 + 15 = 70) divided by the total number of observations (9), which equals 7.78
The Current Population Survey conducted by the Census Bureau records the incomes of a large sample of U.S. households each month. What will be the relationship between the mean and median of the collected data?
The mean will be bigger than the median because the distribution of incomes is skewed right, so the mean will be bigger than the median.
The SAT Math scores of 1,000 future engineers and physicists are recorded. What will be the relationship between the mean and median of the collected data?
The mean will be smaller than the median. Since the SAT Math scores for these students will be mostly high scores, the distribution will be skewed to the left. Thus, the few low scores (outliers) will make the mean smaller than the median.
Median
The median M is the midpoint of the distribution. It is the number such that half of the observations fall above, and half fall below.
Intervals [()]
The square bracket means "including" and the parenthesis means "not including". For example, [50,60) is the interval from 50 to 60, including 50 and not including 60; [60,70) is the interval from 60 to 70, including 60, and not including 70, etc. It really does not matter how you decide to set up your intervals, as long as you're consistent.
Recall that when we described the distribution of a single quantitative variable with a histogram, we described the overall pattern of the distribution (shape, center, spread) and any deviations from that pattern (outliers).
We do the same thing with the scatterplot. The following figure summarizes this point:
The *mean* is more sensitive than the median to the tail of a distribution, and also to outliers.
We see that this distribution is skewed to the right (the tail is towards the right) and we also see a suspected outlier on the right side. Both of these shape features will tend to pull the *mean* more than they would pull the median. Therefore, in this case: mean > median.
A survey taken of 140 sports fans asked the question: "What is the most you have ever spent for a ticket to a sporting event?" The five-number summary for the data collected is: min = 85, Q1 = 130, Median = 145, Q3 = 150, Max = 250 Should the smallest observation be classified as an outlier?
Yes The IQR is 150 - 130 = 20. Using the 1.5(IQR) criterion we get 130 - 1.5(20) = 100. Since the smallest observation of 85 is smaller than 100, it should be considered an outlier.
Variable
a particular characteristic of the individual.
Individual
a particular person or object.
Dataset
a set of data identified with particular circumstances. Datasets are typically displayed in tables, in which rows represent individuals and columns represent variables.
Pictogram
a variation on the pie chart and bar chart that is very commonly used in the media. Pictograms can be misleading, so make sure to use a critical approach when interpreting the information the pictogram is trying to convey.
The mean describes the center as an average value, in which the
actual values of the data points play an important role.
A negative (or decreasing) relationship means that
an increase in one of the variables is associated with a decrease in the other.
A positive (or increasing) relationship means that
an increase in one of the variables is associated with an increase in the other.
Two-way table
an informative display that summarizes the data between two *categorical* variables.
The spread (also called variability) of the distribution can be described by the
approximate range covered by the data. approximate min: 45 (the middle of the lowest interval of scores) approximate max: 95 (the middle of the highest interval of scores) approximate range: *95-45=50* (the spread)
Population
can refer not only to people, but also to animals, things etc.
A study was conducted in order to determine whether longevity (how long a person lives) is related to a person's handedness (right-handed/left-handed). This study is an example of:
case C→Q In this case the explanatory variable (handedness) is categorical and the response variable (longevity) is quantitative. Therefore, this is an example of case C→Q.
We will get a sense of the overall pattern of the data from the histogram's
center, spread, and shape.
Conditional Percents
converting the counts in a two-way table to percents.
The histogram's outliers will highlight
deviations from that pattern.
Another way to visualize the conditional percents, instead of a table, is the
double bar chart. This display is quite common in newspapers.
The form of the relationship is its
general shape.
To display data from *one* *quantitative* variable graphically, we can use either the
histogram or the stemplot.
To create a scatterplot, each pair of values is plotted, so that the value of the explanatory variable (X) is plotted on the
horizontal axis, and the value of the response variable (Y) is plotted on the vertical axis.
Distribution
how often each of the categories occurs.
The mean is very sensitive to outliers (because it factors in their magnitude), while the median
is resistant to outliers. Data set A → 64 65 66 68 70 71 *73* Data set B → 64 65 66 68 70 71 *730* - For symmetric distributions with no outliers: x¯ is approximately equal to M. - For skewed right distributions and/or datasets with high outliers: x¯ >M
Mean and standard deviation are used for a normal shaped/symmetrical distribution and
median and IQR are used for unsymmetrical distributions.
The IQR should be used as a measure of spread of a distribution only when the
median is used as a measure of center.
It's true that the mean tends to be pulled by outliers, but remember that the *median* is the
middle value. Moving 30 to 3 wouldn't change the 20th or the 21st value in the data, so the median would be the same. "The mean will decrease, but the median won't change."
The three main numerical measures for the center of a distribution are the:
mode, mean (x¯), and the median (M). The mode is the most frequently occurring value. The mean is the average value. The median is the middle value.
If a distribution has more than two modes, we say that the distribution is
multimodal.
The Standard Deviation Rule applies to
normal or approximately normal (i.e. symmetric) distributions *ONLY.*
The histogram is a graphical display of the distribution of a *SINGLE* *quantitative* variable. It plots the number (count) of
observations that fall in intervals of values.
Outliers
observations that fall outside the overall pattern and require further research before continuing the analysis. For example, the following histogram represents a distribution that has a high probable outlier:
The median, on the other hand, locates the middle value as the center, and the
order of the data is the key to finding it.
Data
pieces of information about individuals organized into variables.
Relationships with a curvilinear form are most simply described as
points dispersed around the same curved line.
Relationships with a linear form are most simply described as
points scattered about a line.
Not all relationships can be classified as either
positive or negative.
The direction of the relationship can be
positive, negative, or neither.
Standard Deviation
quantifies the spread of a distribution by measuring how far the observations are from their mean (x¯). The standard deviation gives the average (or typical distance) between a data point and the mean (x¯).
Not all distributions have a simple, recognizable
shape.
Categorical Variables
take *category* or *label* values, and place an individual in to one of several groups (Race, Gender, Smoking 'yes or no').
Quantitative variables
take *numerical* values, and represent some kind of measurement (age, weight, height).
Measures of Center
tells us what is a "typical value" of the distribution.
Mean
the average of a set of observations (i.e., the sum of the observations divided by the number of observations).
Range (same as variability or variation)=
the exact distance between the smallest data point (min) and the largest one (max.) Range= Max-min.
Type 5 ~Histogram Symmetric Distributions~ Skewed Left
the left tail (smaller values) is much longer than the right tail (larger values).
Mode
the most commonly occurring value in a distribution. For simple datasets where the frequency of each value is available or easily determined, the value that occurs with the highest frequency is the mode. Here are the number of hours that 9 students spend on the computer on a typical day: 1 6 7 5 5 8 11 12 15 The *mode* number of hours spent on the computer is *5*
Type 4 ~Histogram Symmetric Distributions~ Skewed Right
the right tail (larger values) is much longer than the left tail (small values).
While the range quantifies the variability by looking at the range covered by ALL the data, the IQR measures
the variability of a distribution by giving us the range covered by the MIDDLE 50% of the data.
Column Percents
when the *explanatory variable is in columns* and the response variable is in rows.
Row Percents
when the *explanatory variable is in rows* and the response variable is in columns.