WGU C459 - Introduction to Probability and Statistics
P(A) =
# outcomes in A / # outcomes in sample space
Law of Large Numbers
*As the number of trials increases, the relative frequency becomes the actual probability. *As the number of trials increases, the empirical probability gets closer and closer to the theoretical probability.
Relative Frequency definition of probability
*The ratio of the number of times something occurs; *how often something happens divided by all outcomes *empirical probability
observational study
*attempt to understand cause-and-effect relationships by assessing the values of the variables as they naturally occur. *unlike experiments, the researcher is not able to control (1) how subjects are assigned to groups and/or (2) which treatments each group receives.
The distribution of a categorical variable is summarized using:
...Graphical display supplemented by numerical summaries
For disjoint events, P(A and B) =
0
Probabilities range from:
0 to 1 ("never" to "certain")
P(not A) =
1 - P(A)
P(at least one) =
1 - P(none)
Probability of the Complement of an Event (The Complement Rule): P (not A) =
1-P(A)
The Standard Deviation Rule:
1. Approximately 68% of the observations fall within 1 standard deviation of the mean. 2. Approximately 95% of the observations fall within 2 standard deviations of the mean. 3. Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.
Test for independence
1. Compare P(A) to P(A | B) → if equal, then independent; if not equal, then dependent or 2. Compare P(A and B) to P(A) * P(B) → if equal, then independent; if not equal, then dependent
...The relationship between a categorical explanatory variable and a quantitative response variable (C->Q) is summarized using:
1. Data display: side-by-side boxplots 2. Numerical summaries: descriptive statistics *Exploring the relationship between a categorical explanatory variable and a quantitative response variable amounts to comparing the distributions of the quantitative response for each category of the explanatory variable. *In particular, we look at how the distribution of the response variable differs between the values of the explanatory variable.
relationship between two categorical variables (C->C)is summarized using:
1. Data display: two-way table, supplemented by 2. Numerical summaries: conditional percentages. Conditional percentages are calculated for each value of the explanatory variable separately. They can be row percents, if the explanatory variable "sits" in the rows, or column percents, if the explanatory variable "sits" in the columns. When we try to understand the relationship between two categorical variables, we compare the distributions of the response variable for values of the explanatory variable. In particular, we look at how the pattern of conditional percentages differs between the values of the explanatory variable.
Distribution of a variable
1. What values the variable takes and 2. How often the variable takes those values
correlation coefficient (r)
1. a numerical measure that measures the strength and direction of a linear relationship between two quantitative variables 2. denoted by r 3. values -1 to 1 4. values closer to 1 indicate a strong positive relationship 5. values closer to -1 indicate a strong negative relationship 6. values closer to 0 indicate a weak relationship; 0 indicates no relationship
extrapolation
1. a prediction for ranges of the explanatory variable that are not in the data. 2. Since there is no way of knowing whether a relationship holds beyond the range of the explanatory variable in the data, extrapolation is not reliable, and should be avoided.
The probabilities of all possible outcomes in a sample space add up to:
1: P(sample space) = 1
"Before-and-after" studies
A common type of matched pairs design. For each individual, the response variable of interest is measured twice: first before the treatment, then again after the treatment. The categorical explanatory variable is which treatment was applied, or whether a treatment was applied, to that participant.
Multimodal distribution
A distribution with more than one mode
Range
A measure of spread. Range = Max - min. The distance between the smallest data point and the largest one.
Variable
A particular characteristics of the individual
Individual
A particular person or object
Dataset
A set of data identified with particular circumstances. Typically displayed in tables, in which rows represent individuals and columns represent variables.
least squares criterion
Among all the lines that look good on your data, the one that has the smallest sum of squared vertical deviations.
randomized response
An effective technique for collecting accurate data on sensitive questions allows individuals in the sample to answer anonymously, while the researcher still gains information about the population.
probability sampling plan (or technique)
Any sampling plan that relies on random selection
Numerical summary
Category counts and percentages
Standard deviation;
Gives the average (typical distance) between a data point and the mean. Should be paired as a measure of ipread with the mean as a measure of center. Strongly influenced by outliers in the data. Use the mean and standard deviation as measures of center and spread only for reasonably symmetric district
experiment
Instead of assessing the values of the variables as they naturally occur, the researchers interfere, and they are the ones who assign the values of the explanatory variable to the individuals. The researchers "take control" of the values of the explanatory variable because they want to see how changes in the value of the explanatory variable affect the response variable. (Note: By nature, involves at least two variables.)
Weighted average
Instead of each data point contributing equally to the final mean, some data points contribute more "weight" than others. Formula: 1. Multiply the numbers in your data set by the weights. 2. Add the numbers up. * easily influenced by outliers
the form of the relationship
Its general shape. When identifying the form, we try to find the simplest way to describe the shape of the scatterplot
Max.
Largest observation
Outliers
Observations that fall outside the overall pattern
Symmetric unimodal distribution
One mode around which the observations are concentrated
For dependent events; general formula
P(A and B) = P(A) * P(B | A)
For independent events, P(A and B) =
P(A) * P(B)
Probability of Independent events (The Multiplication Rule for Independent Events): P(A and B) =
P(A) * P(B)
For disjoint events, since P(A and B) = 0, P(A or B) =
P(A) + P(B)
If A and B are mutually exclusive events (The Addition Rule for Disjoint Events), then probability of A or B is: P (A or B) =
P(A) + P(B)
For non-disjoint events; general formula, P(A or B) =
P(A) + P(B) - P(A and B)
Probability of Compound Events (The General Addition Rule): If A and B are two events, then the probability of A or B is
P(A) + P(B) - P(A and B)
Graphical display
Pie chart or bar chart Variation: pictogram/can be misleading
Data
Pieces of information about individuals organized into variables
P(A | B)
Probability of the 2nd event (A) happening given that the 1st event (B) has happened. * P(2nd event happening | 1st event has happened) *alternate: P(possible event | known event)
Symmetric uniform distribution
Relatively flat, no modes or no values around which the observations are concentrated.
Min.
Smallest observation
Spread (aka variability)
The approximate range covered by the data.
Mean
The average. The sum of observations divided by the number of observations. Very sensitive to outliers; actual numbers play an important role.
Midpoint
The center of the distribution. The value that divides distribution so that approximately half the observations take smaller values and approximately half take larger values
treatments (common abbreviation: ttt)
The different imposed values of the explanatory variable
sampling
The first stage of the production of data. Choosing the individuals from the population that will be included in the sample.
treatment groups
The groups receiving different treatments
Skewed left distribution
The left tail (smaller values) much longer than the right tail. The bulk of the observations are medium or large with a few observations that are much smaller than the rest. Example: distribution of death from natural causes.
Median
The midpoint. 1. If there is an odd number of observations, it is the center observation in an ordered list. 2. If there is an even number, it is the mean of the two center observations. *Resistant to outliers. The order of data is the key.
Mode
The most commonly occurring value in a distribution
randomized controlled double-blind experiment
The most reliable way to determine whether the explanatory variable is actually causing changes in the response variable
the Hawthorne effect
The phenomenon, whereby people in an experiment behave differently from how they would normally behave
Inter-Quartile Range (IQR)
The range of the middle 50% of the data. Q3 - Q1 where: 1. Q1 is the Median of the lower half of the data (M-min) and 2. Q3 is the median of the upper half of the data (Max - M)
study design
The second stage in the production of data. Collecting the data from the sample population
linear regression
The technique of finding the line that best fits the pattern of the linear relationship (or in other words, the line that best describes how the response variable linearly depends on the explanatory variable).
regression
The technique that specifies the dependence of the response variable on the explanatory variable
Symmetric bimodal distribution
Two modes around which the observations are concentrated
Probability
Used to quantify how much we expect random samples to vary.
Categorical variables
Variables that take category or label values, and place an individual into one of several groups
Quantitative variables
Variables that take numerical values, and represent some kind of measurement
Simpson's paradox
When including a lurking variable causes us to rethink the direction of an association
Like any other line, the equation of the least-squares regression line for summarizing the linear relationship between the response variable (Y) and the explanatory variable (X) has the form:
Y = a + bX calculate the intercept a, and the slope b When: X¯—the mean of the explanatory variable's values SX—the standard deviation of the explanatory variable's values Y¯—the mean of the response variable's values SY—the standard deviation of the response variable's values r—the correlation coefficient the slope and intercept of the least squares regression line are found using the following formulas: b = r (SY/SX) a = Y¯ −bX¯
multistage sampling
a "complex form" of cluster sampling. When conducting cluster sampling, it might be unrealistic, or too expensive to sample all the individuals in the chosen clusters. In cases like this, it would make sense to have another stage of sampling, in which you choose a sample from each of the randomly selected clusters. Multistage sampling can have more than 2 stages.
Histogram
a bar graph that shows how frequently data occur within certain ranges or intervals. The height of each bar gives the frequency in the respective interval.
scatterplot
a graph made by plotting ordered pairs in a coordinate plane to show the correlation between two sets of data. 1. the explanatory variable should always be plotted on the horizontal X-axis, and 2. the response variable should be plotted on the vertical Y-axis.
Stemplot
a method of organizing numerical data in order of place value. The 'ones digit' and the 'tens digit and greater' of each data item is separated as leaves and stems respectively.
sample survey
a particular type of observational study in which individuals report variables' values themselves, frequently by giving their opinions
lurking variable
a variable that is not among the explanatory or response variables in a study, but could substantially affect your interpretation of the relationship among those variables.
response variable
aka the dependent variable the outcome of the study denoted by Y
explanatory variable
aka the independent variable the variable that claims to explain, predict or affect the response denoted by X
confounding variable
an "extra" variable that you didn't account for.
randomized controlled experiment
an experiment in which researchers control values of the explanatory variable with a randomization procedure
random experiment
an experiment that produces an outcome that cannot be predicted in advance involves uncertainty
A negative (or decreasing) relationship
an increase in one of the variables is associated with a decrease in the other
A positive (or increasing) relationship
an increase in one of the variables is associated with an increase in the other
role-type classification
classify each of the two relevant variables according to type (categorical or quantitative) 1. Categorical explanatory and quantitative response 2. Categorical explanatory and categorical response 3. Quantitative explanatory and quantitative response 4. Quantitative explanatory and categorical response
blocking
dividing subjects into groups of individuals who are similar with respect to an outside variable that may be important in the relationship being studied.
Inference
drawing reliable conclusions about the population based on what we've discovered in our sample
Noncompliance
failure to submit to the assigned treatment
subjects
human participants in an experiment
factor
in an experiment, the explanatory variable
sampling frame
list of potential individuals to be sampled
slope of a straight line linear equation
m = y¹-y²/x¹-x² y = mx +b where m is the slope and b is the y-intercept
Five Number Summary
min, Q1, M, Q3, Max Provides a quick numerical description of both the center and spread of a distribution.
Independent events:
one event's occurrence does not affect the probability the other event will occur
Cluster Sampling
sampling technique is used when our population is naturally divided into groups (which we call clusters) take a random sample of clusters, and use all the individuals within the selected clusters as the sample
Boxplot
shows the distribution of a set of data along a number line, dividing the data into four parts using the median and quartiles.
matched pairs
study design that compares responses for the same individual under two explanatory values, or for two individuals who are as similar as possible except that the first gets one treatment, and the second gets another (or serves as the control). Enable us to pinpoint the effects of the explanatory variable
retrospective observational study
the values of the variables of interest are recorded backward in time
prospective observational study
the values of the variables of interest are recorded forward in time
control group
those individuals on whom no specific treatment was imposed
Disjoint events:
two events that cannot occur at the same time
Lack of realism (aka lack of ecological validity)
unrealistic setting
Stratified Sampling
used when our population is naturally divided into sub-populations, which we call stratum (plural: strata) choose a simple random sample from each stratum, the sample consists of all these simple random samples put together.