WGU C459 - Introduction to Probability and Statistics

Ace your homework & exams now with Quizwiz!

P(A) =

# outcomes in A / # outcomes in sample space

Law of Large Numbers

*As the number of trials increases, the relative frequency becomes the actual probability. *As the number of trials increases, the empirical probability gets closer and closer to the theoretical probability.

Relative Frequency definition of probability

*The ratio of the number of times something occurs; *how often something happens divided by all outcomes *empirical probability

observational study

*attempt to understand cause-and-effect relationships by assessing the values of the variables as they naturally occur. *unlike experiments, the researcher is not able to control (1) how subjects are assigned to groups and/or (2) which treatments each group receives.

The distribution of a categorical variable is summarized using:

...Graphical display supplemented by numerical summaries

For disjoint events, P(A and B) =

0

Probabilities range from:

0 to 1 ("never" to "certain")

P(not A) =

1 - P(A)

P(at least one) =

1 - P(none)

Probability of the Complement of an Event (The Complement Rule): P (not A) =

1-P(A)

The Standard Deviation Rule:

1. Approximately 68% of the observations fall within 1 standard deviation of the mean. 2. Approximately 95% of the observations fall within 2 standard deviations of the mean. 3. Approximately 99.7% (or virtually all) of the observations fall within 3 standard deviations of the mean.

Test for independence

1. Compare P(A) to P(A | B) → if equal, then independent; if not equal, then dependent or 2. Compare P(A and B) to P(A) * P(B) → if equal, then independent; if not equal, then dependent

...The relationship between a categorical explanatory variable and a quantitative response variable (C->Q) is summarized using:

1. Data display: side-by-side boxplots 2. Numerical summaries: descriptive statistics *Exploring the relationship between a categorical explanatory variable and a quantitative response variable amounts to comparing the distributions of the quantitative response for each category of the explanatory variable. *In particular, we look at how the distribution of the response variable differs between the values of the explanatory variable.

relationship between two categorical variables (C->C)is summarized using:

1. Data display: two-way table, supplemented by 2. Numerical summaries: conditional percentages. Conditional percentages are calculated for each value of the explanatory variable separately. They can be row percents, if the explanatory variable "sits" in the rows, or column percents, if the explanatory variable "sits" in the columns. When we try to understand the relationship between two categorical variables, we compare the distributions of the response variable for values of the explanatory variable. In particular, we look at how the pattern of conditional percentages differs between the values of the explanatory variable.

Distribution of a variable

1. What values the variable takes and 2. How often the variable takes those values

correlation coefficient (r)

1. a numerical measure that measures the strength and direction of a linear relationship between two quantitative variables 2. denoted by r 3. values -1 to 1 4. values closer to 1 indicate a strong positive relationship 5. values closer to -1 indicate a strong negative relationship 6. values closer to 0 indicate a weak relationship; 0 indicates no relationship

extrapolation

1. a prediction for ranges of the explanatory variable that are not in the data. 2. Since there is no way of knowing whether a relationship holds beyond the range of the explanatory variable in the data, extrapolation is not reliable, and should be avoided.

The probabilities of all possible outcomes in a sample space add up to:

1: P(sample space) = 1

"Before-and-after" studies

A common type of matched pairs design. For each individual, the response variable of interest is measured twice: first before the treatment, then again after the treatment. The categorical explanatory variable is which treatment was applied, or whether a treatment was applied, to that participant.

Multimodal distribution

A distribution with more than one mode

Range

A measure of spread. Range = Max - min. The distance between the smallest data point and the largest one.

Variable

A particular characteristics of the individual

Individual

A particular person or object

Dataset

A set of data identified with particular circumstances. Typically displayed in tables, in which rows represent individuals and columns represent variables.

least squares criterion

Among all the lines that look good on your data, the one that has the smallest sum of squared vertical deviations.

randomized response

An effective technique for collecting accurate data on sensitive questions allows individuals in the sample to answer anonymously, while the researcher still gains information about the population.

probability sampling plan (or technique)

Any sampling plan that relies on random selection

Numerical summary

Category counts and percentages

Standard deviation;

Gives the average (typical distance) between a data point and the mean. Should be paired as a measure of ipread with the mean as a measure of center. Strongly influenced by outliers in the data. Use the mean and standard deviation as measures of center and spread only for reasonably symmetric district

experiment

Instead of assessing the values of the variables as they naturally occur, the researchers interfere, and they are the ones who assign the values of the explanatory variable to the individuals. The researchers "take control" of the values of the explanatory variable because they want to see how changes in the value of the explanatory variable affect the response variable. (Note: By nature, involves at least two variables.)

Weighted average

Instead of each data point contributing equally to the final mean, some data points contribute more "weight" than others. Formula: 1. Multiply the numbers in your data set by the weights. 2. Add the numbers up. * easily influenced by outliers

the form of the relationship

Its general shape. When identifying the form, we try to find the simplest way to describe the shape of the scatterplot

Max.

Largest observation

Outliers

Observations that fall outside the overall pattern

Symmetric unimodal distribution

One mode around which the observations are concentrated

For dependent events; general formula

P(A and B) = P(A) * P(B | A)

For independent events, P(A and B) =

P(A) * P(B)

Probability of Independent events (The Multiplication Rule for Independent Events): P(A and B) =

P(A) * P(B)

For disjoint events, since P(A and B) = 0, P(A or B) =

P(A) + P(B)

If A and B are mutually exclusive events (The Addition Rule for Disjoint Events), then probability of A or B is: P (A or B) =

P(A) + P(B)

For non-disjoint events; general formula, P(A or B) =

P(A) + P(B) - P(A and B)

Probability of Compound Events (The General Addition Rule): If A and B are two events, then the probability of A or B is

P(A) + P(B) - P(A and B)

Graphical display

Pie chart or bar chart Variation: pictogram/can be misleading

Data

Pieces of information about individuals organized into variables

P(A | B)

Probability of the 2nd event (A) happening given that the 1st event (B) has happened. * P(2nd event happening | 1st event has happened) *alternate: P(possible event | known event)

Symmetric uniform distribution

Relatively flat, no modes or no values around which the observations are concentrated.

Min.

Smallest observation

Spread (aka variability)

The approximate range covered by the data.

Mean

The average. The sum of observations divided by the number of observations. Very sensitive to outliers; actual numbers play an important role.

Midpoint

The center of the distribution. The value that divides distribution so that approximately half the observations take smaller values and approximately half take larger values

treatments (common abbreviation: ttt)

The different imposed values of the explanatory variable

sampling

The first stage of the production of data. Choosing the individuals from the population that will be included in the sample.

treatment groups

The groups receiving different treatments

Skewed left distribution

The left tail (smaller values) much longer than the right tail. The bulk of the observations are medium or large with a few observations that are much smaller than the rest. Example: distribution of death from natural causes.

Median

The midpoint. 1. If there is an odd number of observations, it is the center observation in an ordered list. 2. If there is an even number, it is the mean of the two center observations. *Resistant to outliers. The order of data is the key.

Mode

The most commonly occurring value in a distribution

randomized controlled double-blind experiment

The most reliable way to determine whether the explanatory variable is actually causing changes in the response variable

the Hawthorne effect

The phenomenon, whereby people in an experiment behave differently from how they would normally behave

Inter-Quartile Range (IQR)

The range of the middle 50% of the data. Q3 - Q1 where: 1. Q1 is the Median of the lower half of the data (M-min) and 2. Q3 is the median of the upper half of the data (Max - M)

study design

The second stage in the production of data. Collecting the data from the sample population

linear regression

The technique of finding the line that best fits the pattern of the linear relationship (or in other words, the line that best describes how the response variable linearly depends on the explanatory variable).

regression

The technique that specifies the dependence of the response variable on the explanatory variable

Symmetric bimodal distribution

Two modes around which the observations are concentrated

Probability

Used to quantify how much we expect random samples to vary.

Categorical variables

Variables that take category or label values, and place an individual into one of several groups

Quantitative variables

Variables that take numerical values, and represent some kind of measurement

Simpson's paradox

When including a lurking variable causes us to rethink the direction of an association

Like any other line, the equation of the least-squares regression line for summarizing the linear relationship between the response variable (Y) and the explanatory variable (X) has the form:

Y = a + bX calculate the intercept a, and the slope b When: X¯—the mean of the explanatory variable's values SX—the standard deviation of the explanatory variable's values Y¯—the mean of the response variable's values SY—the standard deviation of the response variable's values r—the correlation coefficient the slope and intercept of the least squares regression line are found using the following formulas: b = r (SY/SX) a = Y¯ −bX¯

multistage sampling

a "complex form" of cluster sampling. When conducting cluster sampling, it might be unrealistic, or too expensive to sample all the individuals in the chosen clusters. In cases like this, it would make sense to have another stage of sampling, in which you choose a sample from each of the randomly selected clusters. Multistage sampling can have more than 2 stages.

Histogram

a bar graph that shows how frequently data occur within certain ranges or intervals. The height of each bar gives the frequency in the respective interval.

scatterplot

a graph made by plotting ordered pairs in a coordinate plane to show the correlation between two sets of data. 1. the explanatory variable should always be plotted on the horizontal X-axis, and 2. the response variable should be plotted on the vertical Y-axis.

Stemplot

a method of organizing numerical data in order of place value. The 'ones digit' and the 'tens digit and greater' of each data item is separated as leaves and stems respectively.

sample survey

a particular type of observational study in which individuals report variables' values themselves, frequently by giving their opinions

lurking variable

a variable that is not among the explanatory or response variables in a study, but could substantially affect your interpretation of the relationship among those variables.

response variable

aka the dependent variable the outcome of the study denoted by Y

explanatory variable

aka the independent variable the variable that claims to explain, predict or affect the response denoted by X

confounding variable

an "extra" variable that you didn't account for.

randomized controlled experiment

an experiment in which researchers control values of the explanatory variable with a randomization procedure

random experiment

an experiment that produces an outcome that cannot be predicted in advance involves uncertainty

A negative (or decreasing) relationship

an increase in one of the variables is associated with a decrease in the other

A positive (or increasing) relationship

an increase in one of the variables is associated with an increase in the other

role-type classification

classify each of the two relevant variables according to type (categorical or quantitative) 1. Categorical explanatory and quantitative response 2. Categorical explanatory and categorical response 3. Quantitative explanatory and quantitative response 4. Quantitative explanatory and categorical response

blocking

dividing subjects into groups of individuals who are similar with respect to an outside variable that may be important in the relationship being studied.

Inference

drawing reliable conclusions about the population based on what we've discovered in our sample

Noncompliance

failure to submit to the assigned treatment

subjects

human participants in an experiment

factor

in an experiment, the explanatory variable

sampling frame

list of potential individuals to be sampled

slope of a straight line linear equation

m = y¹-y²/x¹-x² y = mx +b where m is the slope and b is the y-intercept

Five Number Summary

min, Q1, M, Q3, Max Provides a quick numerical description of both the center and spread of a distribution.

Independent events:

one event's occurrence does not affect the probability the other event will occur

Cluster Sampling

sampling technique is used when our population is naturally divided into groups (which we call clusters) take a random sample of clusters, and use all the individuals within the selected clusters as the sample

Boxplot

shows the distribution of a set of data along a number line, dividing the data into four parts using the median and quartiles.

matched pairs

study design that compares responses for the same individual under two explanatory values, or for two individuals who are as similar as possible except that the first gets one treatment, and the second gets another (or serves as the control). Enable us to pinpoint the effects of the explanatory variable

retrospective observational study

the values of the variables of interest are recorded backward in time

prospective observational study

the values of the variables of interest are recorded forward in time

control group

those individuals on whom no specific treatment was imposed

Disjoint events:

two events that cannot occur at the same time

Lack of realism (aka lack of ecological validity)

unrealistic setting

Stratified Sampling

used when our population is naturally divided into sub-populations, which we call stratum (plural: strata) choose a simple random sample from each stratum, the sample consists of all these simple random samples put together.


Related study sets

Intro to Supply Chain Management Quiz Answers

View Set

JMU Bio 140 Lab checkpoint quiz 7-12

View Set

Meninges of the Brain - Exercise 17 - #13

View Set

Health Care Finance and Budgeting Final

View Set

NCLEX OB-Women's Health and Maternity/Newborn Drugs

View Set

C949 Data Structures and Algorithms: Lesson 10 Take 2

View Set

Real Estate Principles Chapter 7: Property Management (Landlord & Tenant)

View Set

movement occurring at diarthrotic joints

View Set

BSIS 444 Test 3 Practice Imports

View Set