STATS 1430.02 MIDTERM
Rules for calculating probability
- Multiplication rule - Addition rule - Complement rule
Implementation biases
- Nonresponse - Response
T/F The standard deviation can never be negative.
TRUE
T/F The starting point can affect the way a graph looks.
TRUE
T/F You can have two data sets with the same mean but different standard deviations.
TRUE
2 major types of data collections:
- observational studies - experiments
Good surveys....
- select a good sample - design a survey that avoids bias - implements survey to avoid bias - analyzes data properly
Question wording
the way in which survey questions are phrased, which influences how respondents answer them
Residuals
data - line observed y - predicted y
One way to find P(at least one)...
1 - P(none)
The probabilities in the 4 cells of a two-way table are...
"And" Joint probabilities
P(A) = P(A and B) + P(A and Bc) where Bc means...
"B complement"
Frequencies
# (y-axis) in each category - table, bar graph
Response Rate Formula
# of responses received / total # of surveys
Relative frequencies
% (y-axis) in each category - table, pie-charts
coefficent of determination
% of variability in y due to x R^2 or R-squared
Strength of linear relationships
+/-.7 - strong +/-.5 - moderate +/-.3 - weak +/-0 - no linear relationship
Suppose you make 10 telemarketer calls. What is a complement of "at most" 9 sales?
- "all 10" - "more than 9"
How else to know if A and B indepdent?
- "random sample" is stated - experiment is replicating on each subject - stated in problem - never assume
Describing scatterplot relationship
- Pattern (relationship): linear/straight line - Direction: uphill (positive), downhill (negative) or 0 (no relationship - Strength: how closely points follow a line
Elements of survey design
- Question wording - Type of survey - Timing
A well-designed experiment has what characteristics?
- Randomization of subjects to treatments - Comparison of the results from different groups - Controlled variables - Avoids bias - Has enough data
3 results that are true if and only if A and B are independent:
1. P(A | B) = P(A) - conditional = marginal - can switch A & B 2. P (A | B) = P(A | not B or Bc) - two conditionals equal 3. P(A and B) = P(A)*P(B) >>> simplifies to P(A)*P(B) when independent *if no to any of these, A and B are dependent
How to avoid bias
1. Use a sampling procedure 2. Use a random sample in order to represent entire population
Line of best fit 5 number summary
1. correlation (r) 2. mean of x-values (xbar) 3. mean of y-values (ybar) 4. Std. dev. (Sx) of x-values 5. Std. dev. (Sy) of y-values
Probability Tree
1st branch - marginals, sums to 1 2nd branch - conditionals, sums to 1 1st branch x 2nd branch = "joint" and probabilities - every time tree breaks off, sums to 1
Volunteer sample
A method of sample selection that consists of people choosing themselves by responding to a general appeal.
Timing of survey
EX: Length of time from survey to election; the shorter the time, the more accurate the survey.
What is UNDERCOVERAGE?
Excluding a group from the survey/experiment from the beginning.
categorical data
Data that consists of names, labels, or other nonnumerical values
T/F A flat histogram (with a line straight across) contains no variability whatsoever, according to our definition.
FALSE
Control group
In an experiment, the group that is not exposed to the treatment; contrasts with the experimental group and serves as a comparison/baseline for evaluating the effect of the treatment.
Notation for correlation
R
Treatment groups
The groups receiving different treatments.
Law of Total Probability
The sum of the probabilities of all individual outcomes must equal 1. P(A) = P(A and B) + P(A and not B)
Response bias
a systematic pattern of incorrect responses in a sample survey
A _______________ is a way to graph quantitative data.
histogram
when to use two-way table?
if P(A and B) ("joint" and)
Main objective in data collections is...
minimize/avoid bias
Scatterplots are two _______________.
quantitative variables
Best slope (b1)
r (correlation) x Sy/Sx Sy- std. dev of y Sx- std. dev of x
All good samples are__________.
random
Interpret y-intercept
- only if needed - when x=0 - need data in that area to interpret
Calculating residuals
- plug in x-value for each point to get y - compare (subtract) original y from the selected point with a new calculated y from plugging x in
What should the residual plot have if the regression line fits the data well?
- random patterns - no fan shapes - points fall around the horizontal line y=0
Properties of standard deviation
- x values represent data points - used to measure concentration of data around mean - never negative - can equal 0 - affected by outliers & skewedness
Correlation of a sample (r) will always be a number between:
-1 and 1
A good experiment...
-makes comparisons -avoids bias -has enough data
What is the standard deviation of the data set 1, 1, 1, 1?
0
a correlation of ______ or higher is considered strong in the positive direction.
0.7
What is the x-variable in a data set?
Independent variable, or factor (cause)
When skewness is present in a set of data, which descriptive summary measure is most appropriate?
Interquartile range and median
Correlation is a measure of the strength and direction of what TYPE of relationship between two quantitative variables?
Linear
Sample space
S = { all possible outcomes }
What does SSE stand for?
Sum of Squares for Error
Correlation does not always imply _____________.
causation
Convenience sample
choosing individuals in the easiest way (not to be confused with volunteer, you are still choosing individuals)
A listing of all possible values in a data set and how often they occurred is called a:
data distribution
Quanittative data
data with numbers
What is the y-variable in a data set?
dependent variable, response
Bayes Rule
describes the probability of an event, based on prior knowledge of conditions that might be related to the event
Histograms vs. Boxplots
histograms - show shape - hard to identify quantities - only a rough idea of center & variability - hard to compare data sets box plots - shows skewedness/symmetry - does not show what type of symmetric shape - easy to determine center & variability (center - median, variability - IQR (distance)) - can see quantities but no other breakdown - good for comparing
when to use tree diagram?
if P(A) (marginal), P(B | A) (conditional), etc.
Examining residuals
if a line fits well the residuals have - no pattern - y-values don't fan out as x increases (no fan shapes) - none/few unusually large values (outliers in y-direction) - no influential points (outliers in x-direction
Experiment
imposes some treatment, any sizable differences in data should be due to treatment
If you add a number to every value of a data set, what happens to the standard deviation?
it does not change
A __________ distribution summarizes the information from one variable ONLY, without considering ANY information from another variable.
joint ("and")
Probability of an event
long-term chance it will occur P(A) "P of A"
Center measures
mean, median
Variability measures
standard deviation, quartiles
Linear relationship
the points show a straight line pattern, go at a constant rate and move from left to right
horizontal (x) axis
the variable measured
An experiment imposes a _______________________ on the subjects, which makes it different from an observational study.
treatment
Correlation has no __________.
units - therefore the change in units of a problem does not affect correlation
You can show that P(B|A) = P(A and B)/P(A) by...
using multiplication rule and evaluating conditionals
Confidentiality is _________ that anonymity.
weaker
x-intercept
y = 0
Line of best fit equation
y=b0 + b1x b0 = y-intercept b1 = slope
Incentives for truth in survey questions
- Anonymity - Confidentiality
T/F The standard deviation has no units.
FALSE - same as original units of problem
T/F The probabilities in the four cells of a two-way table are conditional probabilities.
FALSE - they are and "joint" probabilities
T/F the best fitting line always has an SSE of 0.
False
How do disjoint probabilities affect "Or" Addition Rule?
P(A or B) = P(A) + P(B) + 0 because P(A and B) = 0 if disjoint
Scatterplots examine relationships between what type(s) of variables?
Quantitative
If a data set is skewed to the right, the mean is ____________________ the median.
more than - tail goes to the right, few large values
Observational study
no imposed treatments, does not attempt to influence responses
If the conditional distributions are the same, then there is:
no relationship
Skewed
not necessarily bad, means off from center
vertical (y) axis
number or % frequency in each group
Which technique should you use when you have P(B|A) and some other probabilities but you are looking for P(A|B)?
Bayes Rule - switching conditionals
Which is stronger, experiments or observational studies?
Experiments, because you are controlling variables, imposing treatments, and randomly assigning those treatments
T/F If variables A and B are related in a certain way in a two way table (with 2 variables), no matter how many other variables you look at in addition to these two, the relationship will always stay the SAME.
FALSE
T/F If you have all the information filled in a two-way table, you can fill in all the information on a tree. But not the other way around
FALSE
T/F In a boxplot you can tell the exact pattern of the data set (beyond just whether the data is skewed or symmetric.)
FALSE
T/F The mean of a data set must be one of the values in the data set.
FALSE
T/F You can compare two MARGINAL distributions to see if the corresponding two variables are related.
FALSE
Suppose you have a numerical data set that is very much skewed to the left. If you had to pick one, which measure of center (mean or median) best represents most of the data in this case?
Median
A and B are disjoint if...
P(A and B) = 0
"Or" Probability Addition Rule
P(A or B) - P(A or B, or both) - P(at least one) - P(one or more) P(A or B) = P(A) + P(B) - P(A and B) * subtract P(A and B) so "both" isn't counted 2x
Bayes Rule Formula
P(A | B) = P(B | A)P(A) / P(B) the top (numerator) will always match something on the bottom (denominator) good for when you need to flip conditionals
Marginal probability
P(A) probability of a single event - adds to 1 - found on tree
For Disjoint Events: P(A or B) = ?
P(A) + P(B)
If A and B are independent then P(A or B) becomes...
P(A) + P(B) - P(A)P(B)
Thirty-seven percent of registered voters said they are likely to vote by mail in the November election according to a new survey released Tuesday. Among them, 48% plan to vote for Joe Biden. That's more than twice the 23% who plan to vote for President Trump. Write the 48% as a probability using the given information.
P(Biden voter | voting by mail)
Suppose a school figures that 70% of adults will purchase a candy bar from a 6th grader during a fund-raiser. A sixth grader randomly selects 10 adults. What's the chance that at least one of them will buy a candy bar?
P(at least one) = 1 - P(none) 1- P(0.70)^10 * because there are 10 adults
What happens to the correlation between two quantitative variables if you switch X and Y?
R (correlation) measures how X and Y move together (numerator) compared to how they move separately (denominator). If you switch the X's and Y's around in the ENTIRE formula you get the SAME ANSWER, by commutative property of multiplication.
What occurs when a two-way table shows one relationship, but the relationship reverses if a third variable gets involved?
Simpson's Paradox
When a difference in treatment is decided to be due to more than random chance, what do you call the results?
Statistically significant
T/F A two-way table has this name because it contains rows going one way, and columns going the other way.
TRUE
T/F An outlier in a data set can significantly affect the value of the mean but not the median.
TRUE
T/F It is possible for the average of a data set to be larger than most of the values in that data set.
TRUE
If 60% of male-owned businesses are successful in their first year, and 60% of female-owned businesses are successful in their first year, are gender and having a successful business in their first year independent?
Yes, conditionals same P (success | male ) = P ( success | female)
Is correlation affected by outliers?
Yes, correlation is based on the mean of X, the mean of Y, the SD of X, and the SD of Y. All four of these items are affected by outliers.
Best y-intercept (b0)
_ _ y - b1(x) y(bar) = mean of y-values x(bar)= mean of x-values
causation
a change in one variable makes another variable change
statisitical significance
a difference beyond chance
Confounding variable
a factor other than the independent variable that might produce an effect in an experiment
If a residual is negative, then that data point lies _________________ the regression line.
below
Nonresponse bias
bias introduced to a sample when a large fraction of those sampled fails to respond
anonymity
cannot attach you to your answers
If a data set is skewed to the left, the mean is ____________________ the median.
less than - tail goes to the left, few small values
Mean
measure of center, average of data set
Data is fairly symmetric when...
the mean and median are close
Two-way tables explore relationships between ____________________________.
two categorical variables
Properties of Correlation
- 2 quantitative variables - Linear relationship only - NO units - Switching x & y does not change correlation, only changes slope - Affected by outliers & skewedness
Simpson's Paradox
- 2-way table shows one relationship, a 3-way table reverses this relationship - 2 variables v.s. 3. variables
Examples of RESPONSE BIAS?
- Answering incorrectly/dishonestly - Responses of individuals who were not included in the sample from the beginning and can't answer the survey question. (undercoverage) - Not answering the survey question at all
Types of biased samples
- Convenience - Volunteer/self-selected -Undercoverage
Joint "and" distribution
- Dividing by grand total (NOT column/row totals, thats conditional) - Can't see a relationship yet
Conditionals
- Individual cell divided b column or row total - compare rows and columns to find relationship - if conditionals are same, no relationship
Which of the following distributions must sum to 1?
- Joint "And" - Marginal - Conditionals
Least squares regression line
- Minimized SSE - b0 (y-intercept) - b1 (slope)
Stratified random sampling
- aimed at a specific subgroups ("strata") of interest - choose a random sample from each group - different perspectives
Notes for box plots
- can't tell whats sample size is from boxplot - bigger boxes don't equal more data - can't see the mean on a boxplot
Information used to find line of best fit:
1. description of relationship (uphill, straight line) 2. mean 3. standard deviations of x and y
A five number summary contains the:
1. min 2. max 3. Q1 (25th) 4. median (Q2, 50th percentile) 5. Q3 (75th)
coefficient of determination
- The % of variability in Y that is explained by X - R-squared - the square of correlation R
To find conditionals from 4 cells in a two-way table...
- divide each cell by its corresponding column or row total (not both)
Sum of Squares for Error (SSE)
- each value squared to make it positive, distance between point and line - point minus line (y - x) - if error is 0, point is on line - find line with smallest SSE from all potential lines
Simple random sample
- examines population as it exists - answers same question overall, applies to everyone
Marginals
- grand totals, "the sides" of tables - can't see relationship
Experiment factors
- has to do with treatment imposed - quantity
double-blind study
An experiment in which neither the participant nor the researcher knows whether the participant has received the treatment or the placebo
"And" Joint Probability
P (A and B) = Probability A and B occur - adds to 1 - multiply marginal and conditional to get joint
Conditional probability
P (B | A) - B, given A % "of" A that have characteristic B - look for word "of" - adds to 1 - found on tree
"And" Probability Multiplication Rule
P(A and B) = P(A)*P(B | A)
Suppose the probability of someone being female and voting for Candidate A is 30%. What is the notation for this probability?
P(Female and Voting for A) = 0.30
Suppose 90% of the patients who test positive for a disease actually have the disease. Write this as a probability.
P(have disease | test positive)=0.90
Interpret slope
always put over 1 as x goes up or down, y increases/decreases by ___.
Which type of graph is best for COMPARING two or more quantitative data sets, a boxplot or a histogram?
boxplot
Which type of graph is made from the 5-number summary?
boxplot
Confidentiality
can attach you to your answers, but promises not to
How to know if an experiment has enough data?
can be replicated on several different subjects
Complement Rule
set of all outcomes in S that are NOT INCLUDED in A Notation: Ac P(Ac)=1 - P(A) - at least one - one or more - at most one (none + exactly one)
When every possible sample with the same number of observations is equally likely to be chosen, the selected sample is called:
simple random sample
The "average distance from the mean" is measured by the __________________________.
standard deviation
Which is more affected by skewness, the IQR or standard deviation?
standard deviation
Which type of sample is used when you want to compare subgroups of the population?
stratified random sample
events (characteristics)
subsets of S (letters: A, B, C, etc.)
What is the most common observational study?
surveys
Bias
systematic favoritism of certain outcomes
Variability
- how much the scores in a data set vary from each other and from the mean - farther away from center, more variability
What summary measures CANNOT be directly calculated from a boxplot?
- mean - sample size - standard deviation
Boxplot
- one-dimensional graph of quantitative data, broken into 4 equal parts (25% each) - based of 5 number summary - can be horizontal or vertical
Ways to check fit of a line
- scatterplot - correlation - coefficient of determination - residuals
Interpretations of data
- shape - center - variability
Two things that can affect the graph:
- starting point - # of bars
T/F You can recreate the original data values from its histogram.
FALSE
The median of a data set must be one of the values in the data set.
FALSE
T/F P(A and B) = P(A) * P(B) for ANY events A and B.
FALSE - only for independent events
T/F Conditional probabilities are present on a tree.
TRUE
T/F If 2 corresponding conditional distributions are different from each other in a two-way table, we know that the variables are related.
TRUE
T/F If the mean of a data set is large, the standard deviation has to be large also.
TRUE
A student has applied to two graduate schools but she will only enroll in one of them. She has a 60% chance of getting into grad school A, and a 40% of getting into grad school B. If grad school A lets her in first, there is an 80% chance she will enroll. If grad school B lets her in first, there is a 90% chance she will enroll. How do you write .80 in probability notation?
P(Enroll | A)
IQR (interquartile range)
Q1 - 1st quartile, 25th percentile (25% of data is below it) Q2 - 2nd quartile, 50th percentile/median (50% data below it) Q3 - 3rd quartile, 75th percentile (75% below)
Random sample
every member of the population has an equal chance of being selected
Problems in surveys that can cause bias:
- Nonresponse - Undercoverage - Question wording
What probability formulas CANNOT be used to test for independence?
- P(A | B) = P(B) - P(A | B) = P(Ac | B) Ac - compliment of A
Two challenges of collecting good data:
- Selecting a good sample - Collecting good/unbiased data
What do the slices on a pie chart represent?
relative frequencies
What are the units of residuals?
same as the units for Y
Types of survey
mail, telephone, online, face-to-face