PSYC 5405 Advanced Stats Midterm
Outliers and clusters these are ways of _________ _______when describing the scatter plot
Unusual Features
Attributes that may take different values for various individuals
Variables
When examining the relationship between ______, these steps should be taken: - Plot the data and examine any numerical summaries (five number summary, mean, standard deviation) - Describe the scatter plot
Variables
ŷ is the ___________ _____ of the response variable for a given value of the explanatory value
predicted value
- Arrange the observations in increasing order and locate the median These are the steps to take to calculate ___________
quartiles
Measures the outcome of a study (dependent variable)
response variable
- ____________ observational studies examine existing data for a sample of individuals - ______________ observational studies track individuals into the future
retrospective, prospective
- Assign labels that place individuals into particular groups - Have NO order - Ex: Hair color, zip code, favorite song
Categorical
The median or mean (depending on distribution)
Center
In a roughly symmetric distribution, the mean and median are _______ __________
Close together
_________ _________ _________: Selects a sample by randomly choosing clusters and including each member of the selected clusters in the sample - A _______ is a group
Cluster random sample
- The __________ __ _________ measures the percent of the variability in the response variable that is accounted for by the least-square regression line - It measures the percent of data values that are accurately depicted by the least-squares regression line - We can find the linear regression line and the correlation coefficient by using LinReg on our calculator
Coefficient of determination
- A _________ ______ of a variable describes the values of that variables among individuals who have a particular value of another variable - Ex: Conditional distribution by sport: Male baseball: 13/36, Female baseball: 23/36, and so on
Conditional Distribution
What does the distribution represent?
Context
_______ _____: Selects individuals from the population who are easy to reach
Convenience sample
- Collect data from a representative sample (from the population of interest) - Perform data analysis, keeping probability in mind - Use the results to create inferences about the population How to go from ______ _______ to ____________
Data analysis, inference
Direction: positive association, negative association, no association ○Form: Linear or nonlinear○Strength: Weak, moderate, strong○Unusual Features: Outliers and clusters○Context of the problem These _______ ___ ______ _____
Describe the scatter plot
- tells us what values a variable takes and how frequently it takes these values - Ex: Histograms, box plots, dot plots, scatter plots, stem and leaf plots, and line graphs for quantitative data - Ex: Bar graphs, two-way tables, and pie charts for categorical data
Distribution
In the normal distribution with mean m and standard deviation s: - Approximately 68% of observations fall within one s of m - Approximately 95% of observations fall within 2s of m - approximately 99.7% of observations fall within 3s of m This is the ______ __________
Empirical rule
the median of the observations located to the right of the median in the list
Third quartile
T/F All normal curves are characterized by a bell shape, a single peak, and are symmetrical
True
T/F Correlation is NOT resistant to outliers
True
a _____-_____ _____ describes two categorical variables, organizing counts according to a row variable and a column variable
Two-way Table
________ ________ ______: Consists of people who choose themselves by responding to general appeal - Often show bias because people with strong opinions are more likely to respond
Voluntary response sample
Does Adding or subtracting the same number n to each observation Add or subtract n to the measures of center and location (mean, median, quartiles, percentiles)?
Yes
Does Multiplying or dividing the same number n to each observation Multiply or divide the measures of center and location by n?
Yes
Does Multiplying or dividing the same number n to each observation Multiply or divide the measures of spread by |n|?
Yes
A scatter plot that displays the residuals on the vertical axis and the explanatory variable on the horizontal axis - If there is no leftover pattern, the regression model is ___________ - If there is a leftover pattern in the residual plot, consider using a different form of _____ _______
appropriate, regression model
The design of a statistical study shows _____ if it is very likely to underestimate or overestimate the value you want to know
bias
two varieties of variables
categorical, quantitative
- Select the rows or columns of interest - Use the data from the table to calculate conditional distribution of the rows or columns - Make a graph to display the conditional distribution○Use a side-by-side bar graph or a segmented bar graph These are the steps to take to examine or compare _____________ _______________
conditional distributions
the process of organizing, displaying, summarizing, and questioning data
data analysis
The ______ of a sample refers to the method used to choose the sample from the population
design
positive association, negative association, no association these are ways of _________ when describing the scatter plot
direction
For a linear association between two quantitative variables, the correlation (r) measures both the ________ and _______ __ ___ ____________
direction, strength of the association
An _________ deliberately imposes some treatment on individuals in order to observe their responses
experiment
Data always involves ________ and ________
individuals, variables
- Use the data from the table to calculate the marginal distribution of the row or column totals - Create a graph to display the marginal distribution These are the steps to take to examine a ___________ ____________
marginal distribution
a ____ _____ _____ provides a good assessment of the adequacy of the normal model for a set of data
normal probability plot
An _________ _______ observes individuals and measures variables of interest but does not attempt to influence the responses
observational study
A __________ z-score is above the mean, a __________ z-score is below the mean
positive, negative
_________ involves studying a part in order to gain information about the whole
sampling
- Convenience - Voluntary response - Simple random - Multi-stage random - Stratified random - Cluster random - Systematic random These are types of _______ _____
sampling design
________ ________ _____:Consists of n individuals of size n chosen from the population in such a way that every set of n individuals has an equal chance to be the sample actually selected
simple random sample
- When observations are not possible, ___________ provide an alternate method for producing data - We generate random numbers and assign certain numbers to outcomes based on probability
simulations
The _________________ is susceptible to outliers
standard deviation
average distance between each value and the mean
standard deviation
In a perfectly symmetric distribution, the mean and median are ________
the same
_________ ___occurs when some groups in the population are left out of the process of choosing the sample
undercoverage bias
The "average" squared deviation
variance
a is the _-_______ -the value of y when x = 0
y-intercept
The _-_____ tells us how many standard deviations away from the mean an observation falls, and what direction it falls in
z-score
_-_____ have no units
z-scores
Attempts to explain the observed outcomes (independent variable)
Explanatory variable
The science of data
Statistics
- Undercoverage - Nonresponse - Response - Order of choice - Wording of questions These are types of ____
Bias
T/F The mean and the median of a normal curve are not the same
False
the median of the observations located to the left of the median in the list
First quartile
- A _____-______ ______ is a quick summary of the distribution of a data set - It contains the minimum, first quartile, median, third quartile, and maximum - A box plot contains all numbers in a _____-______ ______
Five-Number summary
Linear or nonlinear these are ways of _________ when describing the scatter plot
Form
- Divide the range of data into classes of equal width - Find the count or percent of each individuals in each class - Label and scale your axes and draw the histogram these are the steps to take on how to construct a ___________
Histogram
- graphs that display the distribution of a quantitative variable by showing each interval of the values as a bar - The heights of the bars show the frequencies of values in each interval - show off distributions very clearly - are the most common graph of distribution
Histograms
Objects described in a data set
Individuals
- the difference of the first and third quartiles - This can also be found using your calculator - It is resistant to outliers - An observation is an outlier if it falls more than 1.5 x IQR above the third quartile or 1.5 x IQR below the first quartile
Interquartile range
_____-_____ ______ ______: The line that makes the sum of the squared residuals as small as possible
Least-Square Regression Line
- The __________ ________ of one of the categorical variables is the distribution of values of that variable among all individuals described by the table - Ex: Marginal distribution of gender: Male: 48/100 = 48% Female: 52/100 = 52% - The marginal distributions should total to 100%
Marginal Distributions
- The _______ is the average of all individual data values - To find the ______, add all of the observations and divide by the number of observations
Mean
Determine if you should use the mean or median to measure the center of a distribution of data - If the distribution is reasonably symmetric and has no outliers, use the ________ - Outliers have a big impact on the _______ which would cause an inaccurate measure of center (it is not resistant to outliers)
Mean
A normal curve is described by its ________ and _______ _________
Mean, Standard deviation
- Arrange all observations from smallest to largest - If the number of observations is odd, the median is the center observation in the list - If the number of observations is even, the _______ is the average of the two center observations in the list - For n observations in a group, use (n + 1)/2 to find the position of the ________ in the list of observations
Median
- The _________ is the midpoint of the distribution - It is the number where half of the observations are smaller and the other half larger
Median
Determine if you should use the mean or median to measure the center of a distribution of data - If the distribution of data is skewed or has outliers, use the _________ - Outliers have little to no effect on the ________, thus maintaining its accuracy (it is resistant to outliers)
Median
____-______ _______ ______: involves the repeated selections of simple random samples within prior random samples
Mulit-State random sample
Does Adding or subtracting the same number n to each observation change the shape or measure of spread of the distribution (range, IQR, standard deviation)?
No
Does Multiplying or dividing the same number n to each observation change the shape of the distribution?
No
_____________ _____ occurs when an individual chosen for the sample can't be contacted or doesn't cooperate
Nonresponse bias
- The nth _________ of a distribution is the value with n percent of the observations less than it - Ex: 60th _____________ of data is 50. This means that 60% of the data is less than 50 and 40% of the data is 50 or above
Percentile
_______: The entire group of individuals we want information about ______: A subset of individuals in the population from which we collect data
Population, sample
+ means ________ direction, -means _________ direction
Positive, negative
- Take numerical values for which it is sensible to find an average - Have order - Ex: Age, speed, height
Quantitative
- A _________ ____ displays the relationship between two variables, but only when one of the variables helps explain or predict the other - It is a model for the data the equation gives us a compact mathematical description of what this model tells us about the relationship between y and x
Regression line
- When data has a _________ overall pattern, we can use a simplified model called a ______ ________ to describe it - Always on or above the horizontal axis - It has an area of exactly 1 underneath it
Regular, Density Curve
- A _________ is the difference between the actual value of y and the predicted value of y by the regression line - = y -ŷ
Residual
_______ ______ is A scatter plot that displays the residuals on the vertical axis and the explanatory variable on the horizontal axis - If there is no leftover pattern, the regression model is appropriate - If there is a leftover pattern in the residual plot, consider using a regression model with a different form.
Residual plot
_______ _____ occurs when the time surveyed or who the surveyor is causes a bias - Also occurs when people do not remember answers or lie
Response bias
2 types of variables to keep in mind when analyzing two or more variables:
Response, Explanatory
When describing distribution of quantitative data, we use the acronym _____
SOCCS
Symmetric, Skewed Right, Skewed Left, Bimodal, Unimodal
Shape
What does SOCCSS stand for
Shape, Outliers, Context, Center, Spread
b is the ________ -the amount y is predicted to change when x increases by one
Slope
The range (most of the time) or the standard deviation
Spread
The ______ _________ is the distance from the center to the change-of-curvature points on either side
Standar deviaiton
- the normal distribution with mean 0 and standard deviation 1 - We obtain this by converting every value into its z-score and representing each data point as its z-score in the distribution - N(0,1) the ____ ________ ___________
Standard normal distribution
- another name for z-score is ____________ _____________ - the formula is 𝑥−𝑚𝑒𝑎𝑛/𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
Standardized value
- Separate each observation into a stem and a leaf - A stem includes all but the final digit - A leaf is just the final digit of the number - Write all possible stems from the smallest to the largest in a vertical column - Draw a vertical line to the right of the column - Write each leaf in the row to the right of its corresponding stem - Arrange the leaves in increasing order out from the stem - Provide a key that explains in context what the stems and leaves represent These are the steps on how to make a ____-____-____ ____
Stem-and-Leaf Plot
____-___-_____ ____ are a simple graphical display for small sets of data - They give us a visual of the distribution while including the actual numerical values
Stem-and-Leaf Plots
_________ _________ _______: First classify the population into groups of similar individuals who share characteristics called strata. Then choose a separate SRS in each stratum and combine these SRSs to form the full sample
Stratified Random sample
Weak, moderate, strong these are ways of _________ when describing the scatter plot
Strength
- The closer to 1 or -1, the ________ the association - The closer to 0, the ______ the association
Stronger, weaker
__________ _________ _________: Selects a sample from an ordered arrangement of the population by randomly selecting one of the first k individuals and choosing every kth individual thereafter
Systematic random sample