stats mini exam
empirical rules for normal distribution
About 68% of the data fall within one standard deviation of the mean, about 95% of the data fall within two standard deviations of the mean, and almost all fall within three standard deviations of the mean
Describing Quantitative Variables
Always start with making a picture: Histogram or Stem and Leaf Plot summary of different values observed for the variable includes the "3 s's": shape, center, and spread spread is also known as variation or variability
Finding area to the right of a z-score
1. use symmetry 2. using properties of density curves (1- )
a density curve
A mathematical model used to describe the overall pattern of the distribution of a random variable; rescale a percent histogram so that the area under the curve is 1
correlation
A measure of the extent to which two factors vary together, and thus of how well either factor predicts the other.
standard normal distribution
A normal distribution with a mean of 0 and a standard deviation of 1.
coefficient of determination (r^2)
The fraction of the variation in the values of y that is accounted for by the least-squares regression line of y on x. a measure of how well the LSRL fits the data the square of the correlation coefficient (r) fraction of variance in y (vertical scatter), that can be explained linearly by changes in x (horizontal scatter)
bar charts
Used when data is divided into categories (discrete data) The bars are separated to show different categories The height represents the frequency of that category among all individuals Describe the distribution in a bar chart by comparing the heights of the bars
extrapolation
Using a LSRL to predict outside the domain of the explanatory variable. Predictions are not reliable (Can lead to ridiculous conclusions if the current linear trend does not continue)
IQR
distance between the first and third quartiles (Q3-Q1) resistant to outliers and skew, only looking at the middle 50% of the observations
five-number summary
minimum, Q1, median, Q3, maximum quick numerical summary of a quantitative variable, used to create a box plot
parameter
numerical summary of some feature of the population, "mu" and "sigma"
normal QQ plot
of residuals, we want the points to fall on the line to show the residuals are normally distributed and centered at 0
interpret LSRL
on average, for each additional unit of x, y changes by b units.
a residual
or error, the vertical distance between an observed and predicted value of y Ei = yi - yiHat the positive and negative residuals will be 0 if added up
ordinal variable
ordered; ex. letter grade, rankings
categorical variable
places an individual into one of several groups or categories (ordinal or nominal)
re-randomization
reassign individuals into treatment groups and observe the difference between these new groups as a comparison
when asked if a linear model is appropriate for your data,
report on the residuals and R^2
Individuals
the objects described by a set of data n = number of individuals the rows
two-way table
A table containing counts for two categorical variables. It has r rows and c columns. can have any number of categories
2 measures of center
mean and median
fitted LSRL
yHat = a + bx
standardized value (z-score)
z = x-u / o
scatter plots
Shows the relationship between two variables. Straight line indicates closer correlation. describe by form, direction, and strength of association
intercept
a = yHat - b(x-bar)
normal distribution
a bell-shaped curve, a family of curves, describing the spread of a characteristic throughout a population, defined by its center and spread (mean, std dev)
z-score
a measure of how many standard deviations you are away from the norm (average or mean)
sample
a subset of the population needs to be chosen at random to be representative of the population
outliers
a value that deviates from the overall pattern, unusual observation
lurking variable
a variable that is not among the explanatory or response variables in a study but that may influence the response variable
response variable
a variable that measures an outcome or result of a study, y
explanatory variable
a variable that we think explains or causes changes in the response variable, x
table A
allows us to find the area (proportion of observations / probability) to the left of a z-score known as cumulative proportions / probabilities
influential observation
an observation that markedly changes the regression line if removed, substantially changes the regression equation, may or may not be an outlier too
a variable
any characteristic of an individual; can take different values for different individuals; should not be predetermined, must vary the columns
standardization
any normal distribution can be transformed into the standard normal distribution, in order to use the table need to standardize x to get z allows you to compare observations on different scales by recentering and 0 and rescaling to 1
slope
b = r(sy / sx)
1.5 x IQR rule
low outlier: less than Q1 - (1.5xIQR) high outlier: greater than Q3 + (1.5xIQR)
histograms
shows the number of individuals that fall in each interval (height of each bin)
2 measures of spread
standard deviation and IQR
marginal distribution
summarizes each categorical variable independently (row totals, column totals) ignore the potential bivariate relationship between the categorical variables in the table
common distribution shapes
symmetric, skewed, complex / multimodal
quantitative variable
takes numerical values for which arithmetic operations such as adding and averaging make sense across individuals ex. height, weight, GPA
the distribution of the variable
tells us 1) the possibly values or outcomes of a variable and 2) the frequency with which is takes on those values
mean of a density curve
the balance point, at which the curve would balance if made of solid material; if skewed it will get pulled towards the tail
statistic
the corresponding numerical quantity for the sample, x-bar and s
conditional distribution
the distribution of values of that variable among only individuals who have a given value of the other variable the distribution of the response variable, given a particular fixed category of the explanatory variable
mean
the distribution's "center of mass", is not robust and is sensitive to the data, will be pulled to the skew mean or average is denoted by x-bar
median
the distribution's midpoint, 50th percentile, it is robust and resistant to outliers be sure to sort from smallest to largest first
population
the entire group of individuals about which we want information
median of a density curve
the equal-areas point, the point that divides the area under the curve in half
least squares regression line
the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible, fit the average line that has the points as close to the line as possible
nominal variable
unordered; ex. position on a team, gender identity
standard deviation
used to describe the variation around the mean, represents the typical distance of an observation from the mean uses the mean and is therefore effected by outliers denoted by s
clustered bar chart
used to graph a conditional distribution, shows the distribution of the response variable conditioned on the levels of the explanatory variable
residuals vs fitted
want an even, random scatter of the points above and below zero. This indicates that fitting a linear model is appropriate and the residuals have constant variance.