AP Stats Unit 1 & 2
Variance
SD squared
describing a distribution
Shape, Outliers, Center, Spread (SOCS) +context
problem with segmented bar graph
Size of sample can differ
What does not work well for large data sets?
Stemplots
Strength
Strong: very close together Moderate: Not as close Weak: spread out, far apart
Zero
Sum of residuals
Skewed Left Distribution
Tail on the left. Mean less than the median.
correlation coefficient equation
r= 1/n-1 r is not resistant to ouliers r is not affected by changes in scale or center r=0 represents no correlation
Spread
range, IQR, standard deviation
Median/IQR
resistant to outliers
Standard Deviation of Residuals equation
s=square root of y1-y hat/n-2
show categorial variables
segmented bar graph, mosaic plot, pie chart
Boxplots (box-and-whisker plots)
showing data through quartiles
Describe Shape of distribution
skewness, symmetry, unimodal,bimodal,multmodal, approximately normal, bell curve
standard deviation
spread;a measure of variability that describes an average distance of every score from the mean
extrapolation
the act of estimation by projecting known information
z-score
a measure of how many standard deviations you are away from the norm (average or mean)
Minitab
A statistical package to perform statistical analysis Designed to perform analysis as accurately as possible
Frequency
(counts) amount of times something occurs
What is a categorical variable?
A characteristic of an individual that takes on values that are names or labels. Shape is irrelevant. Ex. type of fruit.
Stemplot
A graphical representation of a quantitative data set. Leading values of each data point are presented as stems and second digits are given as leaves.
Power Transformation
A transformation in which a power/exponent is chosen, and then each original value is raised to that power to obtain the corresponding transformed value. Do NOT pick 0 as the exponent as that would make every value 1, and an exponent of 1 is NOT a transformation either.
What is a discrete variable?
A variable that can take only specific values in a given range. Ex. # of students in a class
explanatory variable (independent variable)
A variable that may explain changes in a second variable, or a variable that contains information that you have. (x)
normal distribution (bell curve)
Bell-shaped curve Absolutely symmetrical Central Tendency: mode, mean, median? Mean of 0 and SD of 1.
How to read Histograms?
Bins only include left value and does not have to start at 0.
Larger SD
Data is farther apart
Describing a relationship between two variables
Direction, unusually features, form, and strength (dufs)
marginal distribution
Distribution of values of that variable among all individuals described by the table.
Ways to show quantitative data
Dotplots, stemplots, box and whisker plots, histograms
Dotplots
Each data is shown as a dot above its location on a number line.
Empirical Rule (68-95-99.7) Rule
Gives benchmarks for understanding how probability is distributed under a normal curve. In the normal distribution, 68% of the observations are within one standard deviation of the mean, 95% is within two standard deviations of the mean, and 99.7% is within three standard deviations of the mean.
segmented bar graph
Graph used to compare the distribution of a categorical variable in each of several groups. For each group, there is a single bar with "segments" that correspond to the different values of the categorical variable. The height of each segment is determined by the percent of individuals in the group with that value. Each bar has a total height of 100%.
Influential Point
If removed, substantial change a, b, and/or r.
normalcdf
Input: Z-score or variable value Output: area or probability
InvNorm
Input: area or probability Output: z-score or variable value
residual
Left over; remaining difference of observed and expected value, negative residual: below the line of best fit Positive residual: above the line of best fit
Does association imply causation?
No, only an experiment can show causation
Unusual Features
Outliers, high leverage, influential points
Relative Frequency
Percentage/Proportion
Direction
Positive or Negative
Correlation
Positive:goes up Negative: goes down
Lower outlier
Q1-(1.5*IQR)
Higher outlier
Q3+(1.5*IQR)
Equation for IQR
Q3-Q1
IQR (interquartile range)
Q3-Q1 (middle 50%)
What is a quantitative variable?
Quantitative variables are numeric like: Height, age, number of cars sold, SAT score
y-intercept
The predicted value of y when x=0.
cumulative frequency
The sums of the frequencies of the data values from smallest to largest.
response variable (dependent variable)
The variable that shows the value you want to predict. (y)
Nonlinear relationships
Two relationships may be related but not linear, in this case they will be curvilinear
histograms
Used when data is continuous The bars touch each other, shows frequency distributions
Scattorplot
a graphed cluster of plots; the slop of the points, each of which represents the values of two variables; the direction of the relationship between the two variables
coefficient of determination
a measure of the amount of variation in the dependent variable about its mean that is explained by the regression equation. This explains what percent of the data is explains/acccount for by the LSRL.
residual plot
a scatterplot of the regression residuals (y) against the explanatory variable (x)
What is a continuous variable?
a variable that can take on an infinite range of values along a specified continuum. (height)
socsC
always use context
Transformation
applies a math operation to a variable
Slope Interpretation of Residuals
change of y hat over the change of x or b/1
Smaller SD
data is closer together
High Leverage
data points whose x values are far from the mean of x
Conditional Distribution
describes the values of that variable among individuals who have a specific value of another variable
Outlier in a Scatterplot
doesn't follow trend, large residual
Roundoff error
effect of rounding off results
skewed right distribution
has a majority of data values on the left; best described by the median
association
knowing the value of one variable helps predict the value of the other
Relationship between residual plot and linear model
linear model is appropriate if there is no clear pattern ( if it shows quadratic, the linear model is NOT appropriate to use; not a good model to choose)
What is the form of the relationship?
linear or nonlinear
Center
mean, median, mode
Mean/Standard Deviation
nonresistant to outliers
residual equation
observed y - predicted y ; y-y hat
standard deviation formula
the square root of the variance
SD of residuals/Typical prediction error
the sum of residuals is typically zero. This value is typically (s) units away from the units predicted by the LSRL with x=some units. The smaller the number is the better prediction it will be.
Mosaic plot (segmented bar charts)
used to show the size of the sample
Predicted value
what does the hat above the y represent?
What is a distribution?
what values variables takes and how often it takes these values
Simpson's Paradox
when averages are taken across different groups, they can appear to contradict the overall averages
Mean of a sample
x̅
LSRL (Least Squares Regression Line)
y-hat = a + bx; a is y-intercept and b is slope
Mean of population
μ
Standard Deviation of population
σ