AP Statistics Chapter 5: Summarizing Bivariate Data
Predicted (fitted) values ŷ1, ŷ2,... ŷn
Obtained by substituting the x value for each observation in the data set into the least-squares line; ŷ=a/bx1,..., ŷn=a+bxn
residuals
Obtained by subtracting each predicted value from the corresponding observed y value. These are the vertical deviations from the least-squares line.
Influential Observation
An observation is potentially influential if it has an X value that is far away from the rest of the data (separated from the rest of the data in the x direction). To determine if the observation is in fact influential, we assess whether removal of this observation has a large impact on the value of the slope or intercept of the least-squares line.
Outlier
An observation that has a large residual. Outlier observations fall far away from the least-squares line in the y direction.
Independent (predictor or explanatory) variable
In a bivariate data set, the variable that will be used to make a prediction of the dependent variable. This is denoted by x. Also sometimes called the predictor variable or the explanatory variable.
Dependent (Response) Variable
In a bivariate data set, the variable whose value we would like to predict. This is denoted by y. Also sometimes called the response variable.
Notation of population correlation coefficient
P
Residual
The difference between an observed y value and the corresponding predicted y-value. The residuals from the least-squares line are the n quantities. Residual = Observed value - Predicted value
Total sum of squares
The sum of squared deviations from the sample mean is a measure of total variation in the observed y values.
Residual (error) sum of squares
The sum of the squared residuals is a measure of y variation that cannot be attributed to an approximate linear relationship (unexplained variation).
Logistic regression function
Data is fit into linear regression model, which then be acted upon by a logistic function predicting the target categorical dependent variable.
Power transformation
A transformation in which a power/exponent is chosen, and then each original value is raised to that power to obtain the corresponding transformed value. Do NOT pick 0 as the exponent as that would make every value 1, and an exponent of 1 is NOT a transformation either.
population correlation coefficient
(p) is the correlation computed by using all possible pairs of data values (x,y) taken from a population
Pearson's sample correlation coefficient
A measure of the strength and direction of a linear relationship between two numerical variables. Denoted by r. Although there are several different correlation coefficients, Pearson's correlation coefficient is by far the most commonly used, and so the name "Pearson's" is often omitted and it is referred to as simply the correlation coefficient.
Transformation
A simple function of the x and/or y variable, which is then used in a regression. is the replacement of a variable by a function of that variable: for example, replacing a variable x by the square root of x or the logarithm of x. In a stronger sense, a transformation is a replacement that changes the shape of a distribution or relationship.
Predicted (fitted) value
Result from substituting each sample x value into the equation for the least-squares line.
Residual plot
Scatterplot of the (x, residual) pairs. Isolated points or a pattern of points in a residual plot are indicative of potential problems. a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate.
A = ȳ - bx̄
The intercept of the least-squares line.
Least-squares line
The line that minimizes the sum of squared deviations. The least-squares line is also called the sample regression line.
Sum of squared deviations
The most widely used measure of the goodness of fit of a line y = a+bx to bivariate data (X1,Y1),...,(Xn,Xn)
Coefficient of determination
The proportion of variation in observed y's that can be explained by an approximate linear relationship. The proportion of variation in y that can be attributed to an approximate linear relationship between x and y. The coefficient of determination is denoted by r squared. The value of r squared often converted to a percentage (by multiplying by 100) and interpreted as the percentage of variation in y that can be explained by an approximate linear relationship between x and y.
Standard deviation about the least-squares lines
The size of a "typical" deviation from the line least-squared line. Roughly speaking, Se is the typical amount by which an observation deviates from the least-squares line.
Principle of least squares
a form of mathematical regression analysis that finds the line of best fit for a dataset, providing a visual demonstration of the relationship between the data points. aims to create a straight line that minimizes the sum of the squares of the errors generated by the results of the associated equations, such as the squared residuals resulting from differences in the observed value and the value anticipated based on the model. Y=mx+b m = Sy/Sx
Scatterplot
a graphical depiction of the relationship between two variables
Notation for Residual
e
Sample correlation coefficient
measures both the strength and direction of the linear relationship between two variables
Notation for sample correlation coefficient
r
regression analysis
used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships
Equation of a line
y=a+bx a = intercept b = slope