Units 7-10: Correlation, Regression, & Chi-squared-test

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

alpha (chi-squared)

.05 - the chance we make a wrong

What values does R squared lie between

0 and 1 (worst to best fit)

What does a correlation of one mean?

A correlation of +1 indicates a perfect positive correlation, meaning that both variables move in the same direction together.

Covariance

A measure of how much two random variables vary together. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, the covariance is positive.

Ordinary Least Squares (OLS)

A method for estimating the parameters of a linear regression model. The ordinary least squares estimates are obtained by minimizing the sum of squared residuals // a statistical procedure that estimates regression equation coefficients that produce the lowest sum of squared differences between the actual and predicted values of the dependent variable

What information does a contingency table contain?

Absolute frequencies of categorical variables and their sums (female/male; low/medium/high)

Define F test and when we see it

An F-test is any statistical test in which the test statistic has an F-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model that best fits the population from which the data were sampled. We see this statistic in chi-squared test, when we're examining the sampling distribution under the null hypothesis and seeing whether chi-squared value is within the critical region

Third Variable Problem

An impediment to establishing causality, it's the possibility that there is a third mediating variable that explains the correlation between two observed variables

How to interpret slope B1

An increase in 1 unit in X, leads to an increase or decrease of B1 units in Y

Re-written equation for slope (regression analysis)

B0 + B1X

How to make inferences with regressions

Do statistical hypothesis testing on B1 where H0 is B1 = 0, can do indirect or direct testing

What kind of equation is a regression?

Function (y = a + bx)

What descriptive plots are typically used when dealing with two categorical variables?

Grouped bar chart, mosaic plot

What are the hypotheses for a regression analysis?

H0: X has no effect on Y (B1 = 0); H1: X has effect on Y (B1 does not equal 0)

What is H0 and H1 for studying correlation?

H0: r equals 0 (no correlation); H1: r does not equal 0. Reject the null hypothesis is zero is not included in the confidence interval

How can we test for normalty?

Histogram; quantile to quantile (QQ) plot; kolmogroff-smironf or shapiro-wilks test

Relationship between p and alpha

If p is less than or equal to alpha reject H0. If p is more than alpha don't reject H0.

What does a p-value of more than 0.05 tell us in a Kolmogrov-Smirnov Test?

If the p-value is greater than 0.05, then the distribution is normal and null hypothesis is not rejected.

QQ plot

Q-Q plots are commonly used to compare a data set to a theoretical model. The use of Q-Q plots to compare two samples of data can be viewed as a non-parametric approach to comparing their underlying distributions. Compares two probability distributions by plotting their quantiles against each other. A plot of points (x, y) where in x is a quantile of V1 and y of V2, linearly related variables will lie on some line approximate to y = x. If the two distributions being compared are similar, the points in the Q-Q plot will approximately lie on the line y = x.

What is the equation for R squared

R^2 = r(x,y)^2 (where r is the square of the correlation coefficient)

What is the difference between regression vs. correlation?

Regression aims to model a linear function relationship between X and Y; correlation aims to characterize the dependency of two variables by a single number. Regression can be extended to multiple predictors wherein correlation is only two. Order matters in regression (in that it matters whether we regress of X or Y or vice versa) but not in correlation.

What is the criterion for a regression line?

Regression line must be on top/in the middle of a cloud of data pointsS

Residuals

The difference between an observed value of the response variable and the value predicted by the regression line (Yi minus Y^); e = Y1 - Y^i; r^2 is the aggregated measure of the residuals.

What is the p-value based on in a regress?

t-value (estimate / standard error)

F statistic (omnibus test)

tests the global Ho - that there is some significant difference between multiple Bo (i.e. not stating between which the difference lies)

What are residuals (e)? (chi-squared)

the sample equivalent of the error term; distance of a measured value from the predicted value

mosaic plot

uses the area of rectangles to display the relative frequency of occurrence of all combinations of two categorical variables

Commutativity

x+y=y+x xy=yx

What axis is the manipulated variable?

x-axis

Basic equation for slope

y2-y1/x2-x1

What does correlation not mean?

Two things are correlated doesn't mean one causes other. Correlation does not mean causality or in our example, ice cream is not causing the death of people.

What does the chi-squared sampling distribution look like?

also a lopsided slope-y thing like the F-statistic

When testing for regression - we use confidence intervals or critical regions?

confidence intervals

Predicting with regressions

develop a regression formula by estimating B0, B1, and Yi based on data, plug in X values to calculate/predict a Y.

Contingency table

displays counts of two categorical variables

Ei

error term - explains the noise/imperfections in the data

Chi-square test assumption(s)

expected cell frequencies > 5

How to tell which cells are responsible for the dependence in chi-squared?

looking at the mosaic plot, the cells with the greatest obs. to expected value difference

Is the F-statistic a normal distribution?

no, it's a slope-y thing

What do B0 and B1 represent?

regression parameters - intercept and slope

Relationship between Ei and ei

residuals ei are the sample equivalent of error term Ei, which is crucial when it comes to assumptions with respect to the validity of informational outcomes

What does regression allow us to do?

Make predictions as we're fitting a line to a scatterplot. Order matters, in that it matters whether we regress of X or Y or vice versa.

Regression assumptions (E related)

1. On average, error term for each is 0. 2. homoscedasticity - variance of error terms are similar across i 3. error terms uncorrelated across i, i' 4. error terms normally distributed linearity

What testing method is used with a non-parametric correlation?

Direct testing, Spearman's rho has no confidence interval

What would a R square of 1 mean?

100% of variation in Y can be explained by X

How to obtain t-statistic?

Divide B1 by its standard error.The t-value measures the size of the difference relative to the variation in your sample data. Put another way, T is simply the calculated difference represented in units of standard error. The greater the magnitude of T, the greater the evidence against the null hypothesis.

R squared

Coefficient of determination - measuring the accuracy of prediction in a regression ("goodness of fit"). Tells us how well the independent variable predicts the dependent variable. Can range between 0 and 1. If r squared = 0.8 then 80% of the variability in Y is "explained" by the variability in X

When testing for correlation - do we use confidence intervals or critical regions?

Confidence intervals; chance of hitting a good CI is 95% but it does not mean that there is a 95% probability that the test statistic is in the CI

What does a good R squared depend on?

Context

Bivariate correlation

Correlation between two variables

How do you get Pearson's correlation coefficient?

Covariance/Standard Deviation

Correlation

Degree of dependency between two variables

What do we do in regression?

Fit a line to a scatterplot and examine goodness of fit

monotonic relationship

In a monotonic relationship, the variables tend to move in the same relative direction, but not necessarily at a constant rate. In a linear relationship, the variables move in the same direction at a constant rate.

Partial correlation

Looking at relationship between two variables while controlling for (an) additional variable(s)

How to find the "expected" values (chi-squared)

Multiply the total sum of the row by the total sum of the column and divide by the grand total

What does Spearman's correlation coefficient not require?

No normal distribution required; Linearity not required; Does not need to be metric (can be ordinal)

How should we interpret a correlation coefficient of 0?

No relationship. As one value increases, there is no tendency for the other value to change in a specific direction.

What are the two kinds of correlation coefficients?

Pearson's correlation coefficient and Spearman's correlation coefficient

How to make regression predictions?

Plug a new Xi into the equation

What does Yi represent?

Response variable/dependent variable (always metric)

Multiple Regression assumptions

Same as binary regression+ no multicollinearity

What does a normal QQ plot look like?

Sample quantiles are compared to theoretical quantiles under normal conditions. If there is a correlation (straight line observed) then the data is normal.

What is the difference between the Kolmogorov-Smirnov test and Shapiro-Wilk test?

Shapiro-Wilk test is a specific test for normality, whereas the method used by Kolmogorov-Smirnov test is more general, but less powerful (meaning it correctly rejects the null hypothesis of normality less often). Kolmogorov-Smirnov test is used for sample size n ≥50.

Z-score

Simply put, a z-score (also called a standard score) gives you an idea of how far from the mean a data point is. But more technically it's a measure of how many standard deviations below or above the population mean a raw score is.

When do you use the Spearman's correlation coefficient, and not the Pearson's correlation coefficient?

Spearman's correlation coefficient is used when assumptions of Pearson's correlation coefficient aren't fulfilled

Properties of pearson's correlation coefficient (r)

Standardized (greater than or equal to -1, less than or equal to 1); Scale independent; Commutativity; Sign determines direction of dependency; Linear dependency

Pearson's chi-square test

Statistical test used to determine whether an observed value is significantly different from what would have been expected through random chance

What is the t-distribution?

Student's t-distribution (or simply the t-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally-distributed population in situations where the sample size is small and the population's standard deviation is unknown;The t-distribution is symmetric and bell-shaped, like the normal distribution. However, the t-distribution has heavier tails, meaning that it is more prone to producing values that fall far from its mean

Variance formula

Sum of squared distances of residuals

What does a Kolmogrov-Smirnov Test used to do? (broad)

The Kolmogorov-Smirnov test is used to test the null hypothesis that a set of data comes from a Normal distribution, using a p-value.

Sum of squares

The sum of squares is the sum of the square of variation, where variation is defined as the spread between each individual value and the mean.

What is the difference between correlation and slope?

The value of the correlation indicates the strength of the linear relationship. The value of the slope does not. The slope interpretation tells you the change in the response for a one-unit increase in the predictor.

What do correlation coefficients do?

They represent the dependency of two metric variables (a scatterplot) in a single, standardized number

How to interpret intercept B0

Value of Y for X = 0

What does Xi represent?

Value of predictor/independent variable

How to interpret an R^2 of 0.13?

We can only explain 13% of the variance in Y based on X. Therefore, we reject the null hypothesis

Describe relationship between Yi and Xi

When we increase Xi by one unit, how much does Yi change

What does the chi-squared test measure?

Whether or not two variables are independent from one another. Asks, what would frequencies look like if they were independent vs not (create table under null hypothesis, and compare with actual values; observed vs expected -- creates residuals)

What assumptions are made when using pearson's correlation?

X&Y are metric variables; linear relationship; in small samples each variable would follow a normal distribution

regression equation

Yi = (B0) + (B1)(Xi) + error


Ensembles d'études connexes

tissue integrity, pain, safety EAQ

View Set

PERSONAL DEVELOPMENT Reviewer (Prelim)

View Set

Chapter 6: Consumer Decision Making

View Set

Charlie and the Chocolate Factory

View Set

Quiz 1: What is strategy and why is it important?

View Set

MGT 2660 Supervision: Middle Management Chapter 4

View Set

SUCCESS! In Clinical Laboratory Science - Blood Bank: Antibody Identification, Transfusion Therapy, Transfusion Reactions

View Set