507 MT

Ace your homework & exams now with Quizwiz!

what is a non-parametric alt to one-way ANOVA?

Kruskal-Wallis test. Use it when have more than 2 groups to compare and/or the underlying distributions of the group are far from normal or the data is ordinal

Limitations of simple linear regression

The correlation/linear regression quantifies only the LINEAR relationship between two variables. The correlation/regression should NEVER be extrapolated beyond the observed ranges of the variables.

simple linear regression (SLR)

given a relationship between x and y, create a mathematical model to predict y given a value of x

Disadvantages of nonparametric tests

if the data does follow a normal distribution, nonparametric tests are less powerful than parametric tests. does not use all info provided by the data bc of the use of ranks instead of the og values.

why is R^2 not a good measure of goodness of fit in multiple regression?

if you add more independent variables to the regression, R2 will always increase with an additional variable.

what test would you use to compare two population means between two independent groups?

independent t-test

what is a in the regression line equation?

intercept

Which hypothesis do we usually assume to be correct unless there is convincing evidence against it?

null

Which of the following is a parametric test 1) Wilcoxon rank sum test 2) Mann Whitney U test 3) ANOVA 4) Kruskal Wallis test

ANOVA

what test would you use to compare 3+ population means?

ANOVA

To measure goodness of fit for multiple regression, we can use... a. R2 b. Adjusted R2 c. F test

Adjusted R2

what to do if no explanation for the outlier is found?

one has to decide whether to leave or omit the corresponding observation

if an alternative hypothesis uses < or >, what sided test is it?

one sided

what is y in the regression line equation?

outcome/dependent variable

what sample size will invoke the central limit theorem -- thus implying a normal distribution?

over 30

Which of the following is true for parametric methods 1) parametric methods compare means 2) parametric methods use ranks instead of original values 3) parametric methods do not assume normal distribution. 4) parametric methods are better for ordinal outcomes (e.g. Likert scales).

parametric methods compare means

what parametric test can you use to compare two continuous variables?

pearson's correlation coefficient or simple linear regression

how would a scatter plot with an x that goes up as y goes up be described?

positive, linear association

in a multiple regression equation, what does x1 to xp represent?

predictor variables

what is x in the regression line equation?

predictor/independent variable

The output of a regression is a function that...

predicts the dependent variable based upon values of the independent variables.

with hypotheses testing, what are decisions based on?

probability

method of least squares

process of fitting a mathematical function to a set of measured points by minimizing the sum of the squares of the distances from the points to the curve / lowest total sum of squared residuals.

In simple linear regression, R^2=

r^2: correlation coefficient squared

in a multiple regression equation, what does epsilon (E) represent?

random error

what is e in the regression line equation?

random error

spearman's correlation coefficient is based on

ranks

non-parametric / distribution free methods

require no distributional assumptions

when fitting a regression line, what is the distance between the point and a line called?

residual

how is ρ, a population parameter, estimated with two continuous variables?

sample correlation coefficient, r

assumption for chi-square test?

sample size has to be large; expected frequency >5 in each cell. if not fisher's exact test will be automatically used.

alpha

significance level

what is beta in the regression line equation?

slope of the regression line / the unit change in the dependent variable for each unit change in the independent variable.

what nonparametric test can you use to compare two continuous variables?

spearman's correlation coefficient

identification of outliers

standardized residual > 3 or <-3

forward selection

starting with no variables in the model, adding the variable (if any) which is the most statistically significant, and repeating this process until none improves the model to a statistically significant extent.

F test on ANOVA

testing the overall significance of all independent variables altogether.

what should be used for goodness of fit in multiple regression?

Adjusted R^2. It will adjust for the number of independent variables included in the regression. It will not always increase as you add more independent variables.

What is a multiple regression analysis?: a. An analysis that determines how multiple independent variables will impact a response b. A series of linear regression that lead to a model for system behavior c. A hypothesis test that considers a problem from multiple perspectives to determine the best analytical approach

An analysis that determines how multiple independent variables will impact a response

Parametric methods

Assume the data comes from some underlying distribution whose general form is known. Efficient if the distribution is correctly assumed.

what is wilcoxon rank-sum test based on?

the ranks of the individual observations rather than on their actual values, which would be used in a t-test

beta <0

then as x increases by 1, the expected value of y=a+bx will decrease by b

beta >0

then as x increases by 1, the expected value of y=a+bx will increase by b

beta=0

there is no linear relationship between x and y

how to find an explanation for outliers?

Check additional information about the data collection, you might be able to avoid the outlier (e.g. by correcting a misprint, or introducing an extra explanatory variable)

The usage of QQ/PP plot is to:

Check whether the data or residuals follow a normal distribution.

what can cause an outlier?

(unforeseen) special circumstances about the particular observation; measurement errors; random variation

what can a linear regression model be used for?

-Use the known value of one variable to predict the value of another -Know the incremental effect of change on one variable to change in another

The closer to __ or __, the more correlated the variables are

1 or -1

Ranking procedure for the wilcoxon-rank sum test

1. Combine the data from the two groups, & order the observations from lowest to highest 2. assign ranks to the values, with lowest rank going to the smallest observation 3. if a group of observations has the same value, assign the average rank for the group to every observation in the group 4. compute the sum of the ranks for each group

what are the three assumptions for ANOVA?

1. each group is independent & randomly sampled 2. normality: n>30 or normal distribution 3. homoscedasticity: population variances are equal

The following formula represents a regression model which uses the number of days a student spent on studying HPH507 to predict their score on the final test (%). y = 5x + 35. What is the expected exam score if the student did not spend any days studying?

35

The following formula represents a regression model which uses the number of days a student spent on studying HPH507 to predict their score on the final test (%). y = 5x + 35. What is the value of Y when X is 0 (intercept)?

35

The following formula represents a regression model which uses the number of days a student spent on studying HPH507 to predict their score on the final test (%). y = 5x + 35. what is the extra % scored on test for every day spent on studying

5

how to fit a regression line?

A good line will make the distances from the observed values to the predicted values on the regression line as small as possible.

Which of the following is false for simple linear regression 1) It measures the relationship between a continuous outcome and a continuous explanatory variable 2) The regression equation has two parameters: intercept and slope 3) If the value of the slope is negative, the relationship between the continuous outcome and the continuous explanatory variable is negative. 4) Correlations means causation.

Correlations means causation.

t test for individual beta?

Each post-hoc t-test based on the multiple regression is testing the individual significance of each independent variable.

What does the least squares method do exactly?

Finds those (best) values of the intercept and slope that provide us with the smallest value of the residual sum of squares

confidence limits of prediction

For the prediction of y given a new member of the population (whose values are not used in constructing the regression line) -dotted lines

what would be the null-hypothesis for a wilcoxon-rank sum test?

H0: the population medians are equal

why determine the coefficient of determination (R^2)?

If a linear relationship exists; it is also useful to measure the strength of the relationship

pagano example

If graphed, holding one variable constant produces a two-dimensional graph for the other variable.

Which of the following is true for Pearson correlation 1) It uses ranks instead of original values 2) Its value closer to zero, the correlation is stronger 3) It can be used for ordinal data 4) If its value is positive, the correlation is positive

If its value is positive, the correlation is positive

A multiple regression model has the form: y = 2 + 3 X1 + 4 X2. As X1 increases by 1 unit (holding X2 constant), y will... a. Increase by 3 units b. Decrease by 3 units c. Increase by 4 units

Increase by 3 units

Non-parametric methods

Make no such distributional assumptions. No assumption is made on the shape of the distribution, or the CLT seems inapplicable bc of small sample size. Distribution free or parameter free. Typically applied to pops that take on a ranked order. Robust.

For a QQ/PP plot of standardized residuals from a multiple regression, the normality assumption will be satisfied if

Most of the points fall on the diagonal line

Does a small p value mean the data is consistent with H0?

No

Which of the following is true for nonparametric methods 1) Nonparametric methods compare means 2) Nonparametric methods are always good for small sample size studies 3) Nonparametric methods can not be used for continuous outcomes. 4) Nonparametric methods are better for ordinal outcomes compared to parametric methods.

Nonparametric methods are better for ordinal outcomes compared to parametric methods.

What types of data require a multiple regression analysis?: a. Multiple discrete Y and one continuous X b. One discrete Y and multiple continuous X's c. One continuous Y and multiple continuous X's

One continuous Y and multiple continuous X's

multiple regression

One continuous dependent variable and more than one independent variables

Which of the following is a parametric test? 1) Wilcoxon rank sum test 2) Spearman correlation 3) Pearson correlation 4) Kruskal Wallis test

Pearson correlation

is pearson's or spearman's correlation coefficient sensitive to outliers?

Pearson's is sensitive to extreme observations; spearman's is robust to extreme observations and thus is recommended particularly when "outliers" are present, or data is far from normal, or when the data is ordinal.

QQ-plot

Plotting the quantiles of the observed standardized residuals (zresid) against the corresponding quantiles of a standard normal distribution

if you have a small sample size (n<30), what should be used to check normality?

QQ plot

To check normality assumption, we need to check... a. QQ plot of Y value b. QQ plot of X value c. QQ plot of standardized residual

QQ plot of standardized residual

Which of the following measures is for comparing the goodness of the fit of regression models?

R-square

Which of the following statements about the coefficient of determination R2 is FALSE?

R2 measures the correlation between two continuous variables

Regression analysis usage

Regression analysis is not just used for prediction and understanding correlations, but for inference about the relationship between a factor and an outcome / attempt to explain the variation in a dependent variable using the variation in independent variables.

Reporting outliers

Report model results both with and without the outliers or influential observations to see how much they differ. Alternatively, transform the data (logarithm) in such a way that outlying points are pulled closer towards the general trend in the transformed data. Or nonparametric methods

regression analysis steps

Step 1: Model fitting (Least Square estimation) Step 2: Model evaluation (adjusted R2) Step 3: Diagnostics (Residual plots) Step 4: If model fitting is bad, assumption severely violated, then try transformation of Y (e.g. logarithm), outliers removal ... -> Step 1 Step 5: After the final model is settled, make inference Hypothesis testing for zero slope β=0 Confidence Interval for slope β C.I. for the predicted values

What does a simple linear regression analysis examine?

The relationship between one dependent and one independent variable

Nonparametric methods can be used when 1) The outcome measure follows a normal distribution 2) The sample size is small and the outcome measure follows a normal distribution 3) There are three groups for comparisons. 4) The sample size is large but the outcome measure is ordinal

The sample size is large but the outcome measure is ordinal *if you see normal distribution, choose a parametric test regardless of sample size

The linear regression equation for the data is y = 1.5x + 12. Interpret the slope of the equation.

The slope is 1.5. It means that it costs 1.5 dollars per topping.

Uses of multiple regression

Use several variables at once to explain the variation in a continuous dependent variable. Isolate the unique effect of one variable on the continuous dependent variable while taking into consideration that other variables are affecting it too

r < 0

Variables are negatively correlated (as one increases or decreases, the other decreases or increases accordingly).

r > 0

Variables are positively correlated (as one increases or decreases, the other increases or decreases also).

r=0

Variables are uncorrelated (no linear relationship).

If data points do not show a linear relationship when trying to do a linear regression, what should be used in the equation to best fit the curve?

X^2

influential point

a point for which the observed value of this particular point has a great influence on the analysis. a statistical analysis is v sensitive to a single (or a few) data point(s), in the sense that if the value of this point is changed even slightly, the outcome of the analysis alters greatly

R^2 = 1

all variation in y can be explained by variation of x

what is a type I error probability denoted as?

alpha

outlier

an observation which differs from the main trend in the data

what does a best model for multiple regression include?

as many predictors as possible that explains variations in y. as few predictors as possible since the variance of predictions increases as the number of predictors increases. A simple model is easier to interpret.

why care about assumptions when using t-test?

bc t-test assumes data following normal distribution. if the normality assumption is violated, results will be biased.

why are parametric methods called parametric?

bc they assume "parametric" normal distributions

what is a type II error probability denoted as?

beta

R^2 takes values...

between 0 and 1. The closer it is to 1, the better the fit of the model. Acceptable values depend on the application

what test would you use to compare two population proportions or categorical-categorical association?

chi-square test

in a multiple regression equation, what does Bi represent?

coefficient for variable Xi (the change in y for every unit increase in Xi)

R^2

coefficient of determination: A summary measure of goodness of fit. the proportion (%) of the total variance that is "explained" by the regression line, as measured by the regression sums of squares / how much % of variation in y explained by the regression

how is the best model for multiple regression selected?

compare models with fewer predictors to models with more predictors in order to choose the "best" model.

on sas, when looking at the scatter plot, what is the shaded area?

confidence limits for the average value of y given x (based on the collected data)

what sort of outcome should be expected in a sample we use two-sample independent t-test?

continuous

what type of outcome is expected in a sample you would use ANOVA for?

continuous

in a multiple regression equation, what does Y represent?

dependent variable

Advantages of non-parametric tests

do not require that the underlying pop be normally distributed. less sensitive to measurement error bc they use ranks. can be used for ordinal as well as continuous data.

what symbol is usually used for null hypothesis?

equal

correlation analysis

formal measure of the nature and strength of association

automatic selection methods

forward selection and backward elimination

backward elimination

involves starting with all candidate variables, deleting the variable (if any) which is the least significant, and repeating this process until no further nonsignificant variables can be deleted.

P-value

likelihood of whether the sample data are consistent with H0

which measure of central tendency is less sensitive to outliers?

median

What do non-parametric methods compare?

medians. non-parametric methods sort the data first, then perform tests on the ranks rather than og values

simple regression

model the possible relationship between a normally distributed dependent variable y and an independent variable x

linear regression model

model the possible relationship between a normally distributed outcome variable y and one or more predictor variables, where the independent variable/x's may be either continuous or categorical variables

assumptions for regression

need to use residual plot which uses predict value vs standardized residual. check for Linearity, Normality, Homoscedasticity, Independence

how would a scatter plot with an x that goes up as y goes down be described?

negative, linear association

how would a scatter plot with an x that goes up as y varies in no predictable way be described?

no association

how would a scatter plot has x and y with a relationship that cannot be called linear be described?

non-linear association

Non-parametric methods are often used when the data is

non-normal / sample size under 30, ordinal

Wilcoxon rank-sum test (Mann-Whitney U test)

non-parametric alt to the two-sample independent t-test. used to compare 2 independent (unpaired) populations but your sample size in each group is <30 and/or your data is far from normal or when your response is ordinal.

what should be used if normality is violated?

non-parametric method

what type of distribution is usually assumed for t-test?

normal

are influential points outliers?

not necessarily an outlier as defined above. Influential points do not necessarily constitute a problem

Correlation can be used... 1) to evaluate the relationship between two categorical variables 2) to evaluate the relationship between a continuous variable and a categorical variable 3) to compare means between two groups 4) to evaluate the relationship between two continuous variables

to evaluate the relationship between two continuous variables

if an alternative hypothesis uses a not equal to sign, what sided test is it?

two

when should a chi-square test be used?

two variables, both discrete

scatter plot

visual check to see how the variables are related. can tell us at a glance whether two variables are likely to be related, and how they are related.

If F test is less than 0.05

we reject the null hypothesis

What is a non-parametric alt to the paired t-test? when should it be used?

wilcoxon signed-rank test. used when the sample size <30 and/or data that is far from normal or when the data is ordinal. used for the paired data or matched data

R^2 = 0

x gives no information about y

in a multiple regression equation, what does B0 represent?

y-intercept of the line

regression line equation

y=a+bx+e

Does a high p value mean the data is consistent with H0?

yes


Related study sets

Astronomy Chapter 2 Concept Questions

View Set