MKTG DATA ANALYSIS Exam three retake, ch 7,8,9,11

Ace your homework & exams now with Quizwiz!

Perfect collinearity

exists when at least one predictor in a regression model is a perfect linear combination of the others (the simplest example being two predictors that are perfectly correlated - they have a correlation coefficient of 1).

Independent errors

for any two observations in regression the residuals should be uncorrelated (or independent).

βi

standardized regression coefficient. Indicates the strength of relationship between a given predictor, i, of many and an outcome in a standardized form. It is the change in the outcome (in standard deviations) associated with a one standard deviation change in the predictor.

Levene's test tests whether: A.Data are normally distributed. B.The variances in different groups are equal. C.The assumption of sphericity has been met. D.Group means differ.

B

Bivariate correlation

a correlation between two variables.

Outcome variable

a variable whose values we are trying to predict from one or more predictor variables.

Differences between group means can be characterized as a regression (linear) model if: A.The experimental groups are represented by a binary variable (i.e. coded 0 and 1). B.The outcome variable is categorical. C.The groups have equal sample sizes. D.Differences between group means cannot be characterized as a linear model, they must be analysed with an independent t-test.

A

R2 is known as the: A.Coefficient of determination. B.Multiple correlation coefficient. C.Partial correlation coefficient. D.Semi-partial correlation coefficient.

A

Rank the score of 5 in the following set of scores: 9, 3, 5, 10, 8, 5, 9, 7, 3, 4 A.4.5 B.4 C.3 D.6

A

The covariance is: A.All of these. B.A measure of the strength of relationship between two variables. C.Dependent on the units of measurement of the variables. D.An unstandardized version of the correlation coefficient.

A

What is b0 in regression analysis? A.The value of the outcome when all of the predictors are 0. B.The relationship between a predictor and the outcome variable. C.The value of the predictor variable when the outcome is zero. D.The gradient of the regression line.

A

Which of these statements describes a Bonferroni correction? A.You apply a criterion for significance based on the usual criterion for significance (.05) divided by the number of tests performed. B.You divide the F-ratio by the number of tests performed. C.The degrees of freedom are corrected to make the F-ratio less significant. D.The error in the model is adjusted for the number of tests performed.

A

A correlation of .7 was found between time spent studying and percentage on an exam. What is the proportion of variance in exam scores that can be explained by time spent studying? A..70 B..49 C..30 D..7

B

A researcher measured people's physiological reactions to horror films. He split the data into two groups: males and females. The resulting data were normally distributed and men and women had equal variances. What test should be used to analyse the data? A.Dependent B.Independent t-test C.Mann-Whitney test D.Wilcoxon signed-rank test

B

A researcher measured people's physiological reactions while watching a horror film and compared them to when watching a comedy film, and a documentary about wildlife. Different people viewed each type of film. The resulting data were normally distributed and the variances across groups were similar. What test should be used to analyse the data? A.Repeated-measures analysis of variance B.Kruskal-Wallis test C.Friedman's ANOVA D.Independent analysis of variance

B

A researcher was interested in stress levels of lecturers during lectures. She took the same group of 8 lecturers and measured their anxiety (out of 15) during a normal lecture and again in a lecture in which she had paid students to be disruptive and misbehave. What test is best used to compare the mean level of anxiety in the two lectures? A.Independent samples t-test B.Paired-samples t-test C.One-way independent ANOVA D.Mann-Whitney tes

B

Which of the following statements about outliers is not true? A.Outliers are values very different from the rest of the data. B.Influential cases will always show up as outliers. C.Outliers have an effect on the mean. D.Outliers have an effect on regression parameters.

B

What does the error bar on an error bar chart represent? A.The confidence interval around the mean. B.The standard error of the mean. C.The standard deviation of the mean. D.It can represent any of these.

D

How much variance has been explained by a correlation of .9? A.18% B.9% C.81% D.None of these

C

Which of the following is not a type of post hoc test? A.Games-Howell B.Bonferroni C.Cramér D.REGWQ

C

Which of the following statistical tests allows causal inferences to be made? A.Analysis of variance B.Regression C.None of these, it's the design of the research that determines whether causal inferences can be made. D.t-test

C

Close Review For which regression assumption does the Durbin-Watson statistic test? A.Linearity B.Homoscedasticity C.Multicollinearity D.Independence of errors NextPrev

D

If Pearson's correlation coefficient between stress level and workload is .8, how much variance in stress level is not accounted for by workload? A.20% B.2% C.8% D.36%

D

The t-test tests for: A.Differences between means B.Whether a correlation is significant C.Whether a regression coefficient is equal to zero D.All of these

D

When variances across independent groups are unequal, which of the following is not an appropriate course of action? A.Use Games-Howell post hoc tests. B.Use Kruskal-Wallis test. C.Use Welch's F-ratio. D.Use Friedman's ANOVA.

D

Pearson's correlation coefficient

Pearson's product-moment correlation coefficient, to give it its full name, is a standardized measure of the strength of relationship between two variables. It can take any value from −1 (as one variable changes, the other changes in the opposite direction by the same amount), through 0 (as one variable changes the other doesn't change at all), to +1 (as one variable changes, the other changes in the same direction by the same amount).

t-statistic

Student's t is a test statistic with a known probability distribution (the t-distribution). In the context of regression it is used to test whether a regression coefficient b is significantly different from zero; in the context of experimental work it is used to test whether the differences between two means are significantly different from zero. See also paired-samples t-test and Independent t-test.

Residual

The difference between the value a model predicts and the value observed in the data on which the model is based. Basically, an error. When the residual is calculated for each observation in a data set the resulting collection is referred to as the residuals.

Polynomial contrast

a contrast that tests for trends in the data. In its most basic form it looks for a linear trend (i.e., that the group means increase proportionately).

Simple regression

a linear model in which one variable or outcome is predicted from a single predictor variable. The model takes the form: (see above image) in which Y is the outcome variable, X is the predictor, b1 is the regression coefficient associated with the predictor and b0 is the value of the outcome when the predictor is zero.

Mean squares

a measure of average variability. For every sum of squares (which measure the total variability) it is possible to create mean squares by dividing by the number of things used to calculate the sum of squares (or some function of it).

Variance inflation factor (VIF)

a measure of multicollinearity. The VIF indicates whether a predictor has a strong linear relationship with the other predictor(s). Myers (1990) suggests that a value of 10 is a good value at which to worry. Bowerman and O'Connell (1990) suggest that if the average VIF is greater than 1, then multicollinearity may be biasing the regression model.

Covariance

a measure of the 'average' relationship between two variables. It is the average cross-product deviation (i.e., the cross-product divided by one less than the number of observations).

Cross-product deviations

a measure of the 'total' relationship between two variables. It is the deviation of one variable from its mean multiplied by the other variable's deviation from its mean.

DFBeta

a measure of the influence of a case on the values of bi in a regression model. If we estimated a regression parameter bi and then deleted a particular case and re-estimated the same regression parameter bi, then the difference between these two estimates would be the DFBeta for the case that was deleted. By looking at the values of the DFBetas, it is possible to identify cases that have a large influence on the parameters of the regression model; however, the size of DFBeta will depend on the units of measurement of the regression parameter.

DFFit

a measure of the influence of a case. It is the difference between the adjusted predicted value and the original predicted value of a particular case. If a case is not influential then its DFFit should be zero - hence, we expect non-influential cases to have small DFFit values. However, we have the problem that this statistic depends on the units of measurement of the outcome and so a DFFit of 0.5 will be very small if the outcome ranges from 1 to 100, but very large if the outcome varies from 0 to 1.

Deleted residual

a measure of the influence of a particular case of data. It is the difference between the adjusted predicted value for a case and the original observed value for that case.

Adjusted predicted value

a measure of the influence of a particular case of data. It is the predicted value of a case from a model estimated without that case included in the data. The value is calculated by re-estimating the model without the case in question, then using this new model to predict the value of the excluded case. If a case does not exert a large influence over the model then its predicted value should be similar regardless of whether the model was estimated including or excluding that case. The difference between the predicted value of a case from the model when that case was included and the predicted value from the model when it was excluded is the DFFit.

Studentized deleted residual

a measure of the influence of a particular case of data. This is a standardized version of the deleted residual.

Adjusted R²

a measure of the loss of predictive power or shrinkage in regression. The adjusted R² tells us how much variance in the outcome would be accounted for if the model had been derived from the population from which the sample was taken.

Cook's distance

a measure of the overall influence of a case on a model. Cook and Weisberg (1982) have suggested that values greater than 1 may be cause for concern.

Partial correlation

a measure of the relationship between two variables while 'controlling' the effect of one or more additional variables on both.

Semi-partial correlation

a measure of the relationship between two variables while 'controlling' the effect that one or more additional variables has on one of those variables. If we call our variables x and y, it gives us a measure of the variance in y that x alone shares.

Model sum of squares

a measure of the total amount of variability for which a model can account. It is the difference between the total sum of squares and the residual sum of squares.

Total sum of squares

a measure of the total variability within a set of observations. It is the total squared deviance between each observation and the overall mean of all observations.

Residual sum of squares

a measure of the variability that cannot be explained by the model fitted to the data. It is the total squared deviance between the observations, and the value of those observations predicted by whatever model is fitted to the data.

Covariance ratio (CVR)

a measure of whether a case influences the variance of the parameters in a regression model. When this ratio is close to 1 the case has very little influence on the variances of the model parameters. Belsey et al. (1980) recommend the following: if the CVR of a case is greater than 1 + [3(k + 1)/n] then deleting that case will damage the precision of some of the model's parameters, but if it is less than 1 − [3(k + 1)/n] then deleting the case will improve the precision of some of the model's parameters (k is the number of predictors and n is the sample size).

Hierarchical regression

a method of multiple regression in which the order in which predictors are entered into the regression model is determined by the researcher based on previous research: variables already known to be predictors are entered first, new variables are entered subsequently.

Stepwise regression

a method of multiple regression in which variables are entered into the model based on a statistical criterion (the semi-partial correlation with the outcome variable). Once a new variable is entered into the model, all variables in the model are assessed to see whether they should be removed.

Ordinary least squares (OLS)

a method of regression in which the parameters of the model are estimated using the method of least squares.

Repeated contrast

a non-orthogonal planned contrast that compares the mean in each condition (except the first) to the mean of the preceding condition.

Simple contrast

a non-orthogonal planned contrast that compares the mean in each condition to the mean of either the first or last condition, depending on how the contrast is specified.

Difference contrast

a non-orthogonal planned contrast that compares the mean of each condition (except the first) to the overall mean of all previous conditions combined.

Helmert contrast

a non-orthogonal planned contrast that compares the mean of each condition (except the last) to the overall mean of all subsequent conditions combined.

Deviation contrast

a non-orthogonal planned contrast that compares the mean of each group (except for the first or last, depending on how the contrast is specified) to the overall mean.

Kendall's tau

a non-parametric correlation coefficient similar to Spearman's correlation coefficient, but should be used in preference for a small data set with a large number of tied ranks.

Weight

a number by which something (usually a variable in statistics) is multiplied. The weight assigned to a variable determines the influence that variable has within a mathematical equation: large weights give the variable a lot of influence.

Planned contrasts

a set of comparisons between group means that are constructed before any data are collected. These are theory-led comparisons and are based on the idea of partitioning the variance created by the overall effect of group differences into gradually smaller portions of variance. These tests have more power than post hoc tests.

Post hoc tests

a set of comparisons between group means that were not thought of before data were collected. Typically these tests involve comparing the means of all combinations of pairs of groups. To compensate for the number of tests conducted, each test uses a strict criterion for significance. As such, they tend to have less power than planned contrasts. They are usually used for exploratory work for which no firm hypotheses were available on which to base planned contrasts.

Multicollinearity

a situation in which two or more variables are very closely linearly related.

Spearman's correlation coefficient

a standardized measure of the strength of relationship between two variables that does not rely on the assumptions of a parametric test. It is Pearson's correlation coefficient performed on data that have been converted into ranked scores.

Biserial correlation

a standardized measure of the strength of relationship between two variables when one of the two variables is dichotomous. The biserial correlation coefficient is used when one variable is a continuous dichotomy (e.g., has an underlying continuum between the categories).

Point-biserial correlation

a standardized measure of the strength of relationship between two variables when one of the two variables is dichotomous. The point-biserial correlation coefficient is used when the dichotomy is a discrete, or true, dichotomy (i.e., one for which there is no underlying continuum between the categories). An example of this is pregnancy: you can be either pregnant or not, there is no in between.

Standardized DFBeta

a standardized version of DFBeta. These standardized values are easier to use than DFBeta because universal cut-off points can be applied. Stevens (2002) suggests looking at cases with absolute values greater than 2.

Standardized DFFit

a standardized version of DFFit.

Analysis of variance

a statistical procedure that uses the F-ratio to test the overall fit of a linear model. In experimental research this linear model tends to be defined in terms of group means, and the resulting ANOVA is therefore an overall test of whether group means differ.

Durbin-Watson test

a test for serial correlations between errors in regression models. Specifically, it tests whether adjacent residuals are correlated, which is useful in assessing the assumption of independent errors. The test statistic can vary between 0 and 4, with a value of 2 meaning that the residuals are uncorrelated. A value greater than 2 indicates a negative correlation between adjacent residuals, whereas a value below 2 indicates a positive correlation. The size of the Durbin-Watson statistic depends upon the number of predictors in the model and the number of observations. For accuracy, look up the exact acceptable values in Durbin and Watson's (1951) original paper. As a very conservative rule of thumb, values less than 1 or greater than 3 are definitely cause for concern; however, values closer to 2 may still be problematic depending on the sample and model.

F-ratio

a test statistic with a known probability distribution (the F-distribution). It is the ratio of the average variability in the data that a given model can explain to the average variability unexplained by that same model. It is used to test the overall fit of the model in simple regression and multiple regression, and to test for overall differences between group means in experiments.

Independent t-test

a test using the t-statistic that establishes whether two means collected from independent samples differ significantly.

Paired-samples t-test

a test using the t-statistic that establishes whether two means collected from the same sample (or related observations) differ significantly.

Predictor variable

a variable that is used to try to predict values of another variable known as an outcome variable.

Studentized residuals

a variation on standardized residuals. A Studentized residual is an unstandardized residual divided by an estimate of its standard deviation that varies point by point. These residuals have the same properties as the standardized residuals but usually provide a more precise estimate of the error variance of a specific case.

Brown-Forsythe F

a version of the F-ratio designed to be accurate when the assumption of homogeneity of variance has been violated.

Welch's F

a version of the F-ratio designed to be accurate when the assumption of homogeneity of variance has been violated. Not to be confused with the squelch test which is where you shake your head around after writing statistics books to see if you still have a brain.

Dummy variables

a way of recoding a categorical variable with more than two categories into a series of variables all of which are dichotomous and can take on values of only 0 or 1. There are seven basic steps to create such variables: (1) count the number of groups you want to recode and subtract 1; (2) create as many new variables as the value you calculated in step 1 (these are your dummy variables); (3) choose one of your groups as a baseline (i.e., a group against which all other groups should be compared, such as a control group); (4) assign that baseline group values of 0 for all of your dummy variables; (5) for your first dummy variable, assign the value 1 to the first group that you want to compare against the baseline group (assign all other groups 0 for this variable); (6) for the second dummy variable assign the value 1 to the second group that you want to compare against the baseline group (assign all other groups 0 for this variable); (7) repeat this process until you run out of dummy variables.

Harmonic mean

a weighted version of the mean that takes account of the relationship between variance and sample size. It is calculated by summing the reciprocal of all observations, then dividing by the number of observations. The reciprocal of the end product is the harmonic mean:

Homoscedasticity

an assumption in regression analysis that the residuals at each level of the predictor variable(s) have similar variances. Put another way, at each point along any predictor variable, the spread of residuals should be fairly constant.

Omega squared

an effect size measure associated with ANOVA that is less biased than eta squared. It is a (sometimes hideous) function of the model sum of squares and the residual sum of squares and isn't actually much use because it measures the overall effect of the ANOVA and so can't be interpreted in a meaningful way. In all other respects it's great though.

Eta squared (η²)

an effect size measure that is the ratio of the model sum of squares to the total sum of squares. So, in essence, the coefficient of determination by another name. It doesn't have an awful lot going for it: not only is it biased, but it typically measures the overall effect of an ANOVA, and effect sizes are more easily interpreted when they reflect specific comparisons (e.g., the difference between two means).

Multiple regression

an extension of simple regression in which an outcome is predicted by a linear combination of two or more predictor variables. The form of the model is: (see above image) in which the outcome is denoted as Y, and each predictor is denoted as X. Each predictor has a regression coefficient b associated with it, and b0 is the value of the outcome when all predictors are zero.

Goodness of fit

an index of how well a model fits the data from which it was generated. It's usually based on how well the data predicted by the model correspond to the data that were actually collected.

Independent ANOVA

analysis of variance conducted on any design in which all independent variables or predictors have been manipulated using different participants (i.e., all data come from different entities).

Hat values

another name for leverage.

Cross-validation

assessing the accuracy of a model across different samples. This is an important step in generalization. In a regression model there are two main methods of cross-validation: adjusted R² or data splitting, in which the data are split randomly into two halves, and a regression model is estimated for each half and then compared.

Pairwise comparisons

comparisons of pairs of means.

Quadratic trend

if the means in ordered conditions are connected with a line then a quadratic trend is shown by one change in the direction of this line (e.g., the line is curved in one place); the line is, therefore, U-shaped. There must be at least three ordered conditions.

Quartic trend

if the means in ordered conditions are connected with a line then a quartic trend is shown by three changes in the direction of this line. There must be at least five ordered conditions.

Standard error of differences

if we were to take several pairs of samples from a population and calculate their means, then we could also calculate the difference between their means. If we plotted these differences between sample means as a frequency distribution, we would have the sampling distribution of differences. The standard deviation of this sampling distribution is the standard error of differences. As such it is a measure of the variability of differences between sample means.

Cubic trend

if you connected the means in ordered conditions with a line then a cubic trend is shown by two changes in the direction of this line. You must have at least four ordered conditions.

Leverage

leverage statistics (or hat values) gauge the influence of the observed value of the outcome variable over the predicted values. The average leverage value is (k+1)/n in which k is the number of predictors in the model and n is the number of participants. Leverage values can lie between 0 (the case has no influence whatsoever) and 1 (the case has complete influence over prediction). If no cases exert undue influence over the model then we would expect all of the leverage values to be close to the average value. Hoaglin and Welsch (1978) recommend investigating cases with values greater than twice the average (2(k + 1)/n) and Stevens (2002) recommends using three times the average (3(k + 1)/n) as a cut-off point for identifying cases having undue influence.

Orthogonal

means perpendicular (at right angles) to something. It tends to be equated to independence in statistics because of the connotation that perpendicular linear models in geometric space are completely independent (one is not influenced by the other).

Dependent t-test

see paired-samples t-test

Suppressor effects

situation where a predictor has a significant effect, but only when another variable is held constant.

Variance sum law

states that the variance of a difference between two independent variables is equal to the sum of their variances.

Generalization

the ability of a statistical model to say something beyond the set of observations that spawned it. If a model generalizes it is assumed that predictions from that model can be applied not just to the sample on which it is based, but to a wider population from which the sample came.

Shrinkage

the loss of predictive power of a regression model if the model had been derived from the population from which the sample was taken, rather than the sample itself.

Grand mean

the mean of an entire set of observations.

Multiple R

the multiple correlation coefficient. It is the correlation between the observed values of an outcome and the values of the outcome predicted by a multiple regression model.

Heteroscedasticity

the opposite of homoscedasticity. This occurs when the residuals at each level of the predictor variables(s) have unequal variances. Put another way, at each point along any predictor variable, the spread of residuals is different.

Experimentwise error rate

the probability of making a Type I error in an experiment involving one or more statistical comparisons when the null hypothesis is true in each case.

Familywise error rate

the probability of making a Type I error in any family of tests when the null hypothesis is true in each case. The 'family of tests' can be loosely defined as a set of tests conducted on the same data set and addressing the same empirical question.

Standardization

the process of converting a variable into a standard unit of measurement. The unit of measurement typically used is standard deviation units (see also z-scores). Standardization allows us to compare data when different units of measurement have been used (we could compare weight measured in kilograms to height measured in inches).

Coefficient of determination

the proportion of variance in one variable explained by a second variable. It is Pearson's correlation coefficient squared.

Standardized residuals

the residuals of a model expressed in standard deviation units. Standardized residuals with an absolute value greater than 3.29 (actually, we usually just use 3) are cause for concern because in an average sample a value this high is unlikely to happen by chance; if more than 1% of our observations have standardized residuals with an absolute value greater than 2.58 (we usually just say 2.5) there is evidence that the level of error within our model is unacceptable (the model is a fairly poor fit of the sample data); and if more than 5% of observations have standardized residuals with an absolute value greater than 1.96 (or 2 for convenience) then there is also evidence that the model is a poor representation of the actual data.

Unstandardized residuals

the residuals of a model expressed in the units in which the original outcome variable was measured.

Predicted value

the value of an outcome variable based on specific values of the predictor variable or variables being placed into a statistical model.

Grand variance

the variance within an entire set of observations.

Mahalanobis distances

these measure the influence of a case by examining the distance of cases from the mean(s) of the predictor variable(s). One needs to look for the cases with the highest values. It is not easy to establish a cut-off point at which to worry, although Barnett and Lewis (1978) have produced a table of critical values dependent on the number of predictors and the sample size. From their work it is clear that even with large samples (N = 500) and five predictors, values above 25 are cause for concern. In smaller samples (N = 100) and with fewer predictors (namely three) values greater than 15 are problematic, and in very small samples (N = 30) with only two predictors values greater than 11 should be examined. However, for more specific advice, refer to Barnett and Lewis's (1978) table.

Tolerance

tolerance statistics measure multicollinearity and are simply the reciprocal of the variance inflation factor (1/VIF). Values below 0.1 indicate serious problems, although Menard (1995) suggests that values below 0.2 are worthy of concern.

bi

unstandardized regression coefficient. Indicates the strength of relationship between a given predictor, i, of many and an outcome in the units of measurement of the predictor. It is the change in the outcome associated with a unit change in the predictor.

Autocorrelation

when the residuals of two observations in a regression model are correlated.


Related study sets

Organizational Behavior, Chapter 16

View Set

California real estate exam`vocabulary

View Set

GA Life and Health Insurance Study Guide

View Set

Psych 121 Exam 4 Practice Questions

View Set

Musculoskeletal - Lafayette (104 question)

View Set

Money and Banking Final Lesson 8- 16

View Set