QMB5616 Final

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Why is model/variable selection important in regression analysis? Compare and contrast two regression model/variable selection approaches that were covered within the context of regression analysis.

It helps to mitigate the problems of caused by multicollinearity and can improve predictive performance. Two methods are stepwise and all-possible-subsets regression. The stepwise approach is faster and can handle much larger pools of variables. The all-possible-subsets approach, however, will provide a guarantee that the best (larger R-squared) subset is found for each subset size.

What is the precise interpretation of a number in the structure matrix of the discriminant function output?

It is a discriminant loading, which is a measure of the correlation between the variable and the discriminant function.

What is the precise interpretation of the Wilks' lambda value for the discriminant function that is reported in a two-group discriminant analysis?

It is a measure of the ratio of the within-group sum-of-squares for the discriminant scores to the total sum-of-squares for the discriminant scores. The closer to zero, the better the discriminatory value of the function.

What (specifically) is the range of predictors issue to be considered in regression analysis?

It is the principle that you should be very careful about making predictions using independent variable measurements that are well outside the range of measurements used to establish the regression equation.

What is a p-value in hypothesis testing?

It is the probability of getting the observed test statistic, or one more extreme, given that the null hypothesis is true.

What is the value of a in hypothesis testing?

It is the user-specified tolerance for a type I error, which is the probability of mistakenly rejecting the null hypothesis.

For two-group classification, name one advantage and one disadvantage of logistic regression relative to discriminant analysis.

Logistic regression is more robust with respect to non-normality and non-constant error variance. Discriminant analysis is better-suited to accommodate differential costs (i.e., asymmetry) of misclassification.

What are some important changes in discriminant analysis when you move from two groups to three groups?

There are some significant mathematical changes, but the big practical change is that you can have more than one discriminant function. Also, the patterns of misclassified cases can be more interesting.

What exactly are principal component loadings?

They are measures of correlation between variables and components.

What is the objective function that is optimized when fitting a logistic regression model?

To find the estimates that maximize the log-likelihood of the data.

What is the purpose of rotating component loadings?

To improve interpretability by making it easier to identify which variables define each component.

Find the eigenvalues and unit-length lead eigenvector of the following correlation matrix and identify the proportion of variation explained by the first component. [[1, .7], [.7, 1]]

Unit length eigenvector: [1/sqrt(2), 1/sqrt(2)] = [.7071, .7071]. The first component explains 1.7/2 = 85% of the variation.

You have been given a small data set with eight observations measured on two variables, as well as a partition of the eight observations into three clusters. You are asked to apply k-means clustering to the data. Describe the steps of the process in careful detail.

1- Computer centroids (variable means) for the three clusters 2- Compute distance of each observation to the centroids 3- Assign observations to cluster with nearest centroid 4- repeat until no object change membership

What do the eigenvalues associated with a principal component analysis based on the correlation matrix mean that is, how should they be interpreted?

An eigenvalue provides information as to the percentage of variation in the full data set that is explained by a given component.

What is 'best subsets' logistic regression? How does it compare/contrast with l1-regularized logistic regression?

Best subsets regression selects a subset of the candidate variables, thus excluding unselected variables, so as to improve predictive performance. Best subsets and l1-regularized logistic regression are both variable selection procedures that can improve predictive performance. The former accomplishes this goal by explicitly selecting some variables and excluding others, whereas the latter does it in a more implicit manner by shrinking the coefficient estimates.

Explain how PCA might be used gainfully in conjunction with regression.

By using the uncorrelated PCA scores instead of the raw independent variables, we are able to circumvent problems of multicollinearity.

What is the primary purpose of principal component analysis (PCA)?

Dimensionality reduction. To reduce a large number of a variables to a more manageable number of components.

The slope coefficient for the variable 'age' in a regression model used to predict customer satisfaction is -1.72. Provide a precise interpretation of this coefficient.

For each one unit increase in age, we would expect a decrease of 1.72 in satisfaction, assuming all other variables held constant.

Distinguish (i.e., describe the difference) between hierarchical and nonhierarchical clustering methods, as well as the two major categories of hierarchical clustering procedures.

Hierarchical clustering methods form a tree-like structure of relationships among items. This is accomplished using one of two approaches: (ii) agglomerative methods, which start will all items in their own cluster and then a pair of clusters is merged at each step until all items are in one cluster, and (ii) divisive methods, which start with all items in one cluster and a cluster is split into two clusters at each step until all items are in their own individual cluster. Nonhierarchical (or partitioning methods) begin with items divided (possibly at random) into a pre-selected number of clusters and, subsequently, items are reassigned across clusters until some termination criterion is satisfied.

Precisely define what multicollinearity is in the context of regression analysis. Specifically identify the potential consequences of multicollinearity with respect to both explanation of the dependent variable and prediction of the dependent variable.

Multicollinearity is high correlation between two or more independent variables. In the context of explanation, multicollinearity is a problem because it distorts the sign and magnitude of the slope coefficients and affects their significance tests. In the context of prediction, using superfluous variables can result in a deterioration in the prediction of holdout/validation cases.

Consider the coefficient (b1) for variable x1 in an OLS regression model. What is the precise interpretation of that coefficient? How would your interpretation of that coefficient change if we change the context to logistic regression?

OLS: For each one unit increase in x1, we expect a change in the dependent variable equal to b1, assuming all other variables held constant. Logistic: For each one unit increase in x1, we expect the odds of the binary dependent variable being 1 to increase by exp(b1).

Describe two ways in which the output from a cluster analysis using Ward's method be used to choose the number of clusters.

One approach is the examine the agglomeration schedule and look for large increases in SSE. Select the number of clusters (K) such that going to K-1 clusters would produce a large increase in SSE. A second approach is to look at change in cluster assignments. If going from K to K+1 clusters only results in one or a few items changing cluster membership, then perhaps stop at K clusters.

What are two of the most common reasons proffered for not using OLS regression when the dependent variable is binary?

One reason is that the OLS model can yield predictions that are outside of the possible 0-1 range for the binary variable. A second reason is that the assumption of non-constant error variance is violated in the case of a binary dependent variable.

Explain how PCA might be used gainfully in conjunction with cluster analysis.

PCA was used to obtain component scores, which were then used as clustering variables. This was a useful way to standardize the variables.

What is the difference between single-linkage and complete-linkage cluster analysis?

Single-linkage selects mergers so as to minimize the minimum distance between any two items not already in the same cluster. Contrastingly, complete-linkage selects mergers so as to minimize the maximum distance between any two items not already in the same cluster.

Our SPSS output reported both standardized and unstandardized discriminant function coefficients. How did we use each of these two pieces of information in our analyses?

Standardized coefficients were used to compare the relative importance of the variables to the discriminant function. Unstandardized coefficients were used to compute discriminant scores and classify cases.

You have been given an eight-by-eight matrix of distances between eight observations and asked to apply complete-linkage hierarchical clustering to the data. Describe the steps of the process in careful detail.

Start with each object in its own cluster. Evaluate all possible mergers of pairs of clusters and merge the two clusters that will result in the smallest increase in the distance between any two objects not already in the same cluster. Repeat this process of merging clusters until all objects are on one cluster.

What is the precise null hypothesis of the chi-square test based on Wilks' lambda in the two-group discriminant analysis SPSS output?

That the population discriminant score means for the two groups are equal (to zero).

What is the precise interpretation of the bivariate (or Pearson) correlation measure between two variables x and y? How does a measure of partial correlation between the same two variables differ from this measure?

The Pearson correlation coefficient is a measure of linear association between x and y that ranges from -1 to +1. The partial correlation between x and y is similar, but adjusts for the relationship between x and y and one or more other variables.

What is l1-regularized logistic regression? What is the purpose of l1-regularization?

The l1-regularized logistic regression method is a modified version of logistic regression that either imposes a constraint on the sum of the absolute values of the logistic regression coefficients, or imposed a penalty on that sum. The purpose is to shrink the size of the coefficients (perhaps eliminating some variables entirely) so as to reduce their variability and improve predictive performance.

What is the precise relationship between Wilks' lambda and the eigenvalue in the two-group discriminant analysis SPSS output?

We have T = B + W, where T, B, and W are, respectively, the total, between-group, and within-group variation in the discriminant scores. Wilk's lambda is W/T and the eigenvalue is B/W.

Under what circumstances would it be prudent to standardize the clustering variables prior to completing a K-means cluster analysis?

When the variables are measures on different scales and there are major differences between their means and/or variances.


Ensembles d'études connexes

Policy Provisions, Options, Riders

View Set

Decolonization - India and Pakistan

View Set

Government and Economics Unit 3 Quiz 1: The American Party System

View Set

Financial literacy- Taxes quiz review

View Set

CE017.1 Safety Testing for Code Compliance: Exam

View Set