BA 375 Midterm 1 (ch3,4,7)

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Jaccards coefficient

#of variables with matching non-zero value for u and v/ (total # of variables)-(#of variables with matching zero values for u and v) use for nonbinary!

to use the simple linear regression, data must:

- be in correct form -no categorical variables -use dummy variables

Tables should be used when...

-Readers need to refer to specific # values -Readers need to make precise comparisons (not relative) -values being displayed have different units of magnitudes

When using tables remember...

-to limit unnecessary ink -to avoid using vertical lines

steps for K-Means clustering

1. split all observations into 3 (or "k") clusters 2. create a centroid for each cluster 3. if observation in one cluster is closer to a different clusters centroid, these observations are moved 4. Repeat steps 2+3 until no observations move from one cluster to another

EX Q: using the C of D SSR = 15.8712 SST = 23.9 what percent of the total sum of squares can be explained by using the regression equation?

15.8712/23.9 = .6641 66.41% can be explained by using the regression equation

least squares method

A procedure for using sample data to find the estimated regression equation. provide the values of b0 and b1 that minimize the sum of the squares of the deviations between the observed values of x and y

heat map

A two-dimensional graphical representation of data that uses different shades of color to indicate magnitude

similar equation to Euclidean distance

A^2 +B^2 = C^2

Charts

Aka graphs visual methods for displaying data

centroid linkage

Centroid is found by averaging the value of all variables in all observations of the cluster

Pivot tables

Crosstabulation in excel

for a valid regression analysis using the least squares method, the values of _____ must be statistically _____________

E, independent (E must also be normally distributed)

Similarity/dissimilarity measurement methods to use for K means clustering

Euclidean distance

Similarity/dissimilarity measurement methods to use for Hierarchical clustering

Euclidean distance Matching Coefficient Jaccard's Coefficient

to conduct a hypothesis test, find the significance of the model using the __stat and the CV find the significance of the slope and the intercept of the model using the __stat and the CV

F, t

Matching Coefficient

Measure of similarity between observations consisting solely of categorical variables. Use dummy variables (0 or 1) MC= # of variables with a matching value for observations u and v/ total # of variables

Key performance indicators (KPIs)

Metrics that are crucial to understanding the current performance of an organization

Crosstabulation

Provides a tabular summary of data for two variables

duv becomes smaller when observations are ____________

More similar

Types of data that can be used for Hierarchical clustering

Numerical or categorical

Which of the following regression models is used to model a nonlinear relationship between the independent and dependent variables by including the independent variable and the square of the independent variable in the model? a. Simple regression model b. Least squares regression model c. Multiple regression model d. Quadratic regression model

Quadratic regression model

simple linear regression equation

SST = SSR + SSE (sum of squares total) = (sum of squares regression) + (sum of squares error)

trend line

Slope provides an approximation of the relationship between the two variables

How to avoid the distortion of data in Euclidean distance

Standardize! Use z score for each value instead standardized duv= ((-1.76-(-.76))^2 + (-.056-(-.62))^(1/2)

Hierarchical clustering

Starts with each observation in their own clusters then merging similar clusters

estimated regression

The estimate of the regression equation developed from sample data by using the least squares method. when B0 and B1x are unknown become b0 and b1

experimental region

The range of independent variables values that are used to estimate the regression model.

point estimator

a predicted value of Y using an X value

Assessing the regression model on data other than the sample data that was used to generate the model is known as: a. cross-validation. b. graphical validation. c. approximation. d. postulation.

a. cross-validation.

Ward's method

add up the Euclidean distance from every single point to every single other point combine the two that increase the dissimilarity the least

K means clustering

assigns each observation to one of "K" clusters so that observations in each cluster are similar used to summarize average Centroid (cluster average) is calculated observation is moved to where the centroid is closest

__________ refers to the data set used to compare model forecasts and ultimately pick a model for predicting values of the dependent variable. a. Training set b. Validation set c. Codomain d. Range

b. Validation set

_________ refers to the scenario in which the relationship between the dependent variable and one independent variable is different at different values of a second independent variable. a. Autocorrelation b. Covariance c. Interaction d. Multicollinearity

c. Interaction

__________ is the data set used to build the candidate models. a. Range b. Validation set c. Training set d. Codomain

c. Training set

Regression analysis involving one dependent variable and more than one independent variable is known as: a. linear regression. b. simple regression. c. multiple regression. d. None of these choices are correct.

c. multiple regression.

The tests of significance in regression analysis are based on assumptions about the error term ε. One such assumption is that the variance of ε, denoted by σ 2 , is.... a. greater as x increases. b. less as x increases. c. the same for all values of x. e. unrelated to the value of x.

c. the same for all values of x.

problem with matching coefficient

can mistake absence of a feature for similarity (use Jaccard's coefficient instead!)

The ___________ is a measure of the goodness of fit of the estimated regression equation. It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation. a. dummy variable b. interaction variable c. residual d. coefficient of determination

coefficient of determination

The coefficient of determination: a. is equal to negative one for the poorest fit. b. takes values between -1 to +1. c. is equal to zero for a perfect fit. d. is used to evaluate the goodness of fit.

d. is used to evaluate the goodness of fit.

The least squares regression line minimizes the sum of the: a. absolute deviations between actual and predicted y values. b. absolute deviations between actual and predicted x values. c. differences between actual and predicted y values. d. squared differences between actual and predicted y values.

d. squared differences between actual and predicted y values.

data dashboards

data visualization tool that illustrates metrics as new data becomes available should represent all KPIs at the same time

cluster analysis is ____________ rather than ____________

descriptive, predictive tell a story, not predictions

Ex Q: Euclidean Distance U= 23 year old customer with income of $20,375 V= 26 year old customer with income of $19,475

duv= ((23-26)^2 + (20,375-19,475))^(1/2) duv= 901 Problem with this? the income is numerically higher than the age, so it distorts the data (biased by large variables)

Edward Tufte introduced the idea of the data-ink ratio, as a way of quantifying the proportion of "data-ink" to the total amount of ink used in a table or chart. Which of the following options would increase the data-ink ratio of a table? a. adding table borders b. increasing the row heights and column widths for all c. rows and columns in the table d. rounding off the data to reduce excessive decimal precision e. adding a title to the table

e. adding a title to the table

Higher MC means_______

more similarity!

problem with K-means clustering

everyone can come up with different clusters (even with the same data)

when do experts say we shouldn't use stacked column/bar charts?

experts recommend against the use of stacked column/bar charts for more than a couple variables in each category. human eye has difficulty perceiving small areas (alternative: clustered column/bar chart)

Prediction of the dependent variable value outside the experimental region is called...

extrapolation risky!

for the rule to be a good fit...

f-stat > critical value

quadratic regression model

flexible represent wide variety of nonlinear relationships

Scatter chart

graphical representation of the relationship between two quantitative variables

K means clustering should be used when observations are _________ 500

greater than

Bar chart

horizontal bars display magnitude of quantitative variables

in a simple linear regression, x is the _________ variable and y is the _________ variable

independent, dependent

Often the relationship between the dependent variable and one independent variable is different at various values of a second independent variable. When this occurs, it is called an ______________.

interaction

Hierarchical clustering should be used when observations are _________ 500

less than (merging each observation takes so much time!)

The estimated regression equation would provide a perfect fit if every value of the dependent variable happened to...

lie on the estimated regression line

Parallel coordinates plot

more than 2 variables in which each variable is represented by a different vertical axis

Euclidean distance

most common method to measure dissimilarity between observations on continuous variables

data-ink ratio

most important idea for creating data visualizations measures proportion of ink used

simple linear regression

most used tool in stats predict one variable based on values of another variable y= B0 + B1x +E

ANOVA table

note: degree of freedom fro error is n-k-1

Types of data that can be used for K means clustering

numerical

In Euclidean distance, U is ______ and V is _________

one variable, another variable

dont use simple linear regression when.....

patterns are non linear or a growth curve

Multiple Linear Regression

predict one variable based on values of multiple other variables

Multicollinearity

refers to the correlation among the independent variables in a multiple regression analysis

Hierarchical clustering is _________ to outliers

sensitive

Line chart

similar to scatter chart but a line connects all the points in the chart very useful for series data collected over a period of time sometimes called a "time series plot"

complete linkage

similarity is defined by the observations that are most different

single linkage

similarity is defined by the observations that are most similar

5 ways to compare similarity of clusters (and possibly merge them)

single linkage complete linkage centroid linkage Ward's method McQuitty's method

____________________is the process of making estimates and drawing conclusions about one or more characteristics of a population (the value of one or more parameters) through the analysis of sample data drawn from the population

statistical inference

We use a __________to test the hypothesis that a regression parameter is zero

t-test (null hypothesis) H0: B = 0 (alternative hypothesis) Ha: B /= 0 if B is zero, then the dependent value Y does not change when X changes so there is no linear relationship.

McQuitty's method

take the average dissimilarity of combining other options if the average dissimilarity is less than the other, you are good to cluster dis(A,B) < [dis(B,C) + dis (A,C)]/2

stacked column chart

type of column/bar chart where multiple variables appear on the same bar

pie charts

used to compare categorical data inferior to other charts

coefficient of determination

used to evaluate the goodness of fit for the estimated regression equation r^2 = SSR/SST or 1- (SSE/SST) 0 <= r^2 >= 1

Tree Map

useful for visualizing hierarchical data among multiple dimensions

Geographical Information System (GIS)

uses map and stats to present data collected over different geographical areas

clustered column/bar chart

variables are compared side-to-side often superior to stacked column charts when comparing quantitative variables becomes cluttered with more than a few variables per category

Column chart

vertical bars display magnitude of quantitative variables

bubble charts

visualizing 3 variables in a two dimensional graph preferred alternative to 3D charts

Ex Q: using the regression line intercept = .3431 slope = -1.157e(-4)x find Y (or SSR) if x = 2

y = .341 - .0001157(2) y = .3407

Ex Q: using the regression line intercept = 14.8602 slope = .27543x find Y (or SSR) if x = 12

y = 14.8602 + .275431(12) y = 18.7

multiple regression model

y = B0 + B1X1 + B2X2 +.... BqXq + E (where q is the independent variable)

to find SSR, use slope and intercept

y = mx + b y = SSR

In the simple linear regression equation, the parameter B0 represents the _____ of the true regression line.

y-intercept


Ensembles d'études connexes

Chapter 6- Relationships and Guidance

View Set

BAH - Chapter 35 & 36 Musculoskeletal

View Set