BA 375 Midterm 1 (ch3,4,7)
Jaccards coefficient
#of variables with matching non-zero value for u and v/ (total # of variables)-(#of variables with matching zero values for u and v) use for nonbinary!
to use the simple linear regression, data must:
- be in correct form -no categorical variables -use dummy variables
Tables should be used when...
-Readers need to refer to specific # values -Readers need to make precise comparisons (not relative) -values being displayed have different units of magnitudes
When using tables remember...
-to limit unnecessary ink -to avoid using vertical lines
steps for K-Means clustering
1. split all observations into 3 (or "k") clusters 2. create a centroid for each cluster 3. if observation in one cluster is closer to a different clusters centroid, these observations are moved 4. Repeat steps 2+3 until no observations move from one cluster to another
EX Q: using the C of D SSR = 15.8712 SST = 23.9 what percent of the total sum of squares can be explained by using the regression equation?
15.8712/23.9 = .6641 66.41% can be explained by using the regression equation
least squares method
A procedure for using sample data to find the estimated regression equation. provide the values of b0 and b1 that minimize the sum of the squares of the deviations between the observed values of x and y
heat map
A two-dimensional graphical representation of data that uses different shades of color to indicate magnitude
similar equation to Euclidean distance
A^2 +B^2 = C^2
Charts
Aka graphs visual methods for displaying data
centroid linkage
Centroid is found by averaging the value of all variables in all observations of the cluster
Pivot tables
Crosstabulation in excel
for a valid regression analysis using the least squares method, the values of _____ must be statistically _____________
E, independent (E must also be normally distributed)
Similarity/dissimilarity measurement methods to use for K means clustering
Euclidean distance
Similarity/dissimilarity measurement methods to use for Hierarchical clustering
Euclidean distance Matching Coefficient Jaccard's Coefficient
to conduct a hypothesis test, find the significance of the model using the __stat and the CV find the significance of the slope and the intercept of the model using the __stat and the CV
F, t
Matching Coefficient
Measure of similarity between observations consisting solely of categorical variables. Use dummy variables (0 or 1) MC= # of variables with a matching value for observations u and v/ total # of variables
Key performance indicators (KPIs)
Metrics that are crucial to understanding the current performance of an organization
Crosstabulation
Provides a tabular summary of data for two variables
duv becomes smaller when observations are ____________
More similar
Types of data that can be used for Hierarchical clustering
Numerical or categorical
Which of the following regression models is used to model a nonlinear relationship between the independent and dependent variables by including the independent variable and the square of the independent variable in the model? a. Simple regression model b. Least squares regression model c. Multiple regression model d. Quadratic regression model
Quadratic regression model
simple linear regression equation
SST = SSR + SSE (sum of squares total) = (sum of squares regression) + (sum of squares error)
trend line
Slope provides an approximation of the relationship between the two variables
How to avoid the distortion of data in Euclidean distance
Standardize! Use z score for each value instead standardized duv= ((-1.76-(-.76))^2 + (-.056-(-.62))^(1/2)
Hierarchical clustering
Starts with each observation in their own clusters then merging similar clusters
estimated regression
The estimate of the regression equation developed from sample data by using the least squares method. when B0 and B1x are unknown become b0 and b1
experimental region
The range of independent variables values that are used to estimate the regression model.
point estimator
a predicted value of Y using an X value
Assessing the regression model on data other than the sample data that was used to generate the model is known as: a. cross-validation. b. graphical validation. c. approximation. d. postulation.
a. cross-validation.
Ward's method
add up the Euclidean distance from every single point to every single other point combine the two that increase the dissimilarity the least
K means clustering
assigns each observation to one of "K" clusters so that observations in each cluster are similar used to summarize average Centroid (cluster average) is calculated observation is moved to where the centroid is closest
__________ refers to the data set used to compare model forecasts and ultimately pick a model for predicting values of the dependent variable. a. Training set b. Validation set c. Codomain d. Range
b. Validation set
_________ refers to the scenario in which the relationship between the dependent variable and one independent variable is different at different values of a second independent variable. a. Autocorrelation b. Covariance c. Interaction d. Multicollinearity
c. Interaction
__________ is the data set used to build the candidate models. a. Range b. Validation set c. Training set d. Codomain
c. Training set
Regression analysis involving one dependent variable and more than one independent variable is known as: a. linear regression. b. simple regression. c. multiple regression. d. None of these choices are correct.
c. multiple regression.
The tests of significance in regression analysis are based on assumptions about the error term ε. One such assumption is that the variance of ε, denoted by σ 2 , is.... a. greater as x increases. b. less as x increases. c. the same for all values of x. e. unrelated to the value of x.
c. the same for all values of x.
problem with matching coefficient
can mistake absence of a feature for similarity (use Jaccard's coefficient instead!)
The ___________ is a measure of the goodness of fit of the estimated regression equation. It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation. a. dummy variable b. interaction variable c. residual d. coefficient of determination
coefficient of determination
The coefficient of determination: a. is equal to negative one for the poorest fit. b. takes values between -1 to +1. c. is equal to zero for a perfect fit. d. is used to evaluate the goodness of fit.
d. is used to evaluate the goodness of fit.
The least squares regression line minimizes the sum of the: a. absolute deviations between actual and predicted y values. b. absolute deviations between actual and predicted x values. c. differences between actual and predicted y values. d. squared differences between actual and predicted y values.
d. squared differences between actual and predicted y values.
data dashboards
data visualization tool that illustrates metrics as new data becomes available should represent all KPIs at the same time
cluster analysis is ____________ rather than ____________
descriptive, predictive tell a story, not predictions
Ex Q: Euclidean Distance U= 23 year old customer with income of $20,375 V= 26 year old customer with income of $19,475
duv= ((23-26)^2 + (20,375-19,475))^(1/2) duv= 901 Problem with this? the income is numerically higher than the age, so it distorts the data (biased by large variables)
Edward Tufte introduced the idea of the data-ink ratio, as a way of quantifying the proportion of "data-ink" to the total amount of ink used in a table or chart. Which of the following options would increase the data-ink ratio of a table? a. adding table borders b. increasing the row heights and column widths for all c. rows and columns in the table d. rounding off the data to reduce excessive decimal precision e. adding a title to the table
e. adding a title to the table
Higher MC means_______
more similarity!
problem with K-means clustering
everyone can come up with different clusters (even with the same data)
when do experts say we shouldn't use stacked column/bar charts?
experts recommend against the use of stacked column/bar charts for more than a couple variables in each category. human eye has difficulty perceiving small areas (alternative: clustered column/bar chart)
Prediction of the dependent variable value outside the experimental region is called...
extrapolation risky!
for the rule to be a good fit...
f-stat > critical value
quadratic regression model
flexible represent wide variety of nonlinear relationships
Scatter chart
graphical representation of the relationship between two quantitative variables
K means clustering should be used when observations are _________ 500
greater than
Bar chart
horizontal bars display magnitude of quantitative variables
in a simple linear regression, x is the _________ variable and y is the _________ variable
independent, dependent
Often the relationship between the dependent variable and one independent variable is different at various values of a second independent variable. When this occurs, it is called an ______________.
interaction
Hierarchical clustering should be used when observations are _________ 500
less than (merging each observation takes so much time!)
The estimated regression equation would provide a perfect fit if every value of the dependent variable happened to...
lie on the estimated regression line
Parallel coordinates plot
more than 2 variables in which each variable is represented by a different vertical axis
Euclidean distance
most common method to measure dissimilarity between observations on continuous variables
data-ink ratio
most important idea for creating data visualizations measures proportion of ink used
simple linear regression
most used tool in stats predict one variable based on values of another variable y= B0 + B1x +E
ANOVA table
note: degree of freedom fro error is n-k-1
Types of data that can be used for K means clustering
numerical
In Euclidean distance, U is ______ and V is _________
one variable, another variable
dont use simple linear regression when.....
patterns are non linear or a growth curve
Multiple Linear Regression
predict one variable based on values of multiple other variables
Multicollinearity
refers to the correlation among the independent variables in a multiple regression analysis
Hierarchical clustering is _________ to outliers
sensitive
Line chart
similar to scatter chart but a line connects all the points in the chart very useful for series data collected over a period of time sometimes called a "time series plot"
complete linkage
similarity is defined by the observations that are most different
single linkage
similarity is defined by the observations that are most similar
5 ways to compare similarity of clusters (and possibly merge them)
single linkage complete linkage centroid linkage Ward's method McQuitty's method
____________________is the process of making estimates and drawing conclusions about one or more characteristics of a population (the value of one or more parameters) through the analysis of sample data drawn from the population
statistical inference
We use a __________to test the hypothesis that a regression parameter is zero
t-test (null hypothesis) H0: B = 0 (alternative hypothesis) Ha: B /= 0 if B is zero, then the dependent value Y does not change when X changes so there is no linear relationship.
McQuitty's method
take the average dissimilarity of combining other options if the average dissimilarity is less than the other, you are good to cluster dis(A,B) < [dis(B,C) + dis (A,C)]/2
stacked column chart
type of column/bar chart where multiple variables appear on the same bar
pie charts
used to compare categorical data inferior to other charts
coefficient of determination
used to evaluate the goodness of fit for the estimated regression equation r^2 = SSR/SST or 1- (SSE/SST) 0 <= r^2 >= 1
Tree Map
useful for visualizing hierarchical data among multiple dimensions
Geographical Information System (GIS)
uses map and stats to present data collected over different geographical areas
clustered column/bar chart
variables are compared side-to-side often superior to stacked column charts when comparing quantitative variables becomes cluttered with more than a few variables per category
Column chart
vertical bars display magnitude of quantitative variables
bubble charts
visualizing 3 variables in a two dimensional graph preferred alternative to 3D charts
Ex Q: using the regression line intercept = .3431 slope = -1.157e(-4)x find Y (or SSR) if x = 2
y = .341 - .0001157(2) y = .3407
Ex Q: using the regression line intercept = 14.8602 slope = .27543x find Y (or SSR) if x = 12
y = 14.8602 + .275431(12) y = 18.7
multiple regression model
y = B0 + B1X1 + B2X2 +.... BqXq + E (where q is the independent variable)
to find SSR, use slope and intercept
y = mx + b y = SSR
In the simple linear regression equation, the parameter B0 represents the _____ of the true regression line.
y-intercept