Business Analytics Final exam
Correlation is always between
-1 and +1
Time series plot
A display of values against time
Uniform
A distribution whose histogram doesn't appear to have any mode and in which all the bars are approximately the same height
Unimodal
A distribution whose histogram has one main peak
Positive pattern
A pattern running from the lower left to the upper right
Negative pattern
A pattern that runs from the upper left to the lower right
Always pair the median with
IQR
Skewed
If one tail stretches out farther than the other: skewed to side of where tail is
Linear
If there is a straight-line relationship, it will appear as a cloud or swarm of points stretched out in a generally consistent, straight form
Modes
Peaks or humps seen in a histogram
Data=
Predicted + Residual
how to determine outliers
Q3 +1.5IQR or Q1- 1.5(IQR)
Before using correlation, you must check
Quantitative Variable Condition Linearity Condition Outlier Condition
Correlation coefficient
Since x's and y's are paired, multiply each standardized value of x by the standardized value it is paired with and add up those cross products. Divide by n -1.The ratio of the sum of the product zxzy for every point in the scatterplot to n - 1
leptokurtic
concentrated in one place
What is affected by outliers/skewness?
correlation, mean, and standard deviation
Kurtosis
curve of the distribution
Evaluate: "outliers do not affect the correlation" a. true. Correlation coefficients remain the same if outliers are removed or kept b. False. Correlation coefficients remain the same if outliers are removed or kept c. True. Correlation coefficients change if outliers are removed or kept d. False. Correlation coefficients change if outliers are removed or kept e. All of the above answers are correct
d
if mean>median=
skewed right
more unusual
higher z score from zero
Boxplot
highlights several features of the distribution of the variable, including the quartiles, the median, and any outlying values
Strength
how much scatter or cluster
The five-number summary of a distribution
reports its median, quartiles, and extremes (maximum and minimum)
The median is
resistant
correlation does not cause..
results or change
Mean if larger if skewed
right
Relating least to the sum according to least square criteria to find regression is..
smallest residual value
We place the explanatory or predictor variable on
the x-axis
We place the response variable on
the y-axis
If the correlation were 1.0
then the model predicts y perfectly, the residuals would all be zero and have no variation.
Multimodal
three or more peaks
The mean is a natural summary for
unimodal, symmetric distributions
Outlier
unusual observation, standing away from the overall pattern of the scatterplot
quartiles
values that frame the middle 50% of the data. One quarter of the data lies below the lower quartile, Q1, and one quarter lies above the upper quartile, Q3
Stationary
when a time series is without a strong trend or change in variability
Explanatory variable
x in the regression line equation
Response variable
y hat
A linear model can be written in the form
y hat= b0+b1x where b0 and b1 are numbers estimated from the data and is the predicted value
Residual
(e) The difference between the predicted value and the observed value, y
Variance
(s^2) average of the squared deviations; sum of (y value minus the mean)^2 / n-1
Is an outlier if how far from standard deviation?
+ or - 3
Tukey method
1. order values 2. split data at the median (if n is odd include value in both) 3. find the median of both halves 3.5 if even number add two values/2 4. answers are Q3 and Q1
Playtkurtic
Amodal; flat
Quantitative Data Condition
Before making a histogram or stem-and-leaf display; the data must be values of a quantitative variable whose units are known
Judgment call
Characterizing the shape of a distribution
Quantitative Variables Condition
Correlation applies only to quantitative variables
Linearity Condition
Correlation measures the strength only of the linear association. If the underlying relationship is curved, summarizing its strength with a correlation would be misleading.
Residual=
Data - Predicted
Tails
The thinner ends of a distribution
Smooth trace
To better understanding the trend of times series data
Bimodal
Two main peaks
The subway runs every 15 minutes. You arrive at the station and cannot locate the timetable or the current time. The probability that it will arrive in the next minute can be calculated based on a model. a. binomial b. uniform c. gemetrioc d. poisson e. the answer is not above
Uniform
Outlier Condition
Unusual observations can distort the correlation. When you see an outlier, it's often a good idea to report the correlation both with and without the point.
According to Crovitz, how can big data be misused a. when governments use big data to prevent protests and arrest dissidents b. when health providers and governments use big data to find out what time of year people are sick most c. when health providers use big data to identify treatments for premature babies d. when governments use big data to reduce fire hazards e. all of the above
a
The R^2 ranges from ___. a. 0 to 1.00 b. -1.00 to +1.00 c. -100 to +100 d. none of the above is correct
a
Shape
a distribution in terms of its modes, its symmetry, and whether it has any gaps or outlying values
Histogram
a graph for a quantitative variable; we usually slice up all the possible values into bins and then count the number of cases that fall in each bin
Lurking variable
a third variable that is simultaneously affecting both of the variables you have observed
If fairly symmetric/ symmetric mean will be..
about the same at the median
A linear model is just
an equation of a straight line through the data
Stem-and-leaf displays
are like histograms, but they also give the individual values
There are two independent variables X and Y with the respective means of 40 and 20, and the respective standard deviations of 3 and 5. If you added 3 to Y, what would e the standard deviation of the new distribution? a. 3 b. 5 c. 6 d. 8 e. 4.24
b
When reviewing a scatterplot, it is noted that the independent variable is the Banner Identification Number and the dependent variable is height in inches. One can conclude that the ____ assumption has been violated, and calculating a correlation is not appropriate. a. Qualitative data b. Quantitative data c. Homoscedasticity d. Linearity e. None of the above are correct
b
Slope of the least squares line
b1= r(sy/sx)
mesokurtic
bell curve; popeye
Central 50% of values is
between Q1 and Q3
Which of the following correctly reflects the condition of the outcomes "gender of customers at the ATM?" a. independent because the outcome of one event does influence the outcome of another b. not independent because the outcome of one event does influence the outcome of another c. independent because the outcome of one event does not influence another event d. not independent because the outcome of one event does not influence the outcome of another e. the answer does not appear above.
c
Equal Spread Condition
check a residual plot for equal scatter for all x-values
In the text the New England Journal of Medicine published a report saying that eating chocolate could improve one's intelligence. What misconceptions did Velickovic suggest were present in their report? a. the idea that correlation implies causality b. it is possible to generalize a correlation found on a group level to an individual level c. it is necessary to infer from a correlation found on one group level to all other groups on any level d. both A and B e. All; A,B,C
d
There are two independent distributions. Distribution A N(54.8,4.3) and Distribtuion B N(45.6, 5.1). if the two distributions are added to each other, the mean of the new distribution is ___. a. 50.2 b. 8.9 c. -8.9 d. 100.4 e. not listed above
d
The interquartile range (IQR)
defined to be the difference between the two quartile values; Q3-Q1
y variable
dependant
Consultants from Southpark Corp have been analyzing consumer behaviors and noted that there was a strong negative correlation between satisfaction with checkout ( 1 being highly satisfied and 100 being high dissatisfied) and the length of forms required by transactions are measured in numbers of banks to be filled in. They have interpreted this to reflect that customer satisfaction increases when more blanks in the entailed paperwork of the associated transactions must be filled in. as a diligent scholar of analytic interpretation, you a. Disagree- because it is not a perfect correlation b. Agree, because correlation proves causation c. Disagree, because it is fallacious reasoning of causality d. Concur that they could be correct in their interpretation e. Disagree, because the relationship is the opposite of stated
e
Merce Motors large equiptment operators know that with their Model 3 front loader Mean the Tread Depth is 20 cm with a standard deviation of 3.5 cm, and the Mean Miles Travelled is 1820 miles with a standard deviation of 25.3 miles. The slope between Tire Tread Depth (dependant variable) and the Miles Traveled is -0.1333. The intercept is ___. a. -242.06 b. -202.06 c. +222.06 d. -222.06 e. the answer does not appear above
e
x variable
independant
When describing a distribution, attention should be paid to
its shape, center, spread
Relating most to the sum according to least square criteria to find regression is..
largest residual value
Median in larger if skewed
left
next step of analysis
logarithmic transformation might make distribution more symmetric
Range
max-min; not resistant to unusual observations
Correlation
measures the strength of the linear association between two quantitative variables
R^2
percent of variance that is accounted for by regression; represents strength
Relative frequency histograms
percentages of each bin in the histogram
Scatterplot
plots one quantitative variable against another, is an effective display to look for trends, patterns, and relationships between two quantitative variables
Always pair the mean with
standard deviation
Form
straight, curved, exotic
Mean
sum of y values (or x values)/ number of variables
Correlation treats x and y
symmetrically
Standard deviation
takes into account how far each value is from the mean; appropriate for symmetric distributions; square root of the variance
Correlation is not affected by changes in
the center or scale of either variable
Symmetric
the halves on either side of the center look, at least approximately, like mirror images
The x- and y-variables are sometimes referred to as
the independent and dependent variables
Line of best fit/ least squares line
the line for which the sum of the squared residuals is smallest
The more symmetrical..
the lower the standard deviation
If the shape is unimodal and symmetric
the mean and standard deviation and possibly the median and IQR should be reported
If a distribution is skewed, contains gaps, or contains outliers, then it is better to use
the median
If the shape is skewed
the median and IQR should be reported
If the correlation were 0
the model would predict the mean for all x-values. The residuals would have the same variability as the original data
Z-score
the standardized value tells how many standard deviations each value is above or below the overall mean; x minus the mean/ standard deviation
Correlation measures
the strength of the linear association between the two variables.