Business Analytics 1 Final part 3
Linearity
- examine scatter diagram (should appear linear) - examine residual plot (should appear random)
What does simple linear regression do?
-Finds a linear relationship between: one independent variable x and one dependent variable y
•Test whether the average age of respondents is equal to 35. What is the H0 and H1
-H0: mean age = 35 - H1: mean age <> 35
What are your two options with hypothesis?
-reject the null and conclude the sample data provides sufficient evidence to support H1 or - fail to reject the null and conclude the sample data does not support H1
Selecting the Proper Excel Procedure: •Population variances are unknown but assumed equal:
-t-Test: Two-Sample Assuming Equal Variances
Selecting the Proper Excel Procedure: •Population variances are unknown and assumed unequal:
-t-Test: Two-Sample Assuming Unequal Variances
confidence coefficient
1 - a =P(not rejecting H0 | H0 is true) - The Value of a can be controlled. Common Values are 0.01, 0.05, or 0.10
What is the Systematic Model Building Approach?
1.Construct a model with all available independent variables. Check for significance of the independent variables by examining the p-values. 2.Identify the independent variable having the largest p-value that exceeds the chosen level of significance. 3.Remove the variable identified in step 2 from the model and evaluate adjusted R^2 ----- Remove variables one at a time 4. continue until all variables are significant
What is the Hypothesis testing procedure?
1.Identify the population parameter and formulate the hypotheses to test. 2.Select a level of significance 3.Determine the decision rule on which to base a conclusion. 4.Collect data and calculate a test statistic. 5.Apply the decision rule and draw a conclusion.
Assumptions of ANOVA
1.are randomly and independently obtained, 2.are normally distributed, and have equal variances
What is the level of significance?
1.the risk of drawing an incorrect conclusion
how do you find chi-squared degrees of freedom?
= (r-1)(c-1) rows and columns
Type 1 error
= alpha(level of significance)= P(rejecting H0 | H0 is true)
Type 2 error
=Beta = P( not rejecting H0 | H0 is false)
Excel function CHISQ.INV.RT(probability, deg of freedom)
=x^2 that has a right tail area equal to probability for a specified degree of freedom ---- By setting prbability equal to the level of significance, we can obtain the critical value for the hypothesis test COMPUTES CRITICAL VALUE
Residual formula =
Actual Y value - Predicted Y value
Chi - square test calculations step 3:
Compare the chi-square statistic for the level of significance α to the critical value from a chi-square distribution with (r-1)(c-1) degrees of freedom, where r and c are the number of rows and columns in the cross tabulation table respectively
Chi - Square test calculations Step 2:
Compute a test statistic called shi-square statistic, which is the sum of the squares of the differences between observed frequency f0 and expected frequency, fe divided by the expected
Procedure Two- sample test for quality of variances
Excel F- test two-sample for variances
Procedure: Two - sample test for means o^2 unknown, assumed equal
Excel t-test: Two sample assuming equal variances
Procedure: Two sample test for means, o^2 unknown, assumed unequal
Excel t-test: two sample assuming unequal variances
What Procedure? Two-sample test for means O^2
Excel z-test: two sample for means
The principle of Parismony
Good models are as simple as possible
What is the assumption of hypothesis testing?
H0 is true and uses the sample data to determine whether H1 is more likely to be true
•CadSoft sampled 44 customers and asked them to rate the overall quality of a software package. Sample data revealed that 35 respondents (a proportion of 33/44 = 0.795) thought the software was very good or excellent. In the past, this proportion has averaged about 75%. Is there sufficient evidence to conclude that this satisfaction measure has significantly exceeded 75% using a significance level of 0.05?
Hypotheses: - H0 : pi =< 0.75 - H1 : pi > 0.75 Test statistic z Critical value = NORM.S.INV(0.95) = 1.645 P-value = 1-NORM.S.DIST(0.69,TRUE) Do not reject H0
Using the t-statistic
If |t|<1, then the standard error will decrease and adjusted R^2 will increase if the variable is removed. If |t|>1 then the opposite will occur----- you are using t-values instead of p-values basically
Simple Linear Regression
Involves a single independent variable
Are you proving anything with hypothesis testing?
No you are not
H0=? H1=?
Null hypothesis (describes an existing theory) Alternative Hypothesis (the complement of the null)
What do you do if T is smaller than the lower critical value?
Reject the Null hypopthesis
What is the rule of thumb for standard residual?
Standard residuals outside of +/-2 or +/-3 are potential outliers
Independence of Errors:
Successive observation s should not be related - This is important when the independent variable is time
What excel function for a two tailed test using t-distribution?
T.INV(1-a/2,n-1) or T.INV.2T(a, n-1)
What does an adjustment in R^2 indicate?
That the model has improved
what is the test statistic used for?
The decision to reject or fail to reject a null hypothesis - depends on the type of hypothesis test
What happens the further way the mean us from the hypothesized?
The smaller the value of B
Power test
The value of 1 - Beta =P(rejecting H0 | H0 is false) The value of β cannot be specified in advance and depends on the value of the (unknown) population parameter.
What is the problem with higher order polynomials?
They are generally not very smooth and hard to interpret visually - DO NOT recommend going beyond the 3rd order
If the test statistic is nonnegative, then
This is the correct p-value for an upper tail test but you must subtract from 1 for a lower tailed test
Excel output: If the test statistic is negative for a one-tailed p-value...
This is the correct value for a lower tailed test but from an upper tailed test you must subtract the value from one
What is a 2nd order polynomial shapped like?
U-shaped
Chi - Square test calculations Step 1:
Using a cross-tabulation of the data, compute the expected frequency if the two variables are independent.
Standard Error
Variability between observed and predicted Y values... "Standard Error of the Estimate"
Improving the power of the test
We would like the power of the test to be high (equivalently, we would like the probability of a type 2 error to be low) to allow us to make a valid conclusion
What do you do in situations where the data is naturally paired/matched?
a paired t-test is more accurate than assuming that the data come from independent populations. UD is the main difference betweenthe paired samples
chi-square test for independence
a test to determine whether two classifications are independent H0= Two categorical variables are independent H1 = two categorical variables are dependent
adjusted R square
adjusts R2 for sample size and number of X variables
For an upper-tailed test, if the confidence interval falls entirely above the hypothesized value, we
also reject the null hypothsis
What does adjusted R^2 reflect?
both the number of independent variables and the sample size and may either increase or decrease when an independent variable is added or dropped. An increase in adjusted
how can we test for interactions?
by defining a new variable as the product of the two variables, X3=X1*X2 and testing whether this variable is significant, leading to an alternative model
For a lower test of a one-tailed critical value in excel what must you do?
change the sign
What is chi-square distribution?
characterized by degrees of freedom - is a sampling distribution
What is multiple R squared called?
coefficient of multiple determination
The conclusion to reject or fail to reject H0 is based on....
comparing the value of the test statistic to a "critical value" from the sampling distribution of the test statistic when the null hypothesis is true and the chosen level of significance a
What should you do if you choose a small level of significance?
compensate by having a large sample size
ANOVA
conducts an F-test to determine whether variation in Y is due to varying levels of X --- Used to test the significance of regression: H0: population slope coefficient = 0 H1: population slope ciefficient <> 0
what does the critical value do to the sample distribution?
divides the sampling distribution into two parts, a rejection region and a non-rejection region. If the test statistic falls into the rejection region, we reject the null hypothesis; otherwise, we fail to reject it.
What is the base of natural log functions and what is it used for?
e=2.71828 used for b a lot
Procedure: Paired two - sample test for means
excel t-test: paired two sample for means
What do small sample sizes result in?
in a low value of 1 - B
for a one tail test if H1 is stated as >, the rejection region is...
in the upper tail
For ANOVA what does rejecting H0 mean?
indicates that X explains variation in Y
b0=
intercept
Hypothesis Testing
involves drawing inferences about two contrasting propositions (each called a hypothesis) relating to the value of one or more population parameters
Multiple Regression
involves two or more independent variables
variance inflation factor
is a better indicator of Multicollinearity but is not on excel
What happens when significant Multicollinearity is present?
it becomes difficult to isolate the effect of one independent variable on the dependent variable, the signs of coefficients may be the opposite of what they should be, making it difficult to interpret regression coefficients, and p-values can be inflated.
If the test statistic is a nonnegative for the p-value then...
it is correct for an upper - tial test, but for a lower tail test, you must subtract this number from 1.0 to get the correct p-value
How do you data correlation matrix of the recommended threshhold of +/- 0.7
large correlations exist
What is significant about correlations exceeding +/- 0.7?
may indicate multicollinearity
What is the R^2
measure of the fit of the line to the data - the value is between 0-1 where 1 equals a perfect fit the larger the value
What is Multiple R called
multiple correlation coefficient
Linear regression model with more than one independent variable is called what?
multiple linear regression model: y = dependent variable X1....Xk = independent (explanatory) variables B0= is the intercept term B1.....Bk are the regression coefficients for independent variables E= error term
What does regression analysis require?
numerical data
interaction
occurs when the effect of one variable is dependent on another variable
Multiconllinearity
occurs when there are strong correlations amoung the independent variables and they can predict each other better than the dependent variable
types of hypothesis tests
one sample test for mean o known one- sample test for mean, o unknown
Why would you reject H0 relating to the P-value?
p-value < a
What does the excel function CHISQ.TEST(actual range, expected range) compute?
p-value for the chi- squared test
One Sample Tests for Proportions
pi0 is the hypothesized value and p-hat is the sample proportion
•Test whether the average age of respondents is equal to 35. -H0: mean age = 35 - H1: mean age <> 35 •n = 34; sample mean = 38.677; sample standard deviation = 7.858. What is the test statistic?
reject H0
For a lower tailed test if the confidence interval falls entirely below the hypothesized value we...
reject the null hypothesis
Partial Regression Coefficients
represent the expected change in the dependent variable when the associated independent variable is increased by one unit while the values of all other independent variables are held constant.
Standard Residual formula =
residual/standard deviation
What does ANOVA test for?
significance of the entire model... that is it computes an F-statistic testing the hypothesis H0=B1=B2=...Bk=0 H1= at least one Bj is not 0
B1=
slope
•In the CadSoft example, sample data for 44 customers revealed a mean response time of 21.91 minutes and a sample standard deviation of 19.49 minutes.
t = −1.05 indicates that the sample mean of 21.91 is 1.05 standard errors below the hypothesized mean of 25 minutes.
What is an alternative way of testing weather or not the slope is 0?
t-test
What does the F-test do?
test for equality of variances between two samples - MUST assum that both samples are drawn from normal populations
Excel Output if the test statistic is negative
the one-tailed p-value is the correct p-value for a lower - tail test
What happens as the R2 value increases?
the polynomial increases; that is, a 4th order polynomial will provide a better fit than a 3rd order, and so on.
Where is the rejection for a one tailed test of H1 is stated as <
the rejection region is in the lower tail
What happens if we are not able to reject the null with a certain variable in an ANOVA for Multiple Regression?
then that independent variable is not significant and probably should not be included in the model. You remove and then rerun until you can reject the null
factor
variable of interest
Homoscedasticity
variation about the regression line is constant - examine the residual plot
Normality of Errors
view a histogram of standard residuals, regression is robust to departures from normality
Multiple R
where r is the sample correlation coefficient. The value of r varies from −1 to +1 (r is negative if slope is negative)
Linear
y=a+bx
Exponential
y=ab^x
POlynomial (2nd Order)
y=ax^2+bx+c
POlynomial (3rd order)
y=ax^3+bx^2+cx+d
POwer
y=ax^b
Logarithmic
y=ln(x)
Simple Linear Regression Model
you calculate the E seperately and is not used in estimating the paramters
Excel output: upper tail p-value test and the test statistic is negative
you must subtract this number from one to get the correct p-value
Selecting the Proper Excel Procedure: •Population variances are known:
z-test: two-sample for means
Two Sample Hypothesis Test: Upper tailed Test
•This test seeks evidence that the difference between population parameter (1) and population parameter (2) is greater than some value, D0 When D0=0 the test simplu seeks to conclude whether population parameter (1) is larger than population parameter (2).
Two sampled Hypothesis Tests: Lower-tailed test H0
•This test seeks evidence that the difference between population parameter (1) and population parameter (2) is less than some value D0 When D0= 0, the test simply seeks to conclude whether population parameter (1) is smaller than population parameter (2).
Two sampled Hypothesis Tests: Lower-tailed test H1
•This test seeks evidence that the difference between population parameter (1) and population parameter (2) is less than some value D0 When D0= 0, the test simply seeks to conclude whether population parameter (1) is smaller than population parameter (2).
Two-Sample Hypothesis tests: Two Tailed Test
•This test seeks evidence that the difference between the population parameters is equal to, D0 - When D0- 0 we are seeking evidence that population parameter (1) differs from population parameter (2) ---- In most applications D0=0 and we are simply seeking to compare the population parameters
Residuals
•are the observed errors associated with estimating the value of the dependent variable using the regression line:
Overfitting means...
•fitting a model too closely to the sample data at the risk of not fitting it well to the population in which we are interested.
Statistical Inference
•focuses on drawing conclusions about populations from samples. •Statistical inference includes estimation of population parameters and hypothesis testing, which involves drawing conclusions about the value of the parameters of one or more populations.
Regression Analysis
•is a tool for building mathematical and statistical models that characterize relationships between a dependent (ratio) variable and one or more independent, or explanatory variables (ratio or categorical), all of which are numerical.
p-value
•is the probability of obtaining a test statistic value equal to or more extreme than that obtained from the sample data when the null hypothesis is true.
What does Anova Measure?
•measures variation between groups relative to variation within groups.