MKT 317 Final
how to show residuals in R
residuals(employment_model)
Interpreting p-values for multiple linear regression
(Interpreting a small p-value) The p-value for the coefficient of the variable (years experience) is very small - much less than 0.05. Therefore we can conclude that once we have controlled for number of subordinates and amount of post-bachelor's education, there is a statistically significant additional relationship between the number of years experience and the average salary. (Interpreting a large p-value) The p-value for the coefficient of the variable (size of team) is large- much greater than 0.05. From this information we can NOT conclude that the size of team is not linearly related to the average salary. We do not have enough information from the model provided to determine whether or not the size of team is correlated with salary!
r-squared
(measures strength of a correlation) between 0 and 1
If the p-value associated with the slope is greater than 0.05, then...
- The data do not provide sufficient evidence to conclude that X is correlated with Y. - The data do not provide sufficient evidence to conclude that X is linearly related with Y. - There is not a statistically significant correlation between X and Y. - There is not a statistically significant linear relationship between X and Y.
If the p-value associated with the slope is less than 0.05, then we can make the following (equivalent) conclusions...
- The data provide sufficient evidence to conclude that X is correlated with Y (with over 95% confidence). - There is a statistically significant correlation between the variables X and Y (with the 95% confidence level). - We are 95% confident that X is correlated with Y. - The data provide sufficient evidence to conclude that there is some linear relationship between X and Y at the 95% confidence level. - There is a statistically significant linear relationship between the variables X and Y. - We are 95% confident that X has a linear relationship with Y.
What is required for an independent t-test
- The dependent variable (y-variable) must be quantitative. - The independent variable (x-variable) is a categorical variable that has two possible values. (In other words, the x-variable is a "Group" variable and we are comparing the average "Y" between two groups). - The data are independent: each person is either in Group 1 or Group 2, but not both. In other words, each person appears in the data set exactly one time.
AIC (Akaike information criterion)
AIC measures model quality Model quality is a trade-off of accuracy and simplicity. - Higher accuracy means higher quality. - More simplicity (fewer variables) means higher quality. - AIC balances accuracy and simplicity.
ANOVA in R
ANOVAname <- aov(Y ~ independent variable, data=name of data set) - include as.factor(var name) if x is categorical but entered numerically summary(ANOVAname)
Interaction term
An interaction term in a linear model is a term where two or more of the variables are being multiplied together.
Panel data
Balanced panel data consists of several cross-sections of observations from the same individuals. (unbalanced panel data contains some "missing values"). individuals appear more than once
T-test, linear regression, and ANOVA: when to use them
If the Y-variable is quantitative and there is at least one quantitative X-variable We may use linear regression to analyze the data (or exponential, power, logarithmic, polynomial regression). We may not use an ANOVA or a t-test for the difference of means. If the Y-variable is quantitative and all independent variables are categorical We may use linear regression or an ANOVA to analyze the data. The interpretations are slightly different between linear regression and ANOVA, so we chose the method based on the desired interpretation. (We'll learn more about interpretations later in the reading assignment.) If the Y-variable is quantitative and the only independent variable is categorical with only two possible values (such as a yes/no variable) We may use linear regression, ANOVA, or a t-test for a difference of means to analyze the data. All three methods are equivalent (will give the same p-value). The t-test for the difference of means is the most commonly used in this situation.
ANOVA p-value interpretations
If the p-value is greater than 0.05, then we do not have enough information to conclude that there are differences between groups. Informally, all groups have "about the same" population average. If the p-value is less than 0.05, then we conclude that at least one group's population average is different from some other group (i.e. there are some differences between groups).
Paired t-test p-value intrepretations
If the p-value is small (this often means less than 0.05), then we conclude that once we control for individual variation, there is a statistically significant difference in the average Y-value between the two situations. If the p-value is not small (this often means greater than 0.05), then when we control for individual variation, there is not a statistically significant difference in average Y-value between the two situations; we conclude that the average Y-value for the two situations are "about the same, once we control for individual variation."
t-test p-value interpretations
If the p-value of the t-test for the difference of means is larger than 0.05, then we do not have evidence to support the claim that the averages are unequal, which means it is plausible that the averages might be equal, or at least generally close to each other. When this is the case, we say "there is not a statistically significant difference between the population averages." If the p-value of the t-test for the difference of means is small (less than 0.05), then we have sufficient evidence confidently conclude that the populations have unequal averages. When this is the case, we say "there is a statistically significant difference between the population averages."
Chi-squared: Test statistic interpretation
If the test statistic is larger than the critical number, then there is a statistically significant relationship/correlation between the variables. If the test statistic is less than the critical number, then there is not a statistically significant relationship between the variables (i.e. we conclude that the variables are independent / not correlated).
When do we care about multicollinearity
Interpreting coefficients (slopes): we need low multicollinearity. Not interpreting slopes: Multicollinearity (often) doesn't matter!
Confidence interval R for two populations
It is important to enter groups in the same order when defining the YES and TOTAL. Below, we are entering the values for Segment A first, then the values for Segment B. YES <- c(245, 56) TOTAL <- c(341, 159) prop.test(YES, TOTAL)
Model Forms
Linear Y = a + b(X) An absolute change in X corresponds to an absolute change in Y Exponential Y = a*b^X An absolute change in X corresponds to a percentage change in Y. Logarithmic Y = a + b*log(X) A percentage change in X corresponds to an absolute change in Y Power Law Y = a*X^b A percentage change in X corresponds to a percentage change in Y.
Power law regression
LogXLogYModel <- lm(log(Electricity)~log(Income), data=GDPandElectricity) Y=a⋅X^b ln(Y)=b0+b1ln(X) Shortcut: ln(Y)=b0+b1ln(X) then Y=(e^b0)⋅X^b1 Logs spread out the small values and condense the large values, so often when we see data that is very highly skewed, it is nice to consider the log of the highly skewed variable
How to compute coefficients for log regression model
LogXModel <- lm(Score~log(MinutesOnHold), data=CustomerSatisfaction) coef(LogXModel)
Logarithmic Model
LogXModel <- lm(Y ~ log(X), data=name of the data frame) Y=b0+b1xln(X)
How to compute coefficients for exponential regression model
LogYModel <- lm(log(Score)~MinutesOnHold, data=CustomerSatisfaction) coef(LogYModel) exp(coef(LogYModel))
Exponential Model
LogYModel <- lm(log(Y) ~ X, data=name of the data frame) ln(Y)=b0+b1X Y = a*b^X Shortcut: ln(Y)=b0+b1(X) equals Y=e^b0⋅(e^b1)^X
Logistic regression in R
Model <- glm(dependent dummy variable ~ independent variables separated by plus symbols, data=name of data set, family=binomial) summary(Model)
When do we include the identification variable as an independent variable in a multiple linear regression model?
Only when we have two or more observations per individual and want to control for individual variability.
Overfitting
Overfitting means making the model that is overly complicated; a model that fits the specific data set very well, but a model that is so complicated that it does not fit the trend in the population
Multiple R-squared of a model
Proportion of variation of the dependent variable that is explained by the regression model.
Pros and Cons of Polynomials
Pros: - Polynomial models can be more accurate at predicting the average value of Y for a given value of X. - Polynomial models can capture relationships where the average value of Y is sometimes increasing and other times decreasing Cons: - When the polynomial model has a degree larger than 2, there is no nice way to interpret the general relationship of how a change in X will impact a change in Y. The only four model types where we have some constant rate of change (in terms of absolute or percentage change) are linear, exponential, logarithmic, and power law models. - Sometimes it is challenging to know what degree polynomial to use. - It is possible to have issues associated with "overfitting."
Chi-squared in R
Row1 <- c(10, 20, 30) Row2 <- c(15, 10, 15) Row3 <- c(50, 35, 40) Row4 <- c(30, 20, 25) Table <- rbind(Row1, Row2, Row3, Row4) chisq.test(Table) Result: X-squared = 11.913, df = 6, p-value = 0.06394
correlation cofficient
The correlation coefficient is a number that is between -1 and +1. The higher the absolute value, the stronger the correlation.
OLS regression
The idea behind OLS regression is that we want to minimize the overall size of the prediction errors (residuals)
F statistic
The larger the F statistic, the more evidence we have that there are differences between the groups
correlogram
The larger the dot, the stronger the correlation. The darker the dot, the stronger correlation. Very dark blue dots indicate strong positive correlation. Very dark red dots indicate strong negative correlation.
ANOVA equivalence of approaches (test statistic in relation to critical number + p-value)
The test statistic is less than the critical number exactly when the p-value is greater than 0.05. In both situations, we do not have enough evidence to conclude that there are differences between the groups. The test statistic is greater than the critical number exactly when the p-value is less than 0.05. In both situations, we conclude that there are some differences between the groups.
Time series data
Time series data are measurements for only ONE variable for the SAME individual over time. Time series data must be collected at regular (equal) time intervals
limitation of a scatterplot
We can only reasonably graph using two axes, which means we can only use the scatter plot method to determine if a model is appropriate for the data if there is only one quantitative X-variable!
When to use (and when not to use) as.factor() in R
When do we need to use the as.factor() function? - When we have a categorical variable that is entered in numeric (number) format, such as AGENCY (with values 1, 2, 3, or 4). When is the as.factor() function optional? - The as.factor() command is optional when we have categorical variables entered as text (R will automatically treat them correctly, but adding as.factor() will not hurt the analysis.) - The as.factor() command is optional when we have a categorical variable whose only values are 0 and 1 (i.e. a dummy variable). It is still preferred to use the as.factor() command for 0/1 variables, but not absolutely necessary. When can we NOT use the as.factor() function? - We should not use the as.factor() function for a quantitative variable or an ordinal variable that we wish to analyze as if it was a quantitative variable. - In the real world, there might be some special cases where using as.factor() for a quantitative variable may be appropriate. In this course, never use as.factor() with a quantitative variable unless specifically told to (i.e. only use as.factor() with a quantitative variable if you're told to treat the variable as if it is a categorical variable). - We never use as.factor() for the Y-variable in a linear model.
To determine if there is a statistically significant interaction between X1 and X2 with respect to Y, it depends on what type of variables X1 and X2 are:
X1 and X2 are either quantitative or a categorical variable with only two possible values (some examples of this would include: Location: North/South, Temperature: Over 95 degrees/Under 95 degrees, etc.) If the p-value for the interaction term X1:X2 is very small (typically less than 0.05), then we say that there is a statistically significant interaction between X1 and X2 with respect to Y. If X1 and/or X2 is a categorical variable with more than two possible values (for example, Agency, which can equal 1, 2, 3, or 4). There will be more than one interaction term in the output of the equation - one for each necessary dummy variables. If one or more of these interaction terms have a very small p-value (typically less than 0.05), then we say there is a statistically significant interaction between X1 and X2 with respect to Y.
Confidence interval R for a single population
YES <- c(301) TOTAL <- c(500) prop.test(YES, TOTAL) OR prop.test(301, 500)
ANOVA
a t-test is used to detect a difference between two groups or situations, and an ANOVA is used to detect a difference between two or more groups or situations. An ANOVA tests to see if the variability (differences) between groups is sufficiently large compared to the random variation that appears within the groups (the "noise.") Then, the sample size and the number of possible values a categorical variable is taken into account, and a "test statistic" is calculated - this is called an "F-statistic."
Graphical Intuition: When the trend lines are not parallel and not equal
consider breaking the data into two groups (which is a special case of a "subset analysis") or consider using an "interaction term," which means "different slopes for different groups."
Graphical Intuition: When the trend lines are parallel, but not equal
means that once we've controlled for X, then there probably is a difference in the average Y-value when comparing some of the groups. [We need to use p-values to know for sure!]
Graphical Intuition: When the trend lines are about the same
means that once we've controlled for X, then there probably isn't a difference in average Y when comparing the groups.
Prediction error and residual
observed - predicted
Independent observations
one observation per individual
Paired t-test in R
t.test(Y ~ X, paired=TRUE, data=name of data set)
t-test in R
t.test(Y ~ X, var.equal=TRUE, data=name of data set)
when is there is an interaction looking at a box and whiskers plot
when the averages are different between the groups (small ANOVA p-value)
When there is an interaction on a graph
when the changes between groups aren't the same
One-way, two-way, three-way etc. ANOVA
How ever many categorical X vars there are
percent change
(end-start)/start
Factorial ANOVA
Has an interaction term
Multicollinearity
when one independent variable has a strong linear relationship with a combination of one or more of the other independent variables. We want to avoid this issue in our regression models, as the estimates of the slopes will not be reliable in a model with high multicollinearity, which means that our interpretations of the model will not be valid.
If the p-value for the coefficient of X1 is greater than infinity...
(in most cases, this means p-value > 0.05), either the corresponding independent variable is either not linearly related to the outcome variable (when all other variables have been accounted for) or it is redundant.
Cross-sectional data
Cross-sectional data are data collected from multiple individuals during one period of time, where each individual has exactly one value for each variable.
Independent t-test
Difference of means tests to determine if there is likely to be a difference in the population average values of Y when comparing Group 1 and Group 2, where each individual in the data set has exactly one Y-value; each individual in the data set is either in Group 1 or Group 2 (but not both).
Paired t-test
Difference of means tests to determine if there is likely to be a difference in the population average values of Y when comparing Situation 1 and Situation 2, where each individual has a measurement for both situations (each individual has a pair of observations (two "Y-values").
Reducing models in R
FullModel <- lm(PRICE ~ SIZE + TAX + SPEC.FEATS + AGE + TRAIN.DIST, data=HOUSEDATA) step(FullModel)
F-statistic and the critical number
If the F statistic is less than the critical number, then we do not have enough information to conclude that there are differences between groups. Informally, all groups have "about the same" population average. If the F statistic is larger than the critical number, then we conclude that at least one group's population average is different from some other group (i.e. there are some differences between groups).
Chi-squared: p-value interpretation
If the p-value is less than 0.05, then there is a statistically significant relationship/correlation between the variables. If the p-value is greater than 0.05, then there is not a statistically significant relationship between the variables (i.e. we conclude that the variables are independent / not correlated).
Types of ANOVA
Independent: Cross-Sectional data, where every individual has exactly one Y-value in the data set Mixed measures: panel data, but we have some variables that are unchanging characteristics of an individual over the duration of the study (such as location, employment status, etc.), whereas other variables change (such as time) Repeated measures: we do not have any variables that are unchanging characteristics of an individual.
Calculations
Module 8 -> Part 4 -> Calculating the F statistic
Logistic regression
Three pieces: - Linear part = a + bX - Predicted ODDS of a YES outcome = e^(linear part) to 1 - Predicted PROBABILITY of a YES outcome = ODDS / (1 + ODDS)
Reformatting categorical variables in R
To re-format numeric categorical variables in R, we re-define the numeric variable as a factor variable using the as.factor()
Interpretation of coefficient and p-value
When we control for [list all x-variables in the model, except the variable whose p-value we are interpreting], there is a statistically significant difference in average [y-variable] when comparing [group defined by the dummy variable] and [the comparison group].
how to show predicted values in R
predict(employment_model)
Chi-squared: critical number
qchisq(0.95, df=degrees of freedom)
ANOVA: calculating the critical number
qf(0.95, df1 = numerator degrees of freedom, df2 = denominator degrees of freedom)