482 end me please Exam 1

Ace your homework & exams now with Quizwiz!

What are Scatter Plots?

- are two-dimensional graphs produced by plotting one continuous variable against another continuous variable within a set of coordinate axes, and they describe the relationship between the two variables. - A linear relationship can be inferred when the general shape of a scatter plot is a straight line.

R^2 is bounded between:

0 and 1

You want R^2 to be close to________ because this indicates that the model is a relatively good fit to the data.

1

Logistic Regression equation explanation:

1. No error term in this equation 2. The Betas here are calculated from maximum likelihood estimation

how to make a decision if there are associations or not?

1. Null hypothesis : That there is NOT an association between variables. 2. Alternative hypothesis: That there IS an association between variables. 3. DF=(number of rows-1)*(number of columns-1)

When you perform a regression analysis, several assumptions must be met to provide valid tests of hypotheses and confidence intervals:

1. The linear model fits the data adequately. a. For example, in simple linear regression, the mean of the response variable is linearly related to the value of the predictor variable. To make sure that a linear relationship exists, you can create a scatter plot of the response, Y, versus the predictor, X. 2. Errors are normally distributed with mean 0. 3. Errors have an equal variance at each value of the predictor variable. 4. Errors are independent.

For the Logistic Regression model, log means natural log rather than log with a base

10

What is Point Estimate?

A single, best estimate of a population parameter

What is Inferential statistics?

Focus is on learning about populations

What is Central Limit Theorem?

Indicates that the sample mean of ANY distribution approximates a normal distribution

What is an Association?

It exists between two variables when the expected value of one variable changes at different levels of the other variable.

_____________________________________ of the response variable are common remedies to the departure from equal variances.

Natural log and square root transformations

What is Information criteria?

Other selection criteria that you can use to select variables for a model, as well as evaluate competing models.

The Y within the Simple Linear Regression model equation =

Response

For Adjusted R-squared:

The R-square always increase or stays the same as you include more terms in the model.

You should add a ________________ to fix residual plots that don't fit the linear assumption

quadratic term

Good residual plots are:

randomly scattered

When you compare information criterion values, the model with the ___________ information criterion is considered better.

smaller

Selection Method Stepwise:

1. Combines aspects of both forward and backward selection. 2. It starts with no predictor variables in the model and incrementally builds a model one variable at a time, as in forward selection. 3. However, as in backward selection, stepwise selection can drop non-significant variables. 4. The stepwise selection process terminates if no further variables can be added to or removed from the model, or when the variable to be added to the model is the one just deleted from it. 5. The default p-value to enter and stay in the model are both 0.15. The p-values can be changed based on the research goal or subject-matter expertise.

What is Coefficient of determination, R squared?

A measure of the proportion of variability in the response variables explained by the predictor variables in the analysis. It's calculated as the model sum of squares divided by the total sum of squares.

For Measurement of Scales, What is Categorical?

Associated with specific non-numeric levels

What is Logistic Regression?

Binary response=logistic regression

Test data is used for?

to give a final honest estimate of generalization for the chosen model

What are Predictor Variables?

- (Xs) The input, explanatory, or independent variable(s) - The measures associated with the response variables and therefore can be used to predict the value of the response variable

What is the Response Variable?

- (Y) The outcome, target, or dependent variable - The variable you seek to predict

What is Empirical Rule?

- 68% of the normal distribution lies within 1 standard deviation of the mean; 95% lies within 2 standard deviations of the mean; 99% lies within 3 standard deviations of the mean - Anything past 3 standard deviations away is considered an outlier

What is One-Sample t-test?

- A one-sample t-test compares the mean calculated from a sample to a hypothesized mean. - !!!!!!(Know how to calculate this like homework 1)!! - t=(x ̅-μ_0)/(s⁄√n)

What is Overfitting?

- Adding too many variables in a model - Leads to models that have higher variance when they are applied to a population.

What is Type 1 Error?

- If you reject the null hypothesis when it's actually true, you've made this type of an error. - The probability of committing a this type of error is ∝. - ∝ is the significance level of the hypothesis test

What are Confidence Intervals for the mean?

- Interval estimators of the population mean. - !!!!!!!!!!(Know how to calculate confidence intervals like on homework 1.)!!!!!!!!!!!!

What is Underfitting?

- Not adding enough variables in a model - Leads to biased predictions

What is the P-Value?

- The probability of obtaining a test statistic as extreme or more extreme than the one observed in your data given that the null hypothesis is true. - P-value low= doubt about the truth of the null hypothesis

What is Standard Error?

- The variability associated with the sample mean, x ̅.

Which automated model selection method is best?

- There's no one method that's best, just as there's no one model that's perfect. - All models are approximations that are based on a sample from the population of interest. - These model selection approaches provide suggestions for a useful approximating model.

What is Type II Error?

- This type of error, often referred to as β, is when you fail to reject the null hypothesis and it's actually false.

What is the Null Hypothesis?

- Usually one of equality - What you know to be true

What is the Alternative Hypothesis?

- What you suspect or what you're trying to demonstrate. - Alternative is one of inequality

Logistic Regression is used for:

- to model the relationship between a binary response variable and a set of predictor variables. - to estimate the probability of the response according to the various continuous and categorical predictors

What are the Four types of information criteria are available in SAS?

1. Akaike's information criterion (AIC) 2. Corrected Akaike's information criterion (AICC) a. Should be used in preference to AIC when you work with small sample sizes or relatively few observations per estimated parameter 3. Sawa Bayesian information criterion (BIC) 4. And Schwarz Bayesian information criterion (SBC)

Selection Method Backward:

1. Also called backward elimination 2. Starts with all predictor variables in the model. 3. Results of the F test for individual parameter estimate are examined, and the least significant variable that is above the specified significance level is removed. 4. After a variable is removed from the model, it remains excluded and cannot reenter. 5. Backward selection is repeated until no other variable in the model meets the specified significance level for removal, 0.10 by default.

What is ANOVA?

1. Analysis of Variance 2. If you have a response variable that's continuous and all of your predictor variables are categorical, use this method

With Two-way ANOVA Treatments, or treatment groups:

1. Formed by combining the two factors. 2. If the first factor has three levels, and the second factor has four levels, then there are 12 treatment groups.

With Two-way ANOVA, Interaction effects are:

1. Occur when the relationship between the response and a predictor changes according to another predictor in the model. 2. This possible interaction means that the effect of one factor depends on the value of the other factor.

Selection Method Forward:

1. Starts with no predictor variables in the model. 2. Computes an F statistic for each predictor variable not in the model, and examines the largest of these statistics. 3. If it's significant at a specified significance level, the corresponding variable is added to the model. 4. After a variable is added to the model, it stays in, even if it becomes non-significant later. 5. Forward selection keeps adding variables, one at a time, until none of the remaining variables meets the specified level for entry, 0.05 by default.

What is K-fold Validation?

1. You train and assess the model on k total different partitions of the data for the same model. 2. The results from each holdout set can then be averaged to interpret how well the model generalizes to new data.

With k predictors, there are:

2^k possible models

What are Samples?

A group of measurements from a population

In ___________, the goal is to determine whether there are significant differences among the group means.

ANOVA

What is an outlier?

An unusual data point

What is influential observation?

An unusual data point that has a large effect on some part of the model, such as the model coefficients, the standard errors, or the predicted values.

What is Two-way ANOVA?

Analyzes the effect of two predictors individually, and tests for interactions between them.

For outliers:

Anything beyond -2 or +2 is considered unusual, and anything beyond -3 or +3 is considered VERY unusual.

Why would you not always use 99.9% confidence so that any confidence interval you calculate contains the true value of the population mean?

As you increase the confidence level, the width of the interval increases, making it less informative.

What is Normal Distribution?

Bell-shaped, symmetric, and defined by two parameters: σ, the standard deviation and μ, the mean.

(Sum of Squares) Model Sum of Squares=

Between Group Variation - The variability explained by the independent variable and therefore, measures the variability between groups. It's calculated as the weighted sum of the squared differences between the mean for each group and the overall mean.

What is Pearson Correlation?

By definition, two continuous variables are correlated if there's a linear association between them, but remember that it's possible to have strong associations that are nonlinear in nature.

For Measurement of Scales, What is Continuous?

Can take on any numeric measurement

What is the SECOND way of determining whether there are significant differences among the group means?

Comparing the sources of variability enables us to evaluate the null hypothesis: Are the group means equal?

_________________________, such as k-fold cross validation and bootstrap methods, were developed so that all the data can be suited for both fitting and honest assessment.

Computer-intensive methods

What are Parameters?

Evaluations of characteristics of populations

As the ________________, the more evidence we have that not all group means are equal, because it indicates that more variability is explained by the model and not attributed to error.

F value increases

What is the Alternative Hypothesis for Multiple Linear Regression?

HA:β1≠β2≠...≠βn≠0

What is the Null Hypothesis for Multiple Linear Regression?

Ho:β1=β2=...=βn=0

What is the THIRD way of determining whether there are significant differences among the group means?

If the within group variability were LARGE, and the distributions overlapped, it's possible that the sampled means could be highly similar.

(Sum of Squares) Total Sum of Squares=

Model Sum of Squares + Error Sum of Squares

Observations are not independent when:

Observations are correlated

__________ are not the same as probabilities. Instead, ____________ are calculated from probabilities. You divide the probability that the event occurs by the probability that the event does not occur.

Odds

_______________________ Indicates how much more likely it is that a certain event or outcome occurs in one group relative to its occurrence in another group.

Odds Ratio

_______________ highly affect correlation coefficients; one data point can misrepresent the linear relationship between two variables and make it seem stronger than it really is.

Outliers

The X within the Simple Linear Regression Model equation =

Predictor

What is Predictive Modeling?

Predicts future values of a response variable based on the existing values of predictor variables

What is Bootstrapping?

Resampling method that tries to approximate the distribution of the parameter estimates in order to obtain correct standard errors and p-values.

For Measurement of Scales, What is Ordinal?

Similar to categorical variables, but the levels have a natural hierarchy to them

Residuals are:

The difference between each observed value of Y and its predicted value.

With Two-way ANOVA, term effects are:

The expected change in the response variable due to the change in value of a predictor variable.

With Two-way ANOVA, effects are:

The factor variables are referred to as effects in the model.

What is Power?

The probability that you correctly reject the null hypothesis. The _________ of a statistical test is equal to 1- β.

What are Nonparametic models?

Things like decision trees, which predict new cases based on a sequence of decisions, or rules, based on the values of the inputs without using an equation

What is ONE way of determining whether there are significant differences among the group means?

This is accomplished by partitioning the Total Variation in the response variable (as measured by the corrected total sum of squares) into two components: a. The Between Group Variation (displayed in the ANOVA table as the Model Sum of Squares) b. The Within Group Variation (displayed as the Error Sum of Squares)

How to interpret Pearson Correlation?

To define the correlation that we observe in our scatter plot, we use correlation statistics, which measure the degree, or strength, of linear association between two variables.

(Sum of Squares) Total Sum of Squares=

Total Variation - The overall variability in the response variable. It's calculated as the sum of the squared differences between each observed value and the overall mean.

What is General Linear Model?

Use when your response variable is continuous, you have continuous predictors and you can assume a normal distribution of errors, use a general linear model or more specifically an ordinary least squares regression

Because collinearity involving several predictors can be missed by correlations, we need additional tools for collinearity detection such as ______________________

Variance Inflation Factors.

A cone shape within a residual plot means that:

Variance of the residuals is not constant

Collinearity doesn't violate regression assumptions. Why is it a problem?

When multiple variables try to explain the same variation in the response, it leads to inflated standard errors and instability in the regression model.

(Sum of Squares) Error Sum of Squares=

Within Group Variation - The variability not explained by the model. It's also referred to as within treatment variability or residual sum of squares. Therefore, it measures the random variability within groups. It's calculated as the sum of the squared differences between each observed value and the mean for its group.

What is the Formula for Simple Linear Regression?

Y=β0+β_1X + ε

What is the equation for Multiple Linear Regression?

Y=βo+β1 X1+β2 X2...+βn Xn+ε

What is the Two-way ANOVA equation?

Yijk= μ+αi+βj+αiβj+ ϵijk

With Logistic Regression binary is:

a. The mean of the response is the probability of a success. b. You can code binary data with numbers but these values are still categories and the coding is arbitrary. c. If the response variable has only two levels, you can't assume the constant variance and normality that are required for linear regression.

The calculations of all information criteria begin the same way:

a. They all start with n (the sample size) times the natural log of the sum of squared errors divided by n. b. Then, each criterion adds a penalty that represents the complexity of the model. c. The magnitude of the penalty is what differentiates each type of information criterion.

For building a predictive model, Scoring is?

a. Where you get to apply the model you built and verified to new sets of data to make predictions. b. Make sure new data matches the data set up of data used to build the model

choosing the "best" model for __________________ is not as simple as just making the R-square as large as possible.

adjusted R-squared

You can also think of an _______________ as the probability that one variable depends on the probability of the other.

association

A linear regression model assumes the data is continuous, but for logistic regression, the response is ______________

binary.

For Pearson Correlation, a sample correlation coefficients can be large because:

both variables are affected by other variables, such as time of year.

Within Predictive Modeling, Observations =

cases=instances=records

For Logistic Regression, the estimated probabilities can then be used to:

classify an unknown response into one of the two outcome levels, given a set of predictors

For Standard Error the larger the sample size=

closer we get to measuring all the data of the population=the smaller the standard error will be= the more precise our estimate=the more confident we are that the sample mean is a good estimate of the population mean.

With Collinearity/Multicollinearity, strong correlations between pairs of predictors can be detected with ____________________.

correlation analysis.

When using the Pearson Correlation Coefficient as our correlation statistic the:

correlation coefficient ranges from -1 to +1.

The ε within the Simple Linear Regression Model equation =

error terms that represents the variation of Y around the line

Mean Squared Error=

estimate of the model variance

Parametric models have:

formulas like regression

If R^2 is close to 0 then:

if the predictor variables do not explain much variability in the data

If deleting an observation results in a change in the standard errors, then the observation_________________________ of the parameters

influences the precision

Within Predictive Modeling, Predictor Variables=

inputs=features=explanatory variables=independent variables

Type I and Type II error are _______________________. As one type increases, the other decreases.

inversely related

What is adjusted R-squared?

is a measure that's similar to R-squared, but it takes into account the number of terms in the model. a. It can be thought of as a penalized version of R-squared. b. The penalty increases with each parameter that's added to the model.

What is the goal of Predictive Modeling?

is to predict, or score, future values of a target variable based on the existing values of inputs.

What are Associations?

it exists between two categorical variables if the distribution of one variable changes when the value of the other variable changes.

For building a predictive model, If two models both fit well, choose the one with the __________________________

least amount of predictor variables

Pearson correlations measure only the _________________ between variables. That is, variables can have a near-zero correlation, but can be strongly related in a nonlinear fashion.

linear association

A ____________________ model applies a logit transformation, or simply the log odds transformation, to the probabilities

logistic regression

Each information criterion searches for a model that:

minimizes the unexplained variability with as few effects in the model as possible.

After the best model is chosen the:

model is deployed to make predictions on new data using a process called scoring.

You can use _____________________ to consider many predictors.

multiple linear regression

Collinearity/Multicollinearity is a potential problem in __________________________.

multiple regression

You can also use linear regression to model _________________________with the response variable by adding polynomials, such as squared or cubed terms, or you can add interactions to your model.

non-linear relationships

The process of Predictive modeling begins with:

partitioning a data set into separate training and validation data sets.

Residual plot that doesn't fit the linear assumptions are:

plots that have an obvious pattern.

Multiple linear Regression is a ___________________ tool for both explanatory analysis and for prediction.

powerful

For The Logistic Regression Model, the logit effectively avoids the boundary problem for probabilities since:

probabilities are bounded between 0 and 1

For Logistic Regression we can't use:

regression model assumptions, fitting techniques, and procedures we've learned thus far.

To verify the assumptions of linear regression, you can use the _______________ from the regression analysis as your best estimates of the error terms.

residual values

To check for violations of equal variances, you use the ________________________________________.

residuals versus predicted values plot.

The Yijk within Two-way ANOVA equation =

response variable

If there's no association, the distribution of the first variable is the ___________, regardless of the level of the other variable.

same

You should always produce a ______________ before you conduct a regression analysis

scatter plot

If deleting an observation results in a large change in parameter estimates, then that observation has a _____________________________ on the parameters.

significant influence

To check for outliers, you can look at the __________________, which puts them on a standard deviation scale.

standardized residuals

Within Predictive Modeling, Response Variables=

targets=outcomes=dependent variables

For Pearson Correlation, a strong correlation between two variables doesn't mean:

that a change in one variable causes a change in the other variable, or vice versa. It's possible that other reasons account for a strong correlation between two variables.

The β1 within the Multiple Linear Regression equation =

the average change in Y for a one-unit change in X1, holding X2...X_n constant.

What does the chi-square measure?

the difference between the observed cell counts and the cell counts that are expected if there's no association between the variables, and the null hypothesis is in fact true.

The αi within Two-way ANOVA equation =

the effect of the i^th level of variable 1

The αiβj within Two-way ANOVA equation =

the effect of the interaction between the i^th level of variable 1 and the j^th level of variable 2

The βj within Two-way ANOVA equation =

the effect of the j^th level of variable 2

The ϵijk within Two-way ANOVA equation =

the error term or residual in your model

The β0 within the Simple Linear Regression Model equation =

the intercept parameter=the value of the response variable when the predictor is 0. This is where the regression line crosses the axis.

The μ within Two-way ANOVA equation =

the overall population mean of the response

if R^2 is close to 1 then:

the predictor variables explain a relatively large proportion of variability in the data.

The β1 within the Simple Linear Regression Model equation =

the slope parameter=the average change in Y for a 1 unit change in X

What do Odds Ratios measure?

the strength of the association between a binary predictor variable and a binary response variable, you can use an odds ratio.

When interpreting Pearson Correlation Coefficient, the closer the value is to -1 demonstrates?

the stronger the negative linear relationship between the two variables. A negative linear relationship between two variables means that, as the values of one variable increase, the values of the other variable decrease.

When interpreting Pearson Correlation Coefficient, the closer the value is to +1 demonstrates?

the stronger the positive linear relationship between the two variables. That is, as the values of one variable increase, the other tends to increase as well.

When interpreting Pearson Correlation Coefficient, the closer the value is to 0 demonstrates?

the weaker the linear relationship is, and a correlation coefficient equal to 0 means that no linear relationship exists between the two variables.

Validation data is used for?

to assess the model and pick the best performing model

Training data is used for?

to build the model

For building a Predictive Model select the best model by applying validation data to the models built with __________________

training data

Use most of the data for ______________ set and less for validation and test data sets

training data

Collinearity/Multicollinearity occurs when ____________________________ are highly correlated with each other.

two or more predictor variables

in __________________ each factor has multiple levels

two-way ANOVA

For Predictive Modeling, the model is built using the training data and then assessed using the _______________________.

validation data.

The βo within the Multiple Linear Regression equation =

y-intercept

In explanatory analysis: .

you develop a model to test the statistical significance of the parameter coefficients to determine whether a relationship exists between the response variable and the predictor variables.

When you use multiple regression for prediction:

your focus is the predictive power of the model. You choose terms that you've determined best predicts future values of the response variable.

!!!!!!!!!!(Know how to calculate the expected values of variable combinations)!!!!!!!!!!! !!!!!!!!!!!(Know how to calculate the chi-square test statistic)!!!!!!!!!!!!!

!!!!!!!!!!(Know how to calculate the expected values of variable combinations)!!!!!!!!!! !!!!!!!!!!(Know how to calculate the chi-square test statistic)!!!!!!!!!!

!!!!!!!!(Know how to calculate and interpret an odds ratio)!!!!!!!!!

!!!!!!!!(Know how to calculate and interpret an odds ratio)!!!!!!!!!


Related study sets

Business and Personal Finance- Chapter 2

View Set

Statistical Methods for Psychology

View Set

Lesson 4: Psychological reactance theory

View Set

founding father George Washington

View Set

CompTIA Network+ Cengage Unit 4 Quiz

View Set