Exam 2 (hop 3)

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Statistical test based on the Student's t probability distribution that can be used to test the hypothesis that a regression parameter is zero; if this hypothesis is rejected, we conclude that there is a regression relationship between the jth independent variable and the dependent variable.

T-test

to calc number of dummy variable use ( 0,0 is refvariable)

k-1

A graphical presentation of the relationship between two variables is

scatter chart

variable selection procedures

1. Backward Elimination 2. Forward Selection 3. Stepwise Selection 4. Best Subsets

test test hypoteses

Ho:Bj=0 HA:Bj does not = 0 (reject)

in interactions

meaningful conclusions can be developed only if we consider the joint relationship that both independent variables have with the dependent variable.

Conditions necessary for valid inference in the least squares regression model:

1. For any given combination of values of the independent variables , ,..., , the population of potential error terms e is normally distributed with a mean of 0 and a constant variance. (regression estimates unbiased) 2. The values of e are statistically independent. (you can use scatter charts to test these, residual plots )

3 methods of determining possible causal relationships

1. Measures of association 2. Regression analysis 3. forecasting

characteristics of a simple Regression Equation

1. The sum of predicted values is equal to the sum of the values of the dependent variable y. 2. The sum of the residuals is 0. 3. The sum of the squared residuals has been minimized.

how to reduce effrects of large data sets:

1. perform regression on random samples of observations 2. divide data set into 5 different subgroups of 20% and compare results. If the results are VERY different it may be the effect of large data

In regression, we commonly use inference to estimate and draw conclusions about the following:

1. the regression parameters 2. The mean value and/or the predicted value of the dependent variable y for specific values of the independent variables

In BLANK the slope coefficient represents the change in the mean value of the dependent variable y that corresponds to a one-unit increase in the independent variable , holding the values of ALL other independent variables in the model constant.

A multiple regression model

The estimated change in the mean of the dependent variable y that is associated with a one-unit increase in the independent variable x.

B1(slope)

An iterative variable selection procedure that starts with a model with all independent variables and considers removing an independent variable at each step.

Backward Elimination

A variable selection procedure that constructs and compares all possible models with up to a specified number of independent variables.

Best-subsets

describes how one thing affects another(often only occurs in one direction)

Causality

diagnostic and predictive analytics are based on BLANK w/n a relationship

Causality

A measure of the goodness of fit of the estimated regression equation. It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation. denoted by r^2. The square of the correlation between yi and y hat i

Coefficient of determination

standardized measure of linear association btwn 2 variables that takes ona value btwn -1 and 1

Correlation Coefficient excel: Correl

the descriptive measure of linear association btwn 2 variables usually between -1,0 and 1. 0-not linearly related 1- linearly related

Covariance excel: Covariance.s

Assessment of the performance of a model on data other than the data that were used to generate the model.

Cross-validation

A branch of analytics that examines data to understand why things happened in the past. This is often described as answering the question, "Why did things happen?"

Diagnostic Analytics

The estimate of the regression equation developed from sample data by using the least squares method.

Estimated regression

Prediction of the mean value of the dependent variable y for values of the independent variables , ,..., that are outside the experimental range. This is risking and should be avoided if possible

Extrapolation

An iterative variable selection procedure that starts with a model with no variables and considers adding an independent variable at each step.

Forward Selection

The process of making a conjecture about the value of a population parameter, collecting sample data that can be used to assess this conjecture, measuring the strength of the evidence against the conjecture that is provided by the sample, and using these results to draw a conclusion about the conjecture.

Hypothesis testing

The degree of correlation among independent variables in a regression model. can use CORREL() to detect

Multicollinearity

principle of using the simplest meaningful model possible without sacrificing accuracy

Ockham's razor, the law of parsimony, or the law of economy.

represents the probability of collecting a sample of the same size from the same population that yields a larger t statistic given that the value of is Bj actually zero.

P-value

Regression model in which one linear relationship between the independent and dependent variables is fit for values of the independent variable below a prespecified value of the independent variable, a different linear relationship between the independent and dependent variables is fit for values of the

Piece-wise model(aka segment or spline model)

A branch of analytics that is used to make predictions of what will happen in the future or in hypothetical situations. This is often described as answering the question, "What will happen?"

Predictive Analytics

Regression model in which a nonlinear relationship between the independent and dependent variables is fit by including the independent variable and the square of the independent variable in the model: ; also referred to as a second-order polynomial model. (can help find max out point)

Quadratic Regression Model

a statistical procedure used to develop an equation, showing how variables are related

Regression Analysis

The difference between the observed value of the dependent variable and the value predicted using the estimated regression equation; for the ith observation, the it residual is yi-yhati

Residual

The process of making estimates and drawing conclusions about one or more characteristics of a population (the value of one or more parameters) through analysis of sample data drawn from the population.

Statistical inference

An iterative variable selection procedure that considers adding an independent variable and removing an independent variable at each step.

Stepwise Selection

used to measure how much the values on the estimated regression line deviate from y

Sum of Squares due to Regression (SSR)

a measure of the error (in squared units of the dependent variable) that results from using the estimated regression equation to predict the values of the dependent variable in the sample.

Sum of square due to error (SSE)

measure of how well obs cluster about the y line

Total Sum of Squares (SST)

data set used to build the candidate models.

Training set

If any estimated regression parameters , ,..., or associated p values change dramatically when a new independent variable is added to the model (or an existing independent variable is removed from the model), multicollinearity is likely present. Looking for changes such as these is sometimes used as a way to detect multicollinearity. T or F?

True

If practical experience dictates that the nonsignificant independent variable has a relationship with the dependent variable, the independent variable should be left in the model. T or F?

True

Interestingly, statistical tests do not actually indicate if causality exists. They only provide evidence that a relationship is present. T or F?

True

The estimated value of the y-intercept often results from extrapolation. T or F?

True

When interaction between two variables is present, we cannot study the relationship between one independent variable and the dependent variable y independently of the other variable. T or F?

True

in Regression analysis we must recognize that samples do not replicate the population exactly T or F?

True

overfitting enerally results from creating an overly complex model to explain idiosyncrasies in the sample data T or F?

True

regression analysis does not determine and a cause and effect relationship T or F?

True

Fitting a model too closely to sample data, resulting in a model that does not accurately reflect the population.

`overfitting

using constant zero:

can substantially alter the estimated slopes in the regression model and result in a less effective regression that yields less accurate predicted values of the dependent variable.

example of constant zero , regression through origin

common business example of regression through the origin is a model for which output in a labor-intensive production process is the dependent variable and hours of labor is the independent variable; because the production process is labor intense, we would expect no output when the value of labor hours is zero.

An estimate of a population parameter that provides an interval believed to contain the value of the parameter at some level of confidence.

confidence interval

An indication of how frequently interval estimates based on samples of the same size taken from the same population using identical sampling techniques will contain the true value of the parameter we are estimating.

confidence level

unctionality to remove the y-intercept from the model

constant Zero

A variable used to model the effect of categorical independent variables in a regression model; generally takes only the value zero or one.

dummy variable

The range of values for the independent variables , ,..., for the data that are used to estimate the regression model.

experimental region

Method of cross-validation in which sample data are randomly divided into mutually exclusive and collectively exhaustive sets, then one set is used to build the candidate models and the other set is used to compare model performances and ultimately select a model. -simple and quick

holdout method

in a positive relationship as one variable increases, the other variable generally also

increases

Regression modeling technique used when the relationship between the dependent variable and one independent variable is different at different values of a second independent variable.

interaction

The use of sample data to calculate a range of values that is believed to include the unknown value of a population parameter.

interval estimation

problem with covariance:

it does not tell you the strength of the relationship(magnitude difficult to interpret), units depend on the units of x + y

problem w/ Multicollinearity:

it is possible to conclude that a parameter associated with one of the multicollinear independent variables is not significantly different from zero when the independent variable actually has a strong relationship with the dependent variable.

the prespecified value of the independent variable at which its relationship with the dependent variable changes in a piecewise linear regression model; also called the breakpoint or the joint.

knot

A procedure for using sample data to find the estimated regression equation. minimizes the sum of squared errors

least squares method

if the values of a point estimator change dramatically from sample to sample, the point estimator has high variability, and so the value of the point estimator that we calculate based on a random sample will likely be a BLANK reliable estimate.

less

A single value used as an estimate of the corresponding population parameter.

point estimator

An interval estimate of the prediction of an individual y value given values of the independent variables.

prediction interval

If the values of a point estimator such as , , ,..., change relatively little from sample to sample, the point estimator has low variability, and so the value of the point estimator that we calculate based on a random sample will likely be a BLANK estimate of the population parameter.

reliable

BLANK p values indicate stronger evidence against the hypothesis that the value of BJ is zero (i.e., stronger evidence of a relationship between and y). The hypothesis is rejected when the corresponding p value is smaller than some predetermined level of significance (usually 0.05 or 0.01).

smaller

as the magnitude of BLANK increases (as t deviates from zero in either direction), we are more likely to reject the hypothesis that the regression parameter Bj is zero and so conclude that a relationship exists between the dependent variable y and the independent variable .

t

The phenomenon by which the value of an estimate generally becomes closer to the value of the parameter being estimated as the sample size grows is called

the law of large numbers

data set used to compare model forecasts and ultimately pick a model for predicting values of the dependent variable.

validation set


Set pelajaran terkait

Clinical Informatics Exam 1 Set 2

View Set

Macro Economics Chapter 10 Self Test

View Set

Unit 4 Assessment HBS 2.3 & 2.4 review

View Set

Properties Used in Algebra and Geometry/Complex Number System

View Set