Ch.7 P2 Linear Regression with Categorical Variables

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

As a rule remember: If you have a categorical variable with k levels then you need to add

k - 1 dummy variables

In general, if a categorical variable has k levels then we need to add

k - 1 dummy variables

If a categorical variable has k levels then

k-1 dummy variables are required, with each dummy variable corresponding to one of the levels of the categorical variable and coded as 0 or 1

We model independent categorical variables with

dummy variables

To incorporate a variable that indicates whether a driving assignment included travel on this congested segment of a highway during the afternoon rush hour into a model that currently includes the miles traveled and the number of deliveries , we define the following variable:

*categorical variable will be rush hour x3 = 0 if an assignment did not include travel on the congested segment of highway during afternoon rush hour 1 if an assignment included travel on the congested segment of highway during afternoon rush hour Will this dummy variable add valuable information to the current Butler Trucking regression model?

More Complex Categorical Variables

*lecture video Question: What do we do if our categorical variable is more complex than our binary (Johnson Filteration) ex?

Example of a categorical variable with k levels:

-We have a product that has many colors and we would like to put it as an independent variable into our model -If categorical variable has a satisfaction rating

Each dummy variable can be set to either

0 or 1 -true for the BINARY categorical variable which means it only has two possible outcomes

The regression equation with dummy variables can only take

0 or 1 they test/represent the shift in the y-intercept or change in the slope. We exclude them from the regression equation interpretation if they're not significant bc then in the case of a dummy variable being NOT significant then no significant shift in intercept and change in the slope is PRESENT for the population

Examples of complex categorical variables explained

1) Suppose a manufacturer of vending machines organized the sales territories for a particular state into three regions: A, B, and C -A, B, and C represent categories -managers want to use regression analysis to help predict the number of vending machines sold per week. Suppose the managers believe that the sales region is one of the important factors in predicting the number of units sold Important: Sales Region is the categorical variable with 3 different levels of A, B and C. So, the number of dummy variables that we need is one less than 3, 3 - 1 = 2 dummy variables ^we get the - 1 from equation k - 1 with k being 3 Overview: (k-1) k = 3, represents the 3 different levels of A, B, and C -1 = from k-1 equation Next, we'll name these 2 dummy variables, x1 and x2 x1: can be defined as 1 if sales are in region B. 0 if otherwise (region C) x2: can be defined as 2 if sales are in region C. 0 if otherwise (regions B) Question: What about region A? Answer: We will account for region A, if both x1 and x2 are 0. Slide Table Explained: -Whenever we have a sale from region A, both x1 and x2 will be set to zeros -Whenever we have a sale from region B, x1 will be set to 1 and x2 will be set to 0 -Whenever we have a sale from region c, x1 is set to 0 and x2 set to 0 Overall, it's possible to take care of those 3 different levels (A, B, and C) of the categorical variable (Sales Region) using only 2 dummy variables (not 3 dummy variables). It's enough bc it covers all the possible cases therefore, why add a 3rd dummy variable if 2 dummy variables are enough

The 2 histograms in Figure 7.23 show that driving assignments that included travel on a congested segment of a highway during the afternoon rush hour period tend to have positive residuals, which means we are generally underpredicting the travel times for those driving assignments.

Conversely, driving assignments that didn't include travel on a congested segment of a highway during the afternoon rush hour period tend to have negative residuals, which means we are generally overpredicting the travel times for those driving assignments. These results suggest that the dummy variable could potentially explain a substantial proportion of the variance in travel time that is unexplained by the current model, and so we proceed by adding the dummy variable, x3 to the current Butler Trucking multiple regression model

If a dummy variable show as NOT significant, that means there is

NO significant shift in the y-intercept in that particular case

Regions B and C

Our x1 is set to 1 and our x2 is set to 0

Butler Trucking Company and Rush Hour

Several of Butler Trucking's driving assignments require the driver to travel on a congested segment of a highway during the afternoon rush hour. Management believes that this factor may also contribute substantially to variability in the travel times across driving assignments. How do we incorporate information on which driving assignments include travel on a congested segment of a highway during the afternoon rush hour into a regression model?

More Complex Categorical Variables

The categorical variable for the Butler Trucking Company example had two levels: (1) driving assignments that include travel on the congested segment of highway during the afternoon rush hour and (2) driving assignments that do not. As a result, defining a dummy variable with a value of zero indicating a driving assignment that does not include travel on the congested segment of highway during the afternoon rush hour and a value of one indicating a driving assignment that includes such travel was sufficient. However, when a categorical variable has more than two levels, care must be taken in both defining and interpreting the dummy variables. As we will show, if a categorical variable has k levels, then k - 1 dummy variables are required, with each dummy variable corresponding to one of the levels of the categorical variable and coded as 0 or 1

Interpreting the Parameters

The model estimates that the mean travel time (y variable) increases by: •0.0672 hours for every increase of 1 mile traveled, holding constant the number of deliveries and whether the driving assignment route requires the driver to travel on the congested segment of a highway during the afternoon rush hours. •0.6735 hours for every delivery, holding constant the number of miles traveled and whether the driving assignment route requires the driver to travel on the congested segment of a highway during the afternoon rush hours. •0.998 hours if the driving assignment route requires the driver to travel on the congested segment of a highway during the afternoon rush hours, holding constant the number of miles traveled and the number of deliveries.

The previous independent variables we have considered (such as the miles traveled and the number of deliveries) have been quantitative, but this new variable is

categorical and will require us to define a new type of variable called a dummy variable.

So far the examples we have considered have involved quantitative independent variables such as the miles traveled and the number of deliveries. In many situations, however, we must work with

categorical independent variables such as marital status (married, single) and method of payment (cash, credit card, check). The purpose of this section is to show how categorical variables are handled in regression analysis. To illustrate the use and interpretation of a categorical independent variable, we will again consider the Butler Trucking Co ex.

Dummy variables are sometimes referred to as

indicator variables

Recall that the residual for the ith observation is e sub i = , y sub i - y^ sub i, which is the difference between the

observed and the predicted values of the dependent variable.

A binary categorical variable means it

only has two possible outcomes. You add two possible outcomes (0 or 1) to one dummy variable and then you set either to 0 or to 1 depending on which of the two options it is for a given observation

If a dummy variable show as significant, that means there is a

significant shift in the y-intercept in that particular case ^by looking at the corresponding coefficients you can also see the magnitude if any of a shift

A dummy variable is a

variable used to model the effect of categorical independent variables in a regression model; generally takes only the value zero or one.

Using Excel's Regression tool to develop the estimated regression equation on the data in the file ButlerHighway, we obtain the Excel output in Figure 7.24. The estimated regression equation is

y ̂ = -0.3302 + 0.0672x1 + 0.6735x2 + 0.9980x3

Categorical Independent Variables -Will rush hour add valuable information to the current Butler Trucking regression model?

•The histograms show that driving assignments that included travel on a congested segment of a highway during the afternoon rush hour period tend to have positive residuals, which means we are generally underpredicting the travel times for those driving assignments. •Conversely, driving assignments that did not include travel on a congested segment of a highway during the afternoon rush hour period tend to have negative residuals, which means we are generally overpredicting the travel times for those driving assignments. •These results suggest that the dummy variable could potentially explain a substantial proportion of the variance in travel time that is unexplained by the current model, and so we proceed by adding the dummy variable x3 to the current Butler Trucking multiple regression model.

Let's look at the equation with the x1 and x2 dummy variables and how it translates into a regression equation

•b0 = Mean (expected value) of sales for Region A. •b1 = Difference between the mean number of units sold in Region B and the mean number of units sold in Region A. •b2 = Difference between the mean number of units sold in Region C and the mean number of units sold in Region A. -b1 and b2 are the slopes that correspond to each of those dummy variables -if the observations are coming from region A then we have x1 set to 0 and x2 set to 0


Ensembles d'études connexes

Nervous System, Sensory Organs & Action Potentials

View Set

Introduction to Linux - Chapter 14

View Set

Give combining forms for the following meanings:

View Set

Chapter 18 - The Circulatory System: Blood

View Set