mgsc 22

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Nonstationary:

it has no average level that it wants to be near but rather diverges off into space

Orange Juice Sales Model interpert what is happening (answer at bottom)

fit<-glm(log(sales) ~ brand + log(price),data=oj) > coef(fit) > round(coef(fit),4) (Intercept) brandminute.maid brandtropicana log(price) 10.8288 0.8702 1.5299 -3.1387 β ̂ = -3.14 : the coefficient on log price. This is the estimated "sales-price elasticity". •How to interpret sales-price elasticity : Expected sales drop by about 3% for every 1% increase in price.

The ~ symbol

he ~ symbol is read as "regressed onto" or "as a function of."

GLM to fit regression model in R ·coef(fit):

supplies just the coefficient estimates.

We can use the autocorrelation function (ACF)

(ACF) measures the correlation between a series and its lagged values at different time lags ('lag-l' ).

sample size

, n, minus the model degrees of freedom (n - model df)

If |β1| >1 Properties of AR Models

, the series diverges and will move to very large or small values.

risiiduals 3

1.Residuals are normally distributed 2.Residuals are the "independent"

How to estimate Trend

1.Simply create a time index variable 2.Use the time index as the predictor variable. oy ̂=β_0+β_1 t

Logistic Regression

1.We change the linear line into the non-linear curve 2.This curve goes from zero to 1 3.So, we can conclude that the curve represents the probability of an email being spam 4.Now, we can set a threshold to classify emails based on their probability. 5.The default threshold is .5

•Make a prediction in R for log(sales) for Dominick's OJ priced at $2.00. 1.Create an Observation with the Feature of Interest: > newdata <- data.frame(price=2,brand="dominicks") 2. Use the predict() function to make predictions based on your fitted model and the new data > predict(fit,newdata)

8.653246 •8.653246 is the predicted for log(sales). oTo put this in terms of sales, use exp(·), which is R's function for e(·). > exp(8.65326) [1] 5728.792

detecting spam with r (logistic regression) with positive coefficent

> coef(spamFit)["word_free"] word_free 1.542706 •But we will look at ebeta so to make the interpretation oKeep in mind that ebeta = "odds" ("x+1" )/"odds" ("x" ) > exp(1.542706) [1] 4.67723 oThe odds of the email being spam when the word "free" is present are 4.6 times the odds when "free" is not in the email.

Interpreting Negative coefficient detecting spam with r (logistic regression)

> coef(spamFit)["word_george"] word_george -5.779841 •we will look at ebeta oKeep in mind that ebeta = "odds" ("x+1" )/"odds" ("x" ) o > exp(-5.779841) [1] 0.003089207 •The odds of an email being spam when the word George is present are a tiny fraction (0.003) of the odds if George isn't mentioned.

Finding the levels in brand in (r)

> levels(oj$brand) [1] "dominicks" "minute.maid" "tropicana"

detecting spam with r (logistic regression)

> spamFit <- glm(spam ~ ., data=spammy, family='binomial') Warning message: glm.fit: fitted probabilities numerically 0 or 1 occurred • •The warning message you get when you run this regression, fitted probabilities numerically 0 or 1 occurred, means the regression is able to fit some data points exactly. •

home example

E[y│x] = β_0+〖xβ〗_1 E[Home value │Living Space] = β_0+β_1 Living space E[Home value │1500 sq ft] = β_0+ β_1 * 1500 β_1 = 100, β_0 = 10000 E[Home value │1500 sq ft] = ? 10000 + 100*1500 = 160000 β_1 = 100 , it means that for every additional square foot of living space added to a home, the estimated increase in the home's price is $100, Example living space = 1, 10000 + 100*1 = 10100 $ Example living space = 2, 10000 + 100*2 = 10200 $ β_0 = 10000 , average value of an empty lot with zero living space is 10000 $

glm model

E[y│x]=f(x′ β) E[∙] denotes the taking of the expectation or average of whatever random variable is inside the brackets. Øx^′ β (linear function) is the linear combination of predictor variables

Risiduals

Estimating the Error Variance •When fitting a linear regression model, we estimate an expected value, y ̂, for each observation in our dataset. • difference between the observed (actual) value and the predicted (fitted) value •e_i= y_i - y ̂_i y_i : Observed value ( actual value) y ̂_i: fitted value, expected value, predicted value

Linear equation with categorical variable

Example: we want to predict sales based on price and brand. Brand here is a categorical variable, •E(sales│price, brand)=B_0+β_1 (price)+ β_2 (brand) 1.E(sales│price, brand)=B_0+β_1 (price)+ β_d1 (Minute maid)+ β_d2 (Tropicana) 2.E(sales|price, brand)=β_1 (price) 〖+ α〗_brand •Log-log model: oE(log⁡(sales) ┤| log⁡(price), brand)=α_brand+β_1 log(price) Ølog⁡(sales)=α_brand+β_1 log(price)+ε

logit link

For a link function in logistic regression, we require a function that produces output between zero and one for any given input >>> p(y=1|x)

r^2 in r

For our OJ regression, we can calculate R2 in the following ways. > 1-fit3way$deviance/fit3way$null.deviance [1] 0.5353939 > 1 - SSE/SST [1] 0.5353939 > cor(fit3way$fitted,log(oj$sales))^2 [1] 0.5353939 Interpretation: This regression explains around 54% of the variability in log orange juice sales.

•What is the role of link function in linear regression ?

In linear regression, the link function doesn't do anything to the inputs, so f(x^′ β)=x^′ β

•How is a factor (group) selected as the reference level?

It is selected based on alphabetical order

Stationary vs non-stationary

Nonstationary: it has no average level that it wants to be near but rather diverges off into space Stationary: mean and variance are constant

log linear model

One common transform is that you need to take a logarithm of the response variable •〖log⁡(y)=x′ β+ε, ε~N(0,σ^2), this model is called a log-linear model.

Reference Level: (categorical to dummy)

One of these dummy variables is left out as a reference level. It serves as a baseline for comparison. •Interpretation: When analyzing, we compare the effects of other brands to the reference level. This helps us understand how each brand differs from the baseline in our analysis. •

Residual Variance

Residual variance: we take the sum of the squared residuals (the SSE) and divide by redisdual df (28,935). > SSE/fit3way$df.residual [1] 0.4829706 This gives you σ ̂^2, or what summary.glm calls the "dispersion."

log log 2

Taking the log of the response and the log of the input in a regression would be called a

home example conditional mean

The conditional mean is a function. E(Home value| Bathroom, location) The average home value based on specific conditions, such as living space, = number of bathrooms and the location

tip example linear regression

The data includes at least two variables: response and predictor variables. The Tip amount is dependent on the Bill amount. Bill amount is the input feature (predictor variable) Tip amount is the response variable

feature engineering. •

There are other functional transformations such as x2 and x1*x2 This is often referred to as feature engineering.

reference level

You can check the reference level of a factor by noting the first in the list using the levels function on your factor.

log-log model

You will likely also consider transformations for the input variables, such that elements of x include logarithmic: log(y) = β0+〖log⁡(x)B1 + ε, this model is called a log-log model

Interpreting coefficient in a log-linear model

exp(coef(fitAirSLR)["t"]) t 1.010099 •The result of exponentiation is a multiplicative factor. •Interpret exp(t) = 1.01. oFor every one-unit increase in the predictor variable, the response variable increases by a factor of about 1.01, oIncreasing by a factor of 1.01 is equivalent to an increase of about 1%. •The multiplicative change in passenger count is 1.01, meaning about a 1% increase for each additional t. oSo each month, we expect the passenger count to increase about 1%. o

link thing

f (·) is a "link" function map the relationship between expected value of the response variable and linear predictor variables. •function (input) output •link function(linear combination of predictor variables) expected value of response variable

Sum of squares error (SSE)

is another name for residual

The R2

is the proportion of variability explained by the regression. ØThe proportional reduction in squared errors due to your regression inputs. ØThe name R2 is derived from the fact that, for linear regression only, it is equal to the square of the correlation (usually denoted r) between fitted ŷi and observed values yi.

Stationary:

mean and variance are constant

link function

oThe link function, f(·), defines the relationship between your linear function and the response.

odds formula

odds=p/(1-p)

OJ Sales with Interaction (log sales amswer)

ofit2way <- glm(log(sales) ~ log(price)*brand, data=oj) Ø > coef(fit2way) (Intercept) log(price) brandminute.maid brandtropicana 10.95468173 -3.37752963 0.88825363 0.96238960 log(price):brandminute.maid log(price):brandtropicana 0.05679476 0.66576088 •Regression Equation: ( log⁡(sales))=o10.95-3.38 log⁡(price)+0.89minute.maid+0.96tropicana+0.06(log⁡(price)×minute.maid)+0.67(log⁡(price)×tropicana)

we need to find unknown coefficients

oβ_0 is intercept oβ_1 is slope

·GLM TO FIT INTO REGRESSION predict(fit, newdata=mynewdata)

predicts y where mynewdata is a data frame with the same variables as mydata.

interaction term

product of 2 inputs

residuals

residuals are the "independent" errors. Looking at residuals is a common way to check how much variability is left unexplained by the model

transformations (If the variables change multiplicatively with other factors

the the relationship is not linear.

If |β1| = 1 Properties of AR Models

you have a random walk.

GLM to fit regression model in R summary(fit)

·summary(fit): prints the model, information about residual errors, the estimated coefficients and uncertainty about these estimates (we will cover the uncertainty in detail in the next chapter), and statistics related to model fit.

home example marginal mean

ØThe average home value without considering any other variables

Modeling Trends in Time Series

•If your data include dates, then you should create indicator variables for, say, each year, month, and day. A best practice is to proceed hierarchically

1.linear equation model

β0 +xβ1

In a log-log model

β_1 is the proportional change in y over the proportional change in x.

Generalized Linear Model (GLM)

•GLM is a generalization of the linear model. •GLM allow to make linear relationship between predictor and response variable even if their relationship is not linear •GLM include a linear function (β_0+〖x_1 β〗_1+...+x_p β_p) and a link function The link function, f(·), defines the relationship between your linear function and the response

trend

•Gradual movement to a relatively higher or lower value

logistic regression

•Here we want to predict categories

home example

•Home value is a random variable - It has uncertain quantity. The expected valuefor a random variable (e.g., E(home))is the average of random draws from its probability distribution

If |β1| <1 Properties of AR Models

•, the values are mean reverting.

seasonality

•A repeating pattern at a fixed interval time

Regression for Time Series Data

•A time series dataset, the response variables are almost always correlated over time. oIn business settings, the response of interest is typically sales numbers, revenue, profit, active users, or prices.

Handling Categorical Variables in Regression

•Categorical variables have multiple groups: Categories like "brand" can have many groups, such as Tropicana, Minute maid, or Dominicks. we want to create a dummy variable

Oj sales example

•Consider sales data for orange juice (OJ) from Dominick's grocery stores, a Chicago area grocery chain. •The data includes: Weekly prices and Unit sales (number of cartons sold) for three OJ brands—Tropicana, Minute Maid, and Dominick's—at 83 stores in the Chicago area, as well as an indicator Ad: showing whether each brand was advertised (in store or flyer) that week •Here we want to predict sales by using price E(sales|price)

Deviance

•Deviance is the distance between your fitted model and the data. •It measures the amount of variation you have after fitting the regression. •Deviance is the sum of squared errors (residuals).

Diverging Series

•For AR(1) terms larger than one, life is more complicated. •This case results in what is called a diverging series because the yt values move exponentially far from y1. •The plot to the right shows how quickly the observations diverge even for β1 = 1.02, very close to one. ·It is useless for modeling and prediction.

random walk

•In a random walk, the series just wanders around and the autocorrelation stays high for a long time. •Consider the daily Dow Jones Average (DJA) composite index from 2000 to 2007. •The time series plot (line graph of the time series) appears as though it is just wandering around. •ACF plot ( on the right) confirms the high autocorrelation that stays high for a very long time.

•The link function allows you to map a nonlinear relationship into a linear one. oLinear regression: (β_0+〖xβ1) ---- linear) E[y│x]

•In logistics regression, we used the link function to map from a linear function to a response that takes a few specific values. • oLogistic Regression (β_0+〖xβ1)----non linear E[y│x] •

odds

•In order to discuss how to interpret the coefficients in logistic regression, it is first important to talk about the odds of success. •The odds are the probability an event happens divided by the probability it doesn't happen.

Ordinary Least Squares (OLS)

•In the linear regression, you are minimizing the sum of squared residual errors. •This gives linear regression its common name: Ordinary Least Squares, or OLS . •We will use the terms OLS and linear regression interchangeably.

loglog jucie example

•In the orange juice dataset, variables change multiplicatively with each other • •One possible model for creating linear relationship between sales and price is then using a log-log model: olog⁡(sales)=β_0+β_1 log(price)+ε o •In a log-log model, log(sales) increase by β_1 for every unit increase in log(price) • •In a log-log model, β_1 is the proportional change in y over the proportional change in x.

Categorical to Dummy

•In the orange juice example, we turn brands (a categorical variable) into dummy variables. Each brand gets its own dummy variable, with 1s and 0s indicating presence or absence.

Interactions (linear regression)

•Interactions: In regression, it refer to situations where the impact of one predictor variable on the response variable changes based on the value of another predictor variable. For example, Does gender change the effect of education on wages? How does consumer price sensitivity change across brands? •In each case here, you want to know whether one variable changes the effect of another.

Higher Lags

•It is also possible to expand the AR idea to higher lags. AR(p): y_t= β_0+β_1 y_(t-1)+...+β_p y_(t-p)+ε_t •Regularization methods (we haven't covered these yet) make it straightforward to let the data choose the appropriate lags. • •simple stationary versus nonstationary interpretations for β_1no longer apply if you include higher lags. • •need for higher lags sometimes means you are missing a more persistent trend or seasonality in the data. •

oj sale prediction

•Let's revisit our first simple regression model > fit<-glm(log(sales) ~ brand + log(price),data=oj) > round(coef(fit),4) o (Intercept) brandminute.maid brandtropicana log(price) 10.8288 0.8702 1.5299 -3.1387 •We can write the regression equation and make a prediction for log(sales) for Dominick's OJ priced at $2.00. o(log⁡(sales)) ̂=10.8288+0.872(0)+1.5299(0)-3.1387∗log⁡(2)=10.8288-3.1387∗log⁡(2) =8.653219

linear regressioin

•Linea regression predict a quantitative variable like house price •is a GLM that models a continuous numeric response

linear regression

•Linear regression predicts the mean of the response variable based on its linear relationship with predictors.

logistic regression 2

•Logistic regression is the GLM that you will want to use for modeling a binary response: a y that is either 1 or 0 (e.g., true or false). • •Binary responses arise from a number of prediction targets: oWill this person pay their bills or default? oWill the customer take advantage of the offer?

mean revertinf series 2

•Mean reversion is common and if you find an AR(1) coefficient between −1 and 1 •It should give you some confidence that you have included the right trend variables and are modeling the right version of the response. •The AR(1) component of our regression for log passenger counts was mean reverting, with each yt expected to be 0.79 times the response for the previous month.

assumptions

•Model assumption is about error term (ε) 〖y=x′ β+ε, •When you fit a model : 1.We assume the variability around the line (or the errors) is normally distributed (a.k.a. Gaussian distribution). ε~N(0,σ^2) 2.ε are "independent" which means the variations in y are not correlated with x.

Model Matrix ex

•Now, R creates the model matrix for you when you call glm. •However, we can create it using the model.matrix function just for illustration. > x <- model.matrix( ~ log(price) + brand, data=oj) Øx[c(100,200,300),] Ø (Intercept) log(price) brandminute.maid brandtropicana 100 1 1.1600209 0 1 200 1 1.0260416 1 0 300 1 0.3293037 0 0 o oNotice there is not brand column, instead we have dummy variables: brandminute.maid and brandtropicana oNotice there is a column of 0s and 1s for both Minute Maid and Tropicana, but none for Dominick's. oThis means Dominick's is the reference level.

risiduals 2

•Residuals represent the variability in response that the model does not explain. • • • • • •We can use the residuals to check the model assumptions.

Properties of AR Models

•The AR(1) model is simple but hugely powerful. •If you have any suspicion of autocorrelation, it is a good move to include the lagged response as an input. •The coefficient on this lag gives you important information about the time-series properties.

transformations

•The linear regression assumes that there is a linear relationship between the predictor and the response variable.

link function 2

•The link function (f (·)) also allows you to map a nonlinear relationship between predictor and response variable into a linear one. Assume : β_0+〖xβ〗_1 E[y│x] link function (β0+〖xβ〗_1) E[y│x]

Mean-Reverting Series

•The most interesting series have AR(1) terms between -1 and 1. •These series are called stationary because yt is always pulled back toward the mean. •These are the most common, and most useful, type of AR series. •The past matters in a stationary series, but with limited horizon and autocorrelation drops off rapidly.

Null deviance (SST)

•The null deviance is calculated for the "null model," •With the absence of predictor variables, the deviance is only due to error. So, it makes it a measure of total deviance The sum of squares of Total (SST) is another name for null deviance. ∑1_i▒〖(y_i-y ̅)^2 〗

detecting spam with r (logistic regression) 2

•The output looks basically the same as what you get for a linear regression. • •Note that in logistic regression there is no σ2 to estimate as the "dispersion parameter" because there is no error term like the ε of linear regression. • •Instead, glm outputs "Dispersion parameter for binomial family taken to be 1". • •If you don't see this, then you might have forgotten to put "family=binomial".

Autocorrelation

•The residual analysis shows that y_(t-1) can be used to predict y_t. •This phenomenon is called autocorrelation: It refers to the correlation between a variable and its lagged (past) values. •Time series data is simply a collection of observations gathered over time. oEx. Suppose y1. . .yt are weekly sales, daily temperatures, or five-minute stock returns. oIn each case, you might expect what happens at time t to be correlated with time t - 1. •

R^2

•Think of the null deviance as the total variability in your response, y, • •Think of deviance as the amount of variability in the response, y, that is leftover or not explained by the regression. • •Their ratio (deviance/(null.deviance)) would then be the proportion of variability in the y not explained by the regression. R^2=1-deviance/(null.deviance)

Random Walk: Returns Transformation

•This property is implied by the series being a random walk: the differences between yt and yt-1 are independent. •If you have a random walk, you should perform this "returns" transformation to obtain something that is easier to model. •For example, it is standard to model asset price series in terms of returns rather than raw prices.

dummy variabels

•To do this, we use dummy variables: We convert categorical data into binary (0 or 1) variables. • •Dummy variables act as switches: When a dummy variable is 1, it represents the presence of a specific category; when it's 0, it signifies the absence. This helps the model include or exclude categories in its predictions.

Detecting Spam email - Predictions logistic r

•To get the probabilities, we need to add the type="response" argument > predict(spamFit,newdata=spammy[c(1,4000),],type="response") 1 4000 0.8839073 0.1509989 • •Email 1 has an 88% chance of being spam and email 4,000 has a 15% chance of being spam.

categorical variables

•We want one regression model for all groups: Instead of separate models for each category, we aim to create a single model that works for all categories. oE(sales│price, brand)=B_0+β_1 (price)+ β_2 (brand) •

Logistic Regression in glm

•You can use glm to fit logistic regressions in R with the exact syntax for linear regression, but you need to add the argument family="binomial". •The logit link is how glm fits probabilities. •The response variable can take a number of forms including numeric 0 or 1, logical TRUE or FALSE, or a two level factor such as win vs. lose.

residual degrees of freedom

•are equal to the number of opportunities that you have, to observe variation around the fitted model means.

model degrees of freedom

•are the number of random observations your model could fit perfectly. In regression models, this is the number of coefficients

β_1 (slope) in a log-log model

•elasticity.

GLM to fit regression model in R

•glm in R takes a formula and a dataset: •fit <- glm(y ~ var1 + var2 + ... + varp, data = myData) • The ~ symbol is read as "regressed onto" or "as a function of."

How to interpret β in a log-linear model ?

•log(Y) = 4 + 0.31(X) • 1.exp(0.31) = 1.363425 2.Y is expected to increases by 36% for every 1 unit increase in X.

Degrees of Freedom

•n-1 is degree of freedom. •Degrees of freedom is "opportunities to observe variation"

•What is the relationship between the odds of success and logistic regression?

〖So e^x′ β)is the odds of success: p(success)/p(failure) vNotice: log(e^(x^′ β)) = x^′ β vTherefor x^′ β is log odds of success.


Ensembles d'études connexes

Credit(Credit Reports, Identity Theft, Bankruptcy & Collection Agencies)

View Set

(5) Real Estate Title Transfer - Quizzes/Exams

View Set

Acute and Chronic Illness Exam 1

View Set

AP Human Geography Unit 5: Agriculture

View Set

D'accord 2: Leçon 7A - Quelle forme d'art?

View Set

Week 11-Chapter 16 Violence and Human Abuse

View Set