SF2930 - Regression
Question 26 Regression analysis often utilities the variable selection procedure known as the best sub-sets regression. (a) Describe thoroughly the steps of the "all possible regressions" procedure. (b) Specify at least two objective criteria that can be used for evaluating the models in the"all possible regressions" procedure. Explain how to apply these criteria and motivatewhy they are suitable for this type of variable selection. Explain the advantages anddisadvantages of this approach to regression model building. (c) Suppose that there are three candidate predictors, x1,x2,and x3 for the final regres-sion model. Suppose further that the intercept term, β0,is always included in all themodel equations. How many models must be estimated and examined if one appliesthe "all possible regressions" approach? Motivate your answer.
) a) 1: Create all possible subsets of regressor variables. 2: Fit a linear model to all of the subsets. 3: Use these models to create relevant test statistics. Which ones are up to the person doing the evaluation 4: Use these statistics to make a decision of which models to evaluate further. 5: Perform model adequacy check and investigate outliers, influence and multicollinearity. b) Two objective test statistics to use when considering which model to choose are 𝑅^2 and 𝑀𝑆_𝑅𝑒𝑠. Both of these statistics can be seen as measures of how well the model fits the observations, which makes these statistics well suited to be used to evaluate the models. To apply these criteria a model needs to be fitted before we can calculate these test statistics. Once the model is fitted we can determine 𝑆𝑆_𝑅𝑒𝑠 and then calculate the test statistics using the following formulas: 𝑅2 = 1 − 𝑆𝑆_𝑅𝑒𝑠/𝑆𝑆_𝑡𝑜𝑡. &. 𝑀𝑆𝑅𝑒𝑠 = 𝑆𝑆_𝑅𝑒𝑠/(𝑛 − 𝑝) where 𝑆𝑆_𝑡𝑜𝑡 measures the total variability in the observations and 𝑛 is the number of observations and 𝑝 is the number of parameters in the model. These statistics can be used in conjunction with each other as well as with other model criteria. One could for example favor a simpler model with fewer regressors and smaller 𝑅^2 value than a more complex model with a higher 𝑅^2 value. One big disadvantage with best subset regres- sion is that it can be computationally expensive if we have many possible regressors. One advantage of the method is that it is possible to choose the models based on criteria we choose. Another advantage is that this method is good when you do not want to miss any important information. If you have the capacity to perform this method, it is great for capturing every model and finding the best one. c) If we have 𝑘 possible regressors to consider there are 2^𝑘 different subsets to be con- sidered. The intercept 𝛽0 and the corresponding regressor 𝑥0 = 1 are not normally included in 𝑘 even though they are a part of the model. The regressors would then in this case be 𝑥1, 𝑥2 and 𝑥3 which would mean that 𝑘 = 3 and therefore we would need to evaluate 8 different models.
Question 23 Describe the "best power algorithm". Motivate, using equations, why this algorithm works.
1. Let 𝛼0 be an initial guess for 𝛼, and let 𝜉0 ∶= 𝜉(𝛼0) 2. Calculate the LSE 𝛽̂0 and 𝛽̂1 for the equation 𝔼[𝑥] = 𝛽0 + 𝛽1𝜉0 3. Calculate the LSE 𝛽̂0∗, 𝛽̂1∗ and 𝛽̂2∗ for the equation 𝔼[𝑥] = 𝛽0 + 𝛽1𝜉0 + 𝛽2𝑥 log 𝑥 4. Repeat steps with a(1) :=α0+𝛽̂2∗/𝛽̂1 𝜉(𝛼):={𝑥𝑎, if𝑎≠0 and log𝑥 if𝑎=0}
Question 9 A friend was interested to see if wages are "socially inherited", in the sense that parents wages influence the children's wages, ceteris paribus ("all else held equal".) She had avery large amount of observations on individual wages, education working experience,and parents' wages. She ran a regression of log(wage) on parent's wages, dummies forhighest university or college degree (bachelor's degree, etc.; "no degree" was benchmark),working experience, and years of study at college or university. The coefficients came outwith the sign she had expected, except for the coefficient for years of study, which to hersurprise came out slightly negative (indicating that college studies are detrimental to wageopportunities.) To test, she computed a confidence interval for that coefficient, but alsothis interval was all negative. What is your explanation for this?
1. Probably multicollinearity since the dummy variables bachelors degree, "no degree" and others will be highly correlated to number of years at university. 2. And thus the positive contribution from university studies will already be included in the model from the dummy variable. And additional years, for example taking 6 years to complete a bachelor's degree will affect salary negatively.
Question 30 Define BIC, deviance, and VIF, and describe how they are used.
BIC is the bayesian informtion criterion. L is the likelihood function for a specific model. BIC(sch)=-2*ln(L)+p*ln(n) for ordinary least squares regression it is BIC(sch)=n*ln(SSres/n)+p*ln(n) It is used for model selection where lower BIC usually indicates a better model. It introduces a penalty term for adding additional parameters and thus seeks to prevent overfitting. Deviance is a measure of goodness of fit: the smaller the deviance, the better the fit. D=2*ln(L*(Saturated model)/L(Full model)) Large Ds implies that the model does not provide a satisfactory fit VIF(j)=C(jj)=(1-R(j)^2)^-1 R(j)=coefficient of determination when x(j) is regressed on remaining regressors. Orthagonal: R(j)^2=small, C(jj)=near unity Linear dependent: R(j)^2=near unity, C(jj)=large One ore more large (<5 to 10) VIFs indicates multicollinearity
Question 27 Explain how cross validation works, and how it can be used to estimate the mean squared error 𝔼[1/n*∑(𝑦𝑖 − 𝑦̂𝑖)^2]
Cross validation is a resampling method that uses different partitions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. Step 1: Randomly divide a dataset into k groups, or "folds", of roughly equal size. Step 2: Choose one of the folds to be the holdout set. Fit the model on the remaining k-1 folds. Calculate the test MSE on the observations in the fold that was held out. Step 3: Repeat this process k times, using a different set each time as the holdout set. Step 4: Calculate the overall test MSE to be the average of the k test MSE's. Test MSE = (1/k)*ΣMSEi
Question 25
Fack dis
Question 11 Assume that we have a data set in which the errors do not seem to be normally distributedand that you are not able to find a transform that makes them approximately normallydistributed. Describe a method that can then be used to find a confidence interval for one of the regression parameters. Which assumptions do you need to make on the data set inorder to apply this method?
For parametric bootstrapping, we need to assume that the residuals of the model are i.i.d. For non-parametric bootstrapping, we assume that 𝑋 is mostly random. Bootstrapping CIs for regressors by the "reflection method". lower 100(α/2) and upper 100(1−α/2) percentiles of the bootstrap distribution of Bhat*, operationally Bhat*(α/2) and Bhat*(1−α/2), then define: D(1)=Bhat-Bhat*(α/2) and D(2)=Bhat*(1−α/2)-Bhat gives the Bootstrapped CI: Bhat-D(2)=<B=<Bhat+D(1) OR Bootstrap (cases) 1. Compute the statistic of interest for the original data 2. Resample B times from the data to form B bootstrap samples. 3. Compute the statistic on each bootstrap sample. This creates the bootstrap distribution, which approximates the sampling distribution of the statistic. 4. Use the approximate sampling distribution to obtain bootstrap estimates such as standard errors, confidence intervals, and evidence for or against the null hypothesis.
Question 14 Describe the concept of an influential data point (use figures!) and explain how such points can be detected using DFFITS and Cook's distance measure.
Influential points "pulls" the regression model in its direction. (Draw a point away from the clear data area) DFFITS=(y(i)hat-(y(i)hat without point i))/sqrt(S(i)^2*h(ii)) check the delta(Sd) if point i is removed. Thus if this changes too much the point is influential. Cooks D(i)=((B(i)hat-Bhat)'*M(B(i)-Bhat))/c check the least-square distance with point I removed. Thus if this distance decreases a lot point i is an influential point.
Question 35 Describe a setting in which kernel regression might be advantageous over linear regression.
It can be advantageous to use kernel regression if we have non-linear data that can not easily be described as a polynomial. It can also be advantageous if it hard to get insight in the relationship between the response variable and regressor variables.One example when it is hard to use linear regression is when the data takes the form: 𝑦 = sin(𝑥) + 𝜖 where 𝜖 ∈ 𝑁(0, 𝜎).
Question 34 Assume that we want to fit a polynomial of degree two to a data set (x1,y1),(x2,y2),. . . ,(xn,yn). Does it make any difference if we fit the data to the model y=β0+β1x+β2x^2+ε, or to the model y=β0+β1x+β2(x^2/2)+ε? Motivate your answer! Hint: Does the answerdepend on which estimates for β0 and β1 you use? Does it matter if the model matrix ison standard form?
Since kernel regression in generally regarded as Non-parametric (see Q.36 for why this is) the differences between the two models are basically irrelevant if we are only aiming to fit "a polynomial" and not a specific one to the data. This since Kernel regression makes no assumptions on the underlying relationship between the input (x) and the output (y). But again, the question is strange since kernel regression is usually employed when we believe that there is no polynomial that can be fitted to the data. Moreover, the simple answer to the main question "does it make any difference if we fit the data to the model 𝑦 = 𝛽0 +𝛽1𝑥+𝛽2𝑥2 +𝜀 or to the model 𝑦 = 𝛽0 +𝛽1𝑥+𝛽2(𝑥2∕2)+𝜀" is in my opinion very badly formulated and much to open ended for any detailed discussion since the short answer is "it depends on what we mean by difference". If we mean difference as in "adequacy of the model" then we must compare some test statistic such as RMSE or 𝑅2 in order to answer the question. If we mean difference as in the "inner workings" of the model, such as the Kernel function, the penalty parameter in ridge or LASSO or something else (like the structure of the 𝛽 vector itself) then it will make a difference. Again, they key here is where are we looking for differences.
Question 12 Define some different types of residuals (at least three) (for example standardized, studentized, PRESS, etc.), specify their properties, and explain how they can be used for detecting outliers
Standardizing the residuals is a method based on the fact that estimated mean variance of the residuals MSRes. If d(i)=e(i)/sqrt(MSres) is larger than 3 the point may be defined as an outlier (generally) Studentized uses the standard deviation for each data point by using h(ii) which represents the diagonal element for each data point in the hat matrix. If r(i)=e(i)/sqrt(MSres*(1-h(ii)) is larger than 3 the point may be defined as an outlier (generally) PRESS: checks how each data point affects the fitted values by doing a calculation with each data point excluded once. Inefficient for large data sets.
Question 19 Derive in detail at least two diagnostic measures for detecting multicollinearity in multiple linear regression and explain in which way these measures reflect the degree of multicollinearity.
VIF VIF(j)=C_jj=(1-R_j^2)^-1 R(j)=coefficient of determination when x(j) is regressed on remaining regressors. Orthagonal: R(j)^2=small, C(jj)=near unity Linear dependent: R(j)^2=near unity, C(jj)=large One ore more large (<5 to 10) VIFs indicates multicollinearity. Eigensystem analysis of X'X Eigenvalues of X'X: lambda(1), lambda(2),..., lambda(p) If one or more of the eigenvalues are small it indicates linear dependence within the data. Also condition number: k=lambda(max)/lambda(min) k=100 means chill, k=100 to 1000 means moderate multicollinearity, k>1000 means dis is fcked with multicollinearity
Question 32 (a) Describe the model of logistic regression, including the assumptions made on the data. (b) Describe how the regression parameters are estimated in logistic regression, and ex- plain why the least-squares approach cannot be used. (c) Describe at least one statistic which can be used for model testing for logistic regression.
a) Assumptions: • Response 𝑦𝑖 is a binary variable, i.e. 𝑦𝑖 ∈ 𝐵𝑒(𝑝𝑖), so 𝔼(𝑦𝑖) = 𝑃 [𝑦𝑖 = 1] = 𝑝𝑖 • 𝑦𝑖's are independent • Regressors 𝐱𝑖 are allowed to be continuous Logistic Regression Model: 𝔼(𝑦𝑖)=𝑝𝑖=(exp(𝑥_i^T*𝛽)/(1+exp(𝑥_i^T*𝛽)) Note that the logistic model models the probability of observing a one at a certain sam- ple point, and not a {0, 1}-valued function. b) The regression parameters are estimated by maximum likelihood. Likelihood function: See drive docs 32b The least squares approach cannot be used because the errors are not normally distributed. c) Likelihood Ratio Test (𝐿𝑅) is applicable to check the significance of logistic regression model....... A large value of this test statistic would indicate that at least one of the regressor variables in the logistic regression model is important because it has a nonzero regression coefficient.
Question 36 (a) Describe how kernel regression can be seen as a generalization of linear regression. (b) What are some similarities between kernel regression and linear regression? (c) Is kernel regression parametric or non-parametric regression? Discuss!
a) Both kernel regression and generalized linear regression are designed to be used in cases where we cannot validate the assumptions necessary to use simple linear regression. For kernel regression, this is the assumption that the relationship between the regressor and response can be defined linearly, or with a polynomial function. For GLM, this is the assumption that normality and constant variance. In both cases, a function is used to manipulate the data into a form that we are then able to regress on. For kernel regression, this is a kernel function that is used to make linear models non-linear. And for GLM, these are link functions, described in Question 31. b) One of the more obvious and straightforward similarities between kernel regression and linear regression is that both methods use a data set with inputs and outputs to (in different ways) determine a "function of best fit". In other words, the goal of both types is to determine the relationship between our independent and dependent variables. Another similarity is that depending on the structure of our data, different types of both linear and kernel regression methods may be more advantageous. One last similarity that can be discussed is that neither method is an "be-all and end-all" method for de- termining the relationship between input and output. In other words both linear and kernel regression rely on adequacy and precision tests to determine how well they ac- tually "model" the data. c) Kernel regression can be viewed as non-parametric regression. This is because we do not assume some known relationship between the response and predictor prior to performing the regression. For example in non-parametric regression, we have random variables 𝑋 and 𝑌 and assume the following relationship: 𝐸[𝑌 |𝑋 = 𝑥] = 𝑚(𝑥) where 𝑚(𝑥) is some deterministic function. If this was parametric regression, we would assume this 𝑚(𝑥) to be a polynomial, but for non-parametric we do not have any under- lying assumptions.
Question 20 (a) Explain in detail the idea of ridge and Lasso regression (b) Which of these two approaches behaves only as shrinkage method and which one can directly perform variable selection? Your answer must be formulated in mathematical terms. Provide geometric interpretation of the constraints used in ridge and Lasso estimation approaches to confirm your answer. (c) Sketch an example of the graph with traces of ridge- and Lasso coefficient estimatorsas tuning parameter is varied, and explain the difference in trace shapes.
a) RIDGE The idea of ridge regression is to allow some bias in the estimator in order to achieve a reduction of variance in the estimator. This is done by adding a penalty term to the function that is minimized. The problem then becomes: min(𝑦 − 𝑋𝛽)𝑇 (𝑦 − 𝑋𝛽) + 𝜆𝛽^𝑇 𝛽 where 𝜆 is the tuning parameter. We can thus see that 𝛽 ridge regression favors a smaller estimator as the penalty term increases with the size of 𝛽. Ridge regression thus acts as a shrinkage method as it shrinks 𝛽 towards zero. How much it shrinks depends on the parameter 𝜆 as it depends on the magnitude of the penalty. LASSO Lasso regression is very similar to ridge regression (the idea is the same), the only difference being the penalty term. The problem to be solved using Lasso is: min_𝛽(𝑦−𝑋𝛽)^𝑇(𝑦−𝑋𝛽)+𝜆∑|𝛽𝑖| which makes for an important difference due to the derivative at the penalty term which allows Lasso to shrink a parameter 𝛽𝑖 exactly to zero. b) Ridge regression can however not set a parameter to exactly zero, it can only approach zero, thus only performing shrinkage. Lasso to shrink a parameter 𝛽𝑖 exactly to zero. Lasso can thus directly perform variable selection. c) As we can see in the trace shapes for ridge, the parameters only approach zero whilst for Lasso the parameters become exactly zero. See overleaf
Question 21 (a) Explain the idea of the ridge regression (in relation to multicollinearity) and define the ridge estimator of the vector of regression coefficients for the linear model 𝑦 = 𝑋𝛽 + 𝜀, where the design matrix 𝑋 is in the centered form. (b) Show formally that the ridge estimator is a linear transform of the ordinary LS estimator of the regression coefficients. (c) Show formally that the ridge estimator is a biased estimator of 𝛽. d) Explain why the ridge estimator is also called a shrinkage estimator. (You can assume that the columns of 𝑋 are orthonormal.)
a) Ridge reduces multicollinearity by adding a term to the diagonal of (𝑋^𝑇 𝑋) thereby making it less ill-conditioned and the inverse (𝑋^𝑇 𝑋)−1 less unstable. More specifically, it reduces the inflated variance of 𝛽̂ in case multicollinearity is present by estimating a new parameter 𝛽̂_𝑅𝑖𝑑𝑔𝑒, that is biased but with less variance, unless the ridge parameter is set to zero, in which case 𝛽̂_𝑅𝑖𝑑𝑔𝑒 = 𝛽̂_𝐿𝑆. b) We have 𝛽̂_𝑅𝑖𝑑𝑔𝑒 = (𝑋^𝑇 𝑋 + 𝜆𝐼)^−1*𝑋^𝑇 𝑦 From the normal equation of the least squares estimator of 𝛽 we have𝑋^𝑇 𝑋𝛽̂_𝐿𝑆 =𝑋𝑇𝑦, so 𝛽̂_𝑅𝑖𝑑𝑔𝑒 =(𝑋^𝑇 𝑋+𝜆𝐼)^−1*𝑋^𝑇 𝑋𝛽̂_𝐿𝑆; i.e 𝛽̂_𝑅𝑖𝑑𝑔𝑒 is a linear transformation of 𝛽̂_𝐿𝑆. c) We know from b) that 𝛽̂_𝑅𝑖𝑑𝑔𝑒 = (𝑋^𝑇 𝑋 + 𝜆𝐼)^−1 𝑋^𝑇 𝑋𝛽̂_𝐿𝑆, so 𝔼[𝛽̂_𝑅𝑖𝑑𝑔𝑒] = 𝔼[(𝑋^𝑇 𝑋 + 𝜆𝐼)^−1 𝑋^𝑇 𝑋𝛽̂_𝐿𝑆] =(𝑋^𝑇 𝑋+𝜆𝐼)^−1 𝑋^𝑇 𝑋𝔼[𝛽̂_𝐿𝑆] =(𝑋^𝑇 𝑋+𝜆𝐼)^−1 𝑋^𝑇 𝑋𝛽_𝐿𝑆 so as long as 𝜆 ≠ 0 the ridge estimator will be biased. d) The ridge estimator is also called shrinkage estimator as the ridge least squares equation 𝛽̂_𝑅𝑖𝑑𝑔𝑒 =argmin(𝑦−𝑋𝛽)^𝑇 (𝑦−𝑋𝛽) 𝑠.𝑡.𝛽^𝑇 𝛽 ≤ 𝑐 can be written as a Lagrange function when 𝛽^𝑇 𝛽 = 𝑐 (otherwise we just get a normal least squares solution). 𝛽̂_𝑅𝑖𝑑𝑔𝑒 =argmin𝛽(𝑦−𝑋𝛽)^𝑇 (𝑦−𝑋𝛽)+𝜆𝛽^𝑇 𝛽 where 𝜆 is the ridge estimator. When 𝜆 = 0 we get 𝛽̂_𝐿𝑆 as usual. But as 𝜆, increases the 𝛽^𝑇 𝛽 (𝐿^2 norm of 𝛽) is penalised, original 𝛽̂_𝐿𝑆 will shrink to a new estimate 𝛽̂_𝑅𝑖𝑑𝑔𝑒. Thus, 𝜆 is also called the shrinkage estimator. As 𝜆 → ∞, 𝛽̂ → 0, i.e the 𝛽̂_𝐿𝑆 will be shrunken towards zero.
Question 24 Four different methods to select a subset of the variables are "best subset regression", "stepwise forward", "stepwise backward", and LASSO. (a) Briefly describe the four different approaches. b) Will the above methods always return the same sets of variables? Motivate your answer!
a) Stepwise forward, starts with no regressor and adds best corresponding variable (for as long it is beneficial to add) Stepwise backward, starts with all regressor and removes worst corresponding regressor (until it is no longer beneficial to remove) Best subset regression, starts by considering all possible models with 1 variable, 2 variables, ..., k variables, then chooses the best model of size 1, the best model of size 2, ..., k. Lastly, from these finalists, it chooses the best overall model LASSO, shrinks data values towards a central point, the mean. Then the objective of lasso is to solve Lasso(Bhat)=min(sum((y(i)-B(0)-x(i)^T*B)^2)) b) No, since all the methods have different approaches of creating subsets (including, excluding etc) They will have similar, but NOT the same outputs of variables. For example the difference in best subset created in HA1 with Forward VS Backward, where stepwise backward chose one less regressor variable for the final subset.
Question 22 (a) Explain in mathematical terms the idea of principal-component regression (PCR) and how this approach combats the problem of multicollinearity in the linear regression models. (b) Give some motivation for why principal components corresponding to small eigen- values can be removed from the model. (c) Explain the geometric interpretation of the principal components.
a) The principal-component regression method (PCR) can be described by the following four steps: 1. Complete a principal components analysis of the X-matrix and save the principal components in matrix Z. In other words, 𝑍 = 𝑋𝑃 . 2. Fit the regression of y on Z obtaining least squares estimates of 𝛼. 3. When you have found the vector 𝛼 you should order it by decreasing eigenvalues 𝜆1 ≥ 𝜆2 ≥ ... ≥ 𝜆𝑝. If we suppose the last s of these eigenvalues are close to zero, then all s components should be removed. 4. Transform back to the original coefficients using 𝛽 = P𝛼. When multicollinearity occurs, least squares estimates are unbiased, but their variances are large so they may be far from the true value. By adding a degree of bias to the regres- sion estimates, principal components regression reduces the standard errors. b) Moreover let, 𝛼 = 𝑃 ^𝑇 𝛽, then 𝑦 = 𝑍𝛼 + 𝜀 = 𝑋𝛽 + 𝜀. The variance of 𝛼̂ is given by: Var(𝛼̂)=𝜎^2(𝑍^𝑇 𝑍)^−1 = 𝜎^2𝐷−1 Since 𝐷 = diag(𝜆1, 𝜆2, ..., 𝜆𝑝), a small eigenvalue, 𝜆𝑖, of 𝑋^𝑇 𝑋 means that the variance will be large (since it depend on 𝐷^−1). c) Each principal component is part of a sequence of unit vectors, where the i:th vector is in the direction of a line that best fits the data while being orthogonal to the first i-1 vectors.
Question 18 Assume that 𝜀 > 0 is small and that you have the three data points 𝒙1 = (1, 1), 𝒙2 = (2, 2), and 𝒙3 = (0, 𝜀). (a) Explain why there is multicollinearity in this data set. (b) Give an example of a response 𝑦 = (𝑦1, 𝑦2, 𝑦3) for which the multicollinearity will cause ||𝛽̂||_2^2 to be large.
a) Using the notation of the lectures, we have: X=[1 1 1, 1 2 2, 1 0 𝜀], 𝜆=[0 -1 1]^T, we get: X𝜆=[0 0 𝜀]^T≈[0 0 0]^T if 𝜀 is small enough. Thus, we have near-linear dependence. b) As a note, given that 𝜀 is in the denominator in the expression for the ||𝛽̂||_2^2, any values for 𝑦, (except for values which causes the 2:nd and 3:rd terms cancels), will cause this to be large. For example, the vector 𝑦 = (1, 2, 3) gives us: ||𝛽̂||_2^2 = 1.3 × 10^11
Question 17 Consider the data (𝑥𝑖, 𝑦𝑖) = (−1, −1), (−1, 1), (1, −1), (1, 1) (a) Find the least square estimates of 𝛽0 and 𝛽1 in the model 𝛽0+𝛽1𝑥+ 𝜀 (b) Does the least squares estimates for 𝛽1, 𝛽0,and 𝛽2 in the model 𝛽0 + 𝛽1𝑥 + 𝛽2𝑥2 + 𝜀 exist? Motivate your answer! (c) Assume that you can add a new point to your data set at a point 𝑥5 which you can choose. For which values of 𝑥5 does the least squares estimates in (b) exist?
a) We're not actually neglecting 𝜀. We're simply measuring the distance between our response and observed data points, using: 𝑓(𝛽0,𝛽1)=∑(𝜀)2 =∑(𝑦𝑖−(𝛽0+𝛽1𝑥))^2 Also I'm pretty sure we can immediately see that the estimates will be 0 since our data is a square. Although that might not be terrible well accepted by Malin. Letting: 𝑓(𝛽0,𝛽1)=∑(𝑦𝑖 −(𝛽0 +𝛽1*𝑥𝑖))2 We can differentiate with respect with 𝛽1,𝛽0 and acquire the equations and solve for 𝛽0, 𝛽1. In doing so we can reach the conclusion 𝛽0 = 𝛽1 = 0 b) To show this we can state the equation on matrix form: The mul- tiple regression model described in the question will have the following form: 𝒚 = (𝑿)𝜷 + 𝜺 X=[1 -1 1, 1 -1 1, 1 1 1, 1 1 1] Y=[-1 1 -1 1]^T 𝜷=[𝜷0 𝜷1 𝜷2]^T 𝜺=[𝜺0 𝜺1 𝜺2 𝜺3]^T From the 𝑿 matrix we can see that it has rank 2, when its full rank given its dimensions would be 3. Thus it does not have full rank and can't be inverted and as such the least squares estimates, 𝜷̂ does not exist. c) So we want to add a point s.t. 𝛽1 ≠0. The new matrix 𝑿 will be X plus a fifth point: In order for a solution to exist, the matrix needs to have full rank. We can show that this is equivalent to 𝑥_5^2 ≠ 1, or specifically 𝑥_5 ≠ ±1 X=[1 -1 1, 1 -1 1, 1 1 1, 1 1 1, 1 x_5 x_5^2]
Question 2 Assume that data comes from a model 𝑦 = 𝛽0 + 𝛽1𝑥1 + 𝛽2𝑥2 + 𝜀, where the errors 𝜀 are independent with a distribution N(0, 𝜎2). Assume that we use the least square fit a model 𝑦 = 𝛽̂ + 𝛽̂ 𝑥 to the data. (a) Derive the expression for 𝔼[𝛽̂ ]. (b) What analysis or tests could we perform to detect that our model is not the correct one? Motivate your answers!
a) 𝔼[𝛽̂ ] = 𝔼[∑(𝑥𝑖 − 𝑥̄)y𝑖/S𝑥𝑥] = 𝔼[∑(𝑥𝑖 − 𝑥̄)/S𝑥𝑥]𝔼[𝑦i] =∑(𝑥𝑖 − 𝑥̄)/S𝑥𝑥*[𝑦i] and 𝔼[𝑦𝑖] = 𝛽0 + 𝛽1𝑥1𝑖 + 𝛽2𝑥2𝑖 => 𝔼[𝛽̂ ]=∑(𝑥𝑖 − 𝑥̄)/S𝑥𝑥*(𝛽0 + 𝛽1𝑥1𝑖 + 𝛽2𝑥2𝑖) =𝛽1+𝛽2*∑(𝑥𝑖 − 𝑥̄)x2i/S_𝑥𝑥 b) We could do a test on an individual regression coefficient to test if our reduced model is correct or if we should have kept the full model. We could set up the hypothesis:𝐻0 ∶𝛽2 =0,𝐻1 ∶𝛽2 ≠0 If we fail to reject 𝐻0 then this indicates that the regressor 𝑥2 indeed could be removed from our model and thus we know that our reduced model is as good as our full, and if this is not the case we know that our reduced model is not the correct one and that we should have included 𝛽2.
Question 13 (a) What are influential points, high leverage points, and outliers? Draw pictures to illustrate your definitions. (b) How can we find high leverage points in a data set with many regressors?
a) -Influential points "pulls" the regression model in its direction. (Draw a point away from the clear data area) -Leverage has an unusual x value, hence it may have a negative effect on the regression model accuracy. (Draw a point further "forward" from the general data area) -Outliers... you get it ^^^^^^^^^ b) Points for which the hat diagonal h(ii) is larger than twice the average 2p/n may be considered leverage points.
Question 6 (a) State the multiple linear regression model in matrix notations, form normal equations and derive the solution using the least-squares estimation approach. (b) State the model assumptions under which the least-squares estimator of the vector of regression coefficients is obtained. c) Show formally that the LS estimator of the vector of regression coefficients is an unbiased estimator under the model assumption specified in part
a) 4 matrices (Y, X, B(Beta), ε) Normal form: Y(n)=X(n*p)*B(p)+ε(n) ε~N(0,Sd^2*I(nn) Y~N(XB,Sd^2*I) where I=identity matrix from the least-square estimations we get X'X*Bhat=X'y which times inverse of X'X is Bhat=(X'X)^-1*X'y (more in drive docs 7a notes) b) 1) 𝔼(𝜀) = 0, 𝑉 𝑎𝑟(𝜀) = 𝜎2 2) Errors are uncorrelated 3) Regressor variables 𝑥1, 𝑥2, ..., 𝑥𝑘 are fixed variables, measured without error. 4) (𝑿′𝑿)−𝟏 exists, i.e. the regressors are linearly independent. c) E[Bhat] =E[(X'X)^-1*X'X*B+(X'X)^-1*X'ε] =B since E[ε]=0, (X'X)^-1*X'X=I, thus Bhat is an unbiased estimator of B
Question 10 (a) Describe what we expect to see in the quantile plots if the errors are normally dis- tributed. Motivate your answer. (b) For each of the four data sets; describe the patterns displayed in the residuals (fat tails/thin tails, non-constant variance). (c) For each data set, give an example of a transform that could be applied to make thedistribution of the residuals closer to the normal distribution. Motivate your answers.
a) If the errors are normally distributed we expect the dots in the quantile plots to fall approximately along the line. Since the x axis is the expected quantile (if it is normally distributed) and y is the sample quantiles. Thus if it falls along a line with slope = 1, the sample quantiles= theoretical quantiles and are thus normal. b) 1. Seems to be normally distributed with constant variance. 2. Light tailed with constant variance. 3. Seems to be fat tailed with constant variance. 4. A little bit heavy tails but non-constant variance since the residuals increase with x. c) Plot 1: 𝑦′ = 𝑦 (no transformation, 𝜎2 ∝ 𝑐𝑜𝑛𝑠𝑡𝑎𝑛𝑡) Plot 2:𝑦′ =√𝑦(𝜎2 ∝𝔼(𝑌|𝑋)) Plot 3: Plot 4: (random assumption of residual plot: "funnel type") y'= √𝑦
Question 16 Assume that you have a data set with 𝑘 regressors and 𝑛 data points, where 𝑘 > 𝑛. Let the full model be the model 𝑦 = 𝛽0 +𝛽1𝑥1 +...+𝛽𝑘𝑥𝑘+𝜀 which includes all of the regressors. (a) Explain why the ordinary least squares estimates cannot be used to obtain estimates of the regression parameters for the full model. (b) Describe a method which can be used to obtain estimates for the regression parame- ters in the full model. Explain why it works.
a) If 𝑘 > 𝑛 where 𝑘 is the number of regressors and 𝑛 the number of data points, then the regressors in the 𝑋 matrix will not be linearly independent. This means that at least one of the eigenvalues of matrix 𝑋𝑇 𝑋 will be close to zero. Thus: (𝑋^𝑇 𝑋) 𝛽̂ = 𝑋𝑇 𝑦 will have infinitely many solutions. b) In order to solve the problem from a) where 𝑋^𝑇 𝑋 is not invertible, Ridge Regression can be used to add a small bias to the matrix. This bias is some number 𝑡 times the identity matrix 𝐼, thus making the matrix invertible. 𝛽̂𝑅 =(𝑋𝑇𝑋+𝑡𝐼)−1𝑋𝑇𝑦 The idea is that this 𝑡 should not affect the solution too much. Hopefully you can find a 𝑡 small enough for the bias not to get too large, but still large enough to make the matrix invertible.
Question 29 Two main sampling procedures for bootstrapping regression estimates are usually referred to as bootstrapping residuals and bootstrapping cases. (a) Give in detail the steps of both procedures and specify the difference between these two approaches. (b) Describe what assumptions (if any) must hold for the bootstrap results to be meaningful. (c) Explain how to find a bootstrap estimate of the standard deviation of the estimate of the mean response at a particular point 𝑥0. Explain how to obtain approximate confidence intervals for regression coefficients through bootstrapping.
a) In both procedures a number of samples are retrieved from the original model. When bootstrapping residuals, this sample consists of residuals, and in bootstrapping cases the sample consists of outcome pairs of the regressors and the response variable. The samples are then attached to the predicted values to obtain a bootstrapped response variable and fit a regression model to it. Bootstrapping cases may be used when there's doubt of the models adequacy, variance of the errors are non-constant or when the regressors are stochastic. b) Bootstrap models assume each observation is randomly selected from its population. - In other words, that any observation is equally likely to be selected, and its selection is independent. c) (Brief) 1. Take a random observation and save the data 2. Repeat step.1 n times, where n is the number of observations 3. Calculate the desired sample statistic of the resampled numbers from Steps 1 and 2, and record that number. You find the mean and the median of the n resampled numbers 4. Repeat Steps 1 through 3 many thousands of times. 5. Calculate the standard deviation of your thousands of values of the sample statistic 6. Obtain the 2.5th and 97.5th centiles of the thousands of values of the sample statistic by removing the bottom and top 2,5 centiles from your new data sample.
Question 31 (a) Explain the purpose of the link function in general linear models. Use the link function used in logistic regression as an example. b) Explain why we, in general, cannot use least squares estimates for 𝛽 in logistic regression.
a) Instead of fitting the model to 𝑦𝑖, one might try to fit the model to 𝔼(𝑦𝑖) = 𝑝𝑖. However, it's still not what we really want from response since 𝑝𝑖 does not cover the whole real line. To avoid this, we can think about applying 𝜂 so called link function: 𝜂 ∶ [0, 1] → R For instance, the logit function 𝜂(𝑝𝑖) = ln(𝑝𝑖/(1−𝑝𝑖)) works well. Using a logit transformation, we can derive a model 𝜂 = 𝑥_𝑖^T*𝛽 which thus models the logarithm of the odds ratio as a linear function. b) The method of least squares in logistic regression would imply solving the minimization problem of the form 𝑐(𝛽) =∑(𝜂𝑖 − 𝑥_𝑖^T 𝛽)^2. The problem is that 𝑐(𝛽) is a non- convex function of parameters 𝛽 (𝜂𝑖 is not observable quantity as opposed to 𝑦𝑖 in linear regression), so there is no guarantee of global convergence.
Question 15 (a) What does it mean to have multicollinearity in a model? (b) Describe in detail (with formulas) at least two effects of multicollinearity on the precision accuracy of regression analysis
a) Multicollinearity in a model is when two or more of the predictor variables are highly correlated with each other. This can cause problems with the interpretation of the model because the coefficient estimates for the predictor variables can be biased. b) Var(B(i)hat)=C(ii)*Sd^2 & Cov(B(i)hat,B(j)hat)=C(ij)*Sd^2 where C=X'X^-1. Thus strong since C depends on the multicollinearity the variance and covariance absolute amount increases with multicollinearity. This result in a drastic decrease in accuracy. Further the expected square distance E(L(1)^2)=sum(Var(B(j)hat)) Thus it also increases the variance increases as a result of multicollinearity
Question 28 (a) Explain the conceptual idea of parametric (bootstrapping residuals) and non-parametric bootstrap (bootstrapping cases) and its applications in regression analysis. (b) Give an example of a data set where the regressors are non-random, but when you cannot apply parametric bootstrap. (c) Suppose you have a sample (𝑥1, 𝑦1), (𝑥2, 𝑦2), ..., (𝑥𝑛, 𝑦𝑛), a model 𝑦 = 𝑋𝛽 + 𝜖 and an estimator 𝜃̂ = 𝜃̂(𝑥1, 𝑥2, ..., 𝑥𝑛). Derive the bootstrap based estimator of the standard error of 𝜃.
a) Parametric or bootstrapping residuals is taking a sample size from the residuals of a model and arranging them in a vector. These are then attached to the predicted y values and we obtain the bootstrapped y vector. Apply the original regressors to these bootstrapped response variables to retrieve an estimate of the betas. Non-parametric or bootstrapping cases/bootstrapping pairs is instead re-sampling pairs of the response variable and the regressors to retrieve the estimate of the betas. It's used when the variance of the errors are non-constant or when the regressors are stochas- tic. In both cases the sampling is typically repeated 200-1000 times, or until the standard deviation of the bootstrap stabilizes. They may be applied to linear, nonlinear or generalized linear models and are useful when wanting to construct confidence intervals for situations where there's no standard procedure available. b) When the distribution is unknown and the variance errors are not constant. c) Assuming bootstrap has been performed, with n replications, the sample mean can be calculated as ∑𝜃̂∕𝑛. Consequently the bootstrap based estimate of the standard error of theta can be estimated by the sample standard deviation.
Question 7 (a) For the model 𝒚 = 𝑿𝜷 + 𝜺, obtain the least squares estimator 𝜷̂ of 𝜷. (b) Make the proper normality assumptions and derive the distribution of 𝜷̂ under these assumptions. (c) For the model specified in (a) and proper normality assumptions on 𝜺, obtain the distribution of 𝒆 = 𝒚 − 𝒚̂. State the test of significance of a single slope parameter 𝛽𝑗 and derive the test statistics (t-test or F-test) in the multiple regression setting. (d) Describe the situations in regression analysis where the assumption of normal distribution is crucial and where it is not (coefficients and mean response estimates, tests, confidence intervals, prediction intervals). Clear motivation must be presented.
a) S(𝜷)=∑𝜀𝑖^2 =𝜺′𝜺=(𝒚−𝑿𝜷)′(𝒚−𝑿𝜷) which can be expressed as: 𝑆(𝜷) = 𝒚′𝒚 − 𝜷′𝑿′𝒚 − 𝒚′𝑿𝜷 + 𝜷′𝑿′𝑿𝜷 = 𝒚′𝒚 − 𝟐𝜷′𝑿′𝒚 + 𝜷′𝑿′𝑿𝜷 since 𝜷′𝑿′𝒚 is a 1 × 1 matrix, or a scalar, and (𝜷′𝑿′𝒚)′ = 𝒚′𝑿𝜷 is the same scalar. The least-squares estimator must satisfy: 𝜕𝑆/𝜕𝑆|𝜷̂ =−2𝑿′𝒚+𝟐𝑿′𝑿𝜷̂=𝟎 which simplifies to: 𝑿′𝑿𝜷̂=𝑿′y The equation above is called the least-squares normal equations. If the assumption that an inverse of (𝑿′𝑿)^−𝟏 exists, then the least-squares estimator is given by: 𝜷̂ = (𝑿′𝑿)^−𝟏𝑿′𝒚 b) We assume that errors 𝜀𝑖 are normally independently distributed with mean zero and variance 𝜎^2. Therefore, the observations 𝑦𝑖 are normally and independently distributed with mean 𝑋𝛽 and variance 𝜎^2*𝐼. Since the least-squares estimator 𝛽̂ is a linear combi- nation of the observations, it follows that 𝛽̂ is normally distributed with mean vector 𝛽 and variance 𝜎^2(𝑋^𝑇 𝑋)^−1 𝐸[𝛽̂] = 𝐸[(𝑋^𝑇 𝑋)^−1𝑋𝑇 𝑦 = (𝑋^𝑇 𝑋)^−1𝑋𝑇 𝐸[𝑦] = (𝑋^𝑇 𝑋)^−1𝑋^𝑇 𝐸[𝑋𝛽+𝜀] = (𝑋^𝑇 𝑋)^−1𝑋^𝑇 𝑋𝛽+ (𝑋^𝑇 𝑋)^−1𝑋^𝑇 𝐸[𝜀] = 𝛽 𝑉𝑎𝑟(𝛽̂) = 𝑉𝑎𝑟((𝑋^𝑇 𝑋)^−1𝑋^𝑇 𝑦) = (𝑋^𝑇 𝑋)^−1𝑋^𝑇 𝜎^2𝑋(𝑋^𝑇 𝑋)^−1 = 𝜎^2(𝑋^𝑇 𝑋)^−1 Thus 𝛽̂ has the following distribution 𝜷̂ ∼ 𝑁(𝜷, 𝜎2(𝑿′𝑿)−1) c) d) Concerning the coefficients and mean response estimates we can say that they are not dependent on the assumption of normal distribution. The mean-square estimators for the regressors are determined solely to minimize the residuals, without any further as- sumption or hypotheses. Being a function of 𝜷̂, the same applies to the mean response estimate. Concerning the tests and interval, these rely heavily on the normal distribution assump- tions. All the test statistics used for confidence intervals, significance tests and predic- tion intervals are functions of quantities assumed to have a normal distribution (as a consequence of the normality assumption of the residuals).
Question 3 (a) Explain why 𝑆𝑆𝑟𝑒𝑠 decrease when we add regressors to a model. (b) Give two reasons why adding more regressors does not necessarily result in a better model.
a) SSres =[y(i)-B(0)-B(1)x(1i)-....-B(n)x(ni)]^2 so when a regressor is added the SSres may only decrease or stay the same. Hence it is non-strictly decreasing. b) 1. Adding a regressor with no correlation to the response variable would not improve the model. 2. If the new predictor is contained in the linear span of the predictors already in the model.
Question 33 (a) Describe the differences in the frequentist and Bayesian approach to regression. (b) Describe how ridge regression and LASSO appear in Bayesian regression. (c) Using a Bayesian perspective, how can the regularization parameters in ridge regression and LASSO regression be interpreted.
a) The difference is that, in the Bayesian approach, the parameters that we are trying to estimate are treated as random variables. In the frequentist approach, they are fixed. Random variables are governed by their parameters (mean, variance, etc.) and distributions (Gaussian, Poisson, binomial, etc). b) A Bayesian viewpoint for regression assumes that the coefficient vector 𝛽 has some prior distribution, say 𝑝(𝛽), where 𝛽 = (𝛽0, ..., 𝛽𝑝)′. The likelihood of the data can be written as 𝑓 (𝑌 |𝑋, 𝛽), where 𝑋 = (𝑋1, ..., 𝑋𝑝). Following Bayes' theorem, multiplying the prior distribution by the likelihood gives the posterior distribution, which takes the form 𝑝(𝛽|𝑋, 𝑌 ) ∝ 𝑓 (𝑌 |𝑋, 𝛽) ∗ 𝑝(𝛽). We assume the usual linear model, 𝑌 = 𝛽0 + 𝑋1𝛽1 + ... + 𝑋𝑝𝛽𝑝 + 𝜖, and suppose that the errors are independent and drawn from a normal distribution. Furthermore, assume that 𝑝(𝛽) = ∏𝑝𝑗=1 𝑔(𝛽𝑗), for some density function g. It turns out that ridge regression and the lasso follow naturally from two special cases of g: RIDGE: If g is a normal distribution with 𝑁 (0, 𝑓 (𝜆)), then it follows that the posterior is given by the ridge regression solution. LASSO: If g is a Laplace distribution with 𝐿𝑎𝑝𝑙𝑎𝑐𝑒(0,𝑓(𝜆)), then it follows that the posterior is the lasso solution. c) From a Bayesian point of view, we talk about Lasso and Ridge regression as a reg- ularization, saying that the least squares estimator does not perform as well when we have many covariates and not so much data. Therefore, we want to shrink our coefficient estimated towards zero to get a lower mean square error. If the value for lambda is unknown to us, we need to choose/set a value for this parameter in some way. We do this by using a prior for this value. We treat it as another unknown, specify its distribution and parameters. Then from the joint posterior, we can integrate to find the marginal posterior lambda.
Question 5 (a) Derive the least-squares estimate of 𝛽1 in the no-intercept model 𝑦 = 𝛽1𝑥 + 𝜀 from the least square criterion that is to minimize 𝑆(𝛽1) = ∑(𝑦𝑖 − 𝛽1𝑥𝑖)^2 (b) Give examples of when such model can be appropriate/inappropriate
a) 𝜕𝑆/𝜕𝛽̂1=−2∑(𝑦𝑖−𝛽1*𝑥𝑖)𝑥𝑖=0 ⇒𝛽̂1∑𝑥_𝑖^2=∑𝑥𝑖𝑦𝑖⇒𝛽̂1=∑𝑥𝑖𝑦𝑖/∑𝑥_𝑖^2 b) No intercept: vaccination rate against GDP per capita, cause 0 GDP shouldn't exist Model with intercept: rate of reaction vs time of some chemical reaction
Question 8 (a) What is the difference between interpolation and extrapolation? (b) Why does one often want to avoid extrapolation? (c) Explain the problem of hidden extrapolation in multiple linear regression. Motivate your explanations by drawing pictures. Explain how to detect this problem by using the properties of the hat matrix H.
a) Interpolation: independent variable from inside range of data Extrapolation: independent variable outside range of data b) Extrapolation assumes that the observed trend is the same within and outside data range used for prediction, this might not be the case. c) Combination of regressors might result in observed point might be outside the joint data area. See picture in drive docs 8c. Detect this by looking at the diagonal of the hat matrix H. If h(00)>h(max) are extrapolation points outside the ellipsoid displayed in the picture "above".
Question 4 Assume that an experiment is performed, and that the data follows model 𝑦 = 𝛽0 + 𝛽1*𝑥 + 𝜀. Does the variance of the least squares estimator 𝛽̂1 and 𝛽̂2 always get better when more points are added to the dataset? Motivate!
𝑉𝑎𝑟(𝛽̂0) = 𝜎^2(1/n+x_bar^2/Sxx) & 𝑉𝑎𝑟(𝛽̂1) = 𝜎^2/Sxx (1/𝑛) naturally decreases as we increase n. And as 𝑆𝑥𝑥 increases when we increase the number of data points. Therefore the variance always decreases as more points are added. Decreased variance is a better variance
Question 1 Assume that you are designing an experiment with the goal of finding as accurate values as possible for the parameters in the model 𝛽0 + 𝛽1𝑥 + 𝜀 where you know that 𝜀 ∼ 𝑁(0, 𝜎2). Assume further that you can only collect at most ten data points. If you want to make the confidence intervals for the parameters as small as possible, at which x should you collect data? Motivate your answer.
𝛽1 = 𝛽̂1 ± 𝑡_(𝛼∕2,𝑛−2)*sqrt(MSres/Sxx) 𝛽0 = 𝛽̂0 ± 𝑡_(𝛼∕2,𝑛−2)*sqrt(MSres(1/n+x_bar^2/Sxx)) If we look at the confidence intervals we know that the only term that is affected by the choice of x is the expression under the square root. We want to minimize this term to decrease the confidence intervals and the term 1∕𝑛 is not in consideration since it is constant. So we want to decrease To maximize the sum i.e. 𝑆𝑥𝑥, the data points should therefore be selected so that half of the points are placed at a fixed distance from 𝑥̄ to the left and the other half at the same distance to the right. In this case we then maximize the distance ∑(𝑥−𝑥̄)2. For 𝛽0 we also want to maximize the sum so the same concept is chosen as before. But since we now also want to minimize 𝑥̄2, we choose to sample the data at equal distance from𝑥 = 0. Soforexamplesample5pointsfor𝑥 = −5andfivefor𝑥 = 5,weget 𝑥̄2 = 0 and the distance is larger than if one of the samples would be taken at 𝑥 = 4. An example of this is visualized in the figure below. But notice that we want to use the left one and centralize it in origin if we want to minimize both at the same time. I do not know if that is whats wanted but otherwise one can just choose the points according to the Figure (left for 𝛽1 and right for 𝛽0) if it is only necessary to give example of how to minimize the confidence interval one at a time.
