SRM

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Advantage of using alternative fitting procedures instead of least square

- result in simpler model - improve prediction accuracy - likely easier to interpret

Deviance of Generalized Linear Models

- scaled deviance can be used to assess the quality of fit for nested models -a small deviance means that the fitted model is close to the saturated model, this means that is a good fit -a saturated model has a deviance of zero

For K-nearest neighbors classifier, as K increase

- squared bias increases -variance decrease -flexibility decreases

Which is most appropriate to model if a personal is hospitalized or not

- Binomial distribution -logit link function ( restricts values to range 0 to 1) like binary classification

Time series models

- a larger moving average length results in a smoother series - differencing the time series will likely result in a time series that is stationary in the mean -a logarithmic transformation will likely result in a time series that is stationary in the variance -differncing the logarithmic transformation will likely result in a time series that is stationary in the mean and variance -the larger the k value, the smoother the time series

Ridge Regression

- a shrinkage method -as the tuning parameter (lambda) λ→∞ approaches infinity, the coefficients tend to zero -The ridge regression coefficients can be calculated by determining the coefficients β^R1, β^R2, ..., β^Rp that minimize: ∑(yi−β0−∑pj=1βjxij)2+λ∑pj=1β2j. - ridge regression coefficients are not scale equivariant -the shrinkage penalty is applied to all the coefficients estimates except the intercept - shrinking the coefficient estimates has the benefit of reducing the variance -none of the coefficient estimates can be exactly 0 -the B parameters produced by ridge regression will not in general be unbiased -not scale equivariants -involves all of the predictors -cross validation can be used to select the best tuning parameter

Poisson Regression

- a square root link outputs only non-negative numbers, it is better if a link function outputs a real number, therefore a square root link is not as appropriate as a log link -if a model is adequate, the deviance is a realization from a chi-square distribution -a large Pearson chi-square statistic indicates that overdispersion is likely more severe -using its canonical link, to estimate the variance of the response, the only parameters that need estimating are the regression coefficients - canonical link for a poisson distribution is indeed the log link -an offset does not address overdispersion

advantages of regression tree

- are easier to interpret - better mirror human decision making - can be presented graphically - manage qualitative predictors without the need for dummy variables - does not always out perform linear regression

model selection

- best subset selection does not necessarily result in a nested set of best models, only forward stepwise selection and backward stepwise selection are guaranteed to do so - residual sum of squares is not a suitable metric because it decreases monotonically as the number of predictors increases - forward selection can be used in high-dimensional settings where n>= p+1, while backward stepwise selection cannot be used

subset selction

- best subset selection does not necessarily result in a nested set of best models; only forward stepwise selection and backward stepwise selection are guaranteed to do so -residual sum of squares is not a suitable metric because it decreases monotonically as the number of predictors increases - forward stepwise can be used in high-dimensional settings where n>= p+1, while backward selection cannot be used

control charts

- control charts are used to detect nonstationarity in a time series -control charts have superimposed lines called control limits -R chart helps examine the stability of the variablity of a time seres -X chart examines the stability of the mean of a time series

in cases of two classes

- cross entropy >= Gini index >= classification error rate

decision trees

- decision trees typically have poorer predictive accuracy compared to other statistical methods - dummy variables are not necessary since every possible grouping of the classes into two groups is considered when splitting - pruning helps to reduce variance and leads to a smoother fit -without tree pruning, the recursive binary splitting method leads to a large tree, hence likely to overfit the data -tree with more splits tends to have a higher variance -When using the cost complexity pruning method, α=0 results in a very large tree. -easy to interpret and explain -can be presented visually -can manage categorical variables without the need for dummy variables -can mimic human decision making better than other statistical models -number of leaves does not necessarily equal the number of branches -the number of branches is always greater than the number of internal nodes -a stump is a decision tree with one internal node -every tree is constructed independently of every other tree - relative to bagging, it attempts to make the constructed tree less similiar

bagging

- does not operate on a choice of flexibility so it does not require cross-validation -reduces variance -special form of random forest where we consider all explanatory variables at each split

x bar chart

- examines the stability of the mean of a time series

VIF

- if an explanatory variable is uncorrelated with all other explanatory variables, the corresponding variance inflation factor would be 1

simple learn relationship, y=β0+β1x+ε

- if ε=0, the 95% confidence interval is equal to the 95% prediction interval -the prediction interval accounts for the irreducible error -the prediction interval is always at least as wide as the confidence interval - The confidence interval quantifies the possible range for E(y∣x) -The prediction interval quantifies the possible range for y∣x

10 fold cross validation

- involves randomly dividing the data set into 10 sets - each functions as the test dataset when the remaining 9 sets are used as the training dataset -this means that every observation will be included in the training dataset 9 times

Tolerance

- is the reciprocal of the VIF

severe multicollinearity

- less reliable are the regression coefficient estimates - therefore masking which predictors are in fact significant - with perfect multicollinearity, the ordinary least squares estimates are no longer unique - the MSE is still realiable

Log link

- maps the range (0,∞) to (−∞,∞)

Which of the following is true regarding models that use numerous decision trees to obtain a prediction function f?

- not every decision tree is usually pruned -out of bag error can help determine an appropriate number of trees to construct, but that applies to bagging and random forest where the trees is not a flexibility measure -With bagging and random forest, improvement in accuracy comes from lowering f's variance, whereas with boosting, it comes from slow learning and finding an appropriate number of trees. -they address the lacking predictive accuracy of a single decision tree -f is usually harder to interpret for these models since it consolidates the results from multiple trees. -These models are equally suited to handle regression or classification

Principal component analysis

- provides low-dimension linear surfaces that are closest to the observations -the first principle component is the line in p-dimensional space that is closest to the observations - finds low dimension representation of a dataset that contains as much variation as possible - serves as a tool for data visualization

Principal Component Regression

- recommended that the modeler standardize each predictor prior to generating the principal components -all variables are used in generating the principal component -assumes that the directions in which features show the most variation are directions that are associated with the target - can reduce overfitting -the first principal component direction of the data is that along which the observations vary the most

Bagging

- relatively flexible approach

Alternative fitting procedure

- removes irrelevant variables from the predictor, thus leads to a simpler model -results are easier to interpret -accuracy will improve due to reduction in variance

Modeling for inferences

- researchers in a clinical trail use statistical learning to identify the largest risk factor for heart diseases

Logit link function

- restrict the values in the range of 0 to 1

Relationship between a response Y and p predictors Y=f(x) +e

- the accuracy of the prediction for Y depends on both the reducible error and irreducible error -the variability of e cannot be reduced because this is the irreducible error -the reducible error, can be reduced by using the most appropriate learning method to estimate f -e has a mean of 0

Two Poisson regression are modeled using the same data: one accounts for varying exposures and the other does not.

- the coefficient estimates should not be the same for both models -with all else equal, a unit change in predictor xj changes the estimated means of both models by the same factor if the corresponding coefficient estimate is the same -both ought to be equally inadequate at handing overdispersion

K-means clustering algorithm

- the decision to standardize the variables depends heavily on the problem at hand - seeks to find subgroups of homogeneous observations - begins with random assignment of the data points into K clusters -the number of clusters remain the same at each iteration - the algorithm must be re-run for different number of clusters -the algorithm must be initialized with an assignment of the data points to a cluster -greedy method

in wrongly assuming the multiple linear regression error terms are independent, the following issues are likely to occur

- the estimated standard errors are smaller than they should be

simple moving average with length k or exponential smooth with weight w to smooth a time series

- the larger the value of k, the smoother the time series -as w decreases to 0, the amount of smoothing decreases -using a simple moving average with length k=1 or exponential smoothing with weight w=0 will result in the same smoothed series

simple linear regression

- the least squares line always passes through the point (mean x, mean y) -the squared sample correlation between x and y is equal to the coefficient of determination of the model - the F statistic of the model is always the square of the t-statistic of the coefficient estimate for x - a random pattern in the scatterplot of y against x indicates a coefficient of determination close to 0

Linear model Assumptions

- the leverage for each observation in a linear model must be between 1/n and 1 - the leverage must sum to p+1 which is the number of predictors plus the intercept -if an explanatory variable is uncorrelated with all other explanatory variables the corresponding variance inflation factor would be 1

you examine a residual plot for a SLR model. The residuals are mostly positive on the left and on the right of the plot, but mostly negative in the middle of the plot

- the model can be improved by adding a quadratic variable as a predictor, which should have a positive coefficient

Stationary autoregressive models of order one yt=β0+β1(yt−1)+εt,t=1,2,...

- the parameter β0 can be any fixed constant -β1 must be a value strictly between -1 and 1 -if β1 = then the model reduces to a white noise process -if β1 =1 then the model is a random walk - only the immediate past value yt−1, is used as a predictor for yt.

multicollinearity

- the presence of an approximate linear relationship between the predictors, not an attribute found in certain observations -can be detected using variance inflation factors -can lead to inflated estimates of standard errors used for inferences -no issue of multicollinearity when there is only one explanatory variable - the more sever the multicollinearity, the less reliable the regression coefficients estimates

principal component analysis

- the proportion of variance explained by an additional principal component decreases as more principal components are added -the cumulative proportion of variance explained increases as more principle components are added -using the first few principal components is often sufficient to get a good understanding of the data - a scree plot provides a method for determining the number of principal components to use -looks for a low dimensional representation of the observation that explains a significant amount of variance -principal components are not correlated with one another -first principal component explains the largest portion of variability - together the principal components explain 100% of the variance - dot product of the first principal component and the second principal components must equal 0 - can only be applied to a dataset with quantitative features - can be used to address multicollinearity in a dataset - is not useful if the variables in the dataset are not correlated

Leverage

- the smallest possible value is 1/n -not a function of the response variable -a large value indicates the presence of a high leverage point, not an outlier -the are diagonal elements of the hat matrix X(X^(T)X)^(−1)X^T -leverage for the ith observation has the formila x(i)^T(X^(T)X)^(-1)x(i) -the leverage for each observation in a linear model must be between 1/n and 1 -the leverage must sum to p+1, which is the number of predictors plus 1

GLM with Poisson distribution

- the standard normal distribution is typically used to make statistical inferences on a regression coefficient - accounting for exposures allows the mean of the response to be the same as the variance -the variance of a negative binomial distribution is always greater than its mean

linear probability, logistic, and probit regression for binary dependent variables

- the three major drawbacks of the linear probability models are poor fitted values, heteroscedascity and meaningless residual analysis - the logistic and probit regression models aim to circumvent the drawbacks of linear probability models

Hurdle model

- the variance can be greater than or less than the mean - this model can accommodate both overdispersion and underdispersion -discrete mixture

zero-inflated model

- the variance is always greater than the mean - this model can only accommodate overdispersion -special case of a latent class model -discrete mixture

negative binomial model

- the variance is always greater than the mean - this model can only accommodate overdisperson -special case of heterogeneity model -possible to set up a Poisson-gamma mixture

studentized residuals in multiple linear regression

- they should be realizations of a t-distribution - they are unitless - a likely outlier is indicated by the magnitude of the studentized residual not its sign - a high leverage point is identified using leverage, not studentized residual

boosting in the context of decision trees

- unlike bagging, boosting does not involve bootstrap sampling -boosting has three tuning parameters: the number of trees, the shrinkage parameter, and the number of splits in each tree - the number of splits in each tree is know as the interaction depth

clustering methods

- we can cluster the n observations on the basis of the p features in order to identify subgroups among the observations - we can cluster p features on the basis of the n observations in order to discover subgroups among the features - clustering is an unsupervised learning method

For a random forest, let p be the total number of features and m be the number of features selected at each split.

- when m=p the random forest produces the same result as bagging -(p-m)/p is the probability a split will not consider the strongest predictor - the typical choice of m is sqrt(p) for classification problems and p/3 for regression problems

Binomial distribution

- with a qualitative response variable, this is the most appropriate model

autoregressive models of order one AR(1)

-An AR(1) model with a positive slope coefficient, i.e. β1>0, is a meandering process. - general AR(1) model is a generalization of a white noise process and a random walk model -this means that a stationary AR(1) model is not a generalization of a random walk model since β1=1 for a random walk model. the lag k autocorrelation of a stationary AR(1) model is only non-negative if the slope coefficient is non negative

Bias-Variance Tradeoff

-bias refers to the error arising from the assumption in the statistical learning tool -variance refers to the error arising from the sensitivity of the training data set -as model flexibility increases, squared bias decreases and variance increases

comparing k-fold cross validation (with k<n) and Leave-One-Out Cross Validation (LOOCV), used on a GLM with log link and gamma error.

-K folds requires fitting k models -LOOCV requires fitting n models -k-folds validation has a computational advantage over LOOCV -with respect to bias, LOOCV has a lower bias compared to k-fold cross validation -with respect to variance, k-folds cross validation has a lower variance compared to LOOCV -with least squares regression, the LOOCV estimate for the test MSE can be calculated by fitting a model once

box plot

-captures the distribution of a variable by emphasizing the distribution tails and the first, second, and third quartiles

with high dimensional data, the following becomes unreliable

-R^2 -the fitted equation - confidence intervals for a regression coefficient

in high dimensional data, these become unrealiable

-R^2 -the fitted equation -confidence intervals for a regression coefficient

Modeling for predictions

-an advertising company is interest in identifying individuals who will respond positively to a marketing campaign based on the demographics of those individuals - a real estate broker is interest in finding out if a house is undervalued or overvalues bases on the characteristic of the house - a statistician uses a regression tree to estimate the salary of a football player based on the traits of the player - agriculture scientist uses statistical models to predict the yield of corn crops based on the soil and weather

Claim amounts

-are non negative and primarily modeled as a continuous variable

decision trees

-bagging and random forest requires bootstrapping -the way that bagging reduces variance is by averaging the prediction of all the unpruned or bagged trees -since a random forest averages the predictions across many trees, it cannot be easily illustrated by one tree diagram -main difference between bagging and random forests is the number of predictors considered at each split when building trees - single decision tree models generally have higher variance than random forest models - random forests provide an improvement over bagging because trees in a random forest are less correlated than those in bagged trees

hierarchical clustering

-changing data set could result in extremely different clusters, not robust -about to visualize the clusters using a dendrogram -it is commonly a bottom up or agglomerative approach -number of clusters does not need to be pre specified -unlike k-means clustering, categorical variables can be used -two clusters are fused at each iteration of the algorithm, with n observations there will be n-1 fusions -will always produce the same dendrogram given a linkage and a dissimiliarity - the decision to standardize the variables before performing hierarchical clustering depends on the problem at hand -greedy method

Linear regression

-considered inflexible because the number of possible models is restricted to a certain form -allows the analyst discretion regarding adding or removing variables

non-parametric model

-do no make explicit assumption on the form of the function -classification tree, K-nearest neighbors, regression tree

qq plot

-examine the similarity between a variables distribution and a theoretical distribution using quartiles

R chart

-examines the stability of the variability of a time serios

leads to unreliable results from a multiple linear regression

-excluding a key predictor -including as many predictors as possible -errors not following a normal distribution

Parametric model

-first, make an assumption on the form of the function -second, fit the model using the training data -logistic regression

continuous mixture

-heterogeneity model

Best models

-highest R^2 , lowest AIC and lowest BIC

scatter plots

-if it shows a quadratic relationship, the variables' sample correlation depends on which region of the quadratic curve appears on the scatter plot -if all the points form a straight line with a slope of -.32, the variables sample correlation would be -1 - useful for detecting any type of relationship between two variables

Poisson Regression

-if the model is adequate, the deviance is a realization from a chi-square deviation - a large Pearson chi-square statistic indicates that overdispersion is likely more severe

Random Forest

-if the number of predictors used at each split is equal to the total number of available predictors, the result is the same as using bagging -when building a specific tree, a new subset of predictor variables is used at each split -improvement over bagging because the trees are decorrelated

If the results of a statistical learning method do no vary much between the training data sets

-it means that the method has low variance -low flexibility

Ordinary least squares approach

-least squares estimators -maximum likelihood estimators -unbiased, hence the bias is zero

Lasso Regression

-less flexible than a linear regression -determines the subset of variables to use while linear regression allows the analyst discretion regarding adding or removing variables -performs variable selection, because it is possible for the coefficient estimates to be exactly 0 - as tuning parameter increases, flexibility decrease -irreducible error will remain constant

benefit of K-means clustering over hierarchical clustering

-less restrictive in its clustering structure - there are fewer areas of consideration in clustering a data set - running the algorithm once is guaranteed to find clusters with the local minimum of the total within cluster variation

reasonable course of action to address multicollinearity

-make no changes -only use orthogonal predictors -drop all but one predictor among those with high VIF -combine all predictors with high variance inflation factors into one predictor

stationary

-means that something does not vary with time -variance = 0 in a random walk -constant mean and variance

Tweedie Distribution

-mixed with a discrete mass at 0 while continuous over the positive numbers, -the distribution can be motivated as an aggregate loss with Poisson frequency and gamma severity

Hierarchical Clustering

-number of cluster does not need to be pre-specified -the algorithm only needs to be performed once for any number of clusters -does not require random assignments -results of clustering depends on the choice of number of clusters, dissimilarity measure, and linkage

Course of action to address multicollinearity

-only use orthogonal predictors, completely eliminates multicollinearity -drop all but one predictor among those with high variance inflation factors -combine all predictors with high variance inflation factors into one predictor

tree pruning

-overfitting is likely an issue in an unpruned decision tree created using recursive binary splitting - a pruned tree likely has high bias compared to an unpruned tree - in cost complexity pruning, if the tuning parameter, a, is zero, the algorithm results in the largest decision tree

Tree pruning

-overfitting is likely an issue in an unpruned decision tree created using recursive binary splitting - in cost complexity pruning, if the tuning parameter, α, is zero, the algorithm results in the largest decision tree - a pruned tree likely has a higher bias compared to an unpruned tree

supervised problems

-problems with a response -Boosting -K-nearest neighbors -Regression tree -logistic regression -ridge regression

unsupervised problems

-problems without a clear response -cluster analysis -K-means clustering

Bagging

-provides additional flexibility

Principal Component Analysis

-provides low-dimensional linear surfaces that are closest to the observations -the first principal component is the line in the p-dimensional space that is the closest to the observations -finds a lower dimension representation of a dataset that contains as much variation as possible -serves as a tool for data visualization -uses all variables

K-Means Clustering

-randomly assign a cluster to each observation. This serves as the initial cluster assignments -algorithm needs to be repeated for each K -number of cluster must be pre-specified

random forest

-randomly select a subset of predictors to be considered for creating each split -helps to decorrelate the trees -can handle both qualitative and quantitative -can handle non-linear relationships between the predictors and the response, however it does not make a random forest less appropriate if the data follows a straight line - a clear relationship between y and x does not make a random forest appropriate -overfitting is not a concern here when a large number of trees is used -if the number of predictors used at each split is equal to the total number of available predictors, the result it the same as using bagging

Principal Component Regression

-recommended that the modeler standardize each predictor prior to generating the principal - assumes that the directions in which features show the most variation are the directions that are associated with the target - can reduce overfitting -the first principal component direction of the data is that along which the observations vary the most -does not perform feature selection because all variables are used

Flexibility model from least to greatest

-ridge regression model -Lasso -linear regression -Boosting -regression tree

Ridge Regression

-shrinkage method, as the tuning parameter approaches infinity, the coefficients are shrink towards zero -ridge regression coefficients are not scale equivalent - shrinkage penalty is applied to all the coefficient estimates except for the intercept -Shrinking the coefficient estimates has the benefit of reducing the variance, not the bias.

nonstationary

-something varies with time -when the mean does not equal 0 in a random walk -when the variance does not equal 0 in a random walk -if a series has a linear trend in time, the sample mean of successive samples is relatively stable

For an statistical learning method, as flexibility increase

-the interpretability decreases -the training MSE decreases -The test MSE forms a U-shape -squared bias decreases -variance increases

Principal components

-the proportional of variance explained by an additional principal component decreases as more principal components are added -the cumulative proportion of variance explained increases as more principal components are added -the least number of principal component provides the best understanding of the data -a scree plot provides a method for determining the number of principal components to use

heteroscedasticity

-the spread of the residuals changes throughout the plot -leads to unreliable MSE, estimate of the irreducible error -adjusted R^2 is unreliable -F test is unreliable - a concave function transformation of the response could resolve heteroscedastic patter

GLM

-the standard normal distribution is typically used to make statistical inferences on a regression coefficient -regression coefficients in GLM are estimated with maximum likelihood estimators; these are asymptotically normally distributed -accounting for exposures allows the mean of the response to be the same as the variance if it is a Poisson Distribution -the variance of a negative binomial distribution is always greater than the mean, this is beneficial by loosening the restriction of Poisson's equal mean and variance -inverse of the link function is the mean function -the distribution in the linear exponential family, the canonical link is the inverse of the mean function -the choice of the mean function and the variance function drives the inference making -great at handling non-constant conditional variance -can accommodate polynomial relationships

decision tree test sets

-the test error for a bagging and random forest tend to be similar -random forest tends to perform better than bagging, thus resulting in a lower test error

orthogonal

-two independent variables are said to be orthogonal if they are uncorrelated, an R^2 of 0 indicates that

Deviance

-useful for testing the significance of explanatory variables in nested models -for a normal distributions is proportional to the residual sum of squares -defined as a measure of distance between saturated and fitted model

control chart

-uses to detect nonstationarity in a time series -have super imposed lines called control limits, upper control limit and lower control limit

Boosted model

-when all its trees are stumps, i.e only having one split, a boosted model becomes an additive model -reduces bias -overfitting is a concern here -all variables are considered at each split -prediction is obtained through fitting sucessive trees

discrete mixture

-zero-inflated model -hurdle model

Most interpretable to least

Lasso, classification tree, bagging

Best Model

Model with the lowest MSE

Categorical variables

are variables that take on one of the limited number of different values

scaled deviance

cannot be negative

Low standard error and high t value

indicates that the predictor is significant in predicting the response

classification problems

problems with a qualitative response

regression problems

problems with a quantitative response

Flexibility & easy to interpret

there is a trade off between flexibility and easy to interpret

white noise process

yt=β0+εt, -stationary in mean and variance -sample autocorrelations should be close to 0

simple linear regression

y = B0 + B1x+e - if e= 0 then the confidence interval equals the prediction interval because the prediction interval includes the irreducible error - the prediction interval is always at least as wide as the confidence interval - the confidence interval quantifies the possible range for E(y I x)

random walk

yt=β0+yt−1+εt -difference series is stationary -sample variance of the series is greater than the sample variance of the differenced series -the difference series follows a white noise model -detects a linear trend in time and increasing variability -as time t increases the variance of a random walk increases


Set pelajaran terkait

Chapter 6 Review: Air Pressure and Winds

View Set

Section 12: Causes of the 1917 Feb/March Revolution

View Set

Rn maternal newborn practice 2023A

View Set

Chapter 15 Health Care Settings Continuum of Care

View Set