MAS II
Out of Bag error (Statistical Learning, Trees)
-A way to approximate the test error for bagged trees. -We know that for every observation, there is about 1/3 of the bagged trees that did not use the observation when creating the model. The observation is "out of bag." -When we want test error, we want "new data", so to these 1/3 models, the observation is a new observation. You can find the test error for that observation by finding the MSE of the 1/3 bagged trees (if regression is the goal). If classification is the goal, take a majority vote using the specified bagged trees. -The OOB approach for estimating test error is particularly convenient when performing bagging on large data sets where LOOCV would take a long time.
Recursive Binary Splitting (Statistical Learning, Trees)
-Begin at the top of the tree, at which point all observations belong to a single region, then successively split the predictor space so that RSS (or the Gini index if classification tree) is minimized at each split. Go with the split that has the smallest RSS (or Gini Index) compared to the other potential splits. -Greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step. -Leads to overfitting because the tree is too complex (has too many splits) -The best splits will group observations with similar responses together. -A tree with no split will have the highest RSS. Yhat will be the mean of the entire data (or majority vote if classification)
K Nearest Neighbors (Statistical Learning, KNN)
-Can be used for qualitative or quantitative responses -performs best with low number of features -If regression is the goal, predictions are based on the mean or median of the k closest observations -If classification is the purpose, the mode of the closest observations will serve for prediction -Smaller K means we opt for more flexible model (in extreme case, overfitting). -Larger K means we opt for less flexible model (in extreme case, underfitting) -Pick K that has the lowest test error
K Means Clustering (Statistical Learning, Clusters)
-Choose K clusters such that the total within cluster variation is minimized. -There are K^n ways to partition n observations in to K clusters -Because k-means algorithm finds a local rather than a global optimum, the results obtained will depend on the initial (random) cluster assignment of each observation in the first step of the algorithm. We should run k means multiple times on the data to help us pick the best clusters (the one that minimizes total within cluster variation)
Bagging (Statistical Learning, Trees)
-Create B bootstrapped training sets (artificially created using the original data) -Create B trees and then average their predictions to get a final prediction if regression is the goal. Take a majority vote if classification is the goal. -The B trees created are not pruned (the have low bias because they are overfitted, but they are accurate to training data) -The B trees do not have the same number of nodes -The number of trees B is not critical, a very large value of B will not lead to overfitting. -But B should be sufficiently large. When it is, the Out of Bag error (OOB error) is equivalent to leave-one-out cross validation error.
Hierarchical Clustering (Statistical Learning, Clustering)
-Each observation starts in its own cluster -Based on the dissimilarity measure (usually Euclidean distance), the two observations with the lowest dissimilarity are fused together into a cluster -Then you repeat, until every observation is in a cluster Notes: -The hierarchical clustering algorithm depends on the type of linkage used and the choice of dissimilarity measure. -When you are comparing how similar a cluster is (more than one observation), use linkage. If you are comparing two observations, use dissimilarity measure.
List the types of dissimilarity measures you can use in Hierarchical Clustering (Statistical Learning, Clustering)
-Euclidean distance -correlation-based distance: two observations are similar if their features are highly correlated, even if the observed values may be far apart in terms of Euclidean distance
What are 3 ways to "plot" out a continuous posterior distribution. (Describing Posterior Distributions)
-Grid Approximation -Quadratic Approximation -HCMC
Explain when it is appropriate to use Gini Index, Entropy, or Classification Error
-Growing Trees (to get T0): Gini and Entropy to ensure node purity -Pruning: Gini, Entropy, Classification Error -Prediction: Classification error
Spurious Relationships (Bayesian, Multivariate Linear Models)
-If a predictor has a spurious relationship, it means that when it is the only predictor of the outcome (bivariate model), there is a strong correlation in predicting the mean outcome, but when it is included with other predictors (multivariate model), it has little significance to the mean outcome. -Cause: can arise when a truly causal predictor, call it xreal, influences both the outcome, y, and a spurious predictor, xspur. By running a model with xreal and xspur together, you can truly find the "right" predictor.
Explain when correlation between alpha and beta becomes an issue in a bayesian linear model (Bayesian, Bayesian Linear Models)
-If it's a simple linear model, mu = alpha +beta *x, then the correlation between alpha and beta isn't problematic -If you are doing a multiple linear model, mu = alpha + beta1*x1 +beta2*x2, and there is heavy correlation between alpha, beta1, and beta2, then there is a problem.
Explain why someone might want to scale the features before clustering (Statistical Learning, Clustering)
-If variables are measured on different scales (ie cm versus km). -If we want each variable to be given equal importance on the groups that form
List some things you should be thinking when you see a standardized model and an unstandardized bayesian model (Bayesian, Linear Models)
-If y is standardized as well, then you cannot compare the standardized model to the unstandardized model because the outputs are in different units -If the models are simple linear models, then either one is usable. -If a linear model only has the predictors centered, then the correlation between alpha and beta go away. WAIC would not change.
Explain three possible point estimates you could choose to summarize your posterior distribution. (Describing Posterior Distributions)
-Mean -Mode (MAP) -Median
Forests (Statistical Learning, Trees)
-Same idea as bagging, except it decorrelates the bagged trees by only allowing the trees to choose a split from a subset of predictors instead of all (subset of predictors changes at each split and is random). This prevents all the trees from starting with the same first split since it is the strongest predictor. If they did, then all the trees would start to look the same. This results in (p-m)/p splits (on average) that will not consider the strong predictors so other predictors have more of a chance -You get more diverse bagged trees, a larger reduction in variance, and larger reduction in test error and OOB error compared to bagging. See explanation below. Averaging many highly correlated quantities does not lead to as large of a reduction in variance as averaging many uncorrelated quantities. Use m = square root(# of predictors)
Explain what happens when you center your predictors (x - xbar) when you create a bayesian multivariate linear model (mu = alpha + beta *(x-xbar))
-The posterior correlation between alpha and beta becomes 0 -the quadratic approximations (the mean and standard deviation for its normal posterior) change for the intercept alpha -Beta and sigma approximations stay the same -Usually, the intercept posterior mean will be 0
Explain why using a single decision tree would be a bad idea. (Statistical Learning, Trees)
-This tree, even after being pruned, would overfit the data and have high variance. -High variance means that if we test how well the tree predicts using new data, it does poorly. -To get better prediction accuracy, we need to reduce the variance (we have seen this happen when we have averaged things)
Classification Tree (Statistical Learning, Trees)
-Used to predict a qualitative response (ie will the person with certain characteristics have good health or bad health) -Predicted class of the observation belongs to the most commonly occurring class of training observations in the region
Regression Tree (Statistical Learning, Trees)
-Used to predict a quantitative response -Predicted response given by the mean response of the training observations in the same region -restrict final nodes to have at least 5 observations in them
State three facts about standardizing predictors when doing bayesian polynomial regressions (Bayesian, Linear Models)
-after standardizing, changing the predictor by one means changing it by one standard deviation, making results easier to interpret - standardizing can result in avoiding numerical problems encountered in the algorithms when computing the posterior distribution -Since standardized variables are not in their natural scale, it can be difficult to interpret the physical meaning of the results
What is true about masked relationships (Bayesian, Multivariate Linear Models)
-if one predictor is masked when its in a bivariate model, then the other predictor is also masked in its bivariate as well -if there is a positive correlation between both predictors and the outcome, then predictor 1 needs to be negative when predictor 2 is positive to mask the relationship. In other words, the two predictors need to have a negative correlation.
Name two type of credibility intervals (Bayesian, Describing Posterior distributions)
-percentile interval (confidence interval) -Highest posterior density interval (HPDI)
Loading Vector Rules (Stats, PCA)
-the squares must add up to 1 -each vector must be orthogonal to the other (dot product is 0)
Principal Component Analysis (Statistical Learning, PCA)
-tool used for data visualization or data pre-processing before supervised techniques are applied. Can also be used for data exploration.
Clustering (Statistical Learning, Clusters)
-tool used to discover unknown subgroups in data
Boosting (Statistical Learning, Trees)
-trees are grown sequentially: each tree is grown using information from previously grown trees. Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set. Fit small tree using the current residuals, rather than the outcome Y , as the response. Update yhat by adding in a shrunken version of the new tree. Update the residuals using the updated yhat. Then fit new small tree to the updated residuals. Repeat. -Number of trees B must be chosen through cross validation. Unlike bagging and random forests, boosting can overfit if B is too large. Overfitting tends to occur slowly. -There is a tuning parameter, lambda, that controls how slowly the model learns (usually set to .01 or .0001) -The number of splits in each tree, d, also called the interaction depth, controls the complexity. Usually d=1 is good, aka a "stump" -Using small trees, like a "stump", or one split, leads to an additive model
List possible solutions if problems arise with estimating the covariance parameters of a linear mixed model (ie when the algorithm fails to converge to a value that makes sense for our random effects covariance & variance)
1. Choose alternative starting values for covariance parameters 2. Rescale the covariates 3. Remove unnecessary random effects: In general, we recommend removing higher-order terms (e.g., higher-level interactions and higher-level polynomials) from a model first for both random and fixed effects. 4. Fit the implied marginal model 5. Fit a marginal model (not implied marginal because the fixed coefficients will be different) with an unstructured covariance matrix
Give 2 reasons why you might want to incorporate multiple predictors when estimating the mean outcome, μ. (Bayesian, Multivariate Linear Models)
1. It's a good way to rule out predictors with spurious relationships 2. You can identify relationships between predictors and the outcome that may not show up when the predictor is modeled by itself (ie the predictor's relationship is masked in the bivariate model)
Solving for Loading vector (Stats, PCA)
1. Start with the loading squared argument. You will get two possible values. 2. Test both into the orthogonal argument.
Explain what a loss function is and how it could help someone pick a point estimate to summarize the posterior distribution. (Describing Posterior Distributions)
A loss function is a rule that tells you the cost associated with using any particular point estimate. We want to select the point estimate that minimizes this cost. Below are types of loss functions and the value that minimizes the cost: -Squared Error: Posterior Mean -Absolute: Posterior Median -Zero - One: Posterior Mode
Scree Plot (Stats, PCA)
A scree plot depicts the proportion of variance explained by each of the principle components. The y axis can be proportion of variance or it can be the cumulative proportion of variance. If first, then the plot is decreasing, since the first component should explain the most. If the second, then the plot is increasing (Full variance should be explained when all components are used).
Describe what happens to dissimilarity as the stages go on in hierarchical clustering (Stats, Clusters)
As the stages go on, we combine more groups together which increases the dissimilarity at which the clusters fuse
How is quadratic approximation similar to maximum likelihood estimation? (Bayesian, Posterior Distributions)
Both are maximizing the likelihood of a distribution, trying to find the value of θ that would most likely produce the values of y that we are seeing in the data. Quadratic approximation maximizes f(y|θ)*f(θ). MLE maximizes f(y|θ)
Explain how you can improve the predictive accuracy of a tree (Statistical Learning, Trees)
By aggregating many decision trees, using methods like bagging, random forests , and boosting, the predictive performance of trees can be substantially improved
Explain how one could increase prediction accuracy when modeling with trees (Statistical Learning, Trees)
Combining a large number of trees can result in dramatic improvements in prediction accuracy, at the expense of some loss in interpretation.
List the types of linkage you could use to measure inter-cluster dissimilarity (Statistical Learning, Clustering)
Complete - maximum inter-cluster dissimilarity. Compute all pairwise dissimilarities between the observations in cluster A and the observations in cluster B. Record the largest of these dissimilarities Single - same as complete, but record the smallest dissimilarity. Can result in extended, trailing clusters in which single observations are fused one at a time. Average - same as complete, but record average of the dissimilarities Centroid - Dissimilarity between the centroid for cluster A and the centroid for cluster B. Centroid linkage can result in undesirable inversions.
Which residual covariance structure would you use if you suspect equal correlation in the measurements of a subject. For example, the study you are conducting has repeated trials under the same condition
Compound Symmetry
A linear mixed model with only random intercepts and a homogeneous residual covariance structure (diagonal) is equivalent to having an implied marginal model with what kind of covariance matrix?
Compound symmetric matrix where the covariance is the random intercept's variance and the variance is the random intercept's variance plus the residual's variance.
Describe pros and cons when using the Newton Raphson algorithm to estimate covariance parameters of a linear mixed model.
Cons: -Iterations are more time consuming because of Hessian Matrix calculations Pros: -Requires fewer iterations than EM -Hessian matrix from final iteration can be used to find the standard errors of your estimated covariance parameters -Recommended to obtain final estimates
Explain how the data being correlated will impact the Proportion of Variance Explained. How about if the data is uncorrelated? (Stats, PCA)
Correlated data: It is very easy to explain the data in just a few components. Those first few component's PVE will explain a good deal of the data Uncorrelated data: PVE starts lower and decreases more slowly as component index increases.
Explain what kind of problems decision trees can be applied to? (Statistical Learning, Trees)
Decision trees can be applied to both regression and classification problems
Which residual covariance structure would you use if you suspect the measurements for one subject are independent and have equal variance?
Diagonal (Also called homogeneous residual covariance structure)
Which algorithm would you use to get the starting values for an algorithm that is estimating the covariance parameters in a linear mixed model?
Expectation Maximization
Which residual covariance structure would you use if you are conducting a study that takes place at equally spaced intervals and you suspect that the measurements of a subject become less correlated as time increases.
First Order Autoregressive
What is Grid Approximation and describe pros and cons. (Describing Posterior Distributions)
Grid approximation approximates the shape of a continuous or discrete posterior distribution's pdf by picking a handful of discrete values for λ and calculating the posterior density f(λ|x) at each λ value. These density values get plotted on a graph to get the general shape of the posterior distribution. The more points you add to the graph, the smoother/more accurate the posterior distribution will look. Pros: -Simple if there is only one parameter λ (or two parameters) -Works well with small sample data -Can work for GLMs Cons: -Grid becomes too big if there is more than one parameter (cannot use for complex models that may have 100 or 1000s of parameters)
Explain when it is appropriate to standardize data in PCA (Stats, PCA)
If a feature has a very large variance due to the way it is scaled, then it will represent most of the variance for the entire data set. The feature will dominate the first principal component. It will explain nearly all of the variance and additional components will add very little to the proportion explained. If the feature is standardize, it will avoid the domination of a single feature *Sometimes standardizing is needed when features are expressed in different units. But even so, it could still be needed even if the features are expressed in the same units.
Masked Relationships (Bayesian, Multivariate Linear Models)
If predictors have a masked relationship, it means that when it is the only predictor of the mean outcome (bivariate model), there is no relationship, but when it is included with other predictors (multivariate model), it does have a meaningful relationship with the mean outcome. -Cause: If the betas in the mutlivariate are the same, then the predictors have a negative correlation. If the betas in the multivariate are different, then the predictors have a positive correlation.
Explain when a tree may outperform a linear model (Statistical Learning, Trees)
If there is a highly non-linear, complex relationship between the features and the response, then decision trees may outperform classical approaches.
What happens to the posterior mean for mu when you increase the mu prior's standard deviation? In other words, when the prior is N(0, t)
If you start with a small standard deviation, you are using a very informative prior. the posterior mean will be close to 0 and as you increase t, the prior becomes less informative, and we lean more on the data, so t will converge to the MLE estimate, which happens to be the sample mean.
How is K Nearest Neighbors related to Bayes Classifier (Statistical Learning, KNN)
In the qualitative case, Bayes classifier assigns a test observation to a class given its conditional probability, which in most cases, we do not know. It is the best most accurate way to assign a test observation to a class. The Bayes error rate is the irreducible error. -Bayes classifier corresponds to predicting class one if Pr(Y = 1|X = x0) > 0.5, and class two otherwise. -When the Bayes classifier is equal to .5, this is the Baye's decision boundary. -The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate *****KNN estimates the Bayes classifier. Our goal is to get KNN test error close to the Bayes error rate.
Difference between inter-observation dissimilarity and inter-cluster dissimilarity (Stats, Clusters)
Inter-observation dissimilarity - dissimilarity between observations. Only happens at the beginning when each observation is in its clusters. You only have to calculate these once Inter-cluster dissimilarity - dissimilarity between clusters, usually what we have to resort to as the dendrogram grows upward.
What is a credibility interval in terms of a posterior density? (Bayesian)
It gives you an interval of possible updated values of your prior - this interval holds about x% of the posterior probability. Depending on the type of interval chosen, the credibility interval can be anywhere on the density curve.
What is the implied marginal model in a mixed linear model?
It is the model with only fixed effects. Fixed coefficients are the same, only now the variance in the random effects is absorbed into the residuals variance (in other words we go back to having just one random source of variation). The implied marginal model has fewer restrictions on the covariance parameters because there is only one matrix that needs to be positive definite (the residuals), whereas in the LMM, two matrices need to be positive definite (residuals and random effects).
How can you tie the posterior updating to credibility? (Bayesian)
It's similar to limited credibility fluctuation. Think of the posterior as a mix between data (the likelihood) and the preconceived notion (the prior). If there is not a lot of data, we will lean greatly on the prior. If there is a lot of a data, we can learn the most from the data - there is no need to lean that heavily on our prior.
What happens to our prior selection as we increase our data set (Bayesian)
More data will "overwhelm" our prior, in other words, it becomes meaningless. Our posterior will lean more on the likelihood portion when updating. The data becomes better at telling the story than our preconceived notions from the prior.
Which algorithm would you use to estimate the covariance parameters in a linear mixed model?
Newton Raphson
Multicollinearity (Bayesian Multivariate Linear Models)
Occurs when one or more of your predictors are highly correlated with each other.
What is a percentile interval in terms of a posterior density? (Describing posterior distributions)
Percentile is a confidence interval for your posterior density. Assumes equal weighting on the tails. Assumes the posterior distribution is symmetrical.
List Pros and Cons of Hierarchical Clustering (Statistical Learning, Clustering)
Pro: -Dendrogram visual -Do not need to pick number of clusters K -Have access to all possible clusters you could pick thanks to the dendrogram. Just cut the dendrogram at any height to obtain your preferred clusters. Cons -For clusters that aren't nested within each other, the hierarchical clustering can result in less accurate results than K-means clustering. -Have to think about scaling or centering variables before clustering -It may not be clear where to draw the line on the dendrogram in order to obtain clusters -Have to think about the type of linkage to use -Have to think about which dissimilarity measure to use -Not robust: A small change in the data can cause a large change in the final clusters -Distortion of clusters when there is an outlier
List Pros and Cons for Bagging (Statistical Learning, Trees)
Pro: -Increase predictive accuracy -Can obtain a summary of the importance of each predictor using RSS (if regression) or Gini Index (if classification) Con: -Loses interpretability -No picture -Not as big of a decrease in variance as forest because the trees can be highly correlated (ie they all look the same)
List Pros and Cons of using the HPDI instead of percentile interval? (Describing posterior distributions)
Pro: -Narrower -Interval includes your most probable parameter value -More representative of the data than a conf interval if the density curve is not symmetrical. If it is, then there is no meaningful difference between HPDI and a percentile interval Con: -HPDI is more computationally intensive than PI -suffers from greater simulation variance - basically it is sensitive to how many samples you draw from the posterior. -harder to understand than the usual conf interval -If choice of interval leads to different inferences, then you'd be better off just plotting the entire posterior distribution.
PVE (Stats, PCA)
Proportion of Variance Explained
List Pros and Cons of K Means Clustering (Statistical Learning, Clustering)
Pros: Cons: -Random so we have to run more than once to get an idea what the optimal clusters are -Must pre-specify the number of clusters K -Not robust: A small change in the data can cause a large change in the final clusters -distortion of clusters when there are outliers
Explain pros and cons for using Restricted Maximum Likelihood (REML) estimates for a mixed linear model
Pros: -Estimates of the fixed effects, β, are unbiased -Estimates of the covariance parameters for the random effects and residuals, θ, are unbiased Cons: -The REML estimates of Var[β] are biased (underestimated)
Explain pros and cons for using Maximum Likelihood Estimates (MLE) for a mixed linear model
Pros: -Estimates of the fixed effects, β, are unbiased (regardless of whether θ is known) Cons: -The ML estimates of Var[β] are biased (underestimated) -ML estimates of the Var and Cov of the random effects and residuals, θ, are biased
List Pros and Cons for Forests (Statistical Learning, Trees)
Pros: -Greater reduction in variance than bagging, especially if predictors are highly correlated -Lower test error and OOB error Cons: -Still abstract, no picture
List Pros and Cons of Boosting (Statistical Learning, Trees)
Pros: -More interpretable than bagging and foresting -Is a slow learning process, which has been proven to do well in prediction
Describe pros and cons when using the Fisher Scoring algorithm (modified Newton Raphson) to estimate covariance parameters of a linear mixed model.
Pros: -More simpler calculations than N-R (because it uses expected Hessian Matrix instead of observed like N-R ) -More numerically stable -More likely to converge Cons: -The Expected Hessian Matrix is hard to find -Not recommended to obtain final estimates
List Pros and Cons of Using Trees (Statistical Learning, Trees)
Pros: -Trees are very easy to explain -Trees more intuitive because they more closely mirror human decision making -Trees can be displayed graphically -Trees can easily handle qualitative predictors without the need to create dummy variables Cons: -Trees do not have the same level of predictive accuracy as some of the other regression and classification approaches -Trees are non-robust. A small change in the data can cause a large change in the final estimated tree.
Describe pros and cons when using the Expectation Maximization algorithm to estimate covariance parameters of a linear mixed model.
Pros: -Can be used to maximize complicated likelihood functions -Good for finding starting values of the parameters to be used in other algorithms Cons: -Has a slow rate of convergence (Requires many iterations to get an answer) -Tends to overestimate the covariance parameters -the precision of estimators derived from the EM algorithm is overly optimistic (because complete hypothetical data is used instead of observed data) -Not recommended to obtain final estimates
What is Quadratic Approximation and describe pros and cons. (Bayesian, Describing Posterior Distributions)
Quadratic approximation assumes the posterior distribution's pdf can be approximated by a normal distribution. The mean of the gaussian will be the mode (the peak of the posterior, the MAP). The standard deviation will be estimated by the curvature around the peak of the posterior. Example: Assuming the posterior is Gaussian, it is maximized at 0.67 (mode), and its standard deviation is 0.16. Pros: -Can handle more than one parameter Cons: -What if the posterior is not shaped like gaussian? -improves as size of data increases (but keep in mind, rate of improvement could be small)
Explain how Forests are different from Bagged Trees (Statistical Learning, Trees)
Same thing as bagging, except the trees are decorrelated. -Note that if the subset of predictors in a forest is equal to the total predictors (m=p), than the forest is equivalent to the bagged trees.
Describe PCA methodology (Statistical Learning, PCA)
Summary: Find a low-dimensional representation of the observations that explain a good fraction of the variance -The idea is especially important when you have lots of features. Not all of them are going to be important when describing the data. We can go from high dimension to low dimension with PCA. -The new dimensions are called principal components, or Z. Z1 would be the first principal component, the first dimension. Then Z2 and so on. Think of it like a new X variable. -z1 is a linear combination of the features, and each observation gets a z1 score. -The coefficients in the linear combination are called loadings. Think of these as assigning weights to the features. The greater magnitude, the greater importance of the feature in explaining the component. -Coefficients are selected so that the sample variance is maximized (in other words, square each observation's z scores and then average them. This is what you want to maximize) Notes: -PCA does not get rid of features -Z1 contains the largest proportion of variance of the data -Each observation gets a z1 score that lives in the Z1 dimension. -Each principal component is uncorrelated. This can be good if we originally wanted to use the data to do a multiple linear regression, but couldn't because the data was too correlated. Now we can do the linear regression using the uncorrelated features, the principal components. -Using the principal components as the features for a linear regression can lead to less noisy results because most of the signal will be concentrated in the first few principal components. -There can only be at most min(n-1, p) principal components. -Must center your data before PCA analysis. Don't have to scale, but if features are expressed using different scales, it's important to scale the features or else the larger scaled features could dominate the principal component or the smaller scaled features will have little influence on the principal component. -Each component loading vector is unique, up to a sign flip.
Which Methods would be used for supervised learning and which for unsupervised learning? (Statistical Learning)
Supervised: -Trees -KNN Unsupervised -PCA -Cluster
Explain how the quap works to find the quadratic approximation of the posterior distribution (Bayesian, Describing Posterior Distributions)
Suppose we have model: y|θ~exp(θ) prior: θ~poisson(2) 1. Find θ that maximizes the likelihood of the posterior distribution: f(y|θ)*f(θ). Quap will test out many values of θ and climb up the posterior until it reaches the top, a maximum. Let this be the mode 2. Estimate the curvature around the mode
What is a high posterior density interval (HPDI)? (Describing posterior distributions)
The HPDI is the narrowest interval containing the specified probability mass.
Biplot (Stats, PCA)
The axis are the two components. Its like a scatterplot of the two principal components. -Sometimes the loading vectors can be imposed on top
In a linear mixed model, if you have a random intercept only, which of the following residual covariance structures would be an inappropriate structure?
The compound symmetry covariance matrix because it would cause aliasing. Recall that the compound symmetric residual covariance is equivalent to having a LMM with a random intercept that has a variance of σ1. You can't have both this structure and random intercepts.
For classification tree, why is node purity important when growing a tree?
The purer a node is, the more confident in our classification of a test observation. Translation: If the node is primarily one class, then we will feel confident that the new observation belongs to that class.
Name a reason one might want to fit the implied marginal model instead of a linear mixed model.
There are fewer restrictions on the covariance parameters being estimated - we won't run into convergence issues.
Predictive Posterior Distribution (Bayesian)
This is your predictive distribution - in other words, given the data, what can you expect your next outcome to be factoring in all possible values of p, their associated outcomes, and the probability of that p value occurring. If you were to compute the sampling distribution of outcomes at each value of p, then you could average all of these prediction distributions together, using the posterior probabilities of each value of p, to get a posterior predictive distribution.
Explain one disadvantage to using a tree-based method (Statistical Learning, Trees)
Tree-based method are not competitive with the best supervised learning approaches in terms of prediction accuracy
Explain two benefits to using a tree based method (Statistical Learning, Trees)
Tree-based methods are -simple -easy to interpret
Explain how a linear mixed model is more flexible than an ANOVA model when it comes to longitudinal studies (under the assumption that missing data happens randomly).
Unlike an ANOVA model, a linear mixed model allows the time points when measurements are collected to vary for each subject. Also, we don't have to omit incomplete subjects (ie dropouts) - a linear mixed model uses all observations available for a subject.
How is Boosting different from bagging? (Statistical Learning, Trees)
Unlike bagging: -No averaging is done, you basically adjust/update the residuals after fitting the next tree -the construction of each tree depends strongly on the trees that have already been grown. -using smaller trees is better (recall that the bagged trees are not pruned) -the number of trees B is critical - small B is good. Must be chosen using cross validation or you might overfit the tree IF B is too big
Explain under what circumstances you would use AIC/BIC to compare two models and loglikelihood ratio to compare two models.
Use AIC/BIC for non-nested models. Use Loglikelihood ratio test for nested models
Posterior Mean
Using grid, quadratic, or MCMC to approximate the posterior, you can take samples from it. The posterior mean is the mean of those samples.
Explain how outliers can affect Clusters (Statistical Learning, Clustering)
When there are outlier observations, it isn't appropriate to place them in a cluster. But Hierarchical Clustering and K-means will put it in a cluster anyway. The clusters will be heavily distorted due to the presence of outliers that do not belong to any cluster. Solution: Use a mixture model to accommodate for the presence of outliers
Explain how a lot of data can make a prior irrelevant (Bayesian)
When there is a lot of data, we don't need to rely on our prior assumptions, we can just use the data to tell the story. The data is represented in the likelihood part of the posterior. The prior, which represents our prior assumptions, becomes unnecessary.
Which residual covariance structures would cause aliasing with the covariance parameters if the linear mixed model has a random intercept?
Whenever the residual covariance structure has a covariance (examples: Compound symmetry, Unstructured, etc)
Explain an instance where it would be inappropriate to perform a loglikelihood ratio test.
You wouldn't want to use a log ratio test when there is missing data that results in one model having the full data set and the other model having partial data set. Loglikelihood ratio tests must compare apple to apple, or nested models.
What happens to the posterior when we: a)Increase data size b) Increase the prior distribution's standard deviation
a) the posterior mean goes towards the MLE b) the posterior mean goes towards the MLE
If you have a bayesian multivariate linear model (mu = alpha + beta *x), what happens when x equals the sample mean xbar?
mu will be roughly the same as the sample mean y bar