Data Final Exam Review

Ace your homework & exams now with Quizwiz!

External Validity

Info: It involves the extent to which the results of a study can be generalized beyond the sample.

Matrix Size

Info: A matrix' size is measured in its numbers of rows and columns and it is written in the format rows×columns

"How might omitted variable bias change model coefficients? "

"Info: The relationships could change dramatically pos to neg), the statistical signifgicance could change. "

What is a "tidy" dataset?

Info: Tidy data sets are arranged such that each variable is a column and each observation (or case) is a row.

What is a "tidy" dataset?

Info: Tidy data sets are structured datasets which are easy to manipulate, model and visualize. Tidy data sets are arranged such that each variable is a column and each observation (or case) is a row.

"Interpretation of linear regression beta coefficients in simple linear regression "

"Info: ""Put simply, beta coefficients allow you to """"compare the relative importance of each coefficient in a regression model"""" (1) Beta-0: This is the constant that represents the expected value of your dependent (Y) variable when your independent (X) variable is equal to 0. Oftentimes, it can give you a good idea of the starting point in your model. That is, when there is no variation in your independent variable, you typically can expect your depenent variable to equal your beta-0 coefficient. However, there can be times where that doesn't make sense, though, (e.g., if you run a regression model on age (X) and height (Y), you might end up with a negative beta-0. That doesn't mean that newborn babies are a negative number of inches) so you need to be cautious when evaluating your beta-0 and make sure that it makes sense if you plan to use it to explain phenomena. Beta-1: This coefficient represents the direction of the relationship (i.e., is the indenpendent variable (X) associated with an increase or decrease in the value of the dependent variable (Y)?). You can evaluate this by multiplying the unit increase of X by the Beta-1, and add that to your Beta-0 to the produce to understand the extent to which changes in your X variable is associated with changes in your Y variable. Be sure to evaluate the p-value of your beta-1 to see if the relationship between your two beta coefficients can be considered statistically significant. Sources: (1) https://www.statisticshowto.datasciencecentral.com/standardized-beta-coefficient/ https://www.theanalysisfactor.com/interpreting-the-intercept-in-a-regression-model/"" Notes: Sources: (1) https://www.statisticshowto.datasciencecentral.com/standardized-beta-coefficient/ https://www.theanalysisfactor.com/interpreting-the-intercept-in-a-regression-model/"" "

"What is the augment() function from the broom package used for in R? "

"Info: ""The augment() function from the broom package provides the specific results of the linear regression calculation process for each obervation for the data included in your model (i.e., y-hat values, residual/distance between the expected value from the model and the actual value from the data, etc.) This function prints the results in a tibble, which can make it easier to read, but also easier to subset or export the results as well. Source: https://broom.tidyverse.org/reference/augment.lm.html"" Notes: Source: https://broom.tidyverse.org/reference/augment.lm.html"" "

What is the glance() function from the broom package used for in R? How could you use it to evaluate model performance?

"Info: ""The glance() function from the broom package prints your linear model statistics in the form of a tibble, which can make it easier to read, or process/manupulate/subset/export to CSV/etc. which could be convenient in certain cases. Please be aware that the glance function only prints out the statistics for the entire model (e.g., R-square, Adjusted R-square, F-statistic, P-value from the F-statistic, Degrees of Freedom, etc.) The glance function can give you a quick """"glance"""" when you are trying to assess how well the data fit your model, if your model is statistically significant from zero, and how much of the variation in the data is accounted for in your model. **The glance() function does not, however, provide you with your beta coefficients, T-statistics, or P-values from the T-statistics.** (See the tidy() function for that) Source: https://www.rdocumentation.org/packages/broom/versions/0.5.2/topics/glance.lm"" Notes: Source: https://www.rdocumentation.org/packages/broom/versions/0.5.2/topics/glance.lm"" "

What is a "tidy" dataset?

Info: Tidy data sets have structure and working with them is easy; they’re easy to manipulate, model and visualize. They are arranged so that each variable is a column and each observation is a row.

"What is the tidy() function from the broom package used for in R? "

"Info: ""The tidy() function from the broom package prints your beta coefficients, T-statistics, and P-values from the T-statistics in the form of a tibble, which can make it easier to read, or process/manupulate/subset/export to CSV/etc. which could be convenient in certain cases. You can use the tidy() function to get a quick read on how your variables interact, if your beta coefficients for your explanatory variables are individually statisically significant from your beta-0 coefficient (for your dependent variable), the direction of your model, and a broach overview of your expected results (""""for every increase/decrease in one unit of X, Y does ___) Please be aware that **you cannot use the tidy() function to assess how well your model accounts for the variation in your data, nor can you assess whether your model, as a whole, is statistically significant from the mean.** (See the glance() function for more information on that) Source: https://www.rdocumentation.org/packages/broom/versions/0.5.2/topics/tidy.lm"" Notes: Source: https://www.rdocumentation.org/packages/broom/versions/0.5.2/topics/tidy.lm"""

"Indicators of multicollinearity "

"Info: - F statisitic (significant model that should capture some relationship but no coefficient significance - Correlation among variables - High variance inflation factor between 5 and 10. If above 10, want to think about getting rid of the variable and not "

"1) Matrix size 2) Matrix addition 3) Matrix subtraction 4) What is a ""tidy"" dataset? 5) What does BLUE stand for? "

"Info: 1) A matrix is a rectangular array of numbers or other mathematical objects for which operations such as addition and multiplication are defined. The size of a matrix is defined by the number of rows and columns that it contains. A matrix with m rows and n columns is called an m-by-n matrix, while m and n are called its dimensions. 2) Matrix addition is the operation of adding two matrices by adding the corresponding entries together. 3) Matrix subtraction is the operation of subtracting two matrices by subtracting the corresponding entries. 4) A tidy dataset is a dataset which uses tidy data, data obtained as a result of a process called data tidying. Tidy datasets have structure and working with them is easy. They are easy to manipulate, model, and visualize. They are arranged such that each variable is a column and each observation is a row. 5) BLUE stands for Best Linear Unbiased Estimator. "

Tidy data

"Info: 1. Each variable has its own column 2. Each observation has its own row 3. Each value has its own cell "

1. Matrix size 2. Matrix addition 3. Matrix subtraction 4. Requirements for matrix sizes necessary to do matrix multiplication 5. Matrix transposition

"Info: 1. Matrix size: The size of a matrix is defined by the number of rows and columns that it contains. A matrix with m rows and n columns is called an m × n matrix or m-by-n matrix 2.Matrix addition: Two matrixes may be added only if they have the same number of rows and columns. Addition is accomplished by adding corresponding elements (elements in the same position). 3. Matrix subtraction: Two matrixes may be subtracted only if they have the same number of rows and columns. Subtraction is accomplished by subtracting corresponding elements (elements in the same position). 4. Requirements for matrix sizes necessary to do matrix multiplication: The number of columns of the left matrix is the same as the number of rows of the right matrix. 5. Matrix transposition: The transpose of an m-by-n matrix A is the n-by-m matrix AT (also denoted Atr or tA) formed by turning rows into columns and vice versa. Notes: Just answer 5 questions together at one time for convenience."

requirements for matrix sizes necessary to do matrix multiplication

"Info: 1. the number of columns in the first matrix must equal the number of rows in the second matrix 2. multiply the numbers in the rows of the first matrix to the numbers in the columns of the second matrix 3. add the products "

What term did you look up information for? delim1 united_column delim2 "1. what is tidy() function from broom package used for? 2. what is augment() function from broom package used for? 3. what is glance() from broom package used for? 4. what is a ""tidy"" dataset? what is the structure of it? 5. what does BLUE stand for?"

"Info: 1. tidy () function constructs a data frame that summarizes the model's statistical findings and includes coefficients and the p-values for each term in a regression 2. the augment () function adds columns to the original data that was modeled, which can includes predictions, residuals and cluster assignments 3. glance () is a code that summarizes of the model and usually contains the Rsquared and adjusted Rsquared values, as well as the residual standard error from a linear regression model 4. A tidy dataset is a way to organize or map data and the structure of it is such that columns represent variables and rows represent each observation. 5. BLUE stands for best linear unbiased estimator and is used to assess coefficients. Best refers to the lowest variance of the estimate, compared to other unbiased, linear estimates. "

Basic structure of a residual fit plot

"Info: Are we violating GM arguments? iS there structure that can undermind our confidence in the linear regression? - Clear violation of the different GM assumptions - Is there curvilinear relationship in them - Homo and hetero (left to right location around the fit line) elasticity - Time series issue: Observations correlated with one another. "

Assumptions for Multiple Linear Regression

"Info: As a predictive analysis, the multiple linear regression is used to explain the relationship between one continuous dependent variable and two or more independent variables. The independent variables can be continuous or categorical (dummy coded as appropriate). Assumptions: Regression residuals must be normally distributed. A linear relationship is assumed between the dependent variable and the independent variables. The residuals are homoscedastic and approximately rectangular-shaped. Absence of multicollinearity is assumed in the model, meaning that the independent variables are not too highly correlated. "

What does BLUE stand for?

"Info: Best Linear Unbiased Estimator (BLUE) The Gaussâ€"Markov theorem states that in a linear regression model in which the errors have expectation zero, are uncorrelated and have equal variances, the best linear unbiased estimator (BLUE) of the coefficients is given by the ordinary least squares (OLS) estimator, provided it exists. Here ""best"" means giving the lowest variance of the estimate, as compared to other unbiased, linear estimators. "

What is the tidy() function in the "broom" package used for in R?

"Info: First define the ""broom"" package's use: Summarizes key information about statistical objects in tidy tibbles. This makes it easy to report results, create plots and consistently work with large numbers of models at once. tidy() : summarizes information about model components such as coefficients of a regression. Notes: source: https://cran.r-project.org/web/packages/broom/broom.pdf"

Matrix Subtraction

"Info: For subtraction, like in addition, you must only use matrices with the same number of rows and columns. Subtract the matrices cell by cell: 3 4 2 3 1 1 5 3 - 54 = 0 -1 "

Heteroscedasticity

"Info: Heteroscedasticity refers to data with unequal variability (scatter) across a set of second, predictor variables. A residual plot can suggest (but not prove) heteroscedasticity. Residual plots are created by: Calculating the square residuals. Plotting the squared residuals against an explanatory variable (one that you think is related to the errors). Make a separate plot for each explanatory variable you think is contributing to the errors. "

Matrix Addition

"Info: If two matrices have the same number of rows and the same number of columns, they may be added together. For example: 3 2 6 5 9 7 12 + 4 1 = 5 3 This is your new matrix "

Requirements for matrix sizes necessary to do matrix multiplication

"Info: Let B be an m × p matrix and A a q × n matrix. Then, the product BA is defined if (and only if) p = q. "

Linear Regression

"Info: Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable. For example, a modeler might want to relate the weights of individuals to their heights using a linear regression model. Before attempting to fit a linear model to observed data, a modeler should first determine whether or not there is a relationship between the variables of interest. This does not necessarily imply that one variable causes the other (for example, higher SAT scores do not cause higher college grades), but that there is some significant association between the two variables. A valuable numerical measure of association between two variables is the correlation coefficient, which is a value between -1 and 1 indicating the strength of the association of the observed data for the two variables. A linear regression line has an equation of the form Y = a + bX, where X is the explanatory variable and Y is the dependent variable. The slope of the line is b, and a is the intercept (the value of y when x = 0). Notes: http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm"

Multicollinearity

"Info: Multicollinearity generally occurs when there are high correlations between two or more predictor variables. In other words, one predictor variable can be used to predict the other. This creates redundant information, skewing the results in a regression model. Examples of correlated predictor variables (also called multicollinear predictors) are: a person’s height and weight, age and sales price of a car, or years of education and annual income. An easy way to detect multicollinearity is to calculate correlation coefficients for all pairs of predictor variables. If the correlation coefficient, r, is exactly +1 or -1, this is called perfect multicollinearity. If r is close to or exactly -1 or +1, one of the variables should be removed from the model if at all possible. "

R-squared/Adjusted R-squared

"Info: Overall: R-squared tends to reward you for including too many independent variables in a regression model, and it doesn’t provide any incentive to stop adding more. Adjusted R-squared and predicted R-squared use different approaches to help you fight that impulse to add too many. The protection that adjusted R-squared and predicted R-squared provide is critical because too many terms in a model can produce results that you can’t trust. -R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. In general, the higher the R-squared, the better the model fits your data. -The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. "

"Interpretation of linear regression beta coefficients in multivariate linear regression "

"Info: Put simply, beta coefficients allow you to ""compare the relative importance of each coefficient in a regression model"" (1) Beta-0: This is the constant that represents the expected value of your dependent variable (Y) when your explanatory (X[sub: 1, 2, etc.]) variables are equal to 0. Oftentimes, it can give you a good idea of the starting point in your model. That is, when there is no variation in your explanatory variables, you typically can expect your dependent variable to equal your beta-0 coefficient. However, there can be times where that doesn't make sense, though, (e.g., if you run a regression model on age (X[sub: 1]), childhood nutrition (X[sub: 2]) and height (Y), you might end up with a negative beta-0. That doesn't mean that newborn babies are a negative number of inches) so you need to be cautious when evaluating your beta-0 and make sure that it makes sense if you plan to use it to explain phenomena. Beta-1, 2, etc.: The rest of your beta coefficients represents the direction of the relationship (i.e., are the explanatory variables (X[sub: 1, 2, etc.]) associated with an increase or decrease in the value of the dependent variable (Y)?). You can evaluate this by multiplying the unit increase of X by the Beta-1, 2, etc., and add the individual products to your Beta-0 separately to understand the extent to which changes in your X variable is associated with changes in your Y variable. Be sure to evaluate the p-values of your beta-1, 2, et.c to see if the relationship between your beta-0 and beta 1, 2, etc., coefficients, respectively, can be considered statistically significant. If you discover an explanatory variable that has a significant relationship to the independent variable, your model accounts of all other possible explanations, and your model satisfies the appropriate conditions, you can say that a 1-unit increase in X is associated with a beta-1 increase in Y, holding all else constant. Sources: (1) https://www.statisticshowto.datasciencecentral.com/standardized-beta-coefficient/ https://www.theanalysisfactor.com/interpreting-the-intercept-in-a-regression-model/"" Notes: Sources: (1) https://www.statisticshowto.datasciencecentral.com/standardized-beta-coefficient/ https://www.theanalysisfactor.com/interpreting-the-intercept-in-a-regression-model/"" "

Formula used to calculate R-squared

"Info: R-squared = 1-[(SSres)/(SSt)] SSres is residual sum of squares SSt is total sum of squares "

Formula used to calculate beta one in simple linear regression

"Info: SSxy/SSxx where SSxx is the sum of (xi-mean of x)^2 SS means sum of the squares "

What is the augment() function from the broom package used for in R?

"Info: The broom package summarizes key information about statistical objects in tidy tibbles. The augment() function adds information about individual observations to a dataset, such as fitted values or influence measures. Notes: Source: https://cran.r-project.org/web/packages/broom/broom.pdf"

Transposition

"Info: The process whereby the rows and columns of a new matrix are the columns and rows of the original matrix. ""T"" is the symbol for ""transposed"" Example: 543 T 547 404 401 7 10 3 343 "

What is the basic structure of a residual - fit plot?

"Info: The residuals appear on the y axis and the fitted values appear on the x axis. The characteristics of a well-behaved residual vs. fits plot and what they suggest about the appropriateness of the simple linear regression model: â€" The residuals ""bounce randomly"" around the 0 line. This suggests that the assumption that the relationship is linear is reasonable. â€" The residuals roughly form a ""horizontal band"" around the 0 line. This suggests that the variances of the error terms are equal. â€" No one residual ""stands out"" from the basic random pattern of residuals. This suggests that there are no outliers. "

What is a "tidy" dataset?

"Info: Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table. Tidy data makes it easy for an analyst or a computer to extract needed variables because it provides a standard way of structuring a dataset. Tidy data is particularly well suited for vectorised programming languages like R, because the layout ensures that values of different variables from the same observation are always paired. "

"BLUE ""tidy"" dataset beta coefficients in simple linear regression lr() function tidy() function"

"Info: best linear unbiased estimator; Gauss Markov theorem states that OLS is BLUE (ordinary least squares regression produces unbiased estimates that have smallest possible variance) dataset where each variable forms a column; each observation forms a row; each type of observational unit forms a table beta0 is intercept, the y value when x is equal to zero; beta1 is the slope, or the change in y’s value for every one unit increase in x calculates the likelihood ratio chi-square statistic, which allows us to determine whether variables are associated; input in R is lr(observed, expected) used to give a data frame that summarizes your model’s statistical findings, including coefficients and p values for each term in your regression "

What is the glance() function from the broom package used for in R? How could you use it to evaluate model performance?

"Info: glance(): Constructs a concise one-row summary of the model. This typically contains values such as R^2, adjusted R^2, and residual standard error that are computed once for the entire model. glance() can be used to observe any changes in adjusted R^2 value when additional variables are added to a model. "

matrix addition

"Info: if the dimensions of the matrices are the same they can be added -the corresponding numbers (those of the same row and column) are added "

matrix subtraction

"Info: the matrices have to be of the same dimension to be subtracted -the numbers of corresponding row and column are subtracted to make the new matrix "

What is a "tidy" dataset?

Info: Tidy dataset provides a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning).

What is RSS and why is it important?

Info: A residual sum of squares (RSS) is a statistical technique used to measure the amount of variance in a data set that is not explained by a regression model. The residual sum of squares is a measure of the amount of error remaining between the regression function and the data set.

Heteroscedascity

Info: An inequality in variability across the range of a values. On a residuals fit plot, this is reflected in the widening or narrowing of residuals as one travels rightward on the x-axis.

Matrix size

Info: "The number of rows and columns of a matrix, written in the form rows×columns." (mathwords.com)

Matrix transposition

Info: "The transpose of a matrix is a new matrix whose rows are the columns of the original. This makes the columns of the new matrix the rows of the original." (chortle.ccsu.edu)

What is a "tidy" dataset?

Info: #NAME?

Formula used to calculate R squared

Info: 1 - (first sum of errors/ second sum of errors)

Formula for R-Squared

Info: 1-( sum of residuals squared for each point S(yi-yhat)^2 / sum of total error S(yi-ybar)^2 )

Causation (requirements for establishing it)

Info: 1. Temporal precedence of X before Y, 2. Covariation between X and Y, and 3. elimination (realistically, minimization) of confounding variables and other alternative explanations

Matrix Addition

Info: A function to add two vectors or matrices. first.matrix + second.matrix.

Histogram

Info: A histogram is used to plot the distribution of a continuous variable over a set interval. The x-axis will intervals for the values for the variable while the y-axis will demonstrate the number of occurrences for data points within that interval.

Autocorrelation

Info: Autocorrelation is a characteristic of data in which the correlation between the values of the same variables is based on related objects. It violates the assumption of instance independence, which underlies most of the conventional models. It generally exists in those types of data-sets in which the data, instead of being randomly selected, is from the same source.

Formula used to calculate beta zero in simple linear regression

Info: B0 = y-bar - B1*x-bar

BLUE

Info: BEST LINEAR UNBIASED ESTIMATOR

What does BLUE stand for?

Info: BLUE = Best Linear Unbiased Estimate

What does BLUE stand for?

Info: BLUE stands for 'Best Linear Unbiased Estimators.' The qualities of BLU estimators include linear parameters, unbiasedness, and best (minimum variance around beta parameter in sampling distribution).

What does BLUE stand for?

Info: BLUE stands for Best Linear Unbiased Estimator.

BLUE

Info: BLUE stands for Best Linear Unbiased Estimators. The conditions for BLUE are described by the Gauss Markov Theorem, which specifies the conditions under which the OLS is the BLUE. Best means the lowest variance as compared to other estimators.

What does BLUE stand for?

Info: BLUE stands for Best Linear Unbiased Estimators. This is a standard used to evaluate certain approaches used to estimate population parameters (particularly applicable to least squared coefficients)

What does BLUE stand for?

Info: BLUE stands for: Best Linear Unbiased Estimators. BLUE is part of the Gauss-Markov theorem. The Gauss-Markov theorem states that the best linear unbiased estimators of the coefficients is given by the ordinary least squares estimator if it exists. "Best" means the lowest variance of the estimate as compared to other unbiased estimators. Notes: The BLUE standard is complex. We need to know what it is, what the letters stand for, and why we need it.

What does BLUE stand for?

Info: Best Linear Unbiased Estimator

What does BLUE stand for?

Info: Best Linear Unbiased Estimator, the standard used to evaluate whether the least squared approach is the best or equal to the best way of estimating population parameters

What does BLUE stand for?

Info: Best Linear Unbiased Estimators

What does BLUE stand for?

Info: Best Linear Unbiased Estimators -a trade-off between different forms of accuracy, such as unbiasedness and minimum variance, to estimate a population parameter.

What does BLUE stand for?

Info: Best Linear Unbiased Estimators. It’s a standard we use to see whether or not several approaches are the best at approaching population matters.

What does BLUE stand for?

Info: Best.Linear.Unbiased.Evaluator

Interpretation of linear regression beta coefficients in single linear regression

Info: Beta coefficient represents the change in the response variable (delta y) given an increase of 1 in the independent variable. You may want to adjust the scale so that the interpretation is meaningful. The effect of a 1 increase in x may not be as tangible as a 10 increase in x for example. For categorical independent variable coded in binary (0,1), a 1 increase would indicate a shift to the other category.

Matrix Size

Info: By definition a matrix is a rectangular array of numbers. A matrix with "m" number of rows, and "n" number of columns, is said to have a size of m x n (m by n). Note the number of rows is always written first.

Heteroscedasticity

Info: By definition, heteroscedasticity refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it. The scatterplot of these variables often creates a cone-like shape, as the scatter (or variability) of the dependent variable widens or narrows as the value of the independent variable increases. Notes: Further Information can be found at: http://www.statsmakemecry.com/smmctheblog/confusing-stats-terms-explained-heteroscedasticity-heteroske.html

Requirements for matrix sizes necessary for matrix multiplication

Info: Columns of the first matrix have to be the same size as the rows of the second matrix. Performed by multiplying the elements of each row of the first matrix by the elements of each column of the second matrix. Add the products.

What is a "tidy" dataset?

Info: Each variable has its own column, each row has its own observation. Each value has its own cell.

Matrix addition

Info: If two matrices are the same size, then they can be added together. Matrix addition is simple, with the individual elements being added together. For instance, the item in row 1, column 1 in matrix A will be added to row 1, column in matrix B. The matrices being added together must be the same size, and the resulting matrix is the same size as well.

Perfect Multicollinearity

Info: If two or more of the variables are actually linear functions of each other, they are perfectly multicollinear. Thus it is impossible to hold one variable constand with respect to other variables. This will prevent the multiple linear regression model from zeroing out the other variables and thus one of the correlated variables will be dropped from the results in R when the model is run.

BLUE

Info: In estimation problems the minimum variance unbiased estimator (MVUE) for a given statistic is called the best linear unbiased estimator (BLUE). Notes: Additional information here: https://nptel.ac.in/courses/117103018/module3/lec9/1.html

What kinds of relationships do interaction terms help us to evaluate?

Info: Interaction terms allow us to evaluate conditional theories (e.g. relationship between X and Y is conditional upon Z, meaning an increase in X is associated with an increase in Y when condition Z is met, but not otherwise).

What kinds of relationships do interaction terms help us evaluate?

Info: Interaction terms are used when the relationship between X and Y is dependent on the existence of a condition Z. Otherwise said, a change in X will result in a change in Y when some Z condition is met.

What is a "tidy" dataset?

Info: It is a consistent way of organizing data in R. Using "tidyverse" is the way to tidy up data. A tidy dataset is the easiest way to use tidyverse. A dataset is tidy when: each variable has its own column, each observation has its own row, each value has its own cell. Each dataset is put into a tibble.

What is a tidy dataset?

Info: It needs the propoer structure. The rows should represent an observation and the column should be a variable

Internal validity

Info: It refers to how well an experiment is done, especially whether it avoids confounding.

What is the lr() function used for in R?

Info: Likelihood Ratio Chi-Square (LR) Calculates the likelihood ratio chi-square statistic based on observed and expected counts.

Matrix Addition

Info: Matrices can be added together if they have the same number of rows and the same number of columns.

Matrix Subtraction

Info: Matrices can be subtracted if they have the same number of rows and columns (same rule as matrix addition). Also similar to addition, you subtract cell-by-cell.

Matrix Addition

Info: Matrices need to be of the same size (and dimensions) to be addable.

Matrix subtraction

Info: Matrices needed to be of the same size (and dimensions) to be subtract-able.

Spring 2019

Info: Matrix Size: By definition a matrix is a rectangular array of numbers. A matrix with "m" number of rows, and "n" number of columns, is said to have a size of m x n (m by n). Note the number of rows is always written first.

Matrix Addition

Info: Matrix addition is the process of adding to matrices together. Essentially, you add the entry of one matrix to the corresponding entry in another matrix. The result will be a matrix of the same dimension of the two that were added (and for matrices to be added, they must be the same size).

Matrix size

Info: Matrix size refers to the size of the matrix in terms of rows and columns in the matrix. It is expressed in terms of n x n, which is rows x columns, said "rows by columns." For example, a matrix with 4 rows and 6 columns would be a 4 x 6 matrix. Matrix size is important in determining whether or not matrices can be added, subtracted, or multiplied together.

Matrix subtraction

Info: Matrix subtraction = subtracting one matrix from another. As long as the dimensions of two matrices are the same, one can be add and subtract them much as one can add and subtract numbers. Subtraction is carried out by subtracting the corresponding elements in the matrices, ie. the elements that are in the same positions.

Matrix Subtraction

Info: Matrix subtraction is the process of subtracting one matrix from another. Essentially, you subtract the entry of one matrix from the corresponding entire in another matrix. The result will be a matrix of the same dimension of the two that were subtracted (and for matrices to be subtracted, they must be the same size).

Matrix Transposition

Info: Matrix transposition essentially flips a matrix diagonally, so that rows become columns and columns become rows. Often denoted as M^T

Defining Multicollinearity

Info: Multicollinearity is a state of very high intercorrelations or inter-associations among independent variables. It is therefore a type of disturbance in the data, and if present in the data, the statistical inferences made about the data may not be reliable. Notes: Additional Information: https://www.statisticssolutions.com/multicollinearity/

Adjusted vs. Regular R-Squared

Info: One major difference between R-squared and the adjusted R-squared is that R-squared assumes that every independent variable in the model explains the variation in the dependent variable while the adjusted R-squared gives the percentage of variation explained by only those independent variables that actually (in reality) affect the dependent variable. Notes: Additional information here: https://www.investopedia.com/ask/answers/012615/whats-difference-between-rsquared-and-adjusted-rsquared.asp

R-Squared Value

Info: R-Squared value essentially measures how much variance in the dependent variable is accounted for by the model. A small R-Squared would indicate that the model doesn't fully explain the variation in the dependent variable. In short, it is a measure of model-fit.

Difference between r-squared and adjusted r-squared

Info: R-Squared: R2=1-RSS/TSS. R-squared is the value between zero and one that tells you how much of the variation in y that is explained by the model. The adjusted R-squared value controls for number of variables in the model.

Formula used to calculate R-squared

Info: R-squared = 1 â€" (First Sum of Errors / Second Sum of Errors) Notes: Specifically, this linear regression is used to determine how well a line fits’ to a data set of observations, especially when comparing models. Also, it is the fraction of the total variation in y that is captured by a model. Or, how well does a line follow the variations within a set of data.

Formula used to calculate R squared

Info: R-squared = 1 â€" (residual sum of squares / total sum of squares)

What is RSS?

Info: RSS is the residual sum of squares and it is the sum of the squared residuals, with residuals being the deviation of the observed values from the prediction line.

Formula used to calculate R squared

Info: R^2=1-((SSRes)/(SST))=E(y1-^y1)^2/(E(y1-ymean)^2)

Matrix transposition

Info: Rows become columns, columns become rows (but you aren't really just turning the matrix 90 degrees). Entry 1,1 stays 1,1, but 3,4, becomes 4,3 and so on. The diagonal axis (if there is one, so if the matrix is square) remains the same (because for example 3,3 becomes 3,3 so it doesn't move). Useful if you want to multiply things and need different dimensions to do so. (ex. matrix A times matrix A is only possible with square matrixes, but matrix A times A' is always possible) Notes: ' or ^T is used to indicate a transposed matrix.

tidy() function in R

Info: Tidy function, as its name implies, shows a tidier summary output of a regression model or other statistical tests. A condensed and cleaner version of summary()

Skew

Info: Skew can be visually assessed by checking if the distribution is flocked disproportionally towards one side (left-right), or more accurately by comparing mean and median. If mean is left to median, it is left-skewed. Right-skewed when mean is right to median. Median is a more trustworthy measure to use if a data is skewed.

BLUE

Info: Stands for Best Linear Unbiased Estimator, given by the ordinary least squares by minimizing the sum of the squares of differences between observed and predicted values. In summary, a model where error is completely minimized.

Matrix Addition

Info: Suppose that the matrices A and B are m × n matrices (they have the same size), with components aij and bij respectively. Then (A + B)ij = (A)ij + (B)ij = aij + bij

Matrix Subtraction

Info: Suppose that the matrices A and B are m × n matrices (they have the same size), with components aij and bij respectively. Then (A - B)ij = (A)ij - (B)ij = aij - bij

What is a "tidy" dataset?

Info: Terminology refers to the structure of how data looks while performing analysis. In a "tidy" dataset, each row is an observation and each column is a variable. Helps to avoid having repeat variables.

Augment()

Info: The augment function augments data with information from an object. For example, you could use augment to add columns to a dataset.

BLUE

Info: The augment function augments data with information from an object. For example, you could use augment to add columns to a dataset.

augment()

Info: The augment function “augments†or adds to a data set with information from the object. Basically it will add another column with additional information gathered from analyzing the data set.

What is the tidy() function from the broom package used for in R?

Info: The broom package summarizes key information about statistical objects in tidy tibbles. The glance() function reports information about an entire model, such as goodness of fit measures like AIC and BIC. Notes: Source: https://cran.r-project.org/web/packages/broom/broom.pdf

Matrix Size

Info: The dimensions of a matrix. rows x columns.

lr() function in R

Info: The function LR computes the complete set of pairwise logratios, in the order [1,2], [1,3], [2,3], [1,4], [2,4], [3,4], etc.

Glance()

Info: The glance function takes a model and returns one row of summary information about the model. This includes fit measures, p-values, and r-squared values. This function can be used to evaluate statistical significance as well as how much variation among the dependent variables the model accounts for.

alternative hypothesis

Info: The hypothesis used that is contrary to the null hypothesis. Notes: It is usually taken to be that the observations are the result of a real effect (with some amount of chance variation superposed).

What is the lr() function used for in R?

Info: The lr() function in R expresses Logrank. The Logrank test uses all cases to compare multiple groups (treated groups vs. control groups in a randomized trial) to the control groups.

What is the lr()function used for in R?

Info: The lr()function in R is used to calculate the likelihood ratio chi-square statistic based on observed and expected counts

Requirements for matrix sizes necessary to do matrix multiplication

Info: The matrices do not have to be the same size in order to multiply them. Instead, The columns of the first matrix have to be the same size as the rows of the second matrix. Hence if matrix A has the size n x m, then matrix B must have the size m x p in order to multiply them together.

Matrix size

Info: The matrix size refers to the dimensions of the matrix - the number of rows by the number of columns..

Matrix addition

Info: The operation of adding two matrices by adding the corresponding entries together.

Matrix Transposition

Info: The reflection of a matrix diagonally, resulting in the replacement of the values in the rows with those from the columns and vice versa.

Matrix Size

Info: The size of a matrix is dependent upon the number of rows and the number of columns of the matrix. The size of a matrix is written as A x B with A being the number of rows and B being the number of columns (always in that order).

Dispersion

Info: The spread of the variable values.

What is the tidy() function from the broom package used for in R?

Info: The tidy()function in broom is used to turn the messy outputs of built-in function in R into tidy data frames.

Matrix Transposition

Info: The transpose A' of an n x m A is the matrix obtained by exchanging the rows and columns of A. Thus, the transpose A' is the n×m matrix whose ijth entry is the jith entry of A. Thus, the roles of rows and columns are reversed.

Central tendency

Info: The typical or average value of the variable.

Z-score

Info: The value for the number of standard deviations away from the mean any data point lies. This can be converted into a confidence interval through a z-score chart.

What is the lr() function used for in R?

Info: This function computes the complete set of pairwise logratios, in the order [1,2], [1,3], [2,3], [1,4], [2,4], [3,4], etc.

lr()

Info: This function is used for a log-rank test. The test is a hypothesis test between the observed and expected survival curves. The LR is calculated at LR = (O-E)^2/E

Tidy Data

Info: Tidy data is data obtained as a result of a process called tidying which is one of the important cleaning processes in big data. Tidy data is structured so that working with it is easy in manipulations, modeling, and visualizations. Tidy data sets are arranged such that each variable is a column and each observation is a row. Notes: Additional Information: https://www.jstatsoft.org/index

Matrix addition

Info: To add two matrices, they must have the same dimensions. Then add equivalent locations (entry 1,1 in the first matrix is added to 1,1 of the second matrix). The resulting matrix will have the same dimensions as the original matrices.

Requirements for matrix sizes necessary to do matrix multiplication

Info: To multiply matrices, the number of rows in the first matrix must equal the number of columns in the second matrix (a x b * b x c). NOTE: this means that the transitive property of multiplication does _NOT_ hold for matrix multiplication. A * B does not equal B * A . Notes: To multiply matrices, use the dot product (not on the exam).

Matrix subtraction

Info: To subtract matrices, as with adding matrices, they must be the same size (have the same dimensions). The second matrix's entries are subtracted item by item from the first matrix (so for matrixes a - b, for the second row third column the resulting entry would be a(2,3)-b(2,3)))

Matrix addition

Info: Two matrices can be added together only if they have the same number of rows and columns. Then, to add two matrices, simply add the corresponding elements of the two matrices. That is: add the entry in the first row, first column of the first matrix with the entry in the first row, first column of the second matrix; add the entry in the first row, second column of the first matrix with the entry in the first row, second column of the second matrix.

matrix transposition

Info: a matrix where the rows become the columns

Linear Regression

Info: a relationship between a dependent variable and one or more explanatory variables

Causation

Info: a relationship between two events where one event is affected by the other Notes: when the value of one event, or variable, increases or decreases as a result of other events, it is said there is causation

What is the augment() function from the broom package used for in R?

Info: add columns to the original data that was modeled

What is the augment() function from the broom package used for in R?

Info: adds columns to the original data that was modeled. This includes predictions, residuals, and cluster assignments.

Formula used to calculate beta zero in simple linear regression

Info: b0=(ymean) -(b1)(xmean)

Formula used to calculate beta one in simple linear regression

Info: b1=(Ssxy)/(SSxx)

what does BLUE stand for?

Info: best linear unbiased estimator

What does BLUE stand for?

Info: best linear unibiased estimators

matrix addition

Info: cells in same location get added together but they must be same size for that to happen

matrix subtraction

Info: cells in same location get subtracted from one another but they must be same size for that to happen

What kind of relationships do interaction terms help us to evaluate?

Info: conditional relationships-- allow us to measure if the relationship between x and y is dependent on z

What is tidy?

Info: data obtained as a result of a process called data tyding

glance()

Info: glance() constructs a single row summary of a model, fit, or other object. You can use it to evaluate model performance because it gives you information such as the p value and the r squared statistics.

sampling variability

Info: how much an estimate varies between samples

Transposition

Info: is a method encryption by which the positions held by units of plaintext

multicollinearity

Info: is a phenomenon in which one predictor variable in a multiple regression model can be linearly predicted from others

What is the lm() function used for in R?

Info: lm() is used to fit linear models and to regress variables against another. It gives us the coefficients for the effect of each independent variable on the dependent variable, as well as the intercept of the line of best fit.

Matrix Addition

Info: operation of adding two matrices by adding corresponding entries

What is the glance() function from the broom package used for in R?How could you use it to evaluate model performance?

Info: produces summary statistics for the entire regression. tells what the r squared value is, which represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. (from ppts & wikipedia)

What is RSS and why is it important?

Info: residual sum of squares, RSS measures the deviations predicted from actual empirical values of data. Important to determining how well your model fits the data, smaller is better

matrix size

Info: rows by columns

what is RSS/ why is it important

Info: sum of the squares of residuals. It is a measure of the discrepancy between the data and an estimation model. A small RSS indicates a tight fit of the model to the data. --wikipedia

Skew

Info: the degree of distortion from the symmetrical bell curve Notes: can be positive or negative

matrix size

Info: the number of rows and columns of a matrix. a matrix read "two by three" has two rows and three columns

z score

Info: the number of standard deviations from the mean a data point is Notes: also known as a standard score

What does the tidy() function do?

Info: tidy() turns an object into a tidy tibble. This gives you a dataframe that summarizes your model's statistical findings, including coefficients and p-values.

What is the tidy() function from the broom package used for in R?

Info: tidy(): Constructs a data frame that summarizes the model’s statistical findings. This includes coefficients and p-values for each term in a regression, per-cluster information in clustering applications, or per-test information for multtest functions.

Does multivariate linear regression improve our ability to evaluate the three conditions for establishing causal claims over bivariate analysis?

Info: yes


Related study sets

Ch 4 Fundamental Health of individual and community

View Set

PA Insurance laws and regulations

View Set

CPT CODES The Evaluation and Management Services

View Set

Chapter 16: Legal Guidelines and Business Considerations

View Set

Chapter 11 Nervous System and Nervous Tissue

View Set