Linear Regression 1
Question: When there is a relationship between the order of the observations and the residuals in a regression model, it is known as _________.
Autocorrelation Difficulty: Hard Explanation: Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. In the context of regression analysis, it refers to a situation where the residuals (errors) from the model are correlated with each other.
Question: _________ is a stepwise regression method that considers all possible subsets of predictors to identify the best performing model.
Best subset regression Difficulty: Hard Explanation: Best subset regression is a method used in regression analysis to select the best performing model considering all possible subsets of predictors. It searches exhaustively through the space of predictor subsets and chooses the best subset for each model size according to a specified criterion.
Question: In linear regression, the _________ represents the expected change in the dependent variable for a one unit change in the independent variable, holding all other independent variables constant.
Coefficient Difficulty: Medium Explanation: In linear regression, the coefficients represent the expected change in the dependent variable (also called the response or target variable) for a one unit change in the corresponding independent variable (also called predictor or feature), holding all other independent variables constant. They define the relationship between the predictors and the response.
Question: The _________, ranging between 0 and 1, is a statistical measure in linear regression that indicates how well the regression predictions approximate the real data points.
Coefficient of determination (R-squared) Difficulty: Hard Explanation: The coefficient of determination, often denoted R², is a statistical measure that shows the proportion of the variance in the dependent variable that is predictable from the independent variables. An R² of 100 percent indicates that all changes in the dependent variable are completely explained by changes in the independent variables.
Question: The proportion of the variance in the dependent variable that is predictable from the independent variable(s) in linear regression is known as the _________.
Coefficient of determination or R-squared Difficulty: Medium Explanation: The coefficient of determination, often denoted as R^2, is a statistical measure that shows the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model.
Question: The estimated probability distribution of a dependent variable given the independent variables in a regression model, which is often assumed to be normally distributed, is called the _________ distribution.
Conditional Difficulty: Hard Explanation: The conditional distribution in a regression context is the estimated distribution of the dependent variable given the values of the independent variables. Often in linear regression, it is assumed that the conditional distribution of the dependent variable is normally distributed.
Question: When dealing with multiple independent variables, a visual tool that is used to identify potential relationships or correlations among variables is called a _________ matrix.
Correlation Difficulty: Medium Explanation: A correlation matrix is a table that shows the correlation coefficients between many variables. Each cell in the table shows the correlation between two variables. It is a powerful tool to summarize data and to identify potential collinearity in the regression model.
Question: In linear regression, the variable that is being predicted or explained is often referred to as the _________ variable.
Dependent Difficulty: Easy Explanation: In the context of linear regression, the dependent variable is the variable that we aim to predict or explain. It's also known as the response or target variable.
Question: _________ tests whether all of the regression coefficients in a linear regression model are zero.
F-test Difficulty: Hard Explanation: The F-test in regression is used to assess the significance of the overall regression model. Specifically, it tests the null hypothesis that all of the regression coefficients are zero, meaning that none of the independent variables matter, versus the alternative hypothesis that at least one does matter.
Question: In regression analysis, the process of expanding the feature set by adding higher degrees and interactions of the original features is known as _________.
Feature engineering Difficulty: Medium Explanation: Feature engineering in the context of regression analysis often involves creating additional relevant features from existing features, including adding interaction terms, polynomial terms, or creating new meaningful features that can help improve the predictive power of the model.
Question: In linear regression, if the residuals have constant variance, they are said to be _________.
Homoscedastic Difficulty: Medium Explanation: Homoscedasticity refers to the assumption in regression analysis that the variance of the errors, or residuals, is constant across all levels of the independent variables. This is one of the key assumptions of linear regression.
Question: In the context of linear regression, _________ refers to the spread of residuals or errors of prediction and is expected to be constant across all levels of the independent variables.
Homoscedasticity Difficulty: Hard Explanation: Homoscedasticity is a statistical assumption in linear regression that suggests that the variance or spread of residuals or errors of prediction is constant across all levels of the independent variables. If the variance changes, the data is said to exhibit heteroscedasticity, which violates one of the assumptions of linear regression.
Question: In linear regression, the assumption that the observations are collected independently of each other is known as _________.
Independence Difficulty: Medium Explanation: The assumption of independence in linear regression implies that the observations are collected independently of each other. It's important because the standard errors of the coefficient estimates, hypothesis tests, and confidence intervals rely on this assumption.
Question: A _________ variable in linear regression is an additional variable created by multiplying two or more independent variables together.
Interaction Difficulty: Medium Explanation: Interaction variables represent situations where the effect of one independent variable on the dependent variable depends on the value of another independent variable. Interaction variables are created by multiplying these independent variables together.
Question: In a linear regression model, the _________ is the value of the dependent variable when all independent variables are zero.
Intercept Difficulty: Medium Explanation: The intercept (often denoted as the beta zero coefficient) in a linear regression model is the expected value of the dependent variable when all the independent variables are zero. It is the point where the regression line crosses the y-axis.
Question: The method of introducing a constant term in regression analysis, so that the regression line passes through the centroid of the data points is known as fitting the model with a/an _________.
Intercept Difficulty: Medium Explanation: The intercept in a regression model ensures that the regression line passes through the centroid (the point that represents the mean value of the predictors and the response) of the data points. This constant term allows the regression equation to fit the data appropriately.
Question: _________ regression is a version of linear regression that can help in selecting features by introducing a penalty term that effectively discards non-influential features.
Lasso Difficulty: Hard Explanation: Lasso (Least Absolute Shrinkage and Selection Operator) regression is a type of linear regression that includes a regularization term in the loss function. The regularization term is the sum of the absolute values of the coefficients, which encourages sparse solutions: it can shrink the coefficients of non-influential features to zero, effectively selecting out those features.
Question: When the value of the dependent variable in a linear regression model is influenced by a single or small group of observations, these observations are referred to as having high _________.
Leverage Difficulty: Hard Explanation: Leverage refers to the idea that certain values of the predictor variables (independent variables) can have an undue influence on the estimated regression coefficients in a linear regression model. Observations with high leverage can disproportionately affect the model's fit and predictions.
Question: When fitting a linear regression model, outliers can have a disproportionately large effect on the model parameters. This problem is known as _________.
Leverage effect Difficulty: Hard Explanation: In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points can have a disproportionately large effect on the estimated regression coefficients and hence they are a concern in regression diagnostics.
Question: Linear regression assumes that the relationship between the independent and dependent variables is _________, meaning that the regression line is a straight line.
Linear Difficulty: Easy Explanation: Linear regression operates under the assumption of linearity, which means it assumes a linear relationship between the dependent and independent variables. It models this relationship as a straight line.
Question: When there's a linear relationship between the log-transformed dependent variable and the log-transformed independent variable, the model is known as a _________ regression model.
Log-Log Difficulty: Hard Explanation: A log-log regression model is a type of regression model where both the dependent variable and independent variables are log-transformed. It is used to model situations where percentage change in the independent variable(s) is associated with a percentage change in the dependent variable.
Question: When the linearity assumption is violated, one common approach is to apply a _________ transformation to the dependent variable, independent variables, or both.
Logarithmic Difficulty: Medium Explanation: A logarithmic transformation can be applied to the dependent variable, independent variables, or both when the linearity assumption of a regression model is violated. This transformation can help to linearize the relationships between variables, stabilize variances, and make the distribution of the variables more symmetric and closer to a normal distribution.
Question: The extension of linear regression to situations where the response variable is categorical is called _________ regression.
Logistic Difficulty: Medium Explanation: Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable. It's an extension of the linear regression model for scenarios where the outcome is a categorical variable.
Question: _________ regression is a modification of linear regression useful for models which predict probabilities that are expected to fall between 0 and 1, such as in binary classification tasks.
Logistic Difficulty: Medium Explanation: Logistic regression is a type of regression analysis used for predicting the probability of occurrence of an event. It is useful in models where the response variable is categorical, typically binary (e.g., yes/no or 0/1 outcomes).
Question: In linear regression, when the independent variables are highly correlated, it may lead to a problem known as _________.
Multicollinearity Difficulty: Hard Explanation: Multicollinearity is a problem that can arise in linear regression when two or more independent variables are highly correlated. It can cause instability in the coefficient estimates and make it difficult to determine the effect of each independent variable on the dependent variable.
Question: The generalization of simple linear regression to handle multiple independent variables is known as _________ regression.
Multiple Difficulty: Easy Explanation: Multiple regression is a generalization of simple linear regression to more than one independent variable. It allows you to understand the influence of several independent variables on a single dependent variable.
Question: In linear regression, the statistical technique to determine the relationship between a single dependent variable and several independent variables is known as _________ regression.
Multiple Difficulty: Easy Explanation: Multiple regression is a statistical technique used to predict the value of a single dependent variable based on the values of several independent variables.
Question: The residuals in a linear regression model should be normally distributed. This is an assumption called _________.
Normality Difficulty: Medium Explanation: The assumption of normality in linear regression implies that the residuals, or prediction errors, are normally distributed. This assumption is important for statistical inference tests that are used to interpret the significance of the coefficients.
Question: The method commonly used to estimate the parameters of a linear regression model is _________, which minimizes the sum of the squared residuals.
Ordinary least squares (OLS) Difficulty: Medium Explanation: Ordinary least squares (OLS) is a type of linear least squares method for estimating the unknown parameters in a linear regression model. It minimizes the sum of the squared residuals, which are the differences between the observed and predicted values.
Question: In multiple regression, the inclusion of unnecessary predictors in the model is known as _________.
Overfitting Difficulty: Medium Explanation: Overfitting in multiple regression occurs when the model includes more predictor variables than necessary. An overfitted model will have poor predictive performance, as it overreacts to minor fluctuations in the training data.
Question: In linear regression, _________ is a measure of how much each predictor variable is contributing to the variance in the fitted model.
Partial R-squared Difficulty: Hard Explanation: Partial R-squared in linear regression is a measure of how much of the variance in the dependent variable is explained by a predictor variable while controlling for the effects of other predictors. It helps to understand the contribution of each predictor in the presence of others.
Question: _________ in linear regression refers to the scenario where an independent variable can be perfectly predicted from other independent variables.
Perfect multicollinearity Difficulty: Hard Explanation: Perfect multicollinearity is a condition in which the independent variables are highly correlated to the point where one variable can be perfectly predicted by the others. It poses a problem for linear regression because it undermines the statistical significance of an independent variable.
Question: _________ is an extension of the linear regression model that allows for non-linear relationships between the independent variables and the dependent variable by introducing interaction terms and polynomial terms.
Polynomial regression Difficulty: Medium Explanation: Polynomial regression is a form of regression analysis in which the relationship between the independent variable and the dependent variable is modelled as an nth degree polynomial. Polynomial regression can model relationships where the effect of the independent variables is not a straight line.
Question: In linear regression, _________ is a technique used to address overfitting by adding a penalty term to the loss function that depends on the size of the coefficients.
Regularization Difficulty: Hard Explanation: Regularization is a technique used to prevent overfitting in linear regression models by adding a penalty term to the loss function. The penalty term is typically a function of the model's coefficients, and it encourages the model to have smaller coefficients, which leads to a simpler, more generalized model.
Question: In linear regression, a plot of residuals against the predicted values is known as a _________ plot, which is used to check the assumptions of homoscedasticity and linearity.
Residual Difficulty: Medium Explanation: A residual plot is a graph that shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a nonlinear model is more appropriate.
Question: In a linear regression model, the difference between the observed value and the predicted value for any observation is called a _________.
Residual Difficulty: Medium Explanation: In the context of a linear regression model, a residual for an observation is the difference between the observed value and the predicted value given by the model. Residuals are used to check the goodness of fit of a model.
Question: The _________ plot in linear regression is a graphical representation of the residuals against the fitted values to detect heteroscedasticity or non-linearity.
Residual Difficulty: Medium Explanation: The residual plot is a graph that shows the residuals on the vertical axis and the fitted values on the horizontal axis. If the points in a residual plot are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a nonlinear model is more appropriate.
Question: In a linear regression model, the estimated standard deviation of the random error term is known as the _________.
Residual standard error Difficulty: Medium Explanation: The residual standard error is a measure of the amount of variability unexplained by the model. In other words, it is an estimate of the standard deviation of the random error term in the regression model.
Question: The goal of _________ regression is to modify the loss function from ordinary least squares regression to penalize large coefficients, thus preventing overfitting.
Ridge Difficulty: Hard Explanation: Ridge regression is a variant of linear regression that includes a regularization term to the loss function. The regularization term is the sum of the squared coefficients, which has the effect of penalizing large coefficients. This can help prevent overfitting by discouraging the learning of a more complex model.
Question: When a single predictor variable is responsible for the variation in the dependent variable, it is known as _________ linear regression.
Simple Difficulty: Easy Explanation: Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables: one variable is considered to be an explanatory variable (independent variable), and the other is considered to be a dependent variable.
Question: Linear regression models can be used for time series forecasting, but they assume that the data is _________, meaning that the mean, variance, and autocorrelation are constant over time.
Stationary Difficulty: Hard Explanation: Stationarity is an important concept in time series analysis. A stationary process has the property that the mean, variance, and autocorrelation structure do not change over time. Linear regression models, when applied to time series data, often assume that the data are stationary.
Question: In the context of linear regression, _________ is a measure of how well the model generalizes to unseen data.
Test error Difficulty: Medium Explanation: The test error of a model is a measure of how well the model performs on unseen data. It's often estimated by splitting the available data into a training set and a test set, fitting the model to the training set, and then evaluating it on the test set.