topic 5 HW guide
A nutritionist conducts a study to analyze the relationship between the number of calories consumed daily (X) and weight gain in kilograms (Y) of an individual. She uses a simple linear regression model to predict the weight gain based on the calories consumed. Given the linear regression equation: Y = 2 + 0.005*(Calories Consumed Daily) How much weight gain (in kilograms) does the model predict for an individual who consumes 2000 calories daily? A. 2kg B. 12kg C. 4kg D. 10kg
B. 12kg - Y is the predicted weight gain in kilograms. - 2 is the intercept (the base weight gain when no calories are consumed). - 0.005 is the slope, representing the amount of weight gain per unit increase in calorie consumption. substitute 2000 for "calories consumed daily" Y=2+0.005*2000 Y=12
Based on the dataset provided below, which car feature is most likely to have the highest direct correlation with the car's price? Use Excel to answer this question Car 1 2 3 Horsepower(x1) 150 200 120 Doors(x2) 4 2 4 Fuel Eff. (m/g)(x3) 25 20 30 Age(yrs)(x4) 3 5 1 Price (Y) $18k $25k $15k A. Age of the Car ($X_4$) B. Horsepower ($X_1$) C. Number of Doors ($X_2$) D. Fuel Efficiency ($X_3$)
B. Horsepower ($X_1$) solved through excel Formula: =CORREL(array1, array2) - Horsepower & Price CORREL(x1,Y) =CORREL({150,200,120},{18000,25000,15000})=0.981 strong positive correlation same equation just substitute horsepower with rest C. Doors(Xv2)= -0.250 weak negative correlation D. Fuel efficiency (Xv3)= 0.316 Weak positive correlation A. Age of the car (Xv4)= -.500 Moderate negative correlation
The following figure shows bar plot for the GDP of three countries in 1960. Which country shows the highest GDP among three countries? Brazil: 10,000,000 China: 50,000,000 India: 30,000,000 A. China B. Brazil C. Both China and Brazil D. India
A. China
A survey is conducted to determine whether the age of car and type of the car influences the annual maintenance cost. A sample of 10 cars is selected and the data is shown below. You want to visualize the distribution of annual maintenance cost, which type of plot will help to achieve the task ? Age of car (months) Type of the car Annual Maintenance Cost ($) 3 Electric 120 5 Electric 115 6 Hybrid 135 7 Hybgrid 290 9 IC 275 10 IC 300 11 IC 350 13 IC 475 14 IC 500 15 IC 550 A. Histogram B. Scatter plot C. Pie chart D. Regression line
A. Histogram To visualize the distribution of the annual maintenance cost, a histogram is the most appropriate choice. A histogram displays the frequency distribution of a continuous variable (in this case, annual maintenance cost), showing how the data is distributed across different ranges of values B. Scatter plot A scatter plot is used to show the relationship between two continuous variables. C. Pie chart A pie chart is typically used to show proportions or percentages of categories in categorical data. D. Regression line A regression line is used in regression analysis to show the relationship between an independent and dependent variable.
A _____ is used to visualize sample data graphically and to draw preliminary conclusions about the possible relationship between two quantitative variables. A. scatter chart B. Gantt chart C. contingency table D. pie chart
A. Scatter chart use to graphically represent the relationship between two quantitative/numerical variables Gantt chart: use in project management to visualize project schedules, tasks, and timelines Contingency table: used to summarize the relationship between two categorical variables Pie chart: used to represent proportional data—typically showing percentages or relative frequencies of categories
A researcher is investigating the factors that influence the selling price of houses in a particular city. They collect data on the number of bedrooms, the total area of the house, the age of the house, and the distance from the city center. In this scenario, which of the following is the dependent variable? A. Selling price of the house B. Distance from the city center C. Total area of the house D. Number of bedrooms
A. Selling price of the house The researcher is investigating the factors that influence the selling price of houses, which means that the selling price is the variable being influenced by the other factors (such as the number of bedrooms, area, age, and distance from the city center). Therefore, the dependent variable is the selling price of the house B. Distance from the city center This is an independent variable (predictor), not the dependent one. The researcher is investigating how various factors like this one influence the selling price. C. Total area of the house This is another independent variable. It's one of the factors being examined for its potential influence on the selling price, but it is not the dependent variable itself. D. Number of bedrooms The number of bedrooms is also an independent variable in this context. The researcher is examining its relationship to the selling price, but it is not the variable being predicted.
You are an urban planner analyzing factors that influence housing prices in a metropolitan area. Among the variables you're examining are the distance to the nearest public transit station (TransitDistance) and the walking time to the nearest public transit station (WalkingTime). After analyzing the data, you find these two predictors have a correlation coefficient of 0.95. Given this scenario, which of the following statements best captures the situation? A. The high correlation between Transit Distance and Walking Time suggests that one variable might be redundant in predicting housing prices. B. Since Transit Distance and Walking Time are both related to public transit, their high correlation is coincidental. C. A high correlation coefficient ensures that both variables will be statistically significant in the regression analysis.
A. The high correlation between Transit Distance and Walking Time suggests that one variable might be redundant in predicting housing prices. This is correct because the high correlation suggests that both variables might convey similar information. Including both could introduce redundancy, and you might want to exclude one of them to improve the model's efficiency and reduce potential multicollinearity. Highly correlated predictors can cause issues (multicollinearity) and occurs when two/more predictors are highly correlated with each other, this causes difficulty to determine the individual impact of each predictor on the dependent variable (in this case, housing prices Since TransitDistance and WalkingTime are so closely related, having both in the model could lead to redundancy, meaning that one of these variables might not contribute much additional value in predicting housing prices. In this case, you could consider removing one of the variables to simplify the model without losing much predictive power.
In a linear regression model, the variable that is being predicted or explained is known as _____. A. dependent variable B. residual variable C. independent variable D. regression variable
A. dependent variable In a linear regression model, the variable that is being predicted or explained is called the dependent variable (also known as the response variable or outcome variable). It is represented as YYY in the regression equation: Y=β0+β1X1+β2X2 - Y = Dependent variable (the one being predicted) - X1,X2,...,Xn = Independent variables (predictors) - β0 = Intercept - β1,β2,...,βn = Regression coefficients Residual variable: represents the difference between the actual and predicted values in a regression model Independent variable: (predictor variables) are used to predict the dependent variable regression variable: is not a standard term in regression analysis.
Prediction of the mean value of the dependent variable y for values of the independent variables x1, x2, . . . , xq that are outside the experimental range is called _____. A. extrapolation B. interaction C. overfitting D. dummy variable
A. extrapolation Extrapolation refers to predicting the mean value of the dependent variable (y) for values of independent variables (x₁, x₂, ..., xq) that are outside the range of the observed data Interaction: Interaction refers to a situation where the effect of one independent variable on the dependent variable depends on the value of another independent variable. Overfitting: Overfitting occurs when a model learns patterns that are too specific to the training data, capturing noise instead of true relationships. dummy variable: A dummy variable is used to represent categorical variables in regression models.
Imagine you are a digital marketing analyst evaluating the impact of different variables on website traffic: Variable Ad spend(x1), #of ads(x2) Coefficient 0.090 0.015 Standard error 0.013 0.020 P-value <0.001 0.190 Based on this table, where should you allocate more resources to increase online engagement effectively? (Hint: You decision will be based on which variable influences website traffic) A. Design more Ad Creatives. B. Increase Ad Spend C. Neither option will likely lead to increased engagement. D. Both are equally significant.
B. Increase Ad Spend Coefficient: This represents the strength and direction of the relationship between the variable and website traffic. A positive coefficient means an increase in that variable is associated with an increase in website traffic. P-value: This tells us whether the relationship between the variable and website traffic is statistically significant. A p-value less than 0.05 is typically considered statistically significant, meaning there's strong evidence that the variable impacts the outcome - Coefficient = 0.090: This means for every unit increase in ad spend, website traffic is expected to increase by 0.090 units (a positive impact) - P-value < 0.001: Since the p-value is much smaller than 0.05, this suggests that the relationship between ad spend and website traffic is statistically significant. Therefore, ad spend is a significant predictor of website traffic
What is the standard method to compute the lower bound to detect outliers using the box plot ? (IQR: Inter-quartile range) A. Q2- 1.5*IQR B. Q1 - 1.5*IQR C. Q3 - 1.5*IQR D. Q3 + 1.5*IQR
B. Q1 - 1.5*IQR Q1 is the first quartile (the 25th percentile of the data). A. Q2- 1.5*IQR Q2 refers to the median (the 50th percentile of the data) C. Q3 - 1.5*IQR This formula incorrectly uses Q3 (the third quartile) as the reference point D. Q3 + 1.5*IQR This formula is used to compute the upper bound for outliers (the threshold beyond which data points are considered outliers on the high side)
A survey is conducted to determine whether the age of car and type of the car influences the annual maintenance cost. A sample of 10 cars is selected and the data is shown below. The coefficient of determination (R 2 or R-squared) is 0.82. What can you infer about the quality of the regression fit ? Age of car (months) Type of the car Annual maintenance cost ($) 3 Electric 120 5 Electric 115 6 Hybrid 135 7 Hybgrid 290 9 IC 275 10 IC 300 11 IC 350 13 IC 475 14 IC 500 15 IC 550 A. R-sqared does not explain anything about the regression model fitting B. R-squared is close to 1, hence the fit is good. C. R-squared is close to 1, hence the fit is NOT good. D. R-squared = SSE/SST
B. R-squared is close to 1, hence the fit is good. The coefficient of determination (R-squared or R^2) is a measure that helps assess the quality of a regression model's fit. It indicates how well the independent variables (predictors) explain the variability of the dependent variable (response). R-squared = 0.82 means that 82% of the variance in the dependent variable (annual maintenance cost) can be explained by the independent variables (age of the car and type of the car). The closer R-squared is to 1, the better the model fits the data Since R-squared is 0.82, it indicates a relatively strong fit, as the model explains a high proportion of the variability in the outcome A. R-sqared does not explain anything about the regression model fitting R-squared does provide valuable insight into how well the model fits the data. It tells us how much of the variation in the dependent variable is explained by the model. C. R-squared is close to 1, hence the fit is NOT good. R-squared close to 1 means the model has a good fit. If R-squared is high (close to 1), it indicates a strong relationship between the independent and dependent variables, meaning the model explains a large portion of the variability in the data D. R-squared = SSE/SST correct formula is R^2=1-SSE/SST SSE: Sum of squared errors SST: Total sum of squares This formula explains how much of the total variability in the dependent variable is explained by the model (in terms of errors)
You extracted a medical dataset with the following variables to predict ten-year risk of coronary heart disease (CHD): 1- Age 2- Gender 3- Systolic blood pressure 4- Heart rate 5- Glucose 6- Diabetese (YES or NO) 6- Ten-year risk of CHD (YES or NO) Which of the following is your target variable ? A. Heart rate B. Ten-year risk of CHD C. Diabetes D. Age
B. Ten-year risk of CHD In a predictive modeling problem, the target variable (dependent variable) is the one we are trying to predict. In this case, we are trying to predict whether a person has a ten-year risk of coronary heart disease (CHD) (YES or NO) based on other features such as age, gender, blood pressure, heart rate, glucose, and diabetes status A. Heart rate Heart rate is an independent (predictor) variable, not the target. C. Diabetes Diabetes is also a predictor variable, meaning it is used to determine CHD risk. D. Age Age is another predictor variable in the dataset.
A mathematical procedure for using sample data to estimate regression parameters is _____. A. interval estimation B. The least squares method C. extrapolation D. point estimation
B. The least squares method The least squares method aims to find the line that minimizes the sum of the squared differences between the observed X values and the predicted Y values from the regression line. Interval estimation: involves calculating a range (e.g., confidence intervals) within which a population parameter is likely to fall. Extrapolation: the process of predicting values outside the observed data range using an existing regression model Point estimation: using a single value (like a sample mean) to estimate a population parameter
What can we infer from the following plot ? Consider x-axis represents variable X1 and y-axis represents variable X2. A. X1 and X2 have positive linear relationship B. X1 and X2 have non-linear relationship C. Coefficient of correlation between X1 and X2 is 0 D. X1 and X2 have negative linear relationship
B. X1 and X2 have non-linear relationship the data points form a curved or non-straight pattern (e.g., a U-shape, an S-shape, etc.), it suggests a non-linear relationship between X1 and X2 A. X1 and X2 have positive linear relationship If the data points are arranged in an upward-sloping straight line (from left to right), this indicates a positive linear relationship As X1 increases, X2 also increases. C. Coefficient of correlation between X1 and X2 is 0 If the points are randomly scattered with no discernible pattern (neither increasing nor decreasing), this implies that there is no linear relationship between X1 and X2 D. X1 and X2 have negative linear relationship If the data points are arranged in a downward-sloping straight line (from left to right), this indicates a negative linear relationship. As X1 increases, X2 decreases
You are a real estate analyst who developed a regression model to predict house prices based on various attributes such as size, location, number of rooms, etc. You then assess the overall effectiveness and fit of your model and find the following data: Model Model 2 R-squared 0.812 Adjusted R-squared 0.805 P-Value ( F-statistic ) <0.001 Based on the given table, which statement most accurately describes the efficacy of the real estate price prediction model? A. The model might be ineffective because the Adjusted R-squared is lower than the R-squared. B. Since the p-value is less than 0.05, the model isn't statistically significant. C. Adjusting for the number of predictors, the model accounts for 80.5% of the variation in house prices. D. The model is ineffective as the R-squared value is less than 1.
C. Adjusting for the number of predictors, the model accounts for 80.5% of the variation in house prices. R-squared (0.812): - 0.812 means your model explains about 81% of why house prices go up or down. - So if you're trying to guess a house's price based on things like size or location, your model is doing a really good job — it's capturing most of the important stuff Adjusted R-squared (0.805) - It slightly penalizes the model for having too many features (like size, location, etc.) that might not actually help much. - In this case, it's 0.805, so after adjusting for extra variables, the model still explains about 80.5% of the changes in house prices — still really good P-value(F-statistics)<0.001 - A p-value less than 0.001 means it's super unlikely that your results are just by chance. In other words: your model is statistically significant, so you can trust that at least one of the things you included (like size, location, etc.) really affects house prices.
Please refer to the following correlation matrix. Which pair of variables show the strongest linear relationship ? The variables are: CARDPROM, NUMPROM, RAMNTALL, MAXRAMNT and LASTGIFT. Each cell in the matrix shows the correlation coefficient between two variables. For example, the correlation coefficient between CARDPROM and NUMPROM is 0.949. Correlations cardprom Numprom ramntall maxramnt lastgift Carfprom 1.000 0.949 0.550 0.023 -0.059 Numprom 0.949 1.000 0.624 0.066 -0.024 Ramntall 0.550 0.624 1.000 0.557 0.324 Maxramnt 0.023 0.066 0.557 1.000 0.563 Last gift -0.059 -0.024 0.324 0.563 1.000 A. Ramntall and Maxramnt B. Cardprom and lastgift C. Cardprom and numprom D. Lastgift and cardprom
C. Cardprom and numprom the pair of variables that shows the strongest linear relationship is the one with the highest correlation coefficient. The correlation coefficient between CARDPROM and NUMPROM is 0.949, which is the highest value in the matrix A. Ramntall and Maxramnt 0.557, a moderate but low correlation B. Cardprom and lastgift -0.059, a very weak negative relationship D. Lastgift and cardprom the same as answer B.
Let us say that we have a set of emails which are labeled as spam or not spam. Using this data, your task is to determine which future email can be potentially a spam. What kind of analytics it is ? A. Both predictive and prescriptive B. Prescriptive analytics C. Predictive analytics D. Descriptive analytics
C. Predictive analytics The given task involves using past data (emails labeled as spam or not spam) to predict whether a new (future) email is spam
Given the partial Excel output from a multiple regression, write the equation of regression model. coefficients standard error Intercept 37,375.357 3,721.625 x1 55.655 9.370 x2 -5.750 3.575 x3 0.213 0.373 A. Y= 37,375.357 - 55.655x1 + 5.750x2 - 0.213x3 B. Y= 3,721.625 + 9.370x1 - 3.575x2 + 0.373x3 C. Y= 37,375.357 + 55.655x1 - 5.750x2 + 0.213x3 D. Y= 55.655 + 37,375.357x1 - 3.575x2 + 0.213x3
C. Y= 37,375.357 + 55.655x1 - 5.750x2 + 0.213x3 The multiple regression equation follows this general format: Y=Intercept+b1x1+b2x2+b3x3 where: Intercept = 37,375.357 Coefficient for x1 = 55.655 Coefficient for x2 = -5.750 Coefficient for x3 = 0.213
A multiple regression model for predicted heart rate is as follows: heart rate = 10 - 0.5*(run speed) + 12*(body weight). As the run speed increases by 1 unit (holding body weight constant), heart rate is expected to? A. increase by 12 B. increase by 0.4 C. decrease by 0.5 D. decrease by 12
C. decrease by 0.5 - The coefficient of Run Speed is -0.5. - This means that for every 1-unit increase in Run Speed, the Heart Rate decreases by 0.5, holding Body Weight constant. - The negative sign in front of 0.5 indicates a negative relationship between Run Speed and Heart Rate.
In a simple linear regression model, y = ß0 + ß1x the parameter ß1 represents the _____. A. error term B. intercept C. slope of the true regression line D. mean value of x
C. slope of the true regression line - y = Dependent variable (the predicted value) - x = Independent variable (predictor) - β0 = Intercept (value of y when x=0) - β1 = Slope (the rate at which y changes as x increases by 1 unit) β1 represents the slope of the true regression line, meaning it shows how much y changes for a one-unit increase in x. Error term: The error term ϵ accounts for random variation not explained by the model Intercept: The intercept (β0\beta_0β0) represents the expected value of y when x=0 Mean value of X: The mean of x is a summary statistic representing the average value of the independent variable
According to company records, 5% of all automobiles brought to Geoff's Garage last year for a state-mandated annual inspection did not pass. Of the next 10 automobiles entering the inspection station, what is the probability that more than 5 will not pass inspection? A.=BINOM.DIST(5, 10, 0.05, FALSE) B.=BINOM.DIST(5, 10, 0.05, TRUE) C.=1-BINOM.DIST(5, 10, 0.05, TRUE) D.=POISSON.DIST(5, 10, TRUE)
C.=1-BINOM.DIST(5, 10, 0.05, TRUE) P(X>5) the probability that more than 5 cars fail. - Each car either passes or fails (two possible outcomes). - The probability of failure is constant at p=0.05 - The number of trials (n=10) is fixed. - The results of each car inspection are independent. P(x>5)=1-P(x<_5) A.=BINOM.DIST(5, 10, 0.05, FALSE) P(x=5) probability that exactly 5 car fail B.=BINOM.DIST(5, 10, 0.05, TRUE) P(X<_5) the cumulative probability for 5 or fewer failures D.=POISSON.DIST(5, 10, TRUE) Poisson distribution is used for counting events over a fixed interval, not for a fixed number of independent trials
A survey is conducted to determine whether the age of car and type of the car influences the annual maintenance cost. A sample of 10 cars is selected and the data is shown below. You want to detect if there are any outliers in the variable ("Age of car (months)"), which of the following is the best visualization technique to do so? Age of car (months) Type of the car Annual maintenance cost ($) 3 Electric 120 5 Electric 115 6 Hybrid 135 7 Hybgrid 290 9 IC 275 10 IC 300 11 IC 350 13 IC 475 14 IC 500 15 IC 550 A. Bar plot B. Pie chart C. Scatter plot D. Box plot
D. Box plot A box plot (also known as a box-and-whisker plot) is the most appropriate visualization technique for detecting outliers in a continuous variable. The box plot displays the distribution of the data, showing the median, interquartile range (IQR), and the range of the data, including potential outliers A. Bar plot A bar plot is typically used for categorical variables to show the frequency or count of different categories. B. Pie chart A pie chart is used to show proportions or percentages of categories in categorical data. C. Scatter plot A scatter plot is used to show the relationship between two continuous variables.
A nutritionist wants to study the effect of various factors like calorie intake, exercise duration, and age on a person's weight. In a regression model where the weight is represented by Y, which of the following correctly represents the independent variables? A. Weight B. None of the above C. Calorie intake only D. Calorie intake, exercise duration, and age
D. Calorie intake, exercise duration, and age In regression analysis, the independent variables (also called predictors or explanatory variables) are the factors that are believed to influence or explain the dependent variable (the outcome). the dependent variable is weight (Y) The nutritionist wants to study the effect of factors (calorie intake, exercise duration, and age) and are influencing/predicting the dependent variable, so these factors are independent variables
A survey is conducted to determine whether the age of car and type of the car influences the annual maintenance cost. A sample of 10 cars is selected and the data is shown below. Age of car (months) Type of the car Annual Maintenance Cost ($) 3 Electric 120 5 Electric 115 6 Hybrid 135 7 Hybgrid 290 9 IC 275 10 IC 300 11 IC 350 13 IC 475 14 IC 500 15 IC 550 What is the variable type for the variable "Type of the Car" ? A. Integer B. Continuous C. Quantitative D. Categorical
D. Categorical These categories represent distinct groups rather than numerical values. A. Integer Integer refers to a whole number, typically used for quantitative variables (e.g., number of cars). B. Continuous Continuous variables are numerical and can take any value within a range (e.g., height, weight, or temperature). C. QuantitativeC. Quantitative variables are numerical and can be measured or counted (e.g., annual maintenance cost, age of the car).
A variable used to model the effect of categorical independent variables in a regression model is a _____. A. dependent variable B. predictor variable C. response D. dummy variable
D. dummy variable used in regression analysis to represent categorical independent variables (e.g., gender, region, or type of product) categorical variables are converted into binary (0 or 1) dummy variables to indicate the presence or absence of a particular category Dependent variable: The dependent variable (also called the response variable) is what we are trying to predict in regression analysis. Predictor variable: A predictor variable (also called an independent variable) is any variable used to predict the outcome of the dependent variable Response: Response variable is another term for the dependent variable (the outcome we measure in a study)