4300 Quizzes
You are hired by a real estate agency to predict housing values in neighborhoods around Boston. The response variable is medv, the median house value in thousand dollars ($1,000) for 506 neighborhoods. You train your model with a single predictor: rm, the average number of rooms per dwelling. You estimate the following linear regression model in python: medv = regressor.predict(rm) Summary output: Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -34.671 2.650 -13.08 <2e-16 *** rm 9.102 0.419 21.72 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 6.616 on 504 degrees of freedom Multiple R-squared: 0.4835, Adjusted R-squared: 0.4825 F-statistic: 471.8 on 1 and 504 DF, p-value: < 2.2e-16
use to answer following 2 questions
Why should we NOT be using linear regression when the response is qualitative/categorical
- LR will produce probs less 0 or bigger than 1 - LR will assume an ordering of coded outcomes
Which of the following simple linear regressions will have the lower R-squared value?
- Plot B on the right (the values are further away from the linear regression line, less of a fit so lower r^2)
In multiple linear regression we calculate the average effect with a large number of coefficients and predictors in such way that...? Please check one of the following answers.
..we hold all predictors except one constant
Write down the equation for calculating the median house value in a neighborhood with an average number of 5 rooms per dwelling. You can round numbers to the nearest integer to simplify the calculation.
= 9*5-35 (rounded) Because you are given the estimates for intercept and rm, you can calculate the median using (rm * number of rooms + intercept).
The following two plots show two different data science models fitted to the same data: a linear regression model (left) and a smoothing spline (right). Which model has the smaller bias?
= Right model (Smoothing Spline) - linear regression want a good fit (not overfitting which gives high variance but low bias VS underfit which is high bias)
Calculate the 95% confidence interval for the slope coefficient. You can round numbers to one decimal place to simplify the calculation. Hint: You need to calculate two values: the lower bound and the upper bound of the interval.
= [8.3, 9.9] formula: [samle mean + (zval for CI * (stand dev/ root:num elem in sample))] - Z-score of confidence level for 95% = 1.960
You have data on the number of assaults and the number of murders in each of the 50 US states. Which of the following types of plots is an appropriate way to visualize this data?
= scatter plot - scatter b/c two different sets of data (not histogram....)
A consulting company collects data on the top 500 firms in the US. For each firm they record CEO salary, annual profit, number of employees, and type of industry. They ask you to build a data science model that explains CEO salary. Is this a problem of supervised learning or unsupervised learning?
= supervised
You are trying to predict how many Oscars a movie will win at the next Academy Awards ("The Oscars"). You have data for all Oscar nominated movies from the last 10 years, including how many oscars a movie has won, its genre, average review scores from film critics, production budget, and total ticket sales on the first weekend. Is this a problem of supervised learning or unsupervised learning?
= supervised - the data is clearly labeled
A hospital has a large dataset of medical records from patients suffering from diabetes. You are asked to identify different patient groups based on gender, age, weight, activity level, and family history for which the hospital might tailor separate treatments. Is this a problem of supervised learning or unsupervised learning?
= unsupervised
In which of the following situations do we use multiple linear regression? Check one of the following answers.
We have multiple predictors for the same quantitative response