Machine learning

Ace your homework & exams now with Quizwiz!

3. Selection and summary statistics: We found the zip code with the highest average house price. What is the average house price of that zip code?

$2,160,607

134. From the section "Effect of L1 penalty": Consider the simple model with 2 features trained on the entire sales dataset. Which of the following values of 11_penalty would not set w[1] zero, but would set w[2] to zero, if we were to take a coordinate gradient step in that coordinate?(Select all that apply)

- 1.64e8 - 1.73e8

135. Refer to the same model as the previous question. Which of the following values of 11_penalty would set both w[1] and w[2] to zero, if we were to take a coordinate gradient step in that coordinate?(Select all that apply)

- 1.9e8 - 2.3e8

142. Which of the following datasets is best suited to nearest neighbor or kernel regression? Choose all that apply.

- A dataset with two features whose observations are evenly scattered throughout the input space - A dataset with many observations

45. Which of the following statements is true? (Check all that apply)

- Features in computer vision act like local detectors. - By learning non-linear features, neural networks have allowed us to automatically learn detectors for computer vision.

100. In ridge regression, choosing a large penalty strength A tends to lead to a model with (choose all that apply):

- High bias - Low variance

68. Which of the following statements about step-size in gradient descent is/are TRUE (select all that apply)

- If the step-size is too large gradient descent may not converge - If the step size is too small (but not zero) gradient descent may take a very long time to converge

93. Selecting model complexity on test data (choose all that apply):

- Provides an overly optimistic assessment of performance of the resulting model - Should never by done

160. Which of the following products are represented in the 20 most negative reviews?

- The First Years True Choice P400 Premium Digital Monitor, 2 Parent Unit - Peg-Perego Tatamia High Chair, White Latte - Safety 1st High-Def Digital Monitor

140. This question refers to the same model as the previous question. In the model trained with 11_penalty=1e4, which of the following features has non-zero weight? (Select all that apply)

- constant - sqft_living - grade - waterfront - sqft_basement

138. In the section "Evaluating LASSO fit with more features", we split the data into training and test sets and learn weights with varying degree of L1 penalties. The model now has 13 features. In the model trained with 11_penalty=1e7, which of the following features has non-zero weight? (Select all that apply)

- constant - sqft_living - waterfront

133. Consider the model learned with the l1_penalty found in the previous question. Which of the following features has non-zero coefficients? (Choose all that apply)

- sqft_living - bathrooms

128. We learn weights on the entire house dataset, using an L1 penalty of 1e10 (or 5e2, if using scikit-learn). Some features are transformations of inputs; see the reading. Which of the following features have been chosen by LASSO, i.e. which features were assigned nonzero weights?(Choose all that apply)

- sqft_living - grade

57. Your friend in the U.S. gives you a simple regression fit for predicting house prices from square feet. The estimated intercept is -44850 and the estimated slope is 280. 76. You believe that your housing market behaves very similarly, but houses are measured in square meters. To make predictions for inputs in square meters, what intercept must you use? Hint: there are 0.092903 square meters in 1 square foot.

-44850

74. What is the mean value (arithmetic average) of the 'lat_plus_long feature on TEST data? (round to 2 decimal places)

-74.65

149. From the section "Compute a single distance": we take our query house to be the first house of the test set. What is the Euclidean distance between the query house and the 10th house of the training set? Enter your answer in American-style decimals (e.g. 0. 044) rounded to 3 decimal places.

0.060

167. Enter the accuracy of the majority class classifier model on the test_data. Round your answer to two decimal places (e.g. 0.76).

0.84

161. What is the accuracy of the sentiment_model on the test data? Round your answer to 2 decimal places (e.g. 0.76).

0.91

147. Suppose you are creating a website to help shoppers pick houses. Every time a user of your website visits the webpage for a specific house, you want to compute a prediction of the house value. You are using 1-NN to make the prediction and have 100,000 houses in the dataset, with each house having 100 features. Computing the distance between the features of two houses takes about 10 microseconds. Assuming the cost of all other operations involved (e.g., fetching data, etc. ) is negligible, about how long will it take to make a prediction using the brute-force method described in the videos?

1 second

163. Consider the coefficients of simple_model. There should be 21 of them, an intercept term + one for each word in significant_words. How many of the 20 coefficients (corresponding to the 20 significant_words and excluding the intercept term) are positive for the simple_model?

112. This question refers to the section "selecting an L2 penalty via cross-validation". What is the best value for the L2 penalty according to 10-fold validation?

1000

123. Given 20 potential features, how many models do you have to evaluate in the all subsets algorithm?

1048576

71. What is the mean value (arithmetic average) of the 'bedrooms_squared' feature on TEST data? (round to 2 decimal places)

12.45

115. This question refers to the same model as the previous question. What is the value ofthe coefficient for sqft_living that you learned with high regularization (I2_penalty=1e11)? Use American-style decimals (e.g. 30. 5) and round your answer to 1 decimal place.

124.6

130. Using the best value of l1_penaltyas mentioned in the previous question, how many nonzero weights do you have?

124. Given 20 potential features, how many models do you have to evaluate if you are running the forward stepwise greedy algorithm? Assume you run the algorithm all the way to the full feature set.

210

118. We run ridge regression to learn the weights of a model that has two features (sqft_living, sqft_living15), once with 12_penalty=0.0 and once with 12_penalty==e11. What is the value ofthe coefficient for sqft_living that you learned with no regularization, rounded to 1 decimal place? Use American-style decimals (e.g. 30.5).

243. 1

152. From the section "Perform 1-nearest neighbor regression": Take the query house to be third house of the test set (features_test[2]). What is the predicted value of the query house based on 1-nearest neighbor regression? Enter your answer in simple decimals without comma separators (e.g. 300000), rounded to nearest whole number.

249000

114. We run ridge regression to learn the weights of a simple model that has a single feature (sqft_living), once with l2_penalty=0.0 and once with I2_penalty=1e11. What is the value ofthe coefficient for sqft_living that you learned with no regularization, rounded to 1 decimal place? Use American-style decimals (e.g. 30. 5)

263. 0

79. What is the value of the weight for sqft_living from your gradient descent predicting house prices (model 1)? Round your answer to 1 decimal place.

281. 91

60. According to the inverse regression function and the regression slope and intercept from predicting prices from square-feet, what is the estimated square-feet for a house costing $800,000?

3004

132. We then explore the narrow range of 11_penalty values between 11_penalty_min and 11_penalty_max.

3448968612.16

49. For the first image in the test data, in what range is the mean distance between this image and its 5 nearest neighbors that were labeled 'cat' in the training data?

35 to 37

80. What is the predicted price for the 1st house in the TEST data set for model 1 (round to nearest dollar)?

356134.44

81. What is the predicted price for the 1st house in the TEST data set for model 2 (round to nearest dollar)?

366651.41

50. For the first image in the test data, in what range is the mean distance between this image and its 5 nearest neighbors that were labeled 'dog' in the training data?

37 to 39

131. We explore a wide range of l1_penalty values to find a narrow region of 11_penaty values where models are likely to have the desired number of non-zero weights (max_nonzeros=7).

3792690190.73

151. From the section "Perform 1-nearest neighbor regression": Take the query house to be third house of the test set (features_test[2]). What is the (0-based) index of the house in the training set that is closest to this query house?

382

154. From the section "Perform k-nearest neighbor regression": Take the query house to be third house of the test set (features_test[2]). Predict the value of the query house by the simple averaging method. Enter your answer in simple decimals without comma separators (e.g. 241242), rounded to nearest whole number.

413988

148. For the housing website described in the previous question, you learn that you need predictions within 50 milliseconds. To accomplish this, you decide to reduce the number of features in your nearest neighbor comparisons. How many features can you use?

5 features

69. Let's analyze how many computations are required to fit a multiple linear regression model using the closed-form solution based on a data set with 50 observations and 10 features. In the videos, we said that computing the inverse of the 10x10 matrix (HAT)H was on the order of DA3 operations. Let's focus on forming this matrix prior to inversion. How many multiplications are required to form the matrix (HAT)H?

5000

56. You have a data set consisting of the sales prices of houses in your neighborhood, with each sale time-stamped by the month and year in which the house sold. You want to predict the average value of houses in your neighborhood over time, so you fit a simple regression model with average house price as the output and the time index (in months) as the input. Based on 10 months of data, the estimated intercept is $4569 and the estimated slope is 143 ($/month). If you extrapolate this trend forward in time, at which time index (in months) do you predict that your neighborhood's value will have doubled relative to the value at month 10? (Round to the nearest month).

52 (51.95)

155. From the section "Perform k-nearest neighbor regression": Make prediction for the first 10 houses using k-nearest neighbors with k=10. What is the index ofthe house in this query set that has the lowest predicted value? Enter an index between 0 and 9.

97. Which degree (1, 2, ..., 15) had the lowest RSS on Validation data?

52. In what range is the accuracy of the 1-nearest neighbor classifier at classifying 'dog' images from the test set?

60 to 70

157. How many weights are greater than or equal to 0?

68419

72. What is the mean value (arithmetic average) of the 'bed_bath_rooms' feature on TEST data? (round to 2 decimal places)

7.50

73. What is the mean value (arithmetic average) of the 'log sqft_living' feature on TEST data? (round to 2 decimal places)

7.55

58. Using your Slope and Intercept from predicting prices from square feet, what is the predicted price for a house with 2650 sqft? Use simple decimals without comma separators (e.g. 300000), and do not include the dollar sign.

700074.85

150. From the section "Compute multiple distances": we take our query house to be the first house of the test set. Among the first 10 training houses, which house is the closest to the query house? Enter the 0-based index of the closest house.

119. This question refers to the same model as the previous question. What is the value of the coefficient for sqft_living that you learned with high regularization (12_penalty=1e11)? Use American-style decimals (e.g. 30. 5) and round your answer to 1 decimal place.

91.5

105. Assume you have a training dataset consisting of 1 million observations. Suppose running the closed-form solution to fit a multiple linear regression model using ridge regression on this data takes 1 second. Suppose you want to choose the penalty strength Aby searching over 100 possible values. How long will it take to run leave-one-out (LOO) cross-validation for this selection task?

About 3 years

66. Gradient descent/ascent is...

An algorithm for minimizing/maximizing a function

143. Which of the following is the most significant advantage of k-nearest neighbor regression (for k>1) over 1-nearest neighbor regression?

Better copes with noise in the data

108. Next, we split the sales data frame into four subsets (set_1, set_2, set_3, set_4) and fit a 15th order polynomial model using each of the subsets. For the models learned in each of these training sets, what are the smallest value you learned for the coefficient of feature power_1? Choose the range that contains this value.

Between -1000 and -100

129. We split the house sales dataset into training set, test set, and validation set and choose the 11_penalty that minimizes the error on the validation set. In which of the following ranges does the best l1_penaltyfall?

Between 0 and 100

59. Using the learned slope and intercept from the squarefeet model, what is the RSS for the simple linear regression using squarefeet to predict prices on TRAINING data?

Between 1.1e+15 and 1.3e+15

98. What is the RSS on TEST data for the model with the degree selected from Validation data? (Make sure you got the correct degree from the previous question)

Between 1.2e+14 and 1.3e+14

110. Using the same 4 subsets (set_1, set_2, set_3, set_4), we train 15th order polynomial models again, but this time we apply a large L2 penalty. For the models learned with the high level of regularization in each of these training sets, what are the smallest value you learned for the coefficient of feature power_1? Choose the range that contains this value.

Between 1.8 and 2.1

109. This question refer to the same models as the previous question. For the models learned in each of these training sets, what are the largest value you learned for the coefficient of feature power_1? Choose the range that contains this value.

Between 1000 and 10000

136. From the section "Cyclical coordinate descent": Using the simple model (with 2 features), we run our implementation of LASSO coordinate descent on the normalized sales dataset. We apply an L1 penalty of 1e7 and tolerance of 1. 0. Which of the following ranges contains the RSS of the learned model on the normalized dataset?

Between 1e15 and 3e15

111. This question refer to the same models as the previous question. For the models learned with the high level of regularization in each of these training sets, what are the largest value you learned for the coefficient of feature power_1? Choose the range that contains this value.

Between 2.3 and 2.8

120. This question refers to the same model as the previous question. Using the weights learned with high regularization (12_penalty=1e11), make predictions for the TEST data. In which of the following ranges does the TESTerror (RSS)fall?

Between 4e14 and 8e14

107. We first fit a 15th order polynomial model using the 'sqft_living' column of the 'sales' data frame, with a tiny L2 penalty applied. Which of the following ranges contains the learned value for the coefficient of feature power_1?

Between 70 and 150

156. From the section "Perform k-nearest neighbor regression": We use a validation set to find the best k value, i.e. one that minimizes the RSS on validation set. If we perform k-nearest neighbors with optimal k found above, what is the RSS on the TEST data? Choose the range that contains this value

Between 8e13 and 2e14

113. Using the best L2 penalty found above, train a model using all training data. Which of the following ranges contains the RSS on the TEST data ofthe model you learn with this L2 penalty?

Between 8e13 and 4e14

159. Which of the following products are represented in the 20 most positive reviews?

Britax Decathlon Convertible Car Seat, Tiffany

67. Gradient descent/ascent allows us to...

Estimate model parameters from data

86. If the features of Model 1 are a strict subset of those in Model 2. which model will USUALLY have lowest TEST error?

It's impossible to tell with only this information

89. A simple model with few parameters is most likely to suffer from:

High Bias

90. A complex model with many parameters is most likely to suffer from:

High Variance

101. In ridge regression using unnormalized features, you double the value of a given feature (i.e., a specific column of the feature matrix), what happens to the estimated coefficients for every other feature? They:

Impossible to tell from the information provided

104. Assume you have a training dataset consisting of N observations and D features. You use the closed-form solution to fit a multiple linear regression model using ridge regression. To choose the penalty strength A, you run leave-one-out (LOO) cross validation searching over L values of A. Let Cost(N,D) be the computational cost of running ridge regression with N data points and D features. Assume the prediction cost is negligible compared to the computational cost of training the model. Which of the following represents the computational cost of your LOO cross validation procedure?

L * N * Cost(N-1,D)

144. To obtain a fit with low variance using kernel regression, we should choose the kernel to have:

Large bandwidth A

116. This question refers to the same model as the previous question. Comparing the lines you fit with the with no regularization versus high regularization (12_penalty=1e11), which one is steeper?

Line fit with no regularization (12_penalty=0) 117. This question refers to the same model as the previous question. Using the weights learned with no regularization (12_penalty=0), make predictions for the TEST data. In which of the following ranges does the TESTerror (RSS) fall? Ans: Between 2e14 and 5e14

92. A common process for selecting a parameter like the optimal polynomial degree is:

Minimizing validation error

82. Which estimate was closer to the true price for the 1st house on the TEST data set, model 1 or model 2?

Model 1

61. Which model (square feet or bedrooms) has lowest RSS on TEST data?

Model 1 (Square feet)

78. Which model (1, 2 or 3) has lowest RSS on TESTING Data?

Model 2

83. Which model (1 or 2) has lowest RSS on all of the TEST data?

Model 2

85. If the features of Model 1 are a strict subset of those in Model 2,which model will USUALLY have lowest TRAINING error?

Model 2

87. If the features of Model 1 are a strict subset of those in Model 2, which model will USUALLY have lower BIAS?

Model 2

77. Which model (1, 2 or 3) has lowest RSS on TRAINING Data?

Model 3

76. What is the sign (positive or negative) for the coefficient/weight for 'bathrooms' in model 2?

Negative (-)

162. Does a higher accuracy value on the training datalways imply that the classifier is better?

No, higher accuracy on training data does not necessarily imply that the classifier is better.

95. Is the sign (positive or negative) for power_15 the same in all four models?

No, it is not the same in all four models

70. More generally, if you have D features and N observations what is the total complexity of computing ((HAT)H) (-1)?

O(ND^2 + D^3)

75. What is the sign (positive or negative) for the coefficient/weight for 'bathrooms' in model 1?

Positive (+)

165. Which model (sentiment_model or simple_model) has higher accuracy on the TRAINING set?

Sentiment_model

166. Which model (sentiment_model or simple_model) has higher accuracy on the TEST set?

Sentiment_model

99. Which of the following is NOT a valid measure of overfitting?

Sum of parameters (w1+w2+...+wn)

Who is the nearest neighbor to 'Elton John' using raw word counts? a. Paul McCartney b. Cliff Richard c. Rod Stewart d. Mary Fitzgerald (artist) e. David Beckham f. Taylor Swift

19. How do you compare the different learned models with the baseline approach where we are just predicting the majority class? a. The model learned using all words performed much worse than the other two. The other two approaches performed about the same. b. The model learned using all words performed much better than the other two. The other two approaches performed not the same. c. The model learned using all words performed much better than the other two. The other two approaches performed about the same.

The model learned using all words performed much better than the other two. The other two approaches performed about the same.

141. In the section "Evaluating each of the learned models on the test data", we evaluate three models on the test data. The three models were trained with same set of features but different L1 penalties. Which of the three models gives the lowest RSS on the TEST data?

The model trained with 1e4

54. In a simple regression model, if you increase the input value by 1 then you expect the output to change by:

The value of the slope parameter

121. This question refers to the same model as the previous question. Predict the price of the first house in the test set using the weights learned with no regularization. Do the same using the weights learned with high regularization. Which weights make better prediction for the first house in the test set?

The weights learned with high regularization (12_penalty=1e11)

65. If you double the value of a given feature (i.e. a specific column of the feature matrix), what happens to the least-squares estimated coefficients for every other feature? (assume you have no other feature that depends on the doubled feature i.e. no interaction terms).

They stay the same

158. Of the three data points in sample_test_data, which one has the lowest probability of being classified as a positive review?

Third

125. Which of the following statements about coordinate descent is true? (Select all that apply. )

To test the convergence of coordinate descent, look at the size of the maximum step you take as you cycle through coordinates.

94. Which of the following statements is true (select all that apply): For a fixed model complexity, in the limit of an infinite amount of training data,

Variance goes to 0

53. Assume you fit a regression model to predict house prices from square feet based on a training data set consisting of houses with square feet in the range of 1000 and 2000. In which interval would we expect predictions to do best?

[1000, 2000]

102. If we only have a small number of observations, K-fold cross validation provides a better estimate of the generalization error than the validation set method. a. True b. False

12. Out of the 11 words in selected_words, which one is most used in the reviews in the dataset? a. great b. wow c. terrible d. love

145. In k-nearest neighbor regression and kernel regression, the complexity of functions that can be represented grows as we get more data. a. True b. False

164. Are the positive words in the simple_model also positive words in the sentiment_model? a. Yes b. No

168. Is the sentiment_model definitely better than the majority class classifier (the baseline)? a. Yes b. No

22. Why is the value of the predicted_sentiment for the most positive review found using the sentiment_model much more positive than the value predicted using the selected_words_model? a. None of the selected_words appeared in the text of this review. b. Some of the selected_words appeared in the text of this review. c. Some of the selected_words did not appear in the text of this overview.

28. Top word count words for Elton John a. (the, in, and) b. (the, or, and) c. (the, in, also)

42. Using the first 10,000 unique users only in the test data, use the personalized_model learned on the training data to recommend 1 song to each user. What's the most recommended song? a. Undo- Bjork b. Taylor Swift c. William Tabbert d. Kings of Leon

7. For a linear classifier classifying between "positive" and "negative" sentiment in a review x, Score(x) = 0 implies (check all that apply): a. We are uncertain whether the review is "positive" or "negative" b. We are certain whether the review is "positive" c. We are certain whether the review is "negative"

9. True or false: For a classifier classifying between 5 classes, there always exists a classifier with accuracy greater than 0. 18. a. True b. False

Which of the following ranges contains the 'predicted_sentiment for the most positive review for 'Baby Trend Diaper Champ', according to the sentiment_model from the IPython Notebook from lecture? a. 0.9 to 1.0 b. 0.7 to 0.8 c. 0.7 to 0.9

Which of the following ranges contains the accuracy of the selected_words_model on the test_data? a. 0.841 to 0.871 b. 0.901 to 0.931 c. 0.811 to 0.843

Who is closer to 'Elton John', "Victoria Beckham' or 'Paul McCartney'? a. Paul McCartney b. Cliff Richard c. Rod Stewart d. Mary Fitzgerald (artist) e. David Beckham f. Taylor Swift

46. If you have lots of images of different types of plankton labeled with their species name, and lots of computational resources, what would you expect to perform better predictions:

a deep neural network trained on this data.

47. If you have a few images of different types of plankton labeled with their species name, what would you expect to perform better predictions:

a simple classifier trained on this data, using deep features as input, which were trained using ImageNet data.

37. Recommending items using featurized matrix factorization can (check all that apply): a. provide personalization b. capture context (e.g., time of day) c. do not provide personalization d. do not capture context (e.g. time of day)

44. A simple linear classifier can represent which of the following functions? (Check all that apply) a. x1 ORx2 OR NOT x3 b. x1 AND x2 AND NOT x3 c. x1 OR (x2 AND NOT x3) d. x1 OR (x2 AND x3) c. x1 AND x2 AND x3

abc

27. Which of the following statements are true? (Check all that apply) (Multiple choice | Example of answer - "acd", without spaces, in alphabetical order): a. If we are performing clustering, we typically assume we either do not have or do not use class labels in training the model b. If we are performing clustering, we typically assume we either do not have class labels in training the model. c. Deciding whether an email is spam or not spam using the text of the email and some spam not spam labels is a supervised learning problem. d. Deciding whether an email is spam or not spam using the text of the email and some spam not spam labels is a unsupervised learning problem.

1. True or false: The model that best minimizes training error is the one that will perform best for the task of prediction on new data. a. True b. False

10. True or false: A false negative is always worse than a false positive. a. True b. False

103. 10-fold cross validation is more computationally intensive than leave-one-out (LOO) cross validation. a. True b. False

122. The best fit model of size 5 (i.e., with 5 features) always contains the set of features from best fit model of size 4. a. True b. False

13. Out of the 11 words in selected_words, which one is least used in the reviews in the dataset? a. great b. wow c. terrible d. love

146. Parametric regression and 1-nearest neighbor regression will converge to the same solution as we collect more and more noiseless observations. a. True b. False

2. True or false: One always prefers to use a model with more features since it better captures the true underlying process. a. True b. False

26. For the TF-IDF representation, does the relative importance of words in a document depend on the base of the logarithm used? For example, take the words "bus" and "wheels" in a particular document. Is the ratio between the TF-IDF values for "bus" and "wheels" different when computed using log base 2 versus log base 10? a. Yes b. No c. Impossible to define

29. Top TF-IDF words for Elton John a. (furnish, elton, desk) b. (furnish, elton, billboard) c. (elton, john, billboard)

30. The cosine distance between 'Elton John's and 'Victoria Beckham's articles (represented with TF-IDF) falls within which range? a. 0.91 to 1.0 b. 0.9 to 1.0 c. 0.7 to 0.9 d. 0.7 to 0.89

38. Normalizing co-occurrence matrices is used primarily to account for: a. items purchased by a number of people b. items purchased by many people с. items not purchased by people

39. Which of the artists below have had the most unique users listening to their songs? a. Undo- Bjork b. Taylor Swift c. William Tabbert d. Kings of Leon

43. Which of the following statements are true? (Check all that apply) a. The lack of non-linear functions can allow us to study very accurate linear classifiers. b. Having good non-linear features can allow us to learn very accurate linear classifiers. c. The lack of non-linear functions can allow us to study very imprecise linear classifiers. d. Having good non-linear functions can allow us to study very imprecise linear classifiers.

5. Building a regression model with several more features: What is the difference in RMSE between the model trained with my_features and the one trained with advanced_features? a. the RMSE of the model with advanced_features lower by less than $250,000 b. the RMSE of the model with advanced_features lower by less than $25,000 c. the RMSE of the model with advanced_features lower by more than $25,000 d. the RMSE of the model with advanced_features lower by less than $2,500

55. Two people present you with fits of their simple regression model for predicting house prices from square feet. You discover that the estimated intercept and slopes are exactly the same. This implies that these two people fit their models on exactlythe same data set. a. True b. False

63. Your estimated model for predicting house prices has aslarge positive weight on 'square feet living'. This implies that if we remove the feature 'square feet living' and refit the model, the new predictive performance will be worse than before. a. True b. False

8. True or false: High classification accuracy always indicates a good classifier. a. True b. False

84. If the features of Model 1 are a strict subset of those in Model 2, the TRAINING error of the two models can never be the same. a. True b. False

88. It is always optimal to add more features to a regression model. a. True b. False

96. The plotted fitted lines all look the same in all four plots a. True b. False

Consider the most positive review for 'Baby Trend Diaper Champ' according to the sentiment_model from the IPython Notebook from lecture. Which of the following ranges contains the predicted_sentiment for this review, if we use the selected_words_model to analyze it? a. 0.9 to 1.0 b. 0.7 to 0.8 c. 0.7 to 0.9

Which of the following ranges contains the accuracy of the sentiment_model in the IPython Notebook from lecture on the test_data? a. 0.841 to 0.871 b. 0.901 to 0.931 c. 0.811 to 0.843

6. The simple threshold classifier for sentiment analysis described in the video (check all that apply): (Multiple choice | в ответ написать варианты в алфавитном порядке. Например "abc") a. Must not either count attributes equally or pre-define weights on attributes b. Must either count attributes equally or pre-define weights on attributes c. Must have pre-defined positive and negative attributes d. Must not have pre-defined positive and negative attributes

137. Refer to the same model as the previous question. Which of the following features were assigned a zero weight at convergence?

bedrooms

48. What's the least common category in the training data?

bird

11. Which of the following statements are true? (Check all that apply) a. Test error tends to decrease with more training data until a point, and then change (i.e., curve flattens) b. Test error tends to increase with more training data until a point, and then does not change (i.e., curve flattens out) c. Test error tends to decrease with more training data until a point, and then does not change (i.e., curve flattens out)

15. Out of the 11 words in selected_words, which one got the most negative weight in the selected_words_model? (Tip: when printing the list of coefficients, make sure to use print_rows(rows=12) to print ALL coefficients. a. great b. wow c. terrible d. love

23. A country, called Simpleland, has a language with a small vocabulary of just "the", "on", "and", "go", "round", "bus", and "wheels". For a word count vector with indices ordered as the words appear above, what is the word count vector for a document that simply says "the wheels on the bus go round and round. Please enter the vector of counts as follows: If the counts were ["the"=1, "on"=3, "and"=2, "go"=1, "round"=2, "bus"=1, "wheels"=1], enter 1321211. a. 1321211 b. 2111212 c. 2111211 d. 1322211

24. In Simpleland, a reader is enjoying a document with a representation: [1 3 2 1 2 1 1]. Which of the following articles would you recommend to this reader next? a. [1 7 0 0 2 0 0 ] b. [1 7 1 3 2 1 2 1] c. [1 7 0 0 2 0 1 ] d. [1 3 2 1 2 1 1]

4. Filtering data: What fraction of the houses have living space between 2000 sq.ft. and 4000 sq.ft.? a. Between 0.2 and 0.39 b. Between 0.5 and 0.59 c. Between 0.4 and 0.49 d. Between 0.2 and 0.4

41. Which of the artists below is the least popular artist, the one with smallest total listen_count, in the data set? a. Undo- Bjork b. Taylor Swift c. William Tabbert d. Kings of Leon

Which of the following ranges contains the accuracy of the majority class classifier, which simply predicts the majority class on the test_data? a. 0.841 to 0.871 b. 0.901 to 0.931 c. 0.811 to 0.843

Who is the nearest neighbor to 'Elton John' using TF-IDF? a. Paul McCartney b. Cliff Richard c. Rod Stewart d. Mary Fitzgerald (artist) e. David Beckham f. Taylor Swift

51. On average, is the first image in the test data closer to its 5 nearest neighbors in the 'cat' data or in the 'dog' data?

cat

139. This question refers to the same model as the previous question. In the model trained with l1_penalty=1e8, which of the following features has non-zero weight? (Select all that apply)

constant

14. Out of the 11 words in selected_words, which one got the most positive weight in the selected_words_model? (Tip: when printing the list of coefficients, make sure to use print_rows(rows=12) to print ALL coefficients. ) a. great b. wow c. terrible d. love

25. A corpus in Simpleland has 99 articles. If you pick one article and perform 1-nearest neighbor search to find the closest article to this query article, how many times must you compute the similarity between two articles? a. 45 b. 45.5 c. 1 d. 98

31. The cosine distance between 'Elton John's and 'Paul McCartney's articles (represented with TF-IDF) falls within which range? a. 0.91 to 1.0 b. 0.9 to 1.0 c. 0.7 to 0.9 d. 0.7 to 0.89

35. Who is the nearest neighbor to "Victoria Beckham' using raw word counts? a. Paul McCartney b. Cliff Richard c. Rod Stewart d. Mary Fitzgerald (artist) e. David Beckham f. Taylor Swift

40. Which of the artists below is the most popular artist, the one with highest total listen_count, in the data set? a. Undo- Bjork b. Taylor Swift c. William Tabbert d. Kings of Leon

36. Who is the nearest neighbor to "Victoria Beckham' using TF-IDF? a. Paul McCartney b. Cliff Richard c. Rod Stewart d. Mary Fitzgerald (artist) e. David Beckham f. Taylor Swift

106. Assume you have a training dataset consisting of 1 million observations. Suppose running the closed-form solution to fit a multiple linear regression model using ridge regression on this data takes 1 second. Suppose you want to choose the penalty strength A by searching over 100 possible values. If you only want to spend about 1 hour to select A, what value of k should you use for k-foldcross-validation?

k=36

64. Complete the following: Your estimated model for predicting house prices has a positive weight on 'square feet living'. You then add 'lot size' to the model and re-estimate the feature weights. The new weight on 'square feet living' [____________] be positive.

might

91. A model with many parameters that fits training data very well but does poorly on test data is considered to be...

overfitted

153. From the section "Perform k-nearest neighbor regression": Take the query house to be third house of the test set (features_test[2]). Which of the following is NOT part of the 4 training houses closest to the query house? (Note that all indices are 0-based.)

training house with index 2818

62. Which of the following is NOT a linear regression model (select all that apply):

y=w_0 * w_1 + log(w_1) * x

See all study sets

Machine learning

Related study sets

AP Lang Final (Progress Checks 1-9)

Lesson 7: Reports and Dashboards

Unit #3 lesson #1-5

CH 14

Ch 54 AP Bio Study Guide

Endemic, Epidemic, Pandemic

GIS ?'s

Health Promotion Test

FINAL EXAM

Business

Chapter 8

14012012 Who is he? Mauyans to Gupta

Articulations

Risk Management & Insurance Exam 3 Review

The Greenhouse Effect

Chapter 14

ESS - Geosphere

prep u antipsychotic drugs chapter 23

Combine Like Terms Unit 5

Self Care - Home Tests for Menopause