Predictive Analytics Midterm Practice

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Match the following feature to the corresponding predictive analytics task (1. Focus on the relationship between outputs and inputs or 2. Relationships among all variables) a. Predict housing price based on several factors such as property tax, crime rate, # of bedrooms, # of bathrooms, etc. b. Recommend products that a particular customer would likely purchase c. Bible products that tend to go together d. Forecast the Gross Domestic Product (GDP) based on unemployment rate, inflation rate, and price produce index

1, 2, 2, 1

What are other terms that are related to Predictive Analytics? Select all that apply a. Data Mining b. Machine Learning c. Data Science d. Artificial Inteligence

a, b, c, d

In R, if you want to understand what the function mean does, which of the following is the correct way to do it? Select all appropriate options. a. help(mean) b. help("mean") c. mean? d. ?mean

a, b, d

Which of the following graphs/plots that we covered are for data exploration? Select all the apply a. histograms b. bar graphs c. pie charts d. boxplots e. stem-and-leafs f. scatterplots g. line graphs

a, b, d, f, g

Which of the following is NOT a valid subset selection algorithm or search? Select all choices that apply a. Adjusted R squared b. Exhaustive Search c. Under-fit Search d. Forward Selection e. Stepwise Regression f. Backward Elimination

a, c

What are the characteristics of the principal components? Select all options that apply a. only a few of them contain most of the original information b. their correlations is non-zero c. they are uncorrelated d. they are linear combinations of the original variables

a, c, d

Given: housing.df <- read.csv("BostonHousing.csv") Which of the following codes produce the histogram plot for the INDUS column in 'BostonHousing.csv'? Select all possible choice(s). a. hist(housing.df$INDUS, xlab = "INDUS") b. hist(housing.df.INDUS, xlab = "INDUS") c. hist(housing.df.INDUS, ylab = "INDUS") d. hist(housing.df$INDUS)

a, d

To generate a random sample from 1 to 10, which of the following codes should we use? Select all appropriate answers a. sample(1:10) b. sample(0:11) c. norm(10, mean = 1, sd= 1) d. sample(10)

a, d

In order to choose how Manny principle components we should include, we often look at a. Cumulative proportion b. Principle component scores c. Standard deviation d. Proportion of variance

a. Cumulative proportion

Match the following statement to either explanatory modeling (E) or predictive modeling (P) a. discuss the relationship between predictors and outcome b. the goal is to fit the given data well c. predict outcome in other data set where we have the predictor values but not the outcome values d. use the entire data set to build the model e. the goal is to optimize predictive accuracy f. divide the data into training set and validation set g. access performance on validation set

a. E b. E c. P d. E e. P f. P g. P

For aggregate() function, list out the order of operations (First, Second, Third) a. Splits the data into subsets b. Return the outcome in a convenient form c. Computes summary statistics for each subset of the data

a. First b. Third c. Second

Which of the following is NOT among methods that are used in predictive Analytics and Data Mining? a. Patterning b. Prediction c. Classification d. Recommendation systems e. Association Rules

a. Patterning

Which method should we use for the following business problem? How much is a car worth based on several indicators such as mileage, age, model, color, etc.? a. Prediction b. Classification c. Association Rules d. Collaborative Filtering

a. Prediction

Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers) a. Supervised Learning b. Unsupervised Learning

a. Supervised Learning

Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and non bankrupt firms a. Supervised Learning b. Unsupervised Learning

a. Supervised Learning

Which of the following codes produces the data in row 2 to row 4 and the first 5 columns? a. housing.df[2:4,1:5] b. housing.df(2:4,1:5) c. housing.df[1:5,2:4] d. housing.df[2:4,5]

a. housing.df[2:4,1:5]

To generate a sample of ten normal random variables with mean 2 and variance 9, which of the following codes should we use? Select all appropriate answers a. rnorm(10, mean = 2, sd = 3) b. rnorm(10, mean = 2, sd = 9) c. rnorm(10, mean = 3, sd = 2) d. rnorm(10, mean = 9, sd = 2)

a. rnorm(10, mean = 2, sd = 3)

What syntax should we include to normalize the variable in PCA?

a. scaled = T b. normalized = T c. normal = T d. scale. = T

All of the followings are among the characteristics of supervised learning EXCEPT a. used in association rules b. used in linear regression c. used in time series forecasting d. used in predicting a numerical outcome

a. used in association rules

Match the following terms with their associated definition a. Training partition 1. is an optional way to double check the model on its performance on a new data b. Validation partition 2. is used to build the model c. Test Partition 3. is to make sure the values of the random variables generated are the same for consistent evaluation d. set.seed() 4. is used to evaluate the model on a 'new data' that wasn't part of the model developing process

a2, b4, c1, d3

Match the following subset selection algorithms with their definition a. Forward Selection 1. Start with all predictors, then at each step, eliminate the least useful predictor b. Backward Elimination 2. Start with no predictors and then add predictors one by one. However, at each step, considering dropping predictors c. Stepwise Regression 3. Start with no predictors and then add predictors one by one

a3, b1, c2

Match the visualization plots with what they are best used for. a. Line graph 1. Show the distribution of a variable b. Bar Chart 2. Relationship between two numerical variables c. Scatterplot 3. Time Series d. Histogram 4. Comparing distributions e. Boxplot 5. Categorical variables

a3, b5, c2, d1, e4

Match the following R codes with their corresponding definition a. head() 1. random number generator b. set.seed() 2. read data form the .csv file c. summary() 3. print out the last 6 observations in the data set d. read.csv() 4. print out the first 6 observations in the data set e. tail() 5. print out five point summary and average value of the data set

a4, b1, c5, d2, e3

For a closer view of the patterns, we'd like to zoom in to the range of 3500-5000 on the y-axis. Which of the following codes would execute this task? a. plot(appship.ts, xlab = "Year", ylab = "Shipments (in millon dollars)", xlim = c(3500, 5000)) b. plot(appship.ts, xlab = "Year", ylab = "Shipments (in millon dollars)", ylim = c(3500, 5000)) c. plot(appship.ts, xlab = "Year", ylab = "Shipments (in millon dollars)", xlim = c(3500, 5000),ylim = c(3500, 5000)) d. none of the alternatives are correct

b

In an exhaustive search, which criterion do we choose to select the most promising subset of predictors? a. R^2 because it uses a penalty on the number of predictors b. Adjusted R^2 because it uses a penalty on the number of predictors c. R^2 because it does not account for the number of predictors d. Adjusted R^2 because it odes not account for the number of predictors

b

What is(are) the R code(s) to print out the first SIX elements in column 'ROOMS'? Select all that apply a. housing.df$ROOMS[6] b. head(housing.df$ROOMS) c. tail(housing.df$ROOMS) d. housing.df$ROOMS[1:6]

b, d

Which method should we use for the following business problem? A hotel wants to identify which customers would have a high churn rate? a. Prediction b. Classification c. Association Rules d. Collaborative Filtering

b. Classification

Every time we run norm(4, mean = 4, sd = 4), the outcome should always be the same a. True b. False

b. False

Regardless of whether we use exhaustive search, awkward elimination, forward selection, or stepwise regression method, the number of predictors and the predictors themselves in the resulted model should always be the same a. True b. False

b. False

Two models are applied to a data set that has been partitioned. Model A is considerably more accurate than Model B on the training data, but slightly less accurate than Model B on the validation data. Which model are you more likely to consider for final deployment? a. Model A b. Model B c. Either model is fine d. Can't choose. the data set shouldn't be partitioned

b. Model B

Which of the following concepts explains when a model is fit to training data, zero error with those data is not necessarily good? a. Missing data b. Overfitting data c. Normalizing data d. Detecting outliers

b. Overfittin data

Identifying segments of similar customers a. Supervised Learning b. Unsupervised Learning

b. Unsupervised Learning

In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying patterns in prior transactions. a. Supervised Learning b. Unsupervised Learning

b. Unsupervised Learning

Suppose fitted represents forecasted values by a model, and true represents the actual values of the data. When using the accuracy() function to compute the predictive accuracy measures, what is the correct order of the parameters? a. accuracy(true, fitted) b. accuracy(fitted, true) c. none of the alternatives are correct d. accuracy(fitted, fitted)

b. accuracy(fitted, true)

Which of the following is considered a major drawback for exhaustive search? a. there is no drawback in exhaustive search b. the implementation R can be tedious and unstable c. it provides the 'best' subset of predictors d. it covers predictors that might be the 'best' ones collectively but not individually

b. the implementation in R can be tedious and unstable

For time series data, which function is to zoom in a specific time frame a. ts() b. window() c. tsapply() d. aggregated()

b. window()

Comment on the use of normalization (standardization) a. normalizing the data would be appealing to wine tasters b. normalizing the data would force the data to behave like a normal distribution c. normalizing the data would equalize the scales and eliminate the undesired impact of scale on the calculation of the principal components d. normalizing the data would not be recommended in this situation

c

Let vl.res be the residuals of a prediction model based on the validation data set. That is Which of the following is the correct expression to computer RMSE? a. sqrt(mean(abs(vl.res$residuals))) b. mean(vl.res$residuals^2) c. sqrt(mean(vl.res$residuals^2)) d. mean(abs(vl.res$residuals))

c

plot(housing.df$MEDV ~ housing.df$DIS, xlab = "DIS", ylab = "MEDV",log='xy') #line2 plot(housing.df$MEDV ~ housing.df$DIS, xlab = "DIS", ylab = "MEDV") #line3 What is a difference when observing the plot in line 2 versus line 3? a. the logarithmic scale in (only) x-axis in line 2 helps see the correlation clearer b. the logarithmic scale in (only) y-axis in line 2 helps see the correlation clearer c. the logarithmic scale in both x- and y-axis in line 2 helps see the correlation clearere d. no difference in the plots between line 2's and line 3's code

c

How should we handle missing data? Select all appropriate options a. omit all of the records that have missing values b. keep everything. R can handle those missing values c. drop records with missing values when there is only a few of them d. replace missing values with reasonable replacement

c, d

Which of the following is NOT a correct partition? a. Training partition b. Validation partition c. Cross-validation partition d. Test Partition

c. Cross-validation partition

If we want to understand what a particular R code does, which of the following function do we use? a. seek() b. answer() c. help() d. question()

c. help()

Which of the following R commands determine whether a data point is missing? a. na() b. which() c. is.na() d. is.missing()

c. is.na()

Which of the following R function is used to stack a set of columns into a single column of data? a. cast() b. aggregate() c. melt() d. reshape()

c. melt()

The objectives of Principle Components Analysis is the following EXCEPT a. to remove the overlap of information between America variables b. to provide a smaller-size set of numerical variables that contain most of the information c. to reduce the number of numerical variables d. to expand to a larger-size set of numerical variables to contain most of the information

d

Which of the following codes is used to rescale to logarithm scale for x values? a. plot(housing.df$TAX ~ housing.df$LSAT, xlab="LSAT", ylab="TAX", log = 'xy') b. plot(housing.df$TAX ~ housing.df$LSAT, xlab="LSAT", ylab="TAX", ln = 'x') c. plot(housing.df$TAX ~ housing.df$LSAT, xlab="LSAT", ylab="TAX", log_10 = 'x') d. plot(housing.df$TAX ~ housing.df$LSAT, xlab="LSAT", ylab="TAX", log = 'x')

d

Why shouldn't we always use all possible variables into a linear regression model? a. it may be expensive or not feasible to collect a full complement for future predictions b. we may be able to measure fewer predictor more accurately c. the more predictors involved, the higher the chance of missing values in the data d. all of the reasons are correct

d

Which symbol, is used to indicate a comment in R? a. % b. $ c. @ d. #

d. #

Which method should we use for the following business problem? Amazon wants to recommend certain products to a customer based on his.her historical purchases a. Prediction b. Classification c. Association Rules d. Collaborative Filtering

d. Collaborative Filtering

Why is data visualization important? a. it helps the data cleaning process b. it helps detect outliers and unexpected patterns c. it helps determine the usefulness of variables d. all of the listed choices are correct

d. all of the listed choices are correct

Which of the following codes produces the first column of housing.df? a. housing.df[1,] b. housing.df[1,1] c. housing.df(,1) d. housing.df[,1]

d. housing.df[,1]

In implementing the forward selection method using step() function, which of the following criteria used to make a decision on adding a variable a. select the variable that goes with the largest R^2 b. select the variable that goes with the largest adjusted R^2 c. select the variable that goes with the largest AIC d. select the variable that goes with the smallest AIC

d. select the variable that goes with the smallest AIC


Kaugnay na mga set ng pag-aaral

Chapter 06: Understanding the Management Process

View Set

Management 340 Chapter 2 Diversity In Organization Review

View Set

Chapter 52 Assessment and Management of Patients with Endocrine Disorders

View Set

Vocabulary Workshop Level D Unit 11 Answers

View Set

Ch. 5: Consumer-Driven Health Plans

View Set

Module 2: Physical Security Assessment

View Set