Midterm 1 (Topics 1-5)
Match the following R codes with their corresponding definition. - head() - setseed() - summary() - read.csv() - tail() A. print out five point summary B. print out the first 6 observations in a data set C. print out the last 6 observations D. read data from .csv file E. random number generator
- head() - B - setseed() - E - summary() - A - read.csv() - D - tail() - C
For questions 6 and 7, use the data "Cereals.csv" for the breakfast cereals example in Section 4.8 to explore and summarize the data as follows: #load the data cereals.df <- read.csv("Cereals.csv", stringsAsFactors = FALSE)
--
Topic 2 | Homework 2
--
Topic 2 | Quiz 2
--
Topic 3 | HW3
--
Topic 3 | Quiz 3
--
Topic 4 | Homework 4
--
Topic 4 | Quiz 4
--
Topic 5 | Quiz 5
--
Use data set "wine.csv" for questions 3-5. Table 4.13 in the textbook shows the PCA output on data (nonnormalized) in which the variablesrepresent chemical characteristics of wine, and each case is a different wine.
--
Use the data ' Universities.csv' for questions 8-10. University Rankings. The dataset on American college and university rankings (available from www.dataminingbook.com) contains information on 1302 American colleges and universities offering an undergraduate program. For each university, there are 17 measurements that include continuous measurements (such as tuition and graduation rate) and categorical measurements (such as location by state and whether it is a private or a public school). universities.df<-read.csv("Universities.csv")head(universities.df)
--
What is the most expensive tax bill in USD that a homeowner has to pay is $
15,319
Run through the entire data set, and check to see how many missing data points are there?
2,002
From the 'BostonHousing.csv' file and using the method of your choice, the median value of owner-occupied homes (MEDV) is $ .
21,200
All of the followings are among the characteristics of supervised learning EXCEPT A. used in association rules B. used in linear regression C. used in time series forecasting D. used in predicting a numerical outcome
A
Applying prcomp() to universities.df1 with normalizing the data, how many principal components needed in order to obtain at least 90% of the total variation associated with all numerical variables? Show your work in the supported R notebook. A. 9 B. 2 C. 4 D. 7
A
Applying prcomp() to universities.df1 without normalizing the data, how many principal components needed in order to obtain at least 90% of the total variation associated with all numerical variables? Show your work in the supported R notebook. A. 2 B. 9 C. 4 D. 7
A
Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning. Deciding whether to issue a loan to an applicant based on demographic and financial data (with reference to a database of similar data on prior customers). A. Supervised Learning B. Unsupervised Learning
A
Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning. Identifying segments of similar customers. A. Unsupervised learning B. Supervised Learning
A
Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning. Predicting whether a company will go bankrupt based on comparing its financial data to those of similar bankrupt and nonbankrupt firms. A. Supervised Learning B. Unsupervised Learning
A
Continued from the problem above. Explain why the 'Proportion of Variance' value for PC1 is so much greater than that of any other column. Hint: run pcs.cor$rot A. PC1 has a very high proportion of the variance because it is composed mostly of proline, which has a much larger scale than the other variables. B. PC1 has a very high proportion of the variance because it is composed mostly of malic acid, which has a much larger scale than the other variables. C. PC1 has a very high proportion of the variance because it is composed mostly of magnesium, which has a much larger scale than the other variables. D. PC1 has a very high proportion of the variance because it is composed mostly of alcohol, which has a much larger scale than the other variables.
A
If we want to understand what a particular R code does, which of the following function do we use? A. help() B. seek() C. answer() D. question()
A
In order to choose how many principles components we should include, we often look at A. Cumulative proportion B. Principle component scores C. Standard deviation D. Proportion of variance
A
Refer back in the housing.df data frame. Use the aggregate() function to compute the median value of INDUS (percentage of land occupied by the non-retail business) based on CHAS (=1 if tract bounds river, =0 otherwise) variable. Which of the following is the correct code to do that? A. data <- aggregate(housing.df$INDUS, by = list(housing.df$CHAS), FUN = median) B. data <- aggregate(housing.df$CHAS, by = list(housing.df$INDUS), FUN = median) C. data<- aggregate(housing.df$INDUS, by = list(housing.df$CHAS), FUN = mean) D. data <- aggregate(housing.df$CHAS, by = list(housing.df$INDUS), FUN = mean)
A
Refer to question 7: For a closer view of the patterns, we'd like to zoom in to the range of 3500-5000 on the y-axis. Which of the following codes would execute this task? A. plot(appship.ts, xlab = "Year", ylab = "Shipments (in millon dollars)", ylim = c(3500, 5000)) B. plot(appship.ts, xlab = "Year", ylab = "Shipments (in millon dollars)", xlim = c(3500, 5000)) C. plot(appship.ts, xlab = "Year", ylab = "Shipments (in millon dollars)", xlim = c(3500, 5000),ylim = c(3500, 5000)) D. None of the alternatives are correct
A
Removing all categorical variables and all records with missing numerical measurements from the data set, we set the new data to be universities.df1 . Which of the following codes would do that? A. universities.df1<-na.omit(universities.df[,-c(1:3)]) B. universities.df1<-universities.df[,-c(1:3)] C. universities.df1<-na.omit(universities.df) D. universities.df1<-na.omit(universities.df[-c(1:3),])
A
To generate a sample of ten normal random variables with mean 2 and variance 9, which of the following codes should we use? Select all appropriate answer(s). A. rnorm(10, mean = 2, sd = 3) B. rnorm(10, mean = 2, sd = 9) C. rnorm(10, mean = 3, sd = 2) D. rnorm(10, mean = 9, sd = 2)
A
Using summary(housing.df$LIVNG.AREA), what is the median of the Living Area (in square feet) of houses in the data set? A. 1548 B. 1550 C. 1873 D. 1874
A
Using summary(housing.df), What is the median home value in USD? A. $375.9K B. $325.1K C. $438.8K D. $392.7K
A
Using summary(housing.df), what is the minimum number of rooms a house includes? A. 3 B. 8 C. 1 D. 6
A
What is the average number of rooms a house has? If necessary, round your answer to the nearest integer. A. 7 B. 3 C. 6 D. 8
A
Which method should we use for the following business problem? Amazon wants to recommend certain products to a customer based on his/her historical purchases A. Collaborative Filtering B. Classification C. Prediction D. Association Rules
A
Which method should we use for the following business problem? How much does a car worth base on several indicators such as mileage, age, model, color, etc.? A. Prediction B. Classification C. Association Rules D. Collaborative Filtering
A
Which of the following R commands determine whether a data point is missing? A. is.na() B. is.missing() C. na() D. which()
A
Which of the following codes produces the data in row 2 to row 4 and the first 5 columns? A. housing.df[2:4,1:5] B. housing.df(2:4,1:5) C. housing.df[1:5,2:4] D. housing.df[2:4,5]
A
Which of the following codes produces the first column of housing.df? A. housing.df[,1] B. housing.df[1,] C. housing.df[1,1] D. housing.df(,1)
A
Which of the following concepts explains when a model is fit to training data, zero error with those data is not necessarily good? A. Overfitting data B. Missing data C. Normalizing data D. Detecting outliers
A
Which of the following is NOT among methods that are used in Predictive Analytics and Data Mining? A. Patterning B. Prediction C. Classification D. Recommendation systems E. Association rules
A
Which symbol is used to indicate a comment in R? A. # B. % C. $ D. @
A
How should we handle missing data? Select all appropriate options. A. Drop records with missing values when there is only a few of them. B. Replace missing values with reasonable replacement. C. Omit all of the records that have missing values. D. Keep everything. R can handle those missing values.
A, B
What are other terms that are related to Predictive Analytics? Select all options that apply. A. Data Mining B. Machine Learning C. Data Science D. Artificial Inteligence
A, B, C, D
In R, if you want to understand what the function mean does, which of the following codes is the correct way to do it? Select all appropriate options. A. help(mean) B.help("mean") C. mean? D. ?mean
A, B, D
Which of the following graphs/plots that we covered are for data exploration? Select all that apply(ies). A. Historgrams B. Bar graphs C. Pie charts D. Boxplots E. Stem-and-leafs F. Scatterplots G. Line graphs
A, B, D, F, G
What are the characteristics of the principal components? Select all options that apply A. Only a few of them contain most of the original information B. Their correlations is non-zero C. They are uncorrelated D. They are linear combinations of the original variables
A, C, D
Run round(cor(boston.housing.df),2). Select all variables that are positively correlated with CRIM. A. AGE B. MDEV C. PTRATIO D. RAD E. RM F. CAT..MEDV G. ZN H. DIS I. INDUS J. NOX K. CHAS L. LSAT M. TAX
A, C, D, I, J, L, M
Given: housing.df <- read.csv("BostonHousing.csv") Which of the following codes produce the histogram plot for the INDUS column in 'BostonHousing.csv'? Select all possible choice(s). A. hist(housing.df$INDUS, xlab = "INDUS") B. hist(housing.df.INDUS, xlab = "INDUS") C. hist(housing.df.INDUS, xlab = "INDUS") D. hist(housing.df$INDUS)
A, D
To generate a random sample from 1 to 10, which of the following codes should we use? Select all appropriate answer(s). A. sample(1:10) B. sample(0:11) C. rnorm(10, mean = 1, sd = 1) D. sample(10)
A, D
Compute the mean, median, min, max, and standard deviation for each of the quantitative variables. This can be done through R's sapply() function ##(e.g., sapply(data, mean, na.rm = TRUE)). Match the following items to its corresponding numerical value Average of sugars amount (in grams) Median of consumer ratings Smallest calories per serving Largest amount of sodium (in milligrams) in a box of cereals Standard deviation of potassium (in milligrams)
Average of sugars amount (in grams) 7.03 Median of consumer ratings 40.40 Smallest calories per serving 50.00 Largest amount of sodium (in milligrams) in a box of cereals 320.00 Standard deviation of potassium (in milligrams) 70.41
Assuming that data mining techniques are to be used in the following cases, identify whether the task required is supervised or unsupervised learning. In an online bookstore, making recommendations to customers concerning additional items to buy based on the buying patterns in prior transactions. A. Supervised Learning B. Unsupervised Learning
B
Every time we run rnorm(4, mean = 4, sd = 4), the outcome should always be the same. A. True B. False
B
In 'PA_Topic03_Rnotebook.Rmd', The following codes produce the bar chart of Mean MEDV based on CHAS ## barchart of CHAS vs. mean MEDV# compute mean MEDV per CHAS = (0, 1)data.for.plot <- aggregate(housing.df$MEDV, by = list(housing.df$CHAS), FUN = mean)#head(data.for.plot)names(data.for.plot) <- c("CHAS", "MeanMEDV")barplot(data.for.plot$MeanMEDV, names.arg = data.for.plot$CHAS, xlab = "CHAS", ylab = "Avg. MEDV") Repeat the same process to draw the bar chart of Mean CRIM based on CAT.MEDV. Select the most appropriate codes to achieve that. A. data.for.plot <- aggregate(housing.df$CAT.MEDV, by = list(housing.df$CRIM), FUN = mean) B. data.for.plot <- aggregate(housing.df$CRIM, by = list(housing.df$CAT.MEDV), FUN = mean) C. data.for.plot <- aggregate(housing.df$MEDV, by = list(housing.df$CHAS), FUN = mean) D. data.for.plot <- aggregate(housing.df$CRIM, by = list(housing.df$CAT.MEDV), FUN = median)
B
Let run the following R codes: #load the data and computer principal componentswine.df <- read.csv("wine.csv")options(scipen = 999, digits = 2)pcs.cor <- prcomp(wine.df[,-1])summary(pcs.cor) Consider the rows labeled "Proportion of Variance." Based on this information, how many principal component(s) should we consider? A. 2 B. 1 C. More than 3 principal components are needed D. 3
B
Refer to questions 7-9, using the R codes below, you can create a line graph of the series at a yearly aggregated level (i.e., the total shipments in each year). #a line graph of the series at a yearly aggregated level yearly.ts <- aggregate(appship.ts, FUN = mean) plot(yearly.ts, xlab = "Year", ylab = "Shipments (in millon dollars)", main = "Annual Shipment", ylim = c(3500, 5000)) Which of the time periods do the shipments shown an increasing trend? A. 1987-1989 B. 1985-1987 C. 1987-1988 D. 1986-1988
B
The dataset ToyotaCorolla.csv contains data on used cars on sale during the late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications.To analyze the data using various predictive analytics techniques described in future topics, let's first prepare the data for use as follows: The dataset has two categorical attributes, Fuel Type and Color. Let's convert these binary variables. #install.packages("dummies") #if you need to install this librarylibrary(dummies) #load datacar.df <- read.csv("ToyotaCorolla.csv")#create binary dummy variables for Fuel_Typecar.df <- dummy.data.frame(car.df, names = c("Fuel_Type", "Color"), sep=".") By doing this, how many more variables are created? Show the code you use to get the answer in the supported R file. A. 10 B. 11 C. 50 D. 12
B
Time series data is best visualized in A. Bar graph B. Line graph C. Boxplot D. None of the alternatives are correct
B
Two models are applied to a data set that has been partitioned. Model A is considerably more accurate than model B on the training data, but slightly less accurate than model B on the validation data. Which model are you more likely to consider for final deployment? Write your explanation in the supported R codes for the homework. A. Model A B. Model B C. Either model is fine D. Can't choose. The data set shouldn't be partitioned.
B
Using dummy() to transform toyota.corolla.df$Fuel_Type into dummy variables. How many dummy variables are created when running such code? A. 2 B. 3 C. 1 D. 4
B
Which method should we use for the following business problem? A hotel wants to identify which customers would have a high churn rate. A. Prediction B. Classification C. Association Rules D. Collaborative Filtering
B
What is the R codes to print out the first SIX element in column 'ROOMS'? (Select all that apply) A. housing.df$ROOMS[6] B. head(housing.df$ROOMS) C. tail(housing.df$ROOMS) D. housing.df$ROOMS[1:6]
B, D
From the box plots in question 5 (PLOT 1 and PLOT 2), which one is a clearer picture describing the Crime rate for different area with median home value of either above or below $30,000. A. No difference B. PLOT 1 C. PLOT 2 D. Not enough information to conclude
C
From the chapter 2's R codes, follow the steps below to create missing values and replace them with the mean of the remaining value in the LIVING.AREA column: 1. Set the rows.to.missing to [154, 657,1032,3421, 2010, 5721, 45, 243, 4512, 1187] 2. Set the housing.df$LIVING.AREA with rows.to.missing to NA 3. Replace these NA values with the mean of the non-NA values in the same column 4. Get the summary of housing.df$LIVING.AREA and report the median A. 1548 B. 1874 C. 1550 D. 1873
C
Run the following R codes: housing.df <- read.csv("BostonHousing.csv") Which of the following code produces the scatter plot between columns TAX (as an outcome) and INDUS (as an input) in housing.df ? A. plot(housing.df$TAX + housing.df$INDUS, xlab = "TAX", ylab = "INDUS") B. plot(TAX ~ INDUS, xlab = "TAX", ylab = "INDUS") C. plot(housing.df$TAX ~ housing.df$INDUS, xlab = "INDUS", ylab = "TAX") D. plot(housing.df$TAX ~ housing.df$INDUS, xlab = "TAX", ylab = "INDUS")
C
Running the following codes: housing.df <- read.csv("BostonHousing.csv") plot(housing.df$MEDV ~ housing.df$DIS, xlab = "DIS", ylab = "MEDV",log='xy') #line2 plot(housing.df$MEDV ~ housing.df$DIS, xlab = "DIS", ylab = "MEDV") #line3 Is there an advantage when observing the plot in line 2 versus line 3? A. Yes, the logarithmic scale in x-axis in line 2 helps see the correlation clearer B. Yes, the logarithmic scale in y-axis in line 2 helps see the correlation clearer C. Yes, the logarithmic scale in both x- and y-axis in line 2 helps see the correlation clearer D. No difference between line 2's and line 3's code
C
Scatterplots are best used to A. Display how many data points in each category B. Plot time series data C. Describe the relationship between two numerical variables D. Show the behavior of a variable
C
Using summary(housing.df), what is the average tax bill amount in USD that a homeowner has to pay? A. $4090 B. $1320 C. $4939 D. $4728
C
Using summary(housing.df), what year was the newest house was built? A. 1935 B. 1937 C. 2011 D. 1955 E. 1920
C
Which of the following R function is used to stack a set of columns into a single column of data? A. cast() B. aggregate() C. melt() D. reshape()
C
Which of the following is NOT a correct partition? A. Training partition B. Validation partition C. Cross-validation D. Test partition
C
Which variables are quantitative/numerical? Which are ordinal? Which are nominal? Calories type sodium shef mfr cups
Calories Quantitative type Nominal sodium Quantitative shef Ordinal mfr Nominal cups Quantitative
Comment on the use of normalization (standardization) in problem . A. Normalizing the data would be appealing to wine tasters. B. Normalizing the data would will force the data to behave like a normal distribution. C. Normalizing the data would not be recommended in this situation D. Normalizing the data would equalize the scales and eliminate the undesired impact of scale on the calculation of the principal components.
D
Refer to questions 7 and 8. Using the R code provided below, we create one chart with four separate lines, one line for each of Q1, Q2, Q3, and Q4. In R, this can be achieved by generating a data frame for each quarter Q1, Q2, Q3, Q4, and then plotting them as separate series on the line graph. Zoom in to the range of 3500-5000 on the y-axis. #generate separate data frame for each quarterQ1 <- ts(appship.ts[seq(1, 20, 4)])Q2 <- ts(appship.ts[seq(2, 20, 4)])Q3 <- ts(appship.ts[seq(3, 20, 4)])Q4 <- ts(appship.ts[seq(4, 20, 4)])#plot each quarter series on separate line grapghplot(Q1, xlab = "Year", ylab = "Shipments (in millon dollars)", ylim = c(3500, 5000))lines(Q2, xlab = "Year", ylab = "Shipments (in millon dollars)", col = "red")lines(Q3, xlab = "Year", ylab = "Shipments (in millon dollars)", col = "blue")lines(Q4, xlab = "Year", ylab = "Shipments (in millon dollars)", col = "green")legend("bottomright", c("Q1", "Q2", "Q3", "Q4"), col = c("black", "red", "blue", "green"), lty = 1) Does there appear to be a difference between quarters? A. Yes, we can see differences between the quarters. The shipments for quarter Q1 is generally higher than in other quarters. B. Yes, we can see differences between the quarters. The shipments for quarter Q3 is generally higher than in other quarters. There is no difference between quarters C. Yes, we can see differences between the quarters. The shipments for quarter Q4 is generally higher than in other quarters. D. Yes, we can see differences between the quarters. The shipments for quarter Q2 is generally higher than in other quarters.
D
Run following R code boston.housing.df <-read.csv("BostonHousing.csv") Use table function with appropriate syntax to count the number of houses with the median value of owner-occupied homes in tract above $30000 (CAT..MEDV=1). A. 471 B. 35 C. 422 D. 84
D
The objectives of Principle Components Analysis is the following EXCEPT A. To remove the overlap of information between numerical variables B. To provide a smaller-size set of numerical variables that contain most of the information C. To reduce the number of numerical variables D. To expand to a larger-size set of numerical variables to contain most of the information
D
What syntax should we include to normalize the variable in PCA? A. scaled = T B. normalized = T C. normal= T D. scale. = T
D
Which of the following codes is used to rescale to logarithm scale for x values? A. plot(housing.df$TAX ~ housing.df$LSAT, xlab="LSAT", ylab="TAX", log = 'xy') B. plot(housing.df$TAX ~ housing.df$LSAT, xlab="LSAT", ylab="TAX", ln = 'x') C. plot(housing.df$TAX ~ housing.df$LSAT, xlab="LSAT", ylab="TAX", log_10 = 'x') D. plot(housing.df$TAX ~ housing.df$LSAT, xlab="LSAT", ylab="TAX", log = 'x')
D
Why is data visualization important? A. It helps the data cleaning process B. It helps detect outliers and unexpected patterns C. It helps determine the usefulness of variables D. All of the listed choices are correct
D. All of the listed choices are correct
Problem 4.3 Sales of Toyota Corolla Cars. The file ToyotaCorolla.csv contains data on used cars (Toyota Corollas) on sale during late summer of 2004 in the Netherlands. It has 1436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal will be to predict the price of a used Toyota Corolla based on its specifications. Run the following codes: #load the datatoyota.corolla.df=read.csv("ToyotaCorolla.csv")head(toyota.corolla.df) Identify whether the following variables are categorical and quantitative. Fuel_Type SportModel Color Price Age_08_04 KM
Fuel_Type Categorical SportModel Categorical Color Categorical Price Quantative Age_08_04 Quantative KM Quantative
Match the visualization plots with what they are best used for. Line graph Bar chart Scatterplot Histogram Boxplot Relationship between two numerical variables Comparing distributions Show the distribution of a variable Categorical variables Time series
Line graph Time series Bar chart Categorical variables Scatterplot Relationship between two numerical variables Histogram Show the distribution of a variable Boxplot Comparing distributions
Match the following visual feature to the corresponding predictive analytics task - Predict housing price based on several factors such as property tax, crime rate, # of bedrooms, # of bathrooms, etc. - Recommend products that a particular customer would likely to purchase - Bundle products that tend to go together - Forecast the Gross Domestic Product (GDP) based on unemployment rate, inflation rate, and price produce index Focus on the relationship between output and inputs Relationships among all variables Focus on the relationship between output and inputs Relationships among all variables
Predict housing price based on several factors such as property tax, crime rate, # of bedtrooms, # of bathrooms, etc. Focus on the relationship between output and inputs Recommend products that a particular customer would likely to purchase Relationships among all variables Bundle products that tend to go together Relationships among all variables Forecast the Gross Domestic Product (GDP) based on unemployment rate, inflation rate, and price produce index Focus on the relationship between output and inputs
Continuing from question 9, we prepare the dataset (as factored into dummies) for data mining techniques of supervised learning by creating partitions in R. Select the variables and use default values for the random seed and partitioning percentages for training (60%) and validation (40%) sets. Use set.seed(201). What are the first 3 ids in the validation data set?
The first id in the validation set is 4 The second id in the validation set is 5 The third id in the validation set is 13
Match the following terms with their associated definition Training partition Validation partition Test partition set.seed() A. is an optional way to double check the model on its performance on a new data B. is used to evaluate the model on a 'new data' that wasn't part of the model developing process C. is used to build the model D. is to make sure the values of the random variables generated are the same for consistent evaluation
Training partition is used to build the model Validation partition is used to evaluate the model on a 'new data' that wasn't part of the model developing process Test partition is an optional way to double check the model on its performance on a new data set.seed() is to make sure the values of the random variables generated are the same for consistent evaluation
Given the information below: Shipments of Household Appliances: Line Graphs. The file 'ApplianceShipments.csv' contains the series of quarterly shipments (in millions of dollars) of US household appliances between 1985 and 1989. To create a well-formatted time plot of the data using R, we use #need forecast package for creating time series of time data library(forecast) #load the data appship.df <- read.csv("ApplianceShipments.csv") #use time series analysis to create time series from the above data frame appship.ts <- ts(appship.df$Shipments, start = ?, end = ?, freq = ?) #line chart for the shipments data plot(appship.ts, xlab = "Year", ylab = "Shipments (in millon dollars)") Using the following choice to match a response '?' for start, end, and freq start end freq 4 c(1985,1) c(1989,4)
start c(1985,1) end c(1989,4) freq 4