R Final Exam
Regression is a
Method
The application case for auto insurance was about
Mining unstructured auto insurance claims
Did the WSJ fake accounts state their interests explicitly to TikTok when they signed up?
No
If you are trying to predict how much someone will buy, would you rather know their age or their war veteran status?
No
Which demographic is buying the largest quantity of products? Choose the best answer.
Older people
Which of the following is true about R?
R is an open source
The total set of documents under consideration is called
corpus
For the code " myFamilyAges[ myFamilyAges > 6 & myFamilyAges < 20 ] " , it selects the family members ___________.
that are older than 6 but younger than 20
A linear model uses ____________to represent the relationship between a predictor variable and an outcome variable.
the slope and the intercept of a line
You can use the ______ function so that R knows the filename is from a location on the web.
url()
"We thought were in the insurance business, but we were actually in the ___ business"
Knowledge
A sample begins with an entropy of 0.9. Then it is split in half. One half has an entropy of zero, and the other half has an entropy of 0.8. What is the resulting information gain from that split?
.5
One leaf only has one observation What's the misclassification rate within that leaf?
0%
Assume that you can observe age and quantity and you want to predict War Create a decision tree with War as the outcome variable. Use 10-fold cross-validation. You'll have to change War into a "factor" even though it is not a character. I provided an example of this in the ACS code recently. What is the accuracy rate of the final decision tree? Do not answer in percentage terms. Type it in the way it is listed in R output.
1 dat$WarFactor <- as.factor(dat$War) dat$ageFactor <- as.factor(dat$age) dat$QuantityFactor <- as.factor(dat$Quantity) ## it's already all numbers, so you don't have to ## worry about converting character variables to levels trainList <- createDataPartition(y=dat$WarFactor, p=.5, list = FALSE) trainData <- dat[trainList,] testData <- dat[-trainList,] trctrl <- trainControl(method = "repeatedcv", number = 5) model.rpart <- train(WarFactor ~ ageFactor + QuantityFactor, method = "rpart", data = trainData, trControl = trctrl, tuneLength = 20) model.rpart library(rpart.plot) rpart.plot(model.rpart$finalModel)
What is the average number of items all customers buy?
108.6519 mean(dat$Quantity)
What is the average number of items bought by customers aged over 25 (age>25)?
111.46 mean(dat$Quantity[dat$age>25])
What is the average number of items bought by people who were in the war (War == 1)?
141.43
How many radar stations does NOAA maintain?
158
What's the misclassification rate for the overall model? (The model predicts passing the test based on study hours.)
20%
In what presidential year did Nate Silver first famously predict the election results (before they came in) using data analytics?
2012
The other leaf contains 4 observations (3 people who passed the test and one person who failed the test) What's the misclassification rate within that leaf?
25%
"In each of the 6 years from 2007 to 2013, Friedberg's company used __ times more data than the year before."
40
Chap4Function <- function(the_vector){med <- median(the_vector)return(med)}Chap4Function(smallSurvey$Age) What do you get if you run that?
45
#Remove NAs from the LTR column#instructions from Chapter 3!smallSurvey <- filter(smallSurvey, !is.na(LTR)) #What is the average Likelihood.to.recommend?mean(smallSurvey$LTR)
7.29
#create a function that returns the mean of LTRmean_LTR <- function(inputVector){return( mean(inputVector) )}mean_LTR(smallSurvey$LTR) What do you get when you run that line of code?
7.29
If there are 8 predictions and 6 of them are correct, then what is the accuracy rate? 50%
75%
The author, Cal Newport, lists two core abilities for thriving in the new economy. Name one of them:
According to Cal Newport, having the ability to quickly master hard things is a core ability for thriving in the new economy.
Create a scatter plot showing Quantity on the vertical axis and age on the horizontal axis. How is age related to quantity bought?
According to the scatterplot, as increasing quantity bough increases. There is a direct relationship between age and Quantity. library(ggplot2) g <- ggplot(dat)+ aes(x=age, y=Quantity)+ geom_point() g
The key question, according to Cowen, is
Are you good at working with intelligent machines?
_________ (also known as "parameter") is a term used by computer scientists to refer to some information that is sent into a function to help it do its job.
Argument
"American culture, in particular, loves the storyline of the prodigy..." What does Cal Newport claim is wrong with the prodigy narrative? [Write 1 or 2 sentences.]
Cal Newport claims the prodigy narrative in wrong because it fails to consider deliberate practice, which requires you to focus tightly on the skill you are trying to improve with no distraction. He argues that in order to be skillful at something, you are going to almost always have a deep level of practice at the skill.
What was the technological advance that made it possible to assemble and manipulate the radar data?
Cloud computing
The model is Quantity ~ age + Ad What is the estimate of the Intercept coefficient on quantity? That means it is the expected quantity if age and Ad are both zero.
Correct answer: 18.52441
Which is a purity measure?
Entropy
A psychiatrist devised a short screening test for depression. We are evaluating the Accuracy metrics for this short test. The outcomes of the short test are compared with a gold standard for diagnosis of depression among 200 psychiatric outpatients. Assume that the gold standard is 100% accurate. Among the 50 outpatients found to be depressed according to the gold standard, 35 patients were positive for the short test. Among 150 patients found not to be depressed according to the gold standard, 30 patients were found to be positive for the test. The sensitivity was 80% (The "true positive rate" = Sensitivity = the proportion of subjects with the disease correctly diagnosed by the test)
FALSE
Computers could beat humans in Jeopardy before chess.
FALSE
In Chapter 2, Cowen argues that everyone needs to become a programmer. (Hint: The answer is on page 21 in Chapter 2.)
FALSE
In the new economy, the wages for people with no technical training are going down. The wages for people with advanced technical training are also going down.
FALSE
It is not necessary to convert repair_cat to a factor, as I do on line 46. I could use it in a decision tree if I leave it as verbal information.
FALSE
The decision tree is trained or created on line 55. The rpart.plot() command on line 60 is just creating a visual for a tree that already exists. The visual is nice for human interpretation.
FALSE
The specificity was 80% (The "true negative rate" = Specificity = the proportion of subjects without the disease correctly excluded by the test)
FALSE
What company did Friedberg work for at the beginning of the story?
On page 10, Cowen says (writing in 2013) that the "future of technology is likely to illuminate the unsettling implications of
How predictable we are
RStudio is an ______ for R.
IDE (integrated development environment)
Which is a splitting criterion?
Information Gain
Where did Friedberg find data on how many plants were in a field at any moment in time for the past 40 years?
Infrared satellite images
Which of the following is true about read_excel()?
It guesses the format of xls or xlsx
Which of the following data formats is an increasingly common way of sharing data on the web, particularly from web-based "application programming interfaces"?
JSON
Which of the following is best practice?
Reserve validation data to test the model later
If we flatten the curve, we will achieve:
Slow the spread of the virus so that sick people can get treatment
Which is true?
Sometimes computers are better and cheaper, for a classification task
Classification is
Supervised
Regression is
Supervised
A new variable called repair_cat is created in lines 41 and 42. The new variable is a character variable, meaning it contains text not numbers.
TRUE
Among the 200 people, 50 are actually depressed, so the prevalence of depression was 25%.
TRUE
As Brynjolfsson and McAfee emphasize, this Great Restructuring [of the labor market] is not driving down all jobs but is instead dividing them.
TRUE
Classification is supervised,
TRUE
Having more data allows you to reduce uncertainty and quantify risk.
TRUE
Information Retrieval is part of Text Analytics
TRUE
Not only did Friedberg use data to create a specialized product for each farmer, but he used this customized data to help him sell the product to them.
TRUE
Smart software is already used to ferret out phony web reviews on sites such as Amazon and TripAdvisor (according to Cowen in 2013).
TRUE
The CRISP cycle starts with Business Understanding.
TRUE
When I used rbind() on line 52, I am doubling the number of rows in our fake data set. This appends a copy of the oil data in such a way that the number of columns remains the same in the new fake data set called oil2.
TRUE
What can we conclude from seeing one pitch from Justin Bieber?
That pitch is probably close to his average ability, within a standard deviation
The #FlattenTheCurve visual communicates:
The way the trend of the virus affects the medical system
How does the TikTok experience begin for a new user?
They are shown popular videos
How did the WSJ figure out the algorithm explained in the video report?
They created bot TikTok users
Which of the following is true about the code below? " colNameSort <- function(inputDF) {} "
This function only has one argument
Which of the following groups of workers will NOT do better in the new information-age economy?
Those who can't work with computers
If a model is not complex enough, that is called
Underfitting
Association mining is
Unsupervised
Clustering or cluster analysis is
Unsupervised
Link analysis is
Unsupervised
Do people who were in the war buy more on average?
YES
Next, use a linear model to predict quantity purchased. The model is Quantity ~ age + War Which input variable is significant, meaning it has a small p-value?
age
Which input variable turns out to be more important for predicting War? You can see this by plotting the tree or by checking variable importance.
age
Which of the following codes combines numbers 1, 10, and 100 into a single vector?
c(1,10,100)
You use air pressure to predict whether there will be rain or no rain tomorrow. This is an example of a prediction technique, _______.
classification
In R, _______ refers to any interface where you can directly type in commands.
console
Data formats that are human readable are _______
easy to diagnose when something goes wrong
Which of the following codes defined a function?
function( arguments ) { codes here }
One line 36 of the provided .R file, which piece of the code is indicating that the graph will be a scatter plot?
geom_point()
Write code to create a histogram of 200 samples from your jar vector (sample size = 8; with replacement).
hist(replicate(200, mean(sample(jar, size=8, replace=TRUE)),simplify=TRUE)
The input and output variable are also called _______variable and _____ variable.
independent; dependent
Which of the following codes find the minimum number in a vector?
min()
You define the object myFamilyAges as " myFamilyAges <- c(43, 42, 12, 8, 5) " Which of the following codes selects the family members' age that are older than 21?
myFamilyAges [myFamilyAges > 21]
Which of the following codes selects the third, the second, and the fifth elements of the myFamilyAges vector?
myFamilyAges[ c(3,2,5) ]
You define the object myFamilyAges as " myFamilyAges <- c(43, 42, 12, 8, 5) " Which of the following codes selects the family members' ages that is not 12?
myFamilyAges[ myFamilyAges != 12]
Which of the following codes find the min and max values of a vector? min()
range()
Which of the following R packages should you use to read comma separated variable (CSV) file?
readr
You use the number of oil changes for a vehicle to predict the repair costs for that vehicle. This is an example of a prediction technique, _______.
regression
What code can you use to sample 8 gumballs with replacement? (hint: The sample() function is discussed in Chapter 6 with lots of examples.)
sample(jar, size=8, replace=TRUE)
Using honest means to rise in search results rankings is
white hat SEO