R Final Exam

Ace your homework & exams now with Quizwiz!

Regression is a

Method

The application case for auto insurance was about

Mining unstructured auto insurance claims

Did the WSJ fake accounts state their interests explicitly to TikTok when they signed up?

No

If you are trying to predict how much someone will buy, would you rather know their age or their war veteran status?

No

Which demographic is buying the largest quantity of products? Choose the best answer.

Older people

Which of the following is true about R?

R is an open source

The total set of documents under consideration is called

corpus

For the code " myFamilyAges[ myFamilyAges > 6 & myFamilyAges < 20 ] " , it selects the family members ___________.

that are older than 6 but younger than 20

A linear model uses ____________to represent the relationship between a predictor variable and an outcome variable.

the slope and the intercept of a line

You can use the ______ function so that R knows the filename is from a location on the web.

url()

"We thought were in the insurance business, but we were actually in the ___ business"

Knowledge

A sample begins with an entropy of 0.9. Then it is split in half. One half has an entropy of zero, and the other half has an entropy of 0.8. What is the resulting information gain from that split?

.5

One leaf only has one observation What's the misclassification rate within that leaf?

0%

Assume that you can observe age and quantity and you want to predict War Create a decision tree with War as the outcome variable. Use 10-fold cross-validation. You'll have to change War into a "factor" even though it is not a character. I provided an example of this in the ACS code recently. What is the accuracy rate of the final decision tree? Do not answer in percentage terms. Type it in the way it is listed in R output.

1 dat$WarFactor <- as.factor(dat$War) dat$ageFactor <- as.factor(dat$age) dat$QuantityFactor <- as.factor(dat$Quantity) ## it's already all numbers, so you don't have to ## worry about converting character variables to levels trainList <- createDataPartition(y=dat$WarFactor, p=.5, list = FALSE) trainData <- dat[trainList,] testData <- dat[-trainList,] trctrl <- trainControl(method = "repeatedcv", number = 5) model.rpart <- train(WarFactor ~ ageFactor + QuantityFactor, method = "rpart", data = trainData, trControl = trctrl, tuneLength = 20) model.rpart library(rpart.plot) rpart.plot(model.rpart$finalModel)

What is the average number of items all customers buy?

108.6519 mean(dat$Quantity)

What is the average number of items bought by customers aged over 25 (age>25)?

111.46 mean(dat$Quantity[dat$age>25])

What is the average number of items bought by people who were in the war (War == 1)?

141.43

How many radar stations does NOAA maintain?

158

What's the misclassification rate for the overall model? (The model predicts passing the test based on study hours.)

20%

In what presidential year did Nate Silver first famously predict the election results (before they came in) using data analytics?

2012

The other leaf contains 4 observations (3 people who passed the test and one person who failed the test) What's the misclassification rate within that leaf?

25%

"In each of the 6 years from 2007 to 2013, Friedberg's company used __ times more data than the year before."

40

Chap4Function <- function(the_vector){med <- median(the_vector)return(med)}Chap4Function(smallSurvey$Age) What do you get if you run that?

45

#Remove NAs from the LTR column#instructions from Chapter 3!smallSurvey <- filter(smallSurvey, !is.na(LTR)) #What is the average Likelihood.to.recommend?mean(smallSurvey$LTR)

7.29

#create a function that returns the mean of LTRmean_LTR <- function(inputVector){return( mean(inputVector) )}mean_LTR(smallSurvey$LTR) What do you get when you run that line of code?

7.29

If there are 8 predictions and 6 of them are correct, then what is the accuracy rate? 50%

75%

The author, Cal Newport, lists two core abilities for thriving in the new economy. Name one of them:

According to Cal Newport, having the ability to quickly master hard things is a core ability for thriving in the new economy.

Create a scatter plot showing Quantity on the vertical axis and age on the horizontal axis. How is age related to quantity bought?

According to the scatterplot, as increasing quantity bough increases. There is a direct relationship between age and Quantity. library(ggplot2) g <- ggplot(dat)+ aes(x=age, y=Quantity)+ geom_point() g

The key question, according to Cowen, is

Are you good at working with intelligent machines?

_________ (also known as "parameter") is a term used by computer scientists to refer to some information that is sent into a function to help it do its job.

Argument

"American culture, in particular, loves the storyline of the prodigy..." What does Cal Newport claim is wrong with the prodigy narrative? [Write 1 or 2 sentences.]

Cal Newport claims the prodigy narrative in wrong because it fails to consider deliberate practice, which requires you to focus tightly on the skill you are trying to improve with no distraction. He argues that in order to be skillful at something, you are going to almost always have a deep level of practice at the skill.

What was the technological advance that made it possible to assemble and manipulate the radar data?

Cloud computing

The model is Quantity ~ age + Ad What is the estimate of the Intercept coefficient on quantity? That means it is the expected quantity if age and Ad are both zero.

Correct answer: 18.52441

Which is a purity measure?

Entropy

A psychiatrist devised a short screening test for depression. We are evaluating the Accuracy metrics for this short test. The outcomes of the short test are compared with a gold standard for diagnosis of depression among 200 psychiatric outpatients. Assume that the gold standard is 100% accurate. Among the 50 outpatients found to be depressed according to the gold standard, 35 patients were positive for the short test. Among 150 patients found not to be depressed according to the gold standard, 30 patients were found to be positive for the test. The sensitivity was 80% (The "true positive rate" = Sensitivity = the proportion of subjects with the disease correctly diagnosed by the test)

FALSE

Computers could beat humans in Jeopardy before chess.

FALSE

In Chapter 2, Cowen argues that everyone needs to become a programmer. (Hint: The answer is on page 21 in Chapter 2.)

FALSE

In the new economy, the wages for people with no technical training are going down. The wages for people with advanced technical training are also going down.

FALSE

It is not necessary to convert repair_cat to a factor, as I do on line 46. I could use it in a decision tree if I leave it as verbal information.

FALSE

The decision tree is trained or created on line 55. The rpart.plot() command on line 60 is just creating a visual for a tree that already exists. The visual is nice for human interpretation.

FALSE

The specificity was 80% (The "true negative rate" = Specificity = the proportion of subjects without the disease correctly excluded by the test)

FALSE

What company did Friedberg work for at the beginning of the story?

Google

On page 10, Cowen says (writing in 2013) that the "future of technology is likely to illuminate the unsettling implications of

How predictable we are

RStudio is an ______ for R.

IDE (integrated development environment)

Which is a splitting criterion?

Information Gain

Where did Friedberg find data on how many plants were in a field at any moment in time for the past 40 years?

Infrared satellite images

Which of the following is true about read_excel()?

It guesses the format of xls or xlsx

Which of the following data formats is an increasingly common way of sharing data on the web, particularly from web-based "application programming interfaces"?

JSON

Which of the following is best practice?

Reserve validation data to test the model later

If we flatten the curve, we will achieve:

Slow the spread of the virus so that sick people can get treatment

Which is true?

Sometimes computers are better and cheaper, for a classification task

Classification is

Supervised

Regression is

Supervised

A new variable called repair_cat is created in lines 41 and 42. The new variable is a character variable, meaning it contains text not numbers.

TRUE

Among the 200 people, 50 are actually depressed, so the prevalence of depression was 25%.

TRUE

As Brynjolfsson and McAfee emphasize, this Great Restructuring [of the labor market] is not driving down all jobs but is instead dividing them.

TRUE

Classification is supervised,

TRUE

Having more data allows you to reduce uncertainty and quantify risk.

TRUE

Information Retrieval is part of Text Analytics

TRUE

Not only did Friedberg use data to create a specialized product for each farmer, but he used this customized data to help him sell the product to them.

TRUE

Smart software is already used to ferret out phony web reviews on sites such as Amazon and TripAdvisor (according to Cowen in 2013).

TRUE

The CRISP cycle starts with Business Understanding.

TRUE

When I used rbind() on line 52, I am doubling the number of rows in our fake data set. This appends a copy of the oil data in such a way that the number of columns remains the same in the new fake data set called oil2.

TRUE

What can we conclude from seeing one pitch from Justin Bieber?

That pitch is probably close to his average ability, within a standard deviation

The #FlattenTheCurve visual communicates:

The way the trend of the virus affects the medical system

How does the TikTok experience begin for a new user?

They are shown popular videos

How did the WSJ figure out the algorithm explained in the video report?

They created bot TikTok users

Which of the following is true about the code below? " colNameSort <- function(inputDF) {} "

This function only has one argument

Which of the following groups of workers will NOT do better in the new information-age economy?

Those who can't work with computers

If a model is not complex enough, that is called

Underfitting

Association mining is

Unsupervised

Clustering or cluster analysis is

Unsupervised

Link analysis is

Unsupervised

Do people who were in the war buy more on average?

YES

Next, use a linear model to predict quantity purchased. The model is Quantity ~ age + War Which input variable is significant, meaning it has a small p-value?

age

Which input variable turns out to be more important for predicting War? You can see this by plotting the tree or by checking variable importance.

age

Which of the following codes combines numbers 1, 10, and 100 into a single vector?

c(1,10,100)

You use air pressure to predict whether there will be rain or no rain tomorrow. This is an example of a prediction technique, _______.

classification

In R, _______ refers to any interface where you can directly type in commands.

console

Data formats that are human readable are _______

easy to diagnose when something goes wrong

Which of the following codes defined a function?

function( arguments ) { codes here }

One line 36 of the provided .R file, which piece of the code is indicating that the graph will be a scatter plot?

geom_point()

Write code to create a histogram of 200 samples from your jar vector (sample size = 8; with replacement).

hist(replicate(200, mean(sample(jar, size=8, replace=TRUE)),simplify=TRUE)

The input and output variable are also called _______variable and _____ variable.

independent; dependent

Which of the following codes find the minimum number in a vector?

min()

You define the object myFamilyAges as " myFamilyAges <- c(43, 42, 12, 8, 5) " Which of the following codes selects the family members' age that are older than 21?

myFamilyAges [myFamilyAges > 21]

Which of the following codes selects the third, the second, and the fifth elements of the myFamilyAges vector?

myFamilyAges[ c(3,2,5) ]

You define the object myFamilyAges as " myFamilyAges <- c(43, 42, 12, 8, 5) " Which of the following codes selects the family members' ages that is not 12?

myFamilyAges[ myFamilyAges != 12]

Which of the following codes find the min and max values of a vector? min()

range()

Which of the following R packages should you use to read comma separated variable (CSV) file?

readr

You use the number of oil changes for a vehicle to predict the repair costs for that vehicle. This is an example of a prediction technique, _______.

regression

What code can you use to sample 8 gumballs with replacement? (hint: The sample() function is discussed in Chapter 6 with lots of examples.)

sample(jar, size=8, replace=TRUE)

Using honest means to rise in search results rankings is

white hat SEO


Related study sets

Spinal Cord/ Spinal Nerve Review Questions

View Set

Exemplar 6.C - Chronic Kidney Disease

View Set

AMT: Airframe Oral & Practical Exam Guide

View Set

Mr. Weed's Review Lesson 5: Compound Subjects and Predicates

View Set