BDAN Final

Ace your homework & exams now with Quizwiz!

what are the goodness of fit measures?

- want high coefficient of determination R^2 and adjusted R^2

What is the median of the following numbers? 6, 10, 10, 10, 1, 4, 8

-middle value 8

What is the probability of a soccer team winning with odds of 5 (i.e., 5 to 1 odds)?

.83

What are the odds of a tennis player winning with a probability of 0.80?

4

Someone makes the following three claims when arguing that Period 4 students outperformed Period 3 students: Half the students in Period 4 scored 75 points or higher, whereas half the students in Period 3 scored 60 points or lower. Half the students in Period 4 scored 75 points or higher, but only 25% of students in Period 3 scored 75 points or higher. The top 25% of students in Period 4 scored 85 points or higher, whereas the top 25% of students in Period 3 scored only 75 points or higher. How many of these claims are actually TRUE, based on the box plots?

All three of the statements are true (e.g., none of the three statements are false)

In the following equation ŷ = 30,000 + 4x with given sales (y in dollars) and marketing (x in dollars), what does the equation imply?

An increase of $1 in marketing is associated with an increase of $4 in sales.

Data ____________ is the process of retrieving, cleaning, integrating, transforming, and enriching data to support analytics.

Data wrangling

According to the Hans Rosling video, in 2009, China (Rural) has the same wealth and health as _____________

Ghana

The observations for any variable can be classified into one of four major measurement scales: nominal, ordinal, interval, or ratio. Which of the following best describes interval data?

Has equal intervals, but lacks a true zero (or has a zero point that is arbitrarily).

What are the relationships on scatterplots?

If the dots start going down it is a negative relationship -vice versa

Mary asks some of her friends on Facebook for recommendations for the best restaurants in Chicago. There are several ways to describe this data. Which term below better describes this data?

Sample data/ cross-sectional data

Which of the following statements about the exponential regression model is true?

The model allows us to estimate the percent change in E(y) when x increases by one unit.

Given the following accuracy rates for k-fold cross-validation with k = 4, which model will we choose to make predictions and what is the accuracy rate for that model?

The model with the highest average accuracy will be the preferred model. In this case, Model 1 has an average accuracy rate of 69.7% whereas, model 2 has an average accuracy of 70.5%, thus Model 2 is considered the superior model.

In the model y = β0 + β1x + ε, the predicted value is = b0 + b1x. What is the impact of the estimated slope coefficient?

b1 measures the approximate change in y when x increases by 1 unit.

With the __________ strategy, the probability of inclusion increases for cases that have been previously misclassified or have large prediction errors. This forces the modeling process to place emphasis on the most difficult cases. The downside, however, is that this strategy is more prone to overfitting.

boosting

__________ measures the way two variables move together, both the direction and the closeness of their movement.

correlation

How does adjusted R^2 and predictive modeling work together?

he higher the adjusted R2 value, the more suited for predictive modeling.

The simple linear regression model y = β0 + β1x + ɛ implies that if x ________, we expect y to change by β1, irrespective of the value of x.

if x goes up by one unit

In a regression model, the _____________ exists when a predictor variable has a different partial effect on the outcome of another predictor variable.

interaction effect

What are leaf nodes?

leaf nodes are the the bottom option of the tree with no branches coming off them

In cross-validation, if k equals the sample size, the resulting method is also called the _____________.

leave-one-out cross-validation method --END 7

Which of the chart types below tends to be used to track how a variable changes over time?

line chart

A nonlinear regression model where both the response and predictor variables are transformed into natural logs is called a _____________.

log-log regression model

Which of the following broad categories is not a type of analytic technique?

manipulative analytics

In the following table, there are four observations with three variables. Which is the best fit to be transferred into dummy variables?

marital status

If you want to summarize a categorical variable, then the __________ is the only meaningful measure of the central location.

mode

What is the mode of the following numbers? 6, 10, 10, 10, 1, 4, 8

mode is the most frequent number 10

In the following Boxplot, the left whisker is longer than the right whisker. This indicates that the underlining distribution is _______.

negatively skewed

Unstructured data is best described as __________.

not conforming to a predefined, row-column format

A regression model made to conform to a sample set of data, compromising predictive power is called _____________.

overfitting

Which of the following is the most common data visualization of two numeric variables?

scatterplot

Which of the following classification measures focuses on the correct positive classifications?

sensitivity

In a linear regression, ε, read as epsilon, is

the rnadom error

Of the following numerical variables, which is continuous?

weight

Which of the following are TRUE based on the Syllabus?

-Email messages without a proper subject line may be returned to you or may not be answered promptly. -Late work might be accepted for half credit, but only within one week of the due date. -The Final Exam is a proctored exam. -The Final Exam will cover the unit concepts materials only, and you are allowed to create a single page of hand-written notes that you can use during the Final Exam.

Break the data down into two data sets, the TRAINING data and TESTING data. Why is it important to have data for testing that is different from the training data?

-Performance Evaluation: Testing data allows us to measure the performance of the predictive model objectively. By comparing the model's predictions with the actual outcomes in the testing data, we can calculate various performance metrics such as accuracy, precision, recall, or mean squared error. These metrics provide insights into how well the model is performing and help assess its usefulness and reliability. - Decision Making: In many real-world applications, the ultimate purpose of predictive analytics is to make informed decisions based on the model's predictions. Testing data allows us to estimate how well the model will perform in practical scenarios. By evaluating the model's performance on testing data, we can gain confidence in its ability to assist in decision making. - Avoiding Overfitting: Overfitting occurs when a model becomes overly specialized in the training data and fails to generalize well to new data. If the same data used for training is also used for testing, the model may simply memorize the training examples without truly understanding the underlying patterns. Having separate testing data helps identify if the model is overfitting, as it evaluates the model's performance on unseen examples.

Match the times in the Simon Sinek video with the comments Sinek is making at that point in the video.

-Time 4:25: "People don't buy what you do. People buy why you do it." - Time 7:40: Correct: "The goal is not just to sell to people who need what you have; the goal is to sell to people who believe what you believe. The goal is not just to hire people who need a job; it's to hire people who believe what you believe. I always say if you hire people just because they can do a job, they'll work for your money, but if they believe what you believe, they'll work for you with blood and sweat and tears." -Time 16:00: Correct: "250,000 people showed up on the right day at the right time to hear him speak. How many of them showed up for him? Zero. They showed up for themselves. It's what they believed about America that got them to travel in a bus for eight hours to stand in the sun in Washington in the middle of August." - Time 17:10: Correct: "Because there are leaders and there are those who lead. Leaders hold a position of power or authority, but those who lead inspire us. Whether they're individuals or organizations, we follow those who lead, not because we have to, but because we want to. We follow those who lead, not for them, but for ourselves. And it's those who start with "Why" that have the ability to inspire those around them or find others who inspire them."

If data on cigarette usage and lung health has an R-squared of 0.7, that would mean __________________.

. that would mean cigarette usage predicts 70% of the variation in lung health.

A scatter plot allows you to see the shape and spread of information in _____ dimension(s).

2 dimensions

What is the predicted value (ŷ) when the numerical variable is x = 70 for the regression equation ŷ = −810 + 24.4x − 0.142x2?

202.2 just plug it in

Assume 300 people applied for a car loan, and the credit scores of the applicants had a mean value of 640 with a standard deviation of 16. Assuming a bell-shaped curve, how many loan applicants fall within a score range of 608 to 672?

285

Transform the marital status into category scores where Single = 1 and Married = 0. How many would have the category score of 0?

4

How many squares are there in the big square image?

40

If R2 = 0.62, how much of the dependent variable is explained by the independent variable(s)?

62%

Using the simple mean imputation strategy, what value would be placed in the missing observation in x1?

84

Two graphs that look similar are the Bar Chart and the Histogram? Which is a graphical representation of categorical data?

A Bar Chart is a graphical representation of categorical data.

Two graphs that look similar are the Bar Chart and the Histogram? Which is a graphical representation of quantitative (numerical) data with equal space between each pair of consecutive bars.

A Histogram is a graphical representation of quantitative (numerical) data.

Which of the answers below correctly describes the type of variable being predicted by a classification problem?

A classification problem tries to predict a categorical value or label.

Which of the following statements are best describe a confusion matrix?

A confusion matrix is the best way to analyze categorical predictions

Which of the answers below correctly describes the type of variable being predicted by a regression problem?

A regression problem tries to predict a continuous numerical value, such as predicting house prices, stock prices, or a person's age.

When comparing competing linear regression models with different numbers of predictor variables, the __________ value is used.

Adjusted R-squared END 6

which option best interprets the impact of the coefficients and p-values?

Anytime the coefficient is positive and the p-value is approximately 0, then there is a positive impact and significant influence.

According to the Syllabus, if you earn 719 points out of the 800 total points possible in this class, what will your final grade be for the class?

B

Which of the following were identified as examples of where human subjectivity is involved in Data Analysis?

Choosing the problem, the results to share, how to present the results, what data to collect, how to collect the data

In supervised learning, CART stands for

Classification and Regression Trees

A confusion matrix is used when making predictions about which type of outcome variable?

Confusion matrix is used to evaluate predictions on a CATEGORICAL outcome variable

In the video, Dr. Crews used a movie theater and an operating room to provide an example of narrative and context. What was the example.

Dimming lights. Dimming lights in a movie theater is expected. Dimming lights in an operating room is unexpected. --END 4

According to the Hans Rosling video, in 1810, life expectancy was what?

In 1810, life expectancy was below 40 in almost all countries

According to the Hans Rosling video, in 2009, China (Shanghai) has the same wealth and health as _____________

Italy

Sam, a marketing manager for XYZ big box stores, is trying to determine if there is a relationship between shelf space (in feet) and sales (in hundreds of dollars). To do this, Sam selected the top 12 producing locations. The regression results produced the following adjusted R2 values: Model 1: 0.8874 and Model 2: 0.6028. Which model is more suitable of a prediction?

Model 1 is more suitable because (0.8874 > 0.6028).

Data can be classified as either numerical or categorical. Which of the two statements better describes these two?

Numerical data is quantitative. Categorical data is qualitative.

_____________ ranges between 0 and 1, whereas _____________ range between 0 and infinity.

Probability; odds

The accuracy score is the number of correct predictions divided by the total number of predictions. What is the accuracy score for the above confusion matrix?

Reference Prediction 0 1 0 30 10 1 20 40 --70%

Which of the following can be determined from the above confusion matrix? *** SELECT ALL CORRECT ANSWERS ***

Reference Prediction 0 1 0 30 10 1 20 40 --The confusion matrix tells us the number of rows (or observations) in the TESTING data. --END 2

One example he gave how our brains process information differently when we are in a SURVIVE mode versus when we are in a THRIVE mode. Which of these two is described as "narrow focus, like walking a tightrope"?

SURVIVE mode has a narrow focus, like walking a tightrope

A study was completed on cholesterol in 100 male adults 40-60 years of age to determine if there is a relationship between cholesterol concentration and time spent watching TV. The researchers wanted to determine if there are any predictive results, such as if the amount of time spent watching TV increases or decreases cholesterol levels. Based on the following regression results, is TV-watch statistically significant? Variable Model 1 p−value Constant:−2.13478 0.000 TV_time: 0.044069 0.002 Adjusted R2: 0.1426

The p-value = 0.002 and is statistically significant because it is under the 5% level.

Why are the partial effects of the two predictor variables more difficult to interpret when a model contains the interaction of two numerical variables?

The partial effect of either predictor variable depends on the value of the other predictor variable.

In the lecture video, Dr. Crews talked about the MIT Sloan Management Review article titled "Minding the Analytics Gap". What is the problem described in the article?

There is a gap between an organization's capacity to produce analytics results and its ability to apply those analytics results effectively to business issues.

What are root nodes?

They are the stgart of the binary tree where the branches start at the top

In the k-fold cross-validation method, the holdout cross-validation method is used k times.

True

Manipulation and Persuasion BOTH involve influencing another person's thinking and decision making process. Manipulation and Persuasion CAN BOTH USE TRUE FACTS to achieve their goals. The main difference between Manipulation and Persuasion is _______

When you are manipulating someone, you are trying to convince the other person to think or behave in a way that is in YOUR best interest, regardless of how it impacts the other person. When you are persuading someone, you are trying to convince the other person to think or behave in a way that is in THEIR best interest. --END 3

Which of the following goodness-of-fit measures assesses how well the model fits the data for binary choice models?

accuracy rate

In the model y = β0 + β1ln(x) + ε, the predicted value is = b0 + b1ln(x). What is the impact of the estimated slope coefficient?

b1 × 0.01 measures the approximate change y in when x increases by 1%.

In the model ln(y) = β0 + β1x + ε, the predicted value is = exp(b0 + b1x + ÷ 2). What is the impact of the estimated slope coefficient?

b1 × 100 measures the approximate change in y when x increases by 1 unit.

The slope coefficient β, is read as

beta

When a target variable is categorical, the CART algorithm produces a __________ tree to predict the class memberships of new cases.

classification

The degree of strength of the linear relationship between x and y is called?

correlation coefficient

Genie wants to know how well her model will perform on data it has not seen before. What technique should she use to assess the predictive power of her model?

cross-validation

Bivariate data means ____________________

data for two variables

If ŷ = 110 − 5x with y = product and x = price of product, what happens to the demand if the price is increased by 3 units?

decreases by 15 units

Imagine a regression tree was developed to predict customer spending for a hotel during football season. One of the leaf nodes consists of six cases in the training set with the following values: 312.00, 350.00, 285.00, 295.00, 380.00, 220.00. What is the predicted spending amount on a hotel for the night for a customer that falls into this leaf node?

do simple average 307

A ____________ variable, also referred to as an indicator or a binary variable, takes on values of 1 or 0 to describe two categories of a categorical variable.

dummy variable

Which R function do you use to construct a logistic regression model?

glm

An R-squared value of 1 means ______________

means you can perfectly predict one variable from another, since 100% of this variation is in one variable. For example, temperature in Fahrenheit can be predicted if you know the temperature in Celsius.

Using the omission strategy, what value would be placed in the missing observation in x1?

no value because excluded

The observations for any variable can be classified into one of four major measurement scales: nominal, ordinal, interval, or ratio. Assume the variable college holds values such as "Western Kentucky University", "University of Kentucky", "University of Tennessee", and "Texas A&M". What type of variable is college?

nominal

According to Jaggia, what are two common strategies for handling missing values?

omission and imputation

An instructor hands out course evaluations where students give a rank of 1 ("very unsatisfied") to 5 ("very satisfied"). What of the following best describes that ranking data?

ordinal

Using the following Boxplot, what is the star to the far right considered?

outlier

The __________ strategy implements not only repeated sampling of the training data, but also a random selection of a subset of predictor variables, called features, to construct each tree. This strategy is particularly useful if the predictor variables are highly correlated.

random forest END 5

What is the range of the following numbers? 6, 10, 10, 10, 1, 4, 8

range is 10 - 1 = 9.

Before an upcoming ice festival, the ice festival committee measures the depth of the ice. What is the type of measurement scale?

ratio

When a target variable is numerical, the CART algorithm produces a __________ tree to estimate the values of new cases.

regression

A ________ line is a line that is as close as possible to all the points at the same time.

regression line

Which of the following classification measures focuses on the model's ability to correctly identify negative or nontarget class cases?

specificity

If the coefficient correlation is computed to be −0.85, this means the relationship between the two variables are _______.

strong, but negative

Bob is working with some data in a table in Excel. What does the data represent?

structured data

What is the arithmetic mean of the following numbers? 6, 10, 10, 10, 1, 4, 8

sum/count 7

The interquartile range (IQR) is IQR = Q3 − Q1. It can be thought of as the _____________.

the spread of the middle 50% of the data

Based on the following sorted 20 values for age, what are the possible split points? {20, 22, 24, 26, 28, 31, 32, 33, 35, 40, 42, 43, 45, 47, 49, 50, 52, 53, 55, 57}

{21, 23, 25, 27, 29.5, 31.5, 32.5, 34, 37.5, 41, 42.5, 44, 46, 48, 49.5, 51, 52.5, 54, 56}- the middle points between the numbers


Related study sets

Chapter 11 Lesson 3: Striving for Equality

View Set

Exploraciones (2nd edition) Capítulo 6

View Set