DSCI 4520 Exam 1

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Transforming numerical variables means performing mathematical functions on them and creating new variables that are better suited for our data mining model.

True

When a model is over-fitted the regression coefficients represent noise in the data, rather than the genuine relationships in the population.

True

k-nearest neighbor (k-NN) is a supervised method that can be used for predicting categorical or numerical targets.

True

In model training, ... measures how much the model's predictions fluctuate when given different input data.

Variance

"Learn from the observed records to predict the class value of unseen records." In data mining, this is called ...

Classification

Which statement is INCORRECT about clustering?

Clustering is useful for predicting association rules

The following figure shows the logic for creating dummy variables from a categorical predictor called "Remodel". Which class level is considered the baseline (reference)?

"None"

What is the Euclidean distance between the following two records WITHOUT normalization? Round your answer to 1 decimal.

11.50

The following formula shows the model that we have developed to predict the average used car Price based on its Age, Fuel Type, and quarterly Tax amount (Fuel Type = Natural Gas is the model baseline). Use the model to predict the price of a car with the following characteristics: Age = 12 Fuel Type = Diesel Tax = 230

21445

The following linear regression model is calculated to predict Boston house prices using CRIM (per capita crime rate by town), ZN (proportion of residential land zone), INDUS (proportion of industrial land zone), CHAS (Charles River dummy variable), and NOX (air pollution level). Use the model to predict the Median Value of the following house: CRIM = 0.121 ZN = 22 INDUS = 1.47 CHAS = 0 NOX = 0.44

27.7

The following chart shows the within-cluster sum of square errors versus the number of clusters in a k-means clustering model. Based on the Elbow method, what value of k is optimum for clustering?

4

Which of the following models (the dark blue line) shows a case of underfitting?

A

In the following scatter plot matrix, Price is the target variable. What predictor shows the strongest negative correlation with Price?

Age_08_04

Based on the following plot (Average Income versus Gender and Education Level) which statement is CORRECT?

As the level of education decreases the average income decreases in both gender categories

In the development of a linear regression model, what is the naïve (based) model that we compare the performance of the linear model with?

Average model

With the k-NN model for a numerical target, after we determined the k nearest neighbors of a new data record, how the target value is predicted?

Average of the neighbors

Which model is an OVER-FITTED model?

B

In the context of predictive model training .... is a measure that shows how much the model's predictions differ from the true values.

Bias

What is the first phase in the CRISP-DM approach for data mining tasks?

Business understanding

Based on the following correlation matrix, what is potentially the weakest predictor of MEDV?

CHAS

In statistics and data mining, "a statistical measure of the strength of the relationship between the relative changes of two variables" is called ...

Correlation Coefficient

Which of the following scatterplots shows the strongest negative correlation?

D

Which of the followings is a core idea/task in data mining?

Data cleaning and pre-processing Data visualization Regression modeling

Which statement about the data mining process is INCORRECT?

Data cleaning and pre-processing is usually a trivial step in the process

ANOVA is an analysis under which of the following data mining task categories?

Data exploration

Which statement about business intelligence workflow is CORRECT?

Data in the operational database is transformed to analytical data in the data warehouse

Which of the following is NOT a step in data pre-processing?

Data modeling

Which of the following variable selection methods for the linear regression model examines all possible combinations of variables?

Exhaustive search

"Estimating the repair time required for an aircraft based on a trouble ticket." Performing this task in data mining requires an unsupervised learning approach.

False

"Identifying segments of similar customers." Performing this task in data mining requires a supervised learning approach.

False

"Predicting whether a company will go bankrupt based on comparing its financialdata to those of similar bankrupt and nonbankrupt firms." Performing this task in data mining requires an unsupervised learning approach.

False

Both numerical and categorical variables can be used in the Euclidian distance function in the k-means clustering algorithm.

False

Categorical variables can NOT be used as predictors in the linear regression model.

False

In the k-means clustering technique, the desired number of clusters (k) is a number that is determined in the middle of the algorithm by calculating the model error.

False

In the k-nearest neighbor models, increasing the value of k leads to overfitting.

False

In the search for the best set of variables for the linear regression model, when the number of potential predictors is small, the exhaustive search method gives significantly different and better results than other methods.

False

Predictors of a multiple linear regression model can only be numeric type.

False

The following box plot shows students' GPA stratified by students' Gender. According to this plot, the minimum GPA of Male students is less than Female students.

False

The k-means clustering algorithm can easily handle noisy data with outliers as well as non-convex data patterns.

False

We run two k-means clustering models on the same data with k=3 and k=5. The model with k=3 is necessarily better than the other one because a smaller value of k is always better for clustering.

False

When data is not uniformly distributed and includes outliers, linear normalization is better than the Z-score standardization method.

False

You are requested to use a large data set of customers to predict how many days after their first purchase they will make the second purchase. You can do this by developing a classification model.

False

We have trained a linear regression model and tested it on the validation set. Here are the results: On the training set: RMSE is low, R2 is high On the validation set: RMSE is low, R2 is high What type of fit is this?

Good (optimized) fit

Which of the following tasks is an unsupervised learning task?

Grouping customers based on the similarity in their online behavior

Which one is NOT one of the primary reasons for discretizing numerical variables?

Higher Accuracy

The following scatterplot matrix is created from the Boston Housing data set that shows the median price of the houses in the Boston metropolitan area. Which two variables (any two variables) show the strongest correlation? (read the column and row labels to find the variable names)

INDUS & NOX

The following report shows Excel output for a linear regression model. What can the p-value of F-statistic tell us?

If this p-value is less than our significance level then the model as a whole is significant

What is the essential element in the machine learning algorithms that distinguish supervised from unsupervised learning?

In the supervised learning models target variable is used in the model, but in the unsupervised learning models there is no target to predict

Which statement is INCORRECT about linear regression models?

It is a very popular, robust, and flexible method for predicting numerical and categorical targets

Which statement is FALSE about the data-driven decision-making approach?

It is loaded with assumptions and theories

Based on the following correlation matrix, what is potentially the strongest predictor of MEDV?

LSTAT

Which statement is INCORRECT about k-NN predictive models?

Larger values of k increase the risk of over-fitting

Which statement is INCORRECT about choosing the number of clusters in the k-means clustering method?

Maximizing the within-cluster sums of squared errors (WSS) is the goal when selecting k

We have developed two different linear regression models on the same data set. Which model shows a better goodness-of-fit?

Model A

The following figure shows residual plots of two linear regression models A and B. Which of the following statements is CORRECT?

Model B is violating homoscedasticity assumption

We have developed a linear regression model and the residual plots are shown in the following figure. What statement is CORRECT about the model?

Model is violating the assumption of independence of observations

We have developed a linear regression model and the residual plots are shown in the following figure. What statement is CORRECT about the model?

Model is violating the linearity assumption

When we are building a linear regression model, against what model do we compare it to evaluate its significance?

Naïve (average) model

The target variable in a multiple linear regression model must be a:

Numerical variable

You are working with the Bike Purchase data set which contains records of individuals and their bike ownership status. What does the following chart show?

Percentage of individuals in each bike purchase group across gender and commute distance

Which of the following tasks is NOT included in the data preprocessing phase?

Performance evaluation

The following bar charts are based on the "Number of Dependent" of credit card applicants and their Credit Status. If you want to explore the association between the number of dependents and the rate of Bad Credit, which plot you should use?

Plot A

Which of the following tasks is a supervised learning task?

Predicting air pollution level

Which of the following statements is INCORRECT about imputing missing numerical values?

Random generator function is one of the best methods of imputing

"Learn from the observed records to predict numerical values of unseen records." In data mining, this is called ...

Regression

To show the relationship between one numeric and one categorical variable, which plot type is NOT useful?

Scatter plot

Which of the followings is NOT a strategy to prevent model over-fitting?

Set a limit on the value of R2 metric

In the following scatter plot, the Sepal Width versus Petal Length of three types of iris flowers are shown. Which iris type is more distinguishable from others based on this scatter plot?

Setosa

You are building a multiple linear regression model to predict median house price (MEDV) in Boston using a data set with 12 predictors as shown in the following correlation matrix. Based on the matrix, you would expect the violation of the multicollinearity assumption to happen between what variables? Hint: multicollinearity means a strong linear relationship between two predictors (independent variables).

TAX & RAD

Which statement is INCORRECT about the k-means clustering algorithm?

The algorithm starts with initial centroids that are determined by distance function

Which of the following statements is INCORRECT about the missing values in a data set?

The best strategy is always to drop records with any missing values

According to the data-driven decision-making technology pyramid shown in the following figure, which statement is FALSE?

The process only moves in one direction (upward) and higher layers never give feedback to the lower layers.

Which statement is INCORRECT about exploratory data visualization?

The purpose of visual exploration of data is to perform target prediction

What statement is correct about the k-nearest neighbor (k-NN) method?

The value of k can control model over and underfitting

"Automated sorting of mail by zip code scanning." Performing this task in data mining requires an unsupervised learning approach.

True

"Identifying a network data packet as dangerous (virus, hacker attack) based on comparison to other packets whose threat status is known." Performing this task in data mining requires a supervised learning approach.

True

Before computing the distance between two data records, we should normalize the numerical variables to prevent variables with large scales from having an undue effect.

True

Both the histogram and box plot are univariate plots that are useful for exploring the distribution of a variable.

True

Data exploration includes summary statistics, univariate and bivariate analysis, basic statistical tests (t-test, correlation), ANOVA, and outlier detection.

True

In a linear regression model, the t-Test for each predictor's coefficient indicates if the estimated value is significantly different from zero.

True

In practice, data preprocessing takes a significant portion of data mining projects.

True

In the data preparation step, normalizing numeric data is a popular method to transform variables into a more suitable scale for modeling.

True

Increasing the data size can help avoiding both over and under-fitting problems.

True

The data dictionary is meta-data, which is data about data

True

Splitting the data set into training and validation is a method to avoid overfitting. We can calculate metrics such as RMSE and R2 for the training and validation sets. Under what conditions, we can tell the model is overfitted?

Very low training error and high test error

Based on the following scatter plots and summary statistics, which statement is CORRECT?

Visualization can help us see differences between data sets, that can not be identified by looking at summary statistics

What is the name of the following plot?

box-and-whiskers plot

What is the name of the following plot?

heat map

We have created a multiple linear regression model to predict car MPG (miles per gallon) values by five cas features: cylinders, displacement, horsepower, weight, and acceleration. The following table shows the linear model. Based on the results, what variable's estimated coefficient is significantly different from zero?

horsepower

When a model captures both the underlying patterns and random fluctuations in data, it is called ...

over-fitting


Kaugnay na mga set ng pag-aaral

Cardiovascular and lymphatic system

View Set

Heavy Duty Truck System Chapter 1

View Set

Popular Sovereignty/Limited Government

View Set

Intro To Social Research 337 Final Exam Chapter 9

View Set

Interpersonal Communications Exam #4

View Set

History Vocabulary 4- gilded age part 2

View Set

Chapter 29: Management of Patients with Nonmalignant Hematologic Disorders

View Set

U1 - Skills Lesson: Creating and Using Thesis Statements

View Set