CIS 375 Final Practice Questions

¡Supera tus tareas y exámenes ahora con Quizwiz!

Determining which customers are most likely to leave a business (or a social media site) is unsupervised learning

FALSE

In a confusion matrix, we want to minimize which of the following?

False positive and false negative

If you could only choose between a fitting graph and a learning curve to compare multiple models' performances, which should you choose

Learning curve

If we have well defined cost benefit matrix, which of the following curve can we use to evaluate model performance?

Profit curve

"Do my customers form natural groups?" is an example of clustering

TRUE

In the mathematical equation below, representing a linear boundary, variable "y" represents the ____________. y = b + w1x1 + w2x2 + w3x3 + ....

Target variable

What is the hold-out data used for?

Test a model

What is the purpose of the following statement? "berryrules <- subset(groceryrules, items %in% "berries")"

The subset() function is used to search for groups within a transaction. This statement will help search for items that use the name "berries".

What are training data and test data used for when we perform data mining tasks?

The test data is used to evaluate how accurate the model is. While the training data is used as the initial data set used to help the model understand and produce successful results using the induction algorithm.

Cluster analysis is under the category of ________________ learning

Unsupervised

What is the function we use to perform hierarchical clustering algorithm using R?

hclust()

What is the function we use to perform clustering by k-means algorithm using R?

kmeans()

Regression is distinguished from classification by

numerical target variable

Examples of Applications of Data Mining

- Fraud detection - Prediction of loan repayment - Prediction of membership of social media site

Truth about overfitting

- Overfitting may be caused by the lack of representative instances in the training data - Holdout method can help detect the issue of overfitting - All data mining procedures have some tendency to overfit

Truth about model accuracy

- Accuracy is misleading with imbalanced data - Accuracy doesn't make distinctions between false positives and false negatives

Characteristics of data mining

- DM extracts useful information and knowledge from large volumes of data by following a well-defined process - DM revolves around data - DM is a set of techniques for analyzing data to discover interesting knowledge or patterns in the data

Characteristics of a tree-structured model

- Made up of root, interior nodes, leaf nodes, and branches - Every instance always ends up at a leaf node - Each branch represents a distinct value of the attribute at that node

Describing Support Vector Machine Model

- SVMs can estimate class membership probability - SVMs are based on supervised learning - SVMs use the hinge loss function

Before performing cluster analysis, what shall we do to prepare the data ready?

- Scale/standardize the features - Dealing with missing values - Screen for unreasonable records

ROC Curve Characteristics

- The dash line represents a random guessing strategy/model - ROC curves are not an intuitive visualization for business stakeholders - AUC measures the area under the ROC curve

Classification/Decision Tree

- The target variable for a classification decision tree cannot be numeric - Decision trees for classification recursively perform IG-based attribute selection - Decision tree is one of the methods that are used for classification

How many steps are in a CRISP-data mining cycle/process

6

Model accuracy can be calculated based on:

Confusion matrix

If we would like to create a confusion matrix for evaluating the model performance on the test data, which function shall we use?

CrossTable()

Estimating probability of default (or "write-off") for a new loan application is a regression problem

FALSE

In selecting informative attributes, we should look for attributes that produce subsets with highest entropy

FALSE

In the case of biopsy dataset we discussed in class, when we use ifelse() function to obtain classification results, the higher cutoff value we set, the better model performance we may get

FALSE

Linear discriminant boundary uses attributes recursively to classify the data

FALSE

There is a target variable when we perform market basket analysis

FALSE

Transactional data displays the same format as the dataset we used for decision trees

FALSE

Tree-structured predictive models take into account multiple attributes (input variables) all at once, via the use of a mathematical formula

FALSE

When we customize the tuning process, we should set SelectionFunction = "best" to choose the simplest candidate model

FALSE

What is the preferred method when pruning a decision/classification tree?

Post-pruning

Clustering attempts to segment the training dataset based on the input features/variables in order to make instances in the same cluster as _______________ as possible

Similar

Explain the basic difference between supervised and unsupervised data mining

Supervised data mining has a clear and specific target variable that we are interested in or trying to predict. Unsupervised data mining has no guarantee that the results are useful or have true meaning.

A decision tree model partitions the instance space into similar regions by perpendicular decision boundaries- horizontal and vertical boundaries

TRUE

A predictive model is a sort of formula to estimate the unknown value of interest, which we often call "the target"

TRUE

A training dataset should be as similar as possible to the original dataset in terms of class distribution in target variable

TRUE

An "objective (loss) function" measures the amount of classification error a model has for a given training dataset

TRUE

C5.0 uses entropy to identify the best decision tree splitting candidate

TRUE

Classification models attempt to predict which class an instance/individual in a population belongs to

TRUE

Data mining is the application of various analytical techniques to find useful knowledge, patterns and relationships among data

TRUE

Every "leaf node" in a classification/decision tree represents a segment of the population, and the attributes and their values along the tree path give the characteristics of the segment

TRUE

Finding the features that differentiate customers into different groups is an example of an unsupervised learning task (think clustering)

TRUE

In ROC curves, the closer the curve is to the perfect classifier, the better it is at identifying positive values

TRUE

In a classification/decision tree induction (generation) process, the next attribute added is the one with the largest increase in information gain value

TRUE

In a classification/decision tree, the root node represents the application of segmentation of the dataset by the attribute with the highest information gain value

TRUE

In supervised segmentation, informative attributes increase model accuracy

TRUE

In the confusion matrix, a false negative occurs when a classifier predicts an instance as negative when it is a positive

TRUE

In the context of a classification/decision tree, an interior node represents a testing of an attribute

TRUE

In the context of a classification/decision tree, every instance/data point will correspond to one and only one path ending at a leaf node

TRUE

In the real-world, model developers often use several model performance tools (graphical and/or numerical) to choose a best model

TRUE

Logistic regression is estimating the probability of class membership over a categorical class

TRUE

Over-fitting occurs when a model learns perfect on the training data but can not be generalized to new dataset

TRUE

Pruning is a technique for reducing the complexity of a decision tree model

TRUE

Random forests focuses only on ensembles of decision trees

TRUE

Support Vector Machines (SVMs) approach classifies problems by finding the widest possible bar that fits between points of two different classes

TRUE

The kappa statistic indicates agreement between a model's predictions and the true values

TRUE

The support an itemset measures how frequently it occurs in the data, while a rule's confidence is a measurement of its predictive power or accuracy

TRUE

The test dataset is used to evaluate the model in order to judge how well its characterization of the training data generalizes to new, unseen cases

TRUE

What is the problem with holdout method (train-test split) and how should we eliminate this problem?

Using the train-test split, we can have an overfitting problem. Where the train accuracy is higher than the test accuracy. So, to eliminate this problem, we perform multiple splits and break the samples into folds.

What should we set for "family" parameter when performing logistic regression using glm() if we have a two class categorical response?

family = binomial

A fitting graph plots

generalization performance vs. model complexity

Information gain is used to

measures the change in entropy due to any amount of new information being added

A "measure of purity" known as entropy...

measures the impurity of a set

The goal of Laplace "correction" is to

reduce the influence of segments (leaf nodes) with only a few instances

Which function can we use for automated parameter tuning to search for optimal models using a choice of evaluation methods and metrics?

train()


Conjuntos de estudio relacionados

EC1008: Chapter 14 questions and answers

View Set

Ch 14 High risk postpartum nursing care

View Set

Chapter 46- Drugs for Disorders and Conditions of the Male Reproductive Systems

View Set

4 types of movement that occur at the synovial joint- Gliding movement

View Set

Business 10 Motivating Employees

View Set

Chapter 4-The Effects of Hypnosis

View Set