ISM 4545 Exam 1

Ace your homework & exams now with Quizwiz!

In information theory, _______ is a measure of uncertainty associated with a random variable. Homogeneity Entropy Gini P-value

entropy

Confusion matrix can be fully constructed, given the accuracy value of a classifier. T/F

False

It is not possible to use binary split for continuous attributes, since there are at least 3 distinct values. T/F

False

Visualization: Side by Side Box Plot

Side by side boxplots are very effective in showing differences in a quantitative variable across different levels

Types of Classification Methods

Discriminative -Decision Tree -Support Vector Machine -Nearest Neighbor Regression based -Logistic Regression -Artificial Neural Network Probabilistic -Naïve Bayes

precision equation

TP / (TP + FP) a / a + c

Limitation of Accuracy Metric

Simplistic

Assume that E depicts an event and A depicts an observation. P(E|A) is:

The conditional probability of E, given A

NN: When to Stop Updating?

-When weights change very little from one iteration to the next -When the misclassification rate reaches a required threshold -When a limit on runs is reached

Generalization

-We would want models to apply not just to the exact training set, but also to the general population -Unbiased

Assuming that "I" represents the entropy level of a set, I(3,6) ≠ I(6,3)

FALSE

Naïve Bayes Approach•Two assumptions:

1.Attributes are independent 2.Attributes are equally important

Overfitting:

the tendency to tailor a model to the training data as much as possible -Memorization -Confusing the noise in the training data (e.g. occurrences due to chance) with the signal (e.g. patterns)

Which of the following methods provide insights about the patterns in the training data? Naïve Bayes Decision Tree Artificial Neural Network All of the above.

Artificial Neural Network

NN: How a network learns?

-Backpropagation: The best-known supervised learning algorithm in neural networks -Learning is done by comparing computed outputs to desired outputs of historical cases

CRISP Process

1. Business Understanding (objectives/requirements) 2. Data Understanding 3. Data Prep 4. Modeling 5. Evaluation 6. Deployment

Confusion Matrix Components

A = TP B = FN C = FP D = TN

Data Exploration

A preliminary exploration of the data to better understand its characteristics Key motivations of data exploration include -Helping to select the right tool for preprocessing or analysis -Making use of humans' abilities to recognize patterns Especially useful in early stages of data analysis -Detect outliers (e.g. assess data quality) -Test assumptions (e.g. normal distributions or skewed?) -Identify useful raw data & transforms (e.g. log(x))

Methods can be compared and evaluated according to different criteria:

Accuracy Speed •Time to construct the model •Time to use the model Robustness •Handling noise and missing values Scalability •Efficiency in large datasets Interpretability •Understanding the model

In an artificial neural network, each input node represents a(n) Attribute Class Cluster Weight

Attribute

Two of the assumptions of Naïve Bayes are: Classes are independent and equally important Attributes are mutually exclusive and exhaustive Classes are mutually exclusive and exhaustive Attributes are independent and equally important

Attributes are independent and equally important

Methods of Estimation: Cross-Validation

Avoids overlapping test sets -First step: data is split into k subsets of equal size -Second step: each subset in turn is used for testing and the remainder for training

What is a Neural Network?

Biologically inspired computer models that operate like a human brain -These networks can "learn" from the data and recognize patterns

ROC (receiver operating characteristic)

Characterizes the trade-off between positive hits and false alarms (true positive rate vs. false positive rate)

Data Preparation and Preprocessing

Check the quality of the data Clean the data (if necessary) Transform the data (if necessary) Sample the data (if necessary)

Which of the following is a better split to use in a decision tree? Class 1: 5 Class 2: 5 Class 1: 4 Class 2: 4 Class 1: 2 Class 2: 2 Class 1: 7 Class 2: 8

Class 1: 7 Class 2: 8

Classification vs. Prediction

Classification- predicts categorical class labels Prediction- Predicts continuous-valued functions

Dataset:

Collection of data objects including their attributes

Quality of Data: Outliers

Data objects with characteristics that are considerably different than most of the other data objects in the data set

Which of the following is an example of a classification problem? Grouping customers of a bank into 3 segments based on customer value. Finding correlations between items sold at a grocery store Forecasting the monthly sale amount for the next 3 months at a retailer. Deciding whether a tumor cell is benign or not.

Deciding whether a tumor cell is benign or not.

One of the advantages of decision trees over artificial neural networks is that: Decision trees have better accuracy Decision trees don't face the problem of overfitting Decision trees reveal the pattern used for making the classification All of the above

Decision trees reveal the pattern used for making the classification

NN: Learning algorithm

Designing the network topology Sample cases are shown to the network as input and the weights are adjusted to minimize the error in its outputs

Naïve Bayes Discussion

Easy to implement It works surprisingly well (even if assumptions are not realistic) -Though, if there exist dependencies among the attributes, these cannot be modeled by Naïve Bayes Classifier Adding too many redundant attributes will cause problems

Quality of Data: Missing Values

Eliminate Data Objects -Estimate Missing Values -Ignore the missing value during analysis•

NN Advantages

Good predictive ability -Can capture complex relationships -High tolerance to noisy data -Ability to classify untrained patterns

Decision tree algorithms employ a ______ strategy, meaning that they grow the tree by making locally optimal decisions. Greedy Mutually exclusive Irreversible Randomized

Greedy

Good split attributes create _______ child nodes in decision trees. A large number of Heterogeneous Small size Homogeneous

Homogenous

Quality of Data: Duplicate Data

Major issue when merging data from heterogeneous sources- Data cleaning - > Process of dealing with duplicate data

General Structure of ANN

Multiple layers -Input layer (raw obs.) -Hidden layer(s) -Output layer Neurons (i.e. nodes) Weights (wij) Bias values (θj)

Why is the error rate on the training dataset not a good indicator of the error rate on future data?

No, because it is overly optimistic and also biased. Future data is itself different and new data so it would not be wise to use the previous data error rate as an indicator on future data.

Which of the following is not one of the reasons that decision trees are popular as a classification method? Black box pattern identification Light computational requirements Stable performance None of the above

None of the above

Methods of Estimation: Random Subsampling

Repeated holdout -In each iteration, a certain proportion is randomly selected for training -The error rates on the different iterations are averaged to yield an overall error rate

Visualization: Histogram

Shows the distribution of values of a single variable Divide the values into bins and show a bar plot of the number of objects in each bin The height of each bar indicates the number of objects Shape of histogram depends on the number of bins

Confusion Matrix

Separates out the decisions made by the classifier -Making explicit of how one class is being confused for another class

NN: User Inputs

Specify Network Architecture... Number of hidden layers •Most popular -one hidden layer Number of nodes in hidden layer(s) •More nodes capture complexity, but increase chances of overfitting Number of output nodes •For classification, one node per class (in binary case can also use one) •For numerical prediction use one Learning Rate •Low values "downweight" the new information from errors at each iteration •This slows learning, but reduces tendency to overfitto local structure

Recall:

TP / TP + FN a / a + b

TPR FPR

TPR = TP / TP +FN a /a + b FPR = FP / FP +TN c / c + d

The shape of a histogram for a continuous variable would depend on the number of bins used. T/F

TRUE

Accuracy of a classifier is given as: the percentage of the records that were correctly classified by the classifier The number of true positives given in the confusion matrix The number of true negatives given in the confusion matrix All of the above

The percentage of the records that were correctly classified by the classifier

Posterior probability is defined as: The probability of an event before observation The probability of an event after observation Unconditional probability of an event None of the above

The probability of an event after observation

In cross-validation, each record is used exactly once for testing purposes. T/F

True

Attribute:

a property or characteristic of an object-Examples: age of a person, eye color, etc.-[Attribute || Field || Variable || Feature]

The task of mapping an input attribute set into its class label is called Clustering Classification Segmentation Prediction

classification

NN: Used for.. Can capture.. Major Danger:

classification and prediction a very flexible/complicated relationship between the outcome and a set of predictors Overfitting

The performance of a model in terms of positive and negative frequencies can be depicted using a table known as the: Validation table Confusion matrix Validation matrix Sensitivity matrix

confusion matrix

________ is a method of dividing the existing data set into two parts in a way that each class label is represented with equal proportions in different parts. Segmentation Cross-validation Stratification Overfitting

cross - validation

Object

each row is one object-[Case || record || row || observation]-a set of measurements for one entity -e.g. the height, weight, age, etc. of one person

The steps involved in the classification process are: Model learning and model update Algorithm development and model learning Model learning and model application Algorithm development and model application

model learning and application

Conditional probability: : P(C|A1, A2, ..., An )

probability of an event after observation -P(C=Spam|exceeded=1, login=1, data=0, fsu=0) -Posterior, after evidence

Prior probability: P(C)

probability of an event before observation

Which of the following visualization tools provides information similar to correlation analysis?

scatter plot

Classification methods can be compared and evaluated according to their ______ speed, accuracy, scalability, robustness, interpretability

speed, accuracy, scalability, robustness, interpretability (all of the above)

NN Disadvantages

-Considered a "black box" prediction machine, with no insight into relationships between predictors and outcome -No variable-selection mechanism, so you have to exercise care in selecting variables -Heavy computational requirements if there are many variables (additional variables dramatically increase the number of weights to calculate)

_______ is a popular algorithm that uses artificial neural networks to learn from data. Backpropagation Apriori C5.0 Sequential covering

Backpropagation

Why are the assumptions of Naïve Bayes important? Note: E depicts the event and A depicts the observations Because they make the process more realistic Because they simplify the task of computing P(A|E) Because they simplify the task of computing P(A) All of the above

Because they simplify the task of computing P(A|E)

Categorical vs continuous

Categorical Attribute -Has only a finite or countable set of values -Examples: eye color, letter grade, zip code -Can be of different sub-types: nominal, ordinal and flag -Can be stored as string or numerical _______________________________ Continuous Attribute: - Quantitative attributes - Examples: weight, failure per hour, number of students, temperature -Can be of different sub-types: interval or ratio -Can be stored as numerical (real or integer)

Summary Stats: Central Tendency vs Spread

Central: Mean/Median, Frequency and Mode Spread: Percentiles, range and variance

Metrics for Performance Evaluation

Classification error -Classifying a record as belonging to one class when it belongs to another class -Count Error rate -Percent of misclassified records out of the total records

Which of the following is one of the tasks in the model learning step? Classify new / unknown records Estimate the accuracy of the model Develop the classification model All of the above

Develop the classification model

Issue w/ Naive Bayes: Continuous Attributes

Discretize the range into bins Two-way split:(A < v) or (A > v)•Choose only one of the two splits as new attribute Probability density estimation: •Assume attribute follows a normal distribution •Use data to estimate parameters of distribution (e.g., mean and standard deviation) •Once probability distribution is known, can use it to estimate the conditional probability P(Ai|c)

In this course we used the ...

Greedy strategy -Split the records based on an attribute test that optimizes certain criterion -in a divide and conquer manner

A prerequisite of the classification task is to: Introduce human intuition into the process Have predetermined class labels Use parallel computing All of the above

Have pre-determined class labels

Which of the following is not one of the disadvantages of artificial neural networks? Black box classification Long computation times Proneness to overfitting High tolerance to noisy data

High tolerance to noisy data

Some of the questions to ask after building the classification model are: How predictive is the model that was built? How can we obtain the prediction results? Given two or more models, which one is better? All of the above?

How predictive is the model that was built? How can we obtain the prediction results? Given two or more models, which one is better? (All of the above)

Existence of a large number of decision nodes and branches in a decision tree is an indication of: Random error Systematic error Entropy Overfitting

Overfitting

What is overfitting? Explain it in the context of training set model error rate and model complexity.

Overfitting according to our course PowerPoint is the tendency to tailor a model to the training data as much as possible. This occurs when the data model is confusing the noise in the training data with the signal. The noise referring occurrences due to chance and the signal referring to actual patterns within the data set. This can happen with a neural network model because it can overfit the data causing the error rate to be too large.

ROC Curve

ROC curve plots TP rate on the y-axis against FP rate on the x-axis at various cut-off settings Want a larger curve for a strong model

Quality of Data: Noise

Random error or variance in the original value -Examples: distortion of a person's voice when talking on a poor phone and "snow" on television screen

Methods of Estimation: Holdout

The holdout method reserves a certain amount for training and uses the remainder for testing -Commonly 70% for training and 30% for test For small datasets, samples might not be representative -Few or none instances of some classes -Stratified sampling: Balancing the data -Making sure that each class is represented with approximately equal proportions in both subsets (similar class distribution in each subset)

Some Questions after Building a Classification Model

•How predictive is the model that we built? •How reliable are the predicted results? •Which model is better (in terms of performance)?2

Which of the following is one of the stopping conditions that stop the growth of a decision tree? All records for a given node belong to the same class There are no more remaining attributes for further partitioning There are no more records left for splitting All of the Above

All records for a given node belong to the same class There are no more remaining attributes for further partitioning There are no more records left for splitting (All of the above)

Which of the following is not one of the user inputs required to run artificial neural network algorithms? Number of leaf nodes Learning rate # of hidden layers All of the above

Number of leaf nodes, Learning rate, # of hidden layers (All of the above)

Data Types

Record Data: consists of a collection of records, each of which consists of a fixed set of attributes Transaction Data: A special type of record data, where each record (transaction) involves a set of items Document Data: Each document becomes a 'word' vector •each word is a component (attribute) of the vector •the value of each component is the number of times the corresponding word occurs in the document Graph Data

Issue w/ Naive Bayes: The Zero-probability Problem

Solution: Use Laplacian correction -Add 1 to the count for every attribute value-class combination -Result: probabilities will never be zero P(Ai| Cj) = |Aij| + 1 / Nj+ v

Naive Bayes Definition

Statistical classifier -Perform probabilistic prediction -predict the probability of belonging to a class Combining pieces of evidence -Prior and posterior information

Choropleth Map

spatial data •Maps using color shadings to represent numerical values are called chloroplethmaps

Precision is computed as: The number of true positives divided by the total number of predicted positive records The number of true positives divided by the total number of actual positive records The number of true positives divided by the number of false positives The number of true positives divided by the number of true positives

The number of true positives divided by the total number of predicted positive records

Which of the following is one of the differences between a training data set and a test data set? Training data sets include class labels, whereas test data sets have unknown class labels Training data sets use old data, whereas test data sets use new data Training data sets are used to derive the classification model, whereas test data sets are used to estimate model performance All of the above

Training data sets are used to derive the classification model, whereas test data sets are used to estimate model performance

A decision tree is mutually exclusive, meaning that each data record is covered by at most one path in the tree. T/F?

True

A test set and a training set are similar in the sense that they both include observations with known class labels. T/F

True

Any classification model that has an area of greater than 0.5 under its ROC curve performs better than classification by chance. T/F

True

Entropy

a measure of the uncertainty associated with a random variable

Which of the following visualization tools can be used for outlier detection?

box plot

When a classification begins to memorize the training data than learning from a general trend, then ______ occurs. Random error Systematic error Overfitting Deviation

overfitting


Related study sets

Accounting 202 Midterm 1 Formulas

View Set

WEEK 2 [ADN 225] CHAPTER 2-"IMMUNITY AND DISEASE", CHAPTER 3-"INFECTIOUS DISEASES"

View Set

Study Sync First Read: Speech to the Second Virginia Convention (comprehension questions)

View Set