Final Exam-Data Analysis

Ace your homework & exams now with Quizwiz!

Training-Test Dataset

-Need to randomly select 70% and randomly select 30% out of dataset -70% training data (build the model) -30% test data (evaluate the model performance)

Steps to Apply Naive Bayes

(1) Calculate the frequency of each class (2) Construct frequency tables for each attribute against the class (3) Calculate the posterior probability (4) (optional) normalize the results (5) Make the decision/prediction

Model Evaluation Metrics

(1) Confusion Matrix (2) TPR: True Positive Rate (3) TNR: True Negative Rate (4) FPR: False Positive Rate (5) FNR: False Negative Rate (6) Accuracy (7) Precision (8) Recall (9) F-measure

Steps to Build Decision Tree

(1) Create a frequency table for the class (2) Calculate the class entropy (3) Create a frequency table for each attribute (4) Calculate the entropy for each attribute (5) Calculate the information gain for each attribute (6) Select the attribute with the highest information gain to be the root node (7) Repeat steps from 3-6 until finishing all attributes

Steps to implement KNN algorithm

(1) Decide on k value (k = number of nearest neighbors) (2) Calculate the distance between each training instance and the new instance (3) Arrange the resulted distance values in ascending order (4) Select the K smallest distance values (5) Classify the new instance based on the majority vote.

Steps for Association Rules

(1) Decide on the minimum support and minimum confidence (2) Create a table (iteration 1) with a list of itemsets where K = 1 (K is number of items in the itemset) (3) From the list of the first itemsets, extract a list of frequent itemsets where support is >= minimum support. (4) From the list of frequent itemsets from the step above, generate a list of itemsets where K = 2 (iteration 2) (5) From the list of the second itemset extract a list of frequent itemsets that satisfy the minimum support. (6) Keep repeating steps 4 & 5 by increasing K by 1, until K equals the largest transaction. (7) Generate the set of rules and select the rules that satisfy the minimum confidence

Steps for k-means

(1) Decide on the number of clusters (2) Randomly assign the data points to each cluster (3) Calculate the centroid point of each cluster (4) Calculate the within squared error for each cluster (5) Calculate the total error (6) Calculate the distance between the centroid of each cluster and each data point (7) Evaluate the Euclidean distance of each data point and re-assign the data point to the cluster with the smallest distance (8) Start again from step 4 until you get stable results

Row

*For more than 2 classes in matrix* FN for each class is the sum of the values in the ___ of the class, excluding the TP.

Column

*For more than 2 classes in matrix* FP for each class is the sum of the values in the ___ of the class excluding the TP.

columns, rows, excluding

*For more than 2 classes in matrix* TN is the sum of all ______ and _____ (excluding/including) the column and row of the class.

Decision Tree

-Another classification algorithm that builds a model in a tree form -Applies a Top-down approach: where it will keep break down the data set into smaller and smaller subsets that share the same characteristics while it builds the tree -Decision is based on an algorithm called ID3

ROC Curve

-Area under curve = ROC (1 is perfect) -Higher area under curve = better the model

T-test

-Confidence level = 95% (95% chance of being right) -Alpha = .05 (5% change of being wrong)

Market segmentation

-How to cluster customers/people -Ex: based on spending, shopping, habit, zip code, gender, age etc.

Clustering applications

-Market segmentation -Outlier identification

Cost Sensitive Analysis

-Need to do this to reduce False Positives (FP) or False Negatives (FN) or to work with biased dataset. -Is a data mining learning technique that takes the misclassification cost in consideration

Apriori algorithm

-One of the earliest and well-known algorithms about association rules -Rule based algorithm -This algorithm mines the itemsets from transaction to generate rules -The two main concepts are support and confidence

Clustering (unsupervised)

-The process/techniques of dividing large dataset into small groups where each group share similar characteristics. -The distance between the data points of the same group is small and the distance between the points of different groups is large.

Advantages of Decision Tree

-Very fast -Works with both categorical & numerical data -Can handle high-dimensional dataset (dataset with high number of instances)

K-nearest neighbor (KNN)

-non-probabilistic algorithm -considered an instance-based algorithm -learning = storing all the training dataset -defined as a lazy algorithm -based on the idea that the instances that are close to each other share similar characteristics -making decision in this is based on a similarity measure called distance function.

Mimkowski Distance

= (SUM Ixi-yiI^p)^(1/p) -p = maximum distance between coordinates

Centroid

= SUM((x)/n), SUM((y)/n) -x,y = is the x coordinate and the y coordinate -n = total number of points in the cluster

Squared error

= SUM(x-m)^2 -x = the point -m = centroid -determined at elbow of graph (after line flattens)

Euclidean Distance

= square root of SUM(xi-yi)^2

Naive Bayes

A probabilistic classification algorithm -based on the Bayes' Theorem -main requirement to use this: expects independency among all the attributes (none of them are highly correlated) -in order to use this, must understand posterior probability

Posterior probability

A probability that is a revision of a prior probability using additional information. -P(cIxi) = (P(xiIc) * P(c))/P(xi) -c = class -xi = attribute/reduction -(xi I c) = the condition probability of the attribute given the class -P(xi) = prior probability of the attribute -P(c) = the prior probability of the class

K-fold cross validation

A validation technique that we can use while building the model -applied to training data set -k = number of folds or subsets Ex: training dataset --> 100 instances Subset 1: 10 Subset 2: 10 ........... Subset 10: 10

Solve underfitting

Adding more attributes will solve this

ID3

Algorithm based on 2 main concepts: -Entropy -Information gain

x-axis = TP, y-axis = FP

Axes of ROC curve?

Disadvantage of Decision Tree

Cannot detect the correlation among attributes

Detect outliers

Common use of unsupervised clustering?

Node

Component of the tree: -Attributes -Root will have the best predictor for best attribute and have lowest entropy/highest information gain

Leaves

Component of the tree: -Output or class or target (all the same)

Branches

Component of the tree: -Represent attribute criteria

K-means

Considered a partition clustering algorithm -uses the squared error cirterion -main idea: randomly partition the data points into pre-defined number of clusters -keep reassigned the data points to the best cluster based on the squared error and a distance function.

Association Rule

Discover the most frequently occurring items together -unsupervised learning -we use it to discover the hidden relationships in the data set -Those relationships are presented in the format of a set of rules -Those rules state that when an event occurs, another event occurs with a certain probability

precision, minimize

Give more weight to ___ to ___ the number of False Positives (FP). Ex: spam email & intrusion detection systems (IDs).

recall, minimize

Give more weight to ___ to ____ the number of False Negatives (FN). -Ex: bank fraud transactions & medical diagnosis situations.

Model evaluation

How well the model fits your data

Confidence

Measures the certainty of the rule. The rule will be measured against a minimum threshold value. = (freq. x , y) / (freq. x)

Support

Measures the usefulness of the rule. The itemset will be evaluated against a minimum threshold value. = freq. (x, y) / N

Entropy (E)

Part of ID3: -a measure of the level of uncertainty -makes the level of impurity -level of homogeneity -measures how much information the attribute can contribute to the class -0 = 100% homogenius -1 = data is divided into 2 equal sets -For class: E(T) = SUM(PiLog2(Pi)) -For attribute: E(T,x) = SUM((P(c)*E(c))

Information Gain

Part of ID3: -the lower the entropy, the higher of this and vice versa. - IG = E(T) - E(T,x)

Confusion Matrix

Shows the number of the correct and incorrect instances predicted by the model compared to the actual data -TP -TN -FN -FP

True Negative (TN)

The instance is predicted as negative and it is actually negative.

False Negative (FN)

The instance is predicted as negative and it is actually positive.

False Positive (FP)

The instance is predicted as positive but it is actually negative.

True Positive (TP)

The instance is predicted to be true and it is actually true

Overfitting

Very specific model (Type II Error) -Training data set = very good -Test data set = bad

True Negative Rate (TNR)

The number of the correctly classified negative instances to the total of the actual negative instances. = TN/ (TN + FP)

Accuracy

The percent of the correctly predicted instances. This is a very good measure of the quality of the model but it requires a symmetry between the False Positive (FP) and False Negative (FN). If this can't be achieved, we should use other measurements besides this. = (TN + TP)/(TN+TP+FN+FP) OR = TP/ (Total of all value) <-- for NxN matrix

True Positive Rate (TPR)

The ratio of the correctly classified positive instances to the actual total positive instances. = TP/ (TP + FN)

Precision

The ratio of the correctly classified positive instances to the total predicted positive instances. = TP/(TP+FP)

Recall

The ratio of the correctly predicted positive instances to the total actual positive instances. We want to minimize False Positive (FP) with this. = TP/(TP+FN) (same equation as True Positive Rate aka TPR)

False Positive Rate (FPR & TYPE I ERROR)

The ratio of the negative instances that are classified as positive to the total actual negative instances. = FP/ (FP + TN)

False Negative Rate (FNR & TYPE II ERROR)

The ratio of the positive instances that are misclassified as negative to the total positive instances. = FN/ (FN +TP)

F-measure

The weighted average of precision & recall. =2(P*R)/(P+R)

Solve overfitting

These two steps will solve this: (1) Select the significant attributes (2) use k-fold cross validation

Euclidean distance, mimkowski distance

Two main distance functions

Underfitting

Very generic model (Type I Error) -Training data set = bad -Test data set = bad

Sensitivity

When Recall = True Positive Rate (TPR)

Specificity

When True Negative Rate (TNR) is the same

increase, decreasing

When you ____ precision it is on the cost of ____ recall, and vice versa

FP > FN

Which would you like classifier to have?


Related study sets

OB/GYN Steven Penny Review Questions: 1st Trimester Ch 23

View Set

Six Essential Nutrients and Digestion

View Set

Chapter 9: Lumbar Spine, Sacrum, and Coccyx

View Set

Chapter 8_NEW! Mini Sim_Service Marketing MARK3300

View Set

Chapter 16 (section Questions and take home Message)

View Set

CH.11 Small business pricing, distribution, location

View Set

Chapter 3 Houghton - Greenhouse gases

View Set