Final Exam-Data Analysis
Training-Test Dataset
-Need to randomly select 70% and randomly select 30% out of dataset -70% training data (build the model) -30% test data (evaluate the model performance)
Steps to Apply Naive Bayes
(1) Calculate the frequency of each class (2) Construct frequency tables for each attribute against the class (3) Calculate the posterior probability (4) (optional) normalize the results (5) Make the decision/prediction
Model Evaluation Metrics
(1) Confusion Matrix (2) TPR: True Positive Rate (3) TNR: True Negative Rate (4) FPR: False Positive Rate (5) FNR: False Negative Rate (6) Accuracy (7) Precision (8) Recall (9) F-measure
Steps to Build Decision Tree
(1) Create a frequency table for the class (2) Calculate the class entropy (3) Create a frequency table for each attribute (4) Calculate the entropy for each attribute (5) Calculate the information gain for each attribute (6) Select the attribute with the highest information gain to be the root node (7) Repeat steps from 3-6 until finishing all attributes
Steps to implement KNN algorithm
(1) Decide on k value (k = number of nearest neighbors) (2) Calculate the distance between each training instance and the new instance (3) Arrange the resulted distance values in ascending order (4) Select the K smallest distance values (5) Classify the new instance based on the majority vote.
Steps for Association Rules
(1) Decide on the minimum support and minimum confidence (2) Create a table (iteration 1) with a list of itemsets where K = 1 (K is number of items in the itemset) (3) From the list of the first itemsets, extract a list of frequent itemsets where support is >= minimum support. (4) From the list of frequent itemsets from the step above, generate a list of itemsets where K = 2 (iteration 2) (5) From the list of the second itemset extract a list of frequent itemsets that satisfy the minimum support. (6) Keep repeating steps 4 & 5 by increasing K by 1, until K equals the largest transaction. (7) Generate the set of rules and select the rules that satisfy the minimum confidence
Steps for k-means
(1) Decide on the number of clusters (2) Randomly assign the data points to each cluster (3) Calculate the centroid point of each cluster (4) Calculate the within squared error for each cluster (5) Calculate the total error (6) Calculate the distance between the centroid of each cluster and each data point (7) Evaluate the Euclidean distance of each data point and re-assign the data point to the cluster with the smallest distance (8) Start again from step 4 until you get stable results
Row
*For more than 2 classes in matrix* FN for each class is the sum of the values in the ___ of the class, excluding the TP.
Column
*For more than 2 classes in matrix* FP for each class is the sum of the values in the ___ of the class excluding the TP.
columns, rows, excluding
*For more than 2 classes in matrix* TN is the sum of all ______ and _____ (excluding/including) the column and row of the class.
Decision Tree
-Another classification algorithm that builds a model in a tree form -Applies a Top-down approach: where it will keep break down the data set into smaller and smaller subsets that share the same characteristics while it builds the tree -Decision is based on an algorithm called ID3
ROC Curve
-Area under curve = ROC (1 is perfect) -Higher area under curve = better the model
T-test
-Confidence level = 95% (95% chance of being right) -Alpha = .05 (5% change of being wrong)
Market segmentation
-How to cluster customers/people -Ex: based on spending, shopping, habit, zip code, gender, age etc.
Clustering applications
-Market segmentation -Outlier identification
Cost Sensitive Analysis
-Need to do this to reduce False Positives (FP) or False Negatives (FN) or to work with biased dataset. -Is a data mining learning technique that takes the misclassification cost in consideration
Apriori algorithm
-One of the earliest and well-known algorithms about association rules -Rule based algorithm -This algorithm mines the itemsets from transaction to generate rules -The two main concepts are support and confidence
Clustering (unsupervised)
-The process/techniques of dividing large dataset into small groups where each group share similar characteristics. -The distance between the data points of the same group is small and the distance between the points of different groups is large.
Advantages of Decision Tree
-Very fast -Works with both categorical & numerical data -Can handle high-dimensional dataset (dataset with high number of instances)
K-nearest neighbor (KNN)
-non-probabilistic algorithm -considered an instance-based algorithm -learning = storing all the training dataset -defined as a lazy algorithm -based on the idea that the instances that are close to each other share similar characteristics -making decision in this is based on a similarity measure called distance function.
Mimkowski Distance
= (SUM Ixi-yiI^p)^(1/p) -p = maximum distance between coordinates
Centroid
= SUM((x)/n), SUM((y)/n) -x,y = is the x coordinate and the y coordinate -n = total number of points in the cluster
Squared error
= SUM(x-m)^2 -x = the point -m = centroid -determined at elbow of graph (after line flattens)
Euclidean Distance
= square root of SUM(xi-yi)^2
Naive Bayes
A probabilistic classification algorithm -based on the Bayes' Theorem -main requirement to use this: expects independency among all the attributes (none of them are highly correlated) -in order to use this, must understand posterior probability
Posterior probability
A probability that is a revision of a prior probability using additional information. -P(cIxi) = (P(xiIc) * P(c))/P(xi) -c = class -xi = attribute/reduction -(xi I c) = the condition probability of the attribute given the class -P(xi) = prior probability of the attribute -P(c) = the prior probability of the class
K-fold cross validation
A validation technique that we can use while building the model -applied to training data set -k = number of folds or subsets Ex: training dataset --> 100 instances Subset 1: 10 Subset 2: 10 ........... Subset 10: 10
Solve underfitting
Adding more attributes will solve this
ID3
Algorithm based on 2 main concepts: -Entropy -Information gain
x-axis = TP, y-axis = FP
Axes of ROC curve?
Disadvantage of Decision Tree
Cannot detect the correlation among attributes
Detect outliers
Common use of unsupervised clustering?
Node
Component of the tree: -Attributes -Root will have the best predictor for best attribute and have lowest entropy/highest information gain
Leaves
Component of the tree: -Output or class or target (all the same)
Branches
Component of the tree: -Represent attribute criteria
K-means
Considered a partition clustering algorithm -uses the squared error cirterion -main idea: randomly partition the data points into pre-defined number of clusters -keep reassigned the data points to the best cluster based on the squared error and a distance function.
Association Rule
Discover the most frequently occurring items together -unsupervised learning -we use it to discover the hidden relationships in the data set -Those relationships are presented in the format of a set of rules -Those rules state that when an event occurs, another event occurs with a certain probability
precision, minimize
Give more weight to ___ to ___ the number of False Positives (FP). Ex: spam email & intrusion detection systems (IDs).
recall, minimize
Give more weight to ___ to ____ the number of False Negatives (FN). -Ex: bank fraud transactions & medical diagnosis situations.
Model evaluation
How well the model fits your data
Confidence
Measures the certainty of the rule. The rule will be measured against a minimum threshold value. = (freq. x , y) / (freq. x)
Support
Measures the usefulness of the rule. The itemset will be evaluated against a minimum threshold value. = freq. (x, y) / N
Entropy (E)
Part of ID3: -a measure of the level of uncertainty -makes the level of impurity -level of homogeneity -measures how much information the attribute can contribute to the class -0 = 100% homogenius -1 = data is divided into 2 equal sets -For class: E(T) = SUM(PiLog2(Pi)) -For attribute: E(T,x) = SUM((P(c)*E(c))
Information Gain
Part of ID3: -the lower the entropy, the higher of this and vice versa. - IG = E(T) - E(T,x)
Confusion Matrix
Shows the number of the correct and incorrect instances predicted by the model compared to the actual data -TP -TN -FN -FP
True Negative (TN)
The instance is predicted as negative and it is actually negative.
False Negative (FN)
The instance is predicted as negative and it is actually positive.
False Positive (FP)
The instance is predicted as positive but it is actually negative.
True Positive (TP)
The instance is predicted to be true and it is actually true
Overfitting
Very specific model (Type II Error) -Training data set = very good -Test data set = bad
True Negative Rate (TNR)
The number of the correctly classified negative instances to the total of the actual negative instances. = TN/ (TN + FP)
Accuracy
The percent of the correctly predicted instances. This is a very good measure of the quality of the model but it requires a symmetry between the False Positive (FP) and False Negative (FN). If this can't be achieved, we should use other measurements besides this. = (TN + TP)/(TN+TP+FN+FP) OR = TP/ (Total of all value) <-- for NxN matrix
True Positive Rate (TPR)
The ratio of the correctly classified positive instances to the actual total positive instances. = TP/ (TP + FN)
Precision
The ratio of the correctly classified positive instances to the total predicted positive instances. = TP/(TP+FP)
Recall
The ratio of the correctly predicted positive instances to the total actual positive instances. We want to minimize False Positive (FP) with this. = TP/(TP+FN) (same equation as True Positive Rate aka TPR)
False Positive Rate (FPR & TYPE I ERROR)
The ratio of the negative instances that are classified as positive to the total actual negative instances. = FP/ (FP + TN)
False Negative Rate (FNR & TYPE II ERROR)
The ratio of the positive instances that are misclassified as negative to the total positive instances. = FN/ (FN +TP)
F-measure
The weighted average of precision & recall. =2(P*R)/(P+R)
Solve overfitting
These two steps will solve this: (1) Select the significant attributes (2) use k-fold cross validation
Euclidean distance, mimkowski distance
Two main distance functions
Underfitting
Very generic model (Type I Error) -Training data set = bad -Test data set = bad
Sensitivity
When Recall = True Positive Rate (TPR)
Specificity
When True Negative Rate (TNR) is the same
increase, decreasing
When you ____ precision it is on the cost of ____ recall, and vice versa
FP > FN
Which would you like classifier to have?