Data Analysis Final Prep
F-score
2rp/ (r + p) or 2TP / (2TP + FP +FN) biased towards all except TN
Naive Bayes Classifier
A family of algorithms that consider every feature as independent of any other feature Easy to implement and shows good results but depends on the assumption that class conditionals are independent and therefore results in loss of accuracy.
AdaBoost
A form of boosting algorithm. If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and sampling is repeated.
activation function
A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer.
Rigid Model
A learning model with high bias but low variance.
Flexible model
A learning model with low bias but high variance
Linear Classifier
A linear function classifies an object. Using this method, can define multiple polynomial classifiers that consider the weight of concatenations of multiple features according to degree p.
Laplacian Smoothing
A smoothing method used by Naive Bayesian Classifier to avoid the 0-probability problem. Simply adds 1 to each case.
Confusion Matrix
A table of numbers showing how often a given stimulus is reported when another stimulus was shown. The table typically provides strong evidence for reliance on features in vision TP, TF, FP, FN is given.
Gini index (CART)
A type of attribute selection measure that deals with the reduction of impurity. Attribute with the largest impurity reduction is chosen. Is biased towards attributes with a large number of values.
Information Gain
A type of attribute selection measure. Calculates the amount of information gain or amount of uncertainty reduced when a tuple branched about a certain attribute. When branching to construct a decision tree, run this on each attributes and select one with the highest value. Is biased towards attributes with a large number of values.
Gain Ratio
A type of attribute selection measure. Overcomes the bias of information gain by normalizing it. Is biased towards unbalanced splits in which one partition is much smaller than others.
Stratified Sampling
A type of probability sampling in which the population is divided into groups with a common attribute and a random sample is chosen within each group. Required especially for skewed data.
Random Sampling
A type of sampling where every entry has equal probability of selection.
Learning Evaluation
Accessing the given model based on accuracy, speed and interpretability.
Discretization
Also known as binning. Means to data transformation. Converts continuous values to discrete values. May result in the loss of information.
Rule-based ordering
Also known as decision list. Means to resolve conflict for rule-based classifiers. Rules are organized into one long priority list, according to some measure of rule quality or by experts.
z-score normalization
Also known as standardization. One of the normalization approaches that utilizes z-score. v= value (v-mean)/st.dev
Model Selection
Also known as validation methods. Select the simplest model with the highest validation assessment. Approaches include holdout test, k-fold cross-validation, and leave-one-out test.
Specificity
Alternative measures usually used in medical domains TN/ (TN + FP)
Sensitivity
Alternative measures usually used in medical domains TP / (TP + FN) or recall
leave-one-out test
An example of model selection or validation method. special case of k-fold-cross validation where k is the number of tuples. Most stable among validation methods but most inefficient.
Holdout Test
An example of model selection or validation method. Just simply random partitions dataset. Good for huge data set.
k-fold-cross-validation
An example of model selection or validation method. Randomly partition the data into k mutually exclusive subsets, each approximately of equal size. Therefore each data is used k times: once for validation, k-1 times for training. Fold can be also stratified in account to class distribution (stratified cross-validation)
Area under the curve (AUC)
Another metric for evaluating classification performance. Ideal when 1, 0.5 when random guess.
Irrelevant Attributes
Attributes that contain no useful data for the given data mining task.
Decision boundary
Border line between two neighboring regions of different classes
Receiver Operating Characteristic (ROC)
Curve that plots against TP rate and FP rate. Performance of a classifier is represented as a point on this curve.
Data Preprocessing
Data Cleaning Data Integration Data Reduction Data Transformation
Schema Integration
Data redundancy when data entry refers to the same attribute but with a different name.
Entity Identification
Data redundancy when data entry refers to the same entity but with different values.
Unsupervised Discretization
Discretization approach. Binning can be done in equal-width or frequency binning or K-means clustering. Then mean or median of each bin is used as values.
Supervised Discretization
Discretization approach. Utilizes entropy heuristic.
Redundant Attributes
Duplicate information is contained in one or more other attributes.
Generalization error
Error due to the overfitting of the model. Model is too complex to generalize to unseen data. = Variance + Bias^2
Classification
Example of supervised learning. Target is categorical or finite-discrete. ex) medical diagnosis, fraud detection, credit loan approval
Regression
Example of supervised learning. target value is continuous. ex) weather forecast, stock price prediction
FP rate
FP/N or 1 - specificity
Test
Final procedure of supervised learning. Uses the model to make an estimate on the 000 set of data.
Data Interpretability
How easily data can be understood.
Supervised Learning
Learning where the labels of training data are given. Examples are classification and regression.
Unsupervised Learning
Learning where the labels of training data are not given. Examples are clustering.
Prepruning
Means to avoid overfitting a decision tree. Halts tree construction early as it may result in reduction of goodness.
Postpruning
Means to avoid overfitting a decision tree. Iteratively remove branches from a fully grown tree to decide the best pruned tree. There exists two methods to do this: 1. Subtree replacement: merge from leaves to the root 2. Subtree raising: raise the subtree of the most popular branch to replace its parent
Data Smoothing
Means to data cleaning. Includes averaging, binning, clustering. Can solve noisy data issue.
Outlier Detection
Means to data cleaning. Also known as anomaly detection. Can solve noisy data issue.`
Normalization
Means to data transformation. Changes the values of the numeric column into a common scale.
Sequential Covering Algorithm
Means to extract rules from data. Extracts rule directly from training data. Rules are learned one at a time and every time a rule is learned, the tuples covered by the rules are removed. This procedure is repeated until termination conditions are met. Termination conditions include lack of training examples or a when a quality of rule is below specified threshold. Examples include FOIL, RIPPER.
Size ordering
Means to resolve conflict for rule-based classifiers. Assigns the highest priority to the triggering rules that has the toughest requirements.
Class-based ordering
Means to resolve conflict for rule-based classifiers. Decreasing order of prevalence or misclassification cost per class.
Subtree Replacement
Method of postpruning that merges from leaves to the root.
Subtree Raising
Method of postpruning that raises the subtree of the most popular branch to replace its parent.
Underfitting
Model is way too simple to be used for any prediction.
Perceptron Algorithm
Modeled after neurons in the brain. It has m input values (which correspond with the m features of the examples in the training set) and one output value. Each input value x_i is multiplied by a weight-factor w_i. If the sum of the products between the feature value and weight-factor is larger than zero, the perceptron is activated and 'fires' a signal (+1). Otherwise it is not activated. The weighted sum between the input-values and the weight-values, can mathematically be determined with the scalar-product <w, x>. To produce the behaviour of 'firing' a signal (+1) we can use the signum function sgn(); it maps the output to +1 if the input is positive, and it maps the output to -1 if the input is negative. Thus, it can mathematically be modeled by the function y = sgn(b+ <w, x>). Here b is the bias, i.e. the default value when all feature values are zero. Nondeterministic and limited to linear functions.
Heuristic Search
One approach of Attribute Subset Selection 1. Run significance test on each independent attributes. 2. Forward Method: repeatedly add the best attribute 3. Backward Method: repeatedly remove the worst attribute.
Random Forests
One method to generate an ensemble specially designed for decision tree classifiers. Randomly selects a small subset of features and determines the split among them 1. Very good performance (speed, accuracy) when abundant data is available. 2. Use bootstrapping/bagging to initialize each tree with different data. 3. Use only a subset of variables at each node. 4. Use a random optimization criterion at each node. 5. Project features on a random different manifold at each node.
Boosting
One method to generate an ensemble. An iterative procedure to adaptively change the distribution of training data by focusing more on previously misclassified method. Previously wrongly classified records will be given bigger weights so that it is chosen again in subsequent rounds.
Bagging
One method to generate an ensemble. Sampling is done according to uniform probability distribution and a classifier is built on each bootstrap sample D. D will contain approx. 63% of original data. Performance depends on stability of base classifier. Is less susceptible to model overfitting when applied to noisy data.
Feature Extraction
One of the approaches used in dimensionality reduction.
Attribute Subset Selection
One of the approaches used in dimensionality reduction. Detects and removes redundant or irrelevant attributes.
Min-max normalization
One of the normalization approaches. v = value ((v - min)/(max-min)) * (newmax-newmin) + new min
decimal scaling normalization
One of the normalization approaches. Similar to min-max approach but much more interpretable. v=value v= v/10^(j)
Dimensionality Reduction
One of the strategies used in data reduction. Reduces the column of data. Feature extraction and attribute subset selection some possible approaches of this strategy.
Numerosity Reduction
One of the strategies used in data reduction. Reduces the row of data. Sampling and clustering are some approaches.
Data Cleaning
Part of Data Preprocessing: Required due to incomplete/noisy/inconsistent/intentional data. Examples include ignoring, automatic or manual removal , smoothing, resolve, etc.
Data Reduction
Part of Data Preprocessing: Required to reduce running time or to improve mining quality. Includes strategies such as numerosity reduction, dimensionality reduction, and data compression.
Data Integration
Part of Data Preprocessing: Required usually due to redundancy of data. Can be handled using manual methods and correlation analysis.
Data Transformation
Part of Data Preprocessing: Usually done to improve learning speed and accuracy by producing more compact and interpretable results. Includes strategies such as normalization and discretization (binning).
Training
Procedure of supervised learning. The process of dividing a given data into training, validation and test sets and constructing a model that can be utilized in supervised learning.
Validation
Procedure of supervised learning. Utilizes the 000 set data to evaluate the accuracy of the model created in the previous procedure. Also known as "tuning the model"
Data Timeliness
Refers to data being recorded at or near the time of the event or observation.
Data Trustness
Refers to how much data can be trusted. Varies by source of data and obtaining means.
Sampling with replacement
Sampling method that doesn't remove selected item from selection pool.
Sampling without replacement
Sampling method that removes the selected item from selection pool. Can avoid additional or redundant selection.
Filter Method
Scheme independent heuristic search method. Selection of attributes before learning. 1. Use a single-attribute evaluator with ranking to remove irrelevant attributes. 2. Combine attribute subset evaluator to remove additional irrelevant and redundant attributes.
Wrapper Method
Scheme-dependent heuristic search method. Use ML methods along with heuristic search to select attributes. Is slower than the filer method.
Ensembles
Set of base classifiers from the training data. Makes prediction by taking a majority vote made by base classifiers. Conditions : 1. each base classifier must be independent 2. base classifier error rate must be lower than random guessing (0.5)
Data Quality Measurement
Some measurements that can define a data set's quality. Examples are 1. Accuracy 2. Completeness 3. Consistency 4. Timeliness 5. Trustness 6. Interpretability
Support Vector Machine (SVM)
Supervised learning classification tool that seeks a dividing hyperplane for any number of dimensions can be used for regression or classification Utilizes kernel trick to learn nonlinear functions. Learns in batch mode.
Recall
TP / (TP + FN) biased towards TP and FN
Precision
TP / (TP + FP) biased towards TP and FP
TP rate
TP/P or sensitivity
Learning model
The classifier created by supervised learning. Can be represented as a mathematical function, decision tree, rules, or etc.
Bias
The error that is introduced by approximating a real-life problem, which may be extremely complicated by a much simpler model.
Kernel trick
The key element of kernel methods is that they do not actually map features to this space, instead they return the distance between elements in this space This implicit mapping is called the (definition)
Bayes' Theorem
The probability of an event occurring based upon other event probabilities. P(H | X) = (P(X | H)*P(H)/P(X)
Sampling
To select a representative subset of data.
Accuracy
True predictions/ Total (TP+TN)/(TP+FP+TN+FN) May not be the best measurement since may be unbalanced.
Decision Tree
Type of classification model. Each node represents an attribute. Each branch represents an attribute value. Each leaf represents result. Pro: fast learning, production and easy to interpret model. Cons: information loss due to the need of discretizing attributes. Not as accurate as other classification method.
Correlation Analysis
Used to identify data redundancy. When nominal data, chi-square test is used. When numeric data, Pearson correlation coefficient is used.
Data Accuracy
Whether data is accurate.
Data Completeness
Whether data is complete. Low when data isn not recorded or unavailable.
Data Consistency
Whether data is overall uniformly.
Rule-based Classifier
a classification algorithm that uses a set of IF-THEN statements for classification If more than one rule is triggered, conflict resolution is necessary. Size ordering, class-based ordering, rule-based orderings are some means to resolve conflict.
Artificial Neural Networks (ANNs)
are networks that learn and are capable of performing tasks that are difficult with conventional computers, such as playing chess, recognizing patterns in faces and objects, and filtering spam e-mail.
Overfitting
fitting the model too exactly to the data may limit the model's ability to generalize to unseen data.
KNN
k-nearest neighbor given training data and some similarity function, the algorithm finds the k nearest neighbors to the current example based on the similarity function and classifies the current example based on the majority decision of the k nearest neighbors has a theoretical upper bound of 2*(whatever rule you can come up with) in terms of accuracy all instances become points in the nth dimensional space, where n is the number of features of each instance if the target range is continuous and not binary, the algorithm can output a classification based on the mean value of the k nearest neighbors If k value is too small, overfitting. If k value is too large, underfitting.
Variance
the amount by which learning model and results would change if we estimated it using different training set.
margin
the distance from the decision boundary to the nearest point. Goodness of a decision boundary in SVM is when this value is maximized.
posterior probability
the probability that a hypothesis is true after consideration of the evidence P(H | X)