Data Analysis Final Prep

¡Supera tus tareas y exámenes ahora con Quizwiz!

F-score

2rp/ (r + p) or 2TP / (2TP + FP +FN) biased towards all except TN

Naive Bayes Classifier

A family of algorithms that consider every feature as independent of any other feature Easy to implement and shows good results but depends on the assumption that class conditionals are independent and therefore results in loss of accuracy.

AdaBoost

A form of boosting algorithm. If any intermediate rounds produce error rate higher than 50%, the weights are reverted back to 1/n and sampling is repeated.

activation function

A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer.

Rigid Model

A learning model with high bias but low variance.

Flexible model

A learning model with low bias but high variance

Linear Classifier

A linear function classifies an object. Using this method, can define multiple polynomial classifiers that consider the weight of concatenations of multiple features according to degree p.

Laplacian Smoothing

A smoothing method used by Naive Bayesian Classifier to avoid the 0-probability problem. Simply adds 1 to each case.

Confusion Matrix

A table of numbers showing how often a given stimulus is reported when another stimulus was shown. The table typically provides strong evidence for reliance on features in vision TP, TF, FP, FN is given.

Gini index (CART)

A type of attribute selection measure that deals with the reduction of impurity. Attribute with the largest impurity reduction is chosen. Is biased towards attributes with a large number of values.

Information Gain

A type of attribute selection measure. Calculates the amount of information gain or amount of uncertainty reduced when a tuple branched about a certain attribute. When branching to construct a decision tree, run this on each attributes and select one with the highest value. Is biased towards attributes with a large number of values.

Gain Ratio

A type of attribute selection measure. Overcomes the bias of information gain by normalizing it. Is biased towards unbalanced splits in which one partition is much smaller than others.

Stratified Sampling

A type of probability sampling in which the population is divided into groups with a common attribute and a random sample is chosen within each group. Required especially for skewed data.

Random Sampling

A type of sampling where every entry has equal probability of selection.

Learning Evaluation

Accessing the given model based on accuracy, speed and interpretability.

Discretization

Also known as binning. Means to data transformation. Converts continuous values to discrete values. May result in the loss of information.

Rule-based ordering

Also known as decision list. Means to resolve conflict for rule-based classifiers. Rules are organized into one long priority list, according to some measure of rule quality or by experts.

z-score normalization

Also known as standardization. One of the normalization approaches that utilizes z-score. v= value (v-mean)/st.dev

Model Selection

Also known as validation methods. Select the simplest model with the highest validation assessment. Approaches include holdout test, k-fold cross-validation, and leave-one-out test.

Specificity

Alternative measures usually used in medical domains TN/ (TN + FP)

Sensitivity

Alternative measures usually used in medical domains TP / (TP + FN) or recall

leave-one-out test

An example of model selection or validation method. special case of k-fold-cross validation where k is the number of tuples. Most stable among validation methods but most inefficient.

Holdout Test

An example of model selection or validation method. Just simply random partitions dataset. Good for huge data set.

k-fold-cross-validation

An example of model selection or validation method. Randomly partition the data into k mutually exclusive subsets, each approximately of equal size. Therefore each data is used k times: once for validation, k-1 times for training. Fold can be also stratified in account to class distribution (stratified cross-validation)

Area under the curve (AUC)

Another metric for evaluating classification performance. Ideal when 1, 0.5 when random guess.

Irrelevant Attributes

Attributes that contain no useful data for the given data mining task.

Decision boundary

Border line between two neighboring regions of different classes

Receiver Operating Characteristic (ROC)

Curve that plots against TP rate and FP rate. Performance of a classifier is represented as a point on this curve.

Data Preprocessing

Data Cleaning Data Integration Data Reduction Data Transformation

Schema Integration

Data redundancy when data entry refers to the same attribute but with a different name.

Entity Identification

Data redundancy when data entry refers to the same entity but with different values.

Unsupervised Discretization

Discretization approach. Binning can be done in equal-width or frequency binning or K-means clustering. Then mean or median of each bin is used as values.

Supervised Discretization

Discretization approach. Utilizes entropy heuristic.

Redundant Attributes

Duplicate information is contained in one or more other attributes.

Generalization error

Error due to the overfitting of the model. Model is too complex to generalize to unseen data. = Variance + Bias^2

Classification

Example of supervised learning. Target is categorical or finite-discrete. ex) medical diagnosis, fraud detection, credit loan approval

Regression

Example of supervised learning. target value is continuous. ex) weather forecast, stock price prediction

FP rate

FP/N or 1 - specificity

Test

Final procedure of supervised learning. Uses the model to make an estimate on the 000 set of data.

Data Interpretability

How easily data can be understood.

Supervised Learning

Learning where the labels of training data are given. Examples are classification and regression.

Unsupervised Learning

Learning where the labels of training data are not given. Examples are clustering.

Prepruning

Means to avoid overfitting a decision tree. Halts tree construction early as it may result in reduction of goodness.

Postpruning

Means to avoid overfitting a decision tree. Iteratively remove branches from a fully grown tree to decide the best pruned tree. There exists two methods to do this: 1. Subtree replacement: merge from leaves to the root 2. Subtree raising: raise the subtree of the most popular branch to replace its parent

Data Smoothing

Means to data cleaning. Includes averaging, binning, clustering. Can solve noisy data issue.

Outlier Detection

Means to data cleaning. Also known as anomaly detection. Can solve noisy data issue.`

Normalization

Means to data transformation. Changes the values of the numeric column into a common scale.

Sequential Covering Algorithm

Means to extract rules from data. Extracts rule directly from training data. Rules are learned one at a time and every time a rule is learned, the tuples covered by the rules are removed. This procedure is repeated until termination conditions are met. Termination conditions include lack of training examples or a when a quality of rule is below specified threshold. Examples include FOIL, RIPPER.

Size ordering

Means to resolve conflict for rule-based classifiers. Assigns the highest priority to the triggering rules that has the toughest requirements.

Class-based ordering

Means to resolve conflict for rule-based classifiers. Decreasing order of prevalence or misclassification cost per class.

Subtree Replacement

Method of postpruning that merges from leaves to the root.

Subtree Raising

Method of postpruning that raises the subtree of the most popular branch to replace its parent.

Underfitting

Model is way too simple to be used for any prediction.

Perceptron Algorithm

Modeled after neurons in the brain. It has m input values (which correspond with the m features of the examples in the training set) and one output value. Each input value x_i is multiplied by a weight-factor w_i. If the sum of the products between the feature value and weight-factor is larger than zero, the perceptron is activated and 'fires' a signal (+1). Otherwise it is not activated. The weighted sum between the input-values and the weight-values, can mathematically be determined with the scalar-product <w, x>. To produce the behaviour of 'firing' a signal (+1) we can use the signum function sgn(); it maps the output to +1 if the input is positive, and it maps the output to -1 if the input is negative. Thus, it can mathematically be modeled by the function y = sgn(b+ <w, x>). Here b is the bias, i.e. the default value when all feature values are zero. Nondeterministic and limited to linear functions.

Heuristic Search

One approach of Attribute Subset Selection 1. Run significance test on each independent attributes. 2. Forward Method: repeatedly add the best attribute 3. Backward Method: repeatedly remove the worst attribute.

Random Forests

One method to generate an ensemble specially designed for decision tree classifiers. Randomly selects a small subset of features and determines the split among them 1. Very good performance (speed, accuracy) when abundant data is available. 2. Use bootstrapping/bagging to initialize each tree with different data. 3. Use only a subset of variables at each node. 4. Use a random optimization criterion at each node. 5. Project features on a random different manifold at each node.

Boosting

One method to generate an ensemble. An iterative procedure to adaptively change the distribution of training data by focusing more on previously misclassified method. Previously wrongly classified records will be given bigger weights so that it is chosen again in subsequent rounds.

Bagging

One method to generate an ensemble. Sampling is done according to uniform probability distribution and a classifier is built on each bootstrap sample D. D will contain approx. 63% of original data. Performance depends on stability of base classifier. Is less susceptible to model overfitting when applied to noisy data.

Feature Extraction

One of the approaches used in dimensionality reduction.

Attribute Subset Selection

One of the approaches used in dimensionality reduction. Detects and removes redundant or irrelevant attributes.

Min-max normalization

One of the normalization approaches. v = value ((v - min)/(max-min)) * (newmax-newmin) + new min

decimal scaling normalization

One of the normalization approaches. Similar to min-max approach but much more interpretable. v=value v= v/10^(j)

Dimensionality Reduction

One of the strategies used in data reduction. Reduces the column of data. Feature extraction and attribute subset selection some possible approaches of this strategy.

Numerosity Reduction

One of the strategies used in data reduction. Reduces the row of data. Sampling and clustering are some approaches.

Data Cleaning

Part of Data Preprocessing: Required due to incomplete/noisy/inconsistent/intentional data. Examples include ignoring, automatic or manual removal , smoothing, resolve, etc.

Data Reduction

Part of Data Preprocessing: Required to reduce running time or to improve mining quality. Includes strategies such as numerosity reduction, dimensionality reduction, and data compression.

Data Integration

Part of Data Preprocessing: Required usually due to redundancy of data. Can be handled using manual methods and correlation analysis.

Data Transformation

Part of Data Preprocessing: Usually done to improve learning speed and accuracy by producing more compact and interpretable results. Includes strategies such as normalization and discretization (binning).

Training

Procedure of supervised learning. The process of dividing a given data into training, validation and test sets and constructing a model that can be utilized in supervised learning.

Validation

Procedure of supervised learning. Utilizes the 000 set data to evaluate the accuracy of the model created in the previous procedure. Also known as "tuning the model"

Data Timeliness

Refers to data being recorded at or near the time of the event or observation.

Data Trustness

Refers to how much data can be trusted. Varies by source of data and obtaining means.

Sampling with replacement

Sampling method that doesn't remove selected item from selection pool.

Sampling without replacement

Sampling method that removes the selected item from selection pool. Can avoid additional or redundant selection.

Filter Method

Scheme independent heuristic search method. Selection of attributes before learning. 1. Use a single-attribute evaluator with ranking to remove irrelevant attributes. 2. Combine attribute subset evaluator to remove additional irrelevant and redundant attributes.

Wrapper Method

Scheme-dependent heuristic search method. Use ML methods along with heuristic search to select attributes. Is slower than the filer method.

Ensembles

Set of base classifiers from the training data. Makes prediction by taking a majority vote made by base classifiers. Conditions : 1. each base classifier must be independent 2. base classifier error rate must be lower than random guessing (0.5)

Data Quality Measurement

Some measurements that can define a data set's quality. Examples are 1. Accuracy 2. Completeness 3. Consistency 4. Timeliness 5. Trustness 6. Interpretability

Support Vector Machine (SVM)

Supervised learning classification tool that seeks a dividing hyperplane for any number of dimensions can be used for regression or classification Utilizes kernel trick to learn nonlinear functions. Learns in batch mode.

Recall

TP / (TP + FN) biased towards TP and FN

Precision

TP / (TP + FP) biased towards TP and FP

TP rate

TP/P or sensitivity

Learning model

The classifier created by supervised learning. Can be represented as a mathematical function, decision tree, rules, or etc.

Bias

The error that is introduced by approximating a real-life problem, which may be extremely complicated by a much simpler model.

Kernel trick

The key element of kernel methods is that they do not actually map features to this space, instead they return the distance between elements in this space This implicit mapping is called the (definition)

Bayes' Theorem

The probability of an event occurring based upon other event probabilities. P(H | X) = (P(X | H)*P(H)/P(X)

Sampling

To select a representative subset of data.

Accuracy

True predictions/ Total (TP+TN)/(TP+FP+TN+FN) May not be the best measurement since may be unbalanced.

Decision Tree

Type of classification model. Each node represents an attribute. Each branch represents an attribute value. Each leaf represents result. Pro: fast learning, production and easy to interpret model. Cons: information loss due to the need of discretizing attributes. Not as accurate as other classification method.

Correlation Analysis

Used to identify data redundancy. When nominal data, chi-square test is used. When numeric data, Pearson correlation coefficient is used.

Data Accuracy

Whether data is accurate.

Data Completeness

Whether data is complete. Low when data isn not recorded or unavailable.

Data Consistency

Whether data is overall uniformly.

Rule-based Classifier

a classification algorithm that uses a set of IF-THEN statements for classification If more than one rule is triggered, conflict resolution is necessary. Size ordering, class-based ordering, rule-based orderings are some means to resolve conflict.

Artificial Neural Networks (ANNs)

are networks that learn and are capable of performing tasks that are difficult with conventional computers, such as playing chess, recognizing patterns in faces and objects, and filtering spam e-mail.

Overfitting

fitting the model too exactly to the data may limit the model's ability to generalize to unseen data.

KNN

k-nearest neighbor given training data and some similarity function, the algorithm finds the k nearest neighbors to the current example based on the similarity function and classifies the current example based on the majority decision of the k nearest neighbors has a theoretical upper bound of 2*(whatever rule you can come up with) in terms of accuracy all instances become points in the nth dimensional space, where n is the number of features of each instance if the target range is continuous and not binary, the algorithm can output a classification based on the mean value of the k nearest neighbors If k value is too small, overfitting. If k value is too large, underfitting.

Variance

the amount by which learning model and results would change if we estimated it using different training set.

margin

the distance from the decision boundary to the nearest point. Goodness of a decision boundary in SVM is when this value is maximized.

posterior probability

the probability that a hypothesis is true after consideration of the evidence P(H | X)


Conjuntos de estudio relacionados

Mass Media and Society Final study guide

View Set

State Exam Simulator - Practice Exam (missed questions)

View Set

Marieb Anatomy + Physiology Chapter 2: Chemistry

View Set

Drugs Affecting the Autonomic Nervous System

View Set

Chapter 41 - Listening Guide Quiz 32: Berlioz: Symphonie fantastique, V

View Set

Med Surg: Chapter 53: Nursing Management: Patients With Burn Injury: PREPU

View Set

Tension pneumothorax- medical emergency

View Set

Chapter 46: Management of Patients with Gastric and Duodenal Disorders

View Set