Machine Learning
Brett Lantz
"The practice of machine learning involves matching the characteristics of the input data to the biases of available approaches."
Sparsity
A measurement of the data density. The degree to which data does not exist for each feature of all observations.
Duplicate Data
Occurs when data that is repeated in the dataset. Note that two or more objects may be identical with respect to features but still represent different objects.
Class
The attribute or feature that is described by the other features within an instance.
Within Cluster Sum of Squares
WCSS. This measure decreases as we increase k.
Tom Mitchell
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T as measured by P, improves with experience E."
Elbow Method
A approach for selecting the value value that involves plotting the within clusters sum of squares against various k values.
Itemset
A collection of one or more items (e.g., {beer, bread, diaper, milk, eggs}).
Model Selection
A cyclical process of selecting, training, and evaluating models.
Voronoi Diagram
A diagram utilizing the bisectors of segments that outlines the points in the plane closest to each point.
Non-Parametric Model
A learning model for when you have a lot of data and no prior knowledge, and when you don't want to worry too much about choosing the right features.
Parametric Model
A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples). o matter how much data you throw at a parametric model, it won't change its mind about how many parameters it needs.
Density
A measurement of the data density. The degree to which data exists for each feature of all observations.
k-Means Clustering
A method that takes in items and assigns each items of one of k clusters such that the differences within a cluster are minimized while the differences between clusters are maximized.
Feature
A property or characteristic of an instance, which can be either discrete or continuous.
Box Plot
A visualization technique that answers whether a feature is significant, location comparison across subgroups, variation difference between subgroups, and outliers.
Histogram
A visualization technique that can answer questions regarding data distribution, location, spread, skew, and outliers.
Odds Plot
A visualization technique which answers whether a feature is significant, how feature values affect the probability of occurrence, and a threshold for the effect.
Scatter Plot
A visualization technique which measures whether a feature is significant, how features interact, and whether there are outliers in the data.
Apriori Principle
Allows us to reduce the number of candidate itemsets by pruning the itemset lattice. If we identify an itemset as being infrequent, then its supersets should not be generated/tested.
Clustering
An unsupervised machine learning task that automatically divides data into groups based on similarity. Items inside a cluster should be similar while items outside the cluster should be different.
Inductive Learning Hypothesis
Any hypothesis found to approximate the target function well over a large set of training examples will also approximate the target function well over other unobserved examples.
Missing Data
Arises due to changes in data collection methods, human error, combining various datasets, human bias, etc. The key is to know how and why data is missing, as well as to understand that missing value can have meaning.
Bias Variance Tradeoff
As complexity increases, bias decreases but variance increases.
Market Basket Analysis
As the amount of customer retail transactional data grows over time, machine learning is increasingly being applied to the data in order to detect patterns in purchasing behavior.
Discrete Features
Attributes measures in categorical form, which typically has only a reasonable set of values (e.g., clothing size, customer satisfaction, etc.).
Continuous Features
Attributes usually measured in the form of integer or real numbers (e.g., temperature, weight, height, age, etc.).
Aggregation
Combining two or more objects into a single object.
LHS
Condition that needs to be met to trigger the association rule.
Proximity-Based Techniques
Define a proximity measure between instances, with outliers being distant from most other instances.
Density-Based Techniques
Define outliers as instances that have a local density significantly less than that of neighbors.
Robust
Describes an algorithm that can produce acceptable results even when noise is present.
Resolution
Describes the grain of the data. Data with too much of this could have patterns blurred by noise, but too little will not reveal interesting patterns.
Variance
Errors caused by this are made as a result of the sampling of the training data.
Bias
Errors caused by this are made as a result of the specified learning algorithm.
Hot-Deck Imputation
Fill in the missing value using similar instances from the same dataset.
Support Count
Frequency of an itemset.
Frequent Itemset Generation
Generate all items whose support and confidence are above thresholds. This approach is very expensive.
Cluster Sampling
Group or segment data based on similarities, then randomly select from each group. This method is efficient but typically not optimal.
Clustering
Group the data and use properties of the groups to represent the instances constituting those clusters, which smoothes the data.
Binning
Grouping the ordered data to smooth the data, either by means or boundaries.
Support
How frequently a rule occurs in a dataset, measured as a fraction.
Data Collection
Identifying and gathering the data that will be used in the learning process to generate actionable information.
Mean Imputation
Imputation method that results in underestimation of the standard deviation and pulling estimation of correlation to zero.
Random Imputation
Imputation method that tends to ignore useful information from other features.
Distribution-Based Imputation
Imputation method which asssigns assigning a value for the missing data based on the probability distribution of the non-missing data.
Predictive Imputation
Imputation method which builds a regressor or classifier to predict the missing data. Consider the missing feature as the dependent variable and the rest of the features as the independent variables.
Match-Based Imputation
Imputation method which imputes based on similar instances with non-missing values.
Cold-Deck Imputation
Impute missing values using similar instances from another dataset.
Indicator Variable
Imputes using a constant or indicator value (e.g., "unknown," "N/A," or "-1."
Normalization
Intends to make an entire set of values have a particular property; often, this involves scaling data to fall within a small, specified range.
Predictive Models
Involved with predicting a value based on other values in the dataset. The process of training a predictive model is known as supervised learning.
Descriptive Models
Involved with summarizing or grouping data in new and interesting ways. In these types of models, no single feature is more important than the other.
Imputation
Involves systematically filling in missing data using a substituted value.
Summary Statistics
Numbers that describe the properties of the features of the data.
Inconsistent Data
Occurs as a result of discrepancies in the data. Resolving this issue often requires additional or redundant information.
Imbalanced Data
Occurs when classes have very unequal frequencies, including with data that has more than two classes.
Outlier
Occurs when data has characteristics that are drastically different from most of the other data, or when values of a feature are unusual with respect to the typical values for that feature. The definition depends on hidden assumptions regarding the data structure and the applied detection method.
Dimensionality
Represents the number of features in the dataset.
Inexplicable
Rules that defy rational explanation and do not suggest a clear course of action.
Actionable
Rules that provide clear and useful insights that can be acted upon.
Trivial
Rules that provide insight that is already well-known by those familiar with the domain.
Stratified Random Sampling
Sample from the data such that the original known class distribution is maintained. The new sample of data reflects original distribution, works for imbalanced data, and is often inefficient.
Systematic Sampling
Select instances from an ordered sampling window. Then, select every kth element from the window. Be careful: this method risks interaction with regularities in the data.
Simple Random Sampling
Shuffle the data and then select examples. This method avoids regularities in the data but may be problematic with imbalances data.
Regression
Smoothing data by fitting it to a regression function.
Association Rules
Specify patterns found in the relationship among items or itemsets. The goal is to discover a set of regularities or rules between occurrences of items in the dataset.
Machine Learning
The construction and usage of algorithms that learn from data. A machine is said to learn when it performs better on a task as it receives more information about the task.
Feature Construction
The creation of novel features from original feature data, done because sometimes original features are not suitable for some algorithms and because sometimes more useful features can be engineered from the original ones.
RHS
The expected result of meeting the condition in association rules.
Percentile
The feature value below which a given percentage of the observed instances fall; this statistic is often useful for continuous data.
Validation & Interpretation
The fifth step in the knowledge discovery process, which comes after Modeling.
Data Collection
The first step in the model discovery process, which comes before Data Exploration.
Anti-Monotone Property of Support
The foundation for the apriori principle, the support of an itemset never exceeds that of its subsets. Therefore, if the subset of an itemset is infrequent, then the itemset if infrequent.
Modeling
The fourth step in the knowledge discovery process, which comes after Data Preparation and before Validation & Interpretation.
Lift
The increased likelihood that a rule occurs in a dataset relative to its typical rate of occurrence. It is the confidence of the itemset containing both x and y, divided by the support of the itemset containing only y.
Transaction
The itemset for an observation.
Curse of Dimensionality
The more attributes there are, the easier it is to build a model that fits the sample data but that is worthless as a predictor.
Confidence
The predictive power or accuracy of a rule. We calculate this as the support of the itemset containing both x and y, divided by the support of the itemset containing only x.
Data Preparation
The process of making the data more suitable for data science methods and techniques.
Regression
The process of predicting a value based on previous observations.
Classification
The process of predicting whether a given observation belongs to a certain category or class.
Smoothing
The process of reducing noise in the data.
Deduplication
The process of removing duplicate entries.
Unsupervised Learning
The process of training a descriptive model.
Supervised Learning
The process of training a predictive model.
Z-Score Normalization
The process of transforming data to have an average of 0 and standard deviation of 1.
Brute Force Approach
The prohibitively expensive approach of listing all association rules, computing the support and confidence for every possible rule, and pruning the rules that don't meet the thresholds.
80%
The proportion of time generally spent on data collection, exploration, and preparation.
Noise
The random component of a measurement error, which is difficult to eliminate.
Data Exploration
The second step in the knowledge discovery process, which comes after Data Collection and before Data Preparation.
Instance
The thing to be classified, associated or clustered. An independent, individual example of the target concept; described by a set of attributes or features. A set of these is the input to the learning scheme.
Data Preparation
The third step in the knowledge discovery process, which comes after Data Exploration and before Modeling.
Data Exploration
This process involves describing the data, visualizing the data, analyzing the data, and understanding the data.
Sampling
This process is done because sometimes it is too expensive or time-consuming to use all of the available data to generate a model. The resulting subset should permit the construction of a model representative of a model generated from the entire dataset.
Decimal Scaling
Transform the data by moving the decimal points of values of feature F. The number of decimal points moved depends on the maximum absolute value of F.
Min-Max Normalization
Transform the data from measured units to a new interval from new_min to new_max.
Discretization
Transformation of continuous data into discrete counterparts. This process is similar to binning, and we would do this because some algorithms require it, it can improve visualization, or it can reduce categories for features with many values.
Dummy Variables
Transformation of discrete features into a series of continuous features (usually with binary values). These are helpful because some algorithms only work with continuous features, it is a useful approach for dealing with missing data, and it is a necessary pre-step in dimensionality reduction.
