Machine Learning

Ace your homework & exams now with Quizwiz!

Brett Lantz

"The practice of machine learning involves matching the characteristics of the input data to the biases of available approaches."

Sparsity

A measurement of the data density. The degree to which data does not exist for each feature of all observations.

Duplicate Data

Occurs when data that is repeated in the dataset. Note that two or more objects may be identical with respect to features but still represent different objects.

Class

The attribute or feature that is described by the other features within an instance.

Within Cluster Sum of Squares

WCSS. This measure decreases as we increase k.

Tom Mitchell

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T as measured by P, improves with experience E."

Elbow Method

A approach for selecting the value value that involves plotting the within clusters sum of squares against various k values.

Itemset

A collection of one or more items (e.g., {beer, bread, diaper, milk, eggs}).

Model Selection

A cyclical process of selecting, training, and evaluating models.

Voronoi Diagram

A diagram utilizing the bisectors of segments that outlines the points in the plane closest to each point.

Non-Parametric Model

A learning model for when you have a lot of data and no prior knowledge, and when you don't want to worry too much about choosing the right features.

Parametric Model

A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples). o matter how much data you throw at a parametric model, it won't change its mind about how many parameters it needs.

Density

A measurement of the data density. The degree to which data exists for each feature of all observations.

k-Means Clustering

A method that takes in items and assigns each items of one of k clusters such that the differences within a cluster are minimized while the differences between clusters are maximized.

Feature

A property or characteristic of an instance, which can be either discrete or continuous.

Box Plot

A visualization technique that answers whether a feature is significant, location comparison across subgroups, variation difference between subgroups, and outliers.

Histogram

A visualization technique that can answer questions regarding data distribution, location, spread, skew, and outliers.

Odds Plot

A visualization technique which answers whether a feature is significant, how feature values affect the probability of occurrence, and a threshold for the effect.

Scatter Plot

A visualization technique which measures whether a feature is significant, how features interact, and whether there are outliers in the data.

Apriori Principle

Allows us to reduce the number of candidate itemsets by pruning the itemset lattice. If we identify an itemset as being infrequent, then its supersets should not be generated/tested.

Clustering

An unsupervised machine learning task that automatically divides data into groups based on similarity. Items inside a cluster should be similar while items outside the cluster should be different.

Inductive Learning Hypothesis

Any hypothesis found to approximate the target function well over a large set of training examples will also approximate the target function well over other unobserved examples.

Missing Data

Arises due to changes in data collection methods, human error, combining various datasets, human bias, etc. The key is to know how and why data is missing, as well as to understand that missing value can have meaning.

Bias Variance Tradeoff

As complexity increases, bias decreases but variance increases.

Market Basket Analysis

As the amount of customer retail transactional data grows over time, machine learning is increasingly being applied to the data in order to detect patterns in purchasing behavior.

Discrete Features

Attributes measures in categorical form, which typically has only a reasonable set of values (e.g., clothing size, customer satisfaction, etc.).

Continuous Features

Attributes usually measured in the form of integer or real numbers (e.g., temperature, weight, height, age, etc.).

Aggregation

Combining two or more objects into a single object.

LHS

Condition that needs to be met to trigger the association rule.

Proximity-Based Techniques

Define a proximity measure between instances, with outliers being distant from most other instances.

Density-Based Techniques

Define outliers as instances that have a local density significantly less than that of neighbors.

Robust

Describes an algorithm that can produce acceptable results even when noise is present.

Resolution

Describes the grain of the data. Data with too much of this could have patterns blurred by noise, but too little will not reveal interesting patterns.

Variance

Errors caused by this are made as a result of the sampling of the training data.

Bias

Errors caused by this are made as a result of the specified learning algorithm.

Hot-Deck Imputation

Fill in the missing value using similar instances from the same dataset.

Support Count

Frequency of an itemset.

Frequent Itemset Generation

Generate all items whose support and confidence are above thresholds. This approach is very expensive.

Cluster Sampling

Group or segment data based on similarities, then randomly select from each group. This method is efficient but typically not optimal.

Clustering

Group the data and use properties of the groups to represent the instances constituting those clusters, which smoothes the data.

Binning

Grouping the ordered data to smooth the data, either by means or boundaries.

Support

How frequently a rule occurs in a dataset, measured as a fraction.

Data Collection

Identifying and gathering the data that will be used in the learning process to generate actionable information.

Mean Imputation

Imputation method that results in underestimation of the standard deviation and pulling estimation of correlation to zero.

Random Imputation

Imputation method that tends to ignore useful information from other features.

Distribution-Based Imputation

Imputation method which asssigns assigning a value for the missing data based on the probability distribution of the non-missing data.

Predictive Imputation

Imputation method which builds a regressor or classifier to predict the missing data. Consider the missing feature as the dependent variable and the rest of the features as the independent variables.

Match-Based Imputation

Imputation method which imputes based on similar instances with non-missing values.

Cold-Deck Imputation

Impute missing values using similar instances from another dataset.

Indicator Variable

Imputes using a constant or indicator value (e.g., "unknown," "N/A," or "-1."

Normalization

Intends to make an entire set of values have a particular property; often, this involves scaling data to fall within a small, specified range.

Predictive Models

Involved with predicting a value based on other values in the dataset. The process of training a predictive model is known as supervised learning.

Descriptive Models

Involved with summarizing or grouping data in new and interesting ways. In these types of models, no single feature is more important than the other.

Imputation

Involves systematically filling in missing data using a substituted value.

Summary Statistics

Numbers that describe the properties of the features of the data.

Inconsistent Data

Occurs as a result of discrepancies in the data. Resolving this issue often requires additional or redundant information.

Imbalanced Data

Occurs when classes have very unequal frequencies, including with data that has more than two classes.

Outlier

Occurs when data has characteristics that are drastically different from most of the other data, or when values of a feature are unusual with respect to the typical values for that feature. The definition depends on hidden assumptions regarding the data structure and the applied detection method.

Dimensionality

Represents the number of features in the dataset.

Inexplicable

Rules that defy rational explanation and do not suggest a clear course of action.

Actionable

Rules that provide clear and useful insights that can be acted upon.

Trivial

Rules that provide insight that is already well-known by those familiar with the domain.

Stratified Random Sampling

Sample from the data such that the original known class distribution is maintained. The new sample of data reflects original distribution, works for imbalanced data, and is often inefficient.

Systematic Sampling

Select instances from an ordered sampling window. Then, select every kth element from the window. Be careful: this method risks interaction with regularities in the data.

Simple Random Sampling

Shuffle the data and then select examples. This method avoids regularities in the data but may be problematic with imbalances data.

Regression

Smoothing data by fitting it to a regression function.

Association Rules

Specify patterns found in the relationship among items or itemsets. The goal is to discover a set of regularities or rules between occurrences of items in the dataset.

Machine Learning

The construction and usage of algorithms that learn from data. A machine is said to learn when it performs better on a task as it receives more information about the task.

Feature Construction

The creation of novel features from original feature data, done because sometimes original features are not suitable for some algorithms and because sometimes more useful features can be engineered from the original ones.

RHS

The expected result of meeting the condition in association rules.

Percentile

The feature value below which a given percentage of the observed instances fall; this statistic is often useful for continuous data.

Validation & Interpretation

The fifth step in the knowledge discovery process, which comes after Modeling.

Data Collection

The first step in the model discovery process, which comes before Data Exploration.

Anti-Monotone Property of Support

The foundation for the apriori principle, the support of an itemset never exceeds that of its subsets. Therefore, if the subset of an itemset is infrequent, then the itemset if infrequent.

Modeling

The fourth step in the knowledge discovery process, which comes after Data Preparation and before Validation & Interpretation.

Lift

The increased likelihood that a rule occurs in a dataset relative to its typical rate of occurrence. It is the confidence of the itemset containing both x and y, divided by the support of the itemset containing only y.

Transaction

The itemset for an observation.

Curse of Dimensionality

The more attributes there are, the easier it is to build a model that fits the sample data but that is worthless as a predictor.

Confidence

The predictive power or accuracy of a rule. We calculate this as the support of the itemset containing both x and y, divided by the support of the itemset containing only x.

Data Preparation

The process of making the data more suitable for data science methods and techniques.

Regression

The process of predicting a value based on previous observations.

Classification

The process of predicting whether a given observation belongs to a certain category or class.

Smoothing

The process of reducing noise in the data.

Deduplication

The process of removing duplicate entries.

Unsupervised Learning

The process of training a descriptive model.

Supervised Learning

The process of training a predictive model.

Z-Score Normalization

The process of transforming data to have an average of 0 and standard deviation of 1.

Brute Force Approach

The prohibitively expensive approach of listing all association rules, computing the support and confidence for every possible rule, and pruning the rules that don't meet the thresholds.

80%

The proportion of time generally spent on data collection, exploration, and preparation.

Noise

The random component of a measurement error, which is difficult to eliminate.

Data Exploration

The second step in the knowledge discovery process, which comes after Data Collection and before Data Preparation.

Instance

The thing to be classified, associated or clustered. An independent, individual example of the target concept; described by a set of attributes or features. A set of these is the input to the learning scheme.

Data Preparation

The third step in the knowledge discovery process, which comes after Data Exploration and before Modeling.

Data Exploration

This process involves describing the data, visualizing the data, analyzing the data, and understanding the data.

Sampling

This process is done because sometimes it is too expensive or time-consuming to use all of the available data to generate a model. The resulting subset should permit the construction of a model representative of a model generated from the entire dataset.

Decimal Scaling

Transform the data by moving the decimal points of values of feature F. The number of decimal points moved depends on the maximum absolute value of F.

Min-Max Normalization

Transform the data from measured units to a new interval from new_min to new_max.

Discretization

Transformation of continuous data into discrete counterparts. This process is similar to binning, and we would do this because some algorithms require it, it can improve visualization, or it can reduce categories for features with many values.

Dummy Variables

Transformation of discrete features into a series of continuous features (usually with binary values). These are helpful because some algorithms only work with continuous features, it is a useful approach for dealing with missing data, and it is a necessary pre-step in dimensionality reduction.


Related study sets

chapter 7- Thinking and intelligence

View Set

LS 7C Midterm 1 Learning Objectives

View Set

Geologic Structures, Maps, and Block Diagrams - Parts 1 and 2

View Set

Child Health - Throat/Respiratory

View Set