Machine Learning Quiz 3, Machine Learning Quiz

Ace your homework & exams now with Quizwiz!

Tom Mitchell Quote

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured by P, improves with experience E."

Inductive Learning Hypothesis

"Any hypothesis found to approximate the target function well over a large set of training examples will also approximate the target function well over other unobserved examples."

Dealing with noisy data

*Smoothing* is the process of reducing noise in data. Methods: 1) Binning 2) Clustering 3) Regression

Descriptive Models

*Summarizing or grouping data in new and interesting ways.* In these types of models, no single feature is more important than the other. The process of training a descriptive model is known as unsupervised learning.

Data preparation

1) Cleaning the data 2) transforming the data 3) reducing the data Tidy data are all alike, every messy dataset is messy in its own way

k-NN strengths

1) Simple and effective 2) makes no assumptions about the underlying data distribution 3) training phase is very fast

Dealing with Duplicate Data

1. Setting a similarity threshold to identify duplicates. 2. Defining elimination rules for dealing with duplicates.

Problems with Duplicate Data

A dataset may include completely or partially duplicated data. Be Careful! Occasionally, two or more objects may be identical with respect to features, but still represent different objects. Avoid combining instances that are similar but don't represent a single entity

Entropy

A quantification of the level of randomness or disorder within a set of class values

Nearest Neighbor

Approach that assigns a label to new data based on the label of existing similar examples.

Logistic Regression

Approach that assigns a label to new data based on the odds that the data belongs to a certain category

Decision Trees

Approach that assigns a label to new data by using a tree structure to determine the category that the data belongs to.

Clustering

Approach that automatically partitions data into groups based on similarity

Association Rules

Approach that builds rules that describe the co-occurence of events in data.

Curse of Dimensionality

As the number of features (or complexity) increases, the amount of data needed to generalize accurately grows exponentially . If the size of data is fixed, performance diminishes as complexity increases.

Discrete Features

Attributes measured in categorical form. They typically have only a reasonable set of possible values. Examples include: clothing size (small, medium, large), customer satisfaction (not happy, somewhat happy, very happy), etc.

Continuous Features

Attributes usually measured in the form of integer or real numbers. They have an infinite number of possible values between the lower and upper bounds. Examples include: temperature, height, weight, age, etc.

Prediction Errors

Bias error, variance error, irreducible error

Binning

Binning by mean: Use mean of the bin Binning with boundaries: round down or up to the min/max of the bin

mean and median

Both mean and median are used to measure the location of a set of points

Entropy in R

C50

Gini Impurity in R

CART

Gini Impurity

Can be used instead of Entropy. A way to measure the optimal feature to split upon. A measure of statistical dispersion.

Smoothing by Clustering

Cluster the data and use properties of the clusters to represent the instances constituting those clusters.

DwMR (Package)

Contains the critical function smote for balancing imbalanced data.

Corrplot (Package)

Correlation plotting package

Covariance

Covariance measures the degree to which two features vary together.

Outlier proximity based technique

Define a proximity measure between instances, with outliers being distant from most other instances

Outlier density based technique

Define outliers as instances that have a local density significantly less than that of neighbors

Data Exploration

Describing, visualizing, and analyzing the acquired data in order to better understand it

Gridextra (Package)

For graphing multiple grids side by side

Pre-Pruning Complexity Parameter

If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue

Imputation

Imputation involves systematically filling in missing data using a substituted value. Some of the approaches to imputation include: match-based imputation statistical imputation predictive imputation using an indicator variable mean imputation random imputation

Match Based Imputation

Impute based on similar instances with non-missing values. There are two approaches to this: 1. *Cold-deck Imputation*: Fill in the missing value using similar instances from another dataset. 2. *Hot-deck Imputation*: Fill in the missing value using similar instances from the same dataset.

Distribution-based Imputation

Impute by assigning a value for the missing data based on the probability distribution of the non-missing data.

Predictive Imputation

Impute by building a regressor or classifier to predict the missing data. Consider the missing feature as the dependent variable (output) and the rest of the features as the independent variables (input).

Indicator Variable (imputation)

Impute using a constant or "indicator" value. For example you can use "unknown", "N/A" or "-1" to represent missing values. This approach is appropriate for unordered categorical variables, however, it may be mistaken for actual data.

Random Imputation

Impute using a random observed value for the feature. The major disadvantage of this approach is that it tends to ignore useful information from other features.

Mean Imputation

Impute using the mean of the feature's observed values. Some of the disadvantages of this approach include: 1. It may result in underestimates of the standard deviation. 2. It tends to "pull" estimates of correlation to zero.

Gain Ratio formula

InformationGain(F)÷IntrinsicInfo(F)

Logistic vs Linear regression

Instead of modeling our response variable directly, logistic regression models the probability of a particular response value (producing a sigmoid curve)

Box plots

Is a feature significant? Does the location differ between subgroups? Does the variation differ between subgroups? Are there outliers in the data?

Odds Plots

Is a feature significant? How do feature values affect the probability of occurrence? Is there a threshold for the effect?

Scatter Plots

Is a feature significant? How do features interact? Are there outliers in the data?

High Entropy

Large heterogeneity

Low Entropy

Large homogeneity

Generalized Linear Models Differences

Linear Regression: A unit change in X results in a corresponding change in Y of B₁ Logistic Regression: A unit change in X results in a corresponding change in the log-odds of Pr(X) of B₁ Logistic Regression: For every unit change in X, the odds of Pr(X) changed by a multiple of e^(β₁) (see malignant tumor slides)

Logistic Regression

Logistic regression is a probabilistic statistical regression model which is used to model the relationship between predictor variables and the odds of a categorical response

Regression

Machine learning technique where the goal is to model the size and strength of numeric relationships in order to predict a target variable based on the values of previously observed explanatory variables.

Data Preparation

Making the data more suitable for the learning process and other data science methods and techniques

Problems with Noisy Data

Noise is the random component of a measurement error - variation of data in a dataset An algorithm is *robust* if it can produce acceptable results even when noise is present

Problems with Outliers

Outliers may either be: 1. Instances that have characteristics that are different from most of the other data. 2. Values of a feature that are unusual with respect to the typical values for that feature. *Unlike noise, outliers can be legitimate data or values.*

Feature

Property of characteristic of an instance. Can be discrete or continuous.

Range, variance

Range and variance are used to measure the spread of a set of points.

Sampling

Sampling involves selecting a subset of a dataset as a proxy for the whole. A model generated from a good sample should be representative of a model generated from the entire dataset.

Feature Selection

Select minimal set of features where probability distribution of the predicted class is close to the one obtained by all the features. A "good" feature vector is defined by its capacity to discriminate between instances from different classes.

Recursive Partitioning

Splits the data into subsets and then recursively splits the subsets into even smaller subsets until one or more stopping criteria are met (Also called Divide and Conquer)

Classification

Supervised machine learning approach focused on predicting a discrete value(the class). Performance is evaluated using accuracy or misclassification rate.

Frequency and Mode

The *frequency* of a feature value is the percentage of time the value occurs in the dataset. The *mode* of a feature is the most frequent feature value. Both are typically used with categorical data.

Generalized Linear Models

The Generalized Linear Model (GLM) is an extension of linear regression that allows for linear predictors to be related to a response variable that is not normally distributed by using a transformation or link function.

Supervised Learning

The Process of training a predictive model

Class or Response

The attribute or feature that is described by the other features within an instance.

Post-Pruning Complexity Parameter

The cp value that corresponds to the lowest cross-validation error is used as the threshold for pruning the tree

Machine Learning Definiton

The development and use of algorithms that learn from prior experience in order to solve future problems

Normalization

The goal of standardization or normalization is to make an entire set of values have a particular property. Often, this involves scaling data to fall within a small, specified range. Three common approaches to normalization include: 1)min-max normalization 2)z-score normalization 3)decimal scaling

Intrinsic Information

The information needed to determine the branch to which an instance belongs.

logit function

The link function used for binomial logistic regression is called the logit function. It is the log-odds of the logistic function.

Problems with Imbalanced Data

The nature of certain types of problems result in unequal class distributions. For example: Attrition prediction: 97% stay, 3% attrite (per month). Medical diagnosis: 90% healthy, 10% diseases. eCommerce: 99% don't buy, 1% buy. Security: >99.99% of Americans are not terrorists.

Odds Ratio

The odds ratio of an event is the probability that the event will occur expressed as a proportion of the likelihood that the event will not occur

Percentile

The percentile is the feature value below which a given percentage of the observed instances fall. Percentile is often useful for continuous data.

Cluster Sampling

The process involves grouping or segmenting data based on similarities, then randomly selecting from each group. While this approach is relatively efficient, it does not always result in optimal samples.

Stratified Random Sampling

The process involves selecting a subset from the original data such that the original class distribution is maintained. This approach works well for imbalanced data but is often inefficient

Systematic Sampling

The process involves selecting every k-th instance from an ordered set of instances in a dataset, where k = N/n, given that N is the population size and n is the sample size. This approach risks interaction with regularities in the data.

Simple Random Sampling

The process involves shuffling the data and then randomly select instances from the original dataset. It can be done with replacement (put the marble back in jar) or without replacement (leave marble out of jar) This approach avoids regularities in the data but can be problematic with imbalanced data

Dealing with Imbalanced Data

There are several approaches to dealing with imbalanced data. Some of which are: 1) Collect more data. 2) Over-sample and/or under-sample. 3) Generate synthetic samples. 4) Try a different algorithm. 5) Use a different set of performance measures.

Feature Construction

This involves the creation of new features from the original set of features. Feature construction is done because: 1) Sometimes original features are not suitable for some algorithms. 2) Sometimes more useful features can be engineered from the original ones.

Decimal Scaling

Transform the data by moving the decimal point for values of feature F. The number of decimal points moved depends on the maximum absolute value of F. v' = v/(10^j) where j is the smallest integer such that max(|v'|) < 1

Z-score (zero-mean) Normalization

Transform the data so that it has a mean of 0 and standard deviation of 1. v′ = (v − Fmean) / σF

Discretization

Transformation of continuous data into discrete counterparts. This is a similar process as is used for binning. The reasons to discretize data include: 1) Some algorithms require categorical or binary features. 2)It improves visualization. 3)It reduces categories for features with many values.

Dummy Variable

Transformation of discrete features into a series of continuous features (usually with binary values). Dummy variables are useful because: 1)Some algorithms only work with continuous features. 2)It is a useful approach for dealing with missing data. 3)It is a necessary pre-step in dimensionality reduction such as with PCA (Principal Component Analysis).

Smoothing by Regression

Use a fitted regression line to represent the data.

GIGO

garbage in, garbage out Successful data science depends on good data.

Predictive Models

involved with predicting a value based on other values in the dataset.

Dimensionality

represents the number of features in the dataset.

Complexity Parameter

used to control the size of the decision tree and select the optimal tree size

Decision Tree

utilize a tree structure to model the relationship among the features and the potential outcomes

Min-Max Normalization

v′= [(v − minF)/(maxF − minF)]*(new_maxF − new_minF) + new_minF Suppose that the minimum and maximum values for a feature are $12,000 and $98,000, respectively. We can use min-max normalization to map it to the range [0.0,1.0]. The normalized value for $73,600 will be: Min-Max Normalization (73,600 − 12,000)/(98,000 − 12,000 )*(1.0 − 0.0) + 0.0 = 0.716

Overfitting

when a model captures the underlying patterns in the data too closely. Such a model has low bias and high variance.

Note about regression

Note that regression does NOT establish causation between the independent variables and the dependent variable.

Predictive Models

Predictive Models are involved with predicting a value based on other values in the dataset. The process of training a predictive model is known as supervised learning.

Regression

Supervised machine learning approach focused on predicting a continuous value (the response). Performance is evaluated using root mean squared error.

Underfitting

happens when a model is unable to capture the underlying patterns in the data. Such a model has high bias and low variance

Criteria that triggers a stop to recursive partitioning:

- All data in a leaf node are of the same class - All features have been exhausted - A specified tree size limit has been met

Applications of Nearest Neighbor

-Optical character recognition and facial recognition. -Movie or music recommendation engines. -Pattern recognition in genetic data for disease diagnosis.

Unsupervised Machine Learning

1) Association Rules 2) k-means clustering Unsupervised learning does not have labeled outputs, so its goal is to infer the natural structure present within a set of data points.

Three Common Types of Logistic Regression

1) Binomial Logistic Regression 2) Multinomial Logistic Regression 3) Ordered Logistic Regression

Parametric Model Weaknesses

1) Constrained: By choosing a functional form, these methods are highly constrained to the specified form. 2) Limited Complexity: The methods are more suited to simpler problems. 3) Poor Fit: In practice, the methods may not always match the underlying mapping function.

Data exploration consists of (4)

1) Describing the data 2) visualizing the data 3) analyzing the data 4) understanding the data

k-NN weaknesses

1) Does not produce a model 2) The selection of an appropriate k is often arbitrary 3) Rather slow classification phase 4) Does not handle missing, outlier, and nominal data well without pre-processing

Non-Parametric Models Strengths

1) Flexibility: Capable of fitting a large number of functional forms. 2) Power: No assumptions (or weak assumptions) about the underlying function. 3) Performance: Can result in higher performance models for prediction.

Logistic Regression Weaknesses

1) Makes strong assumptions about the data 2) Does not do well with missing data 3) Tends to under perform when there are multiple or non-linear decision boundaries 4) Does not naturally capture complex relationships

Non-Parametric Models Weaknesses

1) More data: Require a lot more training data to estimate the mapping function. 2) Slower: A lot slower to train, as they often have far more parameters to train. 3) Overfitting: Have a higher risk of overfitting against the training data.

Logistic Regression Strengths

1) Outputs have a nice probabilistic interpretation 2) Can be regularized to avoid overfitting 3) easy to implement 4) very efficient to train 5) able to handle a reasonable number of nominal features

Parametric Models Strenghts

1) Simpler: These methods are easier to understand and the results are easy to interpret. 2) Speed: Parametric models are usually very fast to train. 3) Less Data: They do not require as much training data and can work well even if the fit to the data is not perfect.

Regression Methods

1) linear regression 2) logistic regression 3) poisson regression (log-linear)

Supervised Machine Learning

1)logistic regression 2)k-nearest neighbor 3) decision trees Supervised learning is to learn a function that, given a sample of data and desired outputs, best approximates the relationship between input and output observable in the data.

k-NN Steps

1. Choose k. 2. Identify k labeled examples in the training set that are nearest in similarity to each of the unlabeled observations in the test data set (use euclidean distance) 3. Assign the class of the majority of the k nearest neighbors in the training set to each unlabeled observation in the test set.

The analytics Process

1. Data collection 2. Data exploration 3. Data preparation 4. Modeling 5. Evaluation 6. Business insight

Large vs. Small K

A *large k* reduces the impact of noisy data but increases the risk of ignoring important patterns. A *small k* makes the model susceptible to noise and/or outliers

Information Gain

A calculation of the change in entropy that would result from a split on each possible feature which determines the optimal feature to split on.

Lazy Learners

A class of non-parametric learning methods that do not generate a model but instead make use of verbatim training data for classification. They are also known as instance-based learners or rote learners

Nearest Neighbor

A classification approach that assigns a class to unlabeled examples based on the class of similar labeled examples.

Confusion Matrix

A confusion matrix is a table that categorizes predictions according to whether they match the actual value. The class of interest is known as the positive class, while all others are known as negative.

Non-Parametric Models

A learning model that does not make strong assumptions about the form of the mapping function is called a non-parametric model. By not making assumptions, non-parametric models are free to learn any functional form from the training data.

Parametric Models

A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) -No matter how much data you throw at a parametric model, it won't change its mind about how many parameters it needs

Splitting

A measure of impurity is used to make the decision about which feature to split on called entropy. The goal is to minimize impurity within each node.

Gain Ratio

A modification of Information Gain that reduces its bias on highly branching features by taking into account the number and size of branches when choosing a feature.

Greedy Learners

Decision trees are a part of a group of learners that use all the data available to them on a first-come first-served basis

Variance Bias

Error due to variance are errors made as a result of not giving the learning algorithm the optimal training data. Leads to low training error but high test error

Bias Error

Errors made as a result of not choosing the optimal learning algorithm for the training data. Leads to high training error and high test error

Sparsity and Density

Data sparsity and density describe the degree to which data exists for each feature of all observations. So if a table is 80% dense, then 20% of the data is missing or undefined. This means it is 20% sparse

rpart (Package)

For recursive partitioning in classification.

Data Collection

Identifying and gathering the data needed for the learning process.

Applications of Decision Trees

- Credit scoring models - Marketing studies of customer behavior - Diagnosis and prognosis of medical conditions

Decision Tree Strengths

- Does well on most problems - Handles numeric and nominal features well - Does well with missing data - Ignores unimportant features - Useful for both large and small datasets - Output is easy to understand - Efficient and low cost model

Decision Tree Weaknesses

- Splits biased towards features with a large number of levels - Easy to overfit or underfit - Reliance on axis-parallel splits is limiting - Small changes in data result in large changes to decision logic - Large trees can be difficult to interpret or understand

Akaike Information Criterion (AIC)

An estimate of the relative information lost by a given model: the less information a model loses, the higher the quality of the model. Given a set of candidate models from the same dataset, the model with the lowest AIC is the "preferred" model.

Instance

An individual independent example of the concept represented by the dataset. It is described by a set of attributes or features. A dataset, is made up of several instances.

The Bias Variance Tradeoff

Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn't seen before. As a result, such models perform very well on training data but has high error rates on test data.

Logistic Regression measuring coefficients

Coefficients are estimated using a technique called Maximum Likelihood Estimation (MLE) -Unlike the Ordinary Least Squares (OLS) method used by linear regression, finding a closed form for the coefficients using MLE is not possible. Instead, the process is iterative.

Class (Package)

Contains vital packages for kNN

Modeling

Matching the input data to the appropriate machine learning approach in order to solve a problem.

Choosing the right K

Methods: 1) The square root of the number of training examples 2) Test different k values against a variety of test data sets and choose the one that performs best 3) Use weighted voting where the closest neighbors have larger weights *NOTE*: the larger the data set, the less important the difference between two choices for k becomes.

Resolution

Resolution describes the grain or level of detail in the data. If data resolution is too fine, important patterns may be blurred by noise and if the resolution is too coarse, important patterns may disappear.

InformationValue (Package)

Used for maximizing information value in regression and analyzing performance in classification models.

Histograms

What kind of population distribution does the data come from? Where is the data located? How spread out is the data? Is the data symmetric or skewed? Are there outliers in the data?

Limits of Linear Regression

Works well when the response variable is normally distributed, struggle for non-gaussian responses

linear regression

Y: Dependent or response variable X: independent or predictor variables B: Model parameters (unknown parameters to be estimated by the regression model) -------------------------------------------------------------------- The goal is to estimate the values for B0 and B1 in order to minimize the sum of squared errors (residuals) between the actual and predicted values.


Related study sets

Combombination of Praxis II Teaching Reading Sets

View Set

تطبيقات شبكات الحاسب الوحدة الأولى: مقدمة عن الشبكات ونظام التشغيل الخادم 2008

View Set

Cellular Respiration and Fermentation

View Set

Chapter 21 the ground water system

View Set