Machine Learning Quiz 3, Machine Learning Quiz
Tom Mitchell Quote
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks T, as measured by P, improves with experience E."
Inductive Learning Hypothesis
"Any hypothesis found to approximate the target function well over a large set of training examples will also approximate the target function well over other unobserved examples."
Dealing with noisy data
*Smoothing* is the process of reducing noise in data. Methods: 1) Binning 2) Clustering 3) Regression
Descriptive Models
*Summarizing or grouping data in new and interesting ways.* In these types of models, no single feature is more important than the other. The process of training a descriptive model is known as unsupervised learning.
Data preparation
1) Cleaning the data 2) transforming the data 3) reducing the data Tidy data are all alike, every messy dataset is messy in its own way
k-NN strengths
1) Simple and effective 2) makes no assumptions about the underlying data distribution 3) training phase is very fast
Dealing with Duplicate Data
1. Setting a similarity threshold to identify duplicates. 2. Defining elimination rules for dealing with duplicates.
Problems with Duplicate Data
A dataset may include completely or partially duplicated data. Be Careful! Occasionally, two or more objects may be identical with respect to features, but still represent different objects. Avoid combining instances that are similar but don't represent a single entity
Entropy
A quantification of the level of randomness or disorder within a set of class values
Nearest Neighbor
Approach that assigns a label to new data based on the label of existing similar examples.
Logistic Regression
Approach that assigns a label to new data based on the odds that the data belongs to a certain category
Decision Trees
Approach that assigns a label to new data by using a tree structure to determine the category that the data belongs to.
Clustering
Approach that automatically partitions data into groups based on similarity
Association Rules
Approach that builds rules that describe the co-occurence of events in data.
Curse of Dimensionality
As the number of features (or complexity) increases, the amount of data needed to generalize accurately grows exponentially . If the size of data is fixed, performance diminishes as complexity increases.
Discrete Features
Attributes measured in categorical form. They typically have only a reasonable set of possible values. Examples include: clothing size (small, medium, large), customer satisfaction (not happy, somewhat happy, very happy), etc.
Continuous Features
Attributes usually measured in the form of integer or real numbers. They have an infinite number of possible values between the lower and upper bounds. Examples include: temperature, height, weight, age, etc.
Prediction Errors
Bias error, variance error, irreducible error
Binning
Binning by mean: Use mean of the bin Binning with boundaries: round down or up to the min/max of the bin
mean and median
Both mean and median are used to measure the location of a set of points
Entropy in R
C50
Gini Impurity in R
CART
Gini Impurity
Can be used instead of Entropy. A way to measure the optimal feature to split upon. A measure of statistical dispersion.
Smoothing by Clustering
Cluster the data and use properties of the clusters to represent the instances constituting those clusters.
DwMR (Package)
Contains the critical function smote for balancing imbalanced data.
Corrplot (Package)
Correlation plotting package
Covariance
Covariance measures the degree to which two features vary together.
Outlier proximity based technique
Define a proximity measure between instances, with outliers being distant from most other instances
Outlier density based technique
Define outliers as instances that have a local density significantly less than that of neighbors
Data Exploration
Describing, visualizing, and analyzing the acquired data in order to better understand it
Gridextra (Package)
For graphing multiple grids side by side
Pre-Pruning Complexity Parameter
If the cost of adding another variable to the decision tree from the current node is above the value of cp, then tree building does not continue
Imputation
Imputation involves systematically filling in missing data using a substituted value. Some of the approaches to imputation include: match-based imputation statistical imputation predictive imputation using an indicator variable mean imputation random imputation
Match Based Imputation
Impute based on similar instances with non-missing values. There are two approaches to this: 1. *Cold-deck Imputation*: Fill in the missing value using similar instances from another dataset. 2. *Hot-deck Imputation*: Fill in the missing value using similar instances from the same dataset.
Distribution-based Imputation
Impute by assigning a value for the missing data based on the probability distribution of the non-missing data.
Predictive Imputation
Impute by building a regressor or classifier to predict the missing data. Consider the missing feature as the dependent variable (output) and the rest of the features as the independent variables (input).
Indicator Variable (imputation)
Impute using a constant or "indicator" value. For example you can use "unknown", "N/A" or "-1" to represent missing values. This approach is appropriate for unordered categorical variables, however, it may be mistaken for actual data.
Random Imputation
Impute using a random observed value for the feature. The major disadvantage of this approach is that it tends to ignore useful information from other features.
Mean Imputation
Impute using the mean of the feature's observed values. Some of the disadvantages of this approach include: 1. It may result in underestimates of the standard deviation. 2. It tends to "pull" estimates of correlation to zero.
Gain Ratio formula
InformationGain(F)÷IntrinsicInfo(F)
Logistic vs Linear regression
Instead of modeling our response variable directly, logistic regression models the probability of a particular response value (producing a sigmoid curve)
Box plots
Is a feature significant? Does the location differ between subgroups? Does the variation differ between subgroups? Are there outliers in the data?
Odds Plots
Is a feature significant? How do feature values affect the probability of occurrence? Is there a threshold for the effect?
Scatter Plots
Is a feature significant? How do features interact? Are there outliers in the data?
High Entropy
Large heterogeneity
Low Entropy
Large homogeneity
Generalized Linear Models Differences
Linear Regression: A unit change in X results in a corresponding change in Y of B₁ Logistic Regression: A unit change in X results in a corresponding change in the log-odds of Pr(X) of B₁ Logistic Regression: For every unit change in X, the odds of Pr(X) changed by a multiple of e^(β₁) (see malignant tumor slides)
Logistic Regression
Logistic regression is a probabilistic statistical regression model which is used to model the relationship between predictor variables and the odds of a categorical response
Regression
Machine learning technique where the goal is to model the size and strength of numeric relationships in order to predict a target variable based on the values of previously observed explanatory variables.
Data Preparation
Making the data more suitable for the learning process and other data science methods and techniques
Problems with Noisy Data
Noise is the random component of a measurement error - variation of data in a dataset An algorithm is *robust* if it can produce acceptable results even when noise is present
Problems with Outliers
Outliers may either be: 1. Instances that have characteristics that are different from most of the other data. 2. Values of a feature that are unusual with respect to the typical values for that feature. *Unlike noise, outliers can be legitimate data or values.*
Feature
Property of characteristic of an instance. Can be discrete or continuous.
Range, variance
Range and variance are used to measure the spread of a set of points.
Sampling
Sampling involves selecting a subset of a dataset as a proxy for the whole. A model generated from a good sample should be representative of a model generated from the entire dataset.
Feature Selection
Select minimal set of features where probability distribution of the predicted class is close to the one obtained by all the features. A "good" feature vector is defined by its capacity to discriminate between instances from different classes.
Recursive Partitioning
Splits the data into subsets and then recursively splits the subsets into even smaller subsets until one or more stopping criteria are met (Also called Divide and Conquer)
Classification
Supervised machine learning approach focused on predicting a discrete value(the class). Performance is evaluated using accuracy or misclassification rate.
Frequency and Mode
The *frequency* of a feature value is the percentage of time the value occurs in the dataset. The *mode* of a feature is the most frequent feature value. Both are typically used with categorical data.
Generalized Linear Models
The Generalized Linear Model (GLM) is an extension of linear regression that allows for linear predictors to be related to a response variable that is not normally distributed by using a transformation or link function.
Supervised Learning
The Process of training a predictive model
Class or Response
The attribute or feature that is described by the other features within an instance.
Post-Pruning Complexity Parameter
The cp value that corresponds to the lowest cross-validation error is used as the threshold for pruning the tree
Machine Learning Definiton
The development and use of algorithms that learn from prior experience in order to solve future problems
Normalization
The goal of standardization or normalization is to make an entire set of values have a particular property. Often, this involves scaling data to fall within a small, specified range. Three common approaches to normalization include: 1)min-max normalization 2)z-score normalization 3)decimal scaling
Intrinsic Information
The information needed to determine the branch to which an instance belongs.
logit function
The link function used for binomial logistic regression is called the logit function. It is the log-odds of the logistic function.
Problems with Imbalanced Data
The nature of certain types of problems result in unequal class distributions. For example: Attrition prediction: 97% stay, 3% attrite (per month). Medical diagnosis: 90% healthy, 10% diseases. eCommerce: 99% don't buy, 1% buy. Security: >99.99% of Americans are not terrorists.
Odds Ratio
The odds ratio of an event is the probability that the event will occur expressed as a proportion of the likelihood that the event will not occur
Percentile
The percentile is the feature value below which a given percentage of the observed instances fall. Percentile is often useful for continuous data.
Cluster Sampling
The process involves grouping or segmenting data based on similarities, then randomly selecting from each group. While this approach is relatively efficient, it does not always result in optimal samples.
Stratified Random Sampling
The process involves selecting a subset from the original data such that the original class distribution is maintained. This approach works well for imbalanced data but is often inefficient
Systematic Sampling
The process involves selecting every k-th instance from an ordered set of instances in a dataset, where k = N/n, given that N is the population size and n is the sample size. This approach risks interaction with regularities in the data.
Simple Random Sampling
The process involves shuffling the data and then randomly select instances from the original dataset. It can be done with replacement (put the marble back in jar) or without replacement (leave marble out of jar) This approach avoids regularities in the data but can be problematic with imbalanced data
Dealing with Imbalanced Data
There are several approaches to dealing with imbalanced data. Some of which are: 1) Collect more data. 2) Over-sample and/or under-sample. 3) Generate synthetic samples. 4) Try a different algorithm. 5) Use a different set of performance measures.
Feature Construction
This involves the creation of new features from the original set of features. Feature construction is done because: 1) Sometimes original features are not suitable for some algorithms. 2) Sometimes more useful features can be engineered from the original ones.
Decimal Scaling
Transform the data by moving the decimal point for values of feature F. The number of decimal points moved depends on the maximum absolute value of F. v' = v/(10^j) where j is the smallest integer such that max(|v'|) < 1
Z-score (zero-mean) Normalization
Transform the data so that it has a mean of 0 and standard deviation of 1. v′ = (v − Fmean) / σF
Discretization
Transformation of continuous data into discrete counterparts. This is a similar process as is used for binning. The reasons to discretize data include: 1) Some algorithms require categorical or binary features. 2)It improves visualization. 3)It reduces categories for features with many values.
Dummy Variable
Transformation of discrete features into a series of continuous features (usually with binary values). Dummy variables are useful because: 1)Some algorithms only work with continuous features. 2)It is a useful approach for dealing with missing data. 3)It is a necessary pre-step in dimensionality reduction such as with PCA (Principal Component Analysis).
Smoothing by Regression
Use a fitted regression line to represent the data.
GIGO
garbage in, garbage out Successful data science depends on good data.
Predictive Models
involved with predicting a value based on other values in the dataset.
Dimensionality
represents the number of features in the dataset.
Complexity Parameter
used to control the size of the decision tree and select the optimal tree size
Decision Tree
utilize a tree structure to model the relationship among the features and the potential outcomes
Min-Max Normalization
v′= [(v − minF)/(maxF − minF)]*(new_maxF − new_minF) + new_minF Suppose that the minimum and maximum values for a feature are $12,000 and $98,000, respectively. We can use min-max normalization to map it to the range [0.0,1.0]. The normalized value for $73,600 will be: Min-Max Normalization (73,600 − 12,000)/(98,000 − 12,000 )*(1.0 − 0.0) + 0.0 = 0.716
Overfitting
when a model captures the underlying patterns in the data too closely. Such a model has low bias and high variance.
Note about regression
Note that regression does NOT establish causation between the independent variables and the dependent variable.
Predictive Models
Predictive Models are involved with predicting a value based on other values in the dataset. The process of training a predictive model is known as supervised learning.
Regression
Supervised machine learning approach focused on predicting a continuous value (the response). Performance is evaluated using root mean squared error.
Underfitting
happens when a model is unable to capture the underlying patterns in the data. Such a model has high bias and low variance
Criteria that triggers a stop to recursive partitioning:
- All data in a leaf node are of the same class - All features have been exhausted - A specified tree size limit has been met
Applications of Nearest Neighbor
-Optical character recognition and facial recognition. -Movie or music recommendation engines. -Pattern recognition in genetic data for disease diagnosis.
Unsupervised Machine Learning
1) Association Rules 2) k-means clustering Unsupervised learning does not have labeled outputs, so its goal is to infer the natural structure present within a set of data points.
Three Common Types of Logistic Regression
1) Binomial Logistic Regression 2) Multinomial Logistic Regression 3) Ordered Logistic Regression
Parametric Model Weaknesses
1) Constrained: By choosing a functional form, these methods are highly constrained to the specified form. 2) Limited Complexity: The methods are more suited to simpler problems. 3) Poor Fit: In practice, the methods may not always match the underlying mapping function.
Data exploration consists of (4)
1) Describing the data 2) visualizing the data 3) analyzing the data 4) understanding the data
k-NN weaknesses
1) Does not produce a model 2) The selection of an appropriate k is often arbitrary 3) Rather slow classification phase 4) Does not handle missing, outlier, and nominal data well without pre-processing
Non-Parametric Models Strengths
1) Flexibility: Capable of fitting a large number of functional forms. 2) Power: No assumptions (or weak assumptions) about the underlying function. 3) Performance: Can result in higher performance models for prediction.
Logistic Regression Weaknesses
1) Makes strong assumptions about the data 2) Does not do well with missing data 3) Tends to under perform when there are multiple or non-linear decision boundaries 4) Does not naturally capture complex relationships
Non-Parametric Models Weaknesses
1) More data: Require a lot more training data to estimate the mapping function. 2) Slower: A lot slower to train, as they often have far more parameters to train. 3) Overfitting: Have a higher risk of overfitting against the training data.
Logistic Regression Strengths
1) Outputs have a nice probabilistic interpretation 2) Can be regularized to avoid overfitting 3) easy to implement 4) very efficient to train 5) able to handle a reasonable number of nominal features
Parametric Models Strenghts
1) Simpler: These methods are easier to understand and the results are easy to interpret. 2) Speed: Parametric models are usually very fast to train. 3) Less Data: They do not require as much training data and can work well even if the fit to the data is not perfect.
Regression Methods
1) linear regression 2) logistic regression 3) poisson regression (log-linear)
Supervised Machine Learning
1)logistic regression 2)k-nearest neighbor 3) decision trees Supervised learning is to learn a function that, given a sample of data and desired outputs, best approximates the relationship between input and output observable in the data.
k-NN Steps
1. Choose k. 2. Identify k labeled examples in the training set that are nearest in similarity to each of the unlabeled observations in the test data set (use euclidean distance) 3. Assign the class of the majority of the k nearest neighbors in the training set to each unlabeled observation in the test set.
The analytics Process
1. Data collection 2. Data exploration 3. Data preparation 4. Modeling 5. Evaluation 6. Business insight
Large vs. Small K
A *large k* reduces the impact of noisy data but increases the risk of ignoring important patterns. A *small k* makes the model susceptible to noise and/or outliers
Information Gain
A calculation of the change in entropy that would result from a split on each possible feature which determines the optimal feature to split on.
Lazy Learners
A class of non-parametric learning methods that do not generate a model but instead make use of verbatim training data for classification. They are also known as instance-based learners or rote learners
Nearest Neighbor
A classification approach that assigns a class to unlabeled examples based on the class of similar labeled examples.
Confusion Matrix
A confusion matrix is a table that categorizes predictions according to whether they match the actual value. The class of interest is known as the positive class, while all others are known as negative.
Non-Parametric Models
A learning model that does not make strong assumptions about the form of the mapping function is called a non-parametric model. By not making assumptions, non-parametric models are free to learn any functional form from the training data.
Parametric Models
A learning model that summarizes data with a set of parameters of fixed size (independent of the number of training examples) -No matter how much data you throw at a parametric model, it won't change its mind about how many parameters it needs
Splitting
A measure of impurity is used to make the decision about which feature to split on called entropy. The goal is to minimize impurity within each node.
Gain Ratio
A modification of Information Gain that reduces its bias on highly branching features by taking into account the number and size of branches when choosing a feature.
Greedy Learners
Decision trees are a part of a group of learners that use all the data available to them on a first-come first-served basis
Variance Bias
Error due to variance are errors made as a result of not giving the learning algorithm the optimal training data. Leads to low training error but high test error
Bias Error
Errors made as a result of not choosing the optimal learning algorithm for the training data. Leads to high training error and high test error
Sparsity and Density
Data sparsity and density describe the degree to which data exists for each feature of all observations. So if a table is 80% dense, then 20% of the data is missing or undefined. This means it is 20% sparse
rpart (Package)
For recursive partitioning in classification.
Data Collection
Identifying and gathering the data needed for the learning process.
Applications of Decision Trees
- Credit scoring models - Marketing studies of customer behavior - Diagnosis and prognosis of medical conditions
Decision Tree Strengths
- Does well on most problems - Handles numeric and nominal features well - Does well with missing data - Ignores unimportant features - Useful for both large and small datasets - Output is easy to understand - Efficient and low cost model
Decision Tree Weaknesses
- Splits biased towards features with a large number of levels - Easy to overfit or underfit - Reliance on axis-parallel splits is limiting - Small changes in data result in large changes to decision logic - Large trees can be difficult to interpret or understand
Akaike Information Criterion (AIC)
An estimate of the relative information lost by a given model: the less information a model loses, the higher the quality of the model. Given a set of candidate models from the same dataset, the model with the lowest AIC is the "preferred" model.
Instance
An individual independent example of the concept represented by the dataset. It is described by a set of attributes or features. A dataset, is made up of several instances.
The Bias Variance Tradeoff
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn't seen before. As a result, such models perform very well on training data but has high error rates on test data.
Logistic Regression measuring coefficients
Coefficients are estimated using a technique called Maximum Likelihood Estimation (MLE) -Unlike the Ordinary Least Squares (OLS) method used by linear regression, finding a closed form for the coefficients using MLE is not possible. Instead, the process is iterative.
Class (Package)
Contains vital packages for kNN
Modeling
Matching the input data to the appropriate machine learning approach in order to solve a problem.
Choosing the right K
Methods: 1) The square root of the number of training examples 2) Test different k values against a variety of test data sets and choose the one that performs best 3) Use weighted voting where the closest neighbors have larger weights *NOTE*: the larger the data set, the less important the difference between two choices for k becomes.
Resolution
Resolution describes the grain or level of detail in the data. If data resolution is too fine, important patterns may be blurred by noise and if the resolution is too coarse, important patterns may disappear.
InformationValue (Package)
Used for maximizing information value in regression and analyzing performance in classification models.
Histograms
What kind of population distribution does the data come from? Where is the data located? How spread out is the data? Is the data symmetric or skewed? Are there outliers in the data?
Limits of Linear Regression
Works well when the response variable is normally distributed, struggle for non-gaussian responses
linear regression
Y: Dependent or response variable X: independent or predictor variables B: Model parameters (unknown parameters to be estimated by the regression model) -------------------------------------------------------------------- The goal is to estimate the values for B0 and B1 in order to minimize the sum of squared errors (residuals) between the actual and predicted values.
