SAS Viya
Types of ensemble models
single-algorithm ensemble models: gradient boosting and forest. Perturb and combine (or P & C) methods are used to create ensemble models in two steps The perturb step creates different models by manipulating the distribution of the data or altering the construction method. It changes the splitting criteria. The combine step then uses the predictions of the models built in the perturb step to create a single prediction. Perturb and combine methods can be used with any unstable algorithm but they are mostly used with trees. The benefit of ensembling many trees together is that, by adding more steps, the steps themselves are essentially smoothed out. The main drawback of ensemble models is that they cannot be interpreted, unlike a single tree.
Maximum number of evaluations
specifies the maximum number of tuning evaluations. This option is available only if the search method is Genetic algorithm or Bayesian. The default value is 50. It ranges from 3 to 2,147,483,647.
Maximum number of iterations
specifies the maximum number of tuning iterations. This option is available only if the search method is Genetic algorithm or Bayesian. The default value is 5. It ranges from 1 to 2,147,483,647.
Number of Iterations
specifies the number of iterations of a boosting series. The default initial value is 100. The range is from 20 to 150.
Perturbation
(beyond bagging)
Which of the following statements is true about kernel functions?
A kernel function operates as a dot product in a higher dimension, but it is applied to the raw data. A kernel function operates as a dot product in a higher dimension (that is, in a feature space), but it is applied to the raw data.
multilayer perceptron (or MLP)
A basic MLP has three layers, an input layer, a hidden layer, and a target (or output) layer.
To obtain a prediction for a neural network:
After the prediction formula is established, obtaining a prediction is strictly a matter of inserting the input measurements into the hidden units
optimizing complexity
Applies to neural networks. Optimizing the complexity of a neural network involves controlling the magnitude of the weights.If the weights grow too large, the model will be overfit and will not generalize well. The two main methods of avoiding overfitting are weight decay and early stopping. These methods are often used together.
Bernoulli function
for a binary target are produced by minimizing the error, which is -2 times the log-likelihood. This particular error function is known as the Bernoulli function.
Which of the following terms refers to the intercept estimate in a neural network?
Bias estimate is the neural network term for an intercept
Examples of mathematical functions
Centering, exponential, inverse, log, range, square, square root, and standardize. These fall under the category of Transformations
Types of decision trees
Classification for categorical, regression trees for interval targets.
Models that help interpret the results of SVMs
PD, LIME, ICE
Which of the following statements is true about pruning decision trees?
Pruning starts by identifying a sequence of candidate subtrees, one for each possible number of leaves, and then selects the best of the candidates. Accuracy is obtained by multiplying the proportion of observations falling into each leaf by the proportion of those correctly classified in the leaf and then summing across all leaves.
Gini index
Pure nodes have Gini index of 0; the more imperfect the closer the index is to one.
Global minimum
finding the lowest point in the error space Formally, a global minimum is a set of weights that generates the smallest amount of error. A simple strategy to ensure that a global minimum has been found is an exhaustive search of the error space.
Discovery phase for SVM
This is the training phase as well. Essential discovery tasks: improve the model, optimize the complexity of the model, and regularize and tune the hyperparameters of the model.
Which of the following statements about scoring are true? Select all that apply.
You can export score code from SAS Model Manager. For scoring, you use scoring data (which is new data), not validation data (which was used during model building).
SAS Viya is which of the following?
a component of the SAS Platform that is interoperable with SAS 9
Annealing Rate
a way to automatically reduce the learning rate as SGD progresses, causing smaller steps as SGD approaches a solution.
SVMs
make decision predictions, ranks, and estimates
Target
the response variable/dependant
neural network formula
y=w +w1 h1 + w2 h2 etc
Number of Surrogate Rules
determines the number of surrogates that are sought. A surrogate is discarded if its agreement is less than or equal to the largest proportion of observations in any branch. As a consequence, a node might have fewer surrogates specified than the number in the Number of Surrogate Rules property. The agreement between two splits can be measured as the proportion of cases that are sent to the same branch. The split with the greatest agreement is taken as the best surrogate.
Essential data tasks
divide the data, address rare events, and manage missing values
training cases
examples, instances, records
autotuning
hyperparameter optimization in SAS. In general, autotuning searches for the best combination of hyperparameters specific to each modeling algorithm. When you decide whether to perform autotuning, keep in mind that autotuning can substantially increase run time.
SAS Viya
is a cloud-enabled, in-memory analytics runtime environment that seamlessly scales for data of any size, type, speed, and complexity.
Binning
is helpful for handling interval input variables. In binning, the original numeric values are grouped into discrete categories called bins. The missing values are assigned to their own bin. The interval variable then becomes a categorical variable.
Posterior Probability
is the probability generated by a predictive model.
The variable importance measure
it takes the square root. Further, the Decision Tree node incorporates the agreement between the surrogate split and the primary split in the calculation is scaled to be between 0 and 1 by dividing by the maximum importance. Thus, larger values indicate greater importance. Variables that do not appear in any primary or saved surrogate splits have 0 importance.
Overfitting in a decision tree
means there are no stopping rules and every case has its own leaf. The maximal tree that results from recursive partitioning is smaller than the largest possible tree but is still likely to be overfit. The maximal tree adapts to both the systematic variation of the target (or the signal) and the random variation (or the noise). At the other extreme, a small tree with only a few branches might underfit the data. It might fail to adapt sufficiently to the signal, which usually results in poor generalization.
L1 Regularization
penalizes the absolute value for the weights. Different values for L1 are tried between the range defined by From and To. The default initial value for the L1 is 0. The default for the range is from 0 to 10.
Common perturbation methods
resampling, subsampling, adding noise, adaptively reweighting, and randomly choosing from the competitor splits.
non-deterministic results
results are not reproducible
Unsupervised learning
starts with unlabeled data (that is, data that have no target/dependant variable--we dont know what we are trying to find yet). Unsupervised learning algorithms seek to discover intrinsic patterns that underlie the data, such as a clustering or a redundant parameter (dimension) that can be reduced. Supervised, which means that the target variable is used in the selection process. Other methods are unsupervised and ignore the target.
Unsupervised Selection
Unlabeled data is analyzed and organized in ways to show patterns or trends. We can use clustering for example, with this approach. Identifies the set of input variables that jointly explains the maximum amount of data variance. The target variable is not considered with this method. Unsupervised Selection specifies the VARREDUCE procedure to perform unsupervised variable selection by identifying a set of variables that jointly explain the maximum amount of data variance. Variable selection is based on covariance analysis.
Binning
You can use binning to classify missing values, reduce the effect of outliers on a model, or illustrate nonlinear relationships. A binned version of a variable also has a smaller variance than the original numeric variable. Is a method of transformation that converts numeric inputs to categories or groups. There are different types of binning
3 types of predictions
decisions (classifications), rankings (are relative-ie credit score) or estimates (predict expected value) estimates can be transformed into either decisions or rankings.
Build Models button
In the applications menu. It is a part of discovery in the lifecycle. It is how we access Model Studio. *To access SAS Model Manager (Links to an external site.), select Manage Models. Later in this course, you use the Applications menu to access Model Manager from Model Studio, and then return to Model Studio. To access SAS Visual Analytics (Links to an external site.), select Explore and Visualize Data. From SAS Visual Analytics, you can access the SAS Visual Statistics add-on functionality, which enables you to use pipelines. In this course, you do not use SAS Visual Analytics and SAS Visual Statistics.
Predictive modeling
Is supervised learning. It begins with training data
CAS Cloud Analytic Services,
Is the run-time environment. CAS consists of a controller node and a collection of worker nodes that manage and process distributed data of any size.
Analytics lifecycle
Is to extract value from data. Includes 3: 1. Data are the foundation of everything you do. At the Data phase, you explore and prepare your data for analysis. 2. Discovery is the act of detecting something that you did not know before. You build and refine multiple models with the goal of selecting the best model for your analysis. 3.Deployment is where you put the model to work. You apply the model to new data, which is a process called scoring.
Machine Learning
Machine learning is AI that learns by model iteration (iterates to "perfection") to make predictions. Three main characteristics of machine learning are automation, customization, and acceleration.
How to address rare events:
Ask: Is the target (outcome) a rare event? You can use event based sampling to figure this out. We need to ensure that each partition has a representative sample of the rare event. (i.e. credit card fraud 1/1000) EBS- Make 2 samples. One sample without the event, and one with the event. Then match up each event with a non event data point. You may match 2 non events with every event, or 3 or 4 non-events to 1 event. The advantage of event-based sampling is that you can obtain (on the average) a model of similar predictive power with a smaller overall case count. Although it reduces analysis time, event-based sampling also introduces some analysis complications. Most model fit statistics (especially those related to prediction decisions) and most of the assessment plots are closely tied to the outcome proportions in the training samples. If the outcome proportion in the training and validation samples do not match the outcome proportion in the scoring population, model performance is likely incorrect. If the outcome proportions in the training sample and scoring population do not match, model prediction estimates are biased. Fortunately, Model Studio automatically adjusts assessment measures, assessment graphs, and prediction estimates for bias
Surrogate split
A surrogate splitting rule is a backup to the main splitting rule. For example, the main splitting rule might use COUNTY as the input, and the surrogate might use REGION. If COUNTY is unknown and REGION is known, the surrogate is used. If several surrogate rules exist, each surrogate is considered in sequence until one can be applied to the observation. If none can be applied, the main rule assigns the observation to the branch that is designated for missing values.
Ensamble Model
An ensemble model is an aggregation of multiple models. The final prediction from the ensemble model is a combination of predictions from the component models.
Regression Trees
An interval target can have any numeric value, within a certain range, including decimal values.
Linear Regression Selection
Fits and performs variable selection on an ordinary least squares regression predictive model. This is valid for an interval target and a binary target. In the case of a character binary target (or a binary target with a user-defined format), a temporary numeric variable with values of 0 or 1 is created, which is then substituted for the target. Linear Regression Selection specifies the REGSELECT procedure to perform linear regression selection based on ordinary least square regression. It offers many effect-selection methods, including Backward, Forward, Forward-swap, Stepwise methods, and modern LASSO and Adaptive LASSO methods. It also offers extensive capabilities for customizing the model selection by using a wide variety of selection and stopping criteria, from computationally efficient significance level-based criteria to modern, computationally intensive validation-based criteria.
Essential data tasks
Gather the data Explore the data Divide the data Address rare events Manage missing values. Replace incorrect values. Add unstructured data. Extract features. Manage extreme or unusual values. Select useful inputs.
Gradient boosting
Gradient boosting is an enhancement of boosting that can be applied to any type of target. The gradient boosting algorithm is similar to boosting, except that at each iteration, the target is the residual from the previous decision tree model. Gradient boosting is available in Model Studio.
Fast Supervised Selection
Identifies the set of input variables that jointly explain the maximum amount of variance contained in the target. Fast Supervised Selection specifies the VARREDUCE procedure to perform supervised variable selection by identifying a set of variables that jointly explain the maximum amount of variance contained in the response variables. Supervised selection is essentially based on AIC, AICC, and BIC stop criterion.
What causes instability in decision trees?
If one case is omitted or changed it can have a large impact on the outcome. This results from the large number of univariate splits considered during recursive partitioning and the increasing fragmentation of the data. "A small change in the data can easily result in the selection of a different competitor split, which produces different subsets in the child nodes. The changes in the resulting subsets increase with each new generation of the tree." One reason why simple P & C methods give improved performance is variance reduction. The benefit of ensembling many trees together is that, by adding more steps, the steps themselves are essentially smoothed out.
SAS Project
In Model Studio, the main container for your analytic work is a project. A basic Model Studio project contains a data source, a pipeline that you create, and related project metadata.
complete case analysis
Means no missing values. If missing values appear at random in the input data, you can drop the rows that contain missing values without introducing bias into the model. Techniques for managing missingness during model building include naïve Bayes, decision trees, missing indicators, imputation, and binning. Model Studio, the naïve Bayes technique and decision tree models do not use complete case analysis; these modeling approaches can incorporate missingness.
Missing values in a decision tree
Nominal Treats missing input values as a separate level of the input variable. A nominal input with L levels and a missing value can be treated as an L + 1 level input. If a new case has a missing value on a splitting variable, then the case is sent to whatever branch contains the missing values. Ordinal Modifies the split search strategy for missing values by adding a separate branch adjacent to the ordinal levels. Interval Treats missing values as having the same unknown non-missing value.
Distributed Server: Massively Parallel Processing
One machine acts as the controller and other machines act as workers to process data.One or more machines are designated as worker nodes. Each worker node performs data analysis on the rows of data that are in-memory on the node. The server scales horizontally. If processing times are unacceptably long due to large data volumes, more machines can be added as workers to distribute the workload. Also fault tolerant.
Training data
The purpose of the training data is to construct a predictive model (that is, a rule) that relates the inputs to the target. The predictive model is a concise representation of the association between the inputs and the target.
CAS action
The smallest unit of functionality in CAS, sends a request to the CAS server. The action parses the arguments of the request, invokes the action function, returns the results, and cleans the resources.
Logistic regression
Default Basic model template
Transformations can do which of the following? Select all that apply.
-reduce the effect of outliers or heavy tails -standardize inputs to be on the same range or scale -reduce the bias in model predictions -change the shape of the distribution of a variable
Ways to decipher "black box" algorithms
1. decomposition, approximates the parameters in a neural network using a set of IF-THEN rules. 2. A related approach is to use a decision tree to interpret the neural networks predictions.
The essential discovery tasks for neural networks:
1.select an algorithm 2.Improve the model. 3.Optimize the complexity of the model. 4.Regularize and tune the hyperparameters of the model. *Ensembles are not considered for neural network models as often as they are for tree-based models.
Local Interpretable Model-Agnostic Explanations (LIME) Plot
A LIME plot creates a localized linear regression model around a particular observation based on a perturbed sample set of data. That is, near the observation of interest, a sample set of data is created. This data set is based on the distribution of the original input data. The sample set is scored by the original model, and sample observations are weighted based on proximity to the observation of interest. Next, variable selection is performed using the LASSO technique. Finally, a linear regression is created to explain the relationship between the perturbed input data and the perturbed target variable. The final result is an easily interpreted linear regression model that is valid near the observation of interest.
Poisson Error Function
A Poisson distribution is usually thought of as the appropriate distribution for count data. Because the variance is proportional to the mean, the deviance function for a Poisson distribution is the deviance measure of interval target type and Poisson error function.
Deployment phase
Includes choosing the champion model and model management--monitor model performance.
Complete case analysis
What is it called when the model-building process ignores training data cases with missing values of inputs?
GUI
graphical user interface
Lift Chart
A cumulative lift chart is another graphical tool that illustrates the advantage of using a predictive model as compared to not using a model. Cumulative lift is a ratio of response rates. The response rate in the numerator equals the cumulative percentile hits for a given percentile (P) from a given model (M). The response rate in the denominator equals the cumulative percentile hits for the same percentile given no model. We use the CHP, or Gains Chart info to get the Lift stats. For example, in the customer response example, the CPH for the top 5% of customers, given the model, was 21%. Given no model, the response rate for the top 5% of customers (from a randomly ordered list) is 5%. Thus, the lift at 5% is 21% divided by 5%, which is 4.2. This indicates that, for the top 5% of customers, the model captured 4.2 times as many responders, compared to not using a model. Here is the lift chart for the customer response example. The Y axis represents the lift and the X axis represents the cumulative percentiles from the list. Notice the lift value 4.2 for the 5% percentile. A lift chart can contain lines for multiple models, which is helpful for selecting the best model in a specific business scenario. The model with the higher lift curve is generally better for model deployment. However, the best model can vary depending on the percentile. For the customer response example, if you expect to contact 20% of your customers, you want to choose the model that performs best on this percentile.
A decision tree can have only two-way splits.
A decision tree can have multi-way splits.
Dot product
A dot product is a way to multiply vectors that result in a scalar, or a single number, as the answer. It is an element-by-element multiplication, and then a sum across the products.
Neural Network
A neural network is a nonlinear model that is designed to mimic neurons in the human brain. A major advantage of nonlinear models is flexibility. You can predict very complex surfaces with neural networks. *Non-linear model types: parametric non-linear model with relationship hypothesis Will always need to be optimized after estimating parameters. *neural networks do not require the functional form to be specified. This enables you to construct models when the relationship between the inputs and outputs is unknown. nonparametric regression model--we do this by transforming x. One approach is to add a transformation of x to the functional form. A popular choice is a polynomial. For a single input, x, polynomial degrees are x-squared (or quadratic), x-cubed (or cubed), and so on. For example, a nonlinear model with a quadratic term is shown below:
Regularization Terms L1 and L2
A standard way for regularization is to penalize the weights, preventing them from growing too large. The penalties try to keep the weights small, close to zero, or even zero. An alternative name in literature for weight penalties is weight decay because it forces the weights to decay toward zero. L2 norm: Penalizes the square value of the weight (which explains the 2 in the name). Tends to drive all the weights to smaller values. L1 norm: Penalizes the absolute value of the weight. Tends to drive some weights to exactly zero.
Vector
A vector in machine learning refers to the same mathematical concept present in linear algebra or geometry. In simple, intuitive terms, it is an array of numbers describing a specific combination of properties. Each number specifies a property in its dimensions. If a vector has 3 numbers it is said to have 3 dimensions. A 2 dimensional vector is also called a matrix.
When you have an interval target, which of the following fit statistics can you use to select the champion model?
ASE. You can use the average squared error for both categorical (class) and interval targets. The other fit statistics shown here are applicable only to categorical targets.
To improve/optimize SVM
Adjust Hyperparameters: penalty, kernel, and tolerance. The penalty is a term that accounts for misclassification errors in model optimization. The kernel is a mathematical function that operates as a dot product on transformed data in a higher dimension. The tolerance value balances the number of support vectors and model accuracy.
Early Stopping
Also called stopped training, is another method of avoiding overfitting by preventing the weights from growing too large. SAS Visual Data Mining and Machine Learning treats each iteration in the optimization process as a separate model. The iteration with the smallest value of the selected fit statistic on the validation data is chosen as the final model. At the beginning of the optimization process, the neural network model is set up to predict the overall average response rate for all cases. To do so, all weights and the bias in the target layer are set to zero, as shown in the equation. Next, the weights and the biases of the input layer are randomly assigned values close to zero (some positive and some negative). The philosophy of early stopping says that the optimal model comes from an earlier iteration where the objective function on the validation data has a smaller value than it does for the final iteration.
Individual Conditional Expectation
An ICE plot presents a disaggregation of the PD plot to reveal interactions and differences at the observation level. ICE plot is generated by choosing a plot variable and replicating each observation for every unique value of the plot variable. Then, each replicate is scored. SAS Visual Data Mining and Machine Learning creates a segmented ICE plot. A segmented ICE plot is created from a cluster of observations instead of on individual observations. The most useful feature to observe when evaluating an ICE plot of an interval input is intersecting slopes. Intersecting slopes indicate that there is an interaction between the plot variable and the clusters in terms of the predicted outcome. Even though these lines are not perfectly parallel in this case, they're close enough to conclude that there is a consistent effect for this variable across the five cluster groups.
Why do we need to update models?
Because data changes over time. For example, consider a movie recommendation model that must adapt as viewers grow and mature through stages of life. After a certain period, the error rate on new data surpasses a predefined threshold, and the model must be retrained or replaced. Champion-challenger testing is a common model deployment practice. This method compares the performance of a new, challenger model with the performance of the deployed model on a historic data set at regular time intervals. If the challenger model outperforms the deployed model, the challenger model replaces it. The champion-challenger process is then repeated. Another approach to refreshing a trained model is through online updates. Online updates continuously change the value of model parameters or rules based on the values of new, streaming data.
Bias estimate / Weight
Bias estimate is the neural network term for an intercept (Y) Weight estimate is the neural network term for a parameter estimate or slope (x) Weights=predictors+connecting lines to hidden units=derived parameters
A support vector machine in Model Studio can be used only for which of the following?
Binary targets. In Model Studio, support vector machines are used exclusively with binary targets
A confusion matrix helps you classify which type of target?
Binary. A confusion matrix displays performance statistics for a model with a binary target.
Confusion Matrix
Calculates and displays assessment measures of model performance for decision predictions on a binary target. This matrix is a cross tabulation of the actual and predicted outcomes, based on a decision rule. A simple decision rule allocates cases to the target class with the greatest posterior probability. For binary targets, this corresponds to a 50% cutoff on the posterior probability. A confusion matrix displays four counts: true positives, true negatives, false positives, and false negatives.
Which of the following statements about complete case analysis is true?
Complete case analysis can reduce the predictive accuracy of the model.
To solve for SVM for classification
Create 2 constraints. The first constraint is that if the target is 1, then the solution for H must be greater than or equal to 1, because it must fall on the correct side of the hyperplane. The second constraint is that if the target is -1, then H must be less than or equal to -1 in order to fall on the correct side of the hyperplane. Because the binary target is defined as 1 or -1, we can combine the two contraints into a single constraint.
Entropy
Cross or relative entropy is for nominal targets (including binary targets). It is identical to the Bernoulli distribution if the target is binary, but it offers some advantages over the Bernoulli distribution when the data are proportions. The entropy deviance estimate is given by the following:
Which of the following machine learning models is the easiest to interpret?
Decision Tree. Decision trees are highly interpretable because they are based on English rules, which are rules that use Boolean logic. However, Model Studio's interpretability feature is available for all machine learning models, including decision trees. You might want to use this feature, for example, when you are comparing decision trees to other types of models.
After opening a Model Studio project, you can access the Exchange without leaving the project.
False
For neural networks, model generalization depends more on the number of weights than on the magnitude of the weights.
False
If you build a decision tree model to predict a binary target, the rules in that decision tree are limited to two-way splits.
False
In Model Studio, when you add any node to a pipeline, the Model Comparison node is added automatically.
False
Optimization methods are used to efficiently search the complex landscape of the error surface to find an error maximum.
False Optimization methods are used to efficiently search the complex landscape of the error surface to find an error minimum.
For neural networks, model generalization depends more on the number of weights than on the magnitude of the weights.
False. Model generalization depends more on the magnitude of the weights than on the number of weights. In fact, large weights are responsible for overfitting.
Perturb and combine methods are used only with decision trees.
False. Perturb and combine methods can be used with any unstable algorithm, but they are mostly used with trees.
To use the C-statistic to assess model performance, you need to know the cutoff.
False. You do not need to know the cutoff to use the C-statistic. The cutoff is used in the calculation of the C-statistic.
Optimization methods are used to efficiently search the complex landscape of the error surface to find an error maximum.
False.Optimization methods are used to efficiently search the complex landscape of the error surface to find an error minimum.
Which of the following is a common interface for SAS Viya applications, that enables you to easily view, organize, and share your content from one place?
SAS Drive
Estimate Predictions
For a binary target, estimate predictions are the probability of the primary outcome for each case. squared error: The squared difference between a target and an estimate. Averaged over all cases, squared error is a fundamental assessment measure of model performance. When calculated in an unbiased fashion, the average squared error is related to the amount of bias in a predictive model. A model with a lower average squared error is less biased than a model with a higher average squared error. For estimate predictions, there are at least two commonly used performance statistics. The Schwarz's Bayesian criterion (SBC) is a penalized likelihood statistic. This likelihood statistic can be thought of as a weighted average squared error.
Estimation process for binary target
For a binary target, the weight estimation process is driven by an attempt to minimize the negative of two times the log-likelihood function. The log-likelihood function is the sum of the log of the primary outcome training cases plus the sum of the log of the secondary outcome training cases. Bernoulli function
Forest Model
For a categorical target, the forest model's prediction is either the most popular class (as determined by a vote) or the average of the posterior probabilities of the individual trees. For an interval target, the forest model's prediction is the average of the estimates from the individual decision trees. trees in a forest created in Model Studio differ from each other in two ways. Each tree is created on a different sample of the cases, and each splitting rule is based on a different sample of the inputs. This process ensures that the individual models in the ensemble are more varied. The process that the forest algorithm uses to build the individual trees and then combine the predictions results in a more stable model than a single tree. Training each tree with different data reduces the correlation of the predictions of the trees. This, in turn, is likely to improve the predictions of the forest as compared to the naïve method of using the same data to build all the trees in a forest. Uses bagging plus bootstrapping
out-of-bag sample
For each tree in the forest, the data that are withheld from training. Model assessment measures (such as misclassification rates and average squared error) and iteration plots are constructed on both the entire training data set and the out-of-bag sample. So, you can think of the out-of-bag sample as another validation data set.
Which of the following is a difference between gradient boosting and forest models in Model Studio?
Forests are based on bagging. Gradient boosting models are based on boosting. Forests and gradient boosting models are based on different P&C methods.
Prediction based decision-making is used in these 4 areas
Fraud, targeted marketing, financial risk, customer churn(attrition)
Search Options specifies the options for autotuning searching. The following options are available:
Genetic algorithm uses an initial Latin hypercube sample that seeds a genetic algorithm. The genetic algorithm generates a new population of alternative configurations at each iteration. Latin hypercube sample performs an optimized grid search that is uniform in each tuning parameter, but random in combinations. Random generates a single sample of purely random configurations. Bayesian uses priors to seed the iterative optimization. Grid searching uses the lower bound, upper bound, and midrange values for each autotuned parameter, with the initial value or values, used for the baseline model.
Gradient Descent
Gradient descent is like taking leaps in the current direction of the slope, and the learning rate is like the length of the leap that you take. It is used to minimize the error function.
SVMs
Have 2 or more inputs *2 dimensions - a line - linear classification = H (w (slope) x axis, y is bias) H has an equation that produces the line. When a point falls exactly on the line, the value of H is 0. A dot product is one way to multiply vectors. A dot product returns a scalar, or single number, as the answer. To train a support vector machine model, we select w and b so that the line separates the values of the target. *3 dimensions a plane *4 or more, a hyperplane
signal-to-noise ratio
High means it fits the data well. Low means it does not. Use simplest neural network--ockham's razor.
Stochastic Gradient Descent
The gradient descent is an optimization algorithm to find the minimum value for a function iteratively. Best for big data. And redundancy in data (ie clusters). It randomly picks a one sample at each step and uses it to calculate all of the derivitives. This reduces the number of terms exponentially.
Gamma Error Function
In a gamma distribution, the variance is proportional to the square of the mean. It is often used when the target represents an amount. The gamma deviance function is given by the following:
CAS action set
Is a collection of actions that group functionality (for example, simple summary statistics)
Back Propogation Algorithm
Is accomplished by using the learning rate (r) in equation: w0-r*aC/aw = gradient(steepness of slope) Plug in first weight ie 0.8 into the above equation. The result will be plugged into the equation in the next iteration. Iterate until convergence. Then we back propagate to the next level of weights.
Assessment measure
Is based on two factors: the target measurement scale and the prediction type. For decision predictions, you might choose from accuracy, misclassification, or profit or loss. For rankings, you might choose from the C-statistic (which is the area under the ROC curve) or the Gini coefficient. And for estimates, you might choose from the average square error, root mean square error, Schwarz criterion, or Akaike's information criterion.
Learning Rate (n)
It determines how quickly or how slowly you want to update the parameters. Usually, you can start with a large learning rate, and gradually decrease the learning rate as the training progresses. The learning rate is defined in the context of optimization, and it is related to minimizing the error function (in this particular case, of a neural network). In Model Studio, the learning rate parameter is available only when the SGD optimization method is used.
Numeric Optimization methods
LBFGS and variants of gradient descent (SGD) These methods provide the update vector. These are how we find the minimum of the error function.
Which component of a decision tree provides the predictions?
Leaf nodes A tree's leaf nodes provide the predictions. Leaf nodes are child nodes, but not all child nodes are leaves. Generation 0 is the root level of the tree, which has one root node. The root node does not make predictions. The interior nodes are all the nodes between the root node and the leaves.
In Model Studio, the Data Exploration node is in which group?
Miscellaneous
How does Model Studio Select a Champion Model?
Model Studio uses a default assessment measure, which varies by the type of target. If you want, you can specify a different assessment measure. Model Studio computes all assessment measures for each available data partition (train, validate, and test). By default, Model Studio selects a champion model based on the validation data set unless a test data set is available. If you want, you can specify a different data set. You compare the performance of competing models using many selection statistics. There are different selection stats for class (categorical) v. interval targets. In Model Studio, you can specify selection statistics for model comparison in two places: Project settings, under Rules (Settings apply to all pipelines in the project.) Model Comparison node properties (Settings apply to the pipeline that contains the instance of the Model Comparison node.)
Scoring
Model deployment. Before a model is scored it is translated into score code.
In the results from the Variable Selection node is a Variable Selection table with a Reason column. What does it mean when the Reason column for an input is blank?
The input was selected.
weighted decay
Model generalization depends more on the magnitude of the weights than on the number of weights. Very large magnitude weights tend to generate an irregular fit to the data because the model adapts to noise (that is, random variation). In other words, large weights are responsible for overfitting. The weight decay method penalizes large weights by adding a term at the end of the objective function. This penalty term is the product of lamda (which is the decay parameter) and the sum of the squared weights. Lambda commonly ranges from zero to one, although Model Studio does allow for values larger than 1. Specifying too large a penalty term risks the model under-fitting the data. However, the advantages of weight decay far outweigh its risks. Ie the minima are constrained closer to the origin so the weights used to find the minima are smaller (and thus, less likely to overfit).
Pros of nonlinear models
Neural networks are most appropriate for pure prediction tasks. Neural networks are meant to overcome increased dimensionality neural networks do not require the functional form to be specified. This enables you to construct models when the relationship between the inputs and outputs is unknown. A neural network is a universal approximator, which means that it can model any input-output relationship, no matter how complex. (Unlimited flexibility)
Cons of neural networks
Nonlinear regression models are more difficult to estimate than linear models In addition to specifying the functional form, it is necessary to use an optimization method to efficiently guide the parameter search process. lack of interpretability and the need for a strong signal in the data. The lack of interpretability is the basis for the well-known black box objection to neural networks
Types of error functions
Normal, poisson, gamma, entropy
When early stopping is used, where does the optimal model come from?
One of the iterations that occur earlier than the final training iteration The philosophy of early stopping says that the optimal model comes from an earlier iteration where the objective function on the validation data has a smaller value than it does for the final iteration. The optimal model does not necessarily need to come from the iteration immediately before the final iteration.
Accounting for errors when the data does not separate evenly-Method #1
One solution is a support vector machine known as a soft-margin hyperplane Soft margin means that a line can separate most of the points but some errors occur. We account for errors by using a penalty term in the optimization process. This penalty is the product of two quantities: 1. an error weight, which is a regularization parameter often denoted by C 2. the distance between a point in error and the hyperplane. (the line / ) We use the Lagrange equation to do this.
The limited memory Broyden-Fletcher-Goldfarb-Shanno (LBFGS) method
Optimization method. Best for big data, opposed to BFGS) Estimates paramenters. This is an algorithm for finding local extrema of functions, which are based on Newton's method of finding stationary points of functions. extrema refer to min and max f is a gradient of the function and H its hessian. stores only a few vectors that represent the approximation implicitly. This is the difference that saves memory. The algorithm starts with initial estimates of the optimal value of weights and progresses linearly and continuously to improve the estimates of the weights. The derivatives of the function of the estimates are used to drive the algorithm to find the direction of the steepest descent. The derivatives are also used to find the estimate of the Hessian matrix (second derivative).
Hyperparameters for SVM
Penalty (.000001 to 1) and Polynomial degree (1-3) Additional hyperparameters are available for adjustment--the same as for neural networks.
The confusion matrix is the foundation for which of the following assessment plots?
ROC chart. The confusion matrix contains statistics that are the foundation for the ROC chart.
Imputation
Refers to replacing a missing value with information that is derived from nonmissing values in the training data. This is the approach that we'll use in this course.
Biased and incomplete data, sparsity, and high-dimensionality are common data collection challenges. What should you do to overcome these challenges? Select all that apply. Hint: Refer to Best Practices for Common Data Preparation Challenges, an earlier item on the course Contents tab.
Select and/or extract features to reduce data dimensions. Correct! Take time to understand the business problem and its context. All of the listed actions are recommended for these common data collection challenges. Correct! Change the representation of the data by applying appropriate transformations. Correct! Enrich the data.
Sensativity
Sensitivity, the true positive rate, is the number of true positive decisions divided by the total number of known primary cases For ROC Chart
After a pipeline is run, which of the following can you do using the Manage Variables node?
Set up imputation and transformation rules.
Which of the following statements is true regarding tree-based models?
Small changes in the training data can cause large changes in the structure of a tree. Even small changes in the training data can cause large changes in the tree structure. However, despite these changes, the overall performance of trees usually remains stable.
Support vector machines
Support vector machines automatically discover any relationship between the inputs and the target, which means you don't need to specify the relationship before modeling. Support vector machines have been used in fields such as image classification, handwriting recognition, financial decision making, and text mining. Unlike trees and neurons, a support vector is not something that most people can visualize.
ROC Chart
The ROC chart is a commonly used graphical representation of model performance for a binary target. ROC stands for receiver operating characteristic. ROC charts are based on measures of sensitivity and specificity. The sensitivity of a model is the true positive rate. The specificity of a model is the true negative rate, which is 1 minus Specificity. The rate on each axis ranges from 0 to 1, so the plot is contained within a unit square. The rates are calculated for a series of decision rules (in other words, cutoffs) between 0% and 100%. However, the cutoffs are not shown in the chart. The area under the ROC curve is a measure also known as the C-statistic or the ROC index. The larger the area under the curve, the higher the C-statistic and the better the model's classification accuracy. Models whose true positive rates and false positive rates are approximately equal are weak models.
Which statement is true about the input columns that the Text Mining node creates?
The Text Mining node creates a new input column for each topic it discovers.
CHP Cumulative Percentile Hits, ie Gains Chart
The cumulative percentile hits chart (or CPH chart) illustrates the advantage of using a predictive model to make business decisions as compared to not using a model This looks like a bar chart version of the ROC Chart. The CHP chart has twenty percentiles (that is, twenty bars). The first bar shows that 5% of the event cases are captured in the top 5% of customers. The selected fraction of cases (in this example, 5%) is known as the depth. Because the graph is cumulative, the second bar shows that 10% of the event cases are captured in the top 10% of customers.
How do SVMs make predictions?
The formula for H returns predictions as positive or negative values, where the sign, not the magnitude, is the critical aspect. Points on one side of the line (the points in the direction of the normal vector) return a positive value. Points on the other side of the line return a negative value.
Which of the following statements is true regarding decision trees?
To predict cases, decision trees use rules that involve the values or categories of the input variables. To predict cases, decision trees use Boolean (if-then-else) rules that involve the values or categories of the input variables.
Momentum
The momentum term is a fraction of the last update vector and is added to the new weight value in order to prevent a rapid change in direction during the search. This ensures that the values of the weights do not oscillate around the true minima, but rather converge to it. In Model Studio, the momentum parameter is available only when the SGD optimization method is used. Also, the momentum parameter is not available in the Autotune feature.
Partial Dependance Plot
The procedures available from the DBMS_OUTPUT package that's included with Oracle Database can be used to print output to the screen.
Schedule
The rate at which the learning rate changes from large to small
Specificity
The true negative rate, is the number of true negative decisions divided by the total number of known secondary cases. For ROC chart
Hidden layer units
These are nonlinear transformations of the predictors. Hidden layer=hyperbolic function When there is a single hidden layer, the hidden units can be thought of as regressions applied to linear combinations of the input variables hidden unit regressions include an activation function. This activation function is a mathematical transformation that is applied to the input layer.--usually in form of hyperbolic tangent. You simply specify the correct number of hidden units and find reasonable values for the weights. Specifying the correct number of hidden units involves some trial and error. (interval targets use least squares and binary uses logistic regression BUT we adjust both to provide a deviance estimation)
If you build a decision tree model to predict a binary target, the rules in that decision tree are limited to two-way splits.
This statement is false. The number of splits per rule does not depend on the type of target. If the split is based on binary input (for example, whether a person owns a home), a two-way split is the only possibility. As you learn later, you can specify the maximum number of splits that Model Studio uses to build a decision tree.
Normalizing
is another rescaling method. Most commonly, normalizing rescales numeric data between 0 and 1. where xmin is the variable's minimum value, and xmax is the variable's maximum value.
Model Studio is an interface for SAS Visual Data Mining and Machine Learning and two other SAS solutions: SAS Visual Text Analytics and SAS Visual Forecasting.
True
When you specify a transformation as metadata for specific variables on the Data tab, that metadata rule overrides the default method specified in the Default interval inputs method property (None) for the Transformations node.
True
To construct a support vector machine, only the observations closest to the separating hyperplane are used. This avoids the curse of dimensionality.
True. Using only the observations closest to the separating hyperplane avoids the curse of dimensionality by limiting the number of data points in the solution.
When the data are not linearly separable, the process of optimizing the location of the hyperplane must account for classification errors.
True. When the data are not linearly separable, the hyperplane will misclassify some data points. In this situation, the process of optimizing the location of the hyperplane must account for these classification errors.
Which data partition does Model Studio use by default to select the champion model?
Validation
Validation method
Validation method specifies the validation method for finding the objective value. If your data is partitioned, then that partition is used. Validation method, Validation data proportion, and Cross validation number of folds are all ignored. Partition specifies using the partition validation method. With partition, you specify proportions to use for randomly assigning observations to each role.Validation data proportion specifies the proportion of data to be used for the Partition validation method. The default value is 0.3. K-fold cross validation specifies using the cross validation method. In cross validation, each model evaluation requires k training executions (on k-1 data folds) and k scoring executions (on one holdout fold). This increases the evaluation time by approximately a factor of k.Cross validation number of folds specifies the number of partition folds in the cross validation process (the k defined above). Possible values range from 2 to 20. The default value is 5.
Accounting for errors when the data does not separate evenly--Method #2
We turn the 2D data into a 3D space to find the maximum-margin hyperplane. We do this by squaring the data and using 2 inputs, the original and the squared form. The math to do this is a kernel function, not a dot product. We do not need to know the transformation itself! We call this a trick because we do not need to know exactly what the feature space looks like. It is enough to specify the kernel function as a measure of similarity. The geometric interpretation remains the same because the solution is still a hyperplane. There are linear and polynomial hyperplanes.
error space
is defined by the error function and the weights in the model.
Neural Network Trains data by:
We use an optimizer--aka a learning algorithm, but to compute the gradient we use backpropogation. Stochastic gradient descent (SGD) is an optimization method used e.g. to minimize a loss function. In the SGD, you use 1 example, at each iteration, to update the weights of your model, depending on the error due to this example, instead of using the average of the errors of all examples (as in "simple" gradient descent), at each iteration. To do so, SGD needs to compute the "gradient of your model". Backpropagation is an efficient technique to compute this "gradient" that SGD uses. ackpropagation is an efficient method of computing gradients in directed graphs of computations, such as neural networks. This is not a learning method, but rather a nice computational trick which is often used in learning methods. This is actually a simple implementation of chain rule of derivatives, which simply gives you the ability to compute all required partial derivatives in linear time in terms of the graph size (while naive gradient computations would scale exponentially with depth). SGD is one of many optimization methods, namely first order optimizer, meaning, that it is based on analysis of the gradient of the objective. Consequently, in terms of neural networks it is often applied together with backprop to make efficient updates. You could also apply SGD to gradients obtained in a different way (from sampling, numerical approximators etc.). Symmetrically you can use other optimization techniques with backprop as well, everything that can use gradient/jacobian. This common misconception comes from the fact, that for simplicity people sometimes say "trained with backprop", what actually means (if they do not specify optimizer) "trained with SGD using backprop as a gradient computing technique". Also, in old textbooks you can find things like "delta rule" and other a bit confusing terms, which describe exactly the same thing (as neural network community was for a long time a bit independent from general optimization community). Thus you have two layers of abstraction: gradient computation - where backprop comes to play optimization level - where techniques like SGD, Adam, Rprop, BFGS etc. come into play, which (if they are first order or higher) use gradient computed above
Which of the following terms refers to a parameter estimate or slope that is associated with an input in a neural network?
Weight estimate is the neural network term for a parameter estimate or slope.
What to consider with decision predictions:
With a binary target, you typically consider two decision types: the primary decision, corresponding to the primary outcome the secondary decision, corresponding to the secondary outcome True Positive: Matching the primary decision with the primary outcome that yields a correct decision. True Negative: matching the secondary decision to the secondary outcome that yields a correct decision. False Positive: mismatching the primary decision to the secondary outcome that yields an incorrect decision False Negative: Mismatching the secondary decision with the primary outcome that yields an incorrect decision For decision predictions, the Model Comparison tool rates model performance based on accuracy or misclassification and profit or loss, and by the Kolmogorov-Smirnov (KS) statistic. Accuracy and misclassification tally the correct or incorrect prediction decisions. Kolmogorov-Smirnov statistic describes the ability of the model to separate the primary and secondary outcomes. The Kolmogorov-Smirnov (Youden) statistic is a goodness-of-fit statistic that represents the maximum distance between the model ROC curve and the baseline ROC curve.
Ranking predictions
With ranking predictions, a score is assigned to each case. The basic idea is to rank the cases based on their likelihood of being a primary or secondary outcome. Concordance: When a pair of primary and secondary cases is correctly ordered Discordance: When a pair of primary and secondary cases is incorrectly ordered. Tied Pair: When a pair of primary and secondary cases ordered equal For ranking predictions, two closely related measures of model fit are commonly used. The ROC index is like concordance (described above). The Gini coefficient (for binary prediction) equals 2 * (ROC Index − 0.5). The ROC index equals the percent of concordant cases plus one-half times the percent of tied cases.
For a support vector machine, the classifier model has which of the following elements?
a normal vector and a bias term The classifier model (H) has two elements: a normal vector w and a bias term b.
"Learning" and when to stop
a numerical optimization method that you specify to estimate the weights that minimize the error function. The decision to stop is based on meeting at least one of three convergence criteria: The specified error function stops improving. The gradient has no slope (in other words, the rate of change of the error function with respect to the weights is zero). Or the magnitude of the weights stops changing substantially.
Numerical optimization
aka learning optimizes by iteratively updating the weights until the algorithm finds a minimum in the error space.
Suppose you are modeling data with a binary target and three inputs. The data are linearly separable. How many possible solutions exist that classify the target?
an infinite number
Which of the following terms refers to the intercept estimate in a neural network?
bias estimate
SVM
classifier, a classifier model, or a classification rule. support vectors/ carrying vectors: are the points (just dots) in the data that are closest to the hyperplane. These points, and only these points, determine the exact location of H. These are the dots on either side of the line.
What is it called when the model-building process ignores training data cases with missing values of inputs?
complete case analysis
Learning Rate
controls the size of the weight changes. It ranges from 0 (exclusive) to 1. The default initial value is 0.1. The default initial value for the learning rate is 0.1. The default for the range is from 0.01 to 1.
a neural network's predictions can be
decisions, rankings, or estimates
Normal Error Function
default method for fitting an interval target is a normal distribution
Suppose you build a decision tree to predict whether a customer makes a purchase on the internet (yes or no). A leaf has perfect purity when it contains which of the following?
either all events or all nonevents Given a binary target, a leaf has perfect purity when all its cases have the same target outcome, but it does not matter whether they are events or nonevents. If multiple cases in the training data set have the same values of all the input variables but different target values, it is not possible to achieve perfect purity in the leaf that contains those cases.
Gradient descent
estimates parameters. It can optimize different types of models like regression, logit, cluster, etc. We are looking for the optimal value of the intercept. Predicted height= intercept+slope*weight 1. pick random value for intercept 2. find residual. A residual is the actual subracted from predicted. 3. find residuals for all data points and add them together. This is SOS residuals 4. plot on another graph all potential slope sos residuals and choose lowest point. 5. gradient descent increases number of calculations closer to optimal value 6. get derivative of sos of each point. 7. gradient descent uses the derivitive to find where the sos is lowest. When the slope of the curve is close to 0 we need to take smaller steps along x-axis
The reason for building a decision tree that allows for three-way splits, compared to a tree that allows for two-way splits, is improved model performance.
false
When you create a project in Model Studio, event-based sampling is used (that is, turnedon) by default.
false
Which of the following techniques do deep learning models use to overcome the computational challenges associated with multiple hidden layers?
fast moving gradient-based optimizations Deep learning models use fast moving gradient-based optimizations, such as Stochastic Gradient Descent, for this purpose.
hyperbolic tangent
is a shift and rescaling of the logistic function.
Hyperparameters include:
learning rate, the annealing rate, the regularization terms L1 and L2, and the momentum.
Which default model is in the Basic template for class target?
logistic regression
To find the best SVM use:
maximum-margin hyperplane. Between 2 inputs: The thickest (or fat) line is a good starting place for the "best" line because it results in the largest margin of error on the positive and negative sides. The best solution is the exact center, or median, of the fat line. Taking the exact center of the fat line produces a unique solution known as the maximum-margin hyperplane. This solution, which is represented as H in the diagram, has the greatest margin of error on either side.
Neural networks are universal approximators. This means that neural networks can do which of the following?
model any input-output relationship, no matter how complex Neural networks are called universal approximators because they can model any input-output relationship, no matter how complex.
After you create a new project, Model Studio takes you to the Data tab. Which of thefollowing can you do on the Data tab?
modify variable roles and measurement levels
components of a neural network
neurons (units) and parameters: connecting neurons between layers
A variable can be:
numeric(interval) or categorical(usually nominal)
When early stopping is used, where does the optimal model come from?
one of the iterations that occur earlier than the final training iteration The philosophy of early stopping says that the optimal model comes from an earlier iteration where the objective function on the validation data has a smaller value than it does for the final iteration. The optimal model does not necessarily need to come from the iteration immediately before the final iteration.
L2 Regularization
penalizes the square value for the weights. Different values for L2 are tried between the range established by From and To. The default initial value for the L2 is 0. The default for the range is from 0 to 10.
What does a caslib do?
provides access to files in a data source and to in-memory tables
Standardizing
refers to rescaling your data to have a mean of 0 and a standard deviation of 1. Common standardizing methods: 1. z score is produced: where x is the original value, μ is the variable's mean, and σ is the variable's standard deviation. The absolute value of z represents the distance between the raw score and the population mean in units of the standard deviation. z is negative when the original value is below the mean, positive when above. 2. Midrange equation. Midrange is 0, half range is 1
Maximum time (minutes)
specifies the maximum time (in minutes) for the optimization tuner. The default is 60.
Number of Inputs per Split
specifies the number of inputs evaluated per split. The default value is 100. The default range is from 1 to100.
Number of evaluations per iteration
specifies the number of tuning evaluations in one iteration. This option is available only if the search method is Genetic algorithm or Bayesian. The default value is 10. It ranges from 2 to 2,147,483,647.
Nominal target objective function
specifies the objective function to optimize for tuning parameters for a nominal target. Possible values are average squared error, area under the curve, F1 score, F0.5 score, gamma, Gini coefficient, Kolmogorov-Smirnov statistic, multi-class log loss, misclassification rate, root average squared error, and Tau. The default value is misclassification rate.
Interval target objective function
specifies the objective function to optimize for tuning parameters for an interval target. Possible values are average squared error, mean absolute error, mean squared logarithmic error, root average squared error, root mean absolute error, and root mean square logarithmic error. The default value is average squared error.
Sample size
specifies the sample size. This option is available only if the search method is Random or Latin hypercube sample. The default value is 50. It ranges from 2 to 2,147,483,647.
Subsample Rate
specifies the subsample rate. The default initial value is 0.5. The default range is from 0.1 to 1.
Which of the following can you specify in the properties panel of the Model Comparison node? Select all that apply.
the cutoff to use when applying an ROC-based measure You can specify all these things in the properties panel of the Model Comparison node. -the data partition to use for selecting the champion -You can specify all these things in the properties panel of the Model Comparison node. -the depth to use when applying a lift-based measure You can specify all these things in the properties panel of the Model Comparison node.
To ensure a good model that does not under or over fit
the data are typically divided into two or three non-overlapping sets, which are called partitions. You use the first partition, the training set, to build models. Remaining partitions of the data can be referred to as holdout data. Use the second set in validation. The validation data set is used to optimize the complexity of the model and find the sweet spot between bias and variance. Based on the validation data, you tune the models that were built on the training data and determine whether additional training is required. The third: The test data set is an optional partition for the model building process, but some industries might require it as a source of unbiased model performance. The test data set gives the honest, unbiased estimates of the model's performance. The test data set provides one final measure of how the model performs on new data, before the model is put into production.
Metadata
the set of variable roles, measurement levels, and other configurations that apply to your data set
Stemming
The plus sign next to a word indicates stemming (for example, +service represents service, services, serviced, and so on).
Maximal Tree (starting pt. of pruning)
The final tree created with training data. It is the starting point for optimizing the complexity of the model. i.e. pruning. Does not account for redundancy--we need an additional algorithm.
Outputs
The outputs of the predictive model are known as predictions. Predictions represent the model's best guess about the target given a set of input measurements. The predictive model makes predictions based on what it learns from the source data.
Split Search Process
1. choose criteria (ie CHISQUARE) 2. P-Value 3. Logworth Pvalue bc P-values tend to be small. 4. Adjust logworth-ie jagged line 5. The splits each is applied to every input. 6. inputs log worth that fail to exceed threshold are excluded
Problems with split search process
1. the min number of cases reduces the number of potential partitions for each input in the split search. 2. When you test for the independence of column categories in a contingency table, it is possible to obtain significant (large) values of the chi-squared statistic even when there are no differences in the true, underlying proportions between split branches. 3. As the number of tests increases, the chance of a false positive result likewise increases. 4. Because the significance of secondary and subsequent splits depends on the significance of the previous splits, the algorithm again faces a multiple comparison problem. To compensate for this problem, the algorithm increases the threshold by an amount related to the number of splits above the current split. For binary splits, the threshold is increased by log10(2) d ≈ 0.3 * d, where d is the depth of the split on the decision tree.
Bagging (or bootstrap aggregation)
A bootstrap sample is a random sample from a training data set. For bagging, a bootstrap sample is drawn with replacement. This means that some of the cases might be left out of the sample, and some cases might be represented more than once. The second step is to build a tree on each bootstrap sample. When you create an ensemble of bagged trees, large trees with low bias and high variance are ideal, so pruning can be counterproductive. In the bagging method, trees are grown independently of each other, so they can be built in parallel in a distributed computing environment like SAS Viya. This makes bagging a relatively fast method compared to other P&C methods. The third step is to combine the predictions by voting or averaging. For classification problems, you take the plurality vote of the predicted class or the mean of the posterior probabilities. Averaging the posterior probabilities often gives slightly better performance than voting. If you have an interval target, you take a mean of the predicted values.
Caslib
A caslib provides access to files in a data source, such as a database or a file system directory, and to in-memory tables. Access controls are associated with caslibs to manage access to data. You can think of a caslib as a container with two areas where data is referenced: a physical space that includes the source data or files, and an in-memory space that makes the data available for CAS action processing. You can load a SAS data set, database tables, and more to a caslib. The DATA step, the CASUTIL procedure, and CAS actions can be used to load data into CAS. Tables are not automatically saved.
Decision tree v. ensemble tree
A decision tree model is created through recursive partitioning, so all tree-based models use step functions to split the input space. The plot for the single decision tree shows that there are few steps and those steps are large. Therefore, relatively few steps are used to classify the target, which can decrease the accuracy. The plots for the two ensemble models have more steps and those steps are smaller. Because more steps are used to classify the target, accuracy can be increased compared to a single tree. Notice that boosting smooths the prediction surface more than bagging because boosting emphasizes the misclassified cases during the training.
Importance is defined by
A decision tree using PROC TREESPLIT. The Relative Variable importance metric is a number between 0 and 1, which is calculated in two steps. First it finds the maximum residual sum of squares, or RSS, based variable importance. This method measures variable importance based on the change of residual sum of square when a split is found at a node. Second, for each variable, it calculates the relative variable importance as the RSS based Importance of this variable, divided by the maximum RSS based importance, among all the variables. The RSS and Relative importance are calculated from the validation data. If no validation data exist, these two statistics are calculated, instead, from the training data.
Distributed programming
A distributed system, also known as distributed computing, is a system with multiple components located on different machines that communicate and coordinate actions in order to appear as a single coherent system to the end-user. The CAS server represents pooled memory and runs code multi-threaded. Multi-threading tends to distribute the same instructions to other available threads for execution, creating many different queues on many different cores using separate allocations or subsets of data. The results are different because each thread works on a different subset of the data. Therefore, results can be different from thread to thread unless and until the individual results from multiple threads are summed together. Distributed systems are extremely efficient because work loads can be broken up and sent to multiple machines. Horizontal Scalability—Since computing happens independently on each node, it is easy and generally inexpensive to add additional nodes and functionality as necessary.
Splitting criteria
The goal of splitting is always to reduce the variability of the target distribution and thus increase purity in the child nodes. Splitting criteria can be based on a variety of impurity reduction measures. Categorical Target: CHIAD; CHISQUARE; ENTROPY; GINI; IGR Interval Target: CHAID FTEST VARIANCE
Boosting
Boosting is a perturb and combine method for creating tree-based ensemble models for categorical targets. Unlike the bagging method, each tree that boosting creates is dependent on the tree from the previous iteration. Across iterations, the algorithm keeps a cumulative count of misclassifications. Each case is then weighted based on whether its misclassification count increases in the current iteration. If the misclassification count increases, the weight increases. If the misclassification count remains the same, the weight decreases. The weights influence the likelihood that the case is selected in the sample for the next iteration. The boosting process continues in this fashion for a predetermined number of iterations. The number of iterations equals the number of trees in the series. The main advantage of boosting is that the algorithm focuses on the misclassified cases. This improves the performance of the model.
Decision tree structure
Can be categorical or interval The first node in the tree is called the root. Subsequent rules are named interior nodes. Nodes with only one connection are leaf nodes. The depth of a tree specifies the number of generations of nodes. The ROOT node is generation 0. The children of the ROOT node are the first generation, and so on. The tree shown here stops at the second generation. Leaves on left equate to yes, on right no. Prediction rules are often referred to as English rules.
Fitting a model
Complex models, or overfitting occurs when the model doesn't account for enough nuance in the data. High variance is the result. Underfitting means that the model does not take enough into account.
Decision tree process
Create initial tree, then improve the model. "Settings" allow us to improve the model. Settings include such as depth and leaf size. Other settings relate to the recursive partitioning method that is used to grow a tree, and associated parameters.
Cluster Node Algorithms:
Data is put into groups. 1. the k-means algorithm for clustering interval (quantitative) input variables 2. the k-modes algorithm for clustering nominal (qualitative) input variables 3. the k-prototypes algorithm for clustering mixed input that contains both interval and nominal variables
Dimension Reduction
Dimension reduction decreases the number of variables under consideration. Reducing the dimensionality helps find the true, latent relationship. Model Studio provides the following nodes in SAS Visual Data Mining and Machine Learning for dimension reduction:
Persisted in-memory data
Pre-converted, in-memory, pre-loaded data tables. In SAS Viya, all data typically go through an I/O conversion process only once and can be reused as many times as needed thereafter, without incurring the same expense of conversion into a binary, machine-level format. SAS Viya data are either stored within the RAM of a single machine (and run in SMP mode) or within a shared pool of allocated memory created from several networked machines as part of a common memory grid (which enables Massively Parallel Processing, or MPP mode).
Recursive partitioning
Recursive partitioning is the standard method used to grow decision trees. is a top-down, greedy algorithm. A greedy algorithm makes locally optimal choices at each step. Starting at the root node, recursive partitioning uses an iterative process to select the best split for the node. This process is called a split search. The splitting criterion measures the reduction in variability of the target distribution in the child nodes. The goal is to reduce variability and thus increase purity in the child nodes. *Recursive bc the population may be split any number of times until stopping criteria is reached.
Run-time environment
Refers to the combination of hardware and software in which data management and analytics occur.
Dimension
Refers to the number of input variables in the data. It is especially important to have a densely populated input space when fitting highly complex models For algorithms that do not reduce the number of inputs, it is especially important to reduce the number of inputs during data preparation. Two ways to avoid this are to remove redundant inputs or reduce irrelevant inputs. Let's take a quick look at each method.
Data sources that can be used with SAS Viya
Relational and unstructured data, Hadoop, and various file formats (for example, XML, JSON, CSV). These data sources can be located in local or external databases as well as in the cloud.
Relative variability
Relative variability is useful for comparing variables with similar scales, such as several income variables. Relative variability is the coefficient of variation, which is a measure of variance relative to the mean, CV = σ / μ.
Single Value Decomposition SVD
Singular value decomposition (SVD) projects the high-dimensional document and term spaces into a lower-dimension space. • Singular value decomposition is a method of decomposing a matrix into three other matrices:
Create Validation Sample from Training Data
Specifies whether a validation sample should be created from the incoming training data. This is recommended even if the data have already been partitioned so that only the training partition is used for variable selection, and the validation partition can be used for modeling.
Statistical modeling versus machine learning
Stat modeling finds relationships between variables in order to predict outcomes. Relies on modeler. Machine learning makes better predictions and does not depend on the modeler. Also pulls from Big Data rather than smaller sample set. Machine learning also uses weights in addition to parameters. ML: Creates a model for a given task. Algorithms are always applied to a benchmark dataset. More predictive power an interpretation Stats: Understanding the population in relation to the data in the set. Understands data in fewer quantities.
Decision Tree
Supervised machine learning models They require less data preparation and are easy to interpret. Trees follow a decision-split, IF-THEN logic, and can be represented in a tree-like graphical structure. To predict cases (that is, to score data), decision trees use rules that specify a decision based on the values of the input variables. The rules are expressed in Boolean logic, which means that they are IF-THEN-ELSE statements. The rules are arranged hierarchically in a tree-like structure with nodes connected by lines. The nodes represent the rules, and the lines order the rules.
multi-machine massively parallel processing (MPP) configuration
Supported by CAS
SNP single-machine symmetric multiprocessing
Supported by CAS. The single machine is designated as the controller. Because there are no worker nodes, the controller node performs data analysis on the rows of data that are in-memory. The single machine uses multiple CPUs and threads to speed up data analysis. This architecture is often referred to as symmetric multiprocessing, or SMP. All the in-memory analytic features of a distributed server are available to the single-machine server. Single-machine servers cannot load data into memory in parallel from any data source.
Variable Selection node
There are 7 types of variable selection. They assist you in reducing the number of inputs by rejecting input variables based on the selection results. This node finds and selects the best variables for analysis by using unsupervised and supervised selection methods. You can choose among one or more of the available selection methods in the variable selection process. Combination Criterion--This is a "voting" method such that each selection method gets a vote on whether a variable is selected. In the Combination criterion property, you choose at what voting level (combination criterion) a variable is selected. In pre-screening, if a variable exceeds the maximum number of class levels threshold or the maximum missing percent threshold, that variable is rejected and not processed by the subsequent variable selection methods. The Advisor option also prescreens.
Interval Variable Moments table
This table displays the interval variables with their associated statistics, which include minimum, maximum, mean, standard deviation, skewness, kurtosis, relative variability, and the mean plus or minus two standard deviations. Note that some of the input variables have negative values. You address these negative values in an upcoming practice.
Topics (inside Text mining node)
This table shows topics created by the Text Mining node. Topics are created based on groups of terms that occur together in several documents. Each term-document pair is assigned a score for every topic. Thresholds are then used to determine whether the association is strong enough to consider whether that document or term belongs in the topic. Because of this, terms and documents can belong to multiple topics. Fifteen topics were discovered, so fifteen new columns of inputs are created. The output columns contain SVD (singular value decomposition) scores that can be used as inputs for the downstream nodes.
Features
To add unstructured data, we convert unstructured text variables to usable numeric inputs. This involves text mining.
Pruning
To build the optimal tree, which is neither too large nor too small, Model Studio starts with the maximal tree and uses a process called bottom-up pruning to remove branches. Bottom-up pruning is also known as post pruning or retrospective pruning. Bottom-up pruning starts at the leaf level and removes branches in a backward fashion using a model selection criterion. Suppose the maximal tree has n leaves. First, the pruning algorithm evaluates all possible sub-trees consisting of n minus 1 leaves. To do this, the algorithm removes (or prunes) one split. Out of all possible sub-trees with n minus 1 leaves, the algorithm selects the sub-tree with the optimal value of the model selection criterion on the validation data. The selected sub-tree is the candidate from all sub-trees with n minus 1 leave
Bonferroni Adjustment
To maintain overall confidence in the statistical findings, statisticians inflate the p-values of each test by a factor equal to the number of tests being conducted. If an inflated p-value shows a significant result, then the significance of the overall results is assured. This type of p-value adjustment is known as a Bonferroni correction. For inputs with missing values, two sets of Bonferroni-adjusted logworths are generated. For the first set, cases with missing input values are included in the left branch of the contingency table and logworths are calculated. For the second set of logworths, missing value cases are moved to the right branch. The best split is then selected from the set of possible splits with the missing values in the left and right branches, respectively.
Decision Tree Selection
Trains a decision tree predictive model. The residual sum of squares variable importance is calculated for each predictor variable, and the relative variable importance threshold that you specify is used to select the most useful predictor variables. Decision Tree Selection specifies the TREESPLIT procedure to perform decision tree selection based on CHAID, Chi-square, Entropy, Gini, Information gain ratio, F test, and Variance target criterion. It produces a classification tree, which models a categorical response, or a regression tree, which models a continuous response. Both types of trees are called decision trees because the model is expressed as a series of IF-THEN statements.
Forest Selection:
Trains a forest predictive model by fitting multiple decision trees. The residual sum of squares variable importance is calculated for each predictor variable, averaged across all the trees, and the relative variable importance threshold that you specify is used to select the most useful predictor variables. Forest Selection specifies the FOREST procedure to create a predictive model that consists of multiple decision trees.
Transformations
Transformations are most commonly used to change the shape of the distribution of a variable by stretching or compressing it, reduce the effect of outliers or heavy tails, and standardize inputs to be on the same range or scale.
Unstructured Data--text mining
Unstructured data can be text or non-text (such as images, audio, or video). Non-textual data are often converted to text data, so we focus on text data in this course. Text Mining node enables you to process text data in a document collection. Adding text mining results can improve the predictive ability of models that are based only on structured data In text mining, data are processed in two phases: text parsing and transformation. Text parsing processes textual data into a term-by-document frequency matrix. Transformations such as singular value decomposition (or SVD) change this matrix into a data set that is suitable for data mining purposes. As a result of text mining, a document collection with thousands of documents and terms can be represented in a compact and efficient form.
Gradient Boosting Selection:
Uses Boosting, which combines weak learners sequentially so each new tree corrects the errors of the previous one. Weak learners usually only have one tree split--referred to as a stump. Trains a gradient boosting predictive model by fitting a set of additive decision trees. The residual sum of squares variable importance is calculated for each predictor variable, averaged across all the trees, and the relative variable importance threshold that you specify is used to select the most useful predictor variables. Gradient Boosting Selection specifies the GRADBOOST procedure to create a predictive model that consists of multiple decision trees.
SAS data types
VARCHAR, INT32, INT64, IMAGE CHARACTER and NUMERIC Variables that are created or loaded using the INT32 or INT64 data types support more digits of precision than the traditional NUMERIC data type. All calculations that occur on the CAS engine maintain the INT32 or INT64 data type.
fault tolerance
When CAS is running in an MPP configuration, it can continue processing requests even if it loses connectivity to some nodes (by using a redundant copy) This communication layer also enables you to remove or add nodes while the server is running.
Convergence with SAS Viya
When an algorithm converges, it has found a parameter set meeting your requirements. ie it has reached the number of iterations needed to be a good fit. Estimation frequently requires iterative procedures: the more iterations, the more accurate estimates. But when are estimates accurate enough? When can iteration cease? My the rule has become "Convergence is reached when more iterations do not change my interpretation of the estimates". There are many ways for it to fail, particularly if a suitable parameter set can't be found within some maximum number of iterations. Conversely, you can set up unreasonable stopping criteria that will lead to convergence with very poor parameters. Lack of convergence is an indication that the data do not fit the model well, because there are too many poorly fitting observations.
Cross validation for small data sets
When the data set is too small to split into training and validation, you can use cross validation. Cross validation avoids overlapping test sets. k-fold cross validation consists of the following steps: Split the data into k subsets of equal size. Use each subset in turn for validation and the remainder for training. In a five-fold cross validation, the initial data set is divided into A, B, C, D, and E subsets. On the first run, the subsets B, C, D, and E are used to train the model, and the subset A is used to validate the model. Then the subsets A, C, D, and E are used to train the model, and the subset B is used to validate. The process continues until all subsets are used for training and validation. Often the subsets are stratified before the cross validation is performed. The error estimates are averaged to yield an overall error estimate.
Deterministic versus nondeterministic
a deterministic algorithm also uses the same steps (or take the same path) each time to arrive at the outcome. It's conventional, definitive algorithms, computation of model parameters vs non. distributed computing environment, randomness in algorithms, convergence of model parameters. a nondeterministic algorithm can use different approaches to arrive at the outcome, given the same set of inputs. In fact, given the same set of inputs, a nondeterministic model might even provide different outcomes. Even if the outcome is the same given the same set of inputs, how the nondeterministic algorithm arrived at the result on different runs can vary. An example of a nondeterministic algorithm is a probabilistic algorithm.
hyperparameter
a variable that is used to find the optimal model but whose value cannot be estimated from the data. The ensure speed and quality of the model. These values must be specified externally, either manually or through an automated process. Hyperparameters are sometimes called tuning options. For decision trees, maximum tree depth is an example of a hyperparameter.
neural network
an example of a machine learning model. In a neural network, the weights start close to zero. With each pass through the data, the neural network learns more and refines the weights.
Data preprocessing
can occur at three moments: in a dedicated application (Data Studio), during visual exploration (SAS Visual Analytics), and during execution of a pipeline (Model Studio). Data preprocessing capabilities come in the form of pipeline nodes
machine learning models:
decision trees and other tree-based models, neural networks, and support vector machines.