# SAS Viya

Split Search Process

1. choose criteria (ie CHISQUARE) 2. P-Value 3. Logworth Pvalue bc P-values tend to be small. 4. Adjust logworth-ie jagged line 5. The splits each is applied to every input. 6. inputs log worth that fail to exceed threshold are excluded

Problems with split search process

1. the min number of cases reduces the number of potential partitions for each input in the split search. 2. When you test for the independence of column categories in a contingency table, it is possible to obtain significant (large) values of the chi-squared statistic even when there are no differences in the true, underlying proportions between split branches. 3. As the number of tests increases, the chance of a false positive result likewise increases. 4. Because the significance of secondary and subsequent splits depends on the significance of the previous splits, the algorithm again faces a multiple comparison problem. To compensate for this problem, the algorithm increases the threshold by an amount related to the number of splits above the current split. For binary splits, the threshold is increased by log10(2) d ≈ 0.3 * d, where d is the depth of the split on the decision tree.

Bagging (or bootstrap aggregation)

A bootstrap sample is a random sample from a training data set. For bagging, a bootstrap sample is drawn with replacement. This means that some of the cases might be left out of the sample, and some cases might be represented more than once. The second step is to build a tree on each bootstrap sample. When you create an ensemble of bagged trees, large trees with low bias and high variance are ideal, so pruning can be counterproductive. In the bagging method, trees are grown independently of each other, so they can be built in parallel in a distributed computing environment like SAS Viya. This makes bagging a relatively fast method compared to other P&C methods. The third step is to combine the predictions by voting or averaging. For classification problems, you take the plurality vote of the predicted class or the mean of the posterior probabilities. Averaging the posterior probabilities often gives slightly better performance than voting. If you have an interval target, you take a mean of the predicted values.

Caslib

A caslib provides access to files in a data source, such as a database or a file system directory, and to in-memory tables. Access controls are associated with caslibs to manage access to data. You can think of a caslib as a container with two areas where data is referenced: a physical space that includes the source data or files, and an in-memory space that makes the data available for CAS action processing. You can load a SAS data set, database tables, and more to a caslib. The DATA step, the CASUTIL procedure, and CAS actions can be used to load data into CAS. Tables are not automatically saved.

Decision tree v. ensemble tree

A decision tree model is created through recursive partitioning, so all tree-based models use step functions to split the input space. The plot for the single decision tree shows that there are few steps and those steps are large. Therefore, relatively few steps are used to classify the target, which can decrease the accuracy. The plots for the two ensemble models have more steps and those steps are smaller. Because more steps are used to classify the target, accuracy can be increased compared to a single tree. Notice that boosting smooths the prediction surface more than bagging because boosting emphasizes the misclassified cases during the training.

Importance is defined by

A decision tree using PROC TREESPLIT. The Relative Variable importance metric is a number between 0 and 1, which is calculated in two steps. First it finds the maximum residual sum of squares, or RSS, based variable importance. This method measures variable importance based on the change of residual sum of square when a split is found at a node. Second, for each variable, it calculates the relative variable importance as the RSS based Importance of this variable, divided by the maximum RSS based importance, among all the variables. The RSS and Relative importance are calculated from the validation data. If no validation data exist, these two statistics are calculated, instead, from the training data.

Distributed programming

A distributed system, also known as distributed computing, is a system with multiple components located on different machines that communicate and coordinate actions in order to appear as a single coherent system to the end-user. The CAS server represents pooled memory and runs code multi-threaded. Multi-threading tends to distribute the same instructions to other available threads for execution, creating many different queues on many different cores using separate allocations or subsets of data. The results are different because each thread works on a different subset of the data. Therefore, results can be different from thread to thread unless and until the individual results from multiple threads are summed together. Distributed systems are extremely efficient because work loads can be broken up and sent to multiple machines. Horizontal Scalability—Since computing happens independently on each node, it is easy and generally inexpensive to add additional nodes and functionality as necessary.

Surrogate split

A surrogate splitting rule is a backup to the main splitting rule. For example, the main splitting rule might use COUNTY as the input, and the surrogate might use REGION. If COUNTY is unknown and REGION is known, the surrogate is used. If several surrogate rules exist, each surrogate is considered in sequence until one can be applied to the observation. If none can be applied, the main rule assigns the observation to the branch that is designated for missing values.

3 types of predictions

decisions (classifications), rankings (are relative-ie credit score) or estimates (predict expected value) estimates can be transformed into either decisions or rankings.

Ensamble Model

An ensemble model is an aggregation of multiple models. The final prediction from the ensemble model is a combination of predictions from the component models.

Regression Trees

An interval target can have any numeric value, within a certain range, including decimal values.

How to address rare events:

Ask: Is the target (outcome) a rare event? You can use event based sampling to figure this out. We need to ensure that each partition has a representative sample of the rare event. (i.e. credit card fraud 1/1000) EBS- Make 2 samples. One sample without the event, and one with the event. Then match up each event with a non event data point. You may match 2 non events with every event, or 3 or 4 non-events to 1 event. The advantage of event-based sampling is that you can obtain (on the average) a model of similar predictive power with a smaller overall case count. Although it reduces analysis time, event-based sampling also introduces some analysis complications. Most model fit statistics (especially those related to prediction decisions) and most of the assessment plots are closely tied to the outcome proportions in the training samples. If the outcome proportion in the training and validation samples do not match the outcome proportion in the scoring population, model performance is likely incorrect. If the outcome proportions in the training sample and scoring population do not match, model prediction estimates are biased. Fortunately, Model Studio automatically adjusts assessment measures, assessment graphs, and prediction estimates for bias

Data sources that can be used with SAS Viya

Relational and unstructured data, Hadoop, and various file formats (for example, XML, JSON, CSV). These data sources can be located in local or external databases as well as in the cloud.

CAS Cloud Analytic Services,

Is the run-time environment. CAS consists of a controller node and a collection of worker nodes that manage and process distributed data of any size.

Boosting

Boosting is a perturb and combine method for creating tree-based ensemble models for categorical targets. Unlike the bagging method, each tree that boosting creates is dependent on the tree from the previous iteration. Across iterations, the algorithm keeps a cumulative count of misclassifications. Each case is then weighted based on whether its misclassification count increases in the current iteration. If the misclassification count increases, the weight increases. If the misclassification count remains the same, the weight decreases. The weights influence the likelihood that the case is selected in the sample for the next iteration. The boosting process continues in this fashion for a predetermined number of iterations. The number of iterations equals the number of trees in the series. The main advantage of boosting is that the algorithm focuses on the misclassified cases. This improves the performance of the model.

Relative variability

Relative variability is useful for comparing variables with similar scales, such as several income variables. Relative variability is the coefficient of variation, which is a measure of variance relative to the mean, CV = σ / μ.

fault tolerance

When CAS is running in an MPP configuration, it can continue processing requests even if it loses connectivity to some nodes (by using a redundant copy) This communication layer also enables you to remove or add nodes while the server is running.

Decision tree structure

Can be categorical or interval The first node in the tree is called the root. Subsequent rules are named interior nodes. Nodes with only one connection are leaf nodes. The depth of a tree specifies the number of generations of nodes. The ROOT node is generation 0. The children of the ROOT node are the first generation, and so on. The tree shown here stops at the second generation. Leaves on left equate to yes, on right no. Prediction rules are often referred to as English rules.

Examples of mathematical functions

Centering, exponential, inverse, log, range, square, square root, and standardize. These fall under the category of Transformations

Types of decision trees

Classification for categorical, regression trees for interval targets.

Fitting a model

Complex models, or overfitting occurs when the model doesn't account for enough nuance in the data. High variance is the result. Underfitting means that the model does not take enough into account.

Decision tree process

Create initial tree, then improve the model. "Settings" allow us to improve the model. Settings include such as depth and leaf size. Other settings relate to the recursive partitioning method that is used to grow a tree, and associated parameters.

Cluster Node Algorithms:

Data is put into groups. 1. the k-means algorithm for clustering interval (quantitative) input variables 2. the k-modes algorithm for clustering nominal (qualitative) input variables 3. the k-prototypes algorithm for clustering mixed input that contains both interval and nominal variables

Logistic regression

Default Basic model template

Dimension Reduction

Dimension reduction decreases the number of variables under consideration. Reducing the dimensionality helps find the true, latent relationship. Model Studio provides the following nodes in SAS Visual Data Mining and Machine Learning for dimension reduction:

Linear Regression Selection

Fits and performs variable selection on an ordinary least squares regression predictive model. This is valid for an interval target and a binary target. In the case of a character binary target (or a binary target with a user-defined format), a temporary numeric variable with values of 0 or 1 is created, which is then substituted for the target. Linear Regression Selection specifies the REGSELECT procedure to perform linear regression selection based on ordinary least square regression. It offers many effect-selection methods, including Backward, Forward, Forward-swap, Stepwise methods, and modern LASSO and Adaptive LASSO methods. It also offers extensive capabilities for customizing the model selection by using a wide variety of selection and stopping criteria, from computationally efficient significance level-based criteria to modern, computationally intensive validation-based criteria.

Prediction based decision-making is used in these 4 areas

Fraud, targeted marketing, financial risk, customer churn(attrition)

Essential data tasks

Gather the data Explore the data Divide the data Address rare events Manage missing values. Replace incorrect values. Add unstructured data. Extract features. Manage extreme or unusual values. Select useful inputs.

Gradient boosting

Gradient boosting is an enhancement of boosting that can be applied to any type of target. The gradient boosting algorithm is similar to boosting, except that at each iteration, the target is the residual from the previous decision tree model. Gradient boosting is available in Model Studio.

Fast Supervised Selection

Identifies the set of input variables that jointly explain the maximum amount of variance contained in the target. Fast Supervised Selection specifies the VARREDUCE procedure to perform supervised variable selection by identifying a set of variables that jointly explain the maximum amount of variance contained in the response variables. Supervised selection is essentially based on AIC, AICC, and BIC stop criterion.

What causes instability in decision trees?

If one case is omitted or changed it can have a large impact on the outcome. This results from the large number of univariate splits considered during recursive partitioning and the increasing fragmentation of the data. "A small change in the data can easily result in the selection of a different competitor split, which produces different subsets in the child nodes. The changes in the resulting subsets increase with each new generation of the tree." One reason why simple P & C methods give improved performance is variance reduction. The benefit of ensembling many trees together is that, by adding more steps, the steps themselves are essentially smoothed out.

SAS Project

In Model Studio, the main container for your analytic work is a project. A basic Model Studio project contains a data source, a pipeline that you create, and related project metadata.

Build Models button

In the applications menu. It is a part of discovery in the lifecycle. It is how we access Model Studio. *To access SAS Model Manager (Links to an external site.), select Manage Models. Later in this course, you use the Applications menu to access Model Manager from Model Studio, and then return to Model Studio. To access SAS Visual Analytics (Links to an external site.), select Explore and Visualize Data. From SAS Visual Analytics, you can access the SAS Visual Statistics add-on functionality, which enables you to use pipelines. In this course, you do not use SAS Visual Analytics and SAS Visual Statistics.

CAS action set

Is a collection of actions that group functionality (for example, simple summary statistics)

Predictive modeling

Is supervised learning. It begins with training data

Analytics lifecycle

Is to extract value from data. Includes 3: 1. Data are the foundation of everything you do. At the Data phase, you explore and prepare your data for analysis. 2. Discovery is the act of detecting something that you did not know before. You build and refine multiple models with the goal of selecting the best model for your analysis. 3.Deployment is where you put the model to work. You apply the model to new data, which is a process called scoring.

Machine Learning

Machine learning is AI that learns by model iteration (iterates to "perfection") to make predictions. Three main characteristics of machine learning are automation, customization, and acceleration.

complete case analysis

Means no missing values. If missing values appear at random in the input data, you can drop the rows that contain missing values without introducing bias into the model. Techniques for managing missingness during model building include naïve Bayes, decision trees, missing indicators, imputation, and binning. Model Studio, the naïve Bayes technique and decision tree models do not use complete case analysis; these modeling approaches can incorporate missingness.

Dimension

Refers to the number of input variables in the data. It is especially important to have a densely populated input space when fitting highly complex models For algorithms that do not reduce the number of inputs, it is especially important to reduce the number of inputs during data preparation. Two ways to avoid this are to remove redundant inputs or reduce irrelevant inputs. Let's take a quick look at each method.

Missing values in a decision tree

Nominal Treats missing input values as a separate level of the input variable. A nominal input with L levels and a missing value can be treated as an L + 1 level input. If a new case has a missing value on a splitting variable, then the case is sent to whatever branch contains the missing values. Ordinal Modifies the split search strategy for missing values by adding a separate branch adjacent to the ordinal levels. Interval Treats missing values as having the same unknown non-missing value.

Distributed Server: Massively Parallel Processing

One machine acts as the controller and other machines act as workers to process data.One or more machines are designated as worker nodes. Each worker node performs data analysis on the rows of data that are in-memory on the node. The server scales horizontally. If processing times are unacceptably long due to large data volumes, more machines can be added as workers to distribute the workload. Also fault tolerant.

Persisted in-memory data

Pre-converted, in-memory, pre-loaded data tables. In SAS Viya, all data typically go through an I/O conversion process only once and can be reused as many times as needed thereafter, without incurring the same expense of conversion into a binary, machine-level format. SAS Viya data are either stored within the RAM of a single machine (and run in SMP mode) or within a shared pool of allocated memory created from several networked machines as part of a common memory grid (which enables Massively Parallel Processing, or MPP mode).

Gini index

Pure nodes have Gini index of 0; the more imperfect the closer the index is to one.

Recursive partitioning

Recursive partitioning is the standard method used to grow decision trees. is a top-down, greedy algorithm. A greedy algorithm makes locally optimal choices at each step. Starting at the root node, recursive partitioning uses an iterative process to select the best split for the node. This process is called a split search. The splitting criterion measures the reduction in variability of the target distribution in the child nodes. The goal is to reduce variability and thus increase purity in the child nodes. *Recursive bc the population may be split any number of times until stopping criteria is reached.

Imputation

Refers to replacing a missing value with information that is derived from nonmissing values in the training data. This is the approach that we'll use in this course.

Run-time environment

Refers to the combination of hardware and software in which data management and analytics occur.

Single Value Decomposition SVD

Singular value decomposition (SVD) projects the high-dimensional document and term spaces into a lower-dimension space. • Singular value decomposition is a method of decomposing a matrix into three other matrices:

Create Validation Sample from Training Data

Specifies whether a validation sample should be created from the incoming training data. This is recommended even if the data have already been partitioned so that only the training partition is used for variable selection, and the validation partition can be used for modeling.

Statistical modeling versus machine learning

Stat modeling finds relationships between variables in order to predict outcomes. Relies on modeler. Machine learning makes better predictions and does not depend on the modeler. Also pulls from Big Data rather than smaller sample set. Machine learning also uses weights in addition to parameters. ML: Creates a model for a given task. Algorithms are always applied to a benchmark dataset. More predictive power an interpretation Stats: Understanding the population in relation to the data in the set. Understands data in fewer quantities.

Decision Tree

Supervised machine learning models They require less data preparation and are easy to interpret. Trees follow a decision-split, IF-THEN logic, and can be represented in a tree-like graphical structure. To predict cases (that is, to score data), decision trees use rules that specify a decision based on the values of the input variables. The rules are expressed in Boolean logic, which means that they are IF-THEN-ELSE statements. The rules are arranged hierarchically in a tree-like structure with nodes connected by lines. The nodes represent the rules, and the lines order the rules.

multi-machine massively parallel processing (MPP) configuration

Supported by CAS

SNP single-machine symmetric multiprocessing

Supported by CAS. The single machine is designated as the controller. Because there are no worker nodes, the controller node performs data analysis on the rows of data that are in-memory. The single machine uses multiple CPUs and threads to speed up data analysis. This architecture is often referred to as symmetric multiprocessing, or SMP. All the in-memory analytic features of a distributed server are available to the single-machine server. Single-machine servers cannot load data into memory in parallel from any data source.

Maximal Tree (starting pt. of pruning)

The final tree created with training data. It is the starting point for optimizing the complexity of the model. i.e. pruning. Does not account for redundancy--we need an additional algorithm.

Splitting criteria

The goal of splitting is always to reduce the variability of the target distribution and thus increase purity in the child nodes. Splitting criteria can be based on a variety of impurity reduction measures. Categorical Target: CHIAD; CHISQUARE; ENTROPY; GINI; IGR Interval Target: CHAID FTEST VARIANCE

Outputs

The outputs of the predictive model are known as predictions. Predictions represent the model's best guess about the target given a set of input measurements. The predictive model makes predictions based on what it learns from the source data.

Stemming

The plus sign next to a word indicates stemming (for example, +service represents service, services, serviced, and so on).

Training data

The purpose of the training data is to construct a predictive model (that is, a rule) that relates the inputs to the target. The predictive model is a concise representation of the association between the inputs and the target.

CAS action

The smallest unit of functionality in CAS, sends a request to the CAS server. The action parses the arguments of the request, invokes the action function, returns the results, and cleans the resources.

Variable Selection node

There are 7 types of variable selection. They assist you in reducing the number of inputs by rejecting input variables based on the selection results. This node finds and selects the best variables for analysis by using unsupervised and supervised selection methods. You can choose among one or more of the available selection methods in the variable selection process. Combination Criterion--This is a "voting" method such that each selection method gets a vote on whether a variable is selected. In the Combination criterion property, you choose at what voting level (combination criterion) a variable is selected. In pre-screening, if a variable exceeds the maximum number of class levels threshold or the maximum missing percent threshold, that variable is rejected and not processed by the subsequent variable selection methods. The Advisor option also prescreens.

Interval Variable Moments table

This table displays the interval variables with their associated statistics, which include minimum, maximum, mean, standard deviation, skewness, kurtosis, relative variability, and the mean plus or minus two standard deviations. Note that some of the input variables have negative values. You address these negative values in an upcoming practice.

Topics (inside Text mining node)

This table shows topics created by the Text Mining node. Topics are created based on groups of terms that occur together in several documents. Each term-document pair is assigned a score for every topic. Thresholds are then used to determine whether the association is strong enough to consider whether that document or term belongs in the topic. Because of this, terms and documents can belong to multiple topics. Fifteen topics were discovered, so fifteen new columns of inputs are created. The output columns contain SVD (singular value decomposition) scores that can be used as inputs for the downstream nodes.

Features

To add unstructured data, we convert unstructured text variables to usable numeric inputs. This involves text mining.

Pruning

To build the optimal tree, which is neither too large nor too small, Model Studio starts with the maximal tree and uses a process called bottom-up pruning to remove branches. Bottom-up pruning is also known as post pruning or retrospective pruning. Bottom-up pruning starts at the leaf level and removes branches in a backward fashion using a model selection criterion. Suppose the maximal tree has n leaves. First, the pruning algorithm evaluates all possible sub-trees consisting of n minus 1 leaves. To do this, the algorithm removes (or prunes) one split. Out of all possible sub-trees with n minus 1 leaves, the algorithm selects the sub-tree with the optimal value of the model selection criterion on the validation data. The selected sub-tree is the candidate from all sub-trees with n minus 1 leave

Bonferroni Adjustment

To maintain overall confidence in the statistical findings, statisticians inflate the p-values of each test by a factor equal to the number of tests being conducted. If an inflated p-value shows a significant result, then the significance of the overall results is assured. This type of p-value adjustment is known as a Bonferroni correction. For inputs with missing values, two sets of Bonferroni-adjusted logworths are generated. For the first set, cases with missing input values are included in the left branch of the contingency table and logworths are calculated. For the second set of logworths, missing value cases are moved to the right branch. The best split is then selected from the set of possible splits with the missing values in the left and right branches, respectively.

Decision Tree Selection

Trains a decision tree predictive model. The residual sum of squares variable importance is calculated for each predictor variable, and the relative variable importance threshold that you specify is used to select the most useful predictor variables. Decision Tree Selection specifies the TREESPLIT procedure to perform decision tree selection based on CHAID, Chi-square, Entropy, Gini, Information gain ratio, F test, and Variance target criterion. It produces a classification tree, which models a categorical response, or a regression tree, which models a continuous response. Both types of trees are called decision trees because the model is expressed as a series of IF-THEN statements.

Forest Selection:

Trains a forest predictive model by fitting multiple decision trees. The residual sum of squares variable importance is calculated for each predictor variable, averaged across all the trees, and the relative variable importance threshold that you specify is used to select the most useful predictor variables. Forest Selection specifies the FOREST procedure to create a predictive model that consists of multiple decision trees.

Transformations

Transformations are most commonly used to change the shape of the distribution of a variable by stretching or compressing it, reduce the effect of outliers or heavy tails, and standardize inputs to be on the same range or scale.

Unsupervised Selection

Unlabeled data is analyzed and organized in ways to show patterns or trends. We can use clustering for example, with this approach. Identifies the set of input variables that jointly explains the maximum amount of data variance. The target variable is not considered with this method. Unsupervised Selection specifies the VARREDUCE procedure to perform unsupervised variable selection by identifying a set of variables that jointly explain the maximum amount of data variance. Variable selection is based on covariance analysis.

Unstructured Data--text mining

Unstructured data can be text or non-text (such as images, audio, or video). Non-textual data are often converted to text data, so we focus on text data in this course. Text Mining node enables you to process text data in a document collection. Adding text mining results can improve the predictive ability of models that are based only on structured data In text mining, data are processed in two phases: text parsing and transformation. Text parsing processes textual data into a term-by-document frequency matrix. Transformations such as singular value decomposition (or SVD) change this matrix into a data set that is suitable for data mining purposes. As a result of text mining, a document collection with thousands of documents and terms can be represented in a compact and efficient form.

Gradient Boosting Selection:

Uses Boosting, which combines weak learners sequentially so each new tree corrects the errors of the previous one. Weak learners usually only have one tree split--referred to as a stump. Trains a gradient boosting predictive model by fitting a set of additive decision trees. The residual sum of squares variable importance is calculated for each predictor variable, averaged across all the trees, and the relative variable importance threshold that you specify is used to select the most useful predictor variables. Gradient Boosting Selection specifies the GRADBOOST procedure to create a predictive model that consists of multiple decision trees.

SAS data types

VARCHAR, INT32, INT64, IMAGE CHARACTER and NUMERIC Variables that are created or loaded using the INT32 or INT64 data types support more digits of precision than the traditional NUMERIC data type. All calculations that occur on the CAS engine maintain the INT32 or INT64 data type.

Complete case analysis

What is it called when the model-building process ignores training data cases with missing values of inputs?

Convergence with SAS Viya

When an algorithm converges, it has found a parameter set meeting your requirements. ie it has reached the number of iterations needed to be a good fit. Estimation frequently requires iterative procedures: the more iterations, the more accurate estimates. But when are estimates accurate enough? When can iteration cease? My the rule has become "Convergence is reached when more iterations do not change my interpretation of the estimates". There are many ways for it to fail, particularly if a suitable parameter set can't be found within some maximum number of iterations. Conversely, you can set up unreasonable stopping criteria that will lead to convergence with very poor parameters. Lack of convergence is an indication that the data do not fit the model well, because there are too many poorly fitting observations.

Cross validation for small data sets

When the data set is too small to split into training and validation, you can use cross validation. Cross validation avoids overlapping test sets. k-fold cross validation consists of the following steps: Split the data into k subsets of equal size. Use each subset in turn for validation and the remainder for training. In a five-fold cross validation, the initial data set is divided into A, B, C, D, and E subsets. On the first run, the subsets B, C, D, and E are used to train the model, and the subset A is used to validate the model. Then the subsets A, C, D, and E are used to train the model, and the subset B is used to validate. The process continues until all subsets are used for training and validation. Often the subsets are stratified before the cross validation is performed. The error estimates are averaged to yield an overall error estimate.

Number of Surrogate Rules

determines the number of surrogates that are sought. A surrogate is discarded if its agreement is less than or equal to the largest proportion of observations in any branch. As a consequence, a node might have fewer surrogates specified than the number in the Number of Surrogate Rules property. The agreement between two splits can be measured as the proportion of cases that are sent to the same branch. The split with the greatest agreement is taken as the best surrogate.

Essential data tasks

divide the data, address rare events, and manage missing values

Common perturbation methods

resampling, subsampling, adding noise, adaptively reweighting, and randomly choosing from the competitor splits.

non-deterministic results

results are not reproducible

machine learning models:

decision trees and other tree-based models, neural networks, and support vector machines.

Binning

You can use binning to classify missing values, reduce the effect of outliers on a model, or illustrate nonlinear relationships. A binned version of a variable also has a smaller variance than the original numeric variable. Is a method of transformation that converts numeric inputs to categories or groups. There are different types of binning

Deterministic versus nondeterministic

a deterministic algorithm also uses the same steps (or take the same path) each time to arrive at the outcome. It's conventional, definitive algorithms, computation of model parameters vs non. distributed computing environment, randomness in algorithms, convergence of model parameters. a nondeterministic algorithm can use different approaches to arrive at the outcome, given the same set of inputs. In fact, given the same set of inputs, a nondeterministic model might even provide different outcomes. Even if the outcome is the same given the same set of inputs, how the nondeterministic algorithm arrived at the result on different runs can vary. An example of a nondeterministic algorithm is a probabilistic algorithm.

hyperparameter

a variable that is used to find the optimal model but whose value cannot be estimated from the data. The ensure speed and quality of the model. These values must be specified externally, either manually or through an automated process. Hyperparameters are sometimes called tuning options. For decision trees, maximum tree depth is an example of a hyperparameter.

neural network

an example of a machine learning model. In a neural network, the weights start close to zero. With each pass through the data, the neural network learns more and refines the weights.

Data preprocessing

can occur at three moments: in a dedicated application (Data Studio), during visual exploration (SAS Visual Analytics), and during execution of a pipeline (Model Studio). Data preprocessing capabilities come in the form of pipeline nodes

training cases

examples, instances, records

GUI

graphical user interface

autotuning

hyperparameter optimization in SAS. In general, autotuning searches for the best combination of hyperparameters specific to each modeling algorithm. When you decide whether to perform autotuning, keep in mind that autotuning can substantially increase run time.

SAS Viya

is a cloud-enabled, in-memory analytics runtime environment that seamlessly scales for data of any size, type, speed, and complexity.

Binning

is helpful for handling interval input variables. In binning, the original numeric values are grouped into discrete categories called bins. The missing values are assigned to their own bin. The interval variable then becomes a categorical variable.

Posterior Probability

is the probability generated by a predictive model.

The variable importance measure

it takes the square root. Further, the Decision Tree node incorporates the agreement between the surrogate split and the primary split in the calculation is scaled to be between 0 and 1 by dividing by the maximum importance. Thus, larger values indicate greater importance. Variables that do not appear in any primary or saved surrogate splits have 0 importance.

Overfitting in a decision tree

means there are no stopping rules and every case has its own leaf. The maximal tree that results from recursive partitioning is smaller than the largest possible tree but is still likely to be overfit. The maximal tree adapts to both the systematic variation of the target (or the signal) and the random variation (or the noise). At the other extreme, a small tree with only a few branches might underfit the data. It might fail to adapt sufficiently to the signal, which usually results in poor generalization.

A variable can be:

numeric(interval) or categorical(usually nominal)

L1 Regularization

penalizes the absolute value for the weights. Different values for L1 are tried between the range defined by From and To. The default initial value for the L1 is 0. The default for the range is from 0 to 10.

Types of ensemble models

single-algorithm ensemble models: gradient boosting and forest. Perturb and combine (or P & C) methods are used to create ensemble models in two steps The perturb step creates different models by manipulating the distribution of the data or altering the construction method. It changes the splitting criteria. The combine step then uses the predictions of the models built in the perturb step to create a single prediction. Perturb and combine methods can be used with any unstable algorithm but they are mostly used with trees. The benefit of ensembling many trees together is that, by adding more steps, the steps themselves are essentially smoothed out. The main drawback of ensemble models is that they cannot be interpreted, unlike a single tree.

Unsupervised learning

starts with unlabeled data (that is, data that have no target/dependant variable--we dont know what we are trying to find yet). Unsupervised learning algorithms seek to discover intrinsic patterns that underlie the data, such as a clustering or a redundant parameter (dimension) that can be reduced. Supervised, which means that the target variable is used in the selection process. Other methods are unsupervised and ignore the target.

To ensure a good model that does not under or over fit

the data are typically divided into two or three non-overlapping sets, which are called partitions. You use the first partition, the training set, to build models. Remaining partitions of the data can be referred to as holdout data. Use the second set in validation. The validation data set is used to optimize the complexity of the model and find the sweet spot between bias and variance. Based on the validation data, you tune the models that were built on the training data and determine whether additional training is required. The third: The test data set is an optional partition for the model building process, but some industries might require it as a source of unbiased model performance. The test data set gives the honest, unbiased estimates of the model's performance. The test data set provides one final measure of how the model performs on new data, before the model is put into production.

Target

the response variable/dependant

Metadata

the set of variable roles, measurement levels, and other configurations that apply to your data set