Exam PA - Chapter 3 - A Primer on Predictive Analytics
Features of classifying variables by their nature
- Can be numeric (discrete, non-negative integers, or continuous variables assuming any value. - Linear models only work for continuous target variables, GLMs and decision trees apply to both discrete and continuous. - Can be categorical variables. Special case: binary variable, only has two possible levels.
Data issues identified during Stage 2 of the model building process (list)
- Personally Identifiable Information (PII) - semi-structured data that cannot fit into tabular arrangements (ex: social media), difficult to turn into a form open to further analysis - variables causing unfair discrimination - target leakage
Purpose of PA and key components of PA definition
- Using PA, we can provide a data-driven response to many questions of practical interest. - There is always an outcome of interest (numeric or categorical), and we have at our disposal a collection of variables that may offer potentiality useful information for predicting the outcome.
Importance of controlling the model complexity and how to do it
- Why: Effective control of the model complexity is crucial to have balance between underfitting and overfitting. - How: This can be achieved by feature generation and selection.
Two ways to classify variables
- by their role in the study (intended use): - by their nature (characteristics)
Three main categories of predictive modeling analytic problems (with descriptions)
- descriptive: focuses on what happened in the past, aims to "describe" or explain the observed trends by identifying the relationships between different variables. - Predictive: focuses on what will happen in the future, is concerned with making accurate predictions. - Prescriptive: uses a combo of optimization and simulation to investigate and quantify the impact of different "prescribed" actions in different scenarios. (if we prescribe/propose this change, what happens?)
Considerations for selecting the best model (list)
- predictive performance - interpretability - ease of implementation
Sampling methods
- random sampling (simplest) - stratified sampling (dividing the underlying population into several non-overlapping "strata" or groups in a non-random fashion, then randomly sampling a set number of observations from each stratum). ● Ensures every stratum is properly represented in collected data.
Stage 2 - Data Design process - three qualities of good data (list)
- reasonableness - consistency - sufficient documentation
Stage 2 - Data Design process - three things we are evaluating about the data (list)
- relevance of data - sampling of data - granularity of data
Supervised vs. Unsupervised learning problems
- supervised: have a target variable "supervising" or guiding our analysis - unsupervised: no target variable
Definition of intended use classification variables: - target/response/dependent/output variable - predictors/features/explanatory, independent, input variables
- target: the one we are interested in predicting. - predictors: the ones used to predict the target variable
- optimal level of granularity in a model - option to reduce sensitivity of the model to noise
- the one that optimizes the bias-variance tradeoff. - reduce the granularity of a categorical predictor.
Key idea of the Bias-Variance Tradeoff
Complexity does not guarantee good prediction performance.
Purpose of Stage 2 in model building process
Data Collection - After Stage 1, collect useful data that will underlie the predictive models being constructed. (Important stage).
Main components of the Stage 2 process
Data Collection Stage - data design - data quality issues/data validation
Purpose of Stage 1 in model building process
Define the Business Problem - clearly formulate the business problem we are apply PA to.
Constraints of Stage 1 in model building process
Define the Business Problem ● Constraints: must evaluate the feasibility of solving the business problem and implementing the PA solution. Some constraints include: ○ Availability of easily accessible and high-quality data ○ Implementation issues (IT infrastructure & tech to fit complex models efficiently, timeline for completing project, cost, and effort, etc.).
Objectives of Stage 1 in model building process
Define the Business Problem ● Objectives: given the business objectives, most PA projects can be classified into ○ Prediction-forward: primary objective is to develop a model that makes accurate predictions of the target variable based on the other predictors available ○ Interpretation-forward: use a predictive model to understand the true relationship between the target variable and predictors.
Purpose of Stage 3 in model building process
Exploratory Data Analysis - Stage 3 Goal: using descriptive stats and graphical displays, clean the data for incorrect, unreasonable, and inconsistent errors. ● Try and understand the characteristics of and the key relationships among variables in the data. ● The observations we make may suggest an appropriate type of predictive model to use and the best form for the predictors to enter the model.
Main components of the Stage 4 process
Model Construction and Evaluation - Stage 4 (important) - decide how much of the available data should be used to train our models - training/test data split - decide on performance metrics to use - cross-validation - selecting the best model
Purpose of Stage 4 in model building process
Model Construction and Evaluation - Stage 4 (important) Having collected data and taken a first pass at the key variables, time to move onto the modeling phase and construct our predictive models.
Purpose of Stage 6 in model building process and steps
Model Maintenance - Stage 6 After client approval, use and maintain model over time. Steps include: ● Using newer data in the training model to retrain it, maintaining robustness and improving prediction performance ● Updating new variables in training set and model if necessary ● Solicit feedback and external SME for issues not easily resolved
Purpose of Stage 5 in model building process
Model Validation - Stage 5 After selecting best model, need to validate it and check for obvious deficiencies and soundness of assumptions.
Definition of feature selection (removal) and dimension reduction
Oppose of generation, it is the procedure of dropping (removing) features and/variables with limited predictive power, thereby reducing the dimensions of the data.
6 Stages in the model building process (list)
Stage 1: Define the Business Problem Stage 2: Data Collection Stage 3: Exploratory Data Analysis (EDA) Stage 4: Model Construction and Evaluation (most important) Stage 5: Model Validation Stage 6: Model Maintenance
Definition of feature generation
The process of "generating" new features based on existing variables in the data.
F function equation between the target variable and predictors
The signal and the noise
Why we shouldn't use all of the available data to train our model
Using all collected data for model fitting leaves no independent data for assessing the predictive performance of our models.
Definition of predictive analytics
a vast set of statistical tools or the activity of using these tools for "predicting" a target variable based on a set of closely related variables.
Qualities of good data - reasonableness how to check for it
checked using exploratory data analysis
Data issues - target leakage Definition, reason for issue, and example
■ Definition: the phenomenon that some predictors in a model include ("leak") info about the target variable that will not be made available when the model is applied in practice. ● Predictors are typically strongly associated with the target variable, but values aren't known until the target variable is observed. ● As our goal with PA is to use other variables to predict the target variable (before it is observed), these "predictors" are by definition not really predictors. ● If we ignore the timing issue and mistakenly include these variables in the model construction process, then our model will appear to over-perform well. ● Example: data regarding tests done during a stay is collected after a hospital stay is complete, but the PA model is trying to predict how long a hospital stay will be.
Qualities of good data - consistency definition
■ Ensuring the same basis and rules have been applied to all values so they can be directly compared to each other. ● Numeric: same units. Categorical: factor levels of the variables are well defined and recorded consistently with no coding changes.
Data issues - Personally Identifiable Information (PII) Reason for issue, and solutions
■ Personally Identifiable Information (PII), info used to trace an individual's identity. Could violate law and regulations. Fix by: ○ Anonymization to remove the PII ○ Data security ○ Terms of use (and conditions) of the privacy policy
Purpose of oversampling or undersampling data, and definition of systematic sampling
■ Special case 1: oversampling or undersampling: designed for unbalanced data Systematic sampling: draw observations according to a set pattern; no randomness
Qualities of good data - sufficient documentation definition
■ others should easily gain an accurate understanding of various aspects of the data. ● Description of dataset overall (including data source) and each variable in data ● Notes about any past updates or other irregularities in the data ● Statement of accountability for the correctness of the dataset ● Description of the governance processes used to manage the dataset.
Difference between granularity and dimensionality
○ Applicability: dimensionality is specific to categorical variables, granularity to both categorical & numeric. ○ Comparability: we can order categorical variables by dimension, but not always by granularity. ■ For a variable to be a subset of another, each distinct level should be a subset of it.
3 strategies for reducing the dimensionality of a categorical predictor
○ Combining sparse categories with others: goal is to balance the following conflicting goals: ■ Ensure each level has a sufficient # of observations ■ Preserve differences in the behavior of the target variable among different factors levels which could be useful for predictions. ○ Combining similar categories ○ Using prior knowledge of a categorical variable ■ Can regroup variables based on their known characteristics
Importance of evaluating the granularity of data during Stage 2, data design process
○ Granularity: how precisely a variable in a dataset is measured/how detailed the info contained by the variable is. - At the data design state, recommended to use a relatively high level of granularity as we can then go "down" from there.
Considerations for selecting the best model (details behind the considerations)
○ Prediction performance: smallest RMSE or test classification error rate ○ Interpretability: model predictions should be easily explained in terms of the predictors and lead to specific actions or insights. ○ Ease of implementation: the easier for a model to be implemented (computationally, financially, or logistically), the better the model. ■ If the model requires prohibitive resources to construct and maintain, may be unaffordable for end users
Importance of evaluating the sampling of data during Stage 2, data design process
○ Sampling: the process of taking a subset of observations from the data source to generate our data set ■ Smaller subsets more manageable, should closely resemble the business environment our models will be applied in.
Overfitting vs underfitting a model and the ideal model to use
○ Underfitting: model too simple to capture the signal. ■ Large training error and bias, small variance ○ Overfitting: model too complicated and is overfitting the data ■ Low training error and bias, high test error and variance ● Ideal model will be at the minimum point of the test error ● Want to use a moderately flexible model.
Unsupervised learning problem definition and examples
○ are interested in extracting relationships and structures between different variables in the data. ○ Can be used for the purposes of data exploration and producing potentially useful features for predicting the target variable more accurately. ○ Ex: principal components analysis (PCA) and cluster analysis
Common model performance metrics for regression problems - Stage 4
○ measure discrepancy between each observed value of the target variable and the predicted value (residual or prediction error if in training set or test set respectively) ■ Square root of MSE ■ the smaller the training RMSE, the better fit of the model to the training (or test) data.
Importance of evaluating the relevance of data during Stage 2, data design process
○ relevance of data: more data means better odds it is unbiased and appropriate representative of the environment in which our predictive model will operate. ■ Having more data is desirable as it makes model training more robust, less vulnerable to noise
Common model performance metrics for classification problems - Stage 4
○ use indicator functions to calculate a classification error rate (aka misclassification rate) ■ The smaller the rate, the better the fit ■ We are interested on the test error rates performances since they measure how well the model makes predictions on future, unseen data.
Bias vs Variance key ideas
● Bias = accuracy, variance = precision ● Ideally, want a low bias and low variance model ● Inverse relationship between model bias and variance: the bias-variance tradeoff ● The U-shape behavior of the test error is due to the relative rate of change of the bias and variance
Purpose of feature selection and dimension reduction
● Bias-variance tradeoff: feature selection is an attempt to control model complexity and prevent overfitting by reducing the variance (at the expense of slight rise in bias). ● Particularly relevant in high-dimensional categorical predictors. ● Dimensionality of a categorical variable: the number of possible levels that the variable has. ○ Multiple categories of a predictor can inflate the dimension of it and undermine precision, leading to low exposure in some levels (curse of dimensionality).
How to validate the model during Stage 5 of the model building process
● Can perform validation on the training set or test set, may be model-dependent. ● Training set: for GLMs, we will learn tools later in section 3.2.2 and 4.1.3 ○ N/A for decision trees ● Test set: compare the predicted and observed values on the test set. ○ Should be no systematic patterns ● Can also compare the selected model to an existing baseline model on the test set. ○ Usually, a primitive model like OLS or one w/ no predictors and uses the average of the target variable on the training set for prediction.
Data issues - variables causing unfair discrimination Reason for issue, and solutions
● Differential treatment based on "controversial" variables may be deemed unethical and risky. Includes proxies. solution: exclude if possible
Supervised learning problem definition and examples
● Goal is to understand the relationship between the target variable and the predictors and/or make accurate predictions for the target based on the predictors. - examples: GLMs, decision trees
Goodness of fit vs prediction accuracy
● Goodness of fit to the training data is not the same as prediction accuracy, leads to training error or test error.
Importance and purpose of feature generation
● Process transforms the info in the original variables into a more useful form (or scale) so the model can absorb the info more efficiently and create more powerful predictors ○ Especially important for GLMs and unstructured data ● Feature generation tries to enhance the flexibility of the model and lower squared bias (at the expense of increased variance) ● It can also make the model easier to interpret
Regression vs classification problems
● Regression problems: supervised learning problems w/ a numeric target variable ○ Logistic regression: special case w/ binary target variable ● Classification problems: target variable is categorical in nature ○ Classify observations to a certain level (classifier)
Decomposition of expected test error
● The more complex a model (less flexible), the lower bias. More flexible, higher variance. ● Variance and bias = reducible error, irreducible error not impacted by flexibility
Variables vs Features
● Variables: A raw measurement that is recorded and constitutes the original dataset before any transformations are applied. ○ predictors in a model. ● Features: derivations from the original variables which provide an alternative, more useful view of the information contained in all of the dataset. ○ "Derivatives" of raw variables, will serve as final inputs into a predictive model.
Training/test data split definition
● partition our data into a few parts ○ Training set: estimate signal function f and potentially model parameters ○ Test set (validation set): apply the trained model to make a prediction for the observations in the test set and assess the performance
Definition and purpose of cross validation (CV)
● provides a convenient means to assess the prediction performance of a model without using additional test data. ○ Purpose: to select the values of hyperparameters (aka tuning parameters) which are parameters that control some aspect of the fitting process itself. They play a role in the mathematical expression of the objection function or the set of constraints defining the optimization problem. ○ Values need to be supplied in advance.
How to split data into test/training sets How many observations to use for each set
● randomly according to pre-specified proportions, or ● with special statistical techniques like stratified sampling. ○ # observations for each set: up to the user, there is a tradeoff either way Use the set split to rank competing models: use the test set to evaluate performance.