Analytics Final
Euclidean distance
The length of a straight line between 2 observations
Logarithmic transformation
The natural logarithm converts changes in a variable into percentage changes
Data mining
The process of applying a set of analytical techniques designed to uncover hidden patterns and relationships in data
Spreadsheet engineering
The process of designing, constructing, and testing spreadsheet models
Data partitioning
The process of dividing a data set into a training, validation, and optional test data set. Training: generate one or more models Validation: fine-tune/compare the performance of competing models Optional: assesses the performance of the final model on a new dataset
Specificity
The proportion of nontarget class cases that are classified correctly
Sensitivity/recall
The proportion of target class cases that are classified correctly
Cross validation
The sample is partitioned into a training set (to estimate the model) and a validation set (to assess how well the estimated model predicts with unseen data)
Manhattan distance
The shortest distance between 2 observations measured only horizontally or vertically. Refers to the city of Manhattan and its rectangular, block-like layout
Spreadsheet modeling
The use of spreadsheet programs, like Excel, to implement decision models
Training set
To build/train the model
Model Auditing
Trace precedents, trade dependents, watch window
The holdout method
Training set and validation set (independent and mutually exclusive).
Principal components
Uncorrelated variables whose values are the weighted linear combinations of the original variables. The first principal component accounts for most of the variability in the data, followed by the second, and so on
Dynamic spreadsheet models
Undergo multiple iterations of improvements to make the model mirror real-life better
Dimension reduction
Unsupervised DM The process of converting a set of high-dimensional data into data with lesser dimensions
Pattern recognition
Unsupervised DM The process of recognizing patterns in the data using machine learning techniques
Quadratic regression model
Used in situations when x's affect on y changes from positive to negative over time (or vice versa). Slope, capturing the influence of x on y, changes in magnitude and sign. U or inverted U-shape (ex: average cost of firms curve). Y hat reaches a maximum (b2<0) or minimum (b2>0) when x=-b1/2(b2)
Overfitting
When a regression model is made overly complex to fit the quirks of given sample data. Making the model conform too closely to the same sample data compromises its predictive power
Standardization
Z-scores. How many standard deviations away an observation is from the mean
k-fold cross validation method
k subsets: 1 of the k-subsets used as validation set, remaining k-1 subsets are put together to form training set
Linear Regression equation
y: response variable x1, x2,...xk: k predictor variables ε: random error term B0,B1,B2,...Bk: unknown parameters to be estimated
Semi-log regression model
Logarithmic regression model and exponential regression model
Transformation methods
Make observations independent of scale
Similarity measures for categorical variables
Matching and Jaccard's coefficient
Performance measures for classification models
Misclassification/error rate, accuracy rate, sensitivity/recall, precision/positive predictive value, specificity
Trace precedents
Model auditing. Identify cels that contain data that are part of the formula in the active cell
Trace dependents
Model auditing. Identify cels whose values are affected by active cell
Watch window
Model auditing. Monitor the values of selected cells from any workbook
Jaccard's coefficient
Negative outcomes are not as important as positive outcomes, only match positive outcomes between 2 observations
Unsupervised data mining
No target variable. Dimension reduction and pattern recognition
Target outputs
Outcome values of a decision that we wish to know
Accuracy rate
Overall proportion of observations that are classified correctly
Misclassification/error rate
Overall proportion of observations that are misclassified
Oversampling
Overweights the rare class relative to the other class or classes in order to adjust the class distribution of a dataset (ex: only 200 out of 10,000 potential customers respond to a promotional email)
Precision/positive predictive value
Proportion of the predicted target class cases that belong to the target class
Performance evaluation for prediction models
RMSE, mean/average error, MAD/MAE, mean percentage error, mean absolute percentage error. Most common: RMSE, MAD (0,∞)
When to use R^2 or adjusted R^2
R^2: comparing models using the same number of predictor variables Adjusted R^2: number of predictor variables is different across models
Interaction effect
Regression model, the partial effect of a predictor variable on the response variable depends on the value of another predictor variable. ex: additional bedrooms for houses with larger square footage will increase the overall price by more than an additional bedroom for a house with lower sqft (sqft is additional predictor variable)
Log-log regression model
Response variable and predictor variable are transformed into natural logs
Logarithmic regression model
Semi-log model that transforms only the predictor variable.
Exponential regression model
Semi-log model that transforms only the response variable
Data table
Shows how the calculated results change if one or two input variables change within a range of possible values
Receiver operating characteristic (ROC) curve
Shows the sensitivity and specificity measures across all cutoff values, and how accurately the model is able to classify both target and nontarget class cases
Circular reference
Spreadsheet error. A formula indirectly or directly refers back to its own cell. Sometimes necessary. Bypass through "enable iterative calculation"
Relative cell reference
Spreadsheet error. To fix, use either absolute or mixed (part relative, part absolute) cell reference
Root mean square error (RMSE)
Standard error of estimate for validation data set. Want lowest RMSE possible. Penalizes models with large prediction errors when used for comparing models (uses squared values)
Min-max normalization
Subtracts the min from each observation and divides by the range. Rescales each value to be between 0 and 1, 0 being the minimum and 1 being the maximum.
PCA (principal component analysis)
Summarized large data sets of correlated variables with a smaller number of representative variables, called principal components
Classification model
Supervised DM Target variable: categorical Objective: predict the class membership of a new case (ex: marketing manager classifying a list of customers as buyers or nonbuyers)
Prediction model
Supervised DM Target variable: numerical Objective: predict the numeric value of a new case (ex: predicting the spending of a customer, the selling price of a house, or the length of a hospital stay after a medical procedure)
Primary components of a decision model
Target outputs, input variables, processes
Supervised data mining
Target variable is identified. Classification and prediction models
Machine learning
Techniques that integrate self-learning algorithms designed to evaluate results and to improve performance over time
Cumulative lift chart/cumulative gains chart/lift chart
The ability of the classification model to identify target class cases. Shows improvement that a predictive model provides in capturing the target class cases when compared with random selection
What does normalization mean in statistics?
A process that makes the numerical observation independent of scale
Influence diagram
A visual tool used in spreadsheet engineering to illustrate the various components of a decision and their relationships
Interaction variable
Additional explanatory variable
MAD/mean absolute error (MAE)
Average absolute error. Measures the average magnitude of the errors without considering direction. Not squared, large errors not penalized as heavily as in RMSE
MAPE (mean absolute percentage error)
Average absolute percentage error. Shows error as a percentage of the actual value, giving a sense of the magnitude of the errors
Mean error (ME)/average error(AE)
Average of the prediction errors. Used to measure prediction bias, because affected by both positive and negative values ME>0: model underpredicts ME<0: model overpredicts ME≈0: predictions are unbiased
Mean percentage error (MPE)
Average percentage error. Used for prediction bias (affected by both pos and neg values), but doesn't show magnitude of errors
Matching coefficient
Based on matching outcomes to determine similarity among observations/records. Higher the value, the more similar
Goal seek
Can determine the exact value of one input variable required to achieve a desired outcome. Works backward to identify the exact value for an input variable to reach the desired outcome
Common spreadsheet model errors
Circular reference, relative cell reference
Selecting cutoff values
Class membership is determined by comparing the predicted probability of belonging to the target class
Artificial intelligence
Computer systems that demonstrate human-like intelligence and cognitive abilities, like deduction, pattern recognition, and interpretation of complex data
Decile-wise lift chart
Conveys the same information as cumulative lift chart, but presents the information in equal-sized intervals/deciles
Scenario manager
Create and compare up to 32 possible scenarios based on multiple input variables. Summary report to show input variables and results. Helpful to assess future risk
CRISP-DM
Cross-Industry Standard Process for Data Mining. Business understanding, data understanding, data preparation, modeling, evaluation, and deployment
Performance charts for classification models
Cumulative lift chart, decile-wise lift chart, receiver operating characteristic (ROC) curve
Performance evaluation
Data partitioning, oversampling, overfitting
Stochastic model
Decision model that incorporates some randomness and can produce different outputs, even with the same inputs
Deterministic model
Decision model where all model parameters are known with certainty and the output is fully determined by the model parameters
Target variable
Dependent variable (response variable, outcome, output)
Similarity measures for numerical variables
Euclidean and Manhattan distance, z-score standardization, min-max normalization
Validation set
Evaluate model
Similarity measures
Gauge whether a group of observations are similar or dissimilar to one another
What-if/sensitivity analysis
Helps the decision maker predict how the target outputs of a spreadsheet model would change if one or more input variables are changed. Comprised of data tables, goal seek, and scenario manager
Model inputs/assumptions
Include both controllable and uncontrollable input variables
Predictor variable
Independent variable (explanatory variable, regressor)
Processes
Intermediate or final calculations that link the inputs with outputs of the decision model
