Analytics Final

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Euclidean distance

The length of a straight line between 2 observations

Logarithmic transformation

The natural logarithm converts changes in a variable into percentage changes

Data mining

The process of applying a set of analytical techniques designed to uncover hidden patterns and relationships in data

Spreadsheet engineering

The process of designing, constructing, and testing spreadsheet models

Data partitioning

The process of dividing a data set into a training, validation, and optional test data set. Training: generate one or more models Validation: fine-tune/compare the performance of competing models Optional: assesses the performance of the final model on a new dataset

Specificity

The proportion of nontarget class cases that are classified correctly

Sensitivity/recall

The proportion of target class cases that are classified correctly

Cross validation

The sample is partitioned into a training set (to estimate the model) and a validation set (to assess how well the estimated model predicts with unseen data)

Manhattan distance

The shortest distance between 2 observations measured only horizontally or vertically. Refers to the city of Manhattan and its rectangular, block-like layout

Spreadsheet modeling

The use of spreadsheet programs, like Excel, to implement decision models

Training set

To build/train the model

Model Auditing

Trace precedents, trade dependents, watch window

The holdout method

Training set and validation set (independent and mutually exclusive).

Principal components

Uncorrelated variables whose values are the weighted linear combinations of the original variables. The first principal component accounts for most of the variability in the data, followed by the second, and so on

Dynamic spreadsheet models

Undergo multiple iterations of improvements to make the model mirror real-life better

Dimension reduction

Unsupervised DM The process of converting a set of high-dimensional data into data with lesser dimensions

Pattern recognition

Unsupervised DM The process of recognizing patterns in the data using machine learning techniques

Quadratic regression model

Used in situations when x's affect on y changes from positive to negative over time (or vice versa). Slope, capturing the influence of x on y, changes in magnitude and sign. U or inverted U-shape (ex: average cost of firms curve). Y hat reaches a maximum (b2<0) or minimum (b2>0) when x=-b1/2(b2)

Overfitting

When a regression model is made overly complex to fit the quirks of given sample data. Making the model conform too closely to the same sample data compromises its predictive power

Standardization

Z-scores. How many standard deviations away an observation is from the mean

k-fold cross validation method

k subsets: 1 of the k-subsets used as validation set, remaining k-1 subsets are put together to form training set

Linear Regression equation

y: response variable x1, x2,...xk: k predictor variables ε: random error term B0,B1,B2,...Bk: unknown parameters to be estimated

Semi-log regression model

Logarithmic regression model and exponential regression model

Transformation methods

Make observations independent of scale

Similarity measures for categorical variables

Matching and Jaccard's coefficient

Performance measures for classification models

Misclassification/error rate, accuracy rate, sensitivity/recall, precision/positive predictive value, specificity

Trace precedents

Model auditing. Identify cels that contain data that are part of the formula in the active cell

Trace dependents

Model auditing. Identify cels whose values are affected by active cell

Watch window

Model auditing. Monitor the values of selected cells from any workbook

Jaccard's coefficient

Negative outcomes are not as important as positive outcomes, only match positive outcomes between 2 observations

Unsupervised data mining

No target variable. Dimension reduction and pattern recognition

Target outputs

Outcome values of a decision that we wish to know

Accuracy rate

Overall proportion of observations that are classified correctly

Misclassification/error rate

Overall proportion of observations that are misclassified

Oversampling

Overweights the rare class relative to the other class or classes in order to adjust the class distribution of a dataset (ex: only 200 out of 10,000 potential customers respond to a promotional email)

Precision/positive predictive value

Proportion of the predicted target class cases that belong to the target class

Performance evaluation for prediction models

RMSE, mean/average error, MAD/MAE, mean percentage error, mean absolute percentage error. Most common: RMSE, MAD (0,∞)

When to use R^2 or adjusted R^2

R^2: comparing models using the same number of predictor variables Adjusted R^2: number of predictor variables is different across models

Interaction effect

Regression model, the partial effect of a predictor variable on the response variable depends on the value of another predictor variable. ex: additional bedrooms for houses with larger square footage will increase the overall price by more than an additional bedroom for a house with lower sqft (sqft is additional predictor variable)

Log-log regression model

Response variable and predictor variable are transformed into natural logs

Logarithmic regression model

Semi-log model that transforms only the predictor variable.

Exponential regression model

Semi-log model that transforms only the response variable

Data table

Shows how the calculated results change if one or two input variables change within a range of possible values

Receiver operating characteristic (ROC) curve

Shows the sensitivity and specificity measures across all cutoff values, and how accurately the model is able to classify both target and nontarget class cases

Circular reference

Spreadsheet error. A formula indirectly or directly refers back to its own cell. Sometimes necessary. Bypass through "enable iterative calculation"

Relative cell reference

Spreadsheet error. To fix, use either absolute or mixed (part relative, part absolute) cell reference

Root mean square error (RMSE)

Standard error of estimate for validation data set. Want lowest RMSE possible. Penalizes models with large prediction errors when used for comparing models (uses squared values)

Min-max normalization

Subtracts the min from each observation and divides by the range. Rescales each value to be between 0 and 1, 0 being the minimum and 1 being the maximum.

PCA (principal component analysis)

Summarized large data sets of correlated variables with a smaller number of representative variables, called principal components

Classification model

Supervised DM Target variable: categorical Objective: predict the class membership of a new case (ex: marketing manager classifying a list of customers as buyers or nonbuyers)

Prediction model

Supervised DM Target variable: numerical Objective: predict the numeric value of a new case (ex: predicting the spending of a customer, the selling price of a house, or the length of a hospital stay after a medical procedure)

Primary components of a decision model

Target outputs, input variables, processes

Supervised data mining

Target variable is identified. Classification and prediction models

Machine learning

Techniques that integrate self-learning algorithms designed to evaluate results and to improve performance over time

Cumulative lift chart/cumulative gains chart/lift chart

The ability of the classification model to identify target class cases. Shows improvement that a predictive model provides in capturing the target class cases when compared with random selection

What does normalization mean in statistics?

A process that makes the numerical observation independent of scale

Influence diagram

A visual tool used in spreadsheet engineering to illustrate the various components of a decision and their relationships

Interaction variable

Additional explanatory variable

MAD/mean absolute error (MAE)

Average absolute error. Measures the average magnitude of the errors without considering direction. Not squared, large errors not penalized as heavily as in RMSE

MAPE (mean absolute percentage error)

Average absolute percentage error. Shows error as a percentage of the actual value, giving a sense of the magnitude of the errors

Mean error (ME)/average error(AE)

Average of the prediction errors. Used to measure prediction bias, because affected by both positive and negative values ME>0: model underpredicts ME<0: model overpredicts ME≈0: predictions are unbiased

Mean percentage error (MPE)

Average percentage error. Used for prediction bias (affected by both pos and neg values), but doesn't show magnitude of errors

Matching coefficient

Based on matching outcomes to determine similarity among observations/records. Higher the value, the more similar

Goal seek

Can determine the exact value of one input variable required to achieve a desired outcome. Works backward to identify the exact value for an input variable to reach the desired outcome

Common spreadsheet model errors

Circular reference, relative cell reference

Selecting cutoff values

Class membership is determined by comparing the predicted probability of belonging to the target class

Artificial intelligence

Computer systems that demonstrate human-like intelligence and cognitive abilities, like deduction, pattern recognition, and interpretation of complex data

Decile-wise lift chart

Conveys the same information as cumulative lift chart, but presents the information in equal-sized intervals/deciles

Scenario manager

Create and compare up to 32 possible scenarios based on multiple input variables. Summary report to show input variables and results. Helpful to assess future risk

CRISP-DM

Cross-Industry Standard Process for Data Mining. Business understanding, data understanding, data preparation, modeling, evaluation, and deployment

Performance charts for classification models

Cumulative lift chart, decile-wise lift chart, receiver operating characteristic (ROC) curve

Performance evaluation

Data partitioning, oversampling, overfitting

Stochastic model

Decision model that incorporates some randomness and can produce different outputs, even with the same inputs

Deterministic model

Decision model where all model parameters are known with certainty and the output is fully determined by the model parameters

Target variable

Dependent variable (response variable, outcome, output)

Similarity measures for numerical variables

Euclidean and Manhattan distance, z-score standardization, min-max normalization

Validation set

Evaluate model

Similarity measures

Gauge whether a group of observations are similar or dissimilar to one another

What-if/sensitivity analysis

Helps the decision maker predict how the target outputs of a spreadsheet model would change if one or more input variables are changed. Comprised of data tables, goal seek, and scenario manager

Model inputs/assumptions

Include both controllable and uncontrollable input variables

Predictor variable

Independent variable (explanatory variable, regressor)

Processes

Intermediate or final calculations that link the inputs with outputs of the decision model


Ensembles d'études connexes

Lesson 5: Configuring Network Addressing and Internet Connections

View Set

World History: Quiz 2 Between the Wars

View Set

Social Studies Unit 3: Latin and South American Revolutions

View Set

Foundations of Project Management Week 1

View Set