GBUS 3302 MODULE 2
will look like a line, such as a slash ( / ). This relationship may have a very slight bend, but will be mostly straight. You can think of it this way, LINEar makes a line
A linear relationship
variable handling
If there is a non-linear relationship, it may be better to categorize the variable into bins that have similar percentages
variable issues
Imputation of missing values Data Transformation Binning Dimension reduction
data quality assessment
Do we measure what we think we measure? Is there missing information?
business area example self service analytics used
become the current hot new development. It consists of technology solutions that can be applied by the staff that have the domain knowledge rather than requiring involvement of the IT department.
PCA is helpful for logistic regression. however, not for
beneficial for other models. One should try different data preparations for different methods to improve predictions.
machine learning 2 types
broad types of learning supervised unsupervised
CRISP data mining 6 steps
business understanding data understanding data preparation modeling evaluation deployment.
Before we can gather data for modeling, the business problem needs to be clearly stated. True False
true
CRISP DM stands for Cross-industry standard practice process for data mining, sometimes also called Cross-industry standard process for data mining. Select one: True False
true
If predictors of nominal modeling type have many different values dimensionality becomes a problem. True False
true
If there is a non-linear relationship between dependent and a continuous independent variable that does not follow a close form functional relationship then it may be better to categorize the independent variable into bins. True False
true
Imputation can be used to estimate and replace missing values in the predictors. Select one: True False
true
Imputing missing values means that missing values are replaced by a predicted value. Select one: True False
true
Imputing missing values means that missing values are replaced by a predicted value. true false
true
It is advisable to standardized continuous predictor variables or scale them to similar ranges before a model is built. Select one: True False
true
PCA analysis helps to reduce a large number of continuous predictors to fewer dimensions. True False
true
PCA is a method that helps to deal with the curse of dimensionality. True False
true
The Cross-industry standard process (practice) for data mining (CRSIP DM) can also be used for analytics projects. true false
true
The six main steps in CRISP are business understanding, data understanding, data preparation, modelling , evaluation and deployment. Select one: True False
true
Before we can gather data for modeling, the business problem needs to be clearly stated. Select one: True False
true- Business understanding is the first step. This includes a clear defined business problem with goals and objectives.
A continuous predictor X should not be binned when a Fit Y by X plot in JMP shows that a linear, quadratic or other functional relationship exists. Select one: True False
true- The Fit Y by X is a good method to check the relationship first. If there is a functional relationship then no binning should be applied.
business understanding
understanding the business problem, the objective, and the importance to the organization. It is important to frame the business problem as an analytics problem, define the project goal timeframe and develop a project plan with a timeline.
CLASSIFICATION TREE
use successive splitting on variables to predict an outcome.
evaluation: training data set
used to build a predictive model
binning
used to convert a continuous variable to a categorical variable. ex: may be better to use age groups rather than actual age.
evaluation: validation set
used to validate the result
Ensemble modelling
uses a combination of other models to improve prediction.
for medium and lower priced homes- when square feet goes up
usually price goes up- lines evenly- DOESNT MATTER
what is most important for obtaining the truth?
variety not the volume is important for obtaining the truth.
Modeling types are used in JMP to indicate the types of variables:
---Continuous Modeling Type: Numeric data (Continuous or Integer). Both are treated as continuous in JMP (Blue triangle) ---Nominal Modeling Type: Categorical Unordered (male, female) - Nominal in JMP identified as red icon (bar chart) ---Ordinal Modeling Type: Ordered (low, medium, high) - Ordinal in JMP identified as green icon (bar chart)
standardization features
---Used in some techniques when variables with the largest scales would dominate and skew results ---Puts all variables on same scale ---Normalizing function: Subtract mean and divide by standard deviation ++Generally applied behind the scenes in JMP if needed, or a built in option is provided in model dialog ++Select a variable in the data table and select New Formula Column > Transform > Standardize ---Alternative function: scale to 0-1 by subtracting minimum and dividing by the range ---Useful when the data contain dummies and numeric
features in JMP Pro for evaluating missing values or working with missing values
--Columns Viewer - reports the number of values missing for a variable --Missing Data Pattern - find patterns of missing data --Explore Missing Values utility - provides methods for imputing missing data for continuous variables --Multivariate platform - provides imputation for continuous variables --Recode - can be used to recode missing values into a "missing" category --Informative Missing - available in most platforms to handle missing values
omission
-If a small number of records have missing values one can omit them -If many records are missing values on a small set of variables one can drop those variables (or use proxies) -If many records have missing values, omission is not practical
The main concerns of the Data Understanding phase of CRISP-DM include:
-importance of domain knowledge -data quality assessment - garbage in, garbage out
preparing the data includes the following tasks
1. Exploring the data to obtain an understanding of any issues that could affect modeling later 2. Correction of errors in coding 3. Determine what to do with missing data 4. Transform and reclassify data as necessary 5. Bin variables as necessary 6. Group continuous variables using principle component analysis (PCA)
dimension reduction
method to deal with the "curse of dimensionality."
business understanding states
Business Area What is the area of specialization? - different areas have different needs and staff e.g. Marketing, Operations, etc. business objective What is the business objective? - increase response rates to ads data needs What data do we need? e.g. "increase response rates to ads" - start with wishlist of factors that could affect the outcome variable
Preparing the data includes the following tasks except: Select one: A. Exploring the data to obtain an understanding of any issues that could affect modeling later B. Correction of errors in coding C. Run a single regression to predict the independent variables D. Transform and reclassify data as necessary E. Identify outliers F. Identify correct data modeling type
C. Run a single regression to predict the independent variables
Association is not
Causation
the 1st step in data preparation is
Check for outliers and missing information using the Distribution menu under Analyze in JMP
The second step in data preparation is:
Check for outliers, reduce number of levels of variables by binning and re-code data as necessary
Which of the following tasks does preparing the data not include? Transform and reclassify data as necessary Imputing missing values of thE dependent variable Correction of errors in coding Exploring the data to obtain an understanding of any issues that could affect modeling later
Imputing missing values of thE dependent variable
evaluation definition
Instead of using p-values, data sets are partitioned into training, validation, and test sets to assess the reliability of data mining models.
When deciding whether a variable should be nominal or continuous numeric, we should:
Look at the dependency between outcome variable and predictors to decide on nominal versus numeric continuous
garbage in garbage out.
Making predictions with bad data is more than just wasting time. It can lead to the waste of money when these predictions are implemented.
handling missing data
Most algorithms will not process records with missing value, therefore you must take action to resolve issues with missing data
when the independent variables have very different scales it will create a problem for the prediction- what will alleviate this problem?
Normalizing or "standardizing" data. important to normalize the data range by creating a new variable. ---some data mining models in JMP perform the standardization behind the scene without you noticing it.
Predictive analytics can be utilized for:
Predicting alcohol impaired driving Predicting whether a customer will respond to an advertisement Predicting which voter can be influenced by what method Predicting customer churn in the telecommunication industry Predicting if a customer will default on a mortgage
imputation
Replace missing values with reasonable substitutes Lets you keep the record and use the rest of its (non-missing) information
price
Y
BIN
ZIP, LOCATION tabulate
Cross-Industry Standard Process (CRISP) for data mining is a generally
accepted process for data mining
Which of the following are modeling types in JMP? Ordinal Nominal All the options are modeling types Continuous
all of above
data preparation and data validation
compiling the data and preparing them for the predictive modeling. includes collecting data, cleaning the data, transforming the data as necessary, creating new variables useful for modeling, and reducing the number of categories for some of the variables. DIMENSION REDUCTION
developing a project goal
consist of classification algorithms, i.e., the outcome is binary. Every outcome variable should be clearly defined without any ambiguity. . It is best to start with a wish list of factors that could effect the outcome variable
data preparation steps
correcting errors in the data, handling missing data, data transformation, grouping factors, and creating new categories for data.
step 1
create distribution for all variables. analyze---> distribute to find distribution of variables, identify outliers, missing values, coding issues.
Missing values may have
different codes such as "unknown", NA, etc. In that case, the first step is to re-categorize these values to "Missing". The missing value may be by itself informative.
difference between stat or analytical analytics
doesnt start w theory. After we have identified all data sources we attempt to find a model that best describes the data and allows accurate predictions. There is no underlying theory for the behavior of the data. Thus, we need a method of validation of the analytics approach.
understanding data requires
domain knowledge, a comprehension of the various measurement scales, and the awareness of the data quality.
Neural networks
employ algorithms that are less intuitive. The k-nearest neighbor method relies on the majority of the k closest data points to any given data point to find the best prediction for this given data point.
data understanding also includes
exploration of the data to identify any issues such as missing information, coding errors, and to determine the scale for each variable.
Dimensionality is not a problem in data mining because the algorithms can handle many variables. Select one: True False
false
If a continuous modeling type predictor variable has many values it may be better to categorize the variable into bins. True False
false
PCA analysis helps to reduce a large number of predictors of nominal modeling tupe to fewer dimensions. True False
false
Imputation can only be used for numeric variables. Select one: True False
false- both nominal and numeric variables with imputed values
deployment
final step where models are used for business in the daily operation.
difference between stat or analytical stats
founded in theory. Then data are collect and the hypothesis of the theory is test. Much of the statistical training focuses on using standard errors to guard against over-interpreting what we observe.
a mosaic plot
graphical method for visualizing data from two or more qualitative relationships and allows you to recognize relationships between them. x-axis must have a variable with a nominal value scale.
When continuous variables have missing values
imputation is often used to replace the missing values with substituted values.
predictive analytics political campaigns
influence voters and increase turnout. The objective is to predict individual behavior of voters using demographic and other factors.
when the shape of the green line curves, in this case the quadratic line. . This will typically look like a very extended U. NON-LINEar makes a non straight line
it will be non-linear.
higher priced homes
line that points are all above the line- SF doesnt linear change
modeling
main step completion of this step leads to a set of models that address the analytic problem that one is trying to solve.
it is not necessary for predictors to be normally distributed in data mining methods, transformations can
make it easier to see patterns in the data.
how association is useful
many applications in business where one is interested in a prediction of an outcome without necessarily knowing which factors cause the outcome.
evaluation: ways to assess to reliability of data mining models
measuring the error in predictions, false positive rate, false negative rate, overall error, lift curve and ROC curves, and the confusion matrix.
Principal Component Analysis (PCA)
method to reduce the number of predictors, it can also be used to quickly obtain information on how much predictors are correlated with each other and determine potential problems with logistic regression due to collinearity. cases where you have multi-collinearity, i..e., the predictors are of continuous modeling type and are highly correlated, the PCA allows you to replace many correlated predictors with few principle components which account for a large percentage of variation in the predictors.
data understanding
modeler has to understand the data, the quality of the data, and the limitations.
types of variables indicate
modeling type- help to determine the types of pre-processing needed for the data
the log transformation
most common type but requires that there are no zero values.
more predictors make modeling more difficult and often predictors are strongly correlated with each other so that adding additional variables into models may
not provide much better predictions. This is especially important in logistic regression
Which of the following are not data modeling types in JMP? Ordinal Continuous Nominal Numeric
numeric
solutions to missing values are
omission and imputation
Machine learning
part of predictive analytics iteratively develop the understanding of a dataset to automatically learn how to recognize complex patterns and construct models that predict such patterns and optimize results.
predict churn
predicting which customer is likely to switch providers.
step 2
preparing predictors reduce the dimensions of a predictor by binning. simplest graphical method to understand the relationship between the dependent variable and the predictor is to create a mosaic plot.
data preparation
process of collecting, "cleaning," and consolidating data for use in analysis or modeling.
imputing means
replacing the missing value with its average, provided the variable is numeric.
deployment
requires us to consider the practical implications.
step 3
select predictors Predictors should only be numerically continuous if there is a known function that explains the relationship between outcome variable and predictors, such as linear or quadratic,
when higher priced home
square feet doesnt matter- as goes up more gradually increases for the higher the square feet
Logistic regression
statistical method that has a relatively straightforward interpretation in the modelling of log odds
logistic regression
statistical model in nature. It provides standard errors for the parameter estimates.Logistic models basically model ODDS.
modeler must understand
the meaning of the data, how they were collected, and the time at which they were measured.
data mining
the process of extracting patterns from large datasets. It is based on the premise that meaningful information, which is non-random, novel, valid, useful, and ultimately understandable, is contained in all massive datasets.
objective of predictive analysis
to discover patterns and build predictive models by using data mining techniques
evaluation: data mining will split the data into
training and validation( holdout) sets
Some of the variables may be highly skewed and applying
transformation to the data can make the variable appear more normally distributed.