GBUS 3302 MODULE 2

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

will look like a line, such as a slash ( / ). This relationship may have a very slight bend, but will be mostly straight. You can think of it this way, LINEar makes a line

A linear relationship

variable handling

If there is a non-linear relationship, it may be better to categorize the variable into bins that have similar percentages

variable issues

Imputation of missing values Data Transformation Binning Dimension reduction

data quality assessment

Do we measure what we think we measure? Is there missing information?

business area example self service analytics used

become the current hot new development. It consists of technology solutions that can be applied by the staff that have the domain knowledge rather than requiring involvement of the IT department.

PCA is helpful for logistic regression. however, not for

beneficial for other models. One should try different data preparations for different methods to improve predictions.

machine learning 2 types

broad types of learning supervised unsupervised

CRISP data mining 6 steps

business understanding data understanding data preparation modeling evaluation deployment.

Before we can gather data for modeling, the business problem needs to be clearly stated. True False

true

CRISP DM stands for Cross-industry standard practice process for data mining, sometimes also called Cross-industry standard process for data mining. Select one: True False

true

If predictors of nominal modeling type have many different values dimensionality becomes a problem. True False

true

If there is a non-linear relationship between dependent and a continuous independent variable that does not follow a close form functional relationship then it may be better to categorize the independent variable into bins. True False

true

Imputation can be used to estimate and replace missing values in the predictors. Select one: True False

true

Imputing missing values means that missing values are replaced by a predicted value. Select one: True False

true

Imputing missing values means that missing values are replaced by a predicted value. true false

true

It is advisable to standardized continuous predictor variables or scale them to similar ranges before a model is built. Select one: True False

true

PCA analysis helps to reduce a large number of continuous predictors to fewer dimensions. True False

true

PCA is a method that helps to deal with the curse of dimensionality. True False

true

The Cross-industry standard process (practice) for data mining (CRSIP DM) can also be used for analytics projects. true false

true

The six main steps in CRISP are business understanding, data understanding, data preparation, modelling , evaluation and deployment. Select one: True False

true

Before we can gather data for modeling, the business problem needs to be clearly stated. Select one: True False

true- Business understanding is the first step. This includes a clear defined business problem with goals and objectives.

A continuous predictor X should not be binned when a Fit Y by X plot in JMP shows that a linear, quadratic or other functional relationship exists. Select one: True False

true- The Fit Y by X is a good method to check the relationship first. If there is a functional relationship then no binning should be applied.

business understanding

understanding the business problem, the objective, and the importance to the organization. It is important to frame the business problem as an analytics problem, define the project goal timeframe and develop a project plan with a timeline.

CLASSIFICATION TREE

use successive splitting on variables to predict an outcome.

evaluation: training data set

used to build a predictive model

binning

used to convert a continuous variable to a categorical variable. ex: may be better to use age groups rather than actual age.

evaluation: validation set

used to validate the result

Ensemble modelling

uses a combination of other models to improve prediction.

for medium and lower priced homes- when square feet goes up

usually price goes up- lines evenly- DOESNT MATTER

what is most important for obtaining the truth?

variety not the volume is important for obtaining the truth.

Modeling types are used in JMP to indicate the types of variables:

---Continuous Modeling Type: Numeric data (Continuous or Integer). Both are treated as continuous in JMP (Blue triangle) ---Nominal Modeling Type: Categorical Unordered (male, female) - Nominal in JMP identified as red icon (bar chart) ---Ordinal Modeling Type: Ordered (low, medium, high) - Ordinal in JMP identified as green icon (bar chart)

standardization features

---Used in some techniques when variables with the largest scales would dominate and skew results ---Puts all variables on same scale ---Normalizing function: Subtract mean and divide by standard deviation ++Generally applied behind the scenes in JMP if needed, or a built in option is provided in model dialog ++Select a variable in the data table and select New Formula Column > Transform > Standardize ---Alternative function: scale to 0-1 by subtracting minimum and dividing by the range ---Useful when the data contain dummies and numeric

features in JMP Pro for evaluating missing values or working with missing values

--Columns Viewer - reports the number of values missing for a variable --Missing Data Pattern - find patterns of missing data --Explore Missing Values utility - provides methods for imputing missing data for continuous variables --Multivariate platform - provides imputation for continuous variables --Recode - can be used to recode missing values into a "missing" category --Informative Missing - available in most platforms to handle missing values

omission

-If a small number of records have missing values one can omit them -If many records are missing values on a small set of variables one can drop those variables (or use proxies) -If many records have missing values, omission is not practical

The main concerns of the Data Understanding phase of CRISP-DM include:

-importance of domain knowledge -data quality assessment - garbage in, garbage out

preparing the data includes the following tasks

1. Exploring the data to obtain an understanding of any issues that could affect modeling later 2. Correction of errors in coding 3. Determine what to do with missing data 4. Transform and reclassify data as necessary 5. Bin variables as necessary 6. Group continuous variables using principle component analysis (PCA)

dimension reduction

method to deal with the "curse of dimensionality."

business understanding states

Business Area What is the area of specialization? - different areas have different needs and staff e.g. Marketing, Operations, etc. business objective What is the business objective? - increase response rates to ads data needs What data do we need? e.g. "increase response rates to ads" - start with wishlist of factors that could affect the outcome variable

Preparing the data includes the following tasks except: Select one: A. Exploring the data to obtain an understanding of any issues that could affect modeling later B. Correction of errors in coding C. Run a single regression to predict the independent variables D. Transform and reclassify data as necessary E. Identify outliers F. Identify correct data modeling type

C. Run a single regression to predict the independent variables

Association is not

Causation

the 1st step in data preparation is

Check for outliers and missing information using the Distribution menu under Analyze in JMP

The second step in data preparation is:

Check for outliers, reduce number of levels of variables by binning and re-code data as necessary

Which of the following tasks does preparing the data not include? Transform and reclassify data as necessary Imputing missing values of thE dependent variable Correction of errors in coding Exploring the data to obtain an understanding of any issues that could affect modeling later

Imputing missing values of thE dependent variable

evaluation definition

Instead of using p-values, data sets are partitioned into training, validation, and test sets to assess the reliability of data mining models.

When deciding whether a variable should be nominal or continuous numeric, we should:

Look at the dependency between outcome variable and predictors to decide on nominal versus numeric continuous

garbage in garbage out.

Making predictions with bad data is more than just wasting time. It can lead to the waste of money when these predictions are implemented.

handling missing data

Most algorithms will not process records with missing value, therefore you must take action to resolve issues with missing data

when the independent variables have very different scales it will create a problem for the prediction- what will alleviate this problem?

Normalizing or "standardizing" data. important to normalize the data range by creating a new variable. ---some data mining models in JMP perform the standardization behind the scene without you noticing it.

Predictive analytics can be utilized for:

Predicting alcohol impaired driving Predicting whether a customer will respond to an advertisement Predicting which voter can be influenced by what method Predicting customer churn in the telecommunication industry Predicting if a customer will default on a mortgage

imputation

Replace missing values with reasonable substitutes Lets you keep the record and use the rest of its (non-missing) information

price

Y

BIN

ZIP, LOCATION tabulate

Cross-Industry Standard Process (CRISP) for data mining is a generally

accepted process for data mining

Which of the following are modeling types in JMP? Ordinal Nominal All the options are modeling types Continuous

all of above

data preparation and data validation

compiling the data and preparing them for the predictive modeling. includes collecting data, cleaning the data, transforming the data as necessary, creating new variables useful for modeling, and reducing the number of categories for some of the variables. DIMENSION REDUCTION

developing a project goal

consist of classification algorithms, i.e., the outcome is binary. Every outcome variable should be clearly defined without any ambiguity. . It is best to start with a wish list of factors that could effect the outcome variable

data preparation steps

correcting errors in the data, handling missing data, data transformation, grouping factors, and creating new categories for data.

step 1

create distribution for all variables. analyze---> distribute to find distribution of variables, identify outliers, missing values, coding issues.

Missing values may have

different codes such as "unknown", NA, etc. In that case, the first step is to re-categorize these values to "Missing". The missing value may be by itself informative.

difference between stat or analytical analytics

doesnt start w theory. After we have identified all data sources we attempt to find a model that best describes the data and allows accurate predictions. There is no underlying theory for the behavior of the data. Thus, we need a method of validation of the analytics approach.

understanding data requires

domain knowledge, a comprehension of the various measurement scales, and the awareness of the data quality.

Neural networks

employ algorithms that are less intuitive. The k-nearest neighbor method relies on the majority of the k closest data points to any given data point to find the best prediction for this given data point.

data understanding also includes

exploration of the data to identify any issues such as missing information, coding errors, and to determine the scale for each variable.

Dimensionality is not a problem in data mining because the algorithms can handle many variables. Select one: True False

false

If a continuous modeling type predictor variable has many values it may be better to categorize the variable into bins. True False

false

PCA analysis helps to reduce a large number of predictors of nominal modeling tupe to fewer dimensions. True False

false

Imputation can only be used for numeric variables. Select one: True False

false- both nominal and numeric variables with imputed values

deployment

final step where models are used for business in the daily operation.

difference between stat or analytical stats

founded in theory. Then data are collect and the hypothesis of the theory is test. Much of the statistical training focuses on using standard errors to guard against over-interpreting what we observe.

a mosaic plot

graphical method for visualizing data from two or more qualitative relationships and allows you to recognize relationships between them. x-axis must have a variable with a nominal value scale.

When continuous variables have missing values

imputation is often used to replace the missing values with substituted values.

predictive analytics political campaigns

influence voters and increase turnout. The objective is to predict individual behavior of voters using demographic and other factors.

when the shape of the green line curves, in this case the quadratic line. . This will typically look like a very extended U. NON-LINEar makes a non straight line

it will be non-linear.

higher priced homes

line that points are all above the line- SF doesnt linear change

modeling

main step completion of this step leads to a set of models that address the analytic problem that one is trying to solve.

it is not necessary for predictors to be normally distributed in data mining methods, transformations can

make it easier to see patterns in the data.

how association is useful

many applications in business where one is interested in a prediction of an outcome without necessarily knowing which factors cause the outcome.

evaluation: ways to assess to reliability of data mining models

measuring the error in predictions, false positive rate, false negative rate, overall error, lift curve and ROC curves, and the confusion matrix.

Principal Component Analysis (PCA)

method to reduce the number of predictors, it can also be used to quickly obtain information on how much predictors are correlated with each other and determine potential problems with logistic regression due to collinearity. cases where you have multi-collinearity, i..e., the predictors are of continuous modeling type and are highly correlated, the PCA allows you to replace many correlated predictors with few principle components which account for a large percentage of variation in the predictors.

data understanding

modeler has to understand the data, the quality of the data, and the limitations.

types of variables indicate

modeling type- help to determine the types of pre-processing needed for the data

the log transformation

most common type but requires that there are no zero values.

more predictors make modeling more difficult and often predictors are strongly correlated with each other so that adding additional variables into models may

not provide much better predictions. This is especially important in logistic regression

Which of the following are not data modeling types in JMP? Ordinal Continuous Nominal Numeric

numeric

solutions to missing values are

omission and imputation

Machine learning

part of predictive analytics iteratively develop the understanding of a dataset to automatically learn how to recognize complex patterns and construct models that predict such patterns and optimize results.

predict churn

predicting which customer is likely to switch providers.

step 2

preparing predictors reduce the dimensions of a predictor by binning. simplest graphical method to understand the relationship between the dependent variable and the predictor is to create a mosaic plot.

data preparation

process of collecting, "cleaning," and consolidating data for use in analysis or modeling.

imputing means

replacing the missing value with its average, provided the variable is numeric.

deployment

requires us to consider the practical implications.

step 3

select predictors Predictors should only be numerically continuous if there is a known function that explains the relationship between outcome variable and predictors, such as linear or quadratic,

when higher priced home

square feet doesnt matter- as goes up more gradually increases for the higher the square feet

Logistic regression

statistical method that has a relatively straightforward interpretation in the modelling of log odds

logistic regression

statistical model in nature. It provides standard errors for the parameter estimates.Logistic models basically model ODDS.

modeler must understand

the meaning of the data, how they were collected, and the time at which they were measured.

data mining

the process of extracting patterns from large datasets. It is based on the premise that meaningful information, which is non-random, novel, valid, useful, and ultimately understandable, is contained in all massive datasets.

objective of predictive analysis

to discover patterns and build predictive models by using data mining techniques

evaluation: data mining will split the data into

training and validation( holdout) sets

Some of the variables may be highly skewed and applying

transformation to the data can make the variable appear more normally distributed.


Kaugnay na mga set ng pag-aaral

Pathology Final Exam Practice Questions

View Set

Sociology - Chapters 11 - 16, Review & Quiz's

View Set