Chapter 2 Overview of the Data Mining Process ISDS 574

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

7 Core Ideas in Data Mining

1. Classification

SCHEMATIC OF THE DATA MODELING PROCESS 8 steps

1. Define purpose 2. obtain data 3. explore and clean data 4. determine DM class 5. choose DM methods 6. apply methods, select final model 7. evaluate performance 8. deploy

Training Partition

The training partition, typically the largest partition, contains the data used to build the various models we are examining. The same training partition is generally used to develop multiple models.

Test Partition

This partition (sometimes called the holdout or evaluation partition) is used if we need to assess the performance of the chosen model with new data.

Someone with domain knowledge (i.e., knowledge of the business process and the data) should be consulted, as knowledge of what the variables represent can help build a good model and avoid errors.

Why and who can help you to pay close attention to the variables that are included in a model?

When we use the validation data to assess multiple models and then pick the model that does best with the validation data, we again encounter another (lesser) facet of the overfitting problem the test data, which it has not seen before, will provide an unbiased estimate of how well it will do with new data

Why have both a validation and a test partition?

Association Rules

______________, or affinity analysis the analysis of associations among items

training data

are the data from which the classification or prediction algorithm "learns," or is "trained," about the relationship between predictor variables and the outcome variable.

Supervised learning algorithms

are those used in classification and prediction. We must have data available in which the value of the outcome of interest (e.g., purchase or no purchase) is known.

Data Visualization

exploring data to see what information they hold is through graphical analysis. This includes looking at each variable separately as well as looking at relationships between variables.

3

how many dummy variables do you need if you have 4 variables?

Classification

the most basic form of data analysis. The recipient of an offer can respond or not respond. A common task in data mining is to examine data where the classification is unknown or will occur in the future, with the goal of predicting what that classification is or will be.

use bar charts

what charts are used for categorical variables?

To address this problem of bias/ over fitting we simply divide (partition) our data and develop our model using only one of the partitions.

Use and Creation of Partitions

Outliers

Values that lie far away from the bulk of the data Analysts use rules of thumb such as "anything over 3 standard deviations away from the mean is an outlier,"

1. More is not necessarily better 2. being equal 3. parsimony, 4. compactness,

Variable Selection what should we look at?

1. Training Partition 2. Validation partition 3. Test partition

What are the three partition one would perform on data?

the greater the risk of overfitting the data.

What happens the more variables we have?

Association rules, dimension reduction methods, and clustering techniques are all unsupervised learning methods.

What techniques used in unsupervised learning.

Unsupervised learning algorithms

are those used where there is no outcome variable to predict or classify. Hence, there is no "learning" from cases where such an outcome variable is known.

Online merchants such as Amazon.com and Netflix.com use these methods as the heart of a "recommender" system that suggests new purchases to customers.

give example of Association rules used in business

To normalize the data, we subtract the mean from each value and divide by the standard deviation of the resulting deviations from the mean In effect, we are expressing each value as the "number of standard deviations away from the mean," also called a z-score.

how to Normalizing (Standardizing) the Data

Simple linear regression analysis

s an example of supervised learning . The Y variable is the (known) outcome variable and the X variable is a predictor variable. A regression line is drawn. The regression line can now be used to predict Y values for new values of X for which we do not know the Y value.

variable selection. In such a matrix, we can see at a glance scatterplots for all variable combinations. A straight line would be an indication that one variable is exactly correlated with another.

A matrix of scatterplots can be useful for?

1. Develop an understanding of the purpose of the data mining project 2. Obtain the dataset to be used in the analysis. This often involves random sampling from a large database 3. Explore, clean, and preprocess the data. This involves verifying that the data are in reasonable condition. 4. Reduce the data, if necessary, and (where supervised training is involved) separate them into training, validation, and test datasets. 5. Determine the data mining task (classification, prediction, clustering, etc.). 6. Choose the data mining techniques to be used (regression, neural nets, hierarchical clustering, etc.) 7. Use algorithms to perform the task 8. Interpret the results of the algorithms. This involves making a choice as to the best algorithm to deploy 9. Deploy the model.

Steps in Data Mining

1. Types of Variables 2. Handling Categorical Variables 3. Variable Selection

what are 3 steps in Preprocessing and Cleaning the Data

to learn about possible relationships, the type of relationship, and again, to detect outliers.

what can we look at scatterplots of pairs of numerical variables for?

use histograms and boxplots to learn about the distribution of their values, to detect outliers (extreme observations), and to find other information that is relevant to the analysis task

what charts are used for numerical variables?

Quite often, we want to perform our data mining analysis on less than the total number of records that are available. Data mining algorithms will have varying limitations on what they can handle in terms of the numbers of records and variables,

why do Sampling from a Database

Supervised and Unsupervised Learning

A fundamental distinction among data mining techniques is between supervised and unsupervised methods.

Sample Explore Modify Model Assess

the steps in SEMMA

, when we use the same data both to develop the model and to assess its performance, we introduce _______.

Bias

1. numerical or text 2. unordered or ordered

Characteristics of Categorical variables

Predictive Analytics

Classification, prediction, and to some extent, affinity analysis constitute the analytical methods employed in predictive analytics.

we can often use it as is, as if it were a continuous variable

Handling Categorical Variables If the categorical variable is ordered?

A good rule of thumb is to have 10 records for every predictor variable

How Many Variables and How Much Data?

1. in Excel is to sort the records by the first column, then review the data for very large or very small values in that column. 2. examine the minimum and maximum values of each column using Excel's min and max functions. 3. clustering techniques could be used to identify clusters of one or a few records that are distant from others.

How do we inspect for outliers?

1. numerical or text 2. continuous (able to assume any real numerical value, usually in a given range), integer (assuming only integer values), or categorical (assuming one of a limited number of values)

How to classify variables

Datasets are nearly always constructed and displayed so that variables are in columns and records are in rows.

Organization of Datasets how is it done?

If the event we are interested in is rare, In such cases we would want our sampling procedure to overweight the purchasers relative to the nonpurchasers so that our sample would end up with a healthy complement of purchasers.

Oversampling Rare Events

Prediction

Prediction is similar to classification, except that we are trying to predict the value of a numerical variable (e.g., amount of purchase) rather than a class (e.g., purchaser or nonpurchaser). prediction in this book refers to the prediction of the value of a continuous variable.

Validation Partition

This partition (sometimes called the test partition) is used to assess the performance of each model so that you can compare models and pick the best one. In some algorithms (e.g., classification and regression trees), the validation partition may be used in automated fashion to tune and improve the model.

Data Reduction

This process of consolidating a large number of variables (or cases) into a smaller set

1. have data available in which the value of the outcome of interest is known which is called training data 2. training data are the data from which the classification or prediction algorithm "learns," or is "trained," about the relationship between predictor variables and the outcome variable. 3. Once the algorithm has learned from the training data, it is then applied to another sample of data (the validation data) to see how well it does in comparison to other models. 4. If many different models are being tried out, it is prudent to save a third sample of known outcomes (the test data) to use with the model finally selected to predict how well it will do. 5. The model can then be used to classify or predict the outcome of interest in new cases where the outcome is unknown.

steps in supervised learning

1. records with missing values is small, those records might be omitted 2. replace the missing value with an imputed value, based on the other values for that variable across all records. possibly the mean 3. An alternative is to examine the importance of the predictor. If it is not very crucial, it can be dropped 4. When such a predictor is deemed central the best solution is to invest in obtaining the missing data.

techniques to deal with missing values?

predictive analytics

the tasks of classification and prediction that are becoming key elements of a "business intelligence" function in most large firms

Data Exploration

to review and examine the data to see what messages they hold, much as a detective might survey a crime scene.


Kaugnay na mga set ng pag-aaral

BIO 169 - Exam 2: Cardiovascular & Lymphatic Systems & Immunity

View Set

Chapter 9: Levels of Disease Prevention

View Set

Hematology Chapter 11 Review Questions

View Set

Man. of Strategy Final Test Ed. 5 Chapter 10

View Set

development chapter 2 assignment 2

View Set