Chapter 2 - Overview of the Data Mining Process
Supervised learning algorithms
Used in classification and prediction. We must have data available in which the value of the outcome of interest is known. The training data are the data from which the classification or prediction algorithm learns or is trained about the relationship between predictor variables and the outcome variable. Once the algorithm has learned from the training data, it is then applied to the validation data where the outcome is known, to see how well it does in comparison to the other models. If many different models are being tried out, it is prudent to save a third sample, which also includes test data to use with the model finally selected to predict how well it will do.
Unsupervised learning algorithms
Used where there is no outcome variable to predict or classify. Association rules, dimension reduction methods, and clustering techniques are samples.
Association rules
- Aka affinity analysis - Designed to find general associations patterns between items in large databases - Generate rules general to an entire population
Categorical variables
- Can be coded as numerical (1,2,3) or text (payments current, payments not current, bankrupt) - Unordered aka nominal (North America, Europe, Asia) or Ordered aka Ordinal (High value, low value, nil value)
Classification
- Most basic form of data analysis. - A common task in data mining is to examine data where the classification is unknown or will occur in the future, with the goal of predicting what that classification is or will be. Similar data where the classification is known are used to develop rules, which are then applied to the data with the unknown classification - Ex. Buyer will/will not purchase
Missing Values
- Omit values if number of records is small; - Replace missing values with an imputed value, based on the other values for that variable across all records - Examine the importance of the predictor. If it is not very crucial, drop. If it is important, use proxy variable with fewer missing values. When predictor is deemed central, best solution is to invest in obtaining the missing data.
Collaborative filtering
- Online recommendation system used on Amazon and Netflix - Method that uses individual users' preferences and tastes given their historic purchase, rating, browsing, or any other measurable behavior indicative of preference - Deliver personalized recommendations to users with a wide range of preferences
Types of Variables
- numerical/text - continuous - able to assume any real numerical value, usually in a given range - integer - categorical - assuming one of a limited number of values
Prediction
Similar to classification, except that we are trying to predict the value of a numerical variable (ex. amount of purchase) instead of categorical data only.
Standardizing
Subtracting the mean from each value and then dividing by the standard deviation
Steps in data mining
1. Develop an understanding of the purpose of the data mining project 2. Obtain the dataset to be used in the analysis 3. Explore, clean, and preprocess the data 4. Reduce the data dimension, if necessary 5. Determine the data mining task 6. Partition the data (for supervised tasks) - training, validation, test sets 7. Choose the data mining techniques to be used 8. Use algorithms to perform the task 9. Interpret the results of the algorithm 10. Deploy the model - involves integrating the model into operational systems and running it on real records to produce decisions or actions
Data Exploration
Aimed at understanding the global landscape of the data and detecting unusual values. Also used for data cleaning and manipulation as well as for visual discovery and hypothesis generation.
Clustering
Method for reducing the number of cases
Data visualization or visual analytics
Exploration by creating charts and dashboards. The purpose is to discover patterns and exceptions.
Dummy variables
Nominal categorical variables - Cannot be used as is and must be decomposed into a series of binary variables - Example: --Student - Y/N --Unemployed - Y/N --Employed - Y/N --Retired - Y/N
Data Reduction
Process of consolidating a large number of records into a smaller set
Dimension Reduction
Reducing the number of variables - Common initial step before deploying supervised learning methods, intended to improve predictive power, manageability, and interpretability
Normalizing
Rescale each variable to the same scale. Subtract the minimum value and then divide by the range.
Z-score
Result of standardizing, expressing each value as the number of standard deviations away from the mean
Outliers
The more data we are dealing with, the greater the chance of encountering erroneous values resulting from measurement error, data-entry error, or the like. Call attention to values that need further review.