Chapter 2. Overview of the Data Mining Process
Collaborative filtering
- A method that uses individual users' preferences and tastes given their historic purchase, rating, browsing, or any other measurable behavior indicative of preference, as well as other users' history.
Pre-processing Data - Outliers
- An outlier is an observation that is "extreme", being distant from the rest of the data - Anything over 3 standard deviations away from the mean is an outlier
Steps in Data Mining - 5. Determine the data mining task.
- (Classification, prediction, clustering, ect.). - This involves translating the general question or problem of step 1 into a more specific data mining question.
Pre-processing Data - Normalizing
- (Standardizing) - Used in some techniques when variables with the largest scales would dominate and skew results - Puts variables measured in different units on same scale
Steps in Data Mining - 7. Choose the data mining techniques to be used.
- (regression, neural nets, hierarchical clustering, ect.).
Steps in Data Mining
- 1. Develop an understanding of the purpose of the data mining project. - 2. Obtain the dataset to be used in the analysis. - 3. Explore, clean, and preprocess the data. - 4. Reduce the data dimension, if necessary. - 5. Determine the data mining task. - 6. Partition the data (for supervised tasks). - 7. Choose the data mining techniques to be used. - 8. Use algorithms to perform the task. - 9. Interpret the results of the algorithms. - 10. Deploy the model.
Cross tabulation (contingency table)
- A table that displays the number of observations in a data set for different subcategories of two categorical variables. - Subcategories must be mutually exclusive and exhaustive.
Test Partition
- AKA: holdout or evaluation partition - used to assess the performance of the chosen model with new data.
Benford's law
- Calculated the expected frequency of digits in lists of numbers. - If a set of values were truly random, each leading digit would appear about 11% of the time - But in many naturally occurring collections of numbers, the leading digit is likely to be small.
Steps in Data Mining - 4. Reduce the data dimension, if necessary.
- Dimension reduction can involve operations such as eliminating unneeded variables, transforming variables, and creating new variables . - Make sure you know what each variable means and whether it is sensible to include it in the model.
Creating Effective Tables
- Display the data to be compared in columns rather than rows. - For presentation purposes, round off to 2-3 significant digits. - Within a column, use a consistent number of decimal digits.
Frequency Distribution - Relative Frequency
- Fraction or proportion of observations that fall within a cell
Steps in Data Mining - 1. Develop an understanding of the purpose of the data mining project.
- How will the stakeholders use the results? - Who will be affected by the results? - Will the analysis be a one-shot effort or an ongoing procedure?
Test data
- If many different models are being tried out, it is prudent to save a third sample, which also includes known outcomes (this data) to use with the model finally selected to predict how well it will do.
Steps in Data Mining - 6. Partition the data (for supervised tasks).
- If the task is supervised (classification or prediction), randomly partition the dataset into 3 parts: training, validation, and test datasets.
Steps in Data Mining - 10. Deploy the model
- Involves integrating the model into operational systems and running it on real records to produce decisions or actions.
Steps in Data Mining - 9. Interpret the results of the algorithms.
- Involves making a choice as to the best algorithm to deploy and, where possible, testing the final choice on the test data to get an idea as to how well it will perform.
Steps in Data Mining - 3. Explore, clean, and preprocess the data.
- Involves verifying that the data are in reasonable condition. - How should missing data be handled? - We also need to ensure consistency in the definitions of fields, units of measurement, time periods, and so on.
Steps in Data Mining - 2. Obtain the dataset to be used in the analysis.
- Often involves random sampling from a large database to capture records to be used in an analysis. - May also involve pulling together data from different databases (internal or external).
Validation data
- Once the algorithm has learned from the training data, it is then applied to another sample of data (this data) - where the outcome is known, to see how well it does in comparison to other models.
Frequency Distribution - Cumulative frequency
- Proportion or percentage of observations that fall below the upper limit of a cell
Pre-processing Data - Missing Values
- Solution 1: Omission: If a small # of record have missing values, can omit them. - Solution 2: Imputation: Replace missing values with reasonable substitutes
Normalizing function
- Subtract mean and divide by standard deviation. - Alternative standardizing function: Scale to 0-1 by subtracting minimum and diving by the range. (Useful when the data contains dummies and numeric values)
Three data partitions and their roles in the data mining process
- Training Data: Build model(s) - Validation data: Evaluate model(s) - Test data: Re-evaluate model(s) (optional) - New data: Predict/classify using final model
Steps in Data Mining 8. Use algorithms to perform the task.
- Typically an iterative process - trying multiple variants, and often using multiple variants of the same algorithm (choosing different variables or settings within the algorithm.)
Association rules
- designed to find such general associations patterns between items in large databases.
Normalization (or standardization)
- means replacing each original variable by a standardized version of the variable that has unit variance. - This is easily accomplished by dividing each variable by its standard deviation. - The effect of this is to give all variables equal importance in terms of the variability.
Benford's law - First digit law
- ones should account for 30% of leading digits, and each successive number should represent a progressively smaller proportion, with nines coming last. at under 5%.
Prediction
- similar to classification, except that we are trying to predict the value of numerical value (e.g., amount of purchase) rather than a class (e.g., purchaser or non purchaser).
Training data
- the data from which the classification or prediction algorithm "learns," or is "trained," about the relationship between predictor variables and the outcome variables.
Supervised learning algorithms
- used in classification and prediction. -- we must have data available in which the value of the outcome of interest (e.g., purchase or no purchase) is known. - These training data are the data from which the classification or prediction algorithm "learns," or is "trained," about the relationship between predictor variables and the outcome variable.
Validation Partition
- used to assess the predictive performance of each model so that you can compare models and choose the best one. - In some algorithms it may be used in an automated fashion to tune and improve the model.
Unsupervised learning algorithms
- used where there is no outcome variable to predict or classify. -- Association rules, dimension reduction methods, and clustering techniques are all unsupervised learning methods. - Hence there is no "learning" from cases where such an outcome variable is known.
Overfitting
- where a model is fit so closely to the available sample of data that is describes not merely structural characteristics of the data but random peculiarities as well.