Chapter 2 - Overview of the Data Mining Process

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Supervised learning algorithms

Used in classification and prediction. We must have data available in which the value of the outcome of interest is known. The training data are the data from which the classification or prediction algorithm learns or is trained about the relationship between predictor variables and the outcome variable. Once the algorithm has learned from the training data, it is then applied to the validation data where the outcome is known, to see how well it does in comparison to the other models. If many different models are being tried out, it is prudent to save a third sample, which also includes test data to use with the model finally selected to predict how well it will do.

Unsupervised learning algorithms

Used where there is no outcome variable to predict or classify. Association rules, dimension reduction methods, and clustering techniques are samples.

Association rules

- Aka affinity analysis - Designed to find general associations patterns between items in large databases - Generate rules general to an entire population

Categorical variables

- Can be coded as numerical (1,2,3) or text (payments current, payments not current, bankrupt) - Unordered aka nominal (North America, Europe, Asia) or Ordered aka Ordinal (High value, low value, nil value)

Classification

- Most basic form of data analysis. - A common task in data mining is to examine data where the classification is unknown or will occur in the future, with the goal of predicting what that classification is or will be. Similar data where the classification is known are used to develop rules, which are then applied to the data with the unknown classification - Ex. Buyer will/will not purchase

Missing Values

- Omit values if number of records is small; - Replace missing values with an imputed value, based on the other values for that variable across all records - Examine the importance of the predictor. If it is not very crucial, drop. If it is important, use proxy variable with fewer missing values. When predictor is deemed central, best solution is to invest in obtaining the missing data.

Collaborative filtering

- Online recommendation system used on Amazon and Netflix - Method that uses individual users' preferences and tastes given their historic purchase, rating, browsing, or any other measurable behavior indicative of preference - Deliver personalized recommendations to users with a wide range of preferences

Types of Variables

- numerical/text - continuous - able to assume any real numerical value, usually in a given range - integer - categorical - assuming one of a limited number of values

Prediction

Similar to classification, except that we are trying to predict the value of a numerical variable (ex. amount of purchase) instead of categorical data only.

Standardizing

Subtracting the mean from each value and then dividing by the standard deviation

Steps in data mining

1. Develop an understanding of the purpose of the data mining project 2. Obtain the dataset to be used in the analysis 3. Explore, clean, and preprocess the data 4. Reduce the data dimension, if necessary 5. Determine the data mining task 6. Partition the data (for supervised tasks) - training, validation, test sets 7. Choose the data mining techniques to be used 8. Use algorithms to perform the task 9. Interpret the results of the algorithm 10. Deploy the model - involves integrating the model into operational systems and running it on real records to produce decisions or actions

Data Exploration

Aimed at understanding the global landscape of the data and detecting unusual values. Also used for data cleaning and manipulation as well as for visual discovery and hypothesis generation.

Clustering

Method for reducing the number of cases

Data visualization or visual analytics

Exploration by creating charts and dashboards. The purpose is to discover patterns and exceptions.

Dummy variables

Nominal categorical variables - Cannot be used as is and must be decomposed into a series of binary variables - Example: --Student - Y/N --Unemployed - Y/N --Employed - Y/N --Retired - Y/N

Data Reduction

Process of consolidating a large number of records into a smaller set

Dimension Reduction

Reducing the number of variables - Common initial step before deploying supervised learning methods, intended to improve predictive power, manageability, and interpretability

Normalizing

Rescale each variable to the same scale. Subtract the minimum value and then divide by the range.

Z-score

Result of standardizing, expressing each value as the number of standard deviations away from the mean

Outliers

The more data we are dealing with, the greater the chance of encountering erroneous values resulting from measurement error, data-entry error, or the like. Call attention to values that need further review.


Ensembles d'études connexes

How Marlboro Changed Advertising Forever

View Set

Mental Health Chapter 19 NCLEX Questions

View Set

Med surg: Prep u's Management of Patients with Chest and Lower Respiratory Tract Disorder

View Set