Chapter 2. Overview of the Data Mining Process

Ace your homework & exams now with Quizwiz!

Collaborative filtering

- A method that uses individual users' preferences and tastes given their historic purchase, rating, browsing, or any other measurable behavior indicative of preference, as well as other users' history.

Pre-processing Data - Outliers

- An outlier is an observation that is "extreme", being distant from the rest of the data - Anything over 3 standard deviations away from the mean is an outlier

Steps in Data Mining - 5. Determine the data mining task.

- (Classification, prediction, clustering, ect.). - This involves translating the general question or problem of step 1 into a more specific data mining question.

Pre-processing Data - Normalizing

- (Standardizing) - Used in some techniques when variables with the largest scales would dominate and skew results - Puts variables measured in different units on same scale

Steps in Data Mining - 7. Choose the data mining techniques to be used.

- (regression, neural nets, hierarchical clustering, ect.).

Steps in Data Mining

- 1. Develop an understanding of the purpose of the data mining project. - 2. Obtain the dataset to be used in the analysis. - 3. Explore, clean, and preprocess the data. - 4. Reduce the data dimension, if necessary. - 5. Determine the data mining task. - 6. Partition the data (for supervised tasks). - 7. Choose the data mining techniques to be used. - 8. Use algorithms to perform the task. - 9. Interpret the results of the algorithms. - 10. Deploy the model.

Cross tabulation (contingency table)

- A table that displays the number of observations in a data set for different subcategories of two categorical variables. - Subcategories must be mutually exclusive and exhaustive.

Test Partition

- AKA: holdout or evaluation partition - used to assess the performance of the chosen model with new data.

Benford's law

- Calculated the expected frequency of digits in lists of numbers. - If a set of values were truly random, each leading digit would appear about 11% of the time - But in many naturally occurring collections of numbers, the leading digit is likely to be small.

Steps in Data Mining - 4. Reduce the data dimension, if necessary.

- Dimension reduction can involve operations such as eliminating unneeded variables, transforming variables, and creating new variables . - Make sure you know what each variable means and whether it is sensible to include it in the model.

Creating Effective Tables

- Display the data to be compared in columns rather than rows. - For presentation purposes, round off to 2-3 significant digits. - Within a column, use a consistent number of decimal digits.

Frequency Distribution - Relative Frequency

- Fraction or proportion of observations that fall within a cell

Steps in Data Mining - 1. Develop an understanding of the purpose of the data mining project.

- How will the stakeholders use the results? - Who will be affected by the results? - Will the analysis be a one-shot effort or an ongoing procedure?

Test data

- If many different models are being tried out, it is prudent to save a third sample, which also includes known outcomes (this data) to use with the model finally selected to predict how well it will do.

Steps in Data Mining - 6. Partition the data (for supervised tasks).

- If the task is supervised (classification or prediction), randomly partition the dataset into 3 parts: training, validation, and test datasets.

Steps in Data Mining - 10. Deploy the model

- Involves integrating the model into operational systems and running it on real records to produce decisions or actions.

Steps in Data Mining - 9. Interpret the results of the algorithms.

- Involves making a choice as to the best algorithm to deploy and, where possible, testing the final choice on the test data to get an idea as to how well it will perform.

Steps in Data Mining - 3. Explore, clean, and preprocess the data.

- Involves verifying that the data are in reasonable condition. - How should missing data be handled? - We also need to ensure consistency in the definitions of fields, units of measurement, time periods, and so on.

Steps in Data Mining - 2. Obtain the dataset to be used in the analysis.

- Often involves random sampling from a large database to capture records to be used in an analysis. - May also involve pulling together data from different databases (internal or external).

Validation data

- Once the algorithm has learned from the training data, it is then applied to another sample of data (this data) - where the outcome is known, to see how well it does in comparison to other models.

Frequency Distribution - Cumulative frequency

- Proportion or percentage of observations that fall below the upper limit of a cell

Pre-processing Data - Missing Values

- Solution 1: Omission: If a small # of record have missing values, can omit them. - Solution 2: Imputation: Replace missing values with reasonable substitutes

Normalizing function

- Subtract mean and divide by standard deviation. - Alternative standardizing function: Scale to 0-1 by subtracting minimum and diving by the range. (Useful when the data contains dummies and numeric values)

Three data partitions and their roles in the data mining process

- Training Data: Build model(s) - Validation data: Evaluate model(s) - Test data: Re-evaluate model(s) (optional) - New data: Predict/classify using final model

Steps in Data Mining 8. Use algorithms to perform the task.

- Typically an iterative process - trying multiple variants, and often using multiple variants of the same algorithm (choosing different variables or settings within the algorithm.)

Association rules

- designed to find such general associations patterns between items in large databases.

Normalization (or standardization)

- means replacing each original variable by a standardized version of the variable that has unit variance. - This is easily accomplished by dividing each variable by its standard deviation. - The effect of this is to give all variables equal importance in terms of the variability.

Benford's law - First digit law

- ones should account for 30% of leading digits, and each successive number should represent a progressively smaller proportion, with nines coming last. at under 5%.

Prediction

- similar to classification, except that we are trying to predict the value of numerical value (e.g., amount of purchase) rather than a class (e.g., purchaser or non purchaser).

Training data

- the data from which the classification or prediction algorithm "learns," or is "trained," about the relationship between predictor variables and the outcome variables.

Supervised learning algorithms

- used in classification and prediction. -- we must have data available in which the value of the outcome of interest (e.g., purchase or no purchase) is known. - These training data are the data from which the classification or prediction algorithm "learns," or is "trained," about the relationship between predictor variables and the outcome variable.

Validation Partition

- used to assess the predictive performance of each model so that you can compare models and choose the best one. - In some algorithms it may be used in an automated fashion to tune and improve the model.

Unsupervised learning algorithms

- used where there is no outcome variable to predict or classify. -- Association rules, dimension reduction methods, and clustering techniques are all unsupervised learning methods. - Hence there is no "learning" from cases where such an outcome variable is known.

Overfitting

- where a model is fit so closely to the available sample of data that is describes not merely structural characteristics of the data but random peculiarities as well.

See all study sets

Chapter 2. Overview of the Data Mining Process

Related study sets

OOP

Noise Exposure & Hearing Conservation Exam 1

Chapter 25 History Test

College Math Prep B - Final Exam

يتر ميلاد : فط للقراءة

Investments Chapter 12 Review Questions

BI 2.1.3

BA 350 ch 14 QUIZ

Raising Money

test v2

FIN323 EX1 CH2

Org Psych

Chief Complaints & Vital Signs (Test Review)

CertMaster Practice Exam SY0-701

QUIZ 3 Sampling and Hypothesis Testing

Character Playbook 2

Systems Analysis & Design: CH 10 MC Questions

Ch. 30 AP World History (A Second Global Conflict and the End of the European World Order)

BTA111 1

chapter 13: Reality of Consent