Data Mining Chapter 3 & 4

Ace your homework & exams now with Quizwiz!

Distribution Plots

box plots & histograms, display the entire distribution, display "how many" of each value occur in a data set

Bar Chart

useful for comparing a single statistic across groups (numeric & categorical), height = value of statistic. Y = numerical outcome, X= predictor. UNSUPERVISED Learning

Reducing Categories

Combine close or similar categories. A single categorical variable with m categories is typically transformed into m-1 dummy variables. Each dummy variable takes the values 0 or 1 0 = "no" for the category, 1 = "yes". Problem: Can end up with too many variables. Solution: Reduce by combining categories that are close to each other (Pivot table). Exception: Naïve Bayes can handle categorical variables without transforming them into dummies

Boxplots

Display the entire distribution of a numerical value. Invented by J. Tukey, 50% (median): marked by a line within the box, 75% (top -Q3), 25% (bottom -Q1), Two lines outside the box (range). May indicate outliers. Compare subgroups with side by side Prediction tasks unsupervised learning

Principal Components Analysis (PCA)

Goal: Reduce the number of predictors in the model. Remove the overlap of information between variables. "Information" is measured by the sum of the variances of the variables. Create new variables that are linear combinations of the original variables. These linear combinations are uncorrelated (no information overlap). The new variables are called PRINCIPAL COMPONENTS.

Basic Plots

Line graph, Bar Charts, Scatterplots

Summary Statistics

Min and Max to detect extreme values (outliers). Average and Median show central values. Large deviation between the two shows skewness. Std. Deviation shows how dispersed from the mean.

Normalizing data

Normalize each variable to remove scale effect before performing the PCA. ALL on same scale. Normalization (= standardization) is usually performed in PCA. Note: In XLMiner, use CORRELATION MATRIX option to use normalized variables

Dimension Reduction

Reducing the number of variables to operate efficiently. Helping to select the right tool for preprocessing or analysis. Focus on visualization CORRELATION ANALYSIS to reduce

Correlation Analysis

To find redundancies in the data. Avoiding multicollinearity problem. Find Duplicates of data.

Pivot Tables

multiple variables. Counts & percentages are useful for summarizing CATEGORICAL data.

Scatterplot

relationship, pattern or trend between two NUMERICAL variables. Information overlap and identifying clusters of observations. UNSUPERVISED Learning

Heat Maps

remove 1 variable of the highest two correlated variables visualizing correlation tables visualizing missing data

Histograms

the frequency of all values of a single variable. might show SKEWNESS prediction tasks

Visualization

to get a better overview of the data

multicollinearity

two or more predictors sharing the same linear relationship with the outcome variable


Related study sets

GEOG 101 BYU independent study midterm 1 quizzes and self checks

View Set

Chapter 38 Digestive / GI Function Prep U

View Set

BCIS-Unit 5/Quiz 1- Database Basics

View Set

INTERNET SERVICES & TYPES OF INTERNET CONNECTION

View Set

Race Relations - Brown vs Board of Education

View Set

Chapter 8 Recombinant DNA Technology

View Set