Data Mining Chapter 3 & 4
Distribution Plots
box plots & histograms, display the entire distribution, display "how many" of each value occur in a data set
Bar Chart
useful for comparing a single statistic across groups (numeric & categorical), height = value of statistic. Y = numerical outcome, X= predictor. UNSUPERVISED Learning
Reducing Categories
Combine close or similar categories. A single categorical variable with m categories is typically transformed into m-1 dummy variables. Each dummy variable takes the values 0 or 1 0 = "no" for the category, 1 = "yes". Problem: Can end up with too many variables. Solution: Reduce by combining categories that are close to each other (Pivot table). Exception: Naïve Bayes can handle categorical variables without transforming them into dummies
Boxplots
Display the entire distribution of a numerical value. Invented by J. Tukey, 50% (median): marked by a line within the box, 75% (top -Q3), 25% (bottom -Q1), Two lines outside the box (range). May indicate outliers. Compare subgroups with side by side Prediction tasks unsupervised learning
Principal Components Analysis (PCA)
Goal: Reduce the number of predictors in the model. Remove the overlap of information between variables. "Information" is measured by the sum of the variances of the variables. Create new variables that are linear combinations of the original variables. These linear combinations are uncorrelated (no information overlap). The new variables are called PRINCIPAL COMPONENTS.
Basic Plots
Line graph, Bar Charts, Scatterplots
Summary Statistics
Min and Max to detect extreme values (outliers). Average and Median show central values. Large deviation between the two shows skewness. Std. Deviation shows how dispersed from the mean.
Normalizing data
Normalize each variable to remove scale effect before performing the PCA. ALL on same scale. Normalization (= standardization) is usually performed in PCA. Note: In XLMiner, use CORRELATION MATRIX option to use normalized variables
Dimension Reduction
Reducing the number of variables to operate efficiently. Helping to select the right tool for preprocessing or analysis. Focus on visualization CORRELATION ANALYSIS to reduce
Correlation Analysis
To find redundancies in the data. Avoiding multicollinearity problem. Find Duplicates of data.
Pivot Tables
multiple variables. Counts & percentages are useful for summarizing CATEGORICAL data.
Scatterplot
relationship, pattern or trend between two NUMERICAL variables. Information overlap and identifying clusters of observations. UNSUPERVISED Learning
Heat Maps
remove 1 variable of the highest two correlated variables visualizing correlation tables visualizing missing data
Histograms
the frequency of all values of a single variable. might show SKEWNESS prediction tasks
Visualization
to get a better overview of the data
multicollinearity
two or more predictors sharing the same linear relationship with the outcome variable