Tingnan ang lahat ng mga set ng pag-aaral

Data Mining Chapter 3 & 4

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Distribution Plots

box plots & histograms, display the entire distribution, display "how many" of each value occur in a data set

Bar Chart

useful for comparing a single statistic across groups (numeric & categorical), height = value of statistic. Y = numerical outcome, X= predictor. UNSUPERVISED Learning

Reducing Categories

Combine close or similar categories. A single categorical variable with m categories is typically transformed into m-1 dummy variables. Each dummy variable takes the values 0 or 1 0 = "no" for the category, 1 = "yes". Problem: Can end up with too many variables. Solution: Reduce by combining categories that are close to each other (Pivot table). Exception: Naïve Bayes can handle categorical variables without transforming them into dummies

Boxplots

Display the entire distribution of a numerical value. Invented by J. Tukey, 50% (median): marked by a line within the box, 75% (top -Q3), 25% (bottom -Q1), Two lines outside the box (range). May indicate outliers. Compare subgroups with side by side Prediction tasks unsupervised learning

Principal Components Analysis (PCA)

Goal: Reduce the number of predictors in the model. Remove the overlap of information between variables. "Information" is measured by the sum of the variances of the variables. Create new variables that are linear combinations of the original variables. These linear combinations are uncorrelated (no information overlap). The new variables are called PRINCIPAL COMPONENTS.

Basic Plots

Line graph, Bar Charts, Scatterplots

Summary Statistics

Min and Max to detect extreme values (outliers). Average and Median show central values. Large deviation between the two shows skewness. Std. Deviation shows how dispersed from the mean.

Normalizing data

Normalize each variable to remove scale effect before performing the PCA. ALL on same scale. Normalization (= standardization) is usually performed in PCA. Note: In XLMiner, use CORRELATION MATRIX option to use normalized variables

Dimension Reduction

Reducing the number of variables to operate efficiently. Helping to select the right tool for preprocessing or analysis. Focus on visualization CORRELATION ANALYSIS to reduce

Correlation Analysis

To find redundancies in the data. Avoiding multicollinearity problem. Find Duplicates of data.

Pivot Tables

multiple variables. Counts & percentages are useful for summarizing CATEGORICAL data.

Scatterplot

relationship, pattern or trend between two NUMERICAL variables. Information overlap and identifying clusters of observations. UNSUPERVISED Learning

Heat Maps

remove 1 variable of the highest two correlated variables visualizing correlation tables visualizing missing data

Histograms

the frequency of all values of a single variable. might show SKEWNESS prediction tasks

Visualization

to get a better overview of the data

multicollinearity

two or more predictors sharing the same linear relationship with the outcome variable

Tingnan ang lahat ng mga set ng pag-aaral

Data Mining Chapter 3 & 4

Kaugnay na mga set ng pag-aaral

marketing test 4

GEOG 101 BYU independent study midterm 1 quizzes and self checks

SIE Equities: Preferred Stock

Macroeconomics

ISOM Test Ch 6-11

General Electric/Electronic System Diagnosis

Chapter 38 Digestive / GI Function Prep U

BCIS-Unit 5/Quiz 1- Database Basics

Quiz 1

INTERNET SERVICES & TYPES OF INTERNET CONNECTION

Race Relations - Brown vs Board of Education

ECON 351 Chapter 8

15

Psychology Study Exam 3

stats quiz 8

Chapter 8 Recombinant DNA Technology

MLO

Exammmm

chapter 10 final shi

Cell Bio Chapter 17 Quiz