Data Mining Exam 1: Lecture 2
Attribute transformation
a function that maps the entire set of values of a given attribute to a new set of replacement values such that each old value can be identified with one of the new values
An attribute is
a property or characteristic of an object
Binary attributes are
a special case of discrete attributes
Outliers
are data objects with characteristics that are considerably different than most of the other data objects in the data set
Dimensionality Reduction
attempts to avoid curse of dimensionality through either PCA, SVD, supervised & non-linear techniques
A collection of attributes
describe an object
Feature Subset Selection
Removes redundant and irrelevant features
Sampling
The main technique employed for data selection because obtaining the entire set of data is too expensive or time consuming
Discretization
The process of converting a continuous attribute into an ordinal attribute
An attribute is a property of characteristic of an object
True
For ordinal attributes
Values distinguish and order objects (<, >)
For nominal attributes
Values only distinguish between one another (= or !=)
Data Preprocessing Strategies
-Aggregation -Sampling -Dimensionality Reduction -Feature subset selection -Feature creation -Discretization & Binarization -Attribute Transformation
Aggregation
-Combining two or more attributes (or objects) into a single attribute (or object) in order to: reduce data, change scale, or get more stable data
How to handle missing values?
-Eliminate data objects -Estimate missing values -Ignore the missing value during analysis
Continuous Attributes
-Has real numbers as attribute values, like temperature, height, or weight -Real values can only be measured and represented using a finite number of digits -Typically represented as floating-point variables
Discrete Attributes
-Have only a finite or countable infinite set of values, like zip codes, counts, or set of words in a document -Often represented as integer variables -Binary attributes are a special case of discrete attributes
Data quality problems
-Noise and outliers -Missing values -Duplicate data
4 Types of attributes
-Nominal -Ordinal -Interval -Ratio
Which of the following statements about asymmetric attributes is correct? A. Non-zero attribute values are equally important as zero values in data analysis. B. Non-zero attribute values are more important than zero value in data analysis. C. Zero attribute value is more important than non-zero attribute values in data analysis. D. none of the above
B. Non-zero attribute values are more important
For ratio attributes
Both differences AND ratios are meaningful (*, /)
(qualitative/categorical) Ordinal data examples
Rankings (taste of potato chips from 1-10), grades, height in (tall, medium, short)
(quantitative/numeric) Interval data examples
Calendar dates, temperatures in Celsius or Fahrenheit
What is Data?
Collection of data objects and their attributes.
Feature Creation
Creates new attributes that can capture the important information in a data set much more efficiently than the original attributes
Which of the following is an example of data quality problems? A. Noise and outliers. B. Missing values. C. Duplicate data. D. All of the above.
D. All of the above
Which of the following is NOT one of the three important characteristics of structured data. A. Resolution. B. Dimensionality. C. Sparsity. D. Sample size.
D. Sample Size
For interval attributes
Differences between values are meaningful (+, -)
Important characteristics of structured data
Dimensionality - curse of dimensionality Sparsity - only presence counts Resolution - patterns depend on the scale
Age in years is:
Discrete, quantitative, ratio
Attributes and attribute values are equivalent
False
(qualitative/categorical) Nominal data examples
ID numbers, eye color, zip codes
Binarization
Maps a continuous or categorical attribute into one or more binary variables
Noise
refers to modification of original values
Normalization is a form of attribute transformation that:
refers to various techniques to adjust to differences among attributes in terms of frequency of occurrence, mean, variance, and magnitude
(quantitative/numeric) Ratio data examples
temperature in Kelvin, length, time, counts
An Attribute is also known as
variable, characteristic, or feature