4 - Data Preprocessing
What is the first reason why we preprocess the data?
Accuracy
What is incomplete data?
Lacking attribute values, lacking certain attributes of interest, or containing only aggregate data
What is a tuple?
One record (one row). Tuples are collections of ordered objects that cannot be changed.
How are histograms constructed?
Optimally in one dimension with dynamic programming
What are the types of numerosity reduction?
Parametric and non-parametric
What is the second way to handle noisy data?
Regression
What does consistency mean?
Some modified, but some not, dangling
What is the fourth thing that could cause incorrect attribute values?
Technology limitation
What is the first reason normalization is used?
To make two variables in different scales comparable.
What is data preprocessing?
A data mining technique that involves transforming raw data into an understandable format.
Why is data reduction used?
A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set.
What is a histogram?
A popular data reduction technique that divides data into buckets and stores the average (sum) for each bucket.
What is another definition of data preprocessing?
A step in data mining which provides techniques that can help us to understand and make knowledge discovery of data at the same time.
What is sampling?
Allowing a mining algorithm to run in complexity that is potentially sub-linear to the size of the data
What are multiple regression models?
Allows a response variable Y to be modeled as a linear function of multidimensional feature vector.
What is a feature?
An individual measurable property or characteristic of a phenomenon being observed
What are log-linear models?
Approximates discrete multidimensional probability distributions.
What is the first data reduction strategy?
Attribute feature selection
What can clustering do?
Be hierarchical and stored in multi-dimensional index tree structures
Why is data not always available?
Because many tuples have no recorded value for several attributes, such as customer income in sales data
What is the fifth reason why we preprocess the data?
Believability
What is the first way to handle noisy data?
Binning
What is done with noisy data?
Binning method, regression, clustering
What is the second way to handle missing data?
By filling the missing values
What are the different ways to smooth data in the binning process?
By means, median, or bin boundaries.
What is the fourth cause missing data could be attributed to?
Certain data may not be considered important at the time of entry
What is the fifth cause missing data could be attributed to?
Changes of the data or registry history
What is the third way to handle noisy data?
Clustering
What is the second reason why we preprocess the data?
Completeness
What does data generalization do?
Concept hierarchy climbing
What is the third reason why we preprocess the data?
Consistency
What is attribute feature selection?
Construct a new feature combining the given features in order to make the data mining process more efficient.
What is inconsistent data?
Containing discrepancies in codes or names
What is noisy data?
Containing errors or outliers
What is the third data cleaning tasks?
Correct inconsistent data
What does accuracy mean?
Correct or wrong, accurate or not
What is the first major task in data preprocessing?
Data Cleaning
What is the third major task in data preprocessing?
Data Reduction
What is the second major task in data preprocessing?
Data Transformation
What is the second thing that could cause incorrect attribute values?
Data entry problems
What are linear regression models?
Data is modeled to fit a straight line, often uses the least-square method to fit the line.
What is the third cause missing data could be attributed to?
Data not entered due to misunderstanding
What is the third thing that could cause incorrect attribute values?
Data transmission problems
What is clustering?
Detect and remove outliers
What is the second data reduction strategy?
Dimensionality reduction
What is intentional data?
Disguised as missing data
What is the first other data problem that requires data cleaning?
Duplicate records
What is the first cause missing data could be attributed to?
Equipment malfunction
What is the first thing that could cause incorrect attribute values?
Faulty data collection instruments
What is the first data cleaning tasks?
Fill in missing values
What is binning?
First sort data and partition into (equal-frequency) bins
What is non-parametric numerosity reduction?
Histograms, data sampling and data cube aggregation
What does interpretability mean?
How easily the data can be understood
What does believability mean?
How trustable the data is correct
What is the second data cleaning tasks?
Identify outliers and smooth out noisy data
When is clustering effective?
If the data is clustered, but not if the data is 'smeared'.
What is the first way to handle missing data?
Ignore the tuples
What is the second other data problem that requires data cleaning?
Incomplete data
What is the fifth thing that could cause incorrect attribute values?
Inconsistency in naming convention
What is the third other data problem that requires data cleaning?
Inconsistent data
What is the second cause missing data could be attributed to?
Inconsistent with other recorded data and thus deleted
What is the sixth reason why we preprocess the data?
Interpretability
What is the problem with filling in missing values manually?
It can be tedious and infeasible
What sometimes must be done to missing data?
It may need to be inferred
How does data preprocessing address issues with real-world data?
It provides operations which can organize the data into a proper form for better understanding in data mining process
What is the problem with real-world data?
It tends to be incomplete, noisy, and inconsistent
How do you fill in the missing value with the most probable value?
It's inference-based (Bayesian formula or decision tree)
What does it mean that data in the real world is dirty?
Lots of potentially incorrect data, such as instrument faulty, human or computer error, transmission error
What is the first way to fill in the missing values?
Manually
What is 'noisy' data?
Meaningless data that can't be interpreted by machines. It can be generated due to faulty data collection, data entry errors, ect.
What is the most used transfomation?
Min-max normalization
What does data cleaning deal with?
Missing and noisy data
What does completeness mean?
Not recorded, unavailable
What is the third data reduction strategy?
Numerosity reduction
What does data reduction do?
Obtain a reduced representation of the data set that is much smaller in volume however produces the same (or almost the same) analytical results
What is 'clustering'?
Partitioning data set into clusters, and one can store cluster representation only.
What can real-world data that hasn't been preprocessed lead to?
Poor quality of collected data, and a low quality of models built on such data
What should quality decisions be based on?
Quality data
What is noise?
Random error or variance in a measured variable
What is parametric numerosity reduction?
Regression models
What does smoothing the data do?
Remove noise from the data
What is numerosity reduction?
Replaces the original data by a smaller form of data representation
What does data normalization do?
Scaled to fall within a small, specified range by min-max, z-score, or decimal scaling
What is regression?
Smooth by fitting the data into regression functions
What is the second reason normalization is used?
Some models may need the data to be normalized before modeling.
What does data aggregation do?
Summarization, data cube construction
What is the z-score?
The number of standard deviations a data point is from the mean of the data set.
What is normalization?
The process entails converting numerical values into a new range using a mathematical function
What is the fourth reason why we preprocess the data?
Timeliness
What does timeliness mean?
Timely update
What does min-max normalization do?
Transforms the numerical value into a new range, for example 0 to 1.
What is done with missing data?
Tuples are ignored and missing values are filled in
What is the second way to fill in the missing values?
Use the attribute mean to fill the missing value
What is the third way to fill in the missing values?
Use the most probable value to fill in the missing value
What is dimensionality reduction?
Used to reduce the amount of features.
How does decimal scaling normalization work?
We move the decimal point of values of the attribute. This movement of the decimal points totally depends on the maximum value among all the values in the attruibute.
When are histograms best suited?
When related to quantization problems.
When does missing data arise?
When some data is missing in the data. It can be handled in various ways.
When should you ignore the tuples?
When the dataset is quite large and multiple values are missing within a tuple.
Why is data transformation important?
When we get raw data in almost any project, it's unfit for direct consumption for analysis or modeling
