4 - Data Preprocessing

Ace your homework & exams now with Quizwiz!

What is the first reason why we preprocess the data?

Accuracy

What is incomplete data?

Lacking attribute values, lacking certain attributes of interest, or containing only aggregate data

What is a tuple?

One record (one row). Tuples are collections of ordered objects that cannot be changed.

How are histograms constructed?

Optimally in one dimension with dynamic programming

What are the types of numerosity reduction?

Parametric and non-parametric

What is the second way to handle noisy data?

Regression

What does consistency mean?

Some modified, but some not, dangling

What is the fourth thing that could cause incorrect attribute values?

Technology limitation

What is the first reason normalization is used?

To make two variables in different scales comparable.

What is data preprocessing?

A data mining technique that involves transforming raw data into an understandable format.

Why is data reduction used?

A database/data warehouse may store terabytes of data. Complex data analysis may take a very long time to run on the complete data set.

What is a histogram?

A popular data reduction technique that divides data into buckets and stores the average (sum) for each bucket.

What is another definition of data preprocessing?

A step in data mining which provides techniques that can help us to understand and make knowledge discovery of data at the same time.

What is sampling?

Allowing a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

What are multiple regression models?

Allows a response variable Y to be modeled as a linear function of multidimensional feature vector.

What is a feature?

An individual measurable property or characteristic of a phenomenon being observed

What are log-linear models?

Approximates discrete multidimensional probability distributions.

What is the first data reduction strategy?

Attribute feature selection

What can clustering do?

Be hierarchical and stored in multi-dimensional index tree structures

Why is data not always available?

Because many tuples have no recorded value for several attributes, such as customer income in sales data

What is the fifth reason why we preprocess the data?

Believability

What is the first way to handle noisy data?

Binning

What is done with noisy data?

Binning method, regression, clustering

What is the second way to handle missing data?

By filling the missing values

What are the different ways to smooth data in the binning process?

By means, median, or bin boundaries.

What is the fourth cause missing data could be attributed to?

Certain data may not be considered important at the time of entry

What is the fifth cause missing data could be attributed to?

Changes of the data or registry history

What is the third way to handle noisy data?

Clustering

What is the second reason why we preprocess the data?

Completeness

What does data generalization do?

Concept hierarchy climbing

What is the third reason why we preprocess the data?

Consistency

What is attribute feature selection?

Construct a new feature combining the given features in order to make the data mining process more efficient.

What is inconsistent data?

Containing discrepancies in codes or names

What is noisy data?

Containing errors or outliers

What is the third data cleaning tasks?

Correct inconsistent data

What does accuracy mean?

Correct or wrong, accurate or not

What is the first major task in data preprocessing?

Data Cleaning

What is the third major task in data preprocessing?

Data Reduction

What is the second major task in data preprocessing?

Data Transformation

What is the second thing that could cause incorrect attribute values?

Data entry problems

What are linear regression models?

Data is modeled to fit a straight line, often uses the least-square method to fit the line.

What is the third cause missing data could be attributed to?

Data not entered due to misunderstanding

What is the third thing that could cause incorrect attribute values?

Data transmission problems

What is clustering?

Detect and remove outliers

What is the second data reduction strategy?

Dimensionality reduction

What is intentional data?

Disguised as missing data

What is the first other data problem that requires data cleaning?

Duplicate records

What is the first cause missing data could be attributed to?

Equipment malfunction

What is the first thing that could cause incorrect attribute values?

Faulty data collection instruments

What is the first data cleaning tasks?

Fill in missing values

What is binning?

First sort data and partition into (equal-frequency) bins

What is non-parametric numerosity reduction?

Histograms, data sampling and data cube aggregation

What does interpretability mean?

How easily the data can be understood

What does believability mean?

How trustable the data is correct

What is the second data cleaning tasks?

Identify outliers and smooth out noisy data

When is clustering effective?

If the data is clustered, but not if the data is 'smeared'.

What is the first way to handle missing data?

Ignore the tuples

What is the second other data problem that requires data cleaning?

Incomplete data

What is the fifth thing that could cause incorrect attribute values?

Inconsistency in naming convention

What is the third other data problem that requires data cleaning?

Inconsistent data

What is the second cause missing data could be attributed to?

Inconsistent with other recorded data and thus deleted

What is the sixth reason why we preprocess the data?

Interpretability

What is the problem with filling in missing values manually?

It can be tedious and infeasible

What sometimes must be done to missing data?

It may need to be inferred

How does data preprocessing address issues with real-world data?

It provides operations which can organize the data into a proper form for better understanding in data mining process

What is the problem with real-world data?

It tends to be incomplete, noisy, and inconsistent

How do you fill in the missing value with the most probable value?

It's inference-based (Bayesian formula or decision tree)

What does it mean that data in the real world is dirty?

Lots of potentially incorrect data, such as instrument faulty, human or computer error, transmission error

What is the first way to fill in the missing values?

Manually

What is 'noisy' data?

Meaningless data that can't be interpreted by machines. It can be generated due to faulty data collection, data entry errors, ect.

What is the most used transfomation?

Min-max normalization

What does data cleaning deal with?

Missing and noisy data

What does completeness mean?

Not recorded, unavailable

What is the third data reduction strategy?

Numerosity reduction

What does data reduction do?

Obtain a reduced representation of the data set that is much smaller in volume however produces the same (or almost the same) analytical results

What is 'clustering'?

Partitioning data set into clusters, and one can store cluster representation only.

What can real-world data that hasn't been preprocessed lead to?

Poor quality of collected data, and a low quality of models built on such data

What should quality decisions be based on?

Quality data

What is noise?

Random error or variance in a measured variable

What is parametric numerosity reduction?

Regression models

What does smoothing the data do?

Remove noise from the data

What is numerosity reduction?

Replaces the original data by a smaller form of data representation

What does data normalization do?

Scaled to fall within a small, specified range by min-max, z-score, or decimal scaling

What is regression?

Smooth by fitting the data into regression functions

What is the second reason normalization is used?

Some models may need the data to be normalized before modeling.

What does data aggregation do?

Summarization, data cube construction

What is the z-score?

The number of standard deviations a data point is from the mean of the data set.

What is normalization?

The process entails converting numerical values into a new range using a mathematical function

What is the fourth reason why we preprocess the data?

Timeliness

What does timeliness mean?

Timely update

What does min-max normalization do?

Transforms the numerical value into a new range, for example 0 to 1.

What is done with missing data?

Tuples are ignored and missing values are filled in

What is the second way to fill in the missing values?

Use the attribute mean to fill the missing value

What is the third way to fill in the missing values?

Use the most probable value to fill in the missing value

What is dimensionality reduction?

Used to reduce the amount of features.

How does decimal scaling normalization work?

We move the decimal point of values of the attribute. This movement of the decimal points totally depends on the maximum value among all the values in the attruibute.

When are histograms best suited?

When related to quantization problems.

When does missing data arise?

When some data is missing in the data. It can be handled in various ways.

When should you ignore the tuples?

When the dataset is quite large and multiple values are missing within a tuple.

Why is data transformation important?

When we get raw data in almost any project, it's unfit for direct consumption for analysis or modeling

4 - Data Preprocessing

Related study sets

International Accounting Chapter 9 (part 1)

Market Failures

Development Psycology (Actual Questions)

Dealing with Differences Quiz 1

Chapter 10 Exercise Physiology

Step 2 CK

The executive branch text review

HPER Ch 06 Quiz

Biology Exam 1

Chapter 12-14

Bio FInal all topics

MARK Ch.16

Classical Conditioning Phenomena

NHSA - Georgia Driver's Ed

Fin 351 Ch 17

Final Exam

Psychology Exam 3 Review Guide

Biology Test CK-12 Evolution

connect7

TORTS