Big Data: Data Cleansing

अब Quizwiz के साथ अपने होमवर्क और परीक्षाओं को एस करें!

Not all organizations see data in the same way, so not all organizations __ in the same way.

clean

Clearly define: INC for incorporation ST for street USA for United States of America

Abbreviation expansion

Data should be collected at a minimum number of places in the organization.

Separation of Duties

Data cleansing has a __ process, but that process is __.

definitive; flexible

All processes include first finding __, and then correcting them.

errors

We cleanse data to get __ data.

high quality

A data dictionary should be available to anyone in the organization who collects or works with data. (data field names, types, ranges, constraints, defaults) Who is responsible for which data Where is the data collected What is the data collection process.

Documentation

Data can be examined for errors more quickly if it is reasonably organized. Differences stand out among similar data.

Organization

Divide according to tokens, a group of characters that have meaning. Look for patterns (find two spaces in the name field, therefore divide field into first, middle and last names)

Parsing

What data cleansing guideline has these characteristics? Data management is a process that must be guided from start to end. Understanding the organization's needs and the data that supports those needs is the place to start. Data structures and data collection should be controlled to best facilitate the needs of the organization.

Planning

An example of this would be to use legal names instead of nicknames when filling out information.

Standardization

A general definition of __ is the assessment of data to determine quality failures (inaccuracy, incompleteness, etc.) and then improving the quality by correcting as possible any errors found.

data cleansing

What are the steps of a common data cleansing framework?

1. Define and determine error types 2. Search and identify error instances 3. Correct errors 4. Document error instances and error types 5. Modify data entry procedures to reduce future errors.

What are the steps of the Data Quality of Improvement Cycle?

1. Import data 2. Merge data sets 3. Rebuild Missing Data 4. Standardize Data 5. Normalize Data 6. De-Duplicate 7. Verify and Enrich 8. Export Data

What are the 5 assumptions of linear regression?

1. Linear relationship 2. Multivariate normality 3. No or little multicollinearity 4. No auto-correlation 5. Homoscedasticity

__% - __% of operating expenses are wasted because of data quality issues.

15% - 45%

$__ is wasted by the US government annually because of bad data.

3 trillion

$__ is wasted in the US annually because of poor targeted marketing campaigns

611 billion

Change data values not recognized (mispellings).

Correction

__ is a process that must be guided from start to finish.

Data management

__ and __ should be controlled to best facilitate the needs of the organization.

Data structures and data collection

What data cleansing guideline has these characteristics? Everyone in the process should feel responsible for ensuring data quality. A good understanding of why data quality is important and ways to manage it, is critical.

Education

Sorting method to find duplicates; problem with finding non-exact duplicates.

Eliminate duplicate records

It can be very difficult to change data so quality data collection methods are necessary.

Prevention

What are these examples of? dropdown boxes to minimize text entries ranges data types

Prevention

Fill fields that are missing data if reasonable...may be caused by errors in original data.

Update missing fields


संबंधित स्टडी सेट्स

chapter 11 EMT- airway questions

View Set

Fin 334 Test 3 Concepts Chapter 8 and 9

View Set

Psych Concept Practice Chapter 9, 12, 13

View Set

CHAPTER 9 - HUN 1201 - CENGAGE STUDY QUESTIONS

View Set

Unit 11 - Contactors, Relays, and Overloads

View Set