Big Data: Data Cleansing
Not all organizations see data in the same way, so not all organizations __ in the same way.
clean
Clearly define: INC for incorporation ST for street USA for United States of America
Abbreviation expansion
Data should be collected at a minimum number of places in the organization.
Separation of Duties
Data cleansing has a __ process, but that process is __.
definitive; flexible
All processes include first finding __, and then correcting them.
errors
We cleanse data to get __ data.
high quality
A data dictionary should be available to anyone in the organization who collects or works with data. (data field names, types, ranges, constraints, defaults) Who is responsible for which data Where is the data collected What is the data collection process.
Documentation
Data can be examined for errors more quickly if it is reasonably organized. Differences stand out among similar data.
Organization
Divide according to tokens, a group of characters that have meaning. Look for patterns (find two spaces in the name field, therefore divide field into first, middle and last names)
Parsing
What data cleansing guideline has these characteristics? Data management is a process that must be guided from start to end. Understanding the organization's needs and the data that supports those needs is the place to start. Data structures and data collection should be controlled to best facilitate the needs of the organization.
Planning
An example of this would be to use legal names instead of nicknames when filling out information.
Standardization
A general definition of __ is the assessment of data to determine quality failures (inaccuracy, incompleteness, etc.) and then improving the quality by correcting as possible any errors found.
data cleansing
What are the steps of a common data cleansing framework?
1. Define and determine error types 2. Search and identify error instances 3. Correct errors 4. Document error instances and error types 5. Modify data entry procedures to reduce future errors.
What are the steps of the Data Quality of Improvement Cycle?
1. Import data 2. Merge data sets 3. Rebuild Missing Data 4. Standardize Data 5. Normalize Data 6. De-Duplicate 7. Verify and Enrich 8. Export Data
What are the 5 assumptions of linear regression?
1. Linear relationship 2. Multivariate normality 3. No or little multicollinearity 4. No auto-correlation 5. Homoscedasticity
__% - __% of operating expenses are wasted because of data quality issues.
15% - 45%
$__ is wasted by the US government annually because of bad data.
3 trillion
$__ is wasted in the US annually because of poor targeted marketing campaigns
611 billion
Change data values not recognized (mispellings).
Correction
__ is a process that must be guided from start to finish.
Data management
__ and __ should be controlled to best facilitate the needs of the organization.
Data structures and data collection
What data cleansing guideline has these characteristics? Everyone in the process should feel responsible for ensuring data quality. A good understanding of why data quality is important and ways to manage it, is critical.
Education
Sorting method to find duplicates; problem with finding non-exact duplicates.
Eliminate duplicate records
It can be very difficult to change data so quality data collection methods are necessary.
Prevention
What are these examples of? dropdown boxes to minimize text entries ranges data types
Prevention
Fill fields that are missing data if reasonable...may be caused by errors in original data.
Update missing fields