5.2 Extract, Transform, and Load Relevant Data
Loading Data important considerations:
1. Ensuring the transformed data is stored in a format and structured acceptable to the receiving software 2. Understand how the new program will interpret data formats
Four step transformation process:
1. Under the data and the desired outcome 2. Standardized, structured, and clean the data 3. Validate data quality and verify data meets data requirements 4. Document the transformation process
Extracting data steps:
1. understand data needs and the data available 2. perform the data extraction 3. verify the data extraction quality and document what you have done
delimiter
A character, or series of characters, that mark the end of one field and the beginning of the next field
ETL Process
A set of procedures for blending data. The acronym stands for extract, transform, and load data
Check each item listed below that is part of the process for transforming data.
A. Validate data quality and verify data meets data requirements B. Standardize, structure, and clean the data C. Document the transformation process D. Understand the data and the desired outcome
Check each example of structured data in the list below
C. Phone numbers of employees saved in a database D. Customer addresses saved in a customer relation database
what's the fourth step of the transformation process?
Document the transformation process
What do the letters in the acronym ETL stand for?
Extract, Transform, and Load
What is the first step in the ETL process?
Extracting data
A data owner sends you an e-mail with a file to prepare for analysis. The file contains data from multiple database tables all merged into a single file. There are multiple fields in the file each separated by a "~" symbol. For fields that contain large amounts of text, the file contains a "+" at the beginning and end of the text field. Indicate which of the following best describes (1) the type of file the data owner sent, (2), what the "+" is called, and (3) what the "~" is called.
Flat file, text qualifier, delimiter
what's the second step of the transformation process?
Standardized, structured, and clean the data
Chunhua has been building financial forecasting models for the company for several years. For each model, she saves all the data that could possibly be used in the model, even if she doesn't use all the data in her finished model. She does not document anything about the different items she has saved. When her intern, Minsuh pulls the data, she cannot understand what all the fields mean. How would Minsuh most accurately describe the data?
The data has become a data swamp
what's the first step of the transformation process?
Under the data and the desired outcome
what's the third step of the transformation process?
Validate data quality and verify data meets data requirements
data lake
collection of structured, semi-structured, and unstructured data stored in a single location
data swamps
data repositories that are not accurately documented so that the stored data cannot be properly identified and analyzed
data marts
data repositories that hold structured data for a subset of an organization
Examples of Semi-Structured Data
data stored in csv, xml, or json format
metadata
data that describes other data
unstructured data
data that has no uniform structure
structured data
data that is highly organized and fits into fixed fields
defining the question well makes it ___
easier to define what data is needed to address the question
Example of structured data
general ledger and data in a relational database
examples of unstructured data
images, video, documents
dark data
information the organization has collected and stored that would be useful for analysis but is not analyzed and is thus generally ignored
semi-structured data
organized in some ways but is not fully organized to be inserted into a relational database
flat file
text file that contains data from multiple tables or sources and merges that data into a single row
data owner
the person or function in the organization who is accountable for the data and can give permission to access and analyze the data
text qualifier
two characters that indicate the beginning and ending of a field and tell the program to ignore any delimiters contained between the characters
without defining the data well early in the process, it is more likely that
wrong data or incomplete data will be extracted