MODULE 1

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

What is the name of the critical ETL component? Ability to deliver data to other applications, processes or databases in various forms, with capabilities for batch, real-time or event triggered process scheduling.

Data delivery capabilities

What is the transformation process? Involves adding new information to the raw data already collected

Data encriching

What is the name of the critical ETL component? (data quality, profiles and mining)

Data management capabilities

What is the name of the critical ETL component? Scalability

As your company grows, so will your data volume. All components of an ETL process should scale to support arbitrarily large throughput

What is the transformation process? Organizing tuples on the basis of some attribute.

Sorting

What is the name of the critical ETL component? Accuracy

Data cannot be dropped or changed in a way that corrupts its meaning. Every data point should be auditable at every stage in your process.

Within the ETL (traditional process), data is extracted to the staging area (either in-memory data structures or temporary databases) before it is transfo_________________ and loaded into the target database for analytics. ELT (extract-load-transform) takes advantage of the new data warehousing technologies (e.g. BigQuery, Amazon Redshift, Snowflake...) by loa_________________ the raw data into a Data Warehouse or Data Lake first and transfo_______________________ the data on the fly when it is needed for analysis. E________________ is preferred for operations working with extremely large volumes of data or with real-time data

transformed loading ELT

ETL processes involve "transfor_______________" data, which means source data is changed. Destination data are differ_______________ from source data.

transforming different

Another is the rapid shift to cloud-based SaaS applications that now house significant amounts of business-critical data in their own databases, accessible through different technologies such as A__________ and web_____________________.

API Webhooks

What is the name of the critical ETL component? Mainframes (IBM Z/OS), AS/400, HP Tandem, Unix, Wintel, Linux, Virtualized Servers, etc.

Adaptation to the different hardware platforms and operating systems:

What is the term for the transform architecture: Because transforms go through the extracted data, they sometimes need to handle heavy loads. Algorithmic efficiency in the design of transforms can make a difference in the time required for a transform to execute, or whether it will time-out your system. Take this simple example: implementing a dictionary solution for 1M rows transformation vs a for loop results is a difference of a couple of orders of magnitude.

Algorithmic efficiency

What is the name of the critical ETL component? Degree of compactness, consistency and interoperability of the different components that make up the data integration tool (with a desirable minimum number of products, a single repository, a common development environment, interoperability with other tools or via API), etc.

Architecture and integration:

What is the term for the transform architecture: Transforms often implement business logic, such as calculating a customer's lifetime value or their consecutive orders. Your architecture needs to be designed so it can handle missing or corrupt data and transform orders, thus supporting business logic implementation

Business logic

What is this transform challenge? As company operations evolve, business definitions change. Even small changes to business logic can have multiple effects on transforms, especially if the change in one transform affects others which depend on it.

Changing business logic.

What is this transform challenge? APIs can change their response payloads, data can become corrupted, or your system might migrate to a new SaaS... so you need to implement a different transform logic. In addition to the decoupling issues, changing source data requires constant monitoring and maintenance of the transform stage.

Changing source data.

What is the transformation process? Filling up the NULL values with some default values, mapping U.S.A, United States and America into USA, convert data types to standard forms, etc

Cleaning

What type of ETL tool? Tools from Google, Microsoft or Amazon that have their own ETL services in the Cloud. Examples: Amazon AWS Glue, Microsoft Azure Data Factory, Google Cloud Dataflow, Amazon AWS EMR.

Cloud Service

What is this transform challenge? Sometimes, implementing something trivial from a business perspective can be challenging from an engineering perspective.

Complex business logic.

What is the name of the critical ETL component? (with support for data sources and destinations): ability to connect to a wide range of data structure types, including relational and non-relational databases, various file formats, XML, ERP, CRM or SCM applications, standard message formats (EDI, SWIFT or HL7), message queues, emails, websites, content repositories or office automation tools. Handling of multiple source formats: To pull in data from diverse sources such as Salesforce's API, your back-end financials application, and databases such as MySQL and MongoDB, your process needs to be able to handle a variety of data formats.

Connectivity capabilities

What type of ETL tool? Companies that develop their own tools in order to have greater flexibility. Examples of languages used for this programming: Pl/SQL, TSQL, Java, .Net, Python, etc.

Custom ETL

What is the challange called? Depending on how fast you need data to make decisions, the extraction process can be run with lower or higher frequencies. The tradeoff is between stale or late data at lower frequencies vs higher computational resources needed at higher frequencies

Data Latency

What is the challange called? The volume of data extraction affects system design. The solutions for low-volume data do not scale well as data quantity increases. With large amounts of data, you need to implement parallel extraction solutions, which are complex and difficult to maintain from an engineering perspective.

Data Volume

What is the load challenge? Suspicious data is sometimes formatted in such a way that it circumvents all of your data validation at extraction and transformation. As a result, you need additional data quality monitoring to assure data quality in your database or data warehouse.

Data quality

What is the name of the critical ETL component? Ability to transform data, from basic transformations (type conversion, string manipulation or simple calculations), intermediate transformations (aggregations, summarizations, lookups) to complex transformations such as free text analysis or rich text.

Data transformation capabilities:

What is the challange called? Either you validate data at extraction (before pushing it down the ETL pipeline), or at the transformation stage. When validating data at extraction, check for missing data (e.g. are some fields empty, even though they should have returned data?) and corrupted data (e.g. are some returned values nonsensical, such as a Facebook ad having -3 clicks?).

Data validation.

What is the name of the critical ETL component? Graphical representation of repository objects, data models and data flows, test and debugging support, teamwork capabilities, workflow management of development processes, etc.

Design capabilities and development environment

What is the challange called? Working with different data sources causes problems with overhead and management. The variety of sources increases the data management surface by increasing the demands for monitoring, orchestration and error fixes

Disparate sources

Flexible pipelines. Lastly, and perhaps most importantly, modern ETL might actually be a pipeline that supports a combination of ETL and E_______________ (Extract, Load, and Transform — the data is loaded to the target data warehouse and transformed afterwards).

ELT

What type of ETL tool? Used by companies that have a larger size, higher cost compared to other options available. Examples: Oracle Data Integrator, SAP Data Services, IBM Infosphere DataStage, SAS Data Manager, Microsoft SQL Server Integration Services - SSIS.

Enterprise

What is the challange called? Have there been any errors which have caused missing or corrupted data?

Errors

What is the transformation process? loading only certain attributes into the target system.

Filtering

What type of load? Main Dumps all data into the database at once Pro Low implementation Con Requires more time and resources

Full Load

Which extration? PRO Data is guranteed to be fresh CONS Extremely computationally expensive Does not scale well BEST FOR First exraction cycle Short-lived data Srouces with small amounts of data

Full-extraction

What type of load? Main Inserts data into the database at regular intervals Pro Less time and fewer resources needed Good for high data volumes Con Medium level of expertise needed

Incremental Batach Load

What type of load? Main Inserts data into the database when new data emerges or old data is updated Pro Less time and fewer resources needed Good for low data volumes Con High leve of expertise needed

Incremental Stream Load

Which extration? PRO Good balance between data fresh and computational resources CONS Deleted records from source could be missed BEST FOR mission-critical data

Incremental extraction

What is the transformation process? Combining multiple attributes into one. For example: recoding different versions of the same data to a common denominator. For example, "M", 1, "male", "masculine" to "Male

Joining

What is this transform challenge? Oftentimes, it becomes clear that there is a lack of business logic given the data we receive from the extract phase. As an example: the business rule for determining a new customer is the date of their first product purchase. But what do we do for customers who paid for shipping, but not for a product?

Lack of business logic.

What is the name of the critical ETL component? Some decisions need to be made in real time, so data freshness is critical. While there will be latency constraints imposed by particular source data integrations, data should flow through your ETL process with as little latency as possible.

Low latency

What is the name of the critical ETL component? Recovery of data models from data sources or applications, creation and maintenance of data models, mapping from physical model to logical model, open metadata repository (with the possibility of interacting with other tools), synchronization of metadata changes in the different components of the tool, documentation, etc.

Metadata capabilities and data modeling:

What is the challange called? You need to monitor your extraction system on several levels

Monitoring

What type of ETL tool? Free open source tools for all users. Examples: Pentaho Data Integration, Talend Open Studio.

Open Source

What is the name of the critical ETL component? Skills for management, monitoring and control of data integration processes, such as error management, collection of execution statistics, security controls, etc. Auditing and logging: You need detailed logging within the ETL pipeline to ensure that data can be audited after it's loaded and that errors can be debugged.

Operations and administration capabilities

What is the challange called? Based on your choices of data latency, volume, source limits and data quality (validation), you need to orchestrate your extraction scripts to run at specified times or triggers. This can become complex if you implement a mixed model of architectural design choices (which people often do in order to accommodate for different business cases of data use).

Orchestration

What is the load challenge? The order of insertion can affect the end result. If a table has a foreign key constraint, it might prevent you from inserting data into that table (and would probably skip it), unless you first insert matching data in another table.

Order of insertion.

What is the term for the transform architecture: The order in which transform rules are applied to incoming data can affect the end result. For instance, imagine we have two transform scripts. The first one processes data to compute the consecutive number of purchases made by a customer. The second transformation process drops purchase information from the data pipeline unless there is a shipping address. If we drop the row for a customer with a missing shipping address before we calculate the consecutive order, the end result is going to be two different purchase orders.

Order of operations

What is the term for the transform architecture: Transformations are often the place where data is validated against a set of criteria (e.g. do not import customer information unless we have their email) and monitored for data quality. At this stage, a lot of ETL processes are designed with alerts to notify developers of errors, as well as rules, which are preventing data from passing on the data pipeline unless it matches certain criteria.

Quality assurance

Re_____________ time pipelines. In addition, modern ETL tools are designed to move data in re_______________ time and to allow for changes to the sche_____________ on the fly.

Real Schema

What is the challange called? Have the extraction scripts run at all?

Reliability

What is the challange called? How much computational power and memory is allocated?

Resources

What is this transform challenge? Transforms present challenges when the ETL processes evolve. The more transforms you implement, the harder it is to keep track of their mutual effects.

Scaling complexity.

What is the load challenge? The schema represents what the destination (database or data warehouse) expects the data to look like. As your business evolves, the schema is often updated to reflect changes in business operations. The resulting need for schema updates can lead to a waste of engineering hours, as well as unintended consequences for the entire system (e.g. data quality validations might break when the form of data breaks).

Schema changes

What is the challange called? You need to be aware of the source limitations when extracting data. For example, some sources (such as APIs and webhooks) have imposed limitations on how much data you can extract simultaneously. Your engineers need to work around these barriers to ensure system reliability.

Source limits

Which extration? PRO Lower computational resources needed Data quickly available for analysis CONS Webhooks fail and not good for mission-critical data collection BEST FOR Data that changes periodically Speed collection is more important than data quality

Source-driven

What is the transformation process? Separating a single attribute into multiple attributes.

Splitting

What is the name of the critical ETL component? Incremental loading allows you to update your analytics warehouse with new data without doing a full reload of the entire data set

Support for change data capture

However, when the destination is a clo___________-native data warehouse, E_________ is a better approach. Organizations can transform their raw data at any time, when and as necessary for their use case, and not as a step in the data pipeline. Cloud-based analytics databases have the horsepower to perform transformations in place rather than requiring a special staging area.

cloud ELT

ETL can also connect to Data Warehou______________, Data Ma____________, Data Hu_____________ or Data La_______________

Warehouse Mart Hub Lake

An ETL process helps companies create a support system for critical decision making and allows business managers to quickly acc______________ data in one place. In addition, it provides cle______________ and filtered data structures for exploitation by different end-user tools, increases data qua______________ and value, and enables decis______________ optimization.

access clean quality decision

A transformation can involve operations like data clean______________, filte______________, converting data types, formatting, enriching, applying lookups and calculations, masking, removing duplic______________, sorting and aggreg______________ (to name just a few).

cleansing filtering duplicates aggregating

ETL process helps cons__________________ data from various overlapping systems acquired via merger and/or acquisition

consolidate

3 possible approaches to implementing an extraction solution: ● Fu_________-extra________________. Each extra______________ collects all da__________ from the source and pushes it down the data pipe______________. ● Increm_____________ extraction. At each new cyc__________ of the extraction process (e.g. every time the ETL pipeline is run), only the ne__________ da________ is collected from the source, along with any data that has chan____________ since the last collection. For example, data collection via AP__________ ● Sour_________-driven extraction. The sou_________ notifies the ETL system that da_________ has changed, and the ETL pipeline is run to extract the chag____________ data. For example, data collection via web_______________

full extraction data pipeline Incremental cycle new changed API Source data changed webhooks

The main conceptual difference is the final step of the process: in ETL, clean data is loa_____________ in the target destination store. In ELT, loading data happens before transformations - the final step is trans_________________ the data just before data analysis.

loaded transforming

ETLs are typically used for loa________________ data from one or more source systems to a single or mult_______________ target system (destin______________).

loading multiple destination

ETL processes are typically developed for lon___________-time deplo___________ and oper___________. The data pipeline becomes an integral part of the data ecosystem.

long deployment operation

When using traditional ETL tools, any changes to your plan may require the map______________ to be restruc_______________ and all the data to be reloa_________________.

mapping restructured reloaded

ETL provides a method of mov______________ the data from various sources into a data ware_____________. ETL process allows sample data compa____________ between the sou_____________ and the targ________________ system.

moving warehouse comparison source target

A ETL workflow will typically connect to one or more opera________________ data sources such as ERP, CRM or SCM

operational

On-prem______________ or clo___________ data warehouses. Modern ETL tools are built to integrate with on-premise environments and cloud data warehouses — Amazon Redshift, Snowflake, Google BigQuery, Azure, or any number of other options.

premise cloud

Data extracted from source systems are ra____________ and not usable in its original form. Therefore it needs to be cleansed, mapped, transformed and added value such that insightful BI reports can be generated.

raw

Traditional ETL tools are well-suited to working with relati_________________ databases, but often less geared for unstruct________________ data. These tools are often designed to move data in batc___________, meaning that large volumes of data are moved at the same scheduled time, usually when network traffic is low (at night time). This means that you likely wouldn't be able to perform ETL outside of the scheduled batches or perform any kind of real-time analysis.

relational unstructured batches

The transformation step of the ETL process is a set of rul_______ or func___________ applied on the extracted data to convert it into a sing___________ stan_____________ format as per requirements

rules or functions standard format

Loading step involves taking data from the transform sta__________ and saving it to a target data sto______________ (relational database, NoSQL data store, data warehouse, data lake, etc), where it is ready for analysis.

saving store

ETL processes are usually run from a dedicated ETL serv__________ or managed environment in the clou_____________

server cloud

ETL (Extract, Transform and Load) is a data integration method where data from one or more sour__________ systems is first rea___________ (ext__________), then made to go through some chan_____________ (trans__________) and then the changed data is writt_____________ to a target system (lo__________).

source extract changes transform written load

In this step of ETL architecture, data is extracted from the sou_______________ systems into the stag___________ area. Data sources can come from DB_________, Hardware, Operating Systems and Communication Protocols. Sources could include legacy applications like Mainframes, customized applications, Point of contact devices like ATM, Call switches, text files, spreadsheets, ERP, CRM, etc.

source staging DBMS


Set pelajaran terkait

ICS Chapter 9 Checking your Understanding: Clinical information systems

View Set

Unit 5 Transport review questions

View Set

IB SL Biology 1.1 and 1.2 Cell Structure & Function

View Set

Endocrine and Metabolic Disorders - ML8

View Set