DATA-331-Chapter 7 Study Guide

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Ways to ingest data

- Direct database connection: data can be pulled from databases by querying and reading over a network connection - CDC: ingesting changes from a source database system through batch or continuous CDC. - APIs - Message Queues and Event-streaming platforms: ingests real-time data from web and mobile applications. - Databases and File export - Shell: interface by which you can execute commands to ingest data. - SFTP and SCP - Webhooks - Web scraping: extracts data from web pages. - Data Sharing: data providers will offer datasets to third-party subscribers.

File-based export and ingestion

- Push-based ingestion pattern because data export and preparation work is done on the source system side. •Data is quite often moved between databases and systems using files. •Data is serialized into files in an exchangeable format, and these files are provided to an ingestion system. •Common file-exchange methods are object storage, secure file transfer protocol (SFTP), electronic data interchange (EDI), or secure copy (SCP).

Message and Stream considerations

- Schema evolution - Late-arriving data - ordering and multiple delivery - Replay - Time to live - Message size - Error handling and dead-letter queues - Consumer pull and push - Location

Ways to move data from a source to a destination

1. Serialization 2. Deserialization

Pull

A destination pulling data from a source - a target reading data directly from a source

Ingestion frequencies

Batch: Frequent Micro-batch: Semi-frequent Real-time: Very frequent

Unbounded data

Data that exists in reality, as events happen, either sporadically or continuously, ongoing and flowing.

Deserialization

Process of converting the serialized data back into its original format.

Bounded data

a convenient way of bucketing data across some sort of boundary, such as time.

Asynchronous ingestion

dependencies can operate at the level of individual events - Individual events become available in storage as soon as they are ingested individually.

Polling

involves periodically checking a data source for any changes. When changes are detected, the destination pulls the data as it would in a regular pull situation.

Batch ingestion

involves processing data in bulk. Time interval batch ingestion: processes data at different times to provide daily reports. Size based batch ingestion: breaks up the data into discrete blocks for future processing.

Synchronous ingestion

the source, ingestion, and destination have complex dependencies and are tightly coupled - all data in a batch has to be ingested.

Consumer pull and push

•A consumer subscribing to a topic can get events in two ways: push and pull. •Pull subscriptions are the default choice for most data engineering applications, but you may want to consider push capabilities for specialized applications.

Payload

•A payload is the dataset you're ingesting. The characteristics are these: •Kind = type and format of data •Shape = describes the dimensions of the payload (tabular, unstructured text, Images) •Size = the number of bytes of a payload •Schema and data types = describes the fields and types of data within those fields - Metadata = data about data

Inserts, updates, and batch size

•Batch-oriented systems often perform poorly when users attempt to perform many small-batch operations rather than a smaller number of large operations. •It is important to understand the appropriate update patterns for the database or data store you're working with. •It is also important to understand that certain technologies are purpose-built for high insert rates.

Data ingestion

•Data ingestion is the process of moving data from one place to another. •Data ingestion implies data movement from source systems into storage in the data engineering lifecycle.

Throughput and Scalability

•Data throughput and system scalability become critical as your data volumes grow and requirements change. •Design your systems to scale and shrink to flexibly match the desired data throughput. •A common solution is to use managed services that handle the throughput scaling for you.

Error handling and dead-letter queues

•Events that cannot be ingested (or bad events) need to be rerouted and stored in a separate location called a dead-letter queue.

ETL

•Extract means getting data from a source system. •Once data is extracted, it can either be transformed (ETL) before loading it into a storage destination or simply loaded into storage for future transformation.

Time-to-live

•How long will you preserve your event record? •A key parameter is maximum message retention time, also known as the time to live (TTL). •TTL is usually a configuration you'll set for how long you want events to live before they are acknowledged and ingested.

Location

•It is often desirable to integrate streaming across several locations for enhanced redundancy and to consume data close to where it is generated. •As a general rule, the closer your ingestion is to where data originates, the better your bandwidth and latency. •However, you need to balance this against the costs of moving data between regions to run analytics on a combined dataset.

Ordering and Multiple delivery

•Messages may be delivered out of order and more than once. •Should have processes and controls to identify/resolve.

Migration

•Most data systems perform best when data is moved in bulk rather than as individual rows or events. •File or object storage is often an excellent intermediate stage for transferring data. •Be aware that many tools are available to automate various types of data migrations.

Push

•Pushing data from source to destination - involves a source system sending data to a target

Reliability and durability

•Reliability, which leads directly to durability, entails high uptime and proper failover for ingestion systems. Durability entails making sure that data isn't lost or corrupted.

Replay

•Replay allows readers to request a range of messages from the history, allowing you to rewind your event history to a particular point in time. •Replay is a key capability in many streaming ingestion platforms and is particularly useful when you need to reingest and reprocess data for a specific time range.

Schema evolution

•Schema evolution is common when handling event data; fields may be added or removed, or value types might change (say, a string to an integer). •Schema evolution can have unintended impacts on your data pipelines and destinations.

Serialization

•Serialization means encoding the data from a source and preparing data structures for transmission and intermediate storage stages. •Row-Based Serialization (Appendix A) •CSV, XML, JSON and JSONL, Avro •Columnar Serialization (Appendix A) •Parquet, ORC, Apache Arrow or in-memory serialization •Hybrid Serialization (Appendix A) •Hudi, Iceberg •

Batch ingestion considerations

•Snapshot or Differential Extraction •File-Based Export and Ingestion •ETL Versus ELT •Inserts, Updates, and Batch Size •Data Migration

Whom you'll work with

•Upstream Stakeholders •Downstream Stakeholders

Snapshot or differential extraction

•With full snapshots, engineers grab the entire current state of the source system on each update read. •With the differential update pattern, engineers can pull only the updates and changes since the last read from the source system.

Message size

•You must ensure that the streaming framework in question can handle the maximum expected message size. •Be aware of defaults configurability rules.

Late-arriving data

•You should be aware of late-arriving data and the impact on downstream systems and uses •To handle late-arriving data, you need to set a cutoff time for when late-arriving data will no longer be processed.


Kaugnay na mga set ng pag-aaral

Foundations chapter 8 Adaptive quizzing

View Set

2.4.4 - Assessment Types (Practice Questions)

View Set

Chapter 4 Nonverbal Communication

View Set

ATI nurse logic 2.0--knowledge and clinical judgement--> advanced

View Set

29. DNA Structure and Replication

View Set