DP-900

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Non-relational data offerings on Azure

describe Azure Cosmos DB APIs describe Azure Table storage describe Azure Blob storage describe Azure File storage

NoSQL advantages

-It easier for quick iterative improvements. Flexibility to change both schema and queries to adapt to data requirements. -High scalability, availability, and fault tolerance are provided -Uses low-cost commodity hardware -Supports Big Data -Key-value model improves storage efficiency Handling large, unrelated, indeterminate, or rapidly changing data. Schema-agnostic data or schema dictated by the app. Apps where performance and availability are more important than strong consistency. Always-on apps that serve users around the world. A key aspect of non-relational databases is that they enable you to store data in a very flexible manner. Non-relational databases don't impose a schema on data. Instead, they focus on the data itself rather than how to structure it. This approach means that you can store information in a natural format, that mirrors the way in which you would consume, query and use it.

Azure Data Factory

A cloud-based data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale.

Graph database

A database model in which data is arranged as nodes and connected by lines (edges) that establish relationships. A graph database stores two types of information: nodes that you can think of as instances of entities, and edges, which specify the relationships between nodes. Nodes and edges can both have properties that provide information about that node or edge (like columns in a table). Additionally, edges can have a direction indicating the nature of the relationship. For large graphs with lots of entities and relationships, you can perform very complex analyses very quickly, and many graph databases provide a query language that you can use to traverse a network of relationships efficiently. You can often store the same information in a relational database, but the SQL required to query this information might require many expensive recursive join operations and nested subqueries. Azure Cosmos DB supports graph databases using the Gremlin API. The Gremlin API is a standard language for creating and querying graphs. Relevant Azure service: Azure Cosmos DB Graph API

Kappa architecture

A drawback to the lambda architecture is its complexity. Processing logic appears in two different places — the cold and hot paths — using different frameworks. This leads to duplicate computation logic and the complexity of managing the architecture for both paths. The kappa architecture was proposed by Jay Kreps as an alternative to the lambda architecture. It has the same basic goals as the lambda architecture, but with an important distinction: All data flows through a single path, using a stream processing system.

Data cleaning

A generalized term that encompasses a range of actions, such as removing anomalies, and applying filters and transformations that would be too time-consuming to run during the ingestion stage.

Benchmarking

A process by which a company compares its performance with that of high-performing organizations

What is a transactional system?

A transactional system is often what most people consider the primary function of business computing. A transactional system records transactions. A transaction could be financial, such as the movement of money between accounts in a banking system, or it might be part of a retail system, tracking payments for goods and services from customers. Think of a transaction as a small, discrete, unit of work.

Data storage

Data for batch processing operations is typically stored in a distributed file store that can hold high volumes of large files in various formats. This kind of store is often called a data lake. Options for implementing this storage include Azure Data Lake Store or blob containers in Azure Storage.

Azure Synapse Analytics

Analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It gives you the freedom to query data on your terms, using either serverless or dedicated options—at scale. Azure Synapse brings these worlds together with a unified experience to ingest, explore, prepare, transform, manage, and serve data for immediate BI and machine learning needs.

Parquet

Another columnar data format Contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A Parquet file includes metadata that describes the set of rows found in each chunk. Parquet specializes in storing and processing nested data types efficiently.

Cognitive analytics

Attempts to draw inferences from existing data and patterns, derive conclusions based on existing knowledge bases, and then add these findings back into the knowledge base for future inferences--a self-learning feedback loop. Inferences aren't structured queries based on a rules database, rather they're unstructured hypotheses gathered from a number of sources, and expressed with varying degrees of confidence.

Azure Cosmos DB use cases

Azure Cosmos DB natively partitions your data for high availability and scalability. Azure Cosmos DB offers 99.99% guarantees for availability, throughput, low latency, and consistency on all single-region accounts and all multi-region accounts with relaxed consistency, and 99.999% read availability on all multi-region database accounts. Azure Cosmos DB has SSD backed storage with low-latency order-of-millisecond response times. Azure Cosmos DB's support for consistency levels like eventual, consistent prefix, session, and bounded-staleness allows for full flexibility and low cost-to-performance ratio. No database service offers as much flexibility as Azure Cosmos DB in levels consistency. Azure Cosmos DB has a flexible data-friendly pricing model that meters storage and throughput independently. Azure Cosmos DB's reserved throughput model allows you to think in terms of number of reads/writes instead of CPU/memory/IOPs of the underlying hardware. Azure Cosmos DB's design lets you scale to massive request volumes in the order of trillions of requests per day. IoT and telematics IoT use cases commonly share some patterns in how they ingest, process, and store data. First, these systems need to ingest bursts of data from device sensors of various locales. Next, these systems process and analyze streaming data to derive real-time insights. The data is then archived to cold storage for batch analytics. Microsoft Azure offers rich services that can be applied for IoT use cases including Azure Cosmos DB, Azure Event Hubs, Azure Stream Analytics, Azure Notification Hub, Azure Machine Learning, Azure HDInsight, and Power BI. Retail and marketing Azure Cosmos DB is used extensively in Microsoft's own e-commerce platforms, that run the Windows Store and XBox Live. It is also used in the retail industry for storing catalog data and for event sourcing in order processing pipelines.

Azure Data Studio

Azure Data Studio is a cross-platform database tool for data professionals using on-premises and cloud data platforms on Windows, macOS, and Linux. Azure Data Studio offers a modern editor experience with IntelliSense, code snippets, source control integration, and an integrated terminal. It's engineered with the data platform user in mind, with built-in charting of query result sets and customizable dashboards. Are mostly editing or executing queries. Need the ability to quickly chart and visualize result sets. Can execute most administrative tasks via the integrated terminal using sqlcmd or PowerShell. Have minimal need for wizard experiences. Do not need to do deep administrative or platform related configuration. Need to run on macOS or Linux.

Big data solutions typically involve

Batch processing of big data sources at rest. Real-time processing of big data in motion. Interactive exploration of big data. Predictive analytics and machine learning. Store and process data in volumes too large for a traditional database. Transform unstructured data for analysis and reporting. Capture, process, and analyze unbounded streams of data in real time, or with low latency.

Batch Processing

Buffering and processing the data in groups. The whole group is then processed at a future time. Advantages of batch processing include: Large volumes of data can be processed at a convenient time. It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight, or during off-peak hours. Disadvantages of batch processing include: The time delay between ingesting the data and getting the results. All of a batch job's input data must be ready before a batch can be processed. This means data must be carefully checked. Problems with data, errors, and program crashes that occur during batch jobs bring the whole process to a halt. The input data must be carefully checked before the job can be run again. Even minor data errors, such as typographical errors in dates, can prevent a batch job from running. Options include running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python programs in an HDInsight Spark cluster.

Azure Synapse Analytics Use Case - Data Warehousing

Build your mission-critical workloads on the industry's leading SQL engine. Easily scale your workloads and get predictable cost with no hidden charges. Generate millions of predictions in seconds directly in your data warehouse. Optimize your data warehouse to ensure resources are properly utilized. Automatically convert SQL code in minutes with Azure Synapse Pathway. Get guaranteed fast performance on complex queries without any query charge.

Most common formats for unstructured data

CSV and JSON are likely the most common formats used for ingesting, exchanging, and storing unstructured or semi-structured data. Files are commonly used to exchange tabular data between systems in plain text. Despite their limitations, CSV files are a popular choice for data exchange, because they are supported by a wide range of business, consumer, and scientific applications. JSON (JavaScript Object Notation) data is represented as key-value pairs in a semi-structured format. Both are self-describing and human readable, but JSON documents tend to be much smaller, leading to their popular use in online data exchange, especially with the advent of REST-based web services. JSON-formatted files have several benefits over CSV: JSON maintains hierarchical structures, making it easier to hold related data in a single document and represent complex relationships. Most programming languages provide native support for deserializing JSON into objects, or provide lightweight JSON serialization libraries. JSON supports lists of objects, helping to avoid messy translations of lists into a relational data model. JSON is a commonly used file format for NoSQL databases, such as MongoDB, Couchbase, and Azure Cosmos DB.

When to use CSV or JSON format

CSVs are more commonly used for exporting and importing data, or processing it for analytics and machine learning. JSON-formatted files have the same benefits, but are more common in hot data exchange solutions. JSON documents are often sent by web and mobile devices performing online transactions, by IoT (internet of things) devices for one-way or bidirectional communication, or by client applications communicating with SaaS and PaaS services or serverless architectures. CSV and JSON file formats both make it easy to exchange data between dissimilar systems or devices. Their semi-structured formats allow flexibility in transferring almost any type of data, and universal support for these formats make them simple to work with. Both can be used as the raw source of truth in cases where the processed data is stored in binary formats for more efficient querying. Azure provides several solutions for working with CSV and JSON files, depending on your needs. The primary landing place for these files is either Azure Storage or Azure Data Lake Store.

Treemap Chart

Charts of colored rectangles, with size representing the relative value of each item. They can be hierarchical, with rectangles nested within the main rectangles.

Column family databases

Columnar, wide-column, or column-family databases efficiently store data and query across rows of sparse data and are advantageous when querying across specific columns in the database. Organizes data into rows and columns. Examples of this structure include ORC and Parquet files, described in the previous unit. Holding tabular data comprising rows and columns, but you can divide the columns into groups known as column-families. In most column family databases, the column-families are stored separately. In the previous example, the CustomerInfo column family might be held in one area of physical storage and the AddressInfo column family in another, in a simple form of vertical partitioning. You should really think of the structure in terms of column-families rather than rows. The data for a single entity that spans multiple column-families will have the same row key in each column family. The most widely used column family database management system is Apache Cassandra. Azure Cosmos DB supports the column-family approach through the Cassandra API.

Data analytics

Concerned with examining, transforming, and arranging data so that you can study it and extract useful information A catch-all that covers a range of activities, each with its own focus and goals. You can categorize these activities as descriptive, diagnostic, predictive, prescriptive, and cognitive analytics.

Predictive analytics

Helps answer questions about what will happen in the future.

Data Ingestion

Data ingestion is the process of capturing the raw data. This data could be taken from control devices measuring environmental information such as temperature and pressure, point-of-sale devices recording the items purchased by a customer in a supermarket, financial data recording the movement of money between bank accounts, and weather data from weather stations. Some of this data might come from a separate OLTP system. To process and analyze this data, you must first store the data in a repository of some sort. The repository could be a file store, a document database, or even a relational database.

What is semi-structured data?

Data is data that contains fields. The fields don't have to be the same in every entity. You only define the fields that you need on a per-entity basis. Semi-structured data is data that contains fields. The fields don't have to be the same in every entity. You only define the fields that you need on a per-entity basis. You're free to define whatever fields you like. JSON, Avro, ORC, and Parquet

Data Transformation/Data Processing

Data processing is simply the conversion of raw data to meaningful information through a process. The raw data might not be in a format that is suitable for querying. The data might contain anomalies that should be filtered out, or it may require transforming in some way. For example, dates or addresses might need to be converted into a standard format. After data is ingested into a data repository, you may want to do some cleaning operations and remove any questionable or invalid data, or perform some aggregations such as calculating profit, margin, and other Key Performance Indicators (KPIs). KPIs are how businesses are measured for growth and performance.

ETL and ELT

Data processing mechanism can take two approaches to retrieving the ingested data, processing this data to transform it and generate models, and then saving the transformed data and models

What is unstructured data?

Data that doesn't naturally contain fields. Examples include video, audio, and other media streams. In Azure, you would probably store video and audio data as block blobs in an Azure Storage account.

Key influencer chart

Displays the major contributors to a selected result or value.

Data Analysts

Explore and analyze data to create visualizations and charts to enable organizations to make informed decisions. A data analyst enables businesses to maximize the value of their data assets. They're responsible for designing and building scalable models, cleaning and transforming data, and enabling advanced analytics capabilities through reports and visualizations. A data analyst processes raw data into relevant insights based on identified business requirements to deliver relevant insights.

External index data stores

External index data stores provide the ability to search for information held in other data stores and services. An external index acts as a secondary index for any data store, and can be used to index massive volumes of data and provide near real-time access to these indexes. For example, you might have text files stored in a file system. Finding a file by its file path is quick, but searching based on the contents of the file would require a scan of all of the files, which is slow. An external index lets you create secondary search indexes and then quickly find the path to the files that match your criteria. Another example application of an external index is with key/value stores that only index by the key. You can build a secondary index based on the values in the data, and quickly look up the key that uniquely identifies each matched item. Relevant Azure service: Azure Search

Stream data processing

Handles data in real time. Stream processing is ideal for time-critical operations that require an instant real-time response. For example, a system that monitors a building for smoke and heat needs to trigger alarms and unlock doors to allow residents to escape immediately in the event of a fire. After capturing real-time messages, the solution must process them by filtering, aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an output sink. Azure Stream Analytics provides a managed stream processing service based on perpetually running SQL queries that operate on unbounded streams. You can also use open source Apache streaming technologies like Storm and Spark Streaming in an HDInsight cluster.

Relational SQL Advantages

Handling data that is relational and has logical and discrete requirements that can be identified in advance. Schema that must be maintained and kept in sync between the app and database. Legacy systems built for relational structures. Apps requiring complex querying or multi-row transactions. Scales data vertically by increasing server load. Accounting, finance, and banking systems. Inventory management systems. Transaction management systems.

Filled Map

Have geographical data, you can use a filled map to display how a value differs in proportion across a geography or region

Prescriptive analytics

Helps answer questions about what actions should be taken to achieve a goal or target

Descriptive analytics

Helps answer questions about what has happened, based on historical data

What is an analytical system?

In contrast to systems designed to support OLTP, an analytical system is designed to support business users who need to query data and gain a big picture view of the information held in a database. Analytical systems are concerned with capturing raw data, and using it to generate insights. An organization can use these insights to make business decisions. For example, detailed insights for a manufacturing company might indicate trends enabling them to determine which product lines to focus on, for profitability.

Document database disadvantages

Increases the storage required, but can also make maintenance more complex

Key-value

Key-value stores pair keys and values using a hash table. Key-value types are best when a key is known and the associated value for the key is unknown. The simplest (and often quickest) type of NoSQL database for inserting and querying data. Write operations are restricted to inserts and deletes. Azure Table storage is an example. Most key/value stores only support simple query, insert, and delete operations. To modify a value (either partially or completely), an application must overwrite the existing data for the entire value. In most implementations, reading or writing a single value is an atomic operation. If the value is large, writing may take some time. An application can store arbitrary data as a set of values, although some key/value stores impose limits on the maximum size of values. The stored values are opaque to the storage system software. Any schema information must be provided and interpreted by the application. Essentially, values are blobs and the key/value store simply retrieves or stores the value by key. Key/value stores are also not optimized for scenarios where querying or filtering by non-key values is important, rather than performing lookups based only on keys. Relevant Azure services: Azure Cosmos DB Table API Azure Cache for Redis Azure Table Storage

Analytical data store

Many big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. The analytical data store used to serve these queries can be a Kimball-style relational data warehouse, as seen in most traditional business intelligence (BI) solutions. Alternatively, the data could be presented through a low-latency NoSQL technology such as HBase, or an interactive Hive database that provides a metadata abstraction over data files in the distributed data store. Azure Synapse Analytics provides a managed service for large-scale, cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark SQL, which can also be used to serve data for analysis.

Azure Cosmos DB APIs

MongoDB, SQL API, Gremlin API, and Tables API

Orchestration

Most big data solutions consist of repeated data processing operations, encapsulated in workflows, that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results straight to a report or dashboard. To automate these workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.

Identify non-relational database use cases

Non-relational databases are highly suitable for the following scenarios: IoT and telematics. These systems typically ingest large amounts of data in frequent bursts of activity. Additionally, you can process the data in real-time using Azure Functions that are triggered as data arrives in the database. Retail and marketing. Microsoft uses CosmosDB for its own ecommerce platforms that run as part of Windows Store and Xbox Live. It's also used in the retail industry for storing catalog data and for event sourcing in order processing pipelines. Gaming. rely on the cloud to deliver customized and personalized content like in-game stats, social media integration, and high-score leaderboards. Games often require single-millisecond latencies for reads and write to provide an engaging in-game experience. A game database needs to be fast and be able to handle massive spikes in request rates during new game launches and feature updates. Web and mobile applications. A non-relational database such as Azure Cosmos DB is commonly used within web and mobile applications, and is well suited for modeling social interactions, integrating with third-party services, and for building rich personalized experiences. The Cosmos DB SDKs (software development kits) can be used to build rich iOS and Android applications using the popular Xamarin framework. A relational database restructures the data into a fixed format that is designed to answer specific queries. When data needs to be ingested very quickly, or the query is unknown and unconstrained, a relational database can be less suitable than a non-relational database.

Business Intelligence

Refers to technologies, applications, and practices for the collection, integration, analysis, and presentation of business information. The purpose of business intelligence is to support better decision making. Business intelligence systems provide historical, current, and predictive views of business operations, most often using data that has been gathered into a data warehouse, and occasionally working from live operational data.

Object data stores

Object data stores are optimized for storing and retrieving large binary objects or blobs such as images, text files, video and audio streams, large application data objects and documents, and virtual machine disk images. An object consists of the stored data, some metadata, and a unique ID for accessing the object. Object stores are designed to support files that are individually very large, as well provide large amounts of total storage to manage all files. One special case of object data stores is the network file share. Using file shares enables files to be accessed across a network using standard networking protocols like server message block (SMB). Given appropriate security and concurrent access control mechanisms, sharing data in this way can enable distributed services to provide highly scalable data access for basic, low-level operations such as simple read and write requests. Relevant Azure services: Azure Blob Storage Azure Data Lake Store Azure File Storage

Azure Synapse Analytics Use Case - Data Engineering

Process big data with serverless Spark pools using the latest Spark runtime. Perform data processing and machine learning tasks three times faster with Nvidia GPUs. Launch clusters on demand and dynamically scale in, scale out, pause, and resume. Manage data pipelines in the same analytics platform as your data warehouse. Perform code-free interactive data exploration and add it to your data pipeline. Ingest data from software-as-a-service (SaaS) apps with more than 95 built-in connectors.

Azure Blob Storage

Query JSON files directly You may perform batch processing or real-time processing of the data

Data visualization

The graphical representation of information and data. Helps you to focus on the meaning of data, rather than looking at the data itself. If you are using Azure, the most popular data visualization tool is Power BI.

Scatter

Shows the relationship between two numerical values

NoSQL storage

Store the information for entities in collections or containers rather than relational tables. Each entity should have a unique key value. The entities in a collection are usually stored in key-value order.

Data processing

Takes the data in its raw form, cleans it, and converts it into a more meaningful format (tables, graphs, documents, and so on) Data processing can be complex, and may involve automated scripts, and tools such as Azure Databricks, Azure Functions, and Azure Cognitive Services to examine and reformat the data, and generate models. A data analyst could use machine learning to help determine future trends based on these models.

Analysis and reporting

The goal of most big data solutions is to provide insights into the data through analysis and reporting. To empower users to analyze the data, the architecture may include a data modeling layer, such as a multidimensional OLAP cube or tabular data model in Azure Analysis Services. It might also support self-service BI, using the modeling and visualization technologies in Microsoft Power BI or Microsoft Excel. Analysis and reporting can also take the form of interactive data exploration by data scientists or data analysts. For these scenarios, many Azure services support analytical notebooks, such as Jupyter, enabling these users to leverage their existing skills with Python or R. For large-scale data exploration, you can use Microsoft R Server, either standalone or with Spark.

Document Database

The opposite end of the NoSQL spectrum from a key-value store. Has a unique ID, but the fields in the documents are transparent to the database management system. Document databases typically store data in JSON format, as described in the previous unit, or they could be encoded using other formats such as XML, YAML, JSON, BSON. Typically, a document contains the entire data for an entity. Does not require that all documents have the same structure. This free-form approach provides a great deal of flexibility. Most document databases will ingest large volumes of data more rapidly than a relational database, but aren't as optimal as a key-value store for this type of processing. The focus of a document database is its query capabilities. Azure Cosmos DB implements a document database approach in its Core (SQL) API. Document databases extend the concept of the key-value database by organizing entire documents into groups called collections. They support nested key-value pairs and allow queries on any attribute within a document.

ELT is an abbreviation of Extract, Load, and Transform

The process differs from ETL in that the data is stored before being transformed. The data processing engine can take an iterative approach, retrieving and processing the data from storage, before writing the transformed data and models back to storage. ELT is more suitable for constructing complex models that depend on multiple items in the database, often using periodic batch processing.

Data ingestion

The process of obtaining and importing data for immediate use or storage in a database. The data can arrive as a continuous stream, or it may come in batches, depending on the source. The purpose of the ingestion process is to capture this data and store it. This raw data can be held in a repository such as a database management system, a set of files, or some other type of fast, easily accessible storage. The ingestion process might also perform filtering. For example, ingestion might reject suspicious, corrupt, or duplicated data. Suspicious data might be data arriving from an unexpected source. Corrupt or duplicated data could be due to a device error, transmission failure, or tampering.

Reporting

The process of organizing data into informational summaries to monitor how different areas of an organization are performing. Good reporting should raise questions about the business from its end users. Reporting shows you what has happened, while analysis focuses on explaining why it happened and what you can do about it.

ETL Extract, Transform, and Load

The raw data is retrieved and transformed before being saved. The extract, transform, and load steps can be performed as a continuous pipeline of operations. It is suitable for systems that only require simple models, with little dependency between items. For example, this type of process is often used for basic data cleaning tasks, deduplicating data, and reformatting the contents of individual fields. More stream-oriented approach of ETL places more emphasis on throughput. However, ETL can filter data before it's stored. In this way, ETL can help with data privacy and compliance, removing sensitive data before it arrives in your analytical data models.

Time series data stores

Time series data is a set of values organized by time, and a time series data store is optimized for this type of data. Time series data stores must support a very high number of writes, as they typically collect large amounts of data in real time from a large number of sources. Time series data stores are optimized for storing telemetry data. Scenarios include IoT sensors or application/system counters. Updates are rare, and deletes are often done as bulk operations. Relevant Azure services: Azure Time Series Insights OpenTSDB with HBase on HDInsight

Analytical workloads

Typically read-only systems that store vast volumes of historical data or business metrics, such as sales performance and inventory levels. Used for data analysis and decision making. Can be based on a snapshot of the data at a given point in time, or a series of snapshots. Decision makers usually don't require all the details of every transaction. They want the bigger picture.

Document database indexing

Unlike a key/value store or a document database, most column-family databases physically store data in key order, rather than by computing a hash. The row key is considered the primary index and enables key-based access via a specific key or a range of keys. Some implementations allow you to create secondary indexes over specific columns in a column family. Secondary indexes let you retrieve data by columns value, rather than row key. On disk, all of the columns within a column family are stored together in the same file, with a certain number of rows in each file. With large data sets, this approach creates a performance benefit by reducing the amount of data that needs to be read from disk when only a few columns are queried together at a time. Read and write operations for a row are typically atomic within a single column family, although some implementations provide atomicity across the entire row, spanning multiple column families. Cosmos DB Cassandra API HBase in HDInsight https://docs.microsoft.com/en-us/azure/architecture/data-guide/big-data/non-relational-data

Avro

a row-based format. It was created by Apache. Each record contains a header that describes the structure of the data in the record. This header is stored as JSON. The data is stored as binary information. A very good format for compressing data and minimizing storage and network bandwidth requirements

Non-relational data

anything not structured as a set of tables.

Diagnostic analytics

helps answer questions about why things happened

Database Administrators

manage databases, assigning permissions to users, storing backup copies of data and restore data in case of any failures.

What does NoSQL stand for?

non-relational databases

Data Engineers

Vital in working with data, applying data cleaning routines, identifying business rules, and turning data into useful information. A data engineer collaborates with stakeholders to design and implement data-related assets that include data ingestion pipelines, cleansing and transformation activities, and data stores for analytical workloads. They use a wide range of data platform technologies, including relational and nonrelational databases, file stores, and data streams. They're also responsible for ensuring that the privacy of data is maintained within the cloud and spanning from on-premises to the cloud data stores. They also own the management and monitoring of data stores and data pipelines to ensure that data loads perform as expected. Data engineers also need soft skills to communicate data trends to others in the organization and to help the business make use of the data it collects.

Lambda architecture

When working with very large data sets, it can take a long time to run the sort of queries that clients need. These queries can't be performed in real time, and often require algorithms such as MapReduce that operate in parallel across the entire data set. The results are then stored separately from the raw data and used for querying. One drawback to this approach is that it introduces latency — if processing takes a few hours, a query may return results that are several hours old. Ideally, you would like to get some results in real time (perhaps with some loss of accuracy), and combine these results with the results from the batch analytics. The lambda architecture, first proposed by Nathan Marz, addresses this problem by creating two paths for data flow. All data coming into the system goes through these two paths: A batch layer (cold path) stores all of the incoming data in its raw form and performs batch processing on the data. The result of this processing is stored as a batch view. A speed layer (hot path) analyzes data in real time. This layer is designed for low latency, at the expense of accuracy. The batch layer feeds into a serving layer that indexes the batch view for efficient querying. The speed layer updates the serving layer with incremental updates based on the most recent data.

ORC (Optimized Row Columnar format)

organizes data into columns rather than rows. It was developed by HortonWorks for optimizing read and write operations in Apache Hive. Hive supports SQL-like queries over unstructured data. An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.

Data processing solutions

two broad categories: analytical systems, and transaction processing systems.


Ensembles d'études connexes

Social Changes of the Industrial Revolution

View Set

Ch. 6 Personal Auto Policy - Random Questions 1 - MI P&C Licensing

View Set

Foundations Chapter 33: Rest and Sleep

View Set

Personal Finance Chapter 3 - Money Management Strategy

View Set

Lenguajes de Interfaz: Bloque I Raspberry Pi Introducción

View Set