INFSCI 1540 Final

Lakukan tugas rumah & ujian kamu dengan baik sekarang menggunakan Quizwiz!

Data Warehousing:

- Consolidate data from many sources in one large repository. ~ Data loading, data aggregation, speeding up OLAP queries.

Replicating Data in Different Forms:

- Denormalization is a form of duplicating data, similar to caching—if some value is expensive to recompute on reads, you can store it somewhere. Similar, materialized aggregates (views) in a Data Warehouse are a form of redundant (replicated) data. Keeping the same data in different representations speeds up reads - However, you need to keep it up-to-date when the underlying raw data changes.

The Problem of Data Synchronization:

- Making sure that the data ends up in all the right places. - Whenever a piece of data changes in one place, it needs to change correspondingly in all the other places where there is a copy or derivative of that data - Techniques to keep these different data systems synchronized: ~ Dual Writes ~ Using Commit Logs

Common Data Acquisition Patterns:

- Request/response pattern (browsing the web, searching for information online, and using your mobile device) - Request/acknowledge pattern (when you purchase an item online, you are given a confirmation number, which you can then use to check on the status of your order) - One-way pattern ("fire and forget", servers in a data center may send data about how much memory and CPU are being used every 10 seconds) - Publish/subscribe pattern

Aggregate Functions:

- These functions operate on the values of a column of a relation, and return a value: ~ avg: average value ~ min: minimum value ~ max: maximum value ~ sum: sum of values ~ count: number of values

OLAP (online analytical processing):

Calculate key business metrics, forecasting functions, and what-if analysis of large data volumes - Complex SQL queries and views.

ZooKeeper Ensemble:

- A group of ZooKeeper servers. The minimum number of servers in an ensemble is three - If there is one node, the ZooKeeper fails when the node fails - If there are two nodes and one node fails, there is no majority and again the ZooKeeper fails - If there are three nodes and one node fails, there is a majority, and the ZooKeeper keeps working - It is better to have odd number of nodes in the ensemble

Distributed Application:

- A group of running systems in a distributed application is called a cluster and each running machine in a cluster is called a node - A distributed application has two parts ~ Servers: provide a service for clients ~ Client: tools to interact with a distributed application Benefits: - Reliability: failure of some nodes in the cluster does not make the whole system to fail - Scalability: performance can be increased by adding more nodes to the whole system - Transparency: the distributed system can be viewed as a single application Challenges: - Race condition: concurrent modification of shared resources - Inconsistency due to partial failures - Deadlock: two or more operations are waiting for each other Nodes in a distributed application cluster use ZooKeeper for coordination and sharing data via robust synchronization techniques

Log:

- A log is a sequence of log records reflecting notable events in a transaction processing (e.g., update activities on the database) - When transaction Ti starts, it registers itself by writing a <Ti start> log record - Before Ti executes write(X), a log record <Ti , X, V 1 , V 2 > is written, where V 1 is the value of X before the write, andV2 is the value to be written to X. - When Ti finishes its last statement, the log record <Ti commit> is written.

Transaction Concept:

- A transaction is a unit of program execution that accesses and possibly updates various data items. - A transaction must start in a consistent database - During transaction execution the database may be temporarily inconsistent. - When the transaction completes successfully (is committed), the database must be consistent. - After a transaction commits, the changes it has made to the database persist, even if there are system failures. - Multiple transactions can execute in parallel - Two main issues to deal with ~Failures of various kinds, such as hardware failures and system crashes ~ Concurrent execution of multiple transaction

Null Values and Three Valued Logic:

- Any comparison with null returns unknown ~ Example: 5 < null or null <> null or null = null - Three-valued logic using the truth value unknown: ~ OR: (unknown or true) =true, (unknown or false) =unknown, (unknown or unknown) = unknown ~ AND: (true and unknown) = unknown, (false and unknown) =false,(unknown and unknown) = unknown ~ NOT: (not unknown) = unknown ~ "P is unknown" evaluates to true if predicate P evaluates to unknown - Result of where clause predicate is treated as false if it evaluates to unknown

Views:

- Consider a person who needs to know a customer's loan number but has no need to see the loan amount. A view provides a mechanism to hide certain data from the view of certain users create view v as (select customer_name, loan_number from borrower, loan where borrower.loan_number = loan.loan_number ) - Once a view is defined, the view name can be used to refer to the virtual relation that the view generates. select* from v -View definition is not the same as creating a new relation: the view expression is substituted into queries during query evaluation: select* From (select customer_name, loan_number from borrower, loan where borrower.loan_number = loan.loan_number )

kSQL applications:

- Data exploration and discovery. Navigating and browsing through your data in Kafka. - Streaming ETL. Transforming data within the data pipeline to cleanly land in another system. - Real-time Monitoring and Analytics. Tracking and managing infrastructure data feeds, and creating custom alerts. - Anomaly detection. Identifying patterns and spot anomalies (e.g., fraudulent activities) in real-time data with millisecond latency. - Personalization. Creating data driven real-time experiences and insight for users. - Sensor data and IoT. Delivering sensor data how and where it needs to be. - Customer 360-view. Comprehensive understanding of your customers across interactions through a variety of channels, where new information is continuously incorporated in real-time.

Apache Kafka:

- De-facto standard for real-time large-scale event streaming. - Foundation: distributed and persistent Commit Log.

More on Data Manipulation:

- Delete all accounts at every branch located in the city 'Brooklyn'. delete from account where branch_name in (select branch_ name from branch where branch_city = 'Brooklyn')

More on Data Manipulation:

- Delete the record of all accounts with balances below the average at the bank. delete from account where balance < (select avg (balance ) from account ) - Problem: as we delete tuples from deposit, the average balance changes - Solution used in SQL: ~ First, compute avg balance and find all tuples to delete ~ Next, delete all tuples found above (without recomputing avg or retesting the tuples)

Drill Down Paths:

- Drilling down has nothing to do with descending a predetermined hierarchy: you can drill down using any attribute drawn from any dimension (e.g., the weekday from the time dimension). - A good data warehouse designer should always be thinking of additional drill-down paths to add to an existing environment. - Example: adding an audit dimension to a fact table. The audit dimension contains indicators of data quality in the fact table, such as "data element out of bounds." - You can devise a standard report to drill down to issues of data quality, including the proportion of questionable data. - By drilling down on data quality, each row of the original report would appear as multiple rows, each with a different data quality indicator

More on Drilling Down:

- Drilling down means adding a row header (a grouping column) to an existing SELECT statement. - E.g., if you're analyzing the sales of products at a manufacturer level, the select list of the query reads SELECT MANUFACTURER, SUM(SALES). - If you wish to drill down on the list of manufacturers to show the brands sold, you add the BRAND row header: SELECT MANUFACTURER, BRAND,SUM(SALES). - the GROUP BY clause in the second query reads GROUP BYMANUFACTURER, BRAND. Row headers and grouping columns are the same thing. - Now each manufacturer row expands into multiple rows listing all the brands sold. - This example is particularly simple because in a star schema, both the manufacturer attribute and the brand attribute exist in the same product dimension table.

Event structure in Kafka Topic:

- Each Data Entry (Event) consist of an optional header and a time-stamped Key/Value pair (serialized as a sequence of bytes).

Log entry (event) structure:

- Each Log Data Entry (Event) consist of an optional header and a time-stamped Key/Value pair (serialized as a sequence of bytes).

Data Warehousing:

- Essentially, transforming data from raw "easier to write" to aggregated "easier to read" representation. - The aggregated "easier to read" representation is better for data-driven decision making. - Increasingly, organizations are analyzing current and historical data to identify useful patterns and support business strategies. - Emphasis is on complex, interactive, exploratory analysis of very large datasets created by integrating data from across all parts of an enterprise. ~Contrast such On-Line Analytic Processing (OLAP) with traditional On-line Transaction Processing (OLTP): mostly long queries, instead of short update transactions.

Znode Metadata:

- Every znode in the ZooKeeper data model keeps a stat structure, which provides the metadata of a znode - A stat includes information about version number, action control list (ACL), timestamp, and data length. - Version number 4 Every time the data associated with a znode changes, the znode's version number increases - ACL 4 It is an authentication mechanism for accessing the znode, which governs read and write operations - Timestamp 4 It represents time elapsed (in milliseconds) from znode creation and modification - Data length 4 Total amount of data stored in a znode.

Race Conditions:

- Example: two different clients concurrently writing the same key to two replicas of a data store using dual writes: ~ All writes are successful, but the data stores record different values for X and become inconsistent.

Partial Failures:

- Failure during the writes: - A half-finished set of changes will make your data inconsistent. - A simple email example: first, the client inserts the new message into the recipient's inbox. Then, the client makes a request to increment the unread counter.

Set Comparison:

- Find all branches that have greater assets than some branch located in Brooklyn. select branch_name from branch where assets > some (select assets from branch where branch_city = 'Brooklyn')

Set Operations:

- Find all customers who have a loan, an account, or both: (select customer_name fromdepositor) union (select customer_name fromborrower) - Find all customers who have both a loan and an account. (select customer_name fromdepositor) intersect (select customer_name fromborrower) - Find all customers who have an account but no loan. (select customer_name fromdepositor) except (select customer_name fromborrower)

Nested Queries:

- Find all customers who have both an account and a loan at the bank. select distinct customer_name from borrower where customer_name in (select customer_name from depositor ) - Find all customers who have a loan at the bank but do not have an account at the bank select distinct customer_name from borrower where customer_name not in (select customer_name from depositor )

Aggregate Functions (Cont.):

- Find the average account balance at the Perryridge branch. select avg(balance) from account where branch_name = 'Perryridge'

Group By:

- Find the average balance for each branch. select branch_name, avg (balance) from account group by branch_name - Attributes in select clause outside of aggregate functions must appear in group by list

Grouping, Aggregating and Subqueries:

- Find the branches with the largest average account balance select branch_name from account group by branch_name having avg(balance) > = all (select avg (balance) from account group by branch_name )

Example Query:

- Find the names of all branches that have greater assets than all branches located in Brooklyn. select branch_name from branch where assets > all (select assets from branch where branch_city = 'Brooklyn')

Having Clause:

- Find the names of all branches where the average account balance is more than $600. select branch_name, avg (balance) from account group by branch_namehaving avg (balance)> 600 -Predicates in the having clause are applied after the formation of groups whereas predicates in the where clause are applied before forming groups select branch_name, avg (balance) from account where branch_name <> 'Brighton' group by branch_name having avg (balance)> 600

Aggregate Functions (Cont.):

- Find the number of tuples in the customer relation. select count (*) from customer - Find the number of depositors in the bank. select count (distinct customer_name) from depositor

Data Streaming:

- Handle changing data: record every change that happens to the data. Do not overwrite data to keep important historic information. ~ Keeping the list of all changes as a persistent log of immutable events

Data Replication:

- Handle multiple data representations: different data consumers (applications), may have different data representation preferences. ~ Data synchronization, data logging.

Creating stream with timestamps:

- If you want to use the value of one of the topic's columns as the Kafka message timestamp, set the TIMESTAMP property in the WITH clause.

Commit Log In a Database System:

- In a database system commit logs are called Write Ahead Logs (WALs). - Each write to a database must first be recorded in the WAL before the data is changed in either a table or an index -The first benefit is that it speeds up database writes. Writing to a commit log is faster then writing to more complex data structures, such as a relational table. After the transaction is recorded in WAL, the changes to tables can happen in memory, an moved to disk later on. That way, the slow part of writing to the database happens asynchronously. - The second (bigger) benefit, is that any database can be recreated from scratch by following the WAL (e.g., for crash recovery, or for streaming live changes to a read-only replica).

Commit Log-based Data Streaming:

- In data streaming tools, such as Apache Kafka, commit logs are essential to handling increased volumes of data, bringing coherence to distributed systems, and providing a common source of truth for microservices architectures.

Aggregating over time windows:

- In kSQL a record is an immutable representation of an event in time. Each record carries a timestamp, which determines its position on the time axis. - This is the default timestamp that KSQL uses for processing the record. The timestamp is set either by the producer application or by the Apache Kafka® broker, depending on the topic's configuration. Records may be out-of-order within the stream. - Timestamps are used by time-dependent operations, like aggregations and joins. - Next we consider time window semantics in Kafka

Commit Log-based Data Streaming:

- In modern DE data streaming tools, such as Apache Kafka, Commit Logs are essential to handling increased volumes of data, bringing coherence to distributed systems, and providing a common source of truth for microservices architectures. - The concept of Commit Log will help us to understand and master reliable and scalable data streaming.

OLAP Queries:

- Influenced by SQL and by spreadsheets. - A common operation is to aggregate a measure over one or more dimensions. ~ Find total sales. ~ Find total sales for each city, or for each state. ~ Find top five products ranked by total sales. - Roll-up: Aggregating at different levels of a dimension hierarchy. ~ E.g., Given total sales by city, we can roll-up to get sales by state - Drill-down: The inverse of roll-up. ~ E.g., Given total sales by state, can drill-down to get total sales by city. ~ E.g., Can also drill-down on different dimension to get total sales byproduct for each state.

Big Picture:

- Integrated data spanning long time periods, often augmented with summary information - Several gigabytes to terabytes common. - Interactive response times expected for complex queries; ad-hoc updates uncommon.

Null Values:

- It is possible for tuples to have a null value, denoted by null, for some of their attributes - null signifies an unknown value or that a value does not exist. - The predicate is null can be used to check for null values. ~ Example: Find all loan number which appear in the loan relation with null values for amount. select loan_number from loan where amount is null - The result of any arithmetic expression involving null is null ~ Example: 5 +null returns null - However, aggregate functions simply ignore nulls

Dual Writing:

- It is the responsibility of your application code to update data in all the appropriate places. - E.g., if a user submits some data, some code in your application should write the data to your database, then invalidate or refresh the appropriate cache entries, then re-index the document in your full-text search index, and so on ... Problems with dual writing: - Race conditions - Partial failures

kSQL Architecture:

- KSQL Server runs the KSQL engine that executes KSQL queries for processing, reading, and writing data to and from the target Kafka cluster. Servers form KSQL clusters and can run in containers, virtual machines, and bare-metal machines. - You can add and remove servers to/from the same KSQL cluster without restarting your applications to elastically scale KSQL's processing capacity. You can also deploy different KSQL clusters to achieve workload isolation. - The KSQL command line interface (CLI) acts as a client to the KSQL server. You may also configure KSQL servers to run in non-interactive "headless" configuration (e.g., for production scenarious), thereby preventing KSQL CLI access. - KSQL servers, clients, queries, and applications run outside of Kafka brokers, in separate JVM instances, or in separate clusters entirely. KSQL engine - processes KSQL statements and queries REST interface - enables client access to the engine KSQL CLI - console that provides a command-line interface (CLI) to the engine KSQL UI - enables developing KSQL applications in Confluent Control Center

Aggregating Data in kSQL tables:

- KSQL supports several aggregate functions, like COUNT and SUM. You can use these to build stateful aggregates on streaming data. - The result of an aggregate query in KSQL is always a table, because KSQL computes the aggregate for each key, and possibly for each window per key, and updates these results as it processes new input data for a key. - Query: count the number of pageviews per user based on another stream, pageviews2:

Kafka and ZooKeeper:

- Kafka brokers are managed by ZooKeeper - ZooKeeper notifies producers and consumers about existence of a new broker or failure of a broker - Producers push data to brokers. If a new broker starts to work, producers start sending messages to it - Consumers pull data from brokers

Leaders and Followers:

- Leader ~ elected when the service starts ~ performs automatic recovery if any of the connected node failed ~ responsible for processing write request - Follower ~ Follows the leader's instruction. Forwards write requests from the clients to the leader - Request processor ~ It is only in the leader node. It governs write requests to leader from the follower znode - Atomic broadcasts ~ Broadcasting the changes from the leader to the followers - Write ~ Handled by leader. The leader forwards the write request to all the znodes. If half of the znodes reply, the write process is complete - Read ~ A connected znode performs a read request, so there is no need to interact with other nodes in the ensemble - Replicated database ~ Stores ZooKeeper data. Each znode has its own database and it is replicated in all znode

kSQL Language Elements:

- Like traditional relational databases, KSQL supports two categories of statements: Data Definition Language (DDL) and Data Manipulation Language (DML). - Data Definition Language (DDL) Statements Imperative verbs that define metadata on the KSQL server by adding, changing, or deleting streams and tables. DDL statements modify metadata and don't operate on data. - Data Manipulation Language (DML) Statements Declarative verbs that read and modify data in KSQL streams and tables. DML statements modify data and don't change metadata. The KSQL engine compiles DML statements into Kafka Streams applications, which run on a Kafka cluster like any other Kafka Streams application. - The CSAS and CTAS statements belong to both categories, since they perform a metadata change, like adding a stream, and they manipulate data, by creating a derivative of existing records. CREATE STREAM CREATE TABLE DROP STREAM DROP TABLE CREATE STREAM AS SELECT (CSAS) CREATE TABLE AS SELECT (CTAS) SELECT INSERT INTO CREATE STREAM AS SELECT (CSAS) CREATE TABLE AS SELECT (CTAS)

The Problem of Data Synchronization:

- Making sure that the data ends up in all the right places. - Whenever a piece of data changes in one place, it needs to change correspondingly in all the other places where there is a copy or derivative of that data. - Techniques to keep these different data systems synchronized: ~ Dual Writes ~ Using Commit Logs

Views and Decision Support:

- OLAP queries are typically aggregate queries ~ Precomputation is essential for interactive response times. ~ The CUBE is in fact a collection of aggregate queries, and precomputation is especially important: lots of work on what is best to precompute given a limited amount of space to store precomputed results. - Warehouses can be thought of as a collection of asynchronously replicated tables and periodically maintained views. ~ Has renewed interest in view maintenance

ZooKeeper Workflow:

- Once the ZooKeeper ensemble starts, clients can connect to one of its nodes (leader or follower) - When the client is connected, the node creates a session for it and sends an acknowledgement to the client - If there is no acknowledgment from the node, the client tries to connect to another node - After creating a connection between a client and a node, the client sends regular heartbeat messages to the node to ensure that the connection is alive - The client sends a read request to the node with a znode path if it wants to read a particular znode - If a client wants to store data in the ensemble, it sends the znode path and the data to the server. The connected server will forward the request to the leader and then the leader reissues the writing request to all the followers. If the majority of nodes (quorum) respond successfully, then the write request will be executed, and a successful return code will be sent to the client

Checkpoints:

- Problems in recovery procedure: ~ searching the entire log is time-consuming ~ we might unnecessarily redo transactions which have already output their updates to the database. - Streamline recovery procedure by periodically performing checkpointing 1. Output all log records currently residing in main memory onto stable storage. 2. Output all modified buffer blocks to the disk. 3. Write a log record < checkpoint> onto stable storage

Kafka Workflow:

- Producers send message to a topic regularly - The broker stores all messages in the partitions configured for that topic. The messages are shared between partitions - Consumer subscribes to a topic. There might be a group of consumers with the same group id. - The broker provides an offset of the topic to the consumers. It also saves the offset in the ZooKeeper cluster. - Kafka sends messages to the consumers once it receives them form producers - Consumer sends an acknowledgement to the broker once the message is processed - Kafka changes the offset of the topic to a new message and updates it in the ZooKeeper after receiving an acknowledgement - The above process continues until the consumer stops requesting

General Kafka Architecture:

- Producers sends messages to Brokers - Brokers Receive and store Messages - A Kafka Cluster can have many Brokers - Each Broker manages multiple Partitions Partition Management: - Messages of topics spread across Partitions - Partitions spread across Brokers - Each Partition is stored at Broker's disk - Partitions are Commit Log files - Each message in Log is identified by Offset

More on Data Manipulation:

- Provide as a gift for all loan customers of the Perryridge branch, a $200savings account. Let the loan number serve as the account number for the new savings account insert into account select loan_number, branch_name, 200 from loan where branch_name = 'Perryridge' insert into depositor select customer_name, loan_number from loan, borrower where branch_name =' Perryridge' and loan.account_number = borrower.account_number

Recovery Algorithms:

- Recovery algorithms are techniques to ensure database consistency and transaction atomicity and durability despite failure - Recovery algorithms have two parts 1. Actions taken during normal transaction processing to ensure enough information exists to recover from failures 2. Actions taken after a failure to recover the database contents to a state that ensures atomicity, consistency and durability

Recovery Procedure:

- Recovery procedure has two operations: ~undo(Ti ) restores the value of all data items updated by Ti to theirold values, going backwards from the last log record for Ti ~redo(Ti ) sets the value of all data items updated by Ti to the new values, going forward from the first log record for Ti - When recovering after failure: ~ Transaction Ti needs to be undone if the log contains the record<Ti start>, but does not contain the record<Ti commit>. ~ TransactionTi needs to be redone if the log contains both the record<Ti start> and the record<Ti commit>. - Undo operations are performed first, then redo operations.

Derived Relations:

- SQL allows a subquery expression to be used in the from clause - Find the average account balance of those branches where the average account balance is greater than $1200. select branch_name, avg_balance from (selectbranch_name, avg (balance) from account group by branch_name ) as branch_avg (branch_name, avg_balance ) where avg_balance > 1200 - Note that we do not need to use the having clause, since we compute the temporary (view) relation branch_avg in the from clause, and the attributes of branch_avg can be used directly in the where clause.

String Operations:

- SQL includes a string-matching operator for comparisons on character strings. The operator "like" uses patterns that are described using two special characters: ~ percent (%). The % character matches any substring. ~ underscore (_). The _ character matches any character. - Find the names of all customers whose street includes the substring "Main". select customer_name from customer where customer_street like '%Main%' - Match the name "Main%" like 'Main\%' escape '\' - SQL supports a variety of string operations such as ~ concatenation (using "||") ~ converting from upper to lower case (and vice versa) ~ finding string length, extracting substrings, etc.

Basic Query Structure:

- SQL is based on set and relational operations with certain modifications and enhancements - A typical SQL query has the form: selectA1 ,A2 , ...,An from r 1 ,r 2 , ...,r m where P - Ai represents an attribute - r i represents a relation - P is a predicate - The result of an SQL query is a relation.

Warehousing Issues:

- Semantic Integration: When getting data from multiple sources, must eliminate mismatches, e.g., different currencies, schemas - Heterogeneous Sources: Must access data from a variety of source formats and repositories. ~ Replication capabilities can be exploited here. - Load, Refresh, Purge: Must load data, periodically refresh it, and purge too-old data. - Metadata Management: Must keep track of source, loading time, and other information for all data in the warehouse

Data Snapshot vs Data Stream:

- Snapshot: Overwrite and loose the old value with the new value when it changes: - Stream: individually record every change that happens to the database: - LOG: a complete copy of the ENTIRE database as it evolved in time - The truth is the LOG, the database is the cache of a subset of the LOG ! - Even if the customer change the quantity to 2, it is still true that at a previous point in time, the selected quantity was 3 - If you overwrite data in your database, you lose this historic information - Keeping the list of all changes as a log of immutable events gives you richer information than if you overwrite things in the database

Failure with Loss of Nonvolatile Storeage

- So far we assumed no loss of non-volatile storage - Technique similar to checkpointing used to deal with loss of non-volatile storage - Can be extended to allow transactions to be active during dump; known as fuzzy dump or online dump

Log-based Consistency;

- Store all your writes in a fixed order, and apply them in that fixed order to the various places they need to go. - Whenever anyone wants to write some data, we append that write to the end of a sequence of records: commit log.

kSQL:

- Streaming SQL engine for Apache Kafka®. - Scalable, elastic, fault-tolerant, and real-time. - Supports a wide range of streaming operations: data filtering, transformations, aggregations, joins, windowing, and sessionization. - You can use KSQL to build event streaming applications from Apache Kafka® topics by using only SQL statements and queries. - KSQL is built on Kafka Streams, a robust stream processing framework that is part of Apache Kafka. You can use KSQL and Kafka Streams together in your event streaming applications. - You can implement custom logic and aggregations in your KSQL applications by implementing User Defined Functions in Java.

View Materialization (Precomputation):

- Suppose we precompute RegionalSales and store it. - Then, previous query can be answered more efficiently (modified query will not be generated).

Concept of Data Model:

- The aim of data modeling is to represent data in such away that it is easier to be consumed and processed. - Different data consumers (applications), may have different data representation preferences.

Aggregate Navigator:

- The data warehouse must support drilling down at the user interface level with the most atomic data possible because the most atomic data is the most dimensional. - The atomic data must be in the same schema format as any aggregated form of the data. - An aggregated fact table (materialized view) is a derived table of summary records. - Aggregated fact tables (materialized views) offer notable performance advantages compared to using the large, atomic fact tables. But you get this performance boost only when the user asks for an aggregated result. - A modern data warehouse environment uses a query-rewrite facility called an aggregate navigator to choose a prebuilt aggregate table whenever possible - Each time the end user asks for a new drill-down path, the aggregate navigator decides in real time which aggregate fact table will support the query most efficiently. - Whenever the user asks for a sufficiently precise and unexpected drill down, the aggregate navigator gracefully defaults to the atomic data layer.

From Clause, Cartesian Product, and Natural Join:

- The from clause lists the relations involved in the query - Find the all pair-wise tuple combinations of borrower and loan (Cartesian Product operation) select * from borrower, loan - Find the name, loan number, branch name and loan amount of all customers (Natural Join operation). select customer_name, borrower.loan_number, branch_name, amount from borrower, loan where borrower.loan_number = loan.loan_number

Log-based Database Modification:

- The log-based database modification scheme allows database updates of an uncommitted transaction - Since undoing may be needed, update logs must have both old value and new value - Output of updated blocks to the disk can take place at any time before or after transaction commit - Order in which blocks are output can be different from the order in which they are written to the buffer - We assume that the log record is output to stable storage - The Write-Ahead Logging (WAL) Protocol: 1. Must force the log record for an update to stable storage before the corresponding data page outputs to disk. 2. Must force all log records for a Xact to stable storage before commit. ● #1 guarantees Atomicity. ● #2 guarantees Durability.

Creating stream with a specific key:

- The previous KSQL statement makes no assumptions about the message key. If the value of the message key in the topic is the same as the value of the columns defined in the stream, you can specify the key in the WITH clause of the CREATE STREAM statement.

Bigger Picture:

- The raw events are the form in which it's convenient to write the data - The preprocessed/aggregated data is the form in which it's convenient to read and understand the data. - When you're reading, you can get the best performance if the history of changes has already been "squashed" together in an aggregated representation of the current state. - The form of data convenient/optimized for writing is not the same as the form convenient/optimized for reading/interpreting/decision making. - It make sense to separate the way you write to your database from the way you read from it. Row event: You liked a Facebook post and it has been written in a database Aggregated summary of events: When you view a Facebook post, thousands of events have been aggregated in a Facebook post for you to read it.

Select Clause:

- The select clause list the attributes desired in the result of a query - Find the names of all branches in theloan relation: select branch_name from loan -SQL names are case insensitive (i.e., you may use upper- or lower-caseletters.) - SQL allows duplicates in relations as well as in query results. - To force the elimination of duplicates, insert the keyword distinct after select. - Find the names of all branches in the loan relations, and remove duplicates select distinct branch_name from loan -The keyword all specifies that duplicates not be removed. select all branch_name from loan - An asterisk in the select clause denotes "all attributes" select * from loan -The select clause can contain arithmetic expressions involving the operation, +, -, *, and /, and operating on constants or attributes of tuples. - select loan_number, branch_name, amount * 100 from loan returns: would return a relation that is the same as the loan relation, except that the value of the attribute amount is multiplied by 100

Where Clause:

- The where clause specifies conditions that the result must satisfy -To find all loan number for loans made at the Perryridge branch withloan amounts greater than $1200. select loan_number from loan where branch_name = ' Perryridge' and amount > 1200 - Comparison results can be combined using the logical connectives and, or, and not.

Example of Fund Transfer:

- Transaction to transfer $50 from account A to account B: 1. read(A) 2.A :=A - 50 3. write(A) 4. read(B) 5.B :=B + 50 6. write(B) - Atomicity requirement: if the transaction fails after step 3 and before step 6, the system should ensure that its updates are not reflected in the database, else an inconsistency will result. - Consistency requirement: the sum of A and B is unchanged by the execution of the transaction

Storage Structure:

- Volatile storage: ~ does not survive system crashes ~ examples: main memory, cache memory - Nonvolatile storage: ~ survives system crashes ~ examples: disk, tape, flash memory, non-volatile (battery backed up) RAM - Stable storage: ~ a mythical form of storage that survives all failures ~ approximated by maintaining multiple copies on distinct nonvolatile media

Issues in View Materialization:

- What views should we materialize, and what indexes should we build on the precomputed results? - Given a query and a set of materialized views, can we use the materialized views to answer the query? - How frequently should we refresh materialized views to make them consistent with the underlying tables? (And how can we do this incrementally?)

ZooKeeper Data Model: Hierarchical Namespace:

- ZooKeeper file system uses a tree structure for memory representation. - Every ZooKeeper node is called znode which is identified by a name and separated by a sequence of path (/) - At the highest level, there is a root znode. Under the root, there are two logical namespaces config and workers - The config namespace is used for centralized configuration management - The worker namespace is used for naming Znode Types: - Persistence znode (default) ~ alive even after the client that created this znode is disconnected. - Ephemeral znode ~ will be deleted from ensemble once its client gets disconnected; helpful in leader electio - Sequential znode ~ can be either persistent or ephemeral; helpful in locking and synchronization

What is ZooKeeper?

- ZooKeeper is a distributed coordination service, i.e., it enables coordination tasks in distributed applications - A distributed application runs on multiple systems connected to each other through a network, working together to complete a task efficiently - The more systems are involved, the less time is required to finish a complex task in distributed environment if we benefit from computing capabilities of all systems - ZooKeeper itself is a distributed application, which provides services for nodes in a distributed application

Commit Log:

- a sequence of records, each with its own unique identifier. You can add a record appending it at the end of the log - However, you cannot change an existing record: once written, records are immutable. - Commit logs solve a core problem in software development: they provide a source of truth for what has happened in a system and in what order.

commit log:

- totally ordered - append-only (never modify existing records, only add new records at the end) - persistent (store it durably on disk). - If you do all your writes sequentially, you remove the potential for race conditions. - If you keep the order in which you make your writes, it becomes much easier to recover from partial failures. -

Deleting kSQL stream:

-DROP STREAM statement deletes a stream. If you created the stream by using CREATE STREAM AS SELECT, you must first terminate the corresponding persistent query.

Timestamp assignment:

A record's timestamp is set either by the record's producer or by the Kafka broker, depending on the topic's timestamp configuration. The topic's message.timestamp.type setting can be either CreateTime or LogAppendTime. - CreateTime. The broker uses the the record's timestamp as set by the producer. This setting enforces event-time semantics - LogAppendTime. The broker overwrites the record's timestamp with the broker's local time when it appends the record to the topic's log. This setting enforces ingestion-time semantics. If LogAppendTime is configured, the producer has no control over the timestamp - KSQL doesn't support processing-time operations directly, but you can implement user-defined functions (UDFs) that access the current time. - By default, when KSQL imports a topic to create a stream, it uses the record's timestamp, but you can add the WITH(TIMESTAMP='some-field') clause to use a different field from the record's value as the timestamp. The optional TIMESTAMP_FORMAT property indicates how KSQL should parse the field. The field you specify can be an event-time or an ingestion-time.

Commit Log:

A source of truth to understand why something went wrong - Commit Log is a sequence of records, each with its own unique identifier. You can add a record appending it at the end of the log. However, you cannot change an existing record: once written, records are immutable. - Commit logs solve a core problem in software development: they provide a source of truth for what has happened in a system and in what order. - Example: In a relational database commit logs are called Write Ahead Logs (WALs) . Each write to a database must first be recorded in the WAL before the data is changed in either a table or an index. - The first benefit is that it speeds up database writes. Writing to a commit log is faster then writing to more complex data structures, such as a relational table. After the transaction is recorded in WAL, the changes to tables can happen in memory, an moved to disk later on. That way, the slow part of writing to the database happens asynchronously. - The second (bigger) benefit, is that any database can be recreated from scratch by following the WAL (e.g., for crash recovery, or for streaming live changes to a read-only replica).

CREATE STREAM:

A stream associates a schema with an underlying Kafka topic. - Create streams from Apache Kafka® topics: CREATE STREAM - Create streams of query results from other streams: CREATE STREAM AS SELECT - To create a stream from a Kafka topic the Kafka topic must exist in your Kafka cluster.

Desired Behavior:

Atomicity: Transactions may abort: roll it back ("Undo") Durability: What if DBMS stops running after the user has been notified of a successful update? ("Redo") - Desired Behavior after system restarts: ~ T1, T2 & T3 should be durable (effects persists: redo them). ~ T4 & T5 should be aborted (effects not seen: undo them).

Publish/subscribe pattern:

Common pattern with message-based data systems. Data flow starts with a producer publishing a message to a broker. The messages are often sent to a topic, which you can think of as a logical grouping for messages. Next, the message is sent to all the consumers subscribing to that topic. Some technologies follow the data flow pushing messages to consumers. With other technologies, the consumer pulls messages from the brokers. A producer publishing a message doesn't need to subscribe to a topic. Nor is it required that a subscriber produce a message. By using the publish/subscribe pattern we decouple the sender of the data from its consumer, producing large scale data streams.

Continuous streaming query on a table:

Create a continuous streaming query on the users table by using the SELECT statement: The table values update continuously with the most recent records, since the underlying users topic receives new messages continuously.

CREATE TABLE:

Create tables from Apache Kafka® topics: CREATE TABLE. The topic must exist already in your Kafka cluster; Create tables of query results from other tables or streams: CREATE TABLE AS SELECT.

CREATE STREAM AS SELECT:

Creates a stream that contains the results from a SELECT query. KSQL persists the SELECT query results into a corresponding new topic. A stream created this way represents a persistent, continuous, streaming query, which means that it runs until you stop it explicitly.

Commit Log-based Data Streaming:

In modern DE data streaming tools, such as Apache Kafka, Commit Logs are essential to handling increased volumes of data, bringing coherence to distributed systems, and providing a common source of truth for microservices architectures The concept of Commit Log will help us to understand and master reliable and scalable data streaming.

Partition leadership and replication:

Producers Guarantees: - Write to a leader, with no acknowledgement(At Most Once) - Write to a leader with acknowledgement(At Least Once) - Write to a leader and propagate to all followers(Exactly Once) Partition Reconfiguration: - In case of Broker failure partitions leaders will we re-assigned

Time semantics:

Timestamps have different meanings, depending on the implementation. A record's timestamp can refer to the time when the event occurred, or when the record was ingested into Kafka, or when the record was processed. - Event-time. The time when a record is created by the data source. E.g., if the record is a geo-location change reported by a GPS sensor in a car, the event-time is the time when the GPS sensor captured the location change - Ingestion-time. The time when a record is stored in a topic partition by a Kafka broker. Ingestion-time is similar to event-time, as a timestamp is embedded in the record, but the ingestion timestamp is generated when the Kafka broker appends the record to the target topic. Ingestion-time can approximate event-time if the time difference between the creation of the record and its ingestion into Kafka is small. Consider using ingestion-time when the producer can't assign timestamps directly, like when it doesn't have access to a local clock. - Processing-time. The time when the record is consumed by a stream processing application. The processing-time can occur immediately after ingestion-time, or it may be delayed by milliseconds, hours, or days. E.g., imagine an analytics application that reads and processes the geo-location data reported from car sensors, and presents it to a fleet-management dashboard. In this case, processing-time in the analytics application might be many minutes or hours after the event-time, as cars can move out of mobile reception for periods of time and have to buffer records locally You should not mix streams or tables that have different time semantics.

ACID Properties:

To preserve the integrity of data during the transaction execution the database system must ensure: - Atomicity. Either all operations of the transaction are properly reflected in the database or none are. - Consistency. Execution of a transaction in isolation preserves the consistency of the database. - Isolation. Although multiple transactions may execute concurrently, each transaction must be unaware of other concurrently executing transactions. Intermediate transaction results must be hidden from other concurrently executed transactions. ~That is, for every pair of transactions Ti and Tj , it appears to Ti that either Tj , finished execution beforeTi started, or Tj started execution after Ti finished. - Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures

Different data representations used in combination with one another :0

Typically, the data in different forms is spread across the business using different tools. Unfortunately, it is not always possible to keep it relatively simple. There is no one database or tool that can do everything that business application requires.


Set pelajaran terkait

Recreation, Amusement, and Attraction

View Set

History of Economic Thought Test #2

View Set

911 operater dispatched study guide

View Set

ISTQB® Certified Tester - Sample Exam Set A

View Set