IS471 - Exam 2 (Chapter 4, 5, 6)
Soft State (BASE)
- Database may be in an inconsistent state when data is read; thus, the results may change if the same data is requested again - Data could be updated for consistency, even though no user has written to the database between the two reads - Closely related to eventual consistency
Consistency (CAP Theorem)
- A read from any node results in the same data across multiple nodes
Availability (CAP Theorem)
- A read/write request will always be acknowledged in the form of a success or a failure
Map and Reduce Tasks
- A single processing run of the MapReduce processing engine is known as a MapReduce job - Composed of a map task and a reduce task with each having multiple stages
Batch
- AKA Offline processing, involves processing data in batches and usually imposes delays, which in turn results in high-latency responses - Typically involve large quantities of data with sequential read/writes and comprise of groups of read or write queries - Queries can be complex and involve multiple joins - OLAP commonly process workloads in batches - Strategic BI and analytics are batch-oriented as they are highly read-intensive tasks involving large volumes of data - Eg: comprises grouped read/writes that have a large data footprint and may contain complex joins and provide high-latency responses
Transactional
- AKA Online processing, follows an approach whereby data is processed interactively without delay, resulting in low-latency responses - Involves small amounts of data with random reads and writes - OLTP and operational systems fall within this category - Although these workloads can contain a mix of read/write queries, they are generally more write-intensive than read-intensive - Comprise random reads/writes that involve fewer joins than business intelligence and reporting workloads - Give online nature and operational significance to the enterprise, they require low-latency responses with a smaller data footprint
Peer-to-peer
- All nodes operate at the same level, there is not a master-slave relationship between the nodes - Each node, known as peer, is equally capable of handling reads and writes - Each write is copied to all peers - Prone to write inconsistencies that occur as a result of a simultaneous update of the same data across multiple peers - Addressed by implementing either a pessimistic or optimistic concurrency strategy
ACID vs BASE
- Base emphasizes availability over immediate consistency - ACID ensures immediate consistency at the expense of availability due to record locking - The soft approach towards consistency allows BASE compliant databases to serve multiple clients without any latency albeit serving inconsistent results - But, Base-compliant databases are not useful for transactional systems where lack of consistency is a concern
Data Visualization for Big Data
- Big Data solutions require data visualization tools that can seamlessly connect to structured, semi-structured, and unstructured data sources and are further capable of handling millions of data records - Generally use in-memory analytical technologies that reduce the latency normally attributed to traditional, disk-based data visualization tools - Advanced data visualization tools incorporate predictive and prescriptive data analytics and data transformation features - These tools eliminate the need for data pre-processing methods, like ETL - Tools also provide the ability to directly connect to structured, semi-structured, and unstructured data sources - As part of Big Data solutions, advanced data visualization tools can join structured and unstructured data that is kept in memory for fast data access - Queries and statistical formulas can then be applied as part of various data analysis tasks for viewing data in a user-friendly format, such as on dashboard
Big Data BI
- Builds upon traditional BI by acting on the cleansed, consolidated enterprise-wide data in the data warehouse and combining i with semi-structured and unstructured data sources - Compromises both predictive and prescriptive analytics to facilitate the development of an enterprise-wide understanding of business performance - Focuses on multiple business processes simultaneously - Helps reveal patterns and anomalies across a broader scope within the enterprise - Leads to data discovery by identifying insights and information that may have been previously absent or unknown - Requires the analysis of unstructured, semi-structured, and structured data residing in the enterprise data warehouse - Then requires a "next-generation" data warehouse that uses new features and technologies to store cleansed data originating from a variety of sources in a single uniform data format - This coupling results in a single uniform data format - Coupling of a traditional data warehouse with these new technologies results in a hybrid data warehouse - Acts as a uniform and central repository of structured, semi-structured, and unstructured data that can provide Big Data BI tools with all of the required data - Eliminates the need for Big Data BI tools to have to connect to multiple data sources to retrieve or access data - Next-generation data warehouse establishes a standardized data access layer across a range of data sources
Data Warehouses
- Central, enterprise-wide repository consisting of historical and current data - Heavily used by BI to run various analytical queries, interfaced with an OLAP system to support multi-dimensional analytical queries - Periodically extracted, validated, transformed, and consolidated to a single denormalized database - The amount of data contained will continue to increase -> leading to a slower query response time for data analysis tasks over time - To solve this, data warehouses usually contain optimized databases, called Analytical Databases - Analytical Database can exist as a separate DBMS
Distributed Data Processing
- Closely related to parallel data processing in that the same principle of "divide-and-conquer" is applied - Always achieved through physically separate machines that are networked together as a cluster - Task is divided into three sub-tasks that are then executed on three different machines sharing one physical switch
Innovative storage strategies and technologies to store Big Data datasets
- Clusters - File Systems and Distributed File Systems - NoSQL - Sharding - Replication - CAP theorum - ACID - BASE
Traditional Data Visualization
- Data Visualization is a technique whereby analytical results are graphically communicated using elements like charts, maps, data grids, infographics, and alerts - Graphically representing data can make it easier to understand reports, view trends, and identify patterns - Traditional Data Visualization provides mostly static charts and graphics in reports and dashboards, whereas contemporary data visualization tool are interactive and can provide both summarized and detailed views of data - Designed to help people who lack statistical and/or mathematical skills to better understand analytical results without having to resort to spreadsheets - Traditional data visualization tools query data from relational databases, OLAP systems, data warehouses and spreadsheets to present both descriptive and diagnostic analytics results
Processing in Batch Mode
- Data is processed offline in batches and the response time could vary from minutes to hours - Data must be persisted to the disk before it can be processed - Generally involves processing a range of large datasets, either on their own or joined together, essentially addressing the volume and variety characteristics of Big Data datasets - Majority of Big Data processing occurs in batch mode - Relatively simple, easy to set up and low in cost compared to realtime mode - Strategic BI, predictive and prescriptive analytics and ETL operations are commonly batch-oriented
Map (MapReduce Map Tasks)
- Divided into multiple smaller splits - Each split is parsed into its constituent records as a key-value pair - The key is the ordinal position of the record, the value is the actual record - Parsed key-value pairs for each split are then sent to a map function or mapper, with one mapper function per split - Map function executes user-defined logic - Each split generally contains multiple key-value pairs, and the mapper is run once for each key-value pair in the split - Mapper processes each key-value pair as per the user-defined logic and further generates a key-value pair as its output - The output key can either be the same as the input key or a substring value from the input value, or another seralizable user-defined object - Output value can either be the same as the input value or a substring value from the input value, or another serializable user-defined object - When all records of the split have been processed, the output is a list of key-value pairs where multiple key-value pairs can exist for the same key
Complex Event Processing
- During CEP, a number of realtime events often coming from disparate sources and arriving at different time intervals are analyzed simultaneously for the detection of patterns and initiation of action - Rule-based algorithms and statistical techniques are applied, taking into account business logic and process context to discover cross-cutting and complex event patterns - CEP focuses more on complexity, providing rich analytics - However, speed execution may be adversely affected - CEP is consdiered to be a superset of ESP and often the output of ESP results in the generation of synthetic events that can be fed into CEP
Event Stream Processing
- During ESP, an incoming stream of events, generally from a single source and ordered by time, is continuously analyzed - Analysis can occur via simple queries or the application of algorithms that are mostly formula-based - Analysis takes place in-memory before storing the events to an on-disk storage device - ESP focuses more on speed than complexity; the operation to be executed is comparatively simple to aid faster execution
Shuffle and Sort (MapReduce Reduce Tasks)
- During the first stage of the reduce task, output from all partitioners is copied across the network to the nodes running the reduce task (this is known as shuffling) - List based key-value output from each partitioner can contain the same key multiple times - Next, the MR engine automatically groups and sorts the key-value pairs according to the keys so that the output contains a sorted list of all input keys and their values with the same keys appearing together - The way in which keys are grouped and sorted can be customized - This merge creates a single key-value pair per group, where key is the group key and the value is the list of all group values
Combining Sharding and Peer-to-Peer Replication
- Each shard is replicated to multiple peers, each peer is only responsible for a subset of the overall dataset - Helps achieve increased scalability and fault tolerance - No master involved, so no single point of failure and fault-tolerance for both read and write operations is supported
Atomicity (ACID)
- Ensures that all operations will always succeed or fail completely - No partial transactions
Consistency (ACID)
- Ensures that the database will always remain in a consistent state by ensuring that only data that conforms to the constraints of the database schema can be written to the database - A database that is in a consistent state will remain in a consistent state following a successful transaction
Isolation (ACID)
- Ensures that the results of a transaction are not visible to other operations until it is complete
Durability (ACID)
- Ensures that the results of an operation are permanent - Once a transaction has been committed, it cannot be rolled back - Irrespective of any system failure
Distributed File System
- File system that can store large files spread across the nodes of a cluster - Files appear to be local; however, this is only a logical view as physically the files are distributed throughout the cluster - This local view is presented via the distributed file system and it enables the files to be accessed from multiple locations - Eg: Google File System (GFS) and Hadood Distributed File System (HDFS)
Reduce (MapReduce Reduce Tasks)
- Final stage of the reduce task - The reducer will either further summarize its input or will emit the output without making any changes - For each key-value pair that a reducer receives, the list of values stored in the value part of the pair is processed and another key-value pair is written out - Output key can either be the same as the input key or a substring value from the input value, or another serializable user-defined object - The output value can either be the same as the input value or a substring value from the input value, or another seralizable user-defined object - A reducer may not produce any output key-value pair or generate multipel key-value pairs, like the mapper - Output of the reducer, that is the key-value pairs, is then written out as a separate file--one file per reducer - The number of reducers can be customized, also possible to have a MR job without a reducer, for example when performing filtering - Note that the output signature (key-value types) of the map function should match that of the input signature (key-value types) of the reduce/combine function
Combine (MapReduce Map Tasks)
- Generally the output of the map function is handled directly by the reduce function - However, map and reduce tasks are mostly run over different nodes - The data movement can consume a lot of valuable bandwidth and directly contributes to processing latency - With larger datasets, the time taken to move the data between map and reduce stages can exceed the actual processing undertaken by the map and reduce tasks - For this reason, the MR engine provides an optional combine function that summarizes a mapper's output before it gets processed by the reducer - Combiner is essentially a reducer function that locally groups a mpapper's output on the same node as the mapper - A reducer function can be used as a combiner function or a custom user-defined function can be used - The combiner stage is only an optimization stage, and may not even be called by the MapReduce engine - Eg: combiner function will work for finding the largest or smallest number but not work for finding the average of all numbers
ACID
- Stands for Atomicity, Consistency, Isolation, and Durability - Transaction management style that leverages pessimistic concurrency controls to ensure consistency is maintained through the application of record locks - ACID is the traditional approach to database transaction management as it is leveraged by relational database management systems
Partition (MapReduce Map Tasks)
- If more than one reducer is involved, a partitioner divides the output from the mapper or combiner into partitions between reducer instances - Number of partitions will equal the number of reducers - Although each partition contains multiple key-value pairs, all records for a particular key are assigned to the same partition - The MR engine guarantees a random and fair distribution between reducers while making sure that all of the same keys across multiple mappers end up with the same reducer instance - Depending on the job, certain reducers can receive a large number of key-value pairs compared to others - Uneven workload shows that some reducers will finish earlier than others - But this can be rectified by customizing the partitioning logic in order to guarantee a fair distribution of key-value pairs - Last stage of the map task and returns the index of the reducer to which a particular partition should be sent
Processing in Realtime Mode
- In realtime mode, data is processed in-memory as it is captured before being persisted to the disk - Response time generally ranges from a sub-second to under a minute - Realtime mode addresses the velocity characteristic of Big Data datasets - Realtime processing is also called event or stream processing as the data either arrives continuously or at intervals - Interactive mode generally refers to query processing in realtime - Operational BI/analytcs are generally conducted in realtime mode - Principle related to Big Data processing is called the Speed, Consistency, and Volume (SCV) principle
Data Wrangling
- Includes steps to filter, cleanse, and otherwise prepare the data for downstream analysis - From a storage perspective, a copy of the data is first stored in its acquired format, and, after wrangling, the prepared data needs to be stored again - Typically, storage is required whenever the following occurs: -> external datasets are acquired, or internal data will be used in a Big Data environment -> data is manipulated to be made amenable for data analysis -> data is processed via an ETL activity, or output is generated as a result of an analytical operation
Parallel Data Processing
- Involves the simultaneous execution of multiple sub-tasks that collectively comprise a larger task - Goal is to reduce the execution time by dividing a single larger task into multiple smaller tasks that run concurrently - Although parallel data processing can be achieved through multiple networked machines, it is more typically achieved within the confines of a single machine with multiple processors or cores
Realtime Big Data Processing and MapReduce
- MapReduce is generally unsuitable for realtime Big Data processing - MapReduce is intended for batch-oriented processing of large amounts of data that has been stored to disk - MapReduce cannot process data incrementally and can only process complete datasets - This is at odds with the requirement for realtime data procesing as realtime processing involves data that is often incomplete and continuously arriving via a stream - MapReduce's reduce task cannot generally start before the completion of all map tasks - MapReduce is generally not useful for realtime processing, especially when hard-realtime constraints are present
File System
- Method of storing and organizing data on a storage device, like a flash drive, DVD, or hard drive - A file is an atomic unit of storage used by the file system to store data - Provides a logical view of the data stored on the storage device and presents it as a tree structure of directories - Operating systems employ file systems to store and retrieve data on behalf of applications - Each operating system provides support for one or more file systems, eg: NTFS on Microsoft Windows
Combining Sharding and Master-Slave Replication
- Multiple shards become slaves of a sing master, and the master itself is a shard - Although it results in multiple masters, a single slave-shard can only be managed by a single master-shard - Write consistency is maintained by the master-shard - If master-shard becomes non-operational or a network outage occurs, fault tolerance with regards to write operations is impacted - Replicas of shards are kept on multiple slave nodes to provide scalability and fault tolerance for read operations
Master-Slave
- Nodes are arranged in a master-slave configuration, and all data is written to a master node - Once saved, the data is replicated over to multiple slave nodes - All external write requests, including insert, update and delete, occur on the master node, whereas read requests can be fulfilled by any slave node - Writes are managed by the master node and data can be read from either Slave A or Slave B - Ideal for read intensive loads rather than write intensive loads since growing read demands can be managed by horizontal scaling to add more slave nodes - Writes are consistent, as all writes are coordinated by the master node - Implication is that write performance will suffer as the amount of writes increases - If the master node fails, reads are still possible via any of the slave nodes - A slave node can be configured as a backup node for the master node - One concern is read inconsistency, which can be an issue if a slave node is read prior to an update to the master being copied to it - To ensure read consistency, a voting system can be implemented where a read is declared consistent if the majority of the slaves contain the same version of the record - Implementation of such voting systems requires a reliable and fast communication mechanism between the slaves
noSQL
- Not-only SQL (Structured Query Language) database is a non-relational database that is highly scalable, fault-tolerant and specifically designed to house semi-structured and unstructured data - A NoSQL database often provides API-based query interface that can be called from within an application - NoSQL databases also support query languages other than SQL because SQL was designed to query structured data stored within a relational database - Eg: a NoSQL database that is optimized to store XML files will often use XQuery as the query language - Eg: a NoSQL database designed to store RDF data will use SPARQL to query the relationships it contains - There are some NoSQL data based that also provide an SQL-like query interface
Hadoop
- Open-source framework for larger-scale data storage and data processing that is compatible with commodity hardware - Established itself as a de facto industry platform for contemporary Big Data solutions - Can be used as an ETL engine or as an analytics engine for processing large amounts of structured, semi-structured, and unstructured data - Implements the MadReduce processing framework
BASE
- Stands for Basically Available, Soft State, Eventual Consistency - Favors availability over consistency - Database is A+P from a CAP perspective - Leverages optimistic concurrency by relaxing the strong consistency constraints mandated by the ACID properties
Traditional BI
- Primarily utilizes descriptive and diagnostic analytics to provide information on historical and current events - Not "Intelligent" because it only provides answers to correctly formulated questions - Correctly formulated questions requires an understanding of business problems and issues and of the data itself - BI reports on different KPIs through ad-hoc reports and dashboards - Cannot function effectively without data marts because they contain the optimized and segregated data that BI requires for reporting purposes - Without data marts, data needs to be extracted from the data warehouse via an ETL process on ad-hoc basis whenever a query needs to be run - This increases the time and effort to execute queries and generate reports - Traditional BI uses data warehouses and data marts for reporting and data analysis because they allow complex data analysis queries with multiple joins and aggregations to be issued
Pessimistic Concurrency
- Proactive strategy that prevents inconsistency - Uses locking to ensure that only one update to a record can occur at a time - Detrimental to availability since the database record being updated remains unavailable until all locks are released
Sharding
- Process of horizontally partitioning a large dataset into a collection of smaller, more manageable datasets called shards - Shards are distributed across multiple nodes, where a node is a server or a machine - Each shared is stored on a separate node and each node is responsible for only the data stored on it - Each shard shares the same schema, and all shards collectively represent the complete dataset - Sharding is often transparent to the client, but its not a requirement - Allows the distribution of processing loads across multiple nodes to achieve horizontal scalability - Horizontal scaling is a method for increasing a system's capacity by adding similar or higher capacity resources alongside existing resources - Since each node is responsible for only a part of the whole dataset, read/write times are greatly improved - Benefit of sharding is that it provides partial tolerance towards failures - In case of a node failure, only data stored on that node is affected - Regards to data partitioning, query patterns need to be taken into account so that shards themselves do not become performance bottlenecks - Eg: queries requiring data from multiple shards will impose performance penalties - Data locality keeps commonly accessed data co-located on a single shard and helps counter such performance issues
Extract Transform Load (ETL)
- Process of loading data from a source system into a target system. - Source system -> database, flat file, application - Target system -> database or some other storage system - Main operation which data warehouses are fed data - Required Data is first obtained or extracted -> then modified and transformed by the application of rules -> finally data is inserted or loaded into the target system
Ad-hoc Reports
- Process that involves manually processing data to produce custom-made reports - Focus is to usually report on a specific area of the business, such as its marketing or SCM - The generated custom reports are detailed and often tabular in nature
Online Analytical Procession (OLAP)
- Processing data analysis queries - Integral part of Business Intelligence, Data Mining, and Machine Learning Processes - Relevant to Big Data -> serve as both a data source and a data sink that is capable of receiving data - Used in diagnostic, predictive, and prescriptive analytics - Perform long-running, complex queries against multidimensional database to perform advanced analytics
Dashboards
- Provided a holistic view of key business areas - The info displayed on dashboards is generated at periodic intervals in realtime or near-realtime - Presentation of data on dashboards is graphical in nature, using bar charts, pie charts, and gauges
Cluster (Big Data Processing)
- Provides necessary support to create horizontally scalable storage solutions - Provides the mechanism to enable distributed data processing with linear scalability - Since they are highly scalable, they provide an ideal environment for Big Data processing as large datasets can be divided into smaller datasets and then processed in parallel in a distributed manner - Datasets can either be processed in batch mode or realtime mode - Will be compromised of low-cost commodity nodes that collectively provide increase processing capacity - Provide inherent redundancy and fault tolerance, as they consist of physically separate nodes - Redundancy and fault tolerance allow resilient processing and analysis to occur if a network or node failure occurs
Optimistic Concurrency
- Reactive strategy that does not use locking - Allows inconsistency to occur with knowledge that eventually consistency will be achieved after all updates have propagated - Peers may remain inconsistent for some period of time before attaining consistency - Like master-slave replication, reads can be inconsistent during the time period when some of the peers have completed their updates while others perform their updates - Voting system can be implemented where a read is declared consistent if the majority of the peers contain the same version of the record
Speed Consistency Volume (SCV)
- SCV principle is related to distributed data processing - States that a distributed data processing system can be designed to support only two of the following three requirements - Speed: refers to how quickly the data can be processed once it is generated. In the case of realtime analytics, data is processed comparatively faster than batch analytics. This generally excludes the time taken to capture data and focuses only on the actual data processing, such as generating statistics or executing an algorithm. - Consistency: refers to the accuracy and the precision of the results. Results are deemed accurate if they are close to the correct value and precise if close to each other. A more consistent system will make use of all available data, resulting in high accuracy and precision as compared to a less consistent system that makes use of sampling techniques, which can result in lower accuracy with an acceptable level of precision. - Volume: refers to the amount of data that can be processed. Big Data's velocity characteristic results in fast growing datasets leading to huge volumes of data that need to be processed in a distributed manner. Processing such voluminous data in its entirety while ensuring speed and consistency is not possible.
Online Transaction Processing (OLTP)
- Software system that processes transaction-oriented data - Stores operational data that is normalized - Common source of structured data and serves as input to many analytic processes - Queries comprised of simple insert, delete, and update operations. - Eg: ticket reservation systems, banking, point of sale
SCV Principle
- Speed, Consistency, and Volume
Eventual Consistency (BASE)
- State in which reads by different clients, immediately following a write to the database, may not return consistent results - Only attains consistency once the changes have been propagated to all nodes - While the database is in the process of attaining the state of eventual consistency, it will be in a soft state
Replication
- Stores multiple copies of a dataset, known as replicas, on multiple nodes - Provides scalability and availability due to the fact that the same data is replicated on various nodes - Fault tolerance is also achieve since data redundancy ensures that data is not lost when an individual node fails - Two different methods that are used to implement replication: -> master-slave -> peer-to-peer
Data Marts
- Subset of data stored in a data warehouse, typically belonging to a department, division, or specific line of business - Data warehouses can have multiple data marts - Enterprise-wide data is collected and business entities are then extracted - Domain-specific entities are persisted into the data warehouse via an ETL process
Basically Available (BASE)
- That database will always acknowledge a client's request, either in the form of the requested data or a success/failure notification
Processing Workloads
- The amount and nature of data that is processed within a certain amount of time - Workloads are divided into two types: -> batch -> transactional
Partition Tolerance (CAP Theorem)
- The database system can tolerate communication outages that split the cluster into multiple silos and can still service read/write requests
Sharding and Replication
- They can be combined to improve the limited fault tolerance - Benefiting from the increase availability and scalability of replication
Clusters
- Tightly coupled collection of servers, or nodes - These servers usually have the same hardware specifications and are connected together via a network to work as a single unit - Each node in the cluster has its own dedicated resources, such as memory, a processor, and a hard drive - A cluster can execute a task by splitting it into small pieces and distributing their execution onto different computers that belong to the cluster
Realtime Big Data Processing and SCV
- While designating a realtime Big Data processing system, the SCV principle needs to be kept in mind - Consider a hard-realtime and a near-realtime Big Data Processing System - For both hard-realtime and near-realtime scenarios, we assume that data loss is unacceptable; in other words, high data volume (V) processing is required for both the systems - In the case of a hard-realtime system, fast response (S) is required, hency consistency (C) will be compromised if high volume data (V) needs to be processed in memory - In the case of a near-realtime system, a reasonably fast response (S) is required, hence consistency (C) can be guaranteed if high volume data (V) needs to be processed in memory
Batch Processing with MapReduce
- Widely used implementation of a batch processing framework - Highly scalable and reliable and is based on the principle of divide-and-conquer, providing a built-in fault tolerance and redundancy - Divides a big problem into a collection of smaller problems that can each be solved quickly - MapReduce has roots in both distributed and parallel computing - Batch-oriented processing engine used to process large datasets using parallel processing deployed over clusters of commodity hardware - Doesn't require the input data conform to any particular data model - Used to process schema-less datasets - Dataset broken down into multiple smaller parts - Operations are performed on each part independently - Results from operations are summarized to arrive at the answer - MapReduce processing engine generally only supports batch workloads - Work is not expected to have low latency
CAP Theorem
-The Consistency, Availability, and Partition Tolerance, AKA Brewer's theorem, expressed a triple constraint related to distributed database systems - States that a distributed database system, running on a cluster, can only provide two of the following three properties: -> Consistency -> Availability -> Partition Tolerance
Understanding MapReduce Algorithms
-The divide-and-conquer principle is generally achieved by using one of the following approaches: - Task parallelism: parallelization of data processing by dividing a task into sub-tasks and running each sub-task on a separate processor, generally on a separate node in a cluster. Each sub-task generally executes a different algorithm, with its own copy of the same data or different data as its input, in parallel. - Data parallelism: refers to the parallelization of data processing by dividing a dataset into multiple datasets and processing each sub-dataset in parralel. Sub-datasets are spread across multiple nodes and are all processed using the same algorithm.
Features of Visualization tools used in Big Data
Aggregation -> Provides a holistic and summarized view of data across multiple contexts Drill-down -> enables a detailed view of the data of interest by focusing in on a data subset from the summarized view Filtering -> helps focus on a particular set of data by filtering away the data that is not of immediate interest Roll-up -> groups data across multiple categories to show subtotals and totals What-if analysis -> enables multiple outcomes to be visualized by enabling related factors to be dynamically changed