BFD Chp 7

Ace your homework & exams now with Quizwiz!

In-Memory Databases Strategies

-Not all IMDB implementations directly support durability, but instead leverage various strategies for providing durability in the face of machine failures or memory corruption. -Use of Non-volatile RAM (NVRAM) for storing data permanently -Database transaction logs can be periodically stored to a non-volatile medium, such as a disk -Snapshot file, which capture database state as a certain point in time, are saved to disk -An IMDB may leverage sharding and replication to support increasing availability and reliability as a substitute for durable storage -IMDBs can be used in conjunction with on-disk storage devices such as NoSQL databases and RDMBs for durable storage

Key-Value storage device is not Appropriate when :

-applications require searching or filtering data using attributes of the stored value -relationships exist between different key-value entries -a group of keys' values need to be updated in a single transaction -multiple keys require manipulation in a single operation -schema consistency across different values is required -update to individual attributes of the value is required

On-Disk Storage Devices

-generally utilizes low cost hard-disk drives for long-term storage. It can be implemented via a distributed file system or database.

Read-Through

_If a requested value for a key is not found in the IMDG, then it is synchronously read from the backend on-disk storage device, such as a database. -Upon a successful read from the backend on-disk storage device, the key-value pair is inserted into the IMDG, and the requested value is returned to the client. -Any subsequent request for the same key are then served by the IMDG directly, instead of the backend storage. -Although it is a simple approach, its synchronous nature may introduce read latency.

Variety

-A storage device needs to handle different data formats including documents, emails, images, and videos and incomplete data. NoSQL storage devices can store these different forms of semi-structured and unstructured data formats. At the same time, NoSQL storage devices are able to store schema-less data and incomplete data with the added ability of making schema changes as the data model of the dataset evolve. In other words, NoSQL databases support schema evolution

In-Memory Data Grids (IMDGs) Cont.

-An IDMG is akin to a distributed cache as both provide memory-based access to frequently accessed data. However, unlike a distributed cache, an IDMG provides built in support for replication and high availability -Realtime processing engines can make use of IDMG where high velocity data is stored in the IMDG as it arrives and is processed there before being saved to an on-disk storage device, or data from the on-disk storage device is copied to the IMDG -IMDGs may also support in-memory MapREduce that helps to reduce the latency of disk based MapReduce processing, especially when the same job needs to be executed multiple times -An IMDG can also be deployed within a cloud based environment where it provides a flexible storage medium that can scale out or scale in automatically as the storage demand increases or decreases -IMDGs can be added to existing Big Data solutions by introducing them between the existing on-disk storage device and the data processing application. However, this introduction generally requires changing the application code to implement the IMDGs API Examples : In-Memory Data Fabric, Hazelcast, and Oracle Coherence -In a Big Data solution environment, IMDGs are often deployed together with on-disk storage devices that act as the backend storage. This is achieved via the following approaches that can be combined as necessary to support read/write performance, consistency and simplicity requirements: *Read-through, write-through, write-behind, and refresh-ahead -Examples of IMDG storage devices : *Hazelcast, Infinispan, Pivotal GemFire, and Gigaspace XAP

Write-Behind

-Any write to the IMDG is written asynchronously in a batch manner to the backend on-disk storage device, such as a database. -A queue is generally placed between the IMDG and the backend storage for keeping track of the required changes to the backend storage. This queue can be configured to write data to the backend storage at different intervals. -The asynchronous nature increases both write performance (the write operation is considered complete as soon as it is written to the IMDB) and read performance (data can be read from the IMDB as soon as it is written to the IMDB) and scalability / availability in general.

Write-Through

-Any write(insert/ update/delete) to the IMDG is written synchronously in a transactional manner to the backend on-disk storage device, such as a database. -If the write to the backend on-disk storage device fails, the IMDG's update is rolled back. -Due to this transactional nature, data consistency is achieved immediately between the two data stores. -However, this transactional support is provided at the expense of write latency as any write operation is considered complete only when feedback (write success/failure) from the backend storage is received.

Document

-Document storage devices also store data as key-value pairs. However, unlike key-value storage devices, the stored value is a document that can be queried by the database. -These documents can have a complex nested structure, such as an invoice. -The documents can be encoded using either a text-based encoding scheme, such as XML or JSON, or using a binary encoding scheme, such as BSON (Binary JSON) -they provide collections or buckets (like tables) into which key-value pairs can be organized -Each document can have a different schema; therefore it is possible to store different types of documents in the same collection or bucket -Additionally fields can be added to a document after the initial insert, thereby providing flexible schema support -they are not limited to storing data that occurs in the form of actual documents, such as an XML file, but they can also be used to store any aggregate that consists of a collection of fields having a flat or a nested schema Examples: MongoDB, CouchDB, and Terrastore

Graph

-Graph storage devices are used to persist inter-connected entities. Unlike other NoSQL storage devices, where the emphasis is on the structure of the entities, graph storage devices place emphasis on storing the linkages between entities -Entities are stored as nodes (not to be confused with cluster nodes) and are also called vertices, while the linkages between entities are stored as edges. In RDBMS parlance, each node can be thought of a single row while the edge denotes a join. -Nodes can have more than one type of link between them through multiple edges. Each node can have attribute data as key-value pairs, which can be used to further filter query results. Having multiple edges are similar to defining multiple foreign keys in an RDBMS; however not every node is required to have the same edge -Queries generally involve finding interconnected nodes based on node attributes and/or edge attributes, commonly referred to as node traversal. -Edges can be unidirectional or bi-directional, setting the node traversal direction -Graph storage devices provide consistency via ACID compliance -The degree of usefulness of a graph storage device depends on the number and types of edges defined between the nodes. The greater the number and more diverse the edges are, the more diverse the types of queries it can handle. As a result, it is important to comprehensively capture the types relations that exist between the nodes. This is not only true for existing usage scenarios, but also for explanatory analysis of data. -Graph storage devices generally allow adding new types of nodes without making changes to the database. This also enables defining additional links between nodes as new types of relationships or nodes appear in the database -Examples : Neo4J, Infinite Graph, and OrientDB

In-Memory Databases

-IMDBs are in-memory storage devices that employ database technology and leverage the performance of RAM to overcome runtime latency issues that plaque on-disk storage devices. -An IDMB can be relational in nature (relational IMDB) for the storage of structured data, or may leverage NoSQL technology (non-relational IMDB) for the storage of semi-structured and unstructured data. -Unlike IMDGs, which generally provide data access via APIs, relational IMDBs make use of the more familiar SQL language, which helps data analysts or data scientists that do not have advanced programming skills. -NoSQL-based IMDBs generally provide API-based access, which may be as simple as put, get and delete operations. Depending on the underlying implementation, some IMDB scale-out, while others scale-up, to achieve scalability -Like an IMDG, an IMDB may also support the continuous query features, where a filter in the form of a query for data of interest is registered with the IMDB. The IMDB then continuously executes the query in an iterative manner. -Whenever the query result is modified as a result of insert/update/delete operations, subscribing clients are asynchronously informed by sending out changes as events, such as added, removed, and updated events, with information about record values, such as old and new values

In-Memory Storage Devices Cont.

-In-memory storage enables making sense of the fast influx of data in a Big Data environment (velocity characteristic) by providing a storage medium that facilitates realtime insight generation. This supports making quick business decisions for mitigating a threat or taking advantage of an opportunity -A Big Data in-memory storage device is implemented over a cluster, providing high availability and redundancy. Therefore, horizontal scalability can be achieved by simply adding more nodes or memory. When compared with an on-disk storage device, an in-memory storage device is expensive because of the higher cost of memory as compared to a disk-based storage device. -They also do no provide the same level of support for durable data storage. The price factor further affects the achievement capacity of an in-memory device when compared with an on-disk storage device. Only up-to-date and fresh data or data that has the most value is kept in memory, whereas stale data gets replaced with newer, fresher data -Depending on how it is implemented, an in-memory storage device can support schema-less or schema-aware storage. Schema-less storage support is provided through key-value based data persistence -In-Memory storage devices can be implemented as: *In-Memory Data Grid (IMDG) *In-Memory Database (IMDB) -What makes these distinct is the way data is stored in the memomory

Refresh-Ahead

-Refresh-ahead is a proactive approach where any frequently accessed value are automatically, asynchronously refreshed in the IMDG, provided that the value is assessed before its expiry time as configured in the IMDG. -If the value is accessed after its expiry time, the value, like in the read-through approach, is synchronously read from the backend storage and updated in the IMDG before being returned to the client. -Due to its asynchronous and forward-looking nature, this approach helps achieve better read-performance and is especially useful when the same values are accessed frequently or accessed by a number of clients. -Compared to the read-through approach, where a value is served from the IMDG until its expiry, data inconsistency between the IMDB and the backend storage is minimized as values are refreshed before they expire.

NoSQL Characteristics

-Schema-less data model: Data can exist in its raw form -Scale out rather than scale up: More nodes can be added to obtain additional storage with a NoSQL database, in contrast to having to replace the existing node with a better, higher performance / capacity one -Highly available: This is built on cluster-based technologies that provide fault tolerance out of the box -Lower operational costs: Many NoSQL databases are built on Open Source platforms with no licensing costs. They can often be deployed on commodity hardware -Eventual consistency: Data reads across multiple nodes but may not be consistent immediately after a write. However, all nodes will eventually be in a consistent state -BASE, not ACID: BASE compliance requires a database to maintain high availability in the event of network/node failure, while not requiring the database to be in a consistent state whenever an update occurs. The database can be in a soft/inconsistent state until it eventually attains consistency. As a result, in consideration of the CAP theorem, NoSQL storage devices are generally AP or CP -APR driven data access: Data access is generally supported via API bases queries, including RESTful APIs, whereas some implementations may also provide SQL-like query capability -Auto sharding and replication: To support horizontal scaling and provide high availability, a NoSQL storage device automatically employs sharding and replication techniques where the dataset is partitioned horizontally and then copied to multiple nodes -Integrated caching: This removes the need for a third-party distributed caching layer, such as Memcached -Distributed query support: NoSQL storage devices maintain consistent query behavior across multiple shards -Polyglot persistence: The use of NoSQL storage does not mandate retiring traditional RDBMSs. In fact, both can be used at the same time, thereby supporting polyglot persistence, which is an approach of persisting data using different types of storage technologies within the same solution architecture. This is good for developing systems requiring structured as well as semi/unstructured data -Aggregated-focused: Unlike relational databases that are most effective with fully normalized data. NoSQL storage devices store denormalized aggregated data (an entity containing merged, often nested, data from an object) thereby eliminating the need for joins and extensive mapping between application objects and the data stored in the database. One exception, however, is that graph database storage devices are not aggregated-focused -Types: Key-value, document, column-friendly, and graph

NewSQL Databases

-These are highly scalable, available, fault-tolerant and fast for read/write operations. However, they do not provide the same transaction and consistency support as exhibited by ACID compliant RDBMSs. -NoSQL devices provide eventual consistency rather than immediate consistency. They therefore will be in a soft state while reaching the state of eventual consistency. As a result , they are not appropriate for use when implementing large scale transactional systems -New noSQL storage devices combine ACID properties of RDBMS with the scalability and fault tolerance offered by NoSQL storage devices. They generally support SQL compliant syntax for dta definition and data manipulation operations, and they often use a logical relational data model for data storage -New NoSQL databases can be used for developing OLTP systems with very high volumes of transactions for example a banking system. They can also be used for realtime analytics, for example operational analytics, as implementations leverage in-memory storage -As compared to a NoSQL storage device, a NewSQL storage device provides an easier transition from a traditional RDBMS to a highly scalable database due to its support for SQL Examples : VoltDB, NuoDB, and InnoDB

In-Memory Data Grids (IMDGs)

-These store data in memory as key-value pairs across multiple nodes where the keys and values can be any business object or application data in serialized form. This supports schema-less data storage through storage of semi/unstructured data. -Data access is typically provided via APIs -Nodes in IMDGs keep themselves synchronized and collectively provide high availability, fault tolerance and consistency. In comparison to NoSQL's eventual consistency approach, IMDGs support immediate consistency -IMDBs provide faster data access as IMDGs store non-relational data as objects. Hence, unlike relational IMDBs, object-to relational mapping is not required and clients can work directly with the domain specific objects -They scale horizontally by implementing data partitioning and data replication and further support reliability by replicating data to at least one extra node. -They are heavily used for realtime analytics because they support Complex Event Processing (CEP) via the publish-subscribe messaging model. This is achieved through a feature called continuous querying, also known as active querying, where a filter for event(s) of interest is registered with the IMDG. The IDMG then continuously evaluates the filter and whenever the filter is satisfied as a result of insert/update/delete operations, subscribing clients are informed. -Notifications are sent asynchronously as change events, cush as added, removed, and updated events, with information about key-value pairs, such as old and new values

Distributed File Systems

-are agnostic to the data being stored and therefore support schema-less data storage -in general, it provides out of box redundancy and high availability by copying data to multiple locations via replication -a storage device that is implemented with a distributed file system provides simple, fast access data storage that is capable of storing large datasets that are non-relational in nature, such as semi-structured and unstructured data -provided fast read/write capability -it is not ideal for datasets compromising a large number of small files as this creates excessive disk-seek activity, slowing down the overall data access -there is also more overhead involved in processing multiple smaller files, as dedicated processes are generally spawned by the processing engine at runtime for processing each file before the results are synchronized from across the cluster -work best with fewer but larger files accessed in a sequential manner. Multiple smaller files are generally combines into a single file to enable optimum storage and processing. This allows the distributed file systems to have increased performance when data must be accessed in streaming mode with no random reads and writes -it is suitable when large datasets of raw data are to be stored or when archiving of datasets is required -it provides an inexpensive storage option for storing large amounts of data over a long period of time that needs to remain online. This is because more disks can simply be added to the cluster without needing to offload the data to offline data storage, such as tapes -they do not provide the ability to search the contents of files as standard out-of-the-box capability

Relational Database Management Systems (RDBMS)

-are good for handling transactional workloads involving small amounts of data with random read/write properties -are ACID-compliant and, to honor this compliance, they are generally restricted to a single node. For this reason RDBMSs do not provide out-of-the-box redundancy and fault tolerance -They employ vertical scaling, not horizontal scaling, which is a more costly and disruptive scaling strategy. This makes RDBMSs less than ideal for long-term storage of data that accumulates over time -Note that some relational databases, for example IBM DB2 pureScale, Sybase ASE Cluster Edition, Oracle Real Application Cluster (RAC), and Microsoft Parallel Warehouse (PDW), are capable of being run on clusters. However, these database clusters still use shared storage that can act as a single point of failure Relational databases need to be manually sharded, mostly using application logic. This means that the application logic needs to know which shard to query in order to get the required data. This further complicates data processing when data from multiple shards is required -Generally require data to adhere to a schema. As a result, storage of semi-structured and unstructured data whose schemas are non-relational is not directly supported -With relational database schema conformance is validated at the time of fata insert or update by checking the data against the constraints of the schema. This introduces overhead that creates latency -This latency makes relational databases a less than ideal choice for storing high velocity data that needs a highly available database storage device with fast data write capability. As a result of its shortcomings, a traditional RDBMS is generally not useful as the primary storage device in Big Data solution environment

An In-Memory Storage Device is Appropriate when :

-data arrives at a fast pace and requires realtime analytics or event stream processing -continuous or always-on analytics is required, such as operational BI and operational analytics -interactive query processing and realtime data visualization needs to be performed, including what-if analysis and drill-down operations -the same dataset is required by multiple data processing jobs -performing exploratory data analysis, as the same dataset does not need to be reloaded from disk if the algorithm changes -data processing involves iterative access to the same dataset, such as executing graph-based algorithms -developing low latency Big Data solutions with ACID transaction support

An IMDG storage device is Appropriate when :

-data needs to be readily accessible in object form with minimal latency -data being stored is non-relational in nature such as semi-structured and unstructured data -adding realtime support to an existing Big Data solution currently using on-disk storage -the existing storage device cannot replace but the data access layer can be modified -scalability is more important than relational storage; although IMDGs are more scalable than IMDBs (IMDBs are functionally complete databases), they do not support relational storage

An In-Memory Storage Device is Inappropriate when:

-data processing consists of batch processing -very large amounts of data need to be persisted in-memory for a long time in order to perform in-depth data analysis -performing strategic BI or strategic analytics that involves access to very large amounts of data and involves batch data processing -datasets are extremely large and do not fit into the available memory -making the transition from traditional data analysis toward Big Data analysis and involves a complex set up -an enterprise has a limited budget, as setting up an in-memory storage device may require upgrading nodes, which could either be done by node replacement or by adding more RAM

Main Differences between Document Storage data and Key-Value Storage data

-document storage devices are value-aware -the stored value or a reference is self-describing;the schema can be inferred from the structure of the value or a reference to the schema for the document is included in the value -a select operation can reference a field inside the aggregate value -a select operation can retrieve a part of the aggregate value -partial updates are supported; therefore a subset of the aggregate can be updated -indexes that speed up searches are generally supported

Storage Technology

-has continued to evolve over time, moving from inside the server to out on the network -bottom line is that relational technology is simply not scalable in a manner to support Big Data volumes -Big data has pushed the storage boundary to unified views of the available memory and disk storage of a cluster -Even batch-based processing has accelerated by the performance of Solid State Drives (SSDs), which have become less expensive

A Graph Storage Device is Appropriate when :

-interconnected entities need to be stored -querying entities based on the type of relationship with each other rather than the attributes of the entities -finding groups of interconnected entities -finding distances between entities in terms of the node traversal distance -mining data with a view toward finding patterns

A Document Storage Device is not Appropriate when :

-multiple documents need to be updated as part of a single transaction -performing operations that need joins between multiple documents or storing data that is normalized -schema enforcement for achieving consistent query design is required as the document structure may change between successive query runs, which will require restructuring the query -the store value is not self-describing and does not have a reference to a schema -binary data needs to be stored

A Column-Family Storage Device is Appropriate when:

-realtime random read/write capability is needed and data being stored has some defined structure -data represents a tabular structure, each row consists of a large number of columns and nested groups of interrelated data exist -support for schema evolution is required as column families can be added or removed without any system downtime -certain fields are mostly accessed together, and searches need to be performed using field values -efficient use of storage is required when the data consists of sparsely populated rows since column-families databases only allocate storage space if column exists for a row. If no column is present, no space is allocated -query patterns involve insert, select, update, and delete operations

Not-Only (NoSQL) Databases

-refers to technologies used to develop next generation non-relational databases that are highly scalable and fault-tolerant -Characteristics: is a list of the principal features of NoSQL storage devices that differentiate them from traditional RDBMSs. This list should only be considered a general guide, as not all NoSQL storage devices exhibit all of these features -The emergence of NoSQL storage devices can primarily be attributed to the volume, velocity, and variety characteristics of Big Data datasets

A Column-Family Storage Device is Inappropriate when:

-relational data access is required; for example, joins -ACID transactional support is required -binary data needs to be stored -SQL-compliant queries need to be executed -query patterns are likely to change frequently that could initiate a corresponding restructuring of how column-families are arranged

A Document Storage Device is Appropriate when :

-storing semi-structured document-oriented data comprising flat or nested schema -schema evolution is a requirement as the structure of the document is either unknown or is likely to change -applications require a partial update of the aggregate stored as a document -searches need to be performed on different fields of the documents -storing domain objects, such as customers, in serialized object form -query patterns-involve insert, select, update and delete operations

Velocity

-the fast influx of data requires databases with fast access data write capability. NoSQL storage devices enable fast writes by using schema-on-read rather than schema-on-write principle. Being highly available, NoSQL storage devices can ensure that write latency does not occur because of node or network failures

Volume

-the storage requirement of ever increasing data volumes commands the use of databases that are highly scalable while keeping costs down for the business to remain competitive. NoSQL storage devices fulfill this requirement by providing scale out capability while using inexpensive commodity servers.

In-Memory Storage Devices

-these generally utilize RAM, the main memory of a computer, as its storage medium to provide fast data access. The growing capacity and decreasing cost of RAM, coupled with the increasing read/write of solid state hard drives, had made it possible to develop in-memory data storage solutions -Storage of data in memory eliminates the latency of disk I/O and the data transfer time between the main memory and the hard drive. This overall reduction in data read/write latency makes data processing much faster. -In-memory storage device capacity can be increased massively by horizontally scaling the cluster that is hosting the in-memory storage device -Cluster-based memory enables storage of large amounts of dta, including Big Data datasets, which can be accessed considerably faster when compared with an on-disk storage device. This significantly reduces the overall execution time of Big Data analytics, thus enabling realtime Big Data analytics -An in-memory storage device enables in-memory analytics, which refers to in-memory analysis of data, such as generating statistics by executing queries on data that is stored in memory instead of on disk. In-memory analytics enable operational analytics and operational BI through fast execution of queries and algorithms.

Key-Value

-these storage device store data as key-value pairs and act like harsh tables. The table is a list of values where each value is identified by a key. The value is opaque to the database and is typically stored as a BLOB. The value stored can be any aggregate, ranging from sensor data to videos. -Value look-up can only be performed via the keys as the database is oblivious to the details of the stored aggregate. Partial updated are not possible. An update is either a delete or an insert operation. -Key-value storage devices generally do not maintain any indexes, therefore writes are quite fast. Based on a simple storage model, key-value storage devices are highly scalable. -As keys are the only means of retrieving the data, the key is usually appended with the type of the value being saved for easy retrieval. -Most key-value storage devices provide collections or buckets (like tables) into which key-value pairs can be organized Examples : Riak, Redis, and Amazon Dynamo DB

Column-Family

-these storage devices store data much like traditional RDBMS but group related columns together in a row, resulting in column-families. -Each column can be a collection of related columns itself, referred to as a super-column -each super column contains an arbitrary number of related columns that are generally retrieved or updated as a single unit. Each row consists of multiple column-families and can have a different set of columns, thereby manifesting flexible schema support. Each row is identified by a row key -They provide fast data access with random read/write capability. They store different column-families in separate physical files, which improves query responsiveness as only the required column-families are searched -some column-family storage devices provide support for selectively compressing column-families. -Leaving searchable column-families uncompressed can make queries faster because the target column does not need to be decompressed for lookups -Most implementations support data versioning while some support specifying an expiry time for column data. When the expiry time has passed, the data is automatically removed -Examples: Cassandra, HBase, and Amazon SimpleDB

Key-Value storage is Appropriate when :

-unstructured data storage is required -high performance read/writes are required -the value is fully identifiable via the key alone -value is standalone entity that is not dependent on other values -values have a comparatively simple structure or are binary -query patterns are simple, involving insert, select and delete operations only -stored values are manipulated at the application layer

A Graph Storage Device is Inappropriate when :

-updates are required to a large number of node attributes or edge attributes, as this involves searching for nodes or edges, which is a costly operation compared to performing node traversals -entities have a large number of attributes or nested data-it is best to store lightweight entities in a graph storage device while storing the rest of the attribute data in a separate non-graph noSQL storage device -binary storage is required -queries based on the selection of node/edge attributes dominate node traversal queries


Related study sets

STATISTICS UNIT 3.1 Homework Question

View Set

Horticulture Certification Test Review 1

View Set

Chapter 22- Disorders of Thought, Emotion, and Memory

View Set

Chapter 21: America and the Great War, 1914-1920

View Set