CCA-Wk 8
deletes
Add a tombstone to the log and don't delete right away. Eventually delete.
HRegion
Dyanmic partitioned range of rows Built from multiple HFIles
HRegion Location
Root Region -> Meta Region -> User Region
Cassandra data placement strategies
1. Simple Strategy a. Random Partitioner (hash partitioning per segment) b. Byte Ordered Partitioner (assign rages of keys per server) - range queries possible 2. Network Topology Strategy 1st replica - Partitioner place it next replica - clockwise around the ring to hit a different rack
Writes
1. Writes: lock-free and fast (no reads or disk seeks) write request -> client coordinator node in Cassandra cluster Client Coord->Partitioner->all replica nodes for the key Always writable: Hinted Handoff Mechanism buffer writes for up to few hours if replicas down Per-DC coordinator elected via Zookeepr that runs a Paxos (consensus) variant Writes at a replica node 1. log it in disk commit log 2. change Memtable (in-memory multi key-value pairs, write-back cache, not write-through) Later - full/old - flush to disk as Data File SSTable(Sorted String Table) Index file: SSTable (Key, position in data sstable) Bloom filer
What is bloom filter and what does it look like
A Bloom filter is a data structure designed to tell you, rapidly and memory-efficiently, whether an element is present in a set. The price paid for this efficiency is that a Bloom filter is a probabilistic data structure: it tells us that the element either definitely is not in the set or may be in the set. False positive rate low FPSome probability of "false positives" Large Bit Map - Check true as being in set False positive rate low FPCompact way of representing a set of items rate = 0.02%
What is Kafka, main components and characteristics
A distributed, partitioned, replicated pub/sub system provided with commit log service LinkedIn use Kafka
Describe the uses of Kafka.
Apache Kafka is publish-subscribe based fault tolerant messaging system. It is fast, scalable and distributed by design Apache Kafka is a distributed publish-subscribe messaging system and a robust queue that can handle a high volume of data and enables you to pass messages from one end-point to another. Kafka is suitable for both offline and online message consumption. Kafka messages are persisted on the disk and replicated within the cluster to prevent data loss. Kafka is built on top of the ZooKeeper synchronization service. It integrates very well with Apache Storm and Spark for real-time streaming data analysis. Benefits Following are a few benefits of Kafka − Reliability − Kafka is distributed, partitioned, replicated and fault tolerance. Scalability − Kafka messaging system scales easily without down time.. Durability − Kafka uses Distributed commit log which means messages persists on disk as fast as possible, hence it is durable.. Performance − Kafka has high throughput for both publishing and subscribing messages. It maintains stable performance even many TB of messages are stored. Kafka is very fast and guarantees zero downtime and zero data loss. Use Cases Kafka can be used in many Use Cases. Some of them are listed below − Metrics − Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data. Log Aggregation Solution − Kafka can be used across an organization to collect logs from multiple services and make them available in a standard format to multiple con-sumers. Stream Processing − Popular frameworks such as Storm and Spark Streaming read data from a topic, processes it, and write processed data to a new topic where it becomes available for users and applications. Kafka's strong durability is also very useful in the context of stream processing.
HBase Master
Assign HFiles to HRegion Servers Manage HRegion Balances HRegion server load HRegions are distributed randomly on nodes of the cluseter for load balancing Garbage collection Handle schema changes
HFile
Basic building block of HBase On-disk file format representing a map from string to string Persistent, ordered immutable map from Keys to values (stored in HDFS) Sequence of blocks on disk + an index for block lookup - mapped to MemStore Lookup value with key Iterate Key/Value pairs within a key range
Distributed Key-Value Stores
Cassandra Redis
NoSQL databases
Cassandra, HBase
Large scale data stores
Distributed Key-Value store: Cassandra and Rdis Scalable Database: HBase Spark SQL Pub-Sub Queue: Kafka
HRegion Servers
Each HRegion server manages a set of regions - 10-1000 regions 100-200 MB by default Handles read/write request to the regions Splits regions when too large
Consumer rebalance
Each consumer can have several threads Each consumer thread can consume from multiple partitions Each partition will be consumed by EXACTLY ONE CONSUMER in the consumer group One of the consumers goes down, the other consumer will take on the partition based on the consumer offsets of the CG
Kafka Server Cluster Implementation
Each partition - replicated across servers Each partition has one leader server and handles read and write request afollower replicates the leader and acts as backup Each server a leader and a follower for load balance Zookeeper for server consistency
Describe the scalable, low latency database that supports database operations in applications that use Hadoop.
HBase
Hbase vs HDFS
HBase - part of Hadoop's Ecosystem HBase is built on top of HDFS. HBase files are internally stored in HDFS. HDFS: for batch processing not good for: record lokup, incremental add/smal batches, updates HBase: good for those HDFS is not Fast record lookup record-level insertion updates
HRegion assignment
HBase Master keeps track of HRegions Each HRegion is assigned to one HRegion server at a time Changes to HRegion structure HBase Master initiate table creation/deletion HBase Master initiate HRegion merge HRegion server initiate Hregion split
What are Hbase building blocks
HDFS Apache ZooKeeper (uses ZAB - ZooKeeper Atomic Broadcast) HFile HBase schema consists of several Tables Each table has Column Families (not part of the schema) Hbase has Dynamic Columns (encoded insides the cells, different cells can have different columns) version number support for each key rows - sorted lexicographic order HBase Tables are divided horizontally by row key range into "Regions." A region contains all rows in the table between the region's start key and end key. Regions are assigned to the nodes in the cluster, called "Region Servers," and these serve data for reads and writes. A region server can serve about 1,000 regions.
What is Snitches mechanism
Map IPs to rack/DCs. Configured by cassandra.yaml 1. Simple Snitch 2. Rack Inferring x.<DC>.<rack>.<node> 3. Property File Snitch - use config file 4. EC2 Snitch: uses EC2 EC2 Region <=> DC Availability Zone <-> rack
Hbase
Open source, nonrelational, column-oriented distributed database that runs on top of Hadoop (HDFS). Data is logically organized into tables, rows, and columns
Distributed Publish/Subscribe Queues
Publish/subscribe messaging, or pub/sub messaging, is a form of asynchronous service-to-service communication used in serverless and microservices architectures. In a pub/sub model, any message published to a topic is immediately received by all of the subscribers to the topic. Pub/sub messaging can be used to enable event-driven architectures, or to decouple applications in order to increase performance, reliability and scalability. Pub/Sub Messaging Basics In modern cloud architecture, applications are decoupled into smaller, independent building blocks that are easier to develop, deploy and maintain. Publish/Subscribe (Pub/Sub) messaging provides instant event notifications for these distributed applications. The Publish Subscribe model allows messages to be broadcast to different parts of a system asynchronously. A sibling to a message queue, a message topic provides a lightweight mechanism to broadcast asynchronous event notifications, and endpoints that allow software components to connect to the topic in order to send and receive those messages. To broadcast a message, a component called a publisher simply pushes a message to the topic. Unlike message queues, which batch messages until they are retrieved, message topics transfer messages with no or very little queuing, and push them out immediately to all subscribers. All components that subscribe to the topic will receive every message that is broadcast, unless a message filtering policy is set by the subscriber. pub/sub model The subscribers to the message topic often perform different functions, and can each do something different with the message in parallel. The publisher doesn't need to know who is using the information that it is broadcasting, and the subscribers don't need to know who the message comes from. This style of messaging is a bit different than message queues, where the component that sends the message often knows the destination it is sending to. For more information on message queuing, see "What is a Message Queue?"
What is Redis and what are its main properties ?
REmote DIctionary Server Open Source written in C key-value stings and abstract data types Hash where keys and values are strings Ultrafast response time (in memory, Non-blocing I/O, single threaded, 100,000+ read/write per second) Checkpoint in-memory values to disk every few secs
When you should be using Redis (Redis use cases) ?
Redis is not a database, complements data storage laer/Pub/Sub support Potential use: Session store Logging General use: Data you don't mind losing Records that can be accessed by a single primary key Schema for sigle value or a serialized object
reads
Similar to writhe except for: Coordinator can contact X replicas Coordinator also fetches value from other replica and do read repair (looking at timestamp) A row may be split accross multiple SSTables (reads slower than writes)
Offset
Spark Streaming integration with Kafka allows users to read messages from a single Kafka topic or multiple Kafka topics. A Kafka topic receives messages across a distributed set of partitions where they are stored. Each partition maintains the messages it has received in a sequential order where they are identified by an offset, also known as a position. Developers can take advantage of using offsets in their application to control the position of where their Spark Streaming job reads from, but it does require offset management.
Main data types
Spark can read data from HDFS, Hive tables, Json Dataset DataFrame
Spark SQL and its main properties.
Structured Data Processing in Spark Built on top of RDD switch SQL query, javap/python/scala
Explain how Spark SQL can program SQL queries on huge data.
Structured data processing in Apache Spark. Built on top of RDD data abstraction Spark can read data from HDFS, Hive tables, Json, etc. Can use SQL to query the data Strong query engine (optimized) Limitations With Hive: Hive launches MapReduce jobs internally for executing the ad-hoc queries. MapReduce lags in the performance when it comes to the analysis of medium-sized datasets (10 to 200 GB). Hive has no resume capability. This means that if the processing dies in the middle of a workflow, you cannot resume from where it got stuck. Hive cannot drop encrypted databases in cascade when trash is enabled and leads to an execution error. To overcome this, users have to use Purge option to skip trash instead of drop. These drawbacks gave way to the birth of Spark SQL. Spark SQL Overview Spark SQL integrates relational processing with Spark's functional programming. It provides support for various data sources and makes it possible to weave SQL queries with code transformations thus resulting in a very powerful tool. Integration With Spark Spark SQL queries are integrated with Spark programs. Spark SQL allows us to query structured data inside Spark programs, using SQL or a DataFrame API which can be used in Java, Scala, Python and R. To run streaming computation, developers simply write a batch computation against the DataFrame / Dataset API, and Spark automatically increments the computation to run it in a streaming fashion. This powerful design means that developers don't have to manually manage state, failures, or keeping the application in sync with batch jobs. Instead, the streaming job always gives the same answer as a batch job on the same data. Uniform Data Access DataFrames and SQL support a common way to access a variety of data sources, like Hive, Avro, Parquet, ORC, JSON, and JDBC. This joins the data across these sources. This is very helpful to accommodate all the existing users into Spark SQL. Hive Compatibility Spark SQL runs unmodified Hive queries on current data. It rewrites the Hive front-end and meta store, allowing full compatibility with current Hive data, queries, and UDFs. Standard Connectivity Connection is through JDBC or ODBC. JDBC and ODBC are the industry norms for connectivity for business intelligence tools. Performance And Scalability Spark SQL incorporates a cost-based optimizer, code generation and columnar storage to make queries agile alongside computing thousands of nodes using the Spark engine, which provides full mid-query fault tolerance. The interfaces provided by Spark SQL provide Spark with more information about the structure of both the data and the computation being performed. Internally, Spark SQL uses this extra information to perform extra optimization. Spark SQL can directly read from multiple sources (files, HDFS, JSON/Parquet files, existing RDDs, Hive, etc.). It ensures fast execution of existing Hive queries. The image below depicts the performance of Spark SQL when compared to Hadoop. Spark SQL executes upto 100x times faster than Hadoop.
Suspicion Mechanisms in Cassandra
Suspicion mechanisms to set the timeout based on nw and failure behavior. Accrual detector: Failure detector outpus a value (PHI) representing suspicion PHI based on detection timeout and historical inter-arrival time In practice, PHI = 5 => 10-15 sec detection time
Dataframes
a Dataset organized into named clolumns Equiv. to a table in RDB or a data frame in R/Python can be created from structured data files, tables in Hive, external DB, RDDs DataFrame API - Scala, Java, Python/R
Datasets
a distributed collection of data benefits of RDDs with Spark SQL's optimized exec engine can be constructed from JVM objects and manipulated with func transformations (map, flatMap, filter) Dataset API - Scala and Java Python - already have functions equiv to Dataset API
Casandra
a distributed key-value store. Intended to run in a DC and across DCs. Originally designed for Facebook but open-sourced later. Now under Apache.
membership
any server in cluster could be the coordinator. Server keeps a list of servers automatically maintained Cluster membership - Gossip-Style periodicaly gossip their membership list
Compaction
compact data updates in SStables merge SStables by merging updates for a key periodically and locally at each server
Learn about Distributed Key-Value Stores and in-memory databases like Redis.
e.g. Cassandra, Redis
Topic, producer, consumer
kafka maintains feeds of messages in categories called TOPIC Processes that publish messages to a Kafka topic are PRODUCERS Processes that subscribe to topics arend process the feed of pub messages are CONSUMERS kafka is run as a cluster comprised of one or more servers (called BROKER) communication - TCP, Clients include Java
Main idea of Bigtable
sparse (empty cell okay), distributed, persistent multi-dimensional sorted map map indexed by a row key, column key and a timestamp Single transaction lookups, inserts, deletes