Data Systems
Problems with RPCs
A local function call is predictable and either succeeds or fails. A network request is unpredictable due to a variety of problems out of your control. You have to anticipate these problems. A local function call either returns a result, or throws an exception, or never returns. Network request might also return without a result due to a timeout. If you retry a failed network request, it could happen that the requests actually getting through, and only the responses are getting lost. Here retrying could cause the action to be performed multiple times, unless you build a mechanism for deduplication (idempotence) into the protocol. Network request can take a long time to execute. Latency has high variability. Can't pass references to objects in local memory in a network request. Need to pass everything (not really possible for big objects).
Message Queue
A message queue is a form of asynchronous service-to-service communication used in serverless and microservices architectures. Messages are stored on the queue until they are processed and deleted. Each message is processed only once, by a single consumer.
Eventual consistency
A model for database consistency in which updates to the database will propagate through the system so that all data copies will be consistent eventually.
Asynchronous replication (data systems)
A node that receives a write, sends to followers but doesn't wait until followers process. Advantage Don't block; can continue processing writes when other nodes have fallen behind. Disadvantage Weak durability: can lose data if nodes go down. ex: If node fails and is not recoverable, any writes that have not yet been replicated to followers are lost.
Row-Oriented Data
All values from one row are stored next to each other in memory and on disk. Used in most OTLP databases and many document stores.
Snowflake Schema
An extension of star schema where dimension tables are further broken down into sub-dimensions.
Why Normalize Data?
If not normalized you have a number of copies spread across the database. If anything is updated, all of the redundant copies need to be updated, which incurs write overheads, and risks inconsistencies (only some updated).
Multilevel index
Indexing the index using a sparse index (recursively)
Data replication
Keeping a copy of the same data on several different nodes, potentially in different locations.
Backward Compatibility
Newer code can read data that was written by older code.
Migrating data
Rewriting data in a new schema. Expensive for large datasets. Most dbs allow schema changes.
Programs usually work with data in at least 2 representations:
(1) *In memory*: data is kept in objects, structs, lists, etc. These are optimized for efficient use by CPU (2) *byte sequence* for writing data to file, or sending over network.
SSTable advantages over log segments with hash indexes
(1) Merging segments is simple and efficient (follows mergesort algorithm) (2) Don't need to keep an index of all the keys in memory. You can keep a few (sparse) and scan between them. (3) Since read requests need to scan over several key-value pairs in a range anyway, it's possible to group those records into a block and compress it before writing it to disk. Each entry of the sparse in-memory index then points at the start of a compressed block. Saves disk space and reduces I/O bandwidth use.
JSON, XML, CSV Encoding Disadvantages
*Ambiguity around encoding numbers*. In XML/CSV you can't distinguish between strings of digits and numbers. JSON doesn't distinguish integers and floating point numbers (and no precision). JSON/XML don't support binary strings (but you can use Base64). JSON/XML have optional support for schemas, but they're not widespread. CSV is a vague format; has no schema, and if you change rows/columns you need to manage change in code. parsers handle things differently.
Reasons to distribute a database across multiple machines
1) Scalability: data volume, read load, or write load is bigger than single machine can handle. 2) Fault tolerance / availability: Your application needs to continue working if machines or the network or datacenter goes down, you can use multiple machines for redundancy. When one fails, another takes over. 3) Latency: if you have users around the world you might want to have servers at various locations so that each user can be served from a datacenter that is geographically close to them.
Contention
1. A contention is a conflict when two or more programs try to use the same resource or setting at the same time. 2. When referring to a network, a contention is when two or more computers try to access the same file at the same time. 3. With wireless networking, a contention is when two or more data stations attempt to transmit at the same time over a shared channel.
Synchronous replication (data systems)
A node that receives a write, sends to followers and waits until followers process and return SUCCESS. Advantage Nodes are guaranteed to have up-to-date copy of data. Disadvantage Can be a lot of waiting if you have nodes go down, high connection pressure, or network outages. So it's usually impractical to have total synchronous replication. (if it's enabled it usually means one follower is made synchronous)
NoSQL (overview)
A non-relational database.
One-to-many relationship
A relationship between two tables in a database in which one record in the primary table can match many (zero, one, or many) records in the related table. Ex: if one department can employ for several employees then, department to employee is a one to many relationship (1 department employs many employees).
How are documents usually stored in a document-oriented db?
A single continuous string, encoded as JSON, XML, or a binary variant.
Data Warehouse
A system used for reporting and data analysis. Designed for analytical queries. Data model is often relational, because SQL is generally a good fit for analytic queries. But indexing structures are different
Single-leader replication advantages/disadvantages
Advantage Easy to understand No conflict resolution Disadvantage Not as robust to faulty nodes, network interruptions and latency spikes (if leader goes out it's a bit of a pain)
Multi-leader replication advantages/disadvantages
Advantage More robust to faulty nodes, network interruptions and latency spikes. Disadvantage Harder to follow / reason about. Conflicts occur since they allow multiple writes to happen concurrently. Weak consistency guarantees (from issues with concurrency)
Leaderless replication advantages/disadvantages
Advantage More robust to faulty nodes, network interruptions and latency spikes. Disadvantage Harder to follow / reason about. Conflicts occur since they allow multiple writes to happen concurrently. Weak consistency guarantees (from issues with concurrency)
Database index
An additional structure that is derived from the primary data. Idea is to keep additional metadata on the side, which acts as a signpost and helps locate data efficiently. Well-chosen indexes speed up read queries, but every index slows down writes. Maintaining indexes incurs overhead (especially on writes because index needs to be updated every time a write is performed.
Log data file (databases)
An append-only sequence of records in a file many database systems use internally. Databases generally need to deal with concurrency control, reclaiming disk space so the log doesn't grow forever, and handle errors and partially written records.
Protobuf/Thrift field tags and schema evolution
An encoded record is just the concatenation of its encoded fields (with field tag number and type). You can change the name of a field and have no problems (names aren't stored), can't change tag. Can add new fields if they have new tags. (fc: old code just ignores). If you add a new field you can't make it required (bc: new code wouldn't be able to read old data). proto3 doesn't have required. You can remove fields, but you can't use tag numbers again.
Base64 Encoding
An encoding scheme which represents binary data using only printable ASCII characters. Each Base64 digit represents exactly 6 bits of data.
Automatic vs. manual rebalancing partitions
Automated rebalancing can be unpredictable Expensive: requires rerouting requests and moving a large amount of data from one node to another. Can also be dangerous in combination with automatic failure detection: Ex: imagine one node is overloaded and temporarily slow to respond. Other nodes think it's dead and automatically rebalance the cluster to move load away from it, putting additional load on overloaded node, other nodes, and the network and generally making the situation worse. Generally good to have a human involved.
Strategies for rebalancing partitions
Bad way: hash mod n If the number of nodes changes most keys will be moved from one to another. That's a lot of work. Fixed number of partitions Size of partitions is proportional to the size of the dataset. Create many more partitions than there are nodes and assign several partitions to each node. If a node is added to the cluster, the new node can steal a few partitions from every existing node until partitions are fairly distributed once again. Only entire partitions are moved between nodes. Dynamic partitioning Number of nodes is proportional to the size of the dataset. When a partition grows beyond a certain size (default 10GB on HBase), it is split into two partitions so that approximately half the data ends up on each side of the split. Conversely, if data shrinks to some size it can be merged with some adjacent partition. Each partition is assigned to one node, and each node can handle multiple partitions. After a split, one can be transferred to another node to balance the load. Partitioning proportionally to nodes Number of partitions is proportional to the number of nodes. Fixed number of partitions per node. When a new node joins the cluster it randomly chooses a fixed number of existing partitions to split, and then takes ownership of one half of each of those split partitions while leaving the other half of each partition in place.
Thrift and Protocol Buffer Overview
Binary encodings. Both require a schema for any data that is encoded. Both come with a code generation tool that takes a schema definition and produces classes that implement the schema in various languages. If a field value is not set, it is omitted from the record.
Materialized Aggregates
Cache of aggregate query results. Idea is to store counts or sums that queries use most often. One way is to use a materialized view: a table-like object whose contents are the result of some query.
Document-oriented database (overview)
Category of NoSQL databases. Subclass of a key-value store. Stores unstructured data, such as the text of a speech or newspaper article. Document databases store all information for a given object in a single instance in the database, and every stored object can be different from every other. Examples: Mongo DB, RethinkDB, CouchDB, and Espresso
Evolvability
Characteristic of a data system: easily adaptable to change.
Dataflow through services
Client/server communication is most common pattern over a network. Servers expose an API over the network. Clients connect to the servers to make requests to that API. The exposed API is the service. (The web is like this: web browsers make requests to web pages)
Column-Oriented Data Advantages
Columns aren't as fragmented on disk so when loading it's faster. More amenable to compression (one effective technique is to use bitmap encoding). This is because the number of distinct values in a column is < the number of distinct values per row. Can take advantage of vectorized processing.
Disadvantage of LSM-Trees
Compaction process can sometimes interfere with performance for ongoing reads and writes. (Writes have to wait for a compaction process to finish). If writes are high, and compaction is not configured properly, you could get to the point where compaction can't keep up with incoming data.
Problems with shared-disk architectures
Contention limits scalability (a conflict when two or more programs try to use the same resource or setting at the same time). Overhead of locking limits scalability.
Problems with shared-memory architectures
Cost grows faster than linearly. Due to various bottlenecks a machine twice the size cannot handle twice the load. Limited to a single location
Covering index
Covering index: stores some of the table's columns within the index. (some queries can be answered by using the index alone.)
Multi-leader replication in multi-datacenter
Database with replicas in several different datacenters, one leader in each datacenter. In each datacenter regular leader-follower replication is used.
Dense index
Dense index: index appears for every search-key value in the file
Avro: where does writer's schema get stored?
Depends on context: When you write a file the schema will be at the beginning of the file. When writing to a db you have to encode a version number and keep a list of schema versions in your db. Sending records over a network connection (Avro RPC protocol): two processes negotiate the schema version on connection setup and then use that schema for the lifetime of the connection.
Rolling Upgrade / Staged Rollout
Deploying a new version of an application to a few nodes at a time, checking whether the new version is running smoothly, and gradually working your way through all the nodes. Allows new versions to be deployed without service downtime, and thus encourages more frequent releases and better evolvability.
Document-partitioned indexes (local indexes)
Document-partitioned indexes (local indexes) secondary indexes are stored in the same partition as the primary key and value. Only a single partition needs to be updated on write, but a read of secondary index requires a scatter/gather across all partitions.
Ways to partition a secondary index
Document-partitioned indexes (local indexes) secondary indexes are stored in the same partition as the primary key and value. Term-partitioned indexes (global indexes) secondary indexes are partitioned separately, using the indexed values. An entry in the secondary index may include records from all partitions of the primary key.
Shared nothing architecture
Each machine or virtual machine running the db software is a node. Nodes have their own resources, coordination happens at software level.
Read-after-write consistency
Fix for: Reading your own writes (write to leader and immediately read from follower. They might not be consistent). Guarantee that if the user reloads the page, they will always see any updates they submitted themselves. Makes no promises about other users.
Protobuf/Thrift datatypes and schema evolution
For protos you can make things repeated or singular. Thrift has dedicated list datatype, so you can't do it, but it does support lists of lists.
Graph-Like Data Models (Advantages)
Handle many-to-many relationships well. Good for evolvability: as you add features to your application, a graph can easily be extended to accommodate changes in your application's data structures. Lots of flexibility: can handle things that would be hard to express in a relational model (think about representing where people are from... different geographic scales for different countries... e.g. counties vs parishes)
Disadvantages of B-Trees
Has to write every piece of data at least twice: once to write-ahead log, and once to the tree page itself (and potentially again as pages are split).
Hash partitioning
Hash function is applied to each key, and a partition owns a range of hashes. Destroys the ordering of keys (range queries are inefficient), but may distribute load more evenly. Common to create fixed number of partitions in advance, to assign several partitions to each node, and to move entire partitions from one node to another when nodes are added or removed. Dynamic partitioning can also be used.
Hash Indexes Disadvantages
Hash map has to fit in memory. Slicing, or range queries are not efficient; you can't easily scan over all keys between 0001 and 1001... you'd have to look up each key individually.
Why use replication (data systems)?
High availability Helps keep the system operational even when one or several machines go down. Disconnected operation Allows application to continue working when there is a network interruption. Latency You can have nodes geographically close to users so that users can get data faster Scalability Handle higher volume of reads than a single machine could handle.
Monotonic reads
If a process reads the value of a data item x, any successive read operation on x by that process will always return that same value or a more recent value. 1 potential implementation: make sure user always reads from the same replica (e.g. choose replica by hashing user's id). However, if node fails, you still have to get rerouted and may start over. (Solves problem of reads going back in time on replicated data systems with eventual consistency: user makes several reads from different followers, a later read can effectively be at an earlier point in time for the system since the updates don't flow uniformly to all followers.)
When to use a document-oriented db?
If you need to access most or all of a document and one time, there is a performance advantage to storage locality. If data is split across multiple tables it will require more time to retrieve it. If your data has mostly one-to-many relationships (tree-structured data), or no relationships between records, the document model is appropriate.
Declarative Language (overview)
Imperative language tells computer to perform certain operations in a certain order. In declarative (like SQL or relational algebra), you specify the pattern of the data you want: what conditions the results must meet and how you want the data to be transformed, but not how to achieve that goal. Ex: SQL, CSS
One-to-one relationship
In databases, a relationship in which each record in Table A can have only one matching record in Table B, and vice versa. Ex: in a school database, each student has only one student ID, and each student ID is assigned to only one person.
Many-to-many relationship
In databases, a relationship in which one record in Table A can relate to many matching records in Table B, and vice versa. Ex: An Author can write several Books, and a Book can be written by several Authors.
Write-ahead log (WAL) shipping (replication log implementation)
In many situations data written to disk in storage engines is recorded in a log (WAL in a b-tree, logs of lsm tree). Log is an append-only sequence of bytes containing all writes to the database. Can use the same log to build a replica on another node. Disadvantage is that the log describes data in a very low level, so replication is tightly coupled with the storage engine. Leader and follower need to be the same. In practice this can mean more downtime between upgrades (WAL versions need to match).
Hash Indexes (Database)
Indexes for key-value data. Key-value stores are similar to python dicts. Usually implemented as a hash map. Keep the hash map in memory (key can point to a byte offset in a file or something).
Request Routing (partitioned dbs)
Instance of a more general problem: service discovery. Some high-level approaches: Allow client to ask any node randomly (e.g. through a round-robin load balancer). If that node owns the partition it handles the request, otherwise it forwards the request to the next node and so on. Have a routing tier that is a partition-aware load balancer. All requests go to it and it passes request to correct node. Have client be aware of the partitioning and assignment of partitions to nodes. Client connects directly.
In-Memory Databases
Keep elements in RAM. When restarted, needs to reload its state. Performance advantage of in-memory is not due to the fact that they don't read from disk. (Even disk-based storage may never need to read from disk if you have enough memory, because os caches recently used blocks of memory anyway.) It can be faster because it avoids the overhead of encoding in-memory data structures in a form to be written to disk.
Key range partitioning
Keys are sorted and a partition owns all the keys from some minimum up to some maximum. Range queries are possible, but there is risk of hot spots if the application often accesses keys that are close together in sorted order.
B-Trees vs. LSM-Trees
LSM-trees are becoming increasingly popular in new data stores. They are typically faster for writes. (Lower write amplification.) Slower for reads since they have to check several data structures and SSTables at different stages of compaction. LSM trees can be compressed better. B-trees more common. Typically faster for reads. Leave some storage unused due to fragmentation (when a page is split). Another advantage: key only exists in a single place in the index (log structured can have multiple across segments) That said, benchmarks are often inconclusive and sensitive to details of the workload. Need to test with your workload.
Statement-based replication
Leader logs every write request (statement) that it executes and sends that statement log to its followers. For a relational DB, this means every INSERT, UPDATE, or DELETE statement and the follower parses the statement as if it had been received from a client. Problems: Statements that use nondeterministic functions will be different across nodes (RAND(), NOW()). If statements use an autoincrementing column, or if they depend on the existing data in the database they must be executed in exactly the same order on each replica, or else they may have a different effect. Can be limiting when there are multiple concurrently executing transactions. Statements with side effects (e.g. triggers, stored procedures, user-defined functions) may result in different side effects occurring on each replica, unless replica, unless the side effects are absolutely deterministic. Because of these, other replication methods are preferred.
LSM-tree storage engine
Log-structured Merge-Tree Indexing structure based on the principle of merging and compacting sorted files (SSTables). Idea: (1) Keep a sorted key-value store in memory: memtable (usually implemented as a red-black or AVL tree). Tells you byte offsets for some of the keys, but it can be sparse, which makes scanning faster (you don't need every key because you know the specific key lies between 2 keys in the index). (2) And keep sorted structure on disk (SSTables), that has all the keys and values.
MapReduce
Low-level programming model for distributed execution on a cluster of machines. Uses "map" and "reduce" functions, which must be pure functions. During the first phase, the Map phase, computers work on a task in parallel. During the second phase, the Reduce phase, the work of separate computers is combined, eventually obtaining a single result.
Property Graphs
Model for structuring and querying data in graphs. Each vertex consists of: Unique id. Set of outgoing edges. Set of incoming edges. A collection of properties (key: val pairs). Each edge: Unique id. Start (tail) vertex. End (head) vertex. Label to describe relationship. A collection of properties (key: val pairs).
Object-relational mismatch (impedance mismatch)
Most application development is done using object-oriented programming. Working with SQL usually requires an awkward translation layer to map objects to relations and back.
Star Schema
Most common data model for data warehouses. Center of schema is a fact table (each row represents a fact or individual event). Some columns of fact table are foreign key references to other tables, called dimension tables. If each row in a fact table is an "event," dimension tables represent the who, what, where, when, how, and why of the event.
Leader-based replication
Most common solution for data replication. One of the replicas is designated leader. When clients write to the db, requests sent to the leader, which first writes the new data to its local storage. The other nodes are followers. Whenever the leader changes its data, it sends a data change to all of its followers as part of a replication log or change stream. Each follower takes the log from the leader and updates its local copy, applying changes in order they were received. When a client wants to read from the db, it can query either the leader or any of the followers (writes are only accepted by the leader). This is a built-in feature in many relational databases, postgres, mysql, oracle data guard, sql server. Some non-relational ones too, like MongoDB, RethinkDB, and Espresso. Distributed message brokers such as Kafka and RabbitMQ also use it.
Concatenated Index
Multi-column index. Combines several fields into one key.
Use cases for multi-leader replication
Multi-datacenter Usually doesn't make sense in a single datacenter (benefits rarely outweigh the complexity). Clients with offline operation ex calendar apps: every device has a local db that acts as leader, and there is an async multi-leader replication process between the replicas of your calendar on all devices. Collaborative editing Allow several people to edit the same thing simultaneously (like Google Docs). Similar problem: to guarantee there won't be editing conflicts, the application must get a lock on the doc before a user can edit, another user can't edit until the first user has committed their changes. For faster collaboration you could make the unit of change super small (like every keystroke).
B-Tree write-ahead log
Need to make the db resilient to crashes (a write could update and create multiple new pages that pages must reference, if something breaks in the middle of that process it is not good). Common for B-trees to also have a write-ahead log (WAL or redo log). WAL is an append-only file file to which every B-tree modification must be written before it can be applied to the pages of the tree itself. DB uses this log to get B-tree back into a consistent state when it loads after a crash.
Multi-column indexes
Needed if we need to query multiple columns of a table. A multicolumn index should be created when data satisfying multiple conditions is to be retrieved, such as when a complex-condition retrieval using the AND operator with multiple columns as the key is executed. For example, suppose that a complex-condition search is to be executed using table columns C1, C2, and C3 as the key items: SELECT retrieval-column FROM retrieval-table WHERE C1=10 AND C2=20 AND C3=30 In this case, a multicolumn index consisting of the three columns C1, C2, and C3 should be created instead of creating three separate single-column indexes. In this way, overhead for index and row accesses can be reduced. Concatenated index: most common type of of multi-column index. It combines several fields into one key by appending one column to another. Like a phone book index: (lastname, firstname) to phone number. Due to sort order, index can be used to find all people with a particular last name, or last-first combo, but not people with a particular first name.
Graph-Like Data Models (Overview)
NoSQL datastore. Anything can be related to everything. Models data using vertices and edges. Might be appropriate if you many-to-many relationships are common in your data. Ex: social graphs, web graph, road network. Vertices and edges don't all need to be same type of object; Facebook maintains a single graph of people, locations, events, etc.
Rebalancing partitions: Dynamic partitioning
Number of nodes is proportional to the size of the dataset. When a partition grows beyond a certain size (default 10GB on HBase), it is split into two partitions so that approximately half the data ends up on each side of the split. Conversely, if data shrinks to some size it can be merged with some adjacent partition. Each partition is assigned to one node, and each node can handle multiple partitions. After a split, one can be transferred to another node to balance the load. Advantage Number of partitions adapts to the total data volume. Disadvantage An empty DB starts off with a single partition and while dataset is small all writes have to be processed by a single node while the other nodes sit idle. For key-range you need to know what the key distribution is going to look like already.
Rebalancing partitions: Partitioning proportionally to nodes
Number of partitions is proportional to the number of nodes. Fixed number of partitions per node. When a new node joins the cluster it randomly chooses a fixed number of existing partitions to split, and then takes ownership of one half of each of those split partitions while leaving the other half of each partition in place. Picking partition boundaries randomly requires that hash-based partitioning is used.
Forward Compatability
Older code can read data that was written by newer code.
Many-to-one relationship
One type of data relationship in which many records in a table are related to one record in another table. Ex: employee to department relationship is many to one (many employees work in one department).
Write Amplification
One write to the db resulting in multiple writes to the disk over the course of db's lifetime (think about repeated compaction and merging of SStables in LSM-tree)
OLAP
Online Analytical Processing Typically: Primarily used by analysts, not end users. Lower volume of queries, but each query is demanding. Read pattern: Aggregate over large number of records. Write pattern: Bulk import (ETL) or event stream. (analysts using it won't really write to it generally, it's just a read-only copy of the data from the dbs) Dataset Size: TB to PB Disk bandwidth is often the bottleneck. Column-oriented data is increasingly popular for these workloads.
OLTP
Online Transaction Processing Typically: User-facing (handle huge volume of requests) Read pattern: Small number of records per query. Write pattern: Random-access, low-latency writes from user input Dataset Size: GB to TB Disk seek time is often bottleneck
Hierarchical Data Model
Organizes data into a tree-like structure that allows repeating information using defined parent/child relationships. Not good for many-to-many relationships.
Primary index
Primary index: index is primary if it is built on a unique attribute
Dataflow through databases issues
Process that writes to the database encodes it, and the process that reads decodes it. Forward/backward compatibility is important. One weird detail: sometimes old code reads and processes new data, writes it back without knowledge of a new field. The encoding formats support preservation of unknown fields, but sometimes you need to take care at an application level.
With a partitioned db, what are some situations that require data and requests to be moved from one node to another?
Query throughput increases, so you want to add more CPUs to handle the load. Size of dataset increases, so you want to add more disks and RAM to store it. A machine fails and other machines need to take over the failed machine's responsibilities.
3 big problems with replication lag
Reading your own writes: write to leader and immediately read from follower. They might not be consistent. Fix: read-after-write consistency Reads going backward in time: Ex: user makes several reads from different followers, a later read can effectively be at an earlier point in time for the system since the updates don't flow uniformly to all followers. Fix: monotonic reads Bad sequencing: Imagine a conversation that that has a question and response: Mrs. Poons: how far into the future can you see, Mrs. Cake? Mrs. Cake: About ten seconds usually, Mr. Poons. A third person listening to the conversation through followers could get the order reversed if one follower has a little lag. Fix: Consistent prefix reads
Rebalancing partitions
Rebalancing: the process of moving load from one node to another in the cluster. Regardless of partitioning, rebalancing is expected to meet some minimum reqs: After, load (data storage, read and write requests) should be shared fairly between nodes in the cluster. While rebalancing is happening, the database should continue accepting reads and writes. No more data than necessary should be moved between nodes, to make rebalancing fast and to minimize the network and disk I/O load.
Apache Avro
Row-based binary encoding format that came from hadoop (because Thrift didn't really work). Self-describing format: data and metadata about the schema are embedded in the file. Supports serialization and compression. Has 2 schema languages: Avro IDL intended for human editing JSON-based one for machine readability.
Problems with multi-leader replication in multiple datacenters
Same data may be concurrently modified in two datacenters. (need conflict resolution) Multi-leader replication is a somewhat retrofitted feature in many databases, so there are often subtle configuration pitfalls, so it's often considered dangerous territory that should be avoided if possible.
Consistent prefix reads
Says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order. Implementations: * Make sure that any writes that are causally related to each other are written to the same partition (but in some applications that can't be done efficiently). * Use algorithms that explicitly keep track of causal dependences. Solves anomaly in sequencing: order of writes is not preserved on multiple reads
Secondary index
Secondary index index is secondary if it is not built on a unique attribute (crucial for performing joins). A secondary index can be constructed from key-value index, the main difference is that keys are not unique. Solved in 2 ways: Make each value in the index a list of matching row identifiers. Make each key unique by appending row identifier to it. Either way, both B-trees of LSM-trees can be used for secondary indexes.
Why use an append-only log in a database?
Seems wasteful at first glance (why not overwrite the old value with the new?). (1) Append and segment merge are sequential write operations which are mush faster than random writes, especially on magnetic spinning disk hard drives. (to some extend writes are also preferable to flash-based solid state drives). (2) Concurrency and crash recovery are much simpler if segment files are append-only or immutable. (3) Merging old segments avoids the problem of data files getting fragmented over time.
Binary strings
Sequence of 1s and 0s
Service discovery
Service discovery is the automatic detection of devices and services offered by these devices on a computer network. Service discovery is how applications and (micro)services locate each other on a network. Service discovery implementations include both: a central server (or servers) that maintain a global view of addresses and clients that connect to the central server to update and retrieve addresses.
Shared architectures
Shared-memory: All CPUs, RAM chips and disks joined together in one OS. All components can be treated as a single machine. Shared-disk: Several machines with independent CPUs and RAM, but store data on an array of disks that is shared between machines.
Document-oriented database advantages
Since they store all information for a given object in a single instance in the database, mapping object into the database is simple and doesn't need an object-relational mapping. Flexible schema. Better performance from locality. They are attractive stores for web applications (since they're subject to continual change in place, and speed of deployment is an important issue).
Approaches to replication (data systems)
Single-leader replication Client sends writes to one node (leader), the node sends a stream of data change events to the other replicas (followers). Reads can be performed on any replica. Multi-leader replication Clients send writes to any leader node. Leaders send stream of data change events to each other and to any follower nodes. Leaderless replication Clients send each write to several nodes, and read from several nodes in parallel in order to detect and correct nodes with stale data.
Rebalancing partitions: Fixed number of partitions
Size of partitions is proportional to the size of the dataset. Create many more partitions than there are nodes and assign several partitions to each node. Ex: Cluster with 10 nodes can be split to 1,000 partitions from the outset so that approximately 100 partitions are assigned to each node. If a node is added to the cluster, the new node can steal a few partitions from every existing node until partitions are fairly distributed once again. Only entire partitions are moved between nodes. The number of partitions doesn't change, nor does the assignment of keys to partitions; the only thing that changes is the assignment of partitions to nodes. And the new assignment (when a node is added) happens asynchronously, so you can keep reading and writing while transfer is in progress), You can even account for mismatched hardware and give a bigger/faster node more partitions.
SSTable
Sorted string table. A file on disk that contains a set of arbitrary, sorted key-value pairs inside. A simple abstraction to efficiently store large numbers of key-value pairs while optimizing for high throughput, sequential read/write workloads.
Sparse index
Sparse index: contains index records for only some search-key values. Why? Less space and less maintenance overhead for insertions and deletions. Generally slower than dense index for locating records
Data Cube / OLAP Cube
Special case of a materialized view. A grid of aggregates grouped by different dimensions. For example, a company might wish to summarize financial data by product, by time-period, and by city to compare actual and budget expenses. Product, time, city and scenario (actual and budget) are the data's dimensions. Two dimensions shown in the image, but can hold many dimensions.
Data partitioning (sharding)
Splitting a big database into smaller subsets called partitions so that different partitions can be assigned to different nodes (also known as sharding). Goal: spread the data and query load evenly across multiple machines, avoiding hot spots.
B-tree Indexing (Overview)
Standard index implementation in relational databases. Keep key value pairs sorted by key, which allows efficient key-value lookups and range queries. Break the db down into fixed-size blocks or pages, traditionally 4kb in size, and read or write one page at a time. (corresponds more closely to hardware—disks are arranged in fixed-size blocks.) Each page can be identified by using an address or location, so one page can point to another. A page contains several keys and references to child pages. Leaf page contains individual keys and values. (value will typically be the byte offset of the location where the record is actually stored.)
JSON, XML, CSV Encoding Advantages
Standardized; can be read and written by many languages. Somewhat human-readable. JSON: subset of javascript, so supported by web browsers.
Adding a Key/Value to a B-Tree
Start and root, and move downward towards leaf that contains that key and change the value in that page. Then write the page to disk. If there isn't enough space to accommodate a new key, the leaf is split into two new pages with half the references/values in each. Then the parent is updated to account for the division (add another key and reference). This algorithm ensures that the tree remains balanced: a B-tree with n keys always has a depth of O(logn).
Implementation strategies for replication logs
Statement-based replication Leader logs every write request (statement) that it executes and sends that statement log to its followers. Write-ahead log (WAL) shipping Log is an append-only sequence of bytes containing all writes to the database. Can use the same log to build a replica on another node. Logical (row-based) log replication Decouple the storage engine from replication log by using different formats. Physical log: storage engine's physical data representation. Logical log: decoupled log, usually a sequence of records describing writes to db tables at the granularity of a row. Trigger-based replication Do replication at the application level. Many relational dbs have triggers and stored procedures that let you tigger custom application code on a data change.
Column-Oriented Data
Store all the values from each column together in memory and on disk. Usually uses LSM trees for writes (B-trees won't work).
Relational model advantages
Supports joins Supports many-to-many, and many to one relationships.
Failover (distributed data system)
Switching to a redundant or standby system if something breaks. Ex: If a leader node fails in replicated data system, a one follower has to be promoted to leader, clients need to be reconfigured to send their writes to the new leader, and the other followers need to start consuming data changes from the new leader. Can be manual or automatic. Automatic is complicated so a lot of ops teams prefer to perform failovers manually.
Term-partitioned indexes (global indexes)
Term-partitioned indexes (global indexes) secondary indexes are partitioned separately, using the indexed values. An entry in the secondary index may include records from all partitions of the primary key. When a document is written, several partitions of the secondary index need to be updated; however, a read can be served from a single partition.
Relational Model (overview)
The data is organized in the DB in "relations" or tables, where each relation is an unordered collection of tuples, or rows. SQL.
Nonclustered index
The data is present in arbitrary order, but the logical ordering is specified by the index. Index keys are sorted in order in the tree, but the physical order is different. The indexed columns are typically non-primary key columns used in JOIN, WHERE, ORDER BY clauses.
Heap File (Databases)
The key in an index is the thing that queries search for, but the value can be: The actual row A reference to a row stored elsewhere If the value is a reference, then the place where the rows are stored is called heap file. An unsorted set of records. Index just references a position in the file. Good when updating a value without changing the key (can be overwritten in place if it's not larger than previous--otherwise it needs to moved and all indexes need to be updated) Common because it avoids duplicating data when multiple secondary indexes are present: each index just references a location in the heap file... data is kept in one place
Branching Factor (B-Tree)
The number of references to child pages in one page of the B-tree.
Virtual view (Databases)
The result set of a stored query on the data, which the database users can query just as they would in a persistent database collection object. This pre-established query command is kept in the database dictionary. As a result set, it is a virtual table computed or collated dynamically from data in the database when access to that view is requested. a shortcut for writing queries. When you read from a virtual view, the SQL engine expands it into the view's underlying query on the fly and then processes the expanded query.
Secondary index with pointers
To avoid repeating key values, the most commonly used solution for secondary indexes uses pointer: one key has a list of pointers to different data files.
LSM storage engine read flow
To serve a read request: look for key in the in-memory tree, then look in most recent segment, then next older, etc... From time to time, run a merging and compaction process in the bg to combine segment files and discard overwritten or deleted values.
Relational Model: what's it good for?
Transaction processing (entering sales or banking transactions, airline reservations, stock-keeping in warehouses) and batch processing (customer invoicing, payroll, reporting).
B-Tree concurrency control
Updating pages requires careful concurrency control or a thread might see the tree in an inconsistent state. Usually protect the tree's data structures with latches (lightweight locks).
B-Tree optimizations
Use a copy-on-write scheme: a modified page is written to a different location and a new version of the parent pages in the tree is created, pointing at the new location. Then you don't need to keep a write-ahead log and can be useful for concurrency. Can abbreviate the keys to save space in the pages. Many implementations try to keep the pages close to each other on disk to speed up access, but it's hard to maintain as it grows.
LSM-tree Storage Engine how do you get your data sorted by key?
Use an in-memory table: can use tree structures: red-black trees or AVL trees. LSM-Tree is log-based: break db down into variable sized segments (several mb or more) and always write a segment sequentially.
Declarative Language (advantages)
Usually more concise and easier to work with than an imperative api. Hides implementation details (query optimizer does the heavy lifting). Declarative languages lend themselves to parallel execution.
Web services
When HTTP is used as the underlying protocol for talking to the service, it is called a web service.
Avro schema evolution
When application encodes data it uses the writer's schema. When application decodes data it uses the reader's schema. These schema's may be different, but they just need to be compatible. When data is decoded the Avro library resolves the differences by looking at the writer's schema and the readers schema and translating the data from the writer's schema into the readers. To maintain compatibility, you may only add or remove a field that has a default. (doesn't have optional and required markers).
Why is hash mod n a bad strategy for rebalancing partitions?
When partitioning by hash of a key, we said earlier that it's best to divide the possible hashes into ranges and assign each range to a partition (e.g., assign key to partition 0 if 0<= hash(key) < b_0, to partition 1 if b_0 <= hash(key) < b_1, etc). Why not use mod n (the number of nodes)? If number of nodes changes most of the keys will be moved from one node to another.
LSM-tree Storage Engine write flow
When write comes in, add it to an in-memory balanced tree (red-black tree). [memtable] When the tree gets bigger than some threshold (a few mb) write it out to disk as an SSTable file. This file becomes the most recent segment of the db. While writing to disk, writes can continue to new tree.
Modes of Dataflow
Whenever you need to send data to another process with which you don't share data, you need to encode it as a sequence of bytes. Most common ways of data flow: Via databases Via service calls (REST/RPC) Via asynchronous message massing
Pub/sub messaging
Why: In modern cloud architecture, applications are decoupled into smaller, independent building blocks that are easier to develop, deploy and maintain. Publish/Subscribe (Pub/Sub) messaging provides instant event notifications for these distributed applications. The Publish Subscribe model allows messages to be broadcast to different parts of a system asynchronously. A message topic provides a mechanism to broadcast asynchronous event notifications, and endpoints that allow software components to connect to the topic in order to send and receive those messages. To broadcast a message, a component called a publisher simply pushes a message to the topic. Unlike message queues, which batch messages until they are retrieved, message topics transfer messages with no or very little queuing, and push them out immediately to all subscribers. All components that subscribe to the topic will receive every message that is broadcast, unless a message filtering policy is set by the subscriber.
Clustered Index (Databases)
With a clustered index the rows are stored physically on the disk in the same order as the index. Therefore, there can be only one clustered index. Alternatively, the index is clustered if it is built on the same attribute used to sort records in the data file It is generally faster to read from a clustered index if you want to get back all the columns. You do not have to go first to the index and then to the table. Writing to a table with a clustered index can be slower, if there is a need to rearrange the data.
Column-Oriented Data disadvantages
Writes take longer (compression, sorting, etc)
Thrift and Protocol Buffer Backward/Forward Compatibility
You can add new fields to the schema, provided that you give each field a new tag number (an encoded record is just the concatenation of its encoded fields). If old code reads data written by new code it just ignores new fields. As long as each field has a unique tag number, new code can always read old data because the tag numbers have the same meaning. (but every new field must be optional)
Compaction (data system)
ex: Throw away older duplicate keys in a log Reduction of the number of data elements, bandwidth, cost, and time for the generation, transmission, and storage of data without loss of information by eliminating unnecessary redundancy, removing irrelevancy, or using special coding.
Materialized view
in a relational data model, it is often defined like a standard (virtual) view: a table-like object whose contents are the results of some query. Whereas a virtual view is a shortcut for writing queries, a materialized view is an actual copy of query results stored to disk.
Location transparency
location transparency is the use of names to identify network resources, rather than their actual location. The RPC model implies a level of location transparency, namely that calling procedures is largely the same whether it is local or remote, but usually they are not identical, so local calls can be distinguished from remote calls.
Relational Model: disadvantages
not scalable enough for very large datasets, or very high write throughput. doesn't support specialized queries. many restrictions on the data model, e.g. object-relational mismatch.
Document-oriented database disadvantages
support for joins is weak. If you need to join you need to do it in the client code, not the db.
Encoding (Serialization, Marshalling)
the translation of in-memory representation of data to a byte sequence that can be stored (or sent over wire). Python has pickle, java has java.io.Serializable, etc. But it's usually a bad idea to use there (inefficient, can usually store/run arbitrary code, poor versioning).
Primary Key
uniquely identifies one row in a relational table, one document in document database, or one vertex in a graph database.