Designing Data Intensive Apps (Part 1)

Ace your homework & exams now with Quizwiz!

Branching Factor (B Trees)

# of ref's to other pages that a single node of the B-Tree holds higher factor = more keys = fewer levels of depth = faster DB Operations In practice, branching factor is dependent on total space needed to store the page ref's & range boundaries

Thrift & Protocol Buffers; Difference vs MemoryPack

= Binary Encoding Libraries. 1 was made by FB, the other by Google; both are open-sourced Uses a schema for any data that's encoded which defines: (1) the allowed types (2) required vs optional fields

Data Warehouse

= DB totally isolated from OLTP Operations; used specifically to house & return data for analytics Contains read-only copy of all data in the various OLTP systems & can be optimized for running analytics Separation of OLTP & OLAP DB's is important b/c OLTP DB's are usually highly valuable Operations that must be secured & shouldn't be queried against often... thus we just isolate the OLAP & OLTP DB's from each other so analysts can query the OLAP DB all they want!

Binary Encoded JSON vs Textual JSON

= JSON that's encoded in a binary format, rather than the usual textual format → created in response to the large memory footprints that regular JSON usually takes up Examples: MessagePack, BSON, BJSON, UBJSON, BISON, Smile, etc... Some of these formats extend the default set of available data types; i.e: (1) Distinguishes between ints and floats (2) Adds support for binary strings Otherwise, the data model is unchanged compared to JSON/XML More about comparisons with JSON: https://www.sitepoint.com/data-serialization-comparison-json-yaml-bson-messagepack/

Distributed Actor Framework

= a system of actors across multiple distributed nodes, where each actor is responding to the messages within the system & processing them Same Messages passing mechanism is used across all nodes If on diff nodes, MSG's are encoded into byte sequences before being tx'd to preserve location transparency DAF Model is basically a framework that uses a MSG broker & the actor programming model in a single framework Cons: (1) Upgrade Compatibility → Struggles with forward/backward compatibility b/c actors in diff nodes may be running diff versions of software when doing rolling upgrades Pros: (1) Handles Concurrency SUPER well by having each actor handle a single MSG (2) Use of a MSG Broker means we get all the benefits of MSG Brokering systems!

Binary Encodings & Key Pros

= an encoding strategy where data is immediately converted into bit representations for sending over wireless networks to other systems / programs / apps Best used in internal systems where performance is controllable and fully optimizable Key Pros of Binary Encoders: (1) eat less space than textual encoders b/c they omit field names from encoded data (2) schema serves as a great form of documentation It's always up-to-date since the schema is necessary for decoding (3) Keeping a DB of schemas lets you check forward & backward compatibility of schema changes before anything's deployed (4) For static typed lang's, code generation from the schema is great! Let's you check types at compile time, so you can quickly catch bugs Generally, schema evolution allows the same kind of flexibility as schemaless /schema-on-read JSON DB's, BUT they also provide guarantees about your data & much better tooling!

Anti-Caching

= approach where we move least recently used data from cache to disk when we run out of memory The least-accessed cache data gets moved to disk first If we end up accessing this data again at some point, it's moved from disk back to cache. This swapping process is similar to what OS's do with virtual memory and swap files. However, the DB manages memory more effectively since it can work at a more granular level (DB is able to work at the individual record level whereas the OS works w/ entire memory pages)

Message Passing Dataflow

= async message passing systems! These are somewhere between RPC & DB's Like RPC b/c → Client's request is delivered w/ low latency Like DB's b/c → message isn't sent directly from 1 process to another; it's actually goes through a message broker / message queue / message-oriented middleware

gRPC

= modern open source RPC framework that runs in any environment. efficiently connects services in and across data centers w/ pluggable support for load balancing, tracing, health checking & authentication. applicable in last mile of distributed computing to connect devices, mobile applications and browsers to backend services. If you're new to gRPC, read this: https://grpc.io/docs/guides/

MessagePack

= one of the binary encoding versions of JSON Pro: Decreases the amount of space used by textual JSON by a little bit by converting the data into binary Con: (1) Lose human readability b/c it's now in binary form... Probably not worth the small space improvement compared to having the pro of human-readability (2) Still uses field Names, so it's not as compact as Thrift or Protobuf (since they only use Field ID #'s) Pros & Cons & Overview from the Creator: https://gist.github.com/frsyuki/2908191

Interface Design Language (IDL)

= the specification language used to describe a software component's API in a language independent way You can think of the IDL as a language used to define the available transformation schema for the data from 1 end of the RPC to the other. The flow looks like this: Client's data → encoded using IDL's defined formats/ schemas / binary encoder→ Converted into format for the server using some decoding library based on the IDL used For Example, if we have a client program written in JS, we can transform the data into Protobuf format → send it over the network to the server. The server then uses a Protobuf decoding library / code to convert data into a native data structure / format for server-side execution

Vectorized Processing

= using instruction sets to work on SETS of data rather than single data values; specifically in the case of column DB's, it's applying sets of instructions to entire COLUMNs (vectors) at once, rather than processing the data one value at a time WHOA! This is a use linear algebra to transform data along a vectorized scale. (each data column represents data that we can transform using linear operations; i.e. projections & matrices of bitwise operators) Column DB's also make efficient use of CPU Cycles. How? query engine takes some compressed data that's the size of the CPU's L1 Cache → loads data into the L1 cache & iterates through it in a tight loop If we were to do this on other cache levels / memory levels, we'd require a lot more function calls at each stage As we compress data down to smaller sizes, we can fit more into the high performing caching tiers! Thus, we get better performance just by doing compression & use of caches :) Once the data is on the L1 cache, we're then able to specially design bitwise operators (like AND and OR) to create instructions to directly work on these chunks of data

Column Families

= when columns of a table are grouped & stored together, rather than every single column getting their own individual storage This is primarily done to group columns of related data together. It's like... the medium between Column DB's & row DB's Each column family: (1) Store all columns of a row together (2) Store a row key (3) Don't use column compression A table can have multiple column families & a column family must be comprised of more than 1 column Examples: Cassandra & H Base use column families (inherited this concept from Big Table)

Clustered Indexes

= when we store data directly w/in the index itself instead of referencing some external data store / location for the desired data values Useful when the hop from index to heap file is too costly for reads. If secondary indexes are used, it's possible to actually just have them refer to the primary key (this helps to preserve "Single Source of Truth" to make it easier to update/add/delete secondary index values)

Secondary Indexes

= when you have additional keys to refer to the same value as the primary key Used to speed up queries by adding additional lookup values to reference Primary key vs Secondary Index Primary keys are always unique (single source of truth) Secondary keys can have duplicates Side Note --> Heap files are useful when using 2ndary indexes b/c we're just following a reference to the data then changing w/e is at the end of the reference, rather than trying to mutate the actual data itself (again, this idea of a single source of truth, except this time, the source is an address reference, wowzuh!) Allows us to avoid duplicate data w/in heap files Each index references a location in the heap file, and the actual data is kept in one place.

Star Schema

A simple database design in which dimensional data are separated from fact or event data. A dimensional model is another name for a star schema. At the center of the star schema is a fact table.

Microservices

A software architecture that is composed of smaller modules that interact through APIs and can be updated without affecting the entire system. it's important to ensure forward & backward compatibility across data encoding used in various servers & clients of various versions → By doing so, we enable autonomous development of each service with assurances that certain service guarantees will be made

Actor Models & Actors

Actor Model = model for concurrency in a single process; uses actors instead of threads Actors = Client / entity with local state that communicates with other actors by sending & receiving async messages. Actors only process 1 MSG at a time (thus don't need to worry about concurrency deadlocks) Message delivery isn't guaranteed (may be lost) within this model so it's important to beware of this!

snowflake schema

An expanded version of a star schema in which dimension tables are normalized and further broken down into sub-dimensions. Dimension tables also have pointers to other tables that hold sub-dimensional attributes

Non-clustered indexes

An indexing structure where the value of a key-val pair is just a pointer/reference to a data store (i.e. heap file) somewhere, rather than the actual data that we want

Data Encoding & Evolution of RPC Systems

As we update systems that use RPC for comm's, it's important to ensure backwards compatibility. B/c there's no way of forcing RPC clients to upgrade & stay up to date, we end up needing to support SUPER dated versions of RPC API's... painful AF

BLOB

Binary Large Object (1) Used to store large amounts of non-character data (2) Also used to hold data for exploitation by distinct types & user-defined functions (3) Commonly used for storage of large media files or large files in general (audio, photo, video, etc...)

Apache Avro

Binary encoding format that differs from Protobuf & Thrift Started b/c Thrift wasn't a good fit for Hadoop; needed a new binary encoding tool for Hadoop's use cases which required dynamic schema definitions (Protobuf & Thrift only offered static schema def's)

Bitmap Encoding & Runlength encoding

Bitmap Encoding a type of encoding where data is converted into binary format (0's & 1's) Runlength Encoding Type of encoding built on top of binary encoding where the length of 0s and 1s is encoded so we don't need to display all the bits NOTE: We always start the encoding with the # of zeroes; if there are no zeroes, then put 0. For example: 10010001100100 becomes --> 0,1,2,1,3,2,2,1, rest are zeroes; (where each # represents the occurrences of bits & everything after the last # is "rest zeroes").

Covering Indexes

Compromise between Clustered & Non-clustering indexes; it uses both!! Where some data is stored directly in index (usually most frequently accessed data) & some data is stored w/in a heap file Data stored w/in the index in this case is a covered query / covered index Like any duplication / caching style of logic, this: (1) speeds up read times :) (2) eats up more memory :( (3) adds more overhead on writes :( Need to beware of how to enforce transactional guarantees since duplication = more complexity & potential for inconsistencies

Log Storage DB's (Pros & Cons)

Cons: (1) Appending endlessly to a log file wastes a lot more memory than replacing things (2) The hash table must fit within memory, anything larger than memory will crash / have overwrites / issues (3) Range lookups are really in-efficient → I.e. grabbing data from keys 0 - 100,000 requires all 100K individual operations; this kinda sucks & defeats the benefits of a hash table Pros: (1) Sequential writes are faster than random writes (2) Concurrency & crash recovery are easier to achieve (3) Immutability & sequential Time-invariant logic makes this part way easier than random writes (4) Merging of old data ensures data won't become fragmented over time

B Tree Optimizations (Copy on Write, Key Abbrev., Locality, branching factor, fractal trees)

Copy on Write Scheme Instead of maintaining a write-ahead log, the DB creates a full copy of the page before modifying it The copied page is stored in a different location & the new page is written & ref'd in the parent pages of the B tree Useful for concurrency control b/c we can do Snapshot Isolation & Repeatable Reads By freezing the data in time & using a data fixture, we can ensure that the reads will be LTI → Thus, concurrency doesn't suck ass Key Abbreviation Keys only need to be diff enough to specify the boundaries of each key range, so it's possible to abbreviate keys as long as the key boundaries / ranges are preserved Having more keys in page lets us have higher branching factors & thus fewer levels Locality B Tree pages are written on disk, but nothing ensures that sequential keys (neighboring nodes) are stored near each other on disk → So, locality gets extremely tough / becomes non-existent It's extremely tough to maintain locality of BTrees on disk b/c the tree may grow / shrink Branching Factor(More Pointers) Each leaf page can have refs to sibling pages, so we limit the # of times we traverse between levels of depth within the DB Allows us to scan keys in order w/o jumping between parent & child pages Fractal Trees = B tree variant that uses some log-structured ideas to reduce disk seeks Have nothing to do with fractals

Avro vs Thrift & Protobuf

Diff's from Protobuf / Thrift: (1) No Tag #'s Results in smaller encodings than Protobuf & Thrift The example record we've been using before would be just 32 bytes; most compact of all! Wowzuh! (2) No field or data type identifiers; just a bunch of val's concatenated together Strings just have a length prefix followed by UTF-8 bytes Int's use variable-length encoding (same as Thrift's CompactProtocol) (3) Can only decode the data if you are using the exact same schema as the code that wrote the data This is b/c the only way to differentiate fields & schema types is by actually referencing the schema as we parse through the binary representations of the data.. OOF

Swagger

Docs: https://swagger.io/specification/ = the OpenAPI Definition Format Used to describe RESTful APIs and produce documentation w/o access to source code, documentation, or network inspection If properly used, customers should be able to easily understand what the API does & how to use it

Backing up DB's

Every now & then, you should create DB backups in case a DB goes down for w.e reason. Backups are created using the newest schema at the time & missing values are set to default values for consistency. Backups can also be used in the future as actual DB's for other nodes (distributed systems) or you can potentially reformat data for OLAP processes (i.e. column-oriented storage)

B Trees vs LSM Trees (Writes, Crash recovery, Concurrency, Compaction/Fragmentation, Locality, Read Speed, Write Speed, Transaction Isolation)

Generally, you'll have to test both B Tree & LSM Tree implementations for your specific use case to figure out which one is best for your context. However, here's some general pros & cons: Writes: B-trees replace the entire page w/ new data while LSM Trees are append-only Crash-Recovery: Both use write-ahead logs Concurrency: B-Trees mutate tons of data so latches are necessary to avoid race conditions; LSM trees handle concurrency better b/c all actions are logged in a memtable in sequential order & are time invariant Compaction / Fragmentation: B Trees do get fragmented while LSM Trees go through compaction each time a new SSTable is written to disk, so LMS Trees usually take up less space than B Trees; LSM Compaction may however block disk I/O operations while it's occurring, which is pretty bad but doesn't happen often Locality: B Tree pages are written to disk in random order, so it's harder to keep sequential keys close; LSM Trees rewrite storage segments in a single run during merging so it's easier to keep sequential keys closer together on disk Read Speed: B Trees are well-organized & easily traversable; LSM Tree Reads are usually slower b/c various of data struct's are used & there are varying levels of compaction among each SSTable Write Speed: LSM Trees write faster usually. B Trees must write at least twice (write-ahead log + actual page) while LSM Trees usually just write once (to the SSTable on disk). LSM Trees also write data sequentially while B Trees write randomly. Transaction Isolation: B Trees dominate here usually since the key-val pair only exists in a single place (single source of truth); while LSM Trees may have multiple copies of this data as it's compacting & saving data at each SSTable

Modes of Dataflow

How data flows from one process to another within a system (1) Dataflow through DB's (2) Dataflow through Services (3) Message Passing Data Flows

Avro Writer's Schema vs Reader's Schema

In Avro, b/c field types & tags are omitted, there's no good way of knowing what field is what. Thus, we have to use the appropriate schema when decoding data or we'll get wrecked. Writer's Schema = Schema used by the writer to encode data Reader's Schema = Schema used by reader to decode data Key Idea here is to ensure the write & reader schemas are always compatible! Else, you'll run into crazy issues. Avro compares the writer & reader schemas side by side when doing decoding & translates data from the writer's schema into the reader's data. (Read more in notes for specifics)

fact table

In a data warehouse, the star schema table that contains facts linked and classified through their common dimensions. A fact table is in a one-to-many relationship with each associated dimension table. Key properties: (1) Holds pointers/ID's that link to values in other related tables (dimension tables) (2) Can hold values for attributes stored directly in the fact table (3) Each row represents an event, with columns describing attributes of the event

MSG Brokers & Message Passing vs RPC

Key pros of message brokers over direct RPC: (1) Buffering → Queue acts as a buffer so servers don't get overloaded (they can just grab from the Q whenever they have resources to do so!) (2) Resilience → Messages persist in Q and can be re-delivered if a service crashes while trying to process the message (3) Security & Scalability → No need to know the IP & port # of the recipient; we can just connect to the message Q instead! Super key in cloud ENV where Virtual machines come & go often (4) One-to-Many → Can send 1 MSG to many recipients (think PubSub, Kafka, and other similar tools) (5) Decoupling → The client & server don't need to know about each other; they each just need to know about the Q / the middle-man. Thus, you can change the client & server independently as long as the input/output of each system remains consistent & the Message Q continues to serve as middleware Major Diff from RPC: (1) It's one-sided communication, the message sender usually doesn't expect to receive a response (2) It's possible to receive a response though (if desired) by triggering additional processes & going through a separate channel (3) Highly async operation since the sender doesn't wait for the message to be received, it just sends it & hopes for the best :)

write amplification (and why it can be problematic)

LSM Trees rewrite data multiple times b/c of merging & compaction of SSTables --> This means that the same data gets written multiple times; the # of times is the write amplification factor High Amp. Factors are problematic b/c each write takes up disk bandwidth that could be used to do other things; if we're doing tons of write, we may interrupt / block other I/O operations

Distributed Actor Framework vs RPC

Location transparency is better in the actor model vs RPC b/c there's the assumption that messages may be lost within network calls

LSM Trees

Log Structured Merge Trees Basically an SSTable that also has a write-along log that's used to ensure data isn't lost in case the system crashes; the log is necessary since an SSTable is stored on RAM until it's saved to disk, so the log ensures any data on RAM has backup. Used in Google's Big Table = a DB stored in sorted order, with the use of a log to track in-memory memtable data

Differences between local & network calls

Main differences: (1) Unreliability → Network calls can be dropped due to all kinds of strange things in the wild; functions are predictable (2) Timeout → Function calls always return something; Network calls can sometimes return nothing at all... they can just TIMEOUT (3) Idempotence → Sometimes if you do a network call, it might actually be getting through, but the response is getting lost...; Local function calls don't have this issue Need a way to prevent duplication from occurring.. Otherwise, you might end up with horrible messes (4) Execution Time → Function calls take roughly the same time all the time. Network calls have varying latencies based on how much traffic is going through the network & the characteristics of the network itself (5) Encoding → Things passed over a network must be encoded; things in local memory can easily be passed to functions via pointers to local memory (6) Software Language Differences → Server & client may be using diff. Software lang's; so RPC framework must translate data types from one language into another. Note: Not all lang's have the same types (i.e. JS's problems with #'s greater than 253

In-Memory DB's vs Disk-based DB's

Main issue with these is the volatility of RAM (everything is lost if the machine is restarted/crashed) Some in-memory DB's aim for durability via: (1) Special hardware: battery-powered RAM (2) Writing a change log to disk (3) writing periodic snapshots backups to disk (4) Replicating in-memory state to other machines (distributed systems use-case most likely) Anytime an in-memory DB is restarted, it needs to reload its state via: (1) Disk (2) Over network from a replica (3) Other specialized hardware Reads are served entirely from memory Writes to disk are just for durability (append-only log for backups; logs can be used for monitoring) Can be better than disk potentially since it gets rid of some of the operational issues w/ managing on-disk data structures Faster than Disk In-memory DB's are faster b/c they don't need to read from disk; it's b/c they avoid the overhead of encoding in-memory data structures into a form that can be written to disk In-memory DB's are faster since we eliminate 1 step (Transformation) that's necessary to store data on disk Even a disk-based storage engine may never need to read from disk if you have enough memory, b/c the OS caches recently used disk blocks in memory anyway. More Data Models than Disk Redis offers a DB-like interface to data structures such as: Priority queues & Sets B/c all data is stored in-memory, more robust data structure are easier to implement that if we were to encode these data structures for disk TLDR: Encoding data for disk makes it more complex to work with in general & slower to work with

Data Cubes + Materialized View

Materialized View = a cache for storing results of commonly used OLAP Aggregation Functions. (i.e. SUM, AVG, MIN, MAX, etc...) Data Cubes = special case of materialized view; a grid of aggregates grouped by different dimensions Major advantages: Quickly calculate aggregate function results just by looking along the appropriate row & applying the aggregation logic Major Cons: Can't query along dimensions not present in the cube since the details just don't exist Key Takeaway: Only use OLAP cubes & materialized views as performance optimization tools; not as replacements for storing tons of raw data in actual data warehouses

B-tree

Most widely used indexing structure (standard for implementation of relational & non-relational DB's) Stores data as fixed sized pages / blocks, where each page is ID'd using addresses (similar to pointers but on disk instead of in memory). Each node holds a range of keys & leaf nodes hold the actual values.

API Versioning

No good agreement for the best way for a client to indicate which API Version they're using For REST API's: common to include the version # in the URL or HTTP Accept header For services that use API Keys to ID a particular client: Can store a client's requested API Version on the server & allow this selected version to be updated through a separate admin UI (this is what Stripe does)

REST

Not a protocol; it's a design philosophy (much like how Agile isn't a set of rules, but a set if principles to live by) Emphasizes: (1) Simple data formats (2) Using URLs to identify resources (3) Using HTTP Features for cache control (4) Authentication (5) Content type negotiation Favor less code generation Favor automated tooling Generally more popular than SOAP nowadays --> Often used in microservices & Often used in cross-org service integrations APIs designed using REST principles are called restful For more info: https://ninenines.eu/docs/en/cowboy/2.6/guide/rest_principles/

Forward Compatibility

Older code can read data written by newer code Usually trickier since you may have to add exceptions, transformations, very custom logic.... :(

OLTP and OLAP

On line transaction processing is a class of information systems that facilitate and manage transaction-oriented applications, typically for data entry and retrieval transaction processing. Online Analytical Processing. OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling.

Dynamic Schema Generation in Avro

Possible b/c Avro doesn't use tag #'s for encodings & really straightforward to do so. Let's say we want to do a data dump of a relational DB into a single file, and we want to avoid the issues of textual formats (space, limited types, & speed), so you choose a binary format. Avro can fairly easily create an Avro schema from the relational schema (in the JSON representation we saw earlier; which is 1 of Avro's 2 IDL representations). The data is then encoded using the schema & dumped into an Avro object container file. You'd create a schema for each DB Table & each column becomes a field in the record; Column names in the DB maps to the field name in Avro. Anytime the data schema changes, it's easy to create a new Avro schema & export data in the new schema Since Avro doesn't use field names nor does it use tags, it's easy to map the new reader schema against the writer's schema

REST vs RPC

RPC's are better for Private internal use (1) Custom RPC Protocols w/ a binary encoding format get better performance than generic JSON over REST (2) Designed for use within the same org. & typically within the same datacenter REST APi's are better for public API use (1) Good for experimentation & debugging → curl can just hit the API Endpoints for testing (2) Supported by all mainstream languages & platforms (3) Vast ecosystem of tools available for: --> Servers, Caches, Load balancers, Proxies, Firewalls, Monitoring & debugging tools, Testing

Thrift Binary Protocol vs Compact Protocol

Read the GDoc Notes for Specifics... the details are in the names though. Compact Protocol is more compact than binary since it uses additional encoding on top of the standard binary one

Adding/Deleting Fields in Protobuf & Thrift

Records encoded by these formats are composed of groups of fields. Each field has: (1) A tag # (Number 1, 2, 3, etc...) (2) A data type (String, Int, etc...) Both these things are essential to ensuring backward & forward compatibility If a field isn't set, it's left off the encoded record. Field tags are the KEY to ensuring compatibility both forwards & backwards, so you can change field names as long as you don't change field tags Data type annotation lets the parser know how many bytes it needs to skip (skips happen whenever data is omitted). This means that the encoded records will always have the same length in bytes as the schemas change, but there will simply be "skips" in bytes where appropriate Adding a New Field Make it optional OR set it as a default value → ensures backward compatibility Also, don't mutate existing field tags; you have to create a new one for each new field → forward & backward compatibility Deleting an existing Field NEVER delete a required field; you can only delete optional fields NEVER use the same field tag # as the one that's deleted Old code will likely still have this data defined somewhere / try to use it, so just don't use this field tag again

Column-oriented database

Refers to traditional, relational database technologies that use column-centric, not row-centric storage. Very important → All columnar data must be stored in such a way that the row indexes must line up properly Major Pro: Decreased # of disk reads Major Con: Higher MGMT Complexity → Indexing must be kept consistent across all columns, so each nth index of a column (when taken together) creates the nth row of the table

RPCs

Remote Procedure Calls The primary problem with RPC's = Location Transparency Fundamentally, RPC's are over-the-network calls, but we try to make them look like local ones... this causes all kinds of wonky race conditions b/c of this abstraction. It's actually way better to know about the existence of the network call, rather than to try & abstract it away completely.

Size & Level Compaction

Size-Tiered = Newer / smaller SStables are successively merged into older & larger SSTables Level-Tiered = Key range is split into smaller SSTables & older data is moved into separate levels https://github.com/facebook/rocksdb/wiki/Leveled-Compaction

Star VS Snowflake Schema

Snowflake: more normalized More granularity in data More modular Star: easier to work with / query against since fewer references Arguably more readable since there's less depth Both: Dimension tables & fact tables will often become MASSIVE just because of the sheer volume of data produced over time The more attributes we add, the more data points we get, which means more intelligence, but also more potential noise & complexity

backward compatibility

The ability of current devices to use earlier standards in addition to the current standard

Sorting in Column Oriented DB's

Sorting usually doesn't matter, but It's easiest to store them in order that they're inserted in. It's possible to put them in sorted order though, then to impose indexing / partitioning mechanisms to speed up queries (sorta like SSTables from before) If we choose to sort the columns, then we must ensure we maintain the row indexing relationship between columns Knowing what sorting logic to use usually depends on the most frequently received queries Sorting can enable better compression If we see lots of repeated data within the sorted columns, encoding can be used to minimize overall space used Compression effect is strongest on the first sort key 2nd & 3rd sort keys will likely be more jumbled, and thus less compressible Having many sort orders is really similar to having many secondary indexes in row-oriented storage Big diffs though: row-oriented Db's → keep all row data externally in 1 place (heap file or clustered index), & secondary indexes point to the external data Clustered DB's → usually don't use pointers; just store data right in the columns

Human Readable Data Encoding Formats

Textual representations of data --> JSON, XML, CSV Cons: (1) Space --> Eats up way more space generally than binary formats (2) Speed --> Longer to process partly b/c it must first be transformed from text into bits, and there's more of it to look through usually than binary formats (3) Weak Schema Def's --> Lack of clear types & schema definitions makes these encodings highly vulnerable to difficulties with maintenance → Often need to hardcode logic to decode the received payloads (4) Data Types --> XML / CSV can't distinguish between # & a String consisting of digits w/o referring to an external schema. JSON distinguishes Strings & #'s, but not integers & floating-point #'s; also lacks the concept of precision for #'s so it can't handle very large #'s well. Pros: (1) Readability b/c it's a very intuitive textual representation (2) Best used in communications between apps that aren't owned / maintained by the same organization or group. Side Note About Schema Compatibility Across Client + Server: As long as we know & agree on the format for data, it doesn't matter how pretty or efficient the data is. Really, the hardest part is getting people/organizations to agree on anything as the standard... ugh....

Message Brokers

The middleware responsible for Q'ing up messages and eventually distributing them. General logic flow of MSG Brokers: (1) process sends message to a Q or topic (2) broker looks at Q to check for messages (3) if MSG Found, broker sends it to 1 or more consumers / subscribers to that Q or topic Open source implementations are most popular nowadays --> RabbitMQ, ActiveMQ, HornetQ, NATS, Apache Kafka Any encoding format can be used for MSG's since they're just sequences of bytes w/ some metadata --> Just make sure to ensure forward & backward compatibility

Avro Object Container File

The object / file that contains the writer's schema & is essential for decoding during the read process. You can look at the Object Container file (Which holds the writer's schema) & look at the data the same way you would look at a JSON File The file is self-describing so it includes all needed metadata This property is super useful for dynamically typed lang's since you can just open the object container file, analyze them, & write data sets to output in Avro formats w/o even thinking about schemas. This is what Apache Pig does!

Heap Files

The place where data is stored when the value of an index is a reference to the data. If the value of the key-val is a pointer to the data, the heap file is where the data is actually stored Data is stored in no particular order w/in the heap; might be: (1) Append only (2) Tracking deleted rows so it can overwrite them w/ new data later Useful for secondary indexes since it can serve as single source of truth for multiple keys (all the values just need to point to one file, rather than storing the actual value on each secondary key-val pair). Single Source of truth makes it easy to delete, update, add values! Overwriting existing data in a Heap File Easy to update a value w/o changing the key b/c the data can be overwritten in place in memory b/c we have the address reference This works well if incoming data size < existing data size If incoming data is larger, then we need to assign a new reference to this data w/in the heap since we need more space. We then either: (1) Delete the old data & reference to the old location w/in the heap & update all index references to point at the new data (2) OR, we place a forwarding pointer at the old address reference, which now points at the location of the new, larger data location

Memory Bandwidth

The rate at which data can travel from SRAM to DRAM to the processor and vice versa(Megabits per second) Key concerns for queries / DBs: (1) Bottlenecks: Getting data from disk into memory & Getting data from main memory into CPU Cache (2) Avoiding branch mis-predictions (3) bubbles in CPU Instruction processing popeline (4) Making use of single-instruction-multi-data instructions in modern CPUs

Column Compression

This is when we further compress the data stored within column-oriented DB's by using various encoding techniques such as: (1) Bitmap Encoding (2) Run Length Encoding

Code Generation vs Dynamically Typed Languages

Thrift & Protobuf work best in languages when strong typing & code compilation prior to execution are highly valued. Avro works best in situations where dynamically typed languages are used & we really could care less about type checking prior to actual execution. Avro is great for these situations b/c it's designed to dynamically create schemas, so it can bypass the code generation step (which is wasteful in these cases) Caveat for Avro: Avro does support optional code generation & type checking if so desired, so it's possible.. but would probably be an anti-pattern since Avro is designed for dynamic type situations.

Avro Schema Evolution Rules

To maintain compatibility, can only add/remove a field that has a default value → Accounts for any missing fields in both writer & reader schemas Changing types is easier in Avro (Dynamic Schema generation possible due to lack of field names & tags) Changing field names is tricky, but possible Reader's schema can search through a list of aliases to determine the proper field name if an old writer's schema uses a field name that's been changed Not forward compatible, but is backward compatible

Rolling Upgrades

Upgrade strategy where you gradually roll out the updates to various nodes in waves / batches / 1-at-a-time to slowly ensure that the new version actually works before going fully live with it

Concatenated Indexes

Used when we need to query for multiple columns or fields at once That is, when we need to query along N Degrees at once, concatenated indexes are faster than B Trees & LSM Trees (which can only return the value for 1 Degree per query) This is basically what we're doing when we use composite keys to distinguish between different values Combines several fields into 1 key by appending 1 column to another; Index definition specifies which order the fields are concatenated EX: phone books → Where you have an index of lastName, firstName to the value of your Phone# Use these sparingly & usually only when you know you'll need to query along these concatenated keys often since they can add lots of complexity to the DB structure / architecture

Depth Complexity (B Trees)

We always split the b tree when necessary to ensure that it's balance B Trees w/ n keys always have a depth of O(log n) Most DB's fit in B Trees with depths of 3 to 4 levels max I.e. A four-level tree of 4 KB pages with a branching factor of 500 can store up to 256 TB

Full Text Search & Fuzzy Indexes (and Apache Lucene as Example)

What if you want to search for similar keys? Or have misspelled keys? I.e. What if you need some way of implementing auto-correct or smart searches? This type of searching is known as fuzzy querying! Apache Lucene = high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. To cope w/ typos in docs or queries, Lucene searches for words w/in a certain edit distance (Levenshtein Distance) B/c data is stored in a sorted & indexed fashion, Lucene can tell the query to search w/in some range of the fuzzy query I.e. if the misspelling is "fsat" and the index this translate to is "700", then we can search within a range of 650-750, depending on what the allowed search Levenshtein Distance is Really, if it's needed to know more about this topic, just reread the notes & see the images in the Doc

Compaction

act of saving all data together at discrete points in time so we can compact the data contents; we can merge compacted data together to create a "snapshot"/backup of that DB at that point in time

Web Services: Definition, Approaches & Examples

When HTTP is used as the underlying comm's protocol, then it's a web service; 2 popular approaches for web services are REST & SOAP. These types of services are used in the web as well as other non-web contexts; 3 examples are: (1) Client running on a user device (i.e. mobile device) making requests to a server using HTTP over the public internet (2) Middleware = software that supports calls / data flow between 1 service to another where these services are owned by the same org. Basically anything that enables data flow between various services within a microservice architecture (3) External Data Exchange Between Org's a service making requests to a service owned by a diff. Org. (usually via the Internet) I.e. public APIs & Oauth for shared access to user data

Dataflow through Services

When communicating via a network, there's a few ways of setting up the comm's. One popular 1 is client ← → server Can be implemented using Web Services or RPC

Triggered Functions / Events / MSG Chaining

While topics in MSG Passing systems only allow 1 way dataflow, we can create multiplexed systems by chaining events together --> Specifically, having the consumer publish a MSG to another topic after it actions on the MSG that it received (call & response type situation). For example, having the consumer publish to a reply Q that the sender of the original MSG listens to

Runtime Checks in Protobuf & Thrift

While we specify in the IDL Schema whether or not a field is optional, this doesn't affect how the field is encoded Instead, we just get a runtime failure if a required field isn't specified So, it's more of a safety net that the compiler / runtime checks

SOAP & Web Services Framework

XML-based protocol Avoids using HTTP features even though it's usually used over HTTP. This is b/c it's designed to stay independent of HTTP. Web Services Framework = a ton of standards and spec's for describing how to properly use SOAP: https://www.innoq.com/resources/ws-standards-poster/ Jesus, this is quite a lot... no wonder REST is generally more popular LOL Super high learning curve with very language specific implementation details makes it very hard to adapt & scale... very unpopular for these reasons :P Even with all its standards, rules & protocols, SOAP still sucks at ensuring communications between one organization to another. This is another reason why REST tends to be preferred among new / smaller companies, and why many large org's currently using SOAP are trying to get rid of these systems in favor of REST.

Checksums

a data transmission control that uses a hash of a file to verify accuracy; ensures all data was properly saved OR ELSE it'll throw an error

Logs

an append-only sequence of records / data files

Protobuf "Repeated" Marker vs Thrift "List" Type

both are methods of representing data types that appear multiple times Protobuf uses a "Repeated" Field Marker to denote that the data type is repeated several times while Thrift uses the familiar Array / List Object. Pros of Repeated Marker in Protobuf: (1) Easily convert a SINGLE type to MULTI Pros of LIST Type in Thrift: (1) Can have nested lists Con --> Harder to convert single value to multi values

Indexes (and trade-offs)

guide to quickly navigate data much faster than 1-by-1 iterations. Trade-off: Having more indexes means faster read times b/c there's more things to query along, BUT you get longer write times

Hashed Indexes

index implementation for key-val pairs that's done using hash maps / dictionaries. Hash maps are held in memory, so they can only be as large as RAM will allow (it's also possible to only have keys stored in memory)

Sorted String Tables (Def. & Implementation)

keys are: (1) Stored in Sorted Order (2) Only allowed to appear once per segment Implementation: (1) Write comes in --> Add data to in-memory tree (i.e. red-black tree); the tree is called a memtable (2) When memtable becomes too large, save it to disk as an SSTable File (3) New SSTable becomes the latest segment of the DB --> While writing to disk, we can keep writing to a new memtable instance (4) When a read request comes in, we check the following data structures in order from: Membtale → latest SSTable segment → next latest SStable → next latest SSTable → etc.... (5) every now & then, merge & compact segment files & discard overwritten / deleted val's

Determining the Writer's Schema in Avro

reader must always know the writer's schema somehow when decoding data How we determine the writer schema depends on how Avro is being used: (1) Large Files w/ Lots of Records --> (i.e. Hadoop) Just store it at the head of the file in a specific file format (Object Container Files) set by Avro (2) DB w/ Individually Written Records --> Keep a version # at start of every encoded record AND Keep a list of schema versions in DB. Version linking lets you know which schema to use. Version #'s are needed b/c data written to the DB at diff times can be based on different writer schema versions. (3) Sending Records over a Network Connection --> If 2 processes communicate bidirectionally, they can negotiate the schema version on connection setup. The schema is then used throughout the duration of the connection TIP: Have a DB of Schema Versions Generally speaking, this is good to have to serve as documentation & a source of truth of what all the different possible Avro Schemas were over the lifetime of Avro's formatting within your system. Versions could just be a simple incrementing # or a hash of the schema

Data encoding

the process of converting in-memory data into a byte sequence that's sent over a network call aka serialization / marshaling a very well-known & highly studied / solved problem that has various approaches for implementation (i.e. human friendly encodings, binary encodings, language specific encodings)

Bloom Filters

used to help optimize LSM Trees & SSTable searches by approximating the contents of a set --> Tells you if the key doesn't appear in DB to save tons of time on disk reads for nonexistent values

Dataflow Through DB's

writer encodes data to DB. & reader reads from the DB; can think of this as passing a message to future self. Backwards compatibility is ESPECIALLY important here since anything stored in a DB is just something we'll access in the future! So intuitive. Note --> Ignoring Omitted Fields: Let's say we write to the DB with code that doesn't know of the existence of a new field, if we write data to the DB, the old field will definitely be omitted → Should the field in the DB then be updated or just not be changed from its current value at write time? Likely, the desired behavior is to leave the field unchanged; we should only update it if the data includes the incoming field. This isn't a hard problem to solve but we HAVE to be aware of it!!


Related study sets

medical letters &medical reports

View Set

Topic 12: Historical failed attempts at healthcare reform

View Set

Chapter 6 Lipids: Triglycerides, Phospholipids, and Sterols

View Set

HRM 3375 test 2- multiple choice

View Set

Micro-Economics 152-80B Final Exam

View Set

AP Computer Science Midterm Questions

View Set