Systems for Scalable Analytics
What is a thread?
Generalization of OS's Process abstraction A program spawns many threads; each run parts of the program's computations simultaneously
What are the high-level steps OS takes to get a process going?
1. Create a process (get Process ID; add to Process List) 2. Assign part of DRAM to process, aka its Address Space 3. Load code and static data (if applicable) to that space 4. Set up the inputs needed to run program's main() 5. Update process' State to Ready 6. When the process is scheduled (Running), the OS temporarily hands off control to process to run the show! 7. Eventually, process finishes or run Destroy
What are the three requirements in order for models to perform inference in real-time.
1. The features have to be created, or be available in real-time. 2. The model itself has to meet SLA requirements 3. The model has to be self-contained, and served in distributed fashion
What is a byte?
8 bits; the basic unit of data types
What is a directory?
A cataloging structure with a list of references to files and/or (recursively) other directories
What is a program (aka code)?
A collection of instructions for hardware to execute
What is the instruction set architecture?
A command understood by hardware; finite vocabulary for a processor: Instruction Set Architecture (ISA); bridge between hardware and software
What is volatile memory?
A data storage device that needs power/electricity to store bits; e.g., DRAM, CPU caches (SRAM)
What is semistructured data?
A form of data with less regular/more flexible substructure than structured data
What is structured data?
A form of data with regular substructure
What is a programming language?
A human-readable formal language to write programs; at a much higher level of abstraction than ISA
What is a page fault?
A page required by process is not in DRAM OS intervenes to read page from disk to DRAM
What is a file?
A persistent sequence of bytes that stores a logically coherent digital object for an application
What is an application software?
A program or a collection of interrelated programs to manipulate data, typically designed for human use Examples: Excel, Chrome, PostgreSQL, etc.
What is EC2?
A remote computer
What is a process?
A running program, the central abstraction in OS. Started by OS when a program is executed by user OS keeps inventory of "alive" processes (Process List) and handles apportioning of hardware among processes
What is a data structure?
A second layer of abstraction to organize multiple instances of the same or varied data types as a more complex object with specified properties Example: Array, Linked List, Tuple, Graph, etc.
What is an application programming interface?
A set of functions ("interface") exposed by a program/set of programs for use by humans/other programs
What is the relationship between characters and strings?
A string is typically just a variable-sized array of char
What is a page frame?
A virtual "slot" in DRAM to hold a page
What is the formula for access time?
Access time = Rotational delay + Seek time + Transfer time
What is a bit?
All digital data are sequences of 0 & 1 (binary digits) Amenable to high-low/off-on electromagnetism Layers of abstraction to interpret bit sequences Easy way to remember: "Binary digIT"
What is S3?
Amazon Simple Storage
What is an OS?
An OS is a large set of interrelated programs that make it easier for application and user-written program to use computer hardware effectively, efficiently, and securely
What is a file descriptor?
An OS-assigned +ve integer identifier/reference for a file's virtual object that a process that a process can use
What is a data model?
An abstract model to capture organization of data in a database at a formal/logical level
What is a page?
An abstraction of fixed size chunks of storage Makes it easier to manage memory virtualization
What is a file format?
An application-specific standard that dictates how to interpret and process a file's bytes 100s of files formats exist (e.g., TXT, DOC, GIF, MPEG) varying data models/types, domain-specific, etc.
What is a segmentation fault?
An illegal address access
What is a database?
An organized collection of interrelated data
What are the key terms for a process (aka job)?
Arrival Time: Time when process gets created Job Length: Duration of time needed for process Completion Time: Time when process finished/killed Turnaround time: Completion Time - Arrival Time Start Time: Times when the process first starts on proc Response Time = Start Time - Arrival Time
What is caching?
Buffering a copy of bytes (instruction and/or data) from a lower level at a higher level to exploit locality
What is address space?
Chunks(s) of memory assigned by OS to a process Helps virtualizes and apportion physical memory
What characterizes the 3 generations of data centers/clouds?
Cloud 1.0 (Past): Networked servers; user rents servers (time-sliced access) needed for data/software Cloud 2.0 (Current): "Virtualization" of networked servers; user rents amount of resource capacity; cloud provider has a lot more flexibility on provisioning (multi-tenancy, load balancing, more elasticity, etc.) Cloud 3.0 (Ongoing Research): "Serverless" and disaggregated resources all connected to fast networks
What are the 3 segments that memory is split into?
Code, Stack, and Heap
What is an operating system?
Collection of interrelated programs that work as an intermediary platform/service to enable application software to use hardware more effectively/easily Examples: Linux, Windows, MacOS, etc.
What are some cons of cloud computing?
Complexity of composing cloud APIs and licenses; data scientists must keep relearning; "CloudOps" teams Cost over time can crossover and make it costlier! "Lock-in" by cloud vendor Privacy, security, and governance concerns Internet disruption or unplanned downtime, e.g., AWS outage in 2015 made Netflix, Tinder, etc. unavailable!
What is cloud computing?
Compute, storage, memory, networking, etc. are virtualized and exist on remote servers; rented by application users
What are scheduling policies?
Controls how OS time-shares CPUs among processes
Roughly speaking, NVRAM is like a non-volatile form of ________, but with similar capacity as _____.
DRAM SDDs
Roughly speaking, flash combines the speed benefits of _____ with persistence of ______.
DRAM disks Data access latency: 100x faster! Data transfer throughput: Also 10-100x higher Parallel read/writes more feasible Cost per GB is 5-15x higher!
What is spill/miss./fault?
Data needed for program is not yet available at a higher level; need to get it from lower level Register Spill (register to cache) Cache Miss (cache to main memory) Fault (main memory to disk)
What is a hit?
Data needed is already available at a higher level
_________ of compute+memory from storage is common in cloud
Decoupling
What is deserialization?
Deserialization is the process of reconstructing the object from the serialized state.It is the reverse, i.e., bytes to data structure
What is multiprocessing?
Different processes run on different cores (or entire CPUs) simultaneously
What is data?
Digital representation of information that is stored processed, displayed, retrieved, or sent by a program
What is EBS?
EBS is durable, block-level storage volumes that you can attach to a running Amazon EC2 instance. The Amazon EBS volume persists independently from the running life of an Amazon EC2 instance. After an EBS volume is attached to an instance, you can use it like any other physical hard drive. Amazon EBS encryption feature supports encryption feature.
What is virtualization?
Each hardware resource is treated as a virtual entity that OS can divvy up among processes in a controlled way
What is load balancing?
Ensuring different cores/proc. are kept roughly equally busy, i.e., reduce idle items
What is a Boolean?
Examples in data sci.: Y/N or T/F responses Just 1 bit needed, but in python it's actually 24 bytes.
What is an integer?
Examples in data science: # of friends, age, # of likes Typically 4 bytes; many variants (short, unsigned, etc.)
Disk space is organized into ________. Files are made up of disk pages aka ________
Files Blocks
What are the 3 main kinds of software?
Fireware, Operating System, Application Software
What is a data type?
First layer of abstraction to interpret a bit sequences with a human-understandable category of information; interpretation fixed by the PL Example: Boolean, Byte, Integer, "floating point" number (Float)
What is FIFO?
First-In-First-Out aka First-Come-First-Serve (FCFS) Ranking criterion: Arrival Time; no preemption allowed Main con: Short jobs may wait a lot
What is EMR?
Fully managed cluster of EC2 instances
What is the processor?
Hardware to orchestrate and execute instruction to manipulate data as specified by a program
What is the network interface controller?
Hardware to send data to/retrieve data over network of interconnected computer/devices
What is the main memory?
Hardware to store data and programs that allows very fast location/retrieval; byte-level addressing scheme
What is the heap for?
Heap is for dynamically created data structures
What is hardware efficiency?
How close actual execution runtime is to the best possible runtime given instruction processing times of proc. Improved with careful data layout of all data objects used by a program based on its data access patterns Key Principle: Raise cache hits; reduce memory stalls!
What is a float?
IEEE-754 single-precision format is 4 bytes long; double-precision format is 8 bytes long Java and C float is single; Python float is double! Standard IEEE format for single (aka binary32):
What is a page replacement?
If no frame is free when page fault happens, OS must evict some occupied frame's page!
What is locality of reference?
In computer science, locality of reference, also known as the principle of locality, is a term for the phenomenon in which the same values, or related storage locations, are frequently accessed, depending on the memory access pattern. Spatial: Nearby locations will be accessed soon Temporal: Same locations accessed again soon Locality can be exploited to reduce runtime using caching and/or prefetching across all levels in the hierarchy
What are three most popular services offered in cloud computing?
Infrastructure-as-a-Service (IaaS); Platform-as-a-Service (PaaS); Software-as-a-Service (Saas)
What are the key aspects of software?
Instruction Set Architecture, Program (aka code), Programming Language (PL), Application Programing Interface (API), and Data
What are the main pros of cloud vs on-premise clusters?
Manageability: Managing hardware is not user's problem Pay-as-you-go: Fine-grained pricing economics based on actual usage (granularity: seconds to years!) Elasticity: Can dynamically add or reduce capacity based on actual workload's demand
How fast can a processor process a program?
Modern CPUs can run millions of instructions per second ISA tells us #clock cycles each instruction needs CPU's clock rate lets us convert that to runtime Alas, most programs do not keep the CPU always busy! Memory access commands stall the processor; ALU and CU are idle during memory-register transfer Worse, data may not be in DRAM -- wait for disk I/O! So, the actual execution runtime of the program may be orders of magnitude higher than what clock rate calculation suggests!
What are the 2 key principles in OS (any system) design & implementation?
Modularity: Divide system into functionally cohesive components that each do their jobs well Abstraction: Layers of functionality from low-level (close to hardware) to high level (close to user)
How does a processor execute machine code?
Most common approach: load-store architecture Registers: Tiny local memory ("scratch space") on proc. ISA specifies bit length/format of machine code commands ISA has commands to manipulate register contents: Memory access: load (copy bytes from DRAM address to register); store(reverse) Arithmetic & logic on data items in registers: add/multiply/etc.; bitwise ops; compare, etc. Control flow (branch, call, etc) Caches: small local memory to buffer instructions/data
What is stored in the stack?
Most statically known data such as function arguments, return values, etc.
What is transfer time?
Moving data from/to disk surface Typically, hundreds of MB/s!
What is the seek time?
Moving disk head to correct track Typically, 1-20ms (high-end disks: avg is 4ms)
What is concurrency?
Multiple processors/cores run different/same set of instructions simultaneously on different/shared data
Is it risky for OS to give up full control of hardware to some process (a user program)?
OS has mechanisms and policies to regain control
What is limited direct execution?
OS mechanism to time-share CPU and preempt a process to run a different one (aka "context switch") A Scheduling policy tells OS what time-sharing to use Processes also must transfer control to OS for "privileged" operations (e.g., I/O); System Calls API
What is swap space?
OS reserved space on disk to swap pages in and out of DRAM (physical memory)?
________ _______ are still common, especially in large enterprises, healthcare, and academia, "hybrid clouds" too
On-premise clusters
What is the difference between a Relation, a Matrix, and a DataFrame?
Ordering: Matrix and DataFrame have row/col numbers; Relation is orderless on both axes! Schema Flexibility: Matrix cells are numbers. Relation tuples conform to predefined schema. DataFrame has no predefined schema but all rows/cols can have names; col cells can be mixed types!
Virtual Address vs Physical Address
Physical is tricky and not flexible for programs Virtual gives "isolation" illusion when using DRAM OS and hardware work together to quickly perform address translation
What is a data center?
Physical space from which a cloud is operated
What is prefetching?
Preemptively retrieving bytes (typically data) from addresses not explicitly asked yet by programS
What are the key parts of a computer?
Processor (CPU, GPU, etc.), Main Memory (aka Dynamic Random Access Memory), Disk (aka secondary/persistent storage), and Network interface controller (NIC)
What is a memory leak?
Program failed to free dynamic space
What is persistent data storage?
Program state/data is available intact even after process finishes
What is RR?
RR does not need to know job lengths Fixed time given to each job; cycle through jobs RR is often very fair, but Avg Turnaround Time goes up!
What is Fireware?
Read-only programs "baked into" a device to offer basic hardware control functionalities
What is sequential access?
Reading contiguous blocks together amortizes seek time and rotational delay
______ _____ predictions offers _________ gains, compared to batch models that run on a scheduled time.
Real time predictions offers disproportionate gains.
OS keeps moving processes between 3 states:
Running, Ready, Blocked
What is serialization?
Serialization is the process of converting a data structure (or program objects in general) into a neat sequence of bytes that can be exactly recovered.
What is a workload?
Set fo processes, arrival times, and job lengths that OS scheduler has to deal with
What are the 3 paradigms of multi-node parallelism?
Shared-Nothing Parallelism Shared-Disk Parallelism Shared-Memory Parallelism
What is SCTF?
Shortest Completion Time First Jobs might not all arrive at same time; preemption possible
What is SJF?
Shortest Job First Ranking criterion: Job Length; no preemption allowed
What is the disk?
Similar to memory but persistent, slower, and higher capacity/cost ratio; various addressing schemes
What is multithreading?
Some core used by many threads
What is metadata?
Summary or organizing information. about file content (aka payload) stored with file itself; format-dependent
How can async and sync processes work together?
Sync can fetch pre-calculated counters from the async process and calculate features.
What is serverless computing?
That is, compute, memory, storage, etc. are all network-attached and elastically added/removed
What is the kernel of an OS?
The kernel is a portion of the operating system that includes the most heavily used portions of software. Generally, the kernel is maintained permanently in main memory. The kernel runs in a privileged mode and responds to calls from processes and interrupts from devices. It is the core of an OS with modules to abstract the hardware and APIs for programs to use
What is a data access pattern?
The order in which a program has to access items of a complex data structure in memory
What is the data layout?
The order in which data items of a complex data structure are laid out in memory/disk
What is a filesystem?
The part of OS that helps programs create, manage, and delete files on disk (sec. storage)
What is rotational delay?
Waiting for sector to come under disk head Function of RPM; typically, 0-10ms (avg v worst)
What are disks?
Widely used secondary storage device; likely holds the vast majority of the world's day-to-day business critical data!
All digital objects are collections of _____ _____ _____ (bytes, integers, floats, and characters).
basic data types
The OS maintains a __________ to tell which chunks of DRAM are available for new processes, avoid conflicts, etc.
free list space
Virtualization of processor enables process __________, i.e., each process given an "illusion" that it alone runs
isolation
Aside from the obvious consideration of _____ ________ speed, one has to take into consideration _________ if possible.
model, inference, parallelization
Inherent tension in _________ between overall ________ performance and ________ fairness
scheduling, workload, allocation
Based the specific use case, the size of the payload, and variation in traffic, the practitioner can significantly increase model execution speed TP99 by adjusting ______ _________.
serving configurations