CCA Final exam
Vertex
(geometry), a point where higher-dimensional geometric objects meet. Vertex (graph theory), a node in a graph.
What is a data cube, what is it used for, how does it apply to clouds?
- A Data cube is a data structure - A sophisticated nested array - Compressions schemes - Data aggregation techniques when the cube outstrips the host's memory
What is Kubernetes and its function?
- A platform to orchestrate the deployment, scaling, and management of container-based applications - The primary responsibility of Kubernetes is container orchestration. - All the containers that execute workloads are scheduled to run on physical or virtual machines - replace dead, unresponsive, or unhealthy containers
What are some examples of columnar-based data warehouses?
- Column Store, MPP, Cloud based - MariaDB with InfiniDB For reference: row based regular engine for OLTP :InnoDB - Google BigQuery Based on Google Dremel ,paper published in 2010
What are some hardware optimizations for columnar storage
- Disk access pattern • One SSD page is 4KB~8KB • Row-store: When reading a page, a small number of similar column fields from different rows are loaded • - -- Column Store: All the read page are relevant column fields - Reading multiple values for the same column in one run significantly improves cache utilization and computational efficiency - On modern CPUs, vectorized instructions (SIMD) can be used to process multiple data points with a single CPU instruction
Column Store File Format
- During the last several years, likely due to a rising demand to run complex analytical queries over growing datasets, we've seen new column-oriented file formats - Apache Parquet, Apache ORC, RCFile, as well as column-oriented stores, such as Apache Kudu, ClickHouse - Parquet: an open source file format for Hadoop Hive, Pig, Impala, Spark
Trap and Emulate
- Executable code from the guest can execute directly on the host CPU by the hypervisor the hypervisor configures the CPU in such a way that all potentially unsafe instructions will cause a "trap" - An unsafe instruction is one that for example tries to access or modify the memory of another guest. - A trap is an exceptional condition that transfers control back to the hypervisor. - Once the hypervisor has received a trap, it will inspect the offending instruction, emulate it in a safe way, and continue execution after the instruction
kube-proxy
kube-proxy is a network proxy that runs on each node in your cluster, implementing part of the Kubernetes Service concept. kube-proxy maintains network rules on nodes. These network rules allow network communication to your Pods from network sessions inside or outside of your cluster. kube-proxy uses the operating system packet filtering layer if there is one and it's available. Otherwise, kube-proxy forwards the traffic itself.
Service
A kubernetes service is an abstraction which defines a logical set of pods running somewhere in your cluster
Paravirtualization
A method for a hypervisor to offer interfaces to a guest OS that the guest OS can use instead of the normal hardware interfaces.
Roll-up
A roll-up involves summarizing the data along a dimension
Union File System
A stackable unification file system, which can appear to merge the contents of several directories (branches), while keeping their physical content separate
Operating System-Level Virtualization
Also called container Virtualization, uses a single shared OS to host many users simultaneously. Virtualizing a physical server at the operating system level, enabling multiple isolated and secure virtualized servers to run on a single physical server
kubelet
An agent that runs on each node in the cluster. It makes sure that containers are running in a Pod. The kubelet takes a set of PodSpecs that are provided through various mechanisms and ensures that the containers described in those PodSpecs are running and healthy. The kubelet doesn't manage containers which were not created by Kubernetes.
AutoML
Automated machine learning, or AutoML, aims to reduce or eliminate the need for skilled data scientists to build machine learning and deep learning models. Instead, an AutoML system allows you to provide the labeled training data as input and receive an optimized model as output.
Binary Translation
Binary translation modifies sensitive instructions on the fly to virtualizable instructions
Docker bridge
Bridge networks are usually link layer devices that forward traffic between networks
Ambassador design pattern
Brokers interactions between the application container and the rest of the world
What is the Google Cloud AI Platform? What tools does it provide?
• AI Platform Notebooks • Managed notebooks • AI Platform Training• Training with hyperparameter optimizations • Continuous Evaluation • Model optimization • AI Platform Predictions• Server model hosting deployment
Graph Database
• Associative data sets• Structure of object-oriented applications • Do not require join operators
Describe vertex-oriented graph processing
• Based on BSP model• Provides directed graph to Pregel • Runs your computation at each vertex (processor)• Repeats until every computation at each vertex votes to halt • Pregel returns directed graph as a result
How does Pregel guarantee fault-tolerance? Describe checkpointing, failure detection and recovery
• Checkpointing • The master periodically instructs the workers to save the state of their partitions to persistent storage • e.g., Vertex values, edge values, incoming messages • Failure detection • Using regular "ping" messages • Recovery • The master reassigns graph partitions to the currently available workers • The workers all reload their partition state from most recent available checkpoint
Relational Database
• Perform same operation on large numbers of data elements Use relational model of data Entity type has own table• Rows are instances of entity• Columns represent values attributed to that instance Rows in one table can be related to rows in another table via unique key per row
Primitives in Pregel? Are edges or are vertices first-class citizens in Pregel?
• Vertices - first class • Edges - not first class • Both vertices can be created and destroyed
Column-stores: what are they and what are some examples of column-stores?
- In recent years, there has been renewed interest in so-called column- oriented systems, sometimes also called column-stores, a.k.a. Columnar Storage - MonetDB - VectorWise→Ingres VectorWise→Actian Verctor - C-Store→Vertica - SybaseIQ
Relationship between OLAP cubes and row-oriented RDBMS, including strengths and weaknesses of each approach
- OLAP cubes traditionally known for extreme performance advantage over row-oriented RDBMS - --- Less important with recent advances in computers and columnar storage - OLAP cubes demand that you load a subset of the dimensions you're interested in into the cube - Columnar databases allow performing similar OLAP- type workloads at equally good performance levels without the requirement to extract and build new cubes - Note: OLAP Datacubes typically offer richer analysis capabilities than RDBMSs, which are limited by the constraints of SQL The main justification Data cubes are still relevant
OLTP
- Online Transaction Processing - typically involve most or all of the columns in a row for a small number of records - Using a database to run your business - RDBMS
Based on an older version of PostgreSQL
- PostgreSQL 8.0.2 - Originally developed by ParAccel - Some PostgreSQL features that are suited to smaller-scale OLTP processing, such as secondary indexes and efficient single-row data manipulation operations, have been omitted to improve performance
Software only virtualization
- Problem: x86 processors were not virtualizable until mid 2000s - Software-only virtualization is a technique to go around the trap and emulate design of Popek and Goldberg - Does not need special hardware support, e.g. the Intel "VT-x" or "AMD-V" features
optimizations for columnar storage
- Storing values that have the same data type together (e.g., numbers with other numbers, strings with other strings) offers a better compression ratio - Lower information entropy resulting in higher compression - We can use different compression algorithms depending on the data type and pick the most effective compression method for each case - Compression can be automatic by the engine Columns shrink and grow independently
Service discovery in Docker Swarm
- User-defined networks provide DNS service User-defined Bridge networks User-defined overlay networks - For most situations, you should connect to the service name, which is load-balanced and handled by all containers ("tasks") backing the service. - To get a list of all tasks backing the service, do a DNS lookup for tasks.<service- name>
Xen
- Xen was initially a university research project Invasive changes to the kernel to run Linux as a paravirtualized guest - Maintenance effort required on distributions Support was added in mainstream Linux Kernel 3 (2012) - Usually very fast -> Trap and Emulate has overhead, paravirtualization eliminates traps
Data Lake
- a new type of data repository for storing massive amounts of raw data in its native form, in a single location - "A large body of water, into which new water streams from many channels, and from which samples are taken and analyzed" - Solution to a growing problem: the need for a scalable, low-cost data repository that allowed organizations to easily store all data types and analyze that data to make evidence-based business decisions
OLAP
- online analytical processing - read only a few columns for a very large number of rows - Using a database to understand your business - Data Warehouse * Structured Data * SQL * Each query covers many or all of the records * Typical query involves one column
How does a data lake differ from a data warehouse? Why use a data lake? What are the components of a data lake?
-Data Warehouses cannot accommodate unstructured big data projects -Petabytes of data in structured, semi-structured and unstructured forms -Semi-structured and unstructured data: JSON, XML, Log files, Natural Language, Images, video, etc. • Social media sites, mobile phones, Internet of Things (IoT) devices, and many other sources, including shared data sets • Structured data typically collected from enterprise applications
Who are Kimball and Inmon, and what were their contributions?
-Early practitioners observed that certain access patterns occurred in every business -They developed repeatable methods to turn business reporting requirements into data warehouse designs -Designs that allow teams to extract the data they need in the formats they need for their OLAP cubes
User Mode
-User processed operate in user modes -When the user application requests a service from the operating system, or a system call is made, there will be a transition from user to kernel mode to fulfill requests
Kernel Mode
-When the system boots, hardware starts in kernel mode -Privileged instructions which execute only in kernel mode If user attempt to run Privileged instruction in user mode then it will treat instruction as illegal and traps to OS -Example privileged instruction: Input/output management -Interrupt handling
Docker Swarm
A Docker Swarm is a group of either physical or virtual machines that are running the Docker application and that have been configured to join together in a cluster. Once a group of machines have been clustered together, you can still run the Docker commands that you're used to, but they will now be carried out by the machines in your cluster. The activities of the cluster are controlled by a swarm manager, and machines that have joined the cluster are referred to as nodes.
Full Virtualization
A form of virtualization where one or more operating systems and the applications they contain are run on top of virtualized hardware.
Pods
A grouping of one or more containers that share the same namespaces
What motivates modern data warehouse architecture? What technologies enable it?
Cloud =access to near- infinite, low-cost storage improved scalability Outsourcing of data warehousing management and security to the cloud vendor Pay per use Massively parallel processing (MPP) =Dividing computing operations to execute simultaneously across many separate computer processors Columnar storage Vectorized processing
kube-scheduler
Control plane component that watches for newly created Pods with no assigned node, and selects a node for them to run on. Factors taken into account for scheduling decisions include: individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, and deadlines.
What is GraphX? What technologies does GraphX use?
GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.
What is Spark GraphX? What are some of the graph operators that are used in GraphX?
GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: a directed multigraph with properties attached to each vertex and edge.
CPU privilege levels
In computer science, hierarchical protection domains,often called protection rings, are mechanisms to protect data and functionality from faults (by improving fault tolerance) and malicious behavior (by providing computer security). This approach is diametrically opposite to that of capability-based security.
hyperparameter
In machine learning, a hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters are derived via training.
Dockerfile
Instructions and statements for each instruction that create a Docker image
Sidecar Design pattern
Made up of two containers, application container and sidecar container.
Execution of a Pregel program
Many copies of the program begin executing on a cluster of machines The master assigns a partition of the input to each worker • Each worker loads the vertices and marks them as active The master instructs each worker to perform a superstep • Each worker loops through its active vertices and computes for each vertex • Messages are sent asynchronously, but are delivered before the end of the superstep Space reserved for video • This step is repeated as long as any vertices are active, or any messages are in transit After the computation halts, the master may instruct each worker to save its portion of the graph
What is Apache Giraph?
Open source implementation based on Pregel Giraph is currently in Apache incubator Modifications are continuous
Swarm Services
Swarm services use a declarative model, which means that you define the desired state of the service, and rely upon Docker to maintain this state.
API Server
The API server is a component of the Kubernetes control plane that exposes the Kubernetes API. The API server is the front end for the Kubernetes control plane. The main implementation of a Kubernetes API server is kube-apiserver. kube-apiserver is designed to scale horizontally—that is, it scales by deploying more instances. You can run several instances of kube-apiserver and balance traffic between those instances.
Container runtime
The container runtime is the software that is responsible for running containers. Kubernetes supports several container runtimes: Docker, containerd, CRI-O, and any implementation of the Kubernetes CRI (Container Runtime Interface).
Hardware Virtualization
The virtualization of computers or operating systems, which hides the physical characteristics of a computing platform from users.
Full Virtualization -examples
Virtual PC,VirtualBox,VMWare,QEMU
What is virtualization and what is its main idea ?
Virtualization allows distributed computing models without creating dependencies on physical resources
Directed Graph
When a graph has an ordered pair of vertexes, it is called a directed graph. The edges of the graph represent a specific direction from one vertex to another. When there is an edge representation as (V1, V2), the direction is from V1 to V2. The first element V1 is the initial node or the start vertex. The second element V2 is the terminal node or the end vertex
Undirected Graph
When a graph has an unordered pair of vertexes, it is an undirected graph. In other words, there is no specific direction to represent the edges. The vertexes connect together by undirected arcs, which are edges without arrows. If there is an edge between vertex A and vertex B, it is possible to traverse from B to A, or A to B as there is no specific direction.
Edges
interconnect nodes to nodes or nodes to properties and they represent the relationship between the two a local extreme point of curvature.
graph
a graph is an abstract data type that is meant to implement the undirected graph and directed graph concepts from the field of graph theory within a math. ... These pairs are known as edges (also called links or lines), and for a directed graph are also known as arrows.
Pivot
allows an analyst to rotate the cube in space to see its various faces
Drill Up / Down
allows the user to navigate among levels of data ranging from the most summarized (up) to the most detailed (down)
Three building blocks of container
cgroups Namespaces Unionfs
etcd
etcd is an open source distributed key-value store used to hold and manage the critical information that distributed systems need to keep running. Most notably, it manages the configuration data, state data, and metadata for Kubernetes, the popular container orchestration platform
Dicing
produces a subcube by allowing the analyst to pick specific values of multiple dimensions
Describe an algorithm to find connected components in Giraph
propagate smallest vertex label to neighbors until convergence
Nodes
represent entities (people, businesses, accounts...)
index-free adjacency
speeds up processing by ensuring that each node is stored directly to its adjacent nodes and relationships. Then, during query processing (i.e., read time), index-free adjacency ensures lightning-fast retrieval without a heavy reliance on indexes. Non-native graph processing often uses a large number of indexes in order to complete a read or write transaction, significantly slowing down the operation.
Slicing
the act of picking a rectangular subset of a cube by choosing a single value for one of its dimensions