Cloud Computing 1660

¡Supera tus tareas y exámenes ahora con Quizwiz!

Cloud computing infrastructure

"Needs vs IT resources" IT resources need to match user needs over time

NoSQl databases

"Not Only SQL or Non SQL" Database. A type of database that operates using means other than relational tables. --invented by Johan Okarsson #nosql for a meet (most of SQL supported but other properties are available) --opensource software designed to address scalability issues (grow as needed on commodity hardware) --scheme-less (supports unstructured data not easily represented in tabular form, key-value store where key & value can be arbitrary value, new column can be introduced over time) --aggregate-oriented (related items together) --normally distributed over multiple servers in the cloud --does not satisfy ACID (eventual consistency)

Cloud computing in the real world

"cloud" = large internet services running on 10,000s of machines (Google, Amazon, Microsoft, etc.) "cloud computing" = services by these companies that let external customers rent cycles & storage cloud services -- dropbox, google drive, netflix, facebook, etc.

Distributed File System

--A client/server based application that allows clients to access & process data stored on the server as if it were their own computer --more complex than regular disk file systems (network based & high level of fault-tolerance) --BIG DATA --dataset outgrows the storage capacity of a single physical machine (dataset is partitioned across a number of separate machines)

Application flow of Sentiment analyzer

--Client application requests index.html (which requests bundled scripts of ReactJS application) --User interacting w/ the application triggers requests to the Spring WebApp --Spring WebApp forwards the requests for sentiment analysis to the Python app --Python application calculates the sentiment and returns the result as a response --The Spring WebApp returns the response to the React app (which presents the info to the user)

Scaling up container development

--Container -reliable, fast efficient, light-weight --Easy to instantiate many containers --But difficult to manage - need networking, to be deployed appropriately, managed carefully, scalable by demand, able to detect crash, etc.

HDFS design

--Data blocks --Namenode & datanode --Data flow (read/write operations) --Hadoop Replication Strategy --Hadoop Basic File Operations --Hadoop Supported File Systems

Block abstraction

--Flexible (can be stored on any available disk) --Scalable (can handle large data set in distributed environments) --Simple storage subsystem & management (easy to determine # of blocks to be stored on disk, easy to deal w/ various failure modes) --Each block replicated to achieve the desired level of fault tolerance & availability (default replication factor is 3 machines, can be changed to fit needs)

Categories in ascending order based on level of VENDOR control

--SaaS --PaaS --IaaS --MaaS

Server virtualization

--Server virtualization enables server consolidation and containment (eliminating "server sprawl" via deployment of systems as "virtual machines" that can run safely and move transparently across shared hardware) --A virtual server can be serviced by one or more hosts, and one host may house more than one virtual server (so, increased server utilization rates - from 5-15% to 60-80%) capacity = capability of processing

Virtual server concept

--Virtual servers can be scaled out easily (amt of resources can be adjusted dynamically) --Can still be referred to by their function --Not affected by the loss of a host (can be removed/introduced at will to accomodate) --Server "cloning" is easy (identical virtual servers can be easily created) --Virtual servers can be migrated from host to host dynamically as needed

Cloud benefits

--takes hassle out of collaboration --allows for work from anywhere --great protection & security measures --protects environment --automatic software updates --increase in company growth --total flexibility for business owners --reduces hardware costs --disaster recovery more simple

Virtualization advantages

--workload consolidation to reduce hardware, power, and space requirements --ability to run multiple OSs and leverage their advantages based on the application --redundancy to mitigate disaster --greater automation

5 essential cloud characteristics

1. On-demand self service 2. Broad network access 3. Resource pooling 4. Rapid elasticity (cloud resources can be allocated and released rapidly as needed) 5. Measured service (pay only for what you consume)

Cloud Computing

A model for enabling convenient, on-demand network access to a shared pool of configurable computing and networking resources that can be rapidly provisioned and released with minimal management effort or service provider interaction cloud vendors provide storage & computing (companies outsource data to them) --convenient --fast (immediate storage) --shared --computation --little manual management

MapReduce

A programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a Hadoop cluster. --abstracts the problem from disk reads & writes by transforming it into a computation over sets for keys & values --works as a BATCH QUERY PROCESSOR that can run an ad hoc query against your whole data set & get results in reasonable time Divide & Conquer -- partition & then combine

Object stores examples

Amazon Simple Storage Service (S3) --first to come out --2016, reportedly holds trillions of objects in billions of containers (buckets) Glacier --designed for long-term, secure, durable, & low cost data archiving Google -- 3 storage tiers --Standard (multi-regional) --Regional --Nearline Azure --Blob as a part of storage account

Amazon File Stores

Amazon Elastic Block Store (EBS) --designed to be attached to a single Amazon EC2 Amazon Elastic File System (EFS) --general purpose file storage service --a file system interface for one or more Amazon EC2 instances

NoSQL database examples

Amazon's DynamoDB --Based on key-value model --For each row, primary key column is the only required attribute --any # of additional columns can be defined, indexed, and made searchable Google BigTable --same database behind Google search, analytics, maps, and Gmail --maps 2 arbitrary strings, row and column key, and a timestamp by arbitrary byte array --designed for sparse & large datasets & to support large workloads Google Datastore --similar to BigTable + ACID Azure Table --similar to DynamoDB but limited

Hadoop history

Apache Lucene project - text search library Apache Nutch - open source web engine for Lucene (index & crawl, need big cluster to process/expensive to invest) Nutch originally, Doug Cutting adds DFS & MapReduce to support (Google published GFS & MapReduce papers) Then, yahoo! hires Cutting, Hadoop is created out of Nutch In 2008, Hadoop became the Apache top level project Name comes from: "The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria." ... Doug Cutting

Applications NOT suitable for HDFS

Applications w/ low-latency access, as opposed to high throughput of data Applications w/ a large # of small files (costly metadata) --on average each file, directory, and block takes about 150 bytes (for HDFS that maintains 1 mil files w/ one block each will need memory 1,000,000*150*2 = 300 MB) Applications w/ multiple writers, or modifications arbitrary offsets in the file --files in HDFS written by a single writer, w/ writes always made at the end of the file

Cloud Storage Models Overview

Attached File Stores --organizing data into folders & directories --accessed by attaching a virtual disk to a virtual machine Object Stores --Store unstructured binary objects (blobs) Binary Large Objects Databases --Structured data collection --3 well-known types: Relational databases, NoSQL databases, Graph databases Data warehouses --designed to support search over massive amounts of data

Motivations for Hadoop

Big Data, Storage & analysis, need parallel data access & shared access Challenges: --analysis tasks need to combine data from multiple sources (need a paradigm that transparently splits & merges data) --challenges of parallel data access to & from multiple disks (hardware failure)

Advances in data center deployment

Conquering complexity --building racks of servers & complex cooling systems all separately is not efficient --package & deploy into larger units

Conventional vs Cloud computing

Conventional --Manually provisioned --dedicated hardware --fixed capacity --pay for capacity --capital & operational expenses --managed via system administrators Cloud --self-provisioned --shared hardware --elastic capacity --pay per use --operational expenses --managed via APIs

Data Warehouses

Data management systems optimized to support analytic queries from reading large data sets --designed to support many concurrent requests to read and update Examples: --Amazon Redshift --Google BigQuery --Azure Data Lake All provide REST (REpresentation State Transfer) API interface

Relational Databases

Database technology involving tables (relations) representing entities and primary/foreign keys representing relationships --structured collection of data about entities and their relationships --models real world objects --normally managed through database management system (DBMS) like Oracle or MySQL or PostgreSQL (SQL = Structured Query Language) --support ACID semantics (atomicity, consistency, isolation, and durability)

Microservices

Divide a computation into small, mostly stateless components that can be: --easily replicated for scale --communicate w/ simple protocols --computation is as a swarm of communicating workers

Graph Parallel

Each function node represents a parallel invocation of the function on the distributed data structure --data is distributed arrays or streams --build a data flow graph of the algorithm functions --the graph is compiled into parallel operators that are applied to the distributed data structures Data Analytic: Spark, Spark Streaming, Apache, Flink, Storm, Google DataFlow Machine Learning: Google TensorFlow, MS Congnitive Toolkit graphs generated by compiler

What if datanode fails?

Fault Tolerance -- Replication factor solution: each data blocks are replicated (3x by default) and are distributed across different DataNodes

Pros & Cons of File-Stores vs Block-Stores

File-Stores --Visibility to OS: OS gets a network share (sees a directory w/ files) --Protocols: NFS (linux) and CIFS (windows) --Cons: Relatively slow Block-Stores --Visibility to OS: OS gets a block device (sees the volume as a disk) --Protocols: iSCSI/iSER/Vendor Specific --Cons: No built-in file system

Simple Serverless Example

GCP --deploys function w/ HTTP trigger --visit URL of function in browser AWS Lambda is an event driven, serverless computing platform -- runs code in response to events & automatically manages the computing resources that are required by the code

Object Stores

General term that refers to the way in which we organize & work w/ units of storage, called objects useful information dispersal (erasure coding) algorithms to place object each object contains: --data --expandable amt of metadata --globally unique identifier Access via API at application level, rather than via OS at filesystem-level (each object gets an HTTP URL that can be publicly accessible via REST) flat object models support two-level folder-file hierarchy that allows for the creating of object containers --each can hold zero or more objects objects cannot be modified once created/uploaded --can only be deleted or replaced object storage w/ versioning - all versions of the same file are stored in the same container

Attached File Store

Good --easy to understand, files are organized around a tree of directories or folders --use standard POSIX (Portable Operating System Interface) API --Allow direct use of many existing programs w/o modification Bad --not scalable (limit in file size, num of files, num of folders, slow search when number of file is large) --No support for data model

HDFS Namenode and Datanodes

HDFS cluster has two types of node- namenode and a number of datanodes --operates in a master/worker pattern Namenode --maintains file system tree & metadata for all the files & directories in the tree (store persistently on local disk in the form of two files: namespace image and edit log) --Services block location requests (the entire (block, datanode) mapping table is stored in memory for efficient access) Datanode --stores data --maintains block location information which stores commodity hardware? --datanode

HDFS blocks

HDFS supports the concept of a block, but it is much larger - 128 MB by default --files in HDFS are broken into block-sized chunks, stored in independent units --if size of file < HDFS block size, the file does not occupy all of it block size is large to --minimize the cost of seek time --target is to make seek time 1% of disk transfer time

Hadoop vs Grid Computing

Hadoop --try to co-locate data w/ the computing node (data locality) --avoid copying data around --automatic fault recovery --breaking data down into smaller components & computes locally existing grid computing ie HPC --distribute tasks to process data in a shared file system --data needs to move to the machines that run tasks --not suitable for tasks accessing large data volumes

Beyond HDFS - Hadoop supported file systems

Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation --Local file = fs.LocalFileSystem (a filesystem for locally connected disk w/ client-side checksums) --WebHDFS = hdfs.web.WebHdfsFileSystem (a filesystem providing authenticated read/write access to HDFS over HTTP) --FTP (filesystem backed by FTP server) --S3 (filesystem backed by Amazon S3) --Azure (backed by Azure) --Swift (backed by OpenStack Swift)

More on Namenode Fault Tolerance

High Availability (HA) configuration --a pair of namenodes configured as active-standby --standby takes over when active one fails --namenodes highly available in shared storage to share the edit log (active writes & standby reads to keep in sync) --datanodes must send block reports to both namenodes --client must be configured to handle namenode failover block reports are sent to both name nodes, all name space edits are stored in shared storage

Development challenges

How to ensure services interact consistency, avoid dependency? How to avoid n*m configurations? How to migrate & scale quickly, ensure compatibility?

Roots of Virtualization

Increasing levels of abstraction in hardware & software --high level programming allows for software development, while shielding programmers away from the complexity of OS --OS provides a lower level abstraction that frees software developers from the complexities & details to interact w/ and manage physical resources (memory & I/O devices) -- OS must be fully cognizant of the hardware on which it resides

Dockerfile

Instructions and statements for each instruction that create a Docker image ex: FROM openjdk:7 COPY . /usr/src/myapp WORKDIR /usr/src/myapp RUN javac Main.java CMD ["java", "Main"] RUN = executes commands in a new layer to create a new image CMD = sets default command and/or parameters, which can be overwritten from the command line when docker runs ENTRYPOINT = configures a container that will run as an executable

Rest API

Interface for distributed hypermedia systems created by Roy Fielding in 2000 Guiding principles of REST: Client - Server Stateless Uniform Interface Layered System Code on demand (optional) --Key abstraction of information in REST is a resource (identified by a resource identifier ie URI) --resources can be retrieved or transformed to another state by a set of methods (GET/PUT/POST/DELETE/PATCH) --Clients & servers exchange representations of resources by using a standardized interface and protocol - typically HTTP

Stateless or stateful more suitable for docker container?

It is preferable to create Stateless application for Docker Container. --We can create a container out of our application and take out the configurable state parameters from application. Now we can run same container in Production as well as QA environments with different parameters. This helps in reusing the same Image in different scenarios. Also a stateless application is much easier to scale with Docker Containers than a stateful application.

Namenode cons

It's the main component that makes the filesystem operational --so if it fails, the whole cluster can't function Maintains a single namespace & metadata of all the blocks for the cluster --not scalable (the entire of namespace & block metadata in memory) --poor isolation (no way to separate a group of workers from one another)

Azure Attached File Stores

Managed Disks ▪Ultra SSD Managed Disks ▪Highest performance ▪Up to 64TB Premium SSD Managed Disks ▪I/O intensive workloads with significantly high throughput and low latency ▪Up to 8TB Standard SSD Managed Disks ▪Entry-level production workloads requiring consistent latency ▪Up to 32TB Standard HDD Managed Disks ▪Cheapest ▪Up to 32TB File Share ▪Can be mounted by multiple instances via Server Message Block (SMB)protocol

Single Program Multiple Data (SPMD)

Message passing interface (MPI) programming model --computation comprises one or more processes that communicate by calling library routines to send & receive messages to other processes --same program executed by each processor --control statements select different parts for each processor to execute SPMD using AWS --Amazon CloudFormation service enables automated deployment of complex collections of related services (ie. Multiple EC2 instances, load-balancers, special network connecting them, security groups) --AWS CloudFormation cluster (fill out CfnCluster template, use AWS command line to submit, log into head node) SPMD using Azure --Azure's Slurm Cluster service (fill out template similar to AWS, setup a Slurm cluster) --Use Azure Batch (similar to AWS batch)

How to fix namenode cons

Namenode federation --fixes scalability --adds more namenodes --each manages a portion of the filesystem namespace which is independent from others --each datanode will register w/ each namenode to store blocks for different namespaces downside: cost Namenode Fault Tolerance --simple solution: secondary namenodes --periodically pulls a copy of the file metadata from the active namenode to save to local disk downside: time (30+ min) - might take too long to come up (have to load the namespace image into memory, then replay its edit log, then receive enough block reports from the datanodes to leave safemode)

Virtualization model

Network --Internal vs External Server --Full, Partial, or Para Storage --Block or File Memory --Application Level Integration --OS Level Integration Software --OS Level --Application --Service Data --Database Desktop --Virtual Desktop Infrastructure --Hosted Virtual Desktop

Virtualization resources in the cloud

Network Virtualization --combining hardware & software network resources and functionality into a single, software-based administrative entity (a virtual network) VLAN - External Network Virtualization Software defined network - Internal Server virtualization --process of using software on a physical server to create multiple partitions or "virtual instances" each capable of running independently Storage virtualization --pools physical storage from multiple network storage to enable a single storage device managed by a central console Getting VMs from: --AWS EC2 --Azure --Google Cloud

Serverless

New paradigm for service delivery --user provides a simple function to execute, under certain conditions Very lightweight process, similar to daemon waiting for event to occur Cloud provider allocates machine resources on demand, taking care of the servers on behalf of the customer Managed by cloud infrastructure --No server management --Pay only while code runs --Scale automatically --Highly available & fault tolerant Backend as a Service (BaaS) --uses third party services Function as a service --runs in ephemeral containers (e.g. AWS Lambda) Examples: AWS Lambda, Google Cloud Function, Azure Function

Object stores example

Object storage w/ versioning-- Each NetCDF file is stored in a separate container --all versions of the same NetCDF file are stored in the same container

Hadoop

Open source software framework for storing data and running applications on clusters of commodity hardware --cheap to implement and expand Provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs --scalable HDFS - Hadoop Distributed File System --designed to provide highly fault-tolerant and deployed on low-cost hardware MapReduce -- heart of the whole thing --a framework for processing data in batch - BSP Map = breaking the data into smaller pieces & distributing Reduce = combining the results together to produce the final product

Hadoop software architecture

PIG - scripting HIVE - query Application (MapReduce, Spark, ...) **Spark is very fast Resource Management (YARN - yet another resource negotiator) Storage (HDFS)

Development Tools

PaaS providers usually allow a set of development tools for their users to shorten development time --another trick for vendor lock-in

PaaS or Iaas?

Paas is best suited for multi-tenancy, IaaS creates a clear separation of resources Might be difficult for PaaS to switch to a different vendor PaaS providers usually allow a set of development tools for their users to shorten development time

Google Attached File Stores

Persistent Disks --cheapest, up to 64 TB --can be accessed anywhere in a zone Local SSD (solid state disk) --More expensive, better performance --up to 3 TB RAM disk - in memory --most expensive, and fastest --up to 208 GB

Data center efficiency metrics

Power Usage Effectiveness (PUE) = total power into data center / IT equipment power Data Center Infrastructure Efficiency (DCIE) = 1/PUE

Relational Databases Pros & Cons

Pros: --best for structured data --moderate size Cons: --no support for unstructured data --not scalable in the cloud (requires a single server to host whole database)

Serverless Pros/Cons

Pros: --no need to write or manage backend code --achieve event-based programming w/o the complexity of building & maintaining the infrastructure Cons: --Vendor Lock-in --Architectural complexity (AWS Lambda limits the number of concurrent executions you can be running of all your lambdas) Lambda function = anonymous function not bound to an identifier --Startup latency in FaaS (takes time to initialize an instance of a function before each event)

Pros & Cons of Object Stores

Pros: --scalable (can be as large as needed) --simplify provisioning (flat namespace w/ metadata) --ease of use (each object gets unique ID & HTTP URL that can be publicly accessible) --agility (sysadmin not required to maintain) Cons: --no support for search (need to know object identifier to access or create complex metadata index) --eventual consistency (no guarantee that a read request returns most recent update of an object)

Relational Database Services in the Cloud

Running a DBMS, ie. MySQL, on one of the virtual machines ▪Limit in scale Relational database services ▪Amazon's Aurora ▪Google's Cloud SQL and Spanner ▪Azure's SQL Database

Traditional server concept

Servers are viewed as an integral computing unit (which includes hardware, OS, storage, and applications) --servers are often identified & referred to by their function (ex: file server, database server, SQL server...) --whenever a server reaches its capacity, a new server must be added Advantages --ease of configuration & conceptualization --ease of deployment --backup manageable --client-server paradigm is well suited for a variety of applications & services Disadvantages --maintenance cost is high -- replication is challenging --scalability may be limiting --highly vulnerable to hardware failures --utilization is usually low need physical backup machine

Parallel Computing Paradigms of the Cloud

Single Program Multiple Data (SPMD) --scientific programmers use this --Classic high performance computing (HPC) Many Task Parallelism --large queue of tasks that may be executed in any order - results stored in a database Bulk Synchronous Parallelism (BSP) --Source of map reduce --model for analyzing parallel algorithms Graph Execution --Spark & streaming systems --graph of tasks, application flow demonstrated --100x faster than BSP Microservices --computing is performed by one or more actors which communicate via messages Serverless --focus on application, not the infrastructure --does have a server tho

Cloud computing service categories

Software as a Service (SaaS) --provided w/ access to application software in the cloud --can be directly run from web browser --largest cloud market ex: Google Apps, Microsoft Office 365, saleforce.com, Oracle's Netsuite, Concur, Cisco WebEx, GoToMeeting Platform as a Service (PaaS) --provides computing platforms which includes OS, programming language, execution, database, etc. --applications using PaaS inherit cloud characteristics ex: Google App Engine, AWS Elastic Beanstalk, Salesforce.com, Amazon EMR, MS Azure HDInsight, GCP Dataproc Infrastructure as a Service (IaaS) --offers storage & computing resources that developers & IT organization use to deliver custom business solutions ex: Amazon EC2, VMWare vCloud, GCP Compute Engine Metal as a Service (MaaS) --combines the flexibility & scalability of the cloud w/ the ability to harness the power of physical servers --control over machine specifications --have the option to control everything ex: Juju

HDFS

Stands for Hadoop Distributed File System and is the way that Hadoop structures its files. Designed for storing large files w/ streaming data access patterns, running on clusters on commodity hardware --good for storing very large files (hundreds of megabytes, gigabytes, or terabytes) --streaming data access patterns (write-once, read many times pattern) --runs on clusters on commodity hardware (node failure % is high - has to tolerate failures w/o disruption or loss of data) Client/Server architecture w/ master node and worker nodes

Moore's Law application

Storage doubling period: 12 months Bandwidth doubling period: 9 months CPU computing doubling period: 18 months If you have a company, you have to plan to increase your resources every X months --continuously adding resources & adding complexity to your system A general pattern applied - what's expected to match user needs

Sentiment analyzer example

Takes one sentence as input, using text analysis calculates the emotion of each sentence Consists of 3 microservices: SA-Frontend - a Nginx web server that serves our ReactJS static files SA-WebApp - a Java Web Application that handles requests from the frontend SA-Logic - a python application that performs Sentiment Analysis Build a container image for each service Kubernetize -- orchestrate sentiment analyzer's containers

Many Task Parallelism

Task parallel model that is great for solving problems that involve doing many independent computations --ongoing queue of unique tasks --important that tasks are independent each worker repeatedly pulls a sample from a queue, processes the data, and stores the result in a shared table

Host machine

The physical machine that a virtual machine is running on

Microservices in the cloud

Typically run as containers using a service deployment and management service --Amazon Elastic Container Service --Google Kubernetes --DCOS from Berkeley/Mesosphere --Docker Swarm Container Service Cluster: --consists of various cores

Docker's Union File System

Union File System allows you to take different file systems & create a union of their contents w/ the top most layer superseding any similar files found in the file systems --all containers w/ same image see same directory tree --copy-on-write loads the new part only the part that needs to modify

MapReduce Data-Intensive Programming Model

Users specify computation in terms of map() and reduce() --underlying runtime system (YARN+HDFS) --automatically parallelizes the computation across large-scale clusters of machines --handles machine failures, communications & performance issues @ high level of abstraction, MapReduce codifies a generic recipe for processing a large data set Map: --Iterate over a large # of records --Extract something of interest from each Shuffle: (Hadoop takes care of it) --shuffle & sort immediate results Reduce: --aggregate intermediate results --generate final output

VMs to Containers

VM too heavy for a simple process bc it requires a the whole OS to be installed --containers are isolated, but share an OS, and bins/libs (containers can have original app, copy of app, or modified app) VM-Infrastructure, host operating system, hypervisor, guest OSs, bins/libs, apps Containers-infrastructure, operating system, container engine, bins/libs, apps

Virtual Machines VS Containers

Virtual machines: --heavyweight --fully isolated --no automation for configuration --slow deployment --easy port & IP address mapping --Custom images not portable across clouds (ex: Citrix Xen, Microsoft Hyper-V, VMWare ESXi, VirtualBox, KVM) Containers: --lightweight --process-level isolation (less secure) --script-driven configuration --rapid deployment --more abstract port & IP mappings --completely portable (ex: Docker, Google container, Linux kernel container (LXC), FreeBSD jails, Solaris Zones)

Containers

a type of virtualization that allows for shared operating systems for more resource savings and faster execution --do not have a dedicated operating system, just share the OS from the VM --reliable, fast, efficient, light-weight --easy to instantiate many containers difficult to manage --need networking --need to be deployed appropriately, managed carefully, scalable, detect crashes --traffic distribution is challenging Ex: Pokemon Go container: Java, Google BigTable, MapReduce, Cloud DataFlow Sit inside PODs, and PODs are inside nodes --usually multiple containers --managed by the POD, so share the same resources

Virtual Machine Monitor (VMM)

also called Hypervisor --a process that separates a computer's operating system and applications from the underlying physical hardware --hypervisor monitors and manages running virtual machines stands between physical & virtual machine, tries to allocate resources for each virtual machine / splits concepts

Kubernetes

an open source container management tool which automates container deployment, container descaling, container load balancing --written in Golang, has huge community support (was first developed by Google & later donated to Cloud Native Computing Foundation (CNCF)) --works with most cloud providers --can group 'n' number of containers into one logical unit called 'POD' directs all the traffic - which POD to go to, etc. Architecture: --UI, CLI, API --Kubernetes master --Image Registry --Nodes

Data centers

are physical or virtual infrastructures used by enterprises to house computer, server and networking systems and components for the company's IT (information technology) needs --clouds are built on data centers --very big (range in size from "edge" facilities to mega scale) Facility: location and "white space" Support infrastructure: --uninterruptible power sources: battery banks, redundant power sources & generators --environmental control: cooling systems --physical security systems: biometrics & video surveillance systems IT equipment: servers, storage hardware, cables, racks, firewalls

Pokemon Go example

augmented reality game developed by Niantic for Android & iOS Original container: Java, Google BigTable, MapReduce, Cloud DataFlow --crashed bc couldn't handle load needed both horizontal & vertical scaling bc of real-time activity in gaming --so Kubernetes team at GCP worked with Niantic to handle challenges

Features of Kubernetes

automatic bin-packing --package software & place to container based resource requirement service discovery and load balancing --auto network & load balance configuration storage orchestration --auto mount to different storages for the cluster self healing --detect crash & restart containers automatically secret & configuration management batch execution horizontal scaling --through command line, dashboard, or autoscaling automatic rollbacks and rollouts --make sure update or rollback would not disrupt ongoing traffic

Kubernetes master

controls the clusters and the nodes in it nodes host the group of containers called POD --containers in a POD run on the same node & share resources such as filesystems, kernel namespaces, and an IP address Replication controller at the master ensure that requested number of PODs are running on nodes load balancer at the master provide load balancing across a replicated group of PODs Nodes contain PODs which contain containers

Cloud history

distributed systems (1940s-50s)-> timesharing and data processing industry (1960s-70s) -> PCs and clusters or work-stations(1980s) -> grid computing & peer to peer systems (1990-2000s)

Memory virtualization

enables software programs to gain access to memory than is physically installed by background swapping of data to disk storage

Evolution of cloud computing

grid computing --large problems solved w/ parallel computing utility computing --each machine doing specific purpose --computing resources as a metered service software as a service --renting a service on the cloud --network based subscriptions to applications cloud computing --next-generation

Basic Filesystem Operations (Hadoop)

hadoop fs -mkdir hadoop fs -put ____ . (copy file from local filesystem to HDFS) or -copyFromLocal hadoop fs --copyToLocal hadoop fs -get (copy from HDFS to local) list files hadoop fs -ls/user/name the file listing also shows replication factor -- Replication factor determines how many times a file is replicated --empty for directories since the concept of replication does not apply --directories are treated as metadata & stored by the namenode, not the datanodes

Hybrid cloud

includes two or more private, public, or community clouds, but each cloud remains separate and is only linked by technology that enables data and application portability

Docker

leading container --most widely known & used --easy to download --free previously called dotCloud (2013) Popular because of: --ease of use (command line, docker compose, kubernetes) --speed (loads fast, shares library among containers) --Docker Hub - for sharing images --Modularity & Scalability To download an image from DockerHub - pull docker ps - list running processes docker stop - stops active containers docker run -it --rm (run & attach container) docker rmi - removes all docker images "-it" connects the container's standard IO to the shell that ran the docker command "-d" detached mode "-v [localdir]:[containerdir]" mounts the localdir to the containerdir in the container

Multi-tenancy

many users sharing the same physical computer and database --Paas is best suited --Saas also promotes multi-tenancy

Public cloud

promotes massive, global, and industrywide applications offered to the general public

Private cloud

serves only one customer or organization and can be located on the customer's premises or off the customer's premises

Moore's law on doubling periods

storage: 12 months ---> volume bandwith: 9 months CPU computing: 18 months --------> velocity

Vendor Lock-in

the ability to use "what you manage" in cloud environment with different cloud provider --PaaS may lock-in applications by requiring users to develop apps based on their specific APIs

Container orchestration

the automatic process of managing or scheduling the work of individual containers for applications based on microservices within multiple clusters Orchestration tools: Kubenestes - most widely used Docker Swarm Appache Mesos CoreOS rkt

Virtualization

the practice of sharing or pooling computing resources, such as servers and storage devices the process of creating a virtual version of a physical object --relies on clusters --customers go through virtualization - don't know the physical storage info --setting up something on top of physical machines & abstracting them to make it easier for the user to communicate w/ them combination of computing + storage

Guest machine

the virtual machine, running on the host machine

WordCount Program

two-phased program --distribute work over several machines --combine outcome from each machine into final word count phase 1 -- document processing (each machine processes a fraction of document set) phase 2 -- count aggregation (partial word counts from individual machines combined into final) Limitations: --The program does not take into consideration the location of the documents --Storing WordCountand TotalWordCountin the memory is a flaw --In Phase II, the aggregation machine becomes the bottleneck Hadoop solution: --Executing both phases in a distributed fashion on a cluster of machines that can run independently --To achieve this, functionalities must be added ▪Store data over a cluster of processing machines ▪Partition intermediate data across multiple machines ▪Shuffle the partitions to the appropriate machines

Virtual machine

visual representation of a physical machine (not JVM)

Big data characteristics

volume, velocity, variety

Bulk Synchronous Parallelism (BSP)

worker tasks periodically synchronize & exchange data w/ each other barrier = the point of synchronization MapReduce - a special case of BSP --map task = an operation applied to blocks of data in parallel output = input --reduce task - when maps are "done" reduce the results into a single result output < input


Conjuntos de estudio relacionados

Environmental Science Chapter 13 Atmosphere and Climate Change

View Set

2B unit 2-2 Itinerary details: Vocabulary (2)

View Set

Ch. 8 - Agency Contracts (Sales) and Related Practices - Part 2

View Set

Ch 5 Central and peripheral nervous system

View Set

Intro To Islam - Fredrick Mathewson Denny Chapters 1-7

View Set

Chapter 39 Hazardous Materials, Multiple-Casualty Incidents, and Incident Management PreTest

View Set

DECA Hospitality and Tourism Exam

View Set

Chapter 17: Science, the Environment, and Society - Inquisitive Questions

View Set