Tingnan ang lahat ng mga set ng pag-aaral

Week 8 - Big Data Systems

Pataasin ang iyong marka sa homework at exams ngayon gamit ang Quizwiz!

Big data scaling is needed because...

...to handle increased data in a similar time frame you need to scale computing power.

CLOUD COMPUTING

Cloud computing allows for the use of remote vertical or horizontally scaled servers for data storage and analysis. Some of the most popular cloud computing services available are: -Google Cloud Platform -Amazon Web Services -Microsoft Azure

Horizontal Scaling (Scale Out)

Examples: peer-to-peer (P2P) networks, Hadoop.

Hadoop Horizontal Scaling

Framework of open-source tools for supporting the examination of data sets that are too large to fit into a traditional data warehouse or relational database through reliable, scalable & distributed computing. Typical Hadoop Cluster: 40 nodes per rack 1000 nodes in a cluster

Vertical Scaling (Scale Up)

Install more processors, memory and better / faster hardware in a single machine/server. Examples: Multi-core processor upgrade, graphics processing unit (GPU) upgrade, supercomputers.

Map Reduce Limitations

MapReduce has been very popular however it doesn't utilise computer memory to its full potential...

Horizontal Scaling Pros & Cons

Pros: -Increases performance in small steps as needed -Financial investment to upgrade is relatively less -Can scale out the systems as much as needed Cons: -Software has to handle all the data distribution and parallel processing complexities -Limited number of software are available that can take advantage of horizontal scaling

Vertical Scaling Pros & Cons

Pros: -Most of the software can easily take advantage of vertical scaling -Easy to manage and install hardware within a single machine Cons: -Requires substantial financial investment -System has to be more powerful to handle future workloads and initially the additional performance is not fully utilized -It is not possible to scale vertically after a certain limit

Spark

Spark is a new tool (2010) that can run directly on HDFS, inside MapReduce and alongside MapReduce on the same cluster. Spark supports streaming data and more complex analytics such as graph algorithms and machine learning. • The unique feature of Spark is its ability to perform in-memory computations. • It allows data to be cached in memory, thus eliminating Hadoop's hard disk overhead limitation for iterative tasks. • For certain tasks Spark is tested to be up to 100X faster than MapReduce when the data can fit in the memory and up to 10X faster when data resides on the hard disk.

There are two main types of scaling for big data

Vertical Scaling (Scale Up) Horizontal Scaling (Scale Out)

Hadoop Distributed File System

• Files split into blocks (usually 128mb). • Blocks replicated across multiple nodes. • Usually 3x for fault-tolerance. • A name node stores metadata (i.e. filenames, locations etc.) • Splitting the data into blocks makes sense but also brings complications. For example: Suppose we have 8 data blocks containing student exam marks: 1. How would we identify the lowest student exam mark? 2. How would we calculate the mean student exam mark? 3. How would we calculate the median student exam mark?

Hadoop Map Reduce Process (Parallel)

• Read lots of data. • Map: extract intermediate records (key, value) about something you care about from the data. • Shuffle and sort for reducing. • Reduce: process data - aggregate, summarise, filter, transform etc. • Output the results.

Week 8 - Big Data Systems

Kaugnay na mga set ng pag-aaral

Pathology final

MGMT ALL QUIZ 3&4 FOR EXAM 2 PT 2

lOWER rESPIRATORY

3a Veicoli quiz vero o falso

Pharm Test 2

Chapter 27: Reproductive System (Females)

T.O.P Chapter 2

physio

Govecon Unit 3

IS 428 Final Exam_2

Econ Midterm #1

Metal casting processes

3-3 Vocabulary Review-- Cell Structure and Function

Ch. 18 -- Endocrine System -- Secondary Endocrine Organs

Computer Concepts and Applications Final

Motivation

Ch 1 Stock Market Vocab

LR CH 1

1.4 - Stakeholders

COMPTIA chapter 5.4 RAID