Week 8 - Big Data Systems

¡Supera tus tareas y exámenes ahora con Quizwiz!

Big data scaling is needed because...

...to handle increased data in a similar time frame you need to scale computing power.

CLOUD COMPUTING

Cloud computing allows for the use of remote vertical or horizontally scaled servers for data storage and analysis. Some of the most popular cloud computing services available are: -Google Cloud Platform -Amazon Web Services -Microsoft Azure

Horizontal Scaling (Scale Out)

Examples: peer-to-peer (P2P) networks, Hadoop.

Hadoop Horizontal Scaling

Framework of open-source tools for supporting the examination of data sets that are too large to fit into a traditional data warehouse or relational database through reliable, scalable & distributed computing. Typical Hadoop Cluster: 40 nodes per rack 1000 nodes in a cluster

Vertical Scaling (Scale Up)

Install more processors, memory and better / faster hardware in a single machine/server. Examples: Multi-core processor upgrade, graphics processing unit (GPU) upgrade, supercomputers.

Map Reduce Limitations

MapReduce has been very popular however it doesn't utilise computer memory to its full potential...

Horizontal Scaling Pros & Cons

Pros: -Increases performance in small steps as needed -Financial investment to upgrade is relatively less -Can scale out the systems as much as needed Cons: -Software has to handle all the data distribution and parallel processing complexities -Limited number of software are available that can take advantage of horizontal scaling

Vertical Scaling Pros & Cons

Pros: -Most of the software can easily take advantage of vertical scaling -Easy to manage and install hardware within a single machine Cons: -Requires substantial financial investment -System has to be more powerful to handle future workloads and initially the additional performance is not fully utilized -It is not possible to scale vertically after a certain limit

Spark

Spark is a new tool (2010) that can run directly on HDFS, inside MapReduce and alongside MapReduce on the same cluster. Spark supports streaming data and more complex analytics such as graph algorithms and machine learning. • The unique feature of Spark is its ability to perform in-memory computations. • It allows data to be cached in memory, thus eliminating Hadoop's hard disk overhead limitation for iterative tasks. • For certain tasks Spark is tested to be up to 100X faster than MapReduce when the data can fit in the memory and up to 10X faster when data resides on the hard disk.

There are two main types of scaling for big data

Vertical Scaling (Scale Up) Horizontal Scaling (Scale Out)

Hadoop Distributed File System

• Files split into blocks (usually 128mb). • Blocks replicated across multiple nodes. • Usually 3x for fault-tolerance. • A name node stores metadata (i.e. filenames, locations etc.) • Splitting the data into blocks makes sense but also brings complications. For example: Suppose we have 8 data blocks containing student exam marks: 1. How would we identify the lowest student exam mark? 2. How would we calculate the mean student exam mark? 3. How would we calculate the median student exam mark?

Hadoop Map Reduce Process (Parallel)

• Read lots of data. • Map: extract intermediate records (key, value) about something you care about from the data. • Shuffle and sort for reducing. • Reduce: process data - aggregate, summarise, filter, transform etc. • Output the results.


Conjuntos de estudio relacionados

MGMT ALL QUIZ 3&4 FOR EXAM 2 PT 2

View Set

Chapter 27: Reproductive System (Females)

View Set

3-3 Vocabulary Review-- Cell Structure and Function

View Set

Computer Concepts and Applications Final

View Set

The Microscope: Exercise 3 Pre lab Quiz

View Set

Digital Imaging: Photoshop: Chapter 1

View Set