Big Data 101

Ace your homework & exams now with Quizwiz!

In the video, 2.5 Quintillion Bytes of data are equivalent to how many blue ray DVDs?

10 Million

How many petabytes make up an Exabyte?

1024

When is it estimated that the data we create and copy will reach around 35 zettabytes?

2020

One way to be bigger than one technology is to use Hadoop when dealing with big data.

A Hadoop distributed file system, or HDFS, stores data for many different locations, creating a centralized place to store and process the data.

Big Data

A collection of data sources inside and outside your company that represents a source for ongoing discovery and analysis

Semi structured data

A combination of the two. It is similar to structured data, where it may have an organized structure, but lacks a strictly-defined model.

fact

A node is simply a computer.

Fact

According to McKinsey in 2013, the emergence of cloud computing has highly contributed to the launch of the Big Data era.

visualization

An example of visualizing big data is in displaying temperature on a map by region. By using the massive amounts of data collected by sensors and satellites in space, viewers can get a quick and easy summary of where it's going to be hot or cold.

fact

Big Data is best thought of as a platform rather than a specific set of software.

fact

Big Data skills include discovering and analyzing trends that occur in big data

Big data exploration

Big data exploration addresses the challenge faced by every large organization. Business information is spread across multiple systems and silos, big data exploration enables you to explore, and mine big data to find, visualize and understand all your data, to improve decision making.

governance

Big data governance requires three things: automated integration, that is, easy access to the data wherever it resides visual content, that is, easy categorization, indexing, and discovery within big data to optimize its usage, agile governance is the definition and execution of governance appropriate to the value of the data and its intended use.

Fact

By 2020, about 1.7 megabytes of new information will be created every second for every human being in the world. By 2020, the data we create and copy will reach around 35 zettabytes, up from only 7.9 zettabytes today.

fact

By processing and analyzing new data types, such as social media, emails, and analyzing hours and hours of video footage. Analyzing data in motion, and at rest, can help find new associations, or uncover patterns and facts to significantly improve intelligence, security, and law enforcement.

In Module 2: What has highly contributed to the launch of the Big Data era?

Cloud Computing

Cloud computing

Cloud computing allows users to access highly scalable computing and storage resources through the internet. By using cloud computing, companies can use server capacity as needed and expand it rapidly to the large scale required to process big data sets and run complicated mathematical models. Cloud computing lowers the price to analyze big data as the resources are shared across many users, who pay only for the capacity they actually utilize.

Module 5: What is a method of storing data to support the analysis of originally disparate sources of data?

Data Lake

What is a method of storing data to support the analysis of originally disparate sources of data?

Data Lakes

Variety

Data comes from different sources2 Drivers: mobile social media video genomics Internet of things (IoT)

Velocity

Data is being generated extremely fast; a process that never stops. Drivers: improved connectivity competitive advantage precomputed information

What is the term used to describe an holistic approach that takes into account all available and meaningful information about a customer to drive better engagement, revenue and long term loyalty?

Enhanced 360 degree view

In Module 2: Is one byte binary? True/False

False

Module 3: 'HDFS' stands for ____________________?

Hadoop distributed file system

What is Hadoop written in?

Hadoop is a framework written in Java, originally developed by Doug Cutting who named it after his son's toy elephant. Hadoop uses Google's MapReduce technology as its foundation.

So what is the Hadoop framework?

Hadoop is an open-source software framework used to store and process huge amounts of data. It is implemented in several distinct, specialized modules: Storage, principally employing the Hadoop File System, or HDFS, Resource management and scheduling for computational tasks, Distributed processing programming models based on MapReduce, Common utilities and software libraries necessary for the entire Hadoop platform.

Some of the applications used in big data are

Hadoop, Oozie, Flume, Hive, HBase, Apache Pig, Apache Spark, MapReduce and YARN, Sqoop, ZooKeeper, and text analytics.

One trend making the Big Data revolution possible is the development of new software tools and database systems such as

Hadoop, HBase, and NoSQL for large, un-structured data sets.

Structured data refers to any data that resides in a fixed field within a record or file.

It has the advantage of being easily entered, stored, queried, and analyzed. In today's business setting, most Big Data generated by organizations is structured and stored in data warehouses. Highly structured business-generated data is considered a valuable source of information and thus equally important as machine and people-generated data.

What is an example of a source of Semi-Structured Big data?

JSON Files

analysis

Let's look at a Walmart example. Walmart utilizes a search engine called Polaris, which helps shoppers search for products they wish to buy. It takes into account how a user is behaving on the website in order to surface the best results for them. Polaris will bring up items that are based on a user's interests and, because many consumers visit Walmart's website, large amounts of data are collected, making the analysis on that big data very important.

fact

MapReduce, the programming paradigm that allows for this massive scalability, is the heart of Hadoop.

Facts

More data has been created in the past two years than in the entire history of humankind.

Module 5: What is the term referring to a database that must be processed by means other than just the SQL Query Language.

NoSQL

Module 5: In the Hadoop framework, a rack is a collection of ____________?

Nodes

How big is a zettabyte?

One bit is binary. It's either a one or a zero. Eight bits make up one byte, and 1024 bytes make up one kilobyte. 1024 kilobytes make up one megabyte. Large videos and DVDs will be in gigabytes where 1024 megabytes make up one gigabyte of storage space.

These days we have USBs or memory sticks that can store a few dozen gigabytes of information where computers and hard drives now store terabytes of information.

One terabyte is 1025 gigabytes. 1024 terabytes make up one petabyte, and 1024 petabytes make up an exabyte.

What does 'OLAP' stand for?

Online Analytical Processing

Three major sources of Big Data.

People-generated data machine-generated data business-generated data which is the data that organizations generate within their own operations.

What is the search engine used by Walmart?

Polaris

Given a set of data, there are three key types to Data Warehouse Modernization's.

Pre-Processing, using Big Data as a landing zone before determining what data should be moved to the Data Warehouse. It could be categorized as Irrelevant Data or Relevant Data, which would go to the Data Warehouse.

Value from Big Data can be _____________?

Profits

Veracity

Quality and origin of data Drivers: cost need of traceability and justification

In Module 1: What is a common use of big data that is used by companies like Netflix, Spotify, Facebook and Amazon?

Recommendation Engines

Sources of big data - STRUCTURED

Relational databases and spreadsheets

Name one of the drivers of Volume in the Big Data Era?

Scalable Infrastructure

Big Data comes in three forms.

Structured unstructured semi-structured.

Structured data

Structured data is data that is organized, labelled, and has a strict model that it follows.

the main components and ecosystems are out outlined as follows:

Techniques for Analyzing Data, such as A/B Testing, Machine Learning, and Natural Language Processing. Big Data Technologies like Business Intelligence, Cloud Computing, and Databases. Visualization such as Charts, Graphs, and Other Displays of the data.

An example of visualizing Big Data is___________?

Temperature on a map

Volume

The amount of data being generated is vast compared to traditional data sources Drivers: increase in data sources higher resolution sensors scalable infrastructure

integration

To integrate means to bring together or incorporate parts into a whole.

Data Warehouses provide online analytic processing: True/False

True

Module 3: A data scientist is a person who is qualified to derive insights from data by using skills and experience from computer science, business or science, and statistics. True/False

True

Module 5: The Hadoop framework is mostly written in the Java programming language. True/False

True

Unstructured data

Unstructured data is said to make up about 80% of data in the world, where the data is usually in a text form and does not have a predefined model or is organized in any way.

4 Vs of Big Data

Velocity (speed of the data) Volume (Scale of the data. Increase in the amount of data stored. Variety (Diversity of the data.) Veracity (Certainty of the data. Accuracy)

Sources of big data - Semi- STRUCTURED

XML and jSON files

A rack is

a collection of 30 or 40 nodes that are physically stored close together and are all connected to the same network switch. Network bandwidth between any two nodes in a rack is greater than bandwidth between two nodes on different racks.

The Hadoop Cluster is

a collection of racks.

IBM analytics defines Hadoop as follows, Apache Hadoop is

a highly scalable storage platform designed to process very large data sets across hundreds to thousands of computing nodes that operate in parallel. It provides a cost effective storage solution for large data volumes with no format requirements.

What can help organizations to find new associations or uncover patterns and facts to significantly improve intelligence, security and law enforcement?

analyzing data in motion and at rest

Operations analysis focuses on

analyzing machine data, which can include anything from signals, sensors, and logs, to data from GPS devices. Using big data for operations analysis, organizations can gain real-time visibility into operations, customer experience, transactions, and behavior. Big data empowers businesses to predict when a machine will stop working, when machine components need to be replaced, and even when employees will resign.

The smartest Hadoop strategies start with

choosing recommended distributions, then maturing the environment with modernized hybrid architectures, and adopting a data lake strategy based on Hadoop technology.

Module 3: Data privacy is a critical part of the big data era. Businesses and individuals must give great thought to how data is _____________________________.

collected, retained, used, and disclosed

What is the process of cleaning and analyzing data to derive insights and value from it?

data science

When we look at big data, we can start with a few broad topics:

integration analysis visualization optimization security governance.

An enhanced 360 degree view of the customer

is a holistic approach, that takes into account all available and meaningful information about the customer to drive better engagement, revenue, and long-term loyalty. This is the basis for modern customer relationship management, or CRM systems.

In Operations Analysis, we focus on what type of data?

machine data

second is Offloading

moving infrequently accessed data from Data Warehouses into enterprise grade Hadoop.

Data lakes are a method

of storing data that keep vast amounts of raw data in their native format and more horizontally to support the analysis of originally disparate sources of data.

Data science is

the process of cleaning, mining, and analyzing data to derive insights of value from it. Data science is the process of distilling insights from data to inform decisions.

Value

turn data into value

thirds is Exploration

using big data capabilities to explore and discover new high value data from massive amounts of raw data and free up the Data Warehouse for more structured deep analytics.

Companies like amazon and netflix uses algorithms based on big data to make specific recommendations based on customer preferences

when amazon uses recommendations for you to purchase something. Recommendations engines are a common application of big data.


Related study sets

Introduction to IOT Final Exam review

View Set

Combo with "MGMT 468 Ch. 5 Quiz" and 1 other

View Set

Physics 211 Multiple Choice Final Prep

View Set

Exercise Biology Chapter 8 Review

View Set

ATTR 222 Ch 22 The Shoulder Complex: Recognition and Management of Specific Injuries

View Set