Big Data 101
In the video, 2.5 Quintillion Bytes of data are equivalent to how many blue ray DVDs?
10 Million
How many petabytes make up an Exabyte?
1024
When is it estimated that the data we create and copy will reach around 35 zettabytes?
2020
One way to be bigger than one technology is to use Hadoop when dealing with big data.
A Hadoop distributed file system, or HDFS, stores data for many different locations, creating a centralized place to store and process the data.
Big Data
A collection of data sources inside and outside your company that represents a source for ongoing discovery and analysis
Semi structured data
A combination of the two. It is similar to structured data, where it may have an organized structure, but lacks a strictly-defined model.
fact
A node is simply a computer.
Fact
According to McKinsey in 2013, the emergence of cloud computing has highly contributed to the launch of the Big Data era.
visualization
An example of visualizing big data is in displaying temperature on a map by region. By using the massive amounts of data collected by sensors and satellites in space, viewers can get a quick and easy summary of where it's going to be hot or cold.
fact
Big Data is best thought of as a platform rather than a specific set of software.
fact
Big Data skills include discovering and analyzing trends that occur in big data
Big data exploration
Big data exploration addresses the challenge faced by every large organization. Business information is spread across multiple systems and silos, big data exploration enables you to explore, and mine big data to find, visualize and understand all your data, to improve decision making.
governance
Big data governance requires three things: automated integration, that is, easy access to the data wherever it resides visual content, that is, easy categorization, indexing, and discovery within big data to optimize its usage, agile governance is the definition and execution of governance appropriate to the value of the data and its intended use.
Fact
By 2020, about 1.7 megabytes of new information will be created every second for every human being in the world. By 2020, the data we create and copy will reach around 35 zettabytes, up from only 7.9 zettabytes today.
fact
By processing and analyzing new data types, such as social media, emails, and analyzing hours and hours of video footage. Analyzing data in motion, and at rest, can help find new associations, or uncover patterns and facts to significantly improve intelligence, security, and law enforcement.
In Module 2: What has highly contributed to the launch of the Big Data era?
Cloud Computing
Cloud computing
Cloud computing allows users to access highly scalable computing and storage resources through the internet. By using cloud computing, companies can use server capacity as needed and expand it rapidly to the large scale required to process big data sets and run complicated mathematical models. Cloud computing lowers the price to analyze big data as the resources are shared across many users, who pay only for the capacity they actually utilize.
Module 5: What is a method of storing data to support the analysis of originally disparate sources of data?
Data Lake
What is a method of storing data to support the analysis of originally disparate sources of data?
Data Lakes
Variety
Data comes from different sources2 Drivers: mobile social media video genomics Internet of things (IoT)
Velocity
Data is being generated extremely fast; a process that never stops. Drivers: improved connectivity competitive advantage precomputed information
What is the term used to describe an holistic approach that takes into account all available and meaningful information about a customer to drive better engagement, revenue and long term loyalty?
Enhanced 360 degree view
In Module 2: Is one byte binary? True/False
False
Module 3: 'HDFS' stands for ____________________?
Hadoop distributed file system
What is Hadoop written in?
Hadoop is a framework written in Java, originally developed by Doug Cutting who named it after his son's toy elephant. Hadoop uses Google's MapReduce technology as its foundation.
So what is the Hadoop framework?
Hadoop is an open-source software framework used to store and process huge amounts of data. It is implemented in several distinct, specialized modules: Storage, principally employing the Hadoop File System, or HDFS, Resource management and scheduling for computational tasks, Distributed processing programming models based on MapReduce, Common utilities and software libraries necessary for the entire Hadoop platform.
Some of the applications used in big data are
Hadoop, Oozie, Flume, Hive, HBase, Apache Pig, Apache Spark, MapReduce and YARN, Sqoop, ZooKeeper, and text analytics.
One trend making the Big Data revolution possible is the development of new software tools and database systems such as
Hadoop, HBase, and NoSQL for large, un-structured data sets.
Structured data refers to any data that resides in a fixed field within a record or file.
It has the advantage of being easily entered, stored, queried, and analyzed. In today's business setting, most Big Data generated by organizations is structured and stored in data warehouses. Highly structured business-generated data is considered a valuable source of information and thus equally important as machine and people-generated data.
What is an example of a source of Semi-Structured Big data?
JSON Files
analysis
Let's look at a Walmart example. Walmart utilizes a search engine called Polaris, which helps shoppers search for products they wish to buy. It takes into account how a user is behaving on the website in order to surface the best results for them. Polaris will bring up items that are based on a user's interests and, because many consumers visit Walmart's website, large amounts of data are collected, making the analysis on that big data very important.
fact
MapReduce, the programming paradigm that allows for this massive scalability, is the heart of Hadoop.
Facts
More data has been created in the past two years than in the entire history of humankind.
Module 5: What is the term referring to a database that must be processed by means other than just the SQL Query Language.
NoSQL
Module 5: In the Hadoop framework, a rack is a collection of ____________?
Nodes
How big is a zettabyte?
One bit is binary. It's either a one or a zero. Eight bits make up one byte, and 1024 bytes make up one kilobyte. 1024 kilobytes make up one megabyte. Large videos and DVDs will be in gigabytes where 1024 megabytes make up one gigabyte of storage space.
These days we have USBs or memory sticks that can store a few dozen gigabytes of information where computers and hard drives now store terabytes of information.
One terabyte is 1025 gigabytes. 1024 terabytes make up one petabyte, and 1024 petabytes make up an exabyte.
What does 'OLAP' stand for?
Online Analytical Processing
Three major sources of Big Data.
People-generated data machine-generated data business-generated data which is the data that organizations generate within their own operations.
What is the search engine used by Walmart?
Polaris
Given a set of data, there are three key types to Data Warehouse Modernization's.
Pre-Processing, using Big Data as a landing zone before determining what data should be moved to the Data Warehouse. It could be categorized as Irrelevant Data or Relevant Data, which would go to the Data Warehouse.
Value from Big Data can be _____________?
Profits
Veracity
Quality and origin of data Drivers: cost need of traceability and justification
In Module 1: What is a common use of big data that is used by companies like Netflix, Spotify, Facebook and Amazon?
Recommendation Engines
Sources of big data - STRUCTURED
Relational databases and spreadsheets
Name one of the drivers of Volume in the Big Data Era?
Scalable Infrastructure
Big Data comes in three forms.
Structured unstructured semi-structured.
Structured data
Structured data is data that is organized, labelled, and has a strict model that it follows.
the main components and ecosystems are out outlined as follows:
Techniques for Analyzing Data, such as A/B Testing, Machine Learning, and Natural Language Processing. Big Data Technologies like Business Intelligence, Cloud Computing, and Databases. Visualization such as Charts, Graphs, and Other Displays of the data.
An example of visualizing Big Data is___________?
Temperature on a map
Volume
The amount of data being generated is vast compared to traditional data sources Drivers: increase in data sources higher resolution sensors scalable infrastructure
integration
To integrate means to bring together or incorporate parts into a whole.
Data Warehouses provide online analytic processing: True/False
True
Module 3: A data scientist is a person who is qualified to derive insights from data by using skills and experience from computer science, business or science, and statistics. True/False
True
Module 5: The Hadoop framework is mostly written in the Java programming language. True/False
True
Unstructured data
Unstructured data is said to make up about 80% of data in the world, where the data is usually in a text form and does not have a predefined model or is organized in any way.
4 Vs of Big Data
Velocity (speed of the data) Volume (Scale of the data. Increase in the amount of data stored. Variety (Diversity of the data.) Veracity (Certainty of the data. Accuracy)
Sources of big data - Semi- STRUCTURED
XML and jSON files
A rack is
a collection of 30 or 40 nodes that are physically stored close together and are all connected to the same network switch. Network bandwidth between any two nodes in a rack is greater than bandwidth between two nodes on different racks.
The Hadoop Cluster is
a collection of racks.
IBM analytics defines Hadoop as follows, Apache Hadoop is
a highly scalable storage platform designed to process very large data sets across hundreds to thousands of computing nodes that operate in parallel. It provides a cost effective storage solution for large data volumes with no format requirements.
What can help organizations to find new associations or uncover patterns and facts to significantly improve intelligence, security and law enforcement?
analyzing data in motion and at rest
Operations analysis focuses on
analyzing machine data, which can include anything from signals, sensors, and logs, to data from GPS devices. Using big data for operations analysis, organizations can gain real-time visibility into operations, customer experience, transactions, and behavior. Big data empowers businesses to predict when a machine will stop working, when machine components need to be replaced, and even when employees will resign.
The smartest Hadoop strategies start with
choosing recommended distributions, then maturing the environment with modernized hybrid architectures, and adopting a data lake strategy based on Hadoop technology.
Module 3: Data privacy is a critical part of the big data era. Businesses and individuals must give great thought to how data is _____________________________.
collected, retained, used, and disclosed
What is the process of cleaning and analyzing data to derive insights and value from it?
data science
When we look at big data, we can start with a few broad topics:
integration analysis visualization optimization security governance.
An enhanced 360 degree view of the customer
is a holistic approach, that takes into account all available and meaningful information about the customer to drive better engagement, revenue, and long-term loyalty. This is the basis for modern customer relationship management, or CRM systems.
In Operations Analysis, we focus on what type of data?
machine data
second is Offloading
moving infrequently accessed data from Data Warehouses into enterprise grade Hadoop.
Data lakes are a method
of storing data that keep vast amounts of raw data in their native format and more horizontally to support the analysis of originally disparate sources of data.
Data science is
the process of cleaning, mining, and analyzing data to derive insights of value from it. Data science is the process of distilling insights from data to inform decisions.
Value
turn data into value
thirds is Exploration
using big data capabilities to explore and discover new high value data from massive amounts of raw data and free up the Data Warehouse for more structured deep analytics.
Companies like amazon and netflix uses algorithms based on big data to make specific recommendations based on customer preferences
when amazon uses recommendations for you to purchase something. Recommendations engines are a common application of big data.