Big Data I Final
Machine Learning
*Field of study that gives computers the ability to learn without being explicitly programmed.* - Allows computers to find hidden insights without being explicitly programmed where to look. - AI that learns predictive models from data. Is a method of data analysis that automates analytical model building.
NOSQL
- NON RELATIONAL- Designed to meet scalability requirements of architectures. Most scalable--> Up to 10PB - Used for analysis and HANDLES big data structures. - New breed of non-realtional database products - Rejections of fixed table schema and join operations. - Schema-less data management requirements. - Handles structured, semi-structured, non-structured
Bit
0 or 1
Gigabyte
1 billion bytes
Megabyte
1 million bytes
Terabyte
1 trillion bytes
Reinforcement Machine Learning
1. Decision Process 2. Reward System 3. Learn series of actions -Area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. - Learns from environment and experience NOT DATA.
Drivers of Big Data
1. Digitization of society 2. Lowering of technology costs 3. Connectivity through cloud computing 4. Social media applications 5. Upcoming Internet of Things
Data Science encompasses what 3 skills?
1. Hacking Skills 2. Math and Stats skills 3. Domain Expertise - Hard to find because the complete skill set is hard to find. It's better to have a TEAM to equate to a data scientist.
Supervised Machine Learning
1. Labeled data 2. Direct feedback 3. Predict outcome/future- Based on input and output data. - FINALIZES into CLASSIFICATION and REGRESSION
Unsupervised Machine Learning
1. No labels 2. No feedback 3. Find the hidden structure - Helps find previously unknown patterns in data set without pre-existing labels. - Based on only input data. - FINALIZES into CLUSTERING
3 Types of Machine Learning
1. Supervised 2. Unsupervised 3. Reinforcement
Kilobyte
1000 bytes
Petabyte
10^15 bytes or quadrillion
Byte
8 bits
Data Mart
A data warehouse that stores only data relevant to a division or subject.
Business Intelligence
A framework for decision support .- Includes architecture for the data management cycle. -Includes tools for mining, analysis, and interpretation sometimes.
Variety
A number of different kinds; assortment
Algorithm
A process or set of rules to be followed during calculations and other problem solving operations. Ex. A GPS calculation of the fastest route
How is AI related to Big Data?
AI encompasses all the powerful techniques that big data uses and it used to train the most powerful techniques in machine learning.
Descriptive Analytics
AKA: Reporting analytics or exploratory analytics.- Refers to knowing what is happening in an organization and understanding some underlying trends and causes of such occurrences.- VISUALIZATION IS THE KEY.Ex. (Business reporting, Dashboards, Trend Analysis, Scoreboards, DW)What actually happened in the data? NO ASSUMPTIONS, JUST REALITY. What is happening?
Scalability
Capability of distributed computing. Platform to handle a growing amount of work-its potential to be scaled up to accommodate more data.
Structured Data
Data that can be fit in tables, Excel SS, Databases, RDMS.
Unstructured Data
Data that cannot be fit into tables. Ex. Images, Text data
Noisy Data
Data with a large amount of additional meaningless information.
NewSQL
Designed to scale out and have ACID properties .- Used for OLTP & HTAP- Designed to meet scalability requirements of architectures. - RELATIONAL - Handles structured Data - Atomic transactions
ELT vs ETL
ETL* can be very time consuming and costly and ELT* is faster and more scalable. Extraction, Loading, and Transformation - More common for internal organizational data- Examples of when ELT are better are: - When there are big volumes of data - When the source database and target database are the same. - When the database is well adapted for that kind of processing, such as NoSQL databases and paralleled data warehouses.
Data Warehouse
For storing and querying data. It's a pool of data produced to support decision making. It is a repository of current and historical data of potential interest to managers throughout the organization.
Hadoop vs. Spark
Hadoop requires developers to hand code each and every operation whereas Spark is easy to program with RDD - Resilient Distributed Dataset. ... Hadoop is designed to handle batch processing efficiently whereas Spark is designed to handle real-time data efficiently. - Spark can be 100x faster and expensive $$$ than Hadoop. - Spark allows you do data analysis on Jupiter notebooks. (Easier to use.)- Program runs on HDFS but puts it into the memory of each individual computer, instead of a mapping paradigm.
Semi-Structured Data
In-between Structured and Unstructured data. Some of it has structure and can fit into a Relational Database Management System and the half can't. Ex. Twitter Data
Facebook and Twitter use ____ _____ databases.
No SQL
Data Lake
Repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data included raw copies of source system data and transformed data used for tasks such as reporting, visuals, analytics, or ML.- Hadoop can act as a data lake.- Data lakes are independent.
Prescriptive Analytics
Seeks to recognize what is going as well as the likely forecast and to make decisions to achieve the best performance possible.Ex: (Optimization, Simulation, Decision modeling, Expert Systems.)- Historically studied as OPERATIONS RESEARCH or MANAGEMENT SCIENCE and has generally been aimed at optimizing the performance of a system.- GOAL is to provide a decision or recommendation of an action.
SQL
Structured Query Language - Programming language for RELATIONAL database systems. - SCALES UP to 10TB. - Handles structured data. - Atomic transactions
Volume
The amount of space an object takes up
Business Analytics
The application of models to business data to support decision making
Noise
The meaningless information in noisy data.
Signal
The valuable information from noisy data.
Predictive Analytics
Use of tools to predict future unknown events.- Uses techniques from data mining, statistics, modeling, machine learning and AI to analyze current data to make such predictions.
4 V's of Big Data
Volume, Velocity, Variety, Veracity
Dirty Data
is erroneous or flawed data
Velocity
the speed and direction of a moving object
Veracity
truthfulness, honesty