Big Data I Final

Ace your homework & exams now with Quizwiz!

Machine Learning

*Field of study that gives computers the ability to learn without being explicitly programmed.* - Allows computers to find hidden insights without being explicitly programmed where to look. - AI that learns predictive models from data. Is a method of data analysis that automates analytical model building.

NOSQL

- NON RELATIONAL- Designed to meet scalability requirements of architectures. Most scalable--> Up to 10PB - Used for analysis and HANDLES big data structures. - New breed of non-realtional database products - Rejections of fixed table schema and join operations. - Schema-less data management requirements. - Handles structured, semi-structured, non-structured

Bit

0 or 1

Gigabyte

1 billion bytes

Megabyte

1 million bytes

Terabyte

1 trillion bytes

Reinforcement Machine Learning

1. Decision Process 2. Reward System 3. Learn series of actions -Area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. - Learns from environment and experience NOT DATA.

Drivers of Big Data

1. Digitization of society 2. Lowering of technology costs 3. Connectivity through cloud computing 4. Social media applications 5. Upcoming Internet of Things

Data Science encompasses what 3 skills?

1. Hacking Skills 2. Math and Stats skills 3. Domain Expertise - Hard to find because the complete skill set is hard to find. It's better to have a TEAM to equate to a data scientist.

Supervised Machine Learning

1. Labeled data 2. Direct feedback 3. Predict outcome/future- Based on input and output data. - FINALIZES into CLASSIFICATION and REGRESSION

Unsupervised Machine Learning

1. No labels 2. No feedback 3. Find the hidden structure - Helps find previously unknown patterns in data set without pre-existing labels. - Based on only input data. - FINALIZES into CLUSTERING

3 Types of Machine Learning

1. Supervised 2. Unsupervised 3. Reinforcement

Kilobyte

1000 bytes

Petabyte

10^15 bytes or quadrillion

Byte

8 bits

Data Mart

A data warehouse that stores only data relevant to a division or subject.

Business Intelligence

A framework for decision support .- Includes architecture for the data management cycle. -Includes tools for mining, analysis, and interpretation sometimes.

Variety

A number of different kinds; assortment

Algorithm

A process or set of rules to be followed during calculations and other problem solving operations. Ex. A GPS calculation of the fastest route

How is AI related to Big Data?

AI encompasses all the powerful techniques that big data uses and it used to train the most powerful techniques in machine learning.

Descriptive Analytics

AKA: Reporting analytics or exploratory analytics.- Refers to knowing what is happening in an organization and understanding some underlying trends and causes of such occurrences.- VISUALIZATION IS THE KEY.Ex. (Business reporting, Dashboards, Trend Analysis, Scoreboards, DW)What actually happened in the data? NO ASSUMPTIONS, JUST REALITY. What is happening?

Scalability

Capability of distributed computing. Platform to handle a growing amount of work-its potential to be scaled up to accommodate more data.

Structured Data

Data that can be fit in tables, Excel SS, Databases, RDMS.

Unstructured Data

Data that cannot be fit into tables. Ex. Images, Text data

Noisy Data

Data with a large amount of additional meaningless information.

NewSQL

Designed to scale out and have ACID properties .- Used for OLTP & HTAP- Designed to meet scalability requirements of architectures. - RELATIONAL - Handles structured Data - Atomic transactions

ELT vs ETL

ETL* can be very time consuming and costly and ELT* is faster and more scalable. Extraction, Loading, and Transformation - More common for internal organizational data- Examples of when ELT are better are: - When there are big volumes of data - When the source database and target database are the same. - When the database is well adapted for that kind of processing, such as NoSQL databases and paralleled data warehouses.

Data Warehouse

For storing and querying data. It's a pool of data produced to support decision making. It is a repository of current and historical data of potential interest to managers throughout the organization.

Hadoop vs. Spark

Hadoop requires developers to hand code each and every operation whereas Spark is easy to program with RDD - Resilient Distributed Dataset. ... Hadoop is designed to handle batch processing efficiently whereas Spark is designed to handle real-time data efficiently. - Spark can be 100x faster and expensive $$$ than Hadoop. - Spark allows you do data analysis on Jupiter notebooks. (Easier to use.)- Program runs on HDFS but puts it into the memory of each individual computer, instead of a mapping paradigm.

Semi-Structured Data

In-between Structured and Unstructured data. Some of it has structure and can fit into a Relational Database Management System and the half can't. Ex. Twitter Data

Facebook and Twitter use ____ _____ databases.

No SQL

Data Lake

Repository of data stored in its natural format, usually object blobs or files. A data lake is usually a single store of all enterprise data included raw copies of source system data and transformed data used for tasks such as reporting, visuals, analytics, or ML.- Hadoop can act as a data lake.- Data lakes are independent.

Prescriptive Analytics

Seeks to recognize what is going as well as the likely forecast and to make decisions to achieve the best performance possible.Ex: (Optimization, Simulation, Decision modeling, Expert Systems.)- Historically studied as OPERATIONS RESEARCH or MANAGEMENT SCIENCE and has generally been aimed at optimizing the performance of a system.- GOAL is to provide a decision or recommendation of an action.

SQL

Structured Query Language - Programming language for RELATIONAL database systems. - SCALES UP to 10TB. - Handles structured data. - Atomic transactions

Volume

The amount of space an object takes up

Business Analytics

The application of models to business data to support decision making

Noise

The meaningless information in noisy data.

Signal

The valuable information from noisy data.

Predictive Analytics

Use of tools to predict future unknown events.- Uses techniques from data mining, statistics, modeling, machine learning and AI to analyze current data to make such predictions.

4 V's of Big Data

Volume, Velocity, Variety, Veracity

Dirty Data

is erroneous or flawed data

Velocity

the speed and direction of a moving object

Veracity

truthfulness, honesty


Related study sets

Lab Assignment 6 - Introduction to Food Macromolecules

View Set

Phases of Wound healing Acute Wound/ Wound Healing

View Set

chp.8 ,section 1: Geography and Early Japan

View Set