IM Big Data

Ace your homework & exams now with Quizwiz!

What is the CAP theorem?

states that it is impossible for a distributed computer system to simutaniously provide all three of the following gurantees Consistancy Availability Partition Tolerance

What is a column store

systems that store data as columns of data rather than as rows of data

What is key value store

systems that store values and an index to find them based on a key

The concept of humanized big data reminds us that

the sources of most data points being analyzed are people.

a form of advanced analytics that uses both new and historical data to forecast future activity, behavior and trends.

Predictive analytics

Query vs Search

Query may be faster, but require precise questions to be asked. So, you cant query the internet because it is not that structured.

are not designed to handle gigabytes or petabytes of unstructured data. You can't load volumes of photos, videos, tweets, articles and emails into a SQL server or Oracle database and writing SQL to query to run reports or writing SQL statements.

Relational Database Management Systems

NoSQL databases do not use

SQL as their primary language

How has DB landscape changed?

Traditional DB storage vs Updated DB storage

Isolation means

Transactions are isolated from each other so they do not contend with eachother

either does not have a pre-defined data model or is not organized in a pre-defined manner-can be text or non text. Snap chats, tweets, mail messages, PowerPoint presentations, Word documents, text messages, JPEG images, MP3 audio files video files.

Unstructured

meaning we need information about the database schema before we write. Writes must adapt to database structure, data types, validation rules, etc.

schema on write

Data warehouses and marts are ..... where data lakes are

shcema on write schema on read

What are the 6 advantages of moving to NOSQL

simplicity of design better horizontal scalling Finer control over availability To easily capture all kinds of data "big data" Speed Cost

Does OLAP or OLTP read data results out, often aggregated or summarized

OLAP

Think of no sql as writing data, then

applying structure after the fact

NoSQL is not one big homogenous group of products, but they are ......

very different

Soft state means

the state of the system and data chages over time

The schema in a data warehouse is defined as chema on

write

What is big data?

-situation where volume velocity, variety and variabity of data re exceeded typical storgage for decision making Includes data typically not previously captured stored or analyzed

Underlying semantic structure that is used to interpret data A particular materialization of data in such a semantic structure

A cube

What is the difference between ACID consistancy and BASE consistancy

ACID consistancy means that once is is written, you have full consistancy in reads BASE consistancy (eventually consistant) means that once data is written, it will eventually appear for reading

mean that once a transaction is complete, its data is consistent and stable on disk, which may involve multiple distinct memory locations.

ACID properties

What does ACID stand for?

Atomicity Consistancy Isolation Durability

What are the words for ACID

Atomicity Consistancy Isolation Durabilty

All operations in a transaction succeed, or they are all rolled back.

Atomicy

What is the difference between ACID and BASE

BASE properties are much looser than ACID

What does BASE stand for

Basically available Soft State Eventual Consistancy

refers to the dynamic, large and disparate volumes of data being created by people, tools and machines

Big Data

is a technology-driven process for analyzing data and presenting actionable information to help executives, managers and other corporate end users make informed business decisions.

Business Intelligance

In a relational or ACID database, systems store and query in tables. In NoSQL or base databses, they aggregate the follwing

Column Store Document Store Key Value Graph

is dealing with simultaneous sources, types, speeds

Complexity

Each operation moves the database from one consistent state to another

Consistancy

- come up with 2-D view of data - going from summary to more detailed views

Cube Slicing Drill Down

lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.

Data Lake

handle large volumes of structured data exceptionally well: lists of employees, sales, transactions and the like. They feed countless business intelligence and enterprise reporting applications.

Data Warehouse

is a subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes

Data Warehouse

What are the 4 key modules of HADOOP

Distributed File System MapReduce Hadoop Common YARN

The database will not lose your data once the transaction reports success.

Duability

What is the process in building a data warehouse?

ETL Extract (capture) Transform (scrub/data cleansing) (Transform) Load (Load and Index)

What is the goal and role of a data warehouse?

Goal-to increase the value of the organizations data asset Role-to store extracts from operational data and make those extracts available to users in a useful format

is a goal to generate processed data that can help others, including non-data scientists, derive clear answers and insights that can be used as a basis for strategic decision making across the organization.

Humanized Big Data

One operation in-process does not affect the others.

Isolation

Logical vs Physical View

Logical view is how is presented to the end user. Physical view is how and where the information actually resides

Steps of Map Reduce vs Steps of SQL

Map Reduce: Input, Splitting, Mapping, Shuffling, Reducing SQL: Select from, Join, Where, Group By, Having

What are the challenges of using Hadoop?

MapReduce programming is not a good match for all problems There is a widely acknowledged talent gap Data Security Full-fledged data management and security

A set of graphical tools that provides users with multidemensional views of their data and allows analysis of data using simple windowing techniques

OLAP

unlike OLTP, OLAP systems work with very large amounts of data. OLAP system analyzes data effectively and efficiently is characterized by relatively low volume of transactions to perform analyses. Queries are often very complex and involve aggregations.

OLAP (online analytical processing)

systems are "classical" systems that process data transactions

OLTP (online transactional processing)

allows you to record data without dictating (database structure, data types, validation) rules. Then when the data is read, you apply data processing rules via code.

Schema on read

has information associated with it, such as metadata and tags. (e.g. JSON (JavaScript Object Notation) data, graphs)

Semi Structured

high degree of organization- traditional database model is based on structure that assumes data that will go into tables, so data is structured to fit that model - searchable by simple, straightforward search engine algorithms or other search operations.

Structured

What are the three categories of data

Structured Unstructured Semi Structured

What is a document store

Systems that store documents, providing index and simple query mechanisms

What is a graph store

Systems that store model data such as graphs where nodes represent content

Durability means

The database will not lose your data once the transaction reports success

Differences in types/sources of data - (structured, unstructured, and semi-structured).

Variety

Speed at which new data arrives. Data is being generated extremely fast —now more streaming data projects which allows the potential of near real time analysis

Velocity

Differences in data accuracy/quality. Big data is sourced from many different places; as a result, you need to test the veracity and quality of the data the truth and the quality of the data, the usefulness of the data

Veracity

More data- The amount of data being created is vast compared to traditional data sources. More of our own data (archive, junk, log files), added free or public data, premium service data.

Volume

What is a NoSQL database

a broad class of database management system that differ from relational databse management system

Why is hadoop important?

ability to store and process huge amounts of any kind of data quickly computing power fault tolerance flexibility low cost scalability

Atomicity means

all or nothing, all operations in a transaction succeed or every operation is rolledback

What is the hadoop distributed file system

allows data to be stored in an easily accessable format

what is hadoop common?

provies the tools needed for the computers system to read data

What do consistancy, availability, and Partition tolerance means

consistancy-after an update some writer all reasders seehis updates in some shared source Availability-a system is designed and implemented in a way that allows it to continue operation Partition tolerance-cluster still functions if two nodes break communication

What are the two OLPA operations

cube slicing drill down

Map reduce?

reads data from the databse, puts in a suitable format, and performs mathematical operations

What is the OLAP process

data information knowledge

A limited scope data warehouse is a

data mart

What are the three big data concerns

data privacy data security discrimination based on what is learned by capturing big data

Variabilty is the

difference in flows of data

Consistency means

each operation moves the database from one consitant state to another

ETL process kick out ...

error reports, generate logs, and send errant records to files to be adressed at later dates

NOSQL do not have a .... and do not use

fixed schema join operations

Eventual Consistancy means

given enough time, data will be consistent accross the distrubuted system.

Data warehouses store vast amounts of structured data in .....

highly regimented ways

ACID refers to

schema on read

Relational Databases use

schema on write

a data mart is easier to manage because why?

it has a much smaller domain typically limited to : particular type of input data particular business fucntion particular business unit or geographic area

what is yarn

manages resources of the systems storing data and running the analysis

What is hadoop map reduce?

map-generates key value shuffle-takes output and sorts by key reduce-reduces the list and puts into small number of atomic values ready for further processing

What does basically available mean

nodes in the distributed environment can go down but the whole system shouldnt be affected so it works most of the time

What is HADOOP

not a databse, it is a distributed file structure that allows massive parrallell computing

Traditional DB storage physical decisions- logical options-

on premise or hosted relational dbms with OLTP and OLAP

Updated DB Storage physical decisions- logical options-

on premise, cloud or hybrid relational DBMS, NoSQL, Hadoop, file system

Because of the rigidity and the ways in which they work, data warehouses can support ..... ETL meaning that they can

partial load or reload portions of its data warehouse

Organizations populate data warehouses ...... and data refreshes in ...... for example at 3 am when not alot of employees are working with the systems

periodically regular cycles


Related study sets

I.Plot Quesrions for The Tell -Tale Heart and Comprehension and Style Activities

View Set

Chpt 15 Antiparkinson Drugs Pharmacology

View Set

Psych 232 Chapter 6 and 7 Study Guide

View Set

Chapter 6: Markets and Social Security

View Set

Chapter 5 - Small Business, Entrepreneurship and Franchising

View Set

Bus Law Ch 16- Statute of Frauds, Practice Exam 2 Notes

View Set

Ch. 13 Employee Rights and Discipline

View Set

Criminal Justice Ethics Chapter 8

View Set