IM Big Data
What is the CAP theorem?
states that it is impossible for a distributed computer system to simutaniously provide all three of the following gurantees Consistancy Availability Partition Tolerance
What is a column store
systems that store data as columns of data rather than as rows of data
What is key value store
systems that store values and an index to find them based on a key
The concept of humanized big data reminds us that
the sources of most data points being analyzed are people.
a form of advanced analytics that uses both new and historical data to forecast future activity, behavior and trends.
Predictive analytics
Query vs Search
Query may be faster, but require precise questions to be asked. So, you cant query the internet because it is not that structured.
are not designed to handle gigabytes or petabytes of unstructured data. You can't load volumes of photos, videos, tweets, articles and emails into a SQL server or Oracle database and writing SQL to query to run reports or writing SQL statements.
Relational Database Management Systems
NoSQL databases do not use
SQL as their primary language
How has DB landscape changed?
Traditional DB storage vs Updated DB storage
Isolation means
Transactions are isolated from each other so they do not contend with eachother
either does not have a pre-defined data model or is not organized in a pre-defined manner-can be text or non text. Snap chats, tweets, mail messages, PowerPoint presentations, Word documents, text messages, JPEG images, MP3 audio files video files.
Unstructured
meaning we need information about the database schema before we write. Writes must adapt to database structure, data types, validation rules, etc.
schema on write
Data warehouses and marts are ..... where data lakes are
shcema on write schema on read
What are the 6 advantages of moving to NOSQL
simplicity of design better horizontal scalling Finer control over availability To easily capture all kinds of data "big data" Speed Cost
Does OLAP or OLTP read data results out, often aggregated or summarized
OLAP
Think of no sql as writing data, then
applying structure after the fact
NoSQL is not one big homogenous group of products, but they are ......
very different
Soft state means
the state of the system and data chages over time
The schema in a data warehouse is defined as chema on
write
What is big data?
-situation where volume velocity, variety and variabity of data re exceeded typical storgage for decision making Includes data typically not previously captured stored or analyzed
Underlying semantic structure that is used to interpret data A particular materialization of data in such a semantic structure
A cube
What is the difference between ACID consistancy and BASE consistancy
ACID consistancy means that once is is written, you have full consistancy in reads BASE consistancy (eventually consistant) means that once data is written, it will eventually appear for reading
mean that once a transaction is complete, its data is consistent and stable on disk, which may involve multiple distinct memory locations.
ACID properties
What does ACID stand for?
Atomicity Consistancy Isolation Durability
What are the words for ACID
Atomicity Consistancy Isolation Durabilty
All operations in a transaction succeed, or they are all rolled back.
Atomicy
What is the difference between ACID and BASE
BASE properties are much looser than ACID
What does BASE stand for
Basically available Soft State Eventual Consistancy
refers to the dynamic, large and disparate volumes of data being created by people, tools and machines
Big Data
is a technology-driven process for analyzing data and presenting actionable information to help executives, managers and other corporate end users make informed business decisions.
Business Intelligance
In a relational or ACID database, systems store and query in tables. In NoSQL or base databses, they aggregate the follwing
Column Store Document Store Key Value Graph
is dealing with simultaneous sources, types, speeds
Complexity
Each operation moves the database from one consistent state to another
Consistancy
- come up with 2-D view of data - going from summary to more detailed views
Cube Slicing Drill Down
lake is a storage repository that holds a vast amount of raw data in its native format until it is needed.
Data Lake
handle large volumes of structured data exceptionally well: lists of employees, sales, transactions and the like. They feed countless business intelligence and enterprise reporting applications.
Data Warehouse
is a subject-oriented, integrated, time-variant, non-updatable collection of data used in support of management decision-making processes
Data Warehouse
What are the 4 key modules of HADOOP
Distributed File System MapReduce Hadoop Common YARN
The database will not lose your data once the transaction reports success.
Duability
What is the process in building a data warehouse?
ETL Extract (capture) Transform (scrub/data cleansing) (Transform) Load (Load and Index)
What is the goal and role of a data warehouse?
Goal-to increase the value of the organizations data asset Role-to store extracts from operational data and make those extracts available to users in a useful format
is a goal to generate processed data that can help others, including non-data scientists, derive clear answers and insights that can be used as a basis for strategic decision making across the organization.
Humanized Big Data
One operation in-process does not affect the others.
Isolation
Logical vs Physical View
Logical view is how is presented to the end user. Physical view is how and where the information actually resides
Steps of Map Reduce vs Steps of SQL
Map Reduce: Input, Splitting, Mapping, Shuffling, Reducing SQL: Select from, Join, Where, Group By, Having
What are the challenges of using Hadoop?
MapReduce programming is not a good match for all problems There is a widely acknowledged talent gap Data Security Full-fledged data management and security
A set of graphical tools that provides users with multidemensional views of their data and allows analysis of data using simple windowing techniques
OLAP
unlike OLTP, OLAP systems work with very large amounts of data. OLAP system analyzes data effectively and efficiently is characterized by relatively low volume of transactions to perform analyses. Queries are often very complex and involve aggregations.
OLAP (online analytical processing)
systems are "classical" systems that process data transactions
OLTP (online transactional processing)
allows you to record data without dictating (database structure, data types, validation) rules. Then when the data is read, you apply data processing rules via code.
Schema on read
has information associated with it, such as metadata and tags. (e.g. JSON (JavaScript Object Notation) data, graphs)
Semi Structured
high degree of organization- traditional database model is based on structure that assumes data that will go into tables, so data is structured to fit that model - searchable by simple, straightforward search engine algorithms or other search operations.
Structured
What are the three categories of data
Structured Unstructured Semi Structured
What is a document store
Systems that store documents, providing index and simple query mechanisms
What is a graph store
Systems that store model data such as graphs where nodes represent content
Durability means
The database will not lose your data once the transaction reports success
Differences in types/sources of data - (structured, unstructured, and semi-structured).
Variety
Speed at which new data arrives. Data is being generated extremely fast —now more streaming data projects which allows the potential of near real time analysis
Velocity
Differences in data accuracy/quality. Big data is sourced from many different places; as a result, you need to test the veracity and quality of the data the truth and the quality of the data, the usefulness of the data
Veracity
More data- The amount of data being created is vast compared to traditional data sources. More of our own data (archive, junk, log files), added free or public data, premium service data.
Volume
What is a NoSQL database
a broad class of database management system that differ from relational databse management system
Why is hadoop important?
ability to store and process huge amounts of any kind of data quickly computing power fault tolerance flexibility low cost scalability
Atomicity means
all or nothing, all operations in a transaction succeed or every operation is rolledback
What is the hadoop distributed file system
allows data to be stored in an easily accessable format
what is hadoop common?
provies the tools needed for the computers system to read data
What do consistancy, availability, and Partition tolerance means
consistancy-after an update some writer all reasders seehis updates in some shared source Availability-a system is designed and implemented in a way that allows it to continue operation Partition tolerance-cluster still functions if two nodes break communication
What are the two OLPA operations
cube slicing drill down
Map reduce?
reads data from the databse, puts in a suitable format, and performs mathematical operations
What is the OLAP process
data information knowledge
A limited scope data warehouse is a
data mart
What are the three big data concerns
data privacy data security discrimination based on what is learned by capturing big data
Variabilty is the
difference in flows of data
Consistency means
each operation moves the database from one consitant state to another
ETL process kick out ...
error reports, generate logs, and send errant records to files to be adressed at later dates
NOSQL do not have a .... and do not use
fixed schema join operations
Eventual Consistancy means
given enough time, data will be consistent accross the distrubuted system.
Data warehouses store vast amounts of structured data in .....
highly regimented ways
ACID refers to
schema on read
Relational Databases use
schema on write
a data mart is easier to manage because why?
it has a much smaller domain typically limited to : particular type of input data particular business fucntion particular business unit or geographic area
what is yarn
manages resources of the systems storing data and running the analysis
What is hadoop map reduce?
map-generates key value shuffle-takes output and sorts by key reduce-reduces the list and puts into small number of atomic values ready for further processing
What does basically available mean
nodes in the distributed environment can go down but the whole system shouldnt be affected so it works most of the time
What is HADOOP
not a databse, it is a distributed file structure that allows massive parrallell computing
Traditional DB storage physical decisions- logical options-
on premise or hosted relational dbms with OLTP and OLAP
Updated DB Storage physical decisions- logical options-
on premise, cloud or hybrid relational DBMS, NoSQL, Hadoop, file system
Because of the rigidity and the ways in which they work, data warehouses can support ..... ETL meaning that they can
partial load or reload portions of its data warehouse
Organizations populate data warehouses ...... and data refreshes in ...... for example at 3 am when not alot of employees are working with the systems
periodically regular cycles