Hadoop Quiz

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Backend Data view:

-database after crawler (web pages, images, video) -doc representation after doc analyzer (semi-structured files like Json) -index after indexer (inverted index files)

NoSQL + Operational Data

-document store -key/value store -column family store -graph store

Big data applications

-ecommerce: churn prediction -higher education: loan assessment, college recommendation -financial services: fraud detection, stock price forecasting -sentiment analysis, advertisement analysis, personalized marketing

HDFS Takeaways:

-HSFS is highly scalable data storage system -HDFS reliable through distributed storage and replications -HDFS offers an economical storage solution by using inexpensive community hardware

Hadoop 1.0 (MapReduce, HDFS)

-Limited to 4000 Nodes -JobTracker is the bottleneck -Only one master node -Only job to run is the MapReduce

Indexer

Organize the normalized data

Big data technologies

Organized into operational data vs. analytics data AND noSQL tech vs. SQL tech

Search Engine Front End:

Ranking procedure: User (Query) --> Query Rep --> Retrieval & Ranking --> Results --> User

Big Data

Refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze

SQL + Analytical Data

Relational Analytics (SQL Server, MySQL)

SQL + Operational

Relational Database (SQL server, MySQL)

Data warehouse is for:

SQL + Analytical Data

Structured vs. Unstructured Data

Structured: data in fixed field of record or file (numbers, dates, groups) Unstructured: data not organized in pre-defined manner (video, voice, text, audio)

Volume

There are huge volumes of data in the world (44 trillion GB/ZB)

Retrieval and Ranking

This is key function to make search engine work through "ranking procedure" that puts most relevant results at the top (goes back to index)

Big Data Enables AI

-computer beat human in Go game -Uber's self driving car

Nodes

Computers

What is Hadoop's biggest advantages?

volume + variety

HDFS Storage Cost

-HDFS uses inexpensive commodity hardware (around 3k per node) -HDFS is open source software with 0 licensing and support costs

Relational DBMS

-A DB is relational when it presents data as tables (columns + rows)

Spark-on-Hadoop

-A fast and general engine for large-scale data processing -in-memory computing -100x faster than Hadoop MapReduce (Apache Spark)

A Complementary Approach: Hadoop Integration

-DW: structured, repeatable, linear -Hadoop: unstructured, exploratory, iterative

Relational Data Modeling

-ER Diagramming -data needs to be converted to follower schema (ER Diagram) before storage -Entities, Attributes, Relationships, Cardinality

data warehouse

-a relational database that is designed with purpose of organizing and cleaning the data retrieved from different DBMSs -data is flattened, filtered, indexed, preprocessed -relational database hub; get single version of truth from data -a relational database that integrates all databases together -makes possible to build OLAP applications like data visualizations

Step 2: Reduce

-combine the output from all sub-blocks leading to output

What is Hadoop?

-open source project -data management framework -optimize to handle massive amounts of various data types -architecture of HDFS Storage + MapReduce

DW

-performance -features

Hadoop

-scalability -flexibility -efficient and simple fault-tolerant mechanism -commodity inexpensive hardware

What are 3 characteristics of HDFS Storage?

-scalable (up to 200PB storage) -reliable (distributed file system) -cheap (spans large clusters of commodity servers)

Step 1: Map

-take file and break data into small blocks (done by HDFS) -perform same function on all sub-blocks

Frontend Data view:

-user(query): query log: acoustic or textual -results --> query: search log and click log stored

Hadoop Master-Slave Architecture

1 master node in charge of multiple slave nodes

Hadoop Trends:

1) Hadoop 1.0 to Hadoop 2.0+ 2) SQL-on-Hadoop 3) Spark-on-Hadoop

4 V's of Big Data (Big data dimensions/limitations)

1) Volume 2) Variety 3) Velocity 4) Veracity

How does HDFS Storage work (how a file is saved)?

1) every file is distributed into "blocks" 2) Every "block" is automatically replicated (3 by default) 3) Replicated "blocks" are placed into 3 different (computers) nodes...etc 4) blocks are uniformly distributed, no same 2 blocks in 1 node

HDFS Storage

A java-based distributed file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers

MapReduce

A parallel programming framework for writing applications that process large amounts of structured and unstructured data stored in the HDFS

Hadoop: Big data Technologies

Analytical Data + NoSQL technologies

Crawler

Created for search engine, can auto extract content related to latest movies, TV, albums, etc

NoSQL + Analytical Data

Big Data Analytics (Hadoop, Spark)

Information Management Hierarchy:

Big Data Analytics <-- Hadoop/NoSQL <-- data warehouse <-- relational database

To enable Big V's to happen in Search Engine:

Big Data Infrastructure needed: Hadoop

Hadoop's Major Players

Cloudera + Hortonworks, IBM, AWS, Microsoft

Variety

Data Sources (i.e. click stream, social media, sensors, mobile devices, audio, video, text, etc)

Veracity

Data uncertainty - the level of reliability associated with certain types of data

Block

Distributed files evenly broken up into blocks (ex. original stream ~1 GB is broken into 8 blocks of 128 MB each)

Hadoop is for:

NoSQL + Analytical Data

What makes up Hadoop's architecture (originated from Google distributed file system)?

HDFS (storage) + MapReduce (process)

Slave Nodes: Layers (datanode)

HDFS - DataNode MapReduce - TaskTracker

Master Node: Layers (namenode)

HDFS - NameNode MapReduce - JobTracker

HDFS stands for?

Hadoop Distributed File System

Hadoop

Hadoop is a data management framework optimized to handle massive amount of structured, semi-structured, unstructured data

Hadoop vs. Data warehouse

Hadoop: schema on run (data --> storage --> schema to filter --> output DW: schema on load --> schema to filter --> storage (pre-filtered data)

Velocity

How fast data is created or changed, as well as the speed at which it must be received, processed, and analyzed

Search Engine Back End

Indexed Corpus: Crawler --> Doc Analyzer --> Indexer --> Index --> Retrieval and Ranking

MapReduce Process:

Input--> Split --> Map --> Shuffle --> Reduce --> Final Result

Vocab for Measuring Information

MB, GB, TB, PB, exabyte, zetabyte, yottabye

Where is the single point of failure? (meaning if this fails, the entire cluster fails)

Master Node (jobtracker + namenode)

Doc Analyzer

Used to create universal format of data; normalize content format

More dimensions for big data?

Value/validity

4 V's of Big Data

Volume: data at scale Variety: data in many forms Velocity: data in motion Veracity: data uncertainty

Search Engine Big V's

Volume: hundreds of crawled pages, produce TB logs everyday Velocity: index refresh, log refresh, ranking model refresh Variety: crawled files including images & videos, ranker features ranging from acoustic, static text Veracity: unstructured web pages, spanning pages, noisy click and query logs

Index

Where Inverted Index file is stored

data warehouse definition

a large relational store of data accumulated from a wide range of sources within a company and used to guide management decisions

SQL-on-Hadoop

better integration with data warehouse and opens the data to a much wider audience

Solution to information inconsistencies/information integration problem?

create a data warehouse

User

creates query

No SQL tech:

data not necessarily stored in spreadsheet, can follow variety of formats

Databases and Datawarehouses follow relational data model:

data stored in tables

Analytical Data:

data that gives insights, patterns when you look at it; variety is very wide, this data is a little more complicated, data can wait, velocity is not as important (e.g. predictive modeling)

Operational Data:

data used to facilitate everyday transactions like inventory data; real time data that cannot wait to be analyzed, velocity is important

How does big data enable AI?

deep learing

deep learning

enables these particular algorithms to mimic our brain structure (neurons in brain talking to each other using electrical impulses - synapses)

deep learning applications

healthcare analytics, self-driving cars, robotics, VR, etc

Problem with relational DBMS

information inconsistencies: people looking at different databases (looking at the sales target of "production" vs. "accounting" databases) -large enterprises have more than 10k databases

Realibility

n-1 failures can be tolerated (if default 3 replications, then 2 failures can be tolerated)

Query Rep

query needs to be represented in understandable format: speech, sound, acoustic, sound, keywords

SQL

stored relational data (columns, rows)

Hadoop 2.0+ (Map Reduce + others, YARN + HDFS)

• Potentially up to 10,000 nodes -Efficient cluster utilization (YARN) -Multiple master nodes -Integrate with multiple apps and languages


Ensembles d'études connexes

Networking Threats, Assessments, and Defenses (Unit 8 Review) - [Network Security]

View Set

Health Law and Ethics Chapter #1 Test #1

View Set

Dance History Vocabulary- Chapter 3

View Set

Biochem Exam 3 Ch. 20 (22-.4 will be separate)

View Set

Chapter 1 quiz fill in the blank( medical terminology)

View Set

DHO Chapter 7:1 Basic Structure of the Human Body - 7:2 Body Planes, Directions, and Cavities.

View Set

Fundamentals of Management: Chapter 14

View Set