Hadoop Quiz
Backend Data view:
-database after crawler (web pages, images, video) -doc representation after doc analyzer (semi-structured files like Json) -index after indexer (inverted index files)
NoSQL + Operational Data
-document store -key/value store -column family store -graph store
Big data applications
-ecommerce: churn prediction -higher education: loan assessment, college recommendation -financial services: fraud detection, stock price forecasting -sentiment analysis, advertisement analysis, personalized marketing
HDFS Takeaways:
-HSFS is highly scalable data storage system -HDFS reliable through distributed storage and replications -HDFS offers an economical storage solution by using inexpensive community hardware
Hadoop 1.0 (MapReduce, HDFS)
-Limited to 4000 Nodes -JobTracker is the bottleneck -Only one master node -Only job to run is the MapReduce
Indexer
Organize the normalized data
Big data technologies
Organized into operational data vs. analytics data AND noSQL tech vs. SQL tech
Search Engine Front End:
Ranking procedure: User (Query) --> Query Rep --> Retrieval & Ranking --> Results --> User
Big Data
Refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze
SQL + Analytical Data
Relational Analytics (SQL Server, MySQL)
SQL + Operational
Relational Database (SQL server, MySQL)
Data warehouse is for:
SQL + Analytical Data
Structured vs. Unstructured Data
Structured: data in fixed field of record or file (numbers, dates, groups) Unstructured: data not organized in pre-defined manner (video, voice, text, audio)
Volume
There are huge volumes of data in the world (44 trillion GB/ZB)
Retrieval and Ranking
This is key function to make search engine work through "ranking procedure" that puts most relevant results at the top (goes back to index)
Big Data Enables AI
-computer beat human in Go game -Uber's self driving car
Nodes
Computers
What is Hadoop's biggest advantages?
volume + variety
HDFS Storage Cost
-HDFS uses inexpensive commodity hardware (around 3k per node) -HDFS is open source software with 0 licensing and support costs
Relational DBMS
-A DB is relational when it presents data as tables (columns + rows)
Spark-on-Hadoop
-A fast and general engine for large-scale data processing -in-memory computing -100x faster than Hadoop MapReduce (Apache Spark)
A Complementary Approach: Hadoop Integration
-DW: structured, repeatable, linear -Hadoop: unstructured, exploratory, iterative
Relational Data Modeling
-ER Diagramming -data needs to be converted to follower schema (ER Diagram) before storage -Entities, Attributes, Relationships, Cardinality
data warehouse
-a relational database that is designed with purpose of organizing and cleaning the data retrieved from different DBMSs -data is flattened, filtered, indexed, preprocessed -relational database hub; get single version of truth from data -a relational database that integrates all databases together -makes possible to build OLAP applications like data visualizations
Step 2: Reduce
-combine the output from all sub-blocks leading to output
What is Hadoop?
-open source project -data management framework -optimize to handle massive amounts of various data types -architecture of HDFS Storage + MapReduce
DW
-performance -features
Hadoop
-scalability -flexibility -efficient and simple fault-tolerant mechanism -commodity inexpensive hardware
What are 3 characteristics of HDFS Storage?
-scalable (up to 200PB storage) -reliable (distributed file system) -cheap (spans large clusters of commodity servers)
Step 1: Map
-take file and break data into small blocks (done by HDFS) -perform same function on all sub-blocks
Frontend Data view:
-user(query): query log: acoustic or textual -results --> query: search log and click log stored
Hadoop Master-Slave Architecture
1 master node in charge of multiple slave nodes
Hadoop Trends:
1) Hadoop 1.0 to Hadoop 2.0+ 2) SQL-on-Hadoop 3) Spark-on-Hadoop
4 V's of Big Data (Big data dimensions/limitations)
1) Volume 2) Variety 3) Velocity 4) Veracity
How does HDFS Storage work (how a file is saved)?
1) every file is distributed into "blocks" 2) Every "block" is automatically replicated (3 by default) 3) Replicated "blocks" are placed into 3 different (computers) nodes...etc 4) blocks are uniformly distributed, no same 2 blocks in 1 node
HDFS Storage
A java-based distributed file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers
MapReduce
A parallel programming framework for writing applications that process large amounts of structured and unstructured data stored in the HDFS
Hadoop: Big data Technologies
Analytical Data + NoSQL technologies
Crawler
Created for search engine, can auto extract content related to latest movies, TV, albums, etc
NoSQL + Analytical Data
Big Data Analytics (Hadoop, Spark)
Information Management Hierarchy:
Big Data Analytics <-- Hadoop/NoSQL <-- data warehouse <-- relational database
To enable Big V's to happen in Search Engine:
Big Data Infrastructure needed: Hadoop
Hadoop's Major Players
Cloudera + Hortonworks, IBM, AWS, Microsoft
Variety
Data Sources (i.e. click stream, social media, sensors, mobile devices, audio, video, text, etc)
Veracity
Data uncertainty - the level of reliability associated with certain types of data
Block
Distributed files evenly broken up into blocks (ex. original stream ~1 GB is broken into 8 blocks of 128 MB each)
Hadoop is for:
NoSQL + Analytical Data
What makes up Hadoop's architecture (originated from Google distributed file system)?
HDFS (storage) + MapReduce (process)
Slave Nodes: Layers (datanode)
HDFS - DataNode MapReduce - TaskTracker
Master Node: Layers (namenode)
HDFS - NameNode MapReduce - JobTracker
HDFS stands for?
Hadoop Distributed File System
Hadoop
Hadoop is a data management framework optimized to handle massive amount of structured, semi-structured, unstructured data
Hadoop vs. Data warehouse
Hadoop: schema on run (data --> storage --> schema to filter --> output DW: schema on load --> schema to filter --> storage (pre-filtered data)
Velocity
How fast data is created or changed, as well as the speed at which it must be received, processed, and analyzed
Search Engine Back End
Indexed Corpus: Crawler --> Doc Analyzer --> Indexer --> Index --> Retrieval and Ranking
MapReduce Process:
Input--> Split --> Map --> Shuffle --> Reduce --> Final Result
Vocab for Measuring Information
MB, GB, TB, PB, exabyte, zetabyte, yottabye
Where is the single point of failure? (meaning if this fails, the entire cluster fails)
Master Node (jobtracker + namenode)
Doc Analyzer
Used to create universal format of data; normalize content format
More dimensions for big data?
Value/validity
4 V's of Big Data
Volume: data at scale Variety: data in many forms Velocity: data in motion Veracity: data uncertainty
Search Engine Big V's
Volume: hundreds of crawled pages, produce TB logs everyday Velocity: index refresh, log refresh, ranking model refresh Variety: crawled files including images & videos, ranker features ranging from acoustic, static text Veracity: unstructured web pages, spanning pages, noisy click and query logs
Index
Where Inverted Index file is stored
data warehouse definition
a large relational store of data accumulated from a wide range of sources within a company and used to guide management decisions
SQL-on-Hadoop
better integration with data warehouse and opens the data to a much wider audience
Solution to information inconsistencies/information integration problem?
create a data warehouse
User
creates query
No SQL tech:
data not necessarily stored in spreadsheet, can follow variety of formats
Databases and Datawarehouses follow relational data model:
data stored in tables
Analytical Data:
data that gives insights, patterns when you look at it; variety is very wide, this data is a little more complicated, data can wait, velocity is not as important (e.g. predictive modeling)
Operational Data:
data used to facilitate everyday transactions like inventory data; real time data that cannot wait to be analyzed, velocity is important
How does big data enable AI?
deep learing
deep learning
enables these particular algorithms to mimic our brain structure (neurons in brain talking to each other using electrical impulses - synapses)
deep learning applications
healthcare analytics, self-driving cars, robotics, VR, etc
Problem with relational DBMS
information inconsistencies: people looking at different databases (looking at the sales target of "production" vs. "accounting" databases) -large enterprises have more than 10k databases
Realibility
n-1 failures can be tolerated (if default 3 replications, then 2 failures can be tolerated)
Query Rep
query needs to be represented in understandable format: speech, sound, acoustic, sound, keywords
SQL
stored relational data (columns, rows)
Hadoop 2.0+ (Map Reduce + others, YARN + HDFS)
• Potentially up to 10,000 nodes -Efficient cluster utilization (YARN) -Multiple master nodes -Integrate with multiple apps and languages