Data Science - Sqoop, Flume, and Oozie
Which of the following options is the final step while loading data into HDFS using Apache Flume?
building source and sink to channel??
Which one of the following options is the default database of Sqoop?
MySQL
__________ allows users to schedule complex workflows.
Oozie Coordinator
Flume agent
a kind of JVM process or can be said as an important part of Flume deployment. So, each flume agent has three components: - source - channel - sink configured using text config files
Apache Sqoop
a software framework to migrate data from relational database to Hadoop system and vis-a-versa. RDBMSs could be MySQL, Oracle, Teradata, etc. Can store data onto HDFS, Hive, HBase, or Accumulo - open source: supported under Apache Software Foundation - originally developed by Cloudera - latest stable release : 1.4.7 when used, it runs a MapReduce job in the background provides parallel operations.
How sqoop processes
-designed to import/export individual tables or entire databases. -generates java classes (java classes are packaged into a jar file and deployed in Hadoop cluster to be executed by MapReduce job) -job is submitted to Hadoop using command line tool - by default, four mappers are run with each mapper importing 25% of data. - the default db is MySql * note: when a large data is transferred from RDBMS to hadoop, it's called import, and vis-a-versa is called export
Which one of the following type of node shows the start and end of the workflow?
action node
____________ can be MapReduce jobs, Java/Pig application, Hive, etc
action nodes
name components on Flume agent
agent1.sources = src agent1.sinks= snk agent1.channels=chl
bind the components in flume
agent1.sources.src.channels = chl agent1.sinks.snk.channel = chl
component properties on flume
agent1.sources.src.type = seq agent1.sinks.snk.type = logger agent1.channels.chl.type = memory agent1.channels.chl.capacity = 100
Which of the following functions does flume support?
all (avro, syslog, netcat)
_____________ are the communication and retention mechanism that manage event delivery.
channel
scoop import
command to load data from rdbms to hdfs sqoop import --connect jdbc:mysql://localhost/oozie --username root --password cloudera --table country --target-dir/PRA_MV *schema called oozie, PRA_MV is the file it is stored into first gathers data from db, then sorts into cluster and hadoop map tasks that stores indo hdfs.
Which of the following directory will save the configuration details of Oozie?
conf
The _________ node drives the actions of the workflow while the __________ node is a specific executable tied to a function
control, action
Which Avro functions does flume set?
event?
flume vs sqoop
flume: - you can import data into Hadoop - streaming data sources - used for collecting and aggregating data - Goibibo uses Flume Sqoop: - you can import data into Hadoop as well as vis-a-versa - sources is an RDBS data - used for parallel data transfer - coupons.com uses sqoop
___________ is used to run multiple jobs in parallel.
fork
sqoop now
generic data transfer service -from any source - to any target from to mysql <-------> Kafka hdfs <-------> mongo ftp <-------> hdfs kafka <-------> memsql sqoop has connectors that represent pluggable data sources connectors are configurable - LINK configs - JOB configs
Which of the following options are Sqoop's characteristics?
it is a client program ?
types of sinks
logger HDFS avro etc...
types of channels
memory file kafka - memory gives you good performance, but it's not reliable. - file gives you reliability at the cost of performance
The ______ channel behaves very similarly to the file channel.
memory channel??
Which of the following options is important for multifunction Flume agents?
multiple sinks and sources??
types of sources
netcat exec syslog avro etc...
configuration
notes come here on out but i cannot even get hadoop....
_______ tag in action node signifies that you can make the transition to the next node.
ok
starting a flume agent
./flume-ng agent_name1 -n na -c conf -f ../conf/<agent_configuration_filename>.properties
Oozie is still using a parameter called _______ to identify the YARN arguments as they are yet to create a new parameter for YARN itself.
JobTracker?
Flume Architecture
- Web Server - Flume Agent - HDFS
what flume does
- gets multi-source streaming data for storage and analysis to Hadoop - to store the data, flume gives many options. you can either directly store it into your Hadoop, which is HDFS or a real-time system such as HBase - flume provides horizontal scalability in case if data streams and volume increases. - it offers buffer storage for real-time spikes - it uses a cross-platform operating system
Sqoop usage
- input to sqoop is either db table or mainframe datasets - reads table row by row into HDFS - for mainframe datasets, it will read records from each mainframe datasets to HDFS - output of this import process is a set of files containing a copy of the imported table or datasets - import process performed in parallel, so there will be multiple files in target dir - output can be a text file or binary Avro or sequence files containing serialized record data - after applying transformations on the imported data, results can be exported back to relational db's - it's export process will read a set of delimited text files from HDFS in parallel, parse them into records and insert the as new records in target db table - provides mechanisms to inspect the db we want to work with
installing flume
- install binary files and source files of apache flume - make flume dir in same place, where hadoop and HBase are being installed and extract the binary file here - open flume-env.sh file and set the JAVA_Home path - edit the bashrc file, set the path of the flume export FLUME_HOME=/usr/local/flume export FLUME_CONF_DIR=$FLUME_HOME/conf export FLUME_CLASS_PATH=$FLUME_CONF_DIR export PATH=$FLUME_HOME/bin:$PATH go to the bin folder and check whether flume is installed or not flume-ng
Oozie
- opened 2010 by Yahoo, now open-source Apache incubator project - a server based workflow scheduling system to manage hadoop jobs - simplifies workflow and coordination between jobs - you can schedule jobs as well as reschedule jobs on failure - allows creating DAG of workflows - tightly integrated with Hadoop stack supporting various hadoop jobs - executes and monitors workflows in hadoop - periodic scheduling of workflows - trigger execution by data availability - HTTP and command line interface + web console - helps to put the following elements together: pig, hive, mapreduce, yarn, hdfs, hbase, spark/giraph - allows user to create directed Acyclic graphs of workflows and these can be ran in parallel and sequence in hadoop used by Hadoop System Admin to perform complex log evaluation on HDFS - used to perform ETL operations in a sequence and then saving the output in HDFS
action nodes
- specify the map/reduce, pig, or java class to run - all nodes have ok and error tags = ok transitions to the next node = error goes to the error node and should print an error message. xml: <action name="[NODE-NAME]"> <ok to = "[NODE-NAME]"> <error to = "[NODE-NAME]"> </action>
hduser> sqoop import ______________.
--connect jdbc:mysql://localhost/
Which one of the following option is the workflow file format?
.xml
channel
1/3 components of flume agent. - acts as a storehouse that keeps the events until the flume sink consumes them - may use a local file system to store these events - there can be more than one Flume agent. in this case, flume sink forwards the events to flume source of the other flume agent in the data flow - channelize data between source and sink. - there can be one to many relationships between source and sink. But for each sink, there needs to be one channel
source
1/3 components of flume agent. - it's responsible for sending the event to the channel it is connected to . - it has no control how data is being stored in the channel - it supports Netcat, exec, AVRO, TCP, and UDP as source data files - primarily to read data from web server logs
sink
1/3 components of flume agent. - removes the events from channels and stores it into an external repository like HDFS or to another flume agent - waits for events - responsible for sending the event to the desired output - it manages issues like time out - as long as one sink is available, the channel will function - primarily to write data into data stores or other Flume agent sources (via avro).
___________ is a data-serialization framework.
Apache Avro
_______ provides logic between action nodes like start, end and kill.
Control nodes
Name the process in which the data is imported, transferred, loaded and processed for future use in a database
Data Ingestion
Oozie jobs
Oozie workflow: supports defining and executing a controlled sequence of MapReduce, Hive, and Pig Oozie Coordinator: allows users to schedule complex workflows. oozie coordinator jobs can be scheduled to execute at a certain time. However, after they are started, they can be configured to the run at specific intervals, also. So they provide scheduling. Oozie Bundles: monitors status of coordinator jobs
flume
a tool used to collect, aggragate, and transport large amounts of data streaming from many different sources to a centralized data store such as events, log files, etc. allows for geo-analytical application -open source -reliable -scalable: log data can be multiplexed and scale to a really large number of servers -manageable -customizable - declarative and dynamic configuration: configuration can be written in a file called agent configuration file and is dynamic
sqoop 2
aka 'sqoop as a service' server side installation and configuration provides both CLI and UI REST API exposed for external itnegration uses map task for transport and reduce task or transformation
why choose sqoop
allows you to transfer legacy systems into Hadoop leverage the parallel processing capabilities of Hadoop to process huge amounts of data the results of Hadoop analysis can be again stored back to relational data storage using sqoop export functionality
ELT (extract, load, transform)
one of the most common uses cases of Sqoop. extracting data from the relational db, transform it, and place it in data warehouse db, which is also relational for business intel and data analytics - this processing is limited - sqoop helps by copying/extracting the relational data from our operational db to Hadoop. Also, use Hadoop as an intermediate parallel processing engine, which is a part of the overall ETL process. - the result can be copied/loaded to our relational db by sqoop real use case in this scenario is: - copying the billing data from a db to Hadoop for billing cycle processing; instead of processing the batch processing of billing data, we can copy the data to Hadoop for processing in parallel and return the final result, i.e., summarized billing data for a customer back to the db - extract and load data from a db and then transform, modify, and process the data in Hadoop leveraging all the parallel processing capabilities
Installation of Oozie
prereq = maven 1. install maven 2. setup the maven path in the bashrc file 3. install Oozie 4. unzip it 5. edit the pom.xml file in the oozie folder, which will be used by Maven while building oozie. - change the java and hadoop version - change the link of the Codehaus repository - go to bin folder and type: ./mkdistro.sh -DskipTests -X
sqoop 1
requires client side installation and configuration provides cli only (command line interface) launches single map only job which does both data transport and transformation
Error Node
signals that an error occurred and a message describing the error should be printed out xml file: <error name="[NODE-NAME]" /> <message>"[A custom message]"</message> </error>
____________ removes the events from channels and stores it into an external repository like HDFS.
sink
__________ is responsible for sending the event to the channel with which it is connected.
source
Which one of the following operation can transform the stream?
sqoop
__________ is the software framework to migrate data from the relational database to Hadoop system and vice versa.
sqoop
sqoop export
sqoop query scoop export\--connect jdbc:mysql://localhost/cloudera/--username cloudera -P\--table exported\--export-dir /user/country_imported/part-m-00000 *spaces im unsure of gather metadata, splits hdfs storage into different maps to load into rdbms
Data Ingestion
the process of importing, transferring, loading and processing data for future use in a database. in order to transform data into information or insights, it needs to be ingested into Hadoop
Flume offers buffer storage for real-time spikes.
true
Sqoop will use a column to split. If that column not present then it will use primary key column
true
Flume uses a cross-platform operating system.
true ?
sqoop versions
two major versions : sqoop 1 and sqoop 2 sqoop 2 is still a work in progress model & design is subject to change; not really stable connectors for all RDBMS : supported in sqoop 1 ; in sqoop 2, not supported. work around: use generic JDBC connectors. Performance can be affected. kerberos security: supported in sqoop 1 and 2 Data transfer to Hive and HBase: supported in sqoop 1; in sqoop 2, not supported. Work around: 2 step approach MR job: in sqoop 1, single job for transport and transformations in sqoop2: mappers will transport the data and reducers will transform data
fork
used to run multiple jobs in parallel. when we use a fork, there should be a join
Oozie workflows
workflow is a collection of actions Active Nodes: can be MapReduce jobs, Java/pig application, hive, etc. Control Flow Nodes: provides logic between action nodes like start, end, and kill. Executes action based on condition
how sqoop works
you have to import some data in RDBMS, and you have to insert the data into Hadoop using Sqoop. step 1. you need to use import command in Sqoop CLI step 2. sqoop will generate java classes using table schema and package them into a JAR file. step 3. send jar file to Hadoop engine, which will allocate some resources to this MapREduce job and will run it into Hadoop cluster step 4. MapReduce jobs begin to run step 5. imported data is saved on HDFS. after this step, hadoop will send a response back to sqoop cli to show the user
Which one of the following options is the first step of Sqoop?
you need to use import command in Sqoop CLI