Data Science - Sqoop, Flume, and Oozie

Ace your homework & exams now with Quizwiz!

Which of the following options is the final step while loading data into HDFS using Apache Flume?

building source and sink to channel??

Which one of the following options is the default database of Sqoop?

MySQL

__________ allows users to schedule complex workflows.

Oozie Coordinator

Flume agent

a kind of JVM process or can be said as an important part of Flume deployment. So, each flume agent has three components: - source - channel - sink configured using text config files

Apache Sqoop

a software framework to migrate data from relational database to Hadoop system and vis-a-versa. RDBMSs could be MySQL, Oracle, Teradata, etc. Can store data onto HDFS, Hive, HBase, or Accumulo - open source: supported under Apache Software Foundation - originally developed by Cloudera - latest stable release : 1.4.7 when used, it runs a MapReduce job in the background provides parallel operations.

How sqoop processes

-designed to import/export individual tables or entire databases. -generates java classes (java classes are packaged into a jar file and deployed in Hadoop cluster to be executed by MapReduce job) -job is submitted to Hadoop using command line tool - by default, four mappers are run with each mapper importing 25% of data. - the default db is MySql * note: when a large data is transferred from RDBMS to hadoop, it's called import, and vis-a-versa is called export

Which one of the following type of node shows the start and end of the workflow?

action node

____________ can be MapReduce jobs, Java/Pig application, Hive, etc

action nodes

name components on Flume agent

agent1.sources = src agent1.sinks= snk agent1.channels=chl

bind the components in flume

agent1.sources.src.channels = chl agent1.sinks.snk.channel = chl

component properties on flume

agent1.sources.src.type = seq agent1.sinks.snk.type = logger agent1.channels.chl.type = memory agent1.channels.chl.capacity = 100

Which of the following functions does flume support?

all (avro, syslog, netcat)

_____________ are the communication and retention mechanism that manage event delivery.

channel

scoop import

command to load data from rdbms to hdfs sqoop import --connect jdbc:mysql://localhost/oozie --username root --password cloudera --table country --target-dir/PRA_MV *schema called oozie, PRA_MV is the file it is stored into first gathers data from db, then sorts into cluster and hadoop map tasks that stores indo hdfs.

Which of the following directory will save the configuration details of Oozie?

conf

The _________ node drives the actions of the workflow while the __________ node is a specific executable tied to a function

control, action

Which Avro functions does flume set?

event?

flume vs sqoop

flume: - you can import data into Hadoop - streaming data sources - used for collecting and aggregating data - Goibibo uses Flume Sqoop: - you can import data into Hadoop as well as vis-a-versa - sources is an RDBS data - used for parallel data transfer - coupons.com uses sqoop

___________ is used to run multiple jobs in parallel.

fork

sqoop now

generic data transfer service -from any source - to any target from to mysql <-------> Kafka hdfs <-------> mongo ftp <-------> hdfs kafka <-------> memsql sqoop has connectors that represent pluggable data sources connectors are configurable - LINK configs - JOB configs

Which of the following options are Sqoop's characteristics?

it is a client program ?

types of sinks

logger HDFS avro etc...

types of channels

memory file kafka - memory gives you good performance, but it's not reliable. - file gives you reliability at the cost of performance

The ______ channel behaves very similarly to the file channel.

memory channel??

Which of the following options is important for multifunction Flume agents?

multiple sinks and sources??

types of sources

netcat exec syslog avro etc...

configuration

notes come here on out but i cannot even get hadoop....

_______ tag in action node signifies that you can make the transition to the next node.

ok

starting a flume agent

./flume-ng agent_name1 -n na -c conf -f ../conf/<agent_configuration_filename>.properties

Oozie is still using a parameter called _______ to identify the YARN arguments as they are yet to create a new parameter for YARN itself.

JobTracker?

Flume Architecture

- Web Server - Flume Agent - HDFS

what flume does

- gets multi-source streaming data for storage and analysis to Hadoop - to store the data, flume gives many options. you can either directly store it into your Hadoop, which is HDFS or a real-time system such as HBase - flume provides horizontal scalability in case if data streams and volume increases. - it offers buffer storage for real-time spikes - it uses a cross-platform operating system

Sqoop usage

- input to sqoop is either db table or mainframe datasets - reads table row by row into HDFS - for mainframe datasets, it will read records from each mainframe datasets to HDFS - output of this import process is a set of files containing a copy of the imported table or datasets - import process performed in parallel, so there will be multiple files in target dir - output can be a text file or binary Avro or sequence files containing serialized record data - after applying transformations on the imported data, results can be exported back to relational db's - it's export process will read a set of delimited text files from HDFS in parallel, parse them into records and insert the as new records in target db table - provides mechanisms to inspect the db we want to work with

installing flume

- install binary files and source files of apache flume - make flume dir in same place, where hadoop and HBase are being installed and extract the binary file here - open flume-env.sh file and set the JAVA_Home path - edit the bashrc file, set the path of the flume export FLUME_HOME=/usr/local/flume export FLUME_CONF_DIR=$FLUME_HOME/conf export FLUME_CLASS_PATH=$FLUME_CONF_DIR export PATH=$FLUME_HOME/bin:$PATH go to the bin folder and check whether flume is installed or not flume-ng

Oozie

- opened 2010 by Yahoo, now open-source Apache incubator project - a server based workflow scheduling system to manage hadoop jobs - simplifies workflow and coordination between jobs - you can schedule jobs as well as reschedule jobs on failure - allows creating DAG of workflows - tightly integrated with Hadoop stack supporting various hadoop jobs - executes and monitors workflows in hadoop - periodic scheduling of workflows - trigger execution by data availability - HTTP and command line interface + web console - helps to put the following elements together: pig, hive, mapreduce, yarn, hdfs, hbase, spark/giraph - allows user to create directed Acyclic graphs of workflows and these can be ran in parallel and sequence in hadoop used by Hadoop System Admin to perform complex log evaluation on HDFS - used to perform ETL operations in a sequence and then saving the output in HDFS

action nodes

- specify the map/reduce, pig, or java class to run - all nodes have ok and error tags = ok transitions to the next node = error goes to the error node and should print an error message. xml: <action name="[NODE-NAME]"> <ok to = "[NODE-NAME]"> <error to = "[NODE-NAME]"> </action>

hduser> sqoop import ______________.

--connect jdbc:mysql://localhost/

Which one of the following option is the workflow file format?

.xml

channel

1/3 components of flume agent. - acts as a storehouse that keeps the events until the flume sink consumes them - may use a local file system to store these events - there can be more than one Flume agent. in this case, flume sink forwards the events to flume source of the other flume agent in the data flow - channelize data between source and sink. - there can be one to many relationships between source and sink. But for each sink, there needs to be one channel

source

1/3 components of flume agent. - it's responsible for sending the event to the channel it is connected to . - it has no control how data is being stored in the channel - it supports Netcat, exec, AVRO, TCP, and UDP as source data files - primarily to read data from web server logs

sink

1/3 components of flume agent. - removes the events from channels and stores it into an external repository like HDFS or to another flume agent - waits for events - responsible for sending the event to the desired output - it manages issues like time out - as long as one sink is available, the channel will function - primarily to write data into data stores or other Flume agent sources (via avro).

___________ is a data-serialization framework.

Apache Avro

_______ provides logic between action nodes like start, end and kill.

Control nodes

Name the process in which the data is imported, transferred, loaded and processed for future use in a database

Data Ingestion

Oozie jobs

Oozie workflow: supports defining and executing a controlled sequence of MapReduce, Hive, and Pig Oozie Coordinator: allows users to schedule complex workflows. oozie coordinator jobs can be scheduled to execute at a certain time. However, after they are started, they can be configured to the run at specific intervals, also. So they provide scheduling. Oozie Bundles: monitors status of coordinator jobs

flume

a tool used to collect, aggragate, and transport large amounts of data streaming from many different sources to a centralized data store such as events, log files, etc. allows for geo-analytical application -open source -reliable -scalable: log data can be multiplexed and scale to a really large number of servers -manageable -customizable - declarative and dynamic configuration: configuration can be written in a file called agent configuration file and is dynamic

sqoop 2

aka 'sqoop as a service' server side installation and configuration provides both CLI and UI REST API exposed for external itnegration uses map task for transport and reduce task or transformation

why choose sqoop

allows you to transfer legacy systems into Hadoop leverage the parallel processing capabilities of Hadoop to process huge amounts of data the results of Hadoop analysis can be again stored back to relational data storage using sqoop export functionality

ELT (extract, load, transform)

one of the most common uses cases of Sqoop. extracting data from the relational db, transform it, and place it in data warehouse db, which is also relational for business intel and data analytics - this processing is limited - sqoop helps by copying/extracting the relational data from our operational db to Hadoop. Also, use Hadoop as an intermediate parallel processing engine, which is a part of the overall ETL process. - the result can be copied/loaded to our relational db by sqoop real use case in this scenario is: - copying the billing data from a db to Hadoop for billing cycle processing; instead of processing the batch processing of billing data, we can copy the data to Hadoop for processing in parallel and return the final result, i.e., summarized billing data for a customer back to the db - extract and load data from a db and then transform, modify, and process the data in Hadoop leveraging all the parallel processing capabilities

Installation of Oozie

prereq = maven 1. install maven 2. setup the maven path in the bashrc file 3. install Oozie 4. unzip it 5. edit the pom.xml file in the oozie folder, which will be used by Maven while building oozie. - change the java and hadoop version - change the link of the Codehaus repository - go to bin folder and type: ./mkdistro.sh -DskipTests -X

sqoop 1

requires client side installation and configuration provides cli only (command line interface) launches single map only job which does both data transport and transformation

Error Node

signals that an error occurred and a message describing the error should be printed out xml file: <error name="[NODE-NAME]" /> <message>"[A custom message]"</message> </error>

____________ removes the events from channels and stores it into an external repository like HDFS.

sink

__________ is responsible for sending the event to the channel with which it is connected.

source

Which one of the following operation can transform the stream?

sqoop

__________ is the software framework to migrate data from the relational database to Hadoop system and vice versa.

sqoop

sqoop export

sqoop query scoop export\--connect jdbc:mysql://localhost/cloudera/--username cloudera -P\--table exported\--export-dir /user/country_imported/part-m-00000 *spaces im unsure of gather metadata, splits hdfs storage into different maps to load into rdbms

Data Ingestion

the process of importing, transferring, loading and processing data for future use in a database. in order to transform data into information or insights, it needs to be ingested into Hadoop

Flume offers buffer storage for real-time spikes.

true

Sqoop will use a column to split. If that column not present then it will use primary key column

true

Flume uses a cross-platform operating system.

true ?

sqoop versions

two major versions : sqoop 1 and sqoop 2 sqoop 2 is still a work in progress model & design is subject to change; not really stable connectors for all RDBMS : supported in sqoop 1 ; in sqoop 2, not supported. work around: use generic JDBC connectors. Performance can be affected. kerberos security: supported in sqoop 1 and 2 Data transfer to Hive and HBase: supported in sqoop 1; in sqoop 2, not supported. Work around: 2 step approach MR job: in sqoop 1, single job for transport and transformations in sqoop2: mappers will transport the data and reducers will transform data

fork

used to run multiple jobs in parallel. when we use a fork, there should be a join

Oozie workflows

workflow is a collection of actions Active Nodes: can be MapReduce jobs, Java/pig application, hive, etc. Control Flow Nodes: provides logic between action nodes like start, end, and kill. Executes action based on condition

how sqoop works

you have to import some data in RDBMS, and you have to insert the data into Hadoop using Sqoop. step 1. you need to use import command in Sqoop CLI step 2. sqoop will generate java classes using table schema and package them into a JAR file. step 3. send jar file to Hadoop engine, which will allocate some resources to this MapREduce job and will run it into Hadoop cluster step 4. MapReduce jobs begin to run step 5. imported data is saved on HDFS. after this step, hadoop will send a response back to sqoop cli to show the user

Which one of the following options is the first step of Sqoop?

you need to use import command in Sqoop CLI


Related study sets

Chapter 1 - Accounting in Action

View Set

Pharmacology Practice Assessment

View Set