Combo with "HaDoop" and 27 others

Ace your homework & exams now with Quizwiz!

[^abc]

everything but a,b,c (negation)

How do you execute a MapReduce job from the command line

export HADOOP_CLASSPATH=build/classes hadoop MyDriver input/path output/path

Lists three drawbacks of using Hadoop.

Does not work well with small amounts of data, MapReduce is difficult to implement or understand, does not guarantee atomicity transactions

streaming

HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.

file system data

HDFS is a highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to ____ ____ ____ and can be built out of commodity hardware.

Hadoop comes with pre-built native compression libraries for which version/OS?

Linux 32 and 64bit. Other platforms you have to compile the libraries yourself.

How do you obtain a comparator for an IntWritable?

Raw Comparator<Intwritable>Comparator=Writable Comparator.get (Int Writable. class) ;

What is the benefit to UDFs in Pig?

Re-usable, unlike many MapReduce jobs libraries

threshold

Rebalancer tool will balance the data blocks across cluster up to an optional ____ percentage.

space

Rebalancing would be very helpful if you are having ____ issues in the other existing nodes.

What is the meaning of speculative execution in Hadoop? Why is it important?

Speculative execution is a way of coping with individual Machine performance. In large clusters where hundreds or thousands of machines are involved there may be machines which are not performing as fast as others. This may result in delays in a full job due to only one machine not performaing well. To avoid this, speculative execution in hadoop can run multiple copies of same map or reduce task on different slave nodes. The results from first node to finish are used.

True or False: As long as df.replication.min replicas are written, a write will succeed.

True, (default = 1) The namenode can schedule further replication afterward.

HDFS clusters do not benefit from RAID

True, The redundancy that RAID provides is not needed since HDFS handles replication between nodes.

True or False: FSDataInputStream allows seeking, FSDataOutputStream does not.

True, because there is no support to write anywhere but the end of a file. So there is no value to seek while writing.

True or False: Block pool storage is not partitioned

True, datanodes must register with every namenode in the cluster. Datanodes can store blocks from multiple block pools.

True or False: Pipes cannot be run in local standalone mode.

True, it relys on Hadoop's distributed cache system which is active when HDFS is running. (Will also work in psedo-distributed mode)

True or False: It is possible for users to run different versions of MapReduce on the same YARN cluster.

True, makes upgrades more manageable.

True or False: Streaming and Pipes work the same way in MapReduce1 vs MapReduce 2.

True, only difference is the child and subprocesses run on the node managers not tasktrackers

True or False: Under YARN, you no longer run a jobtracker or tasktrackers.

True, there is a single resource manager running on the same machine as the HDFS namenode(small clusters) or a dedicated machine with node managers running on each worker node.

simple

___ -In this mode of operation, the identity of a client process is determined by the host operating system. On Unix-like systems, the user name is the equivalent of `whoami`.

Structured data

___ ____ is the data that is easily identifiable as it is organized in structure. The most common form of __ ___ is a database where specific information is stored in tables, that is, rows and columns. [same word]

HFTP

___ is a Hadoop filesystem implementation that lets you read data from a remote Hadoop HDFS cluster. The reads are done via HTTP, and data is sourced from DataNodes.

Rebalancer

____ - tool to balance the cluster when the data is unevenly distributed among DataNodes

Backup Node

____ ____ - An extension to the Checkpoint node. In addition to checkpointing it also receives a stream of edits from the NameNode and maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state. Only one Backup node may be registered with the NameNode at once.

Secondary NameNode

____ ____ - performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.

Checkpoint Node

____ ____ - performs periodic checkpoints of the namespace and helps minimize the size of the log stored a the NameNode containing changes to the HDFS.

Map Reduce

____ ____ is a set of programs used to access and manipulate large data sets over a Hadoop cluster.

unstructured data

____ ____ refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email, logs, and random text. It is not in the form of rows and columns.

HTTP POST

____ ____: APPEND (see FileSystem.append) CONCAT (see FileSystem.concat)

HTTP PUT

____ ____: CREATE (see FileSystem.create) MKDIRS (see FileSystem.mkdirs) CREATESYMLINK (see FileContext.createSymlink) RENAME (see FileSystem.rename) SETREPLICATION (see FileSystem.setReplication) SETOWNER (see FileSystem.setOwner) SETPERMISSION (see FileSystem.setPermission) SETTIMES (see FileSystem.setTimes) RENEWDELEGATIONTOKEN (see FileSystem.renewDelegationToken) CANCELDELEGATIONTOKEN (see FileSystem.cancelDelegationToken)

HTTP DELETE

____ ____: DELETE (see FileSystem.delete)

Concurrency

____ and Hadoop FS "handles" The Hadoop FS implementation includes a FS handle cache which caches based on the URI of the namenode along with the user connecting. So, all calls to hdfsConnect will return the same handle but calls to hdfsConnectAsUser with different users will return different handles. But, since HDFS client handles are completely thread safe, this has no bearing on ____. [same word]

libhdfs

____ is a JNI based C API for Hadoop's Distributed File System (HDFS). It provides C APIs to a subset of the HDFS APIs to manipulate HDFS files and the filesystem. ____ is part of the Hadoop distribution and comes pre-compiled in $HADOOP_PREFIX/libhdfs/libhdfs.so . [same word]

HttpFS

____ is a server that provides a REST HTTP gateway supporting all HDFS File System operations (read and write). And it is interoperable with the webhdfs REST HTTP API.

HTTP GET

_____ _____: OPEN (see FileSystem.open) GETFILESTATUS (see FileSystem.getFileStatus) LISTSTATUS (see FileSystem.listStatus) GETCONTENTSUMMARY (see FileSystem.getContentSummary) GETFILECHECKSUM (see FileSystem.getFileChecksum) GETHOMEDIRECTORY (see FileSystem.getHomeDirectory) GETDELEGATIONTOKEN (see FileSystem.getDelegationToken) GETDELEGATIONTOKENS (see FileSystem.getDelegationTokens)

HttpFS

_____ is a server that provides a REST HTTP gateway supporting all HDFS File System operations (read and write). And it is inteoperable with the webhdfs REST HTTP API.

RDBMS

_________ is useful when you want to seek one record from Big Data, whereas, Hadoop will be useful when you want Big Data in one shot and perform analysis on that later.

HDFS

___________ is used to store large datasets in Hadoop.

What is the workflow in Oozie?

a DAG of action nodes and control-flow nodes

What are the two types of nodes in HDFS and in what pattern are they working?

a NameNode (the master) and a number of data nodes (workers) in a master=worker pattern

What is a meta-character?

a character with special meaning interpreted by the matcher

MapReduce is

a computing paradigm for processing data that resides on 100 of data

tasks

a divided MapReduce job two types: Map tasks Reduce tasks

tasks

a divided MapReduce job two types: Map tasks Reduce tasks

What is Oozie?

a job control utility

What typically delimits a key from a value MapPedree?

a tab

[a-zA-Z]

a through z or A through Z inclusive

fetchdt

a utility to fetch DelegationToken and store in a file on the local system.

Regular expressions are

a way to describe a set of strings based on common characteristics

Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this?

Speculative Execution.

HDFS 1

filesystem to store large data sets by scaling out across a cluster of hosts. Optimized for throughout instead of latency. Achieves HA via replication vs. redundancy.

What do you need for Bloom filtering

data can be separated into records, feature can be extracted from each record , predetermined set for hot values, some false positive are acceptable

Mapper- Top Ten

find their local top K

Top Ten example

finding outliers, finding the top 10%

Structure of Bloom filtering

first bloom filter needs to be trained over the list of values; resulting data object is stored in HDFS, then filtering Mapreduce job

build applications, not infrastructure

for a developer, not need to spend time worrying about job scheduling, error handling, and coordination in distributed processing.

What is HCatalog

for defining and sharing schemas

there are limits

for how big s single host can be.

What is MapReduce?

for processing large data sets in a scalable and parallel fashion

What is Ambari?

for provisioning, managing, and monitoring Apache Hadoop clusters

How do you copy a file from the local file system to HDFS?

hadoop fs - copy From Local path/to/file hdfs://localhost/path/to/file or with defaults hadoop fs -copy from local path/to/file (all on one line)

How do you specify a configuration file when using hadoop command?

hadoop fs -conf conf/hadoop-localhostexml -is.

How do you list the files running in pseduo/single/distributed mode?

hadoop fs -conf conf/hadoop-xxx.xml -ls . hadoop-xxx-.xml (config file for single or dist or pseudo)

command for copying file from HDFS t local disk

hadoop fs -copyToLocal remot/path local/path

command for copying file from HDFS t local disk

hadoop fs -copyToLocal remot/path local/path

How can youget help on the hadoop commands for interacting with the file system?

hadoop fs -help

How would you list the files in the root directory of the local filesystem via command line?

hadoop fs -ls file:///

how can you list all the blocks that makeup each file in the filesystem?

hadoop fsck / -files - blocks

How do you find which blocks are in any particular file?

hadoop fsck /user/tom/part-0007 -files -blocks -racks

How do you check if the file content is the same after copying to/from HDFS?

md5 input/docs/test.txt test.copy.txt MD5 hash should match

Can Hadoop pipers be run in stand alone mode?

no, it relies on Hadoop's distributed cache mechanism which only works when HDFS is running in development run in Pseudo-distributed mode

Can Hadoop pipers be run in stand alone mode?

no, it relies on Hadoop's distributed cache mechanism which only works when HDFS is running in development run in Pseudo-distributed mode

[a-d[m-p]]

union - a through d or m through p

Bloom filtering has a _______ applied to each record

unique evaluation function

pig bag

unordered collection of tuples- can spill onto disk

What is the criteria for a node manager failure?

unresponsive for 10 minutes. (Not sending heartbeat to resouce manager) yarn.resourcemanager.nm.liveness-monitor.expiry-interval-ms The node manager is removed from the pool and any tasks or application managers running on the node can be recovered.

How can you produce a globally sorted file using Hadoop?

use a partitioner that respects the total order of the output (so not hash partitioner)

dfs.nameservices

used by both namenode and federation features.define the logical name of the service being provided by pair of namenode IDs.

What can regular expressions be used for?

used to search, edit, or manipulate text and data.

uses of counters

useful tool for gathering statistics about your job, counter values much easier to retrieve than logs

How to retrieve counter values using the Java API (old)?

usual to get counters at the end of a job run ... Counters counters = job.getCounters(); long missing = counters.getCounter( MaxTemperatureWithCounters.Temperature.MISSING);

What is the key?

what the data will be grouped on

dfs.client failover.proxy.provider.nameservices-id

when namenode HA is enabled clients need a way to deicde which namenode is active and should be used.

Reducer Structure for Inverted Index

will receive a set of unique record identifiers to map back to the input key- identifiers will be concatenated by some delimiter

dfs.ha.namenode.nameservice-id

with name services id define by dfs.nameservicesnow we need to provide which namenode-ids makeup that service.the value is a comma separated list of logical namenode names.

Tokenize the lines in Pig

words = foreach input generate flatten(TOKENIZE(line)) as word;

HDFS & MP are designed to

work on low commodity clusters (low on cost and specs), scale by adding more servers, identify and work around failures.

Doug Cutting

worked as Nutch open source web search engine.Started working on implementing Google's GFS and MR. Hadoop is born.

x{n,}, x{n,}?, x{n,}+

x at least n times

x{n,m}, x{n,m}?, x{n,m}+

x at least n times but not more than m

x{n}, x{n}?, x{n}+

x exactly n times

x?, x??, x?+

x once or not at all

x+, x+?, x++

x one or more times

How do you set memory amounts for the node manager?

yarn.nodemanager.resource.memory.mb

If a datanode is in the include and not in the exclude will it connect? If a datanode is in the include and exclude, will it connect?

yes yes, but it will be decommisioned

x*, x*?, x*+

zero or more times

What mechanism does Hadoop framework provide to synchronise changes made in Distribution Cache during runtime of the application?

none

x*+- reluctant greedy or possessive?

possessive

how can you force a meta-character to be a normal character?

precede the metacharacter with a backslash, or enclose it within \Q (which starts the quote) and \E (which ends it).

medi and standev - mapper

process each input record to calculate the median comment length with each hour of the day- output key is the hour of day and output value is the comment length

data locality optimization

running the map task on the node where the input data resides

Pig- Sample

sample <column> 0-1

Hadoop works best with _________ and ___________ data, while Relational Databases are best with the first one.

structured and unstructured

[a-z&&[^bc]]

subtraction a through z but b and c [ad-Z]

Doug moved

to Cloudera and rest of team started HortonWorks.

Hadoop IO Class that corresponds to Java Integer

IntWritable

When comparing Hadoop and RDBMS, which is the best solution for big data?

RDBMS

True or False: Each node in the cluster should run a datanode & tasktracker?

True

True or False: FileSystem Filters can only act on a file's name, not metadata

True

True or False: Hadoop is open source.

True

True or False: Skipping mode is not supported in the new MapReduce API.

True

True or False: The default JobScheduler will fill empty map task slots before reduce task slots.

True

Upgrading a cluster when the filesystem hasn't changed is reversible.

True

Using Pig's fs command you can run any command like HDFS from Hadoop

True

True or false? MapReduce can best be described as a programming model used to develop Hadoop-based applications that can process massive amounts of unstructured data.

What would you use to turn an array of FileStatus objects to an array of Path objects?

FileUtil.stat2Paths();

Define "fault tolerance".

"Fault tolerance" is the ability of a system to continue operating in the event of the failure of some of its components.

how do counters work

# of bytes of uncompressed input/output consume by maps in job, incremented everytime the collect() is called in Outputcollector

What configuration is used with the hadoop command if you dont use the -conf option

$HADOOP_INSTALL/conf

Where can one lean the default settles for all the public properties in Hadoop?

$HADOOP_INSTALL/does/core-default.html hdfs-default.html mapred-default.html

Where are logs stored by default? How and where should you move them?

$HADOOP_INSTALL/logs set HADOOP_LOG_DIR in hadoop-env.sh Move it outside the install path to avoid deletion during upgrades

How do you finalize an upgrade?

$NEW_HADOOP_HOME/bin/ hadoop dfsadmin -finalizeUpgrade

How do you start an upgrade?

$NEW_HADOOP_HOME/bin/ start-dfs.sh -upgrade

How do you check the progress of an upgrade?

$NEW_HADOOP_HOME/bin/hadoop dfsadmin -upgradeProgress status

How do you roll back an upgrade?

$NEW_HADOOP_HOME/bin/stop-dfs.sh $OLD_HADOOP_HOME/bin/start-dfs.sh

What is the directory structure for the secondary namenode? What are the key points in its design?

${ dfs.checkpoint.dir } - current/ -- version -- edits -- fsimage -- fstime - previous.checkpoint/ -- version -- edits -- fsimage -- fstime Previous checkpoint can act as a stale backup If the secondary is taking over you can use -importCheckpoint when starting the namnode daemon to use the most recent version

What is the directory structure of the Datanode?

${ dfs.data.dir } - current/ -- version -- blk_<id_1> -- blk_<id_1>.meta -- blk_<id_2> -- blk_<id_2>.meta .... --subdir0/ --subdir1/

What does a newly formatted namenode look like? (directory structure)

${ dfs.name.dir } - current -- version -- edits -- fsimage -- fstime

How do you map between node addresses and network locations? Which config property defines an implementation of DNSToSwitchMapping?

(1) public interface DNSToSwitchMapping{ public List<String> resolve( List<String> names); } names - list of IP addresses - returns list of corresponding network location strings (2) topology.node.switch.mapping.impl (Namenodes and jobtrackers use this to resolve worker node network locations)

What are the following HTTP Server default ports? (a) mapred.job.tracker.http.address (b) mapred.task.tracker.http.address (c) dfs.http.address (d) dfs.datanode.http.address (e) dfs.secondary.http.address

(a) 0.0.0.0:50030 (b) 0.0.0.0:50060 (c) 0.0.0.0:50070 (d) 0.0.0.0:50075 (e) 0.0.0.0:50090

What do the following YARN properties manage? (a) yarn.resourcemanager.address (b) yarn.nodemanager.local-dirs (c) yarn.nodemanager.aux-services (d) yarn.nodemanager.resource.memory.mb (e) yarn.nodemanager.vmem-pmem ratio

(a) 0.0.0.0:8032 (default) where resource manager RPC runs (b) locations where node managers allow containers to share intermediate data (cleared at the end of a job) (c) List of auxiliary services run by node manager (d) Amt of physical memory that may be allocated to containers being run by the node manager (e) Ratio of virtual to physical memory for containers.

YARN HTTP Servers: (ports and usage) (a) yarn.resourcemanager.webapp.address (b) yarn.nodemanager.webapp.address (c) yarn.web-proxy.address (d) mapreduce.jobhistory.webapp.address (e) mapreduce.shuffle.port

(a) 0.0.0.0:8088 - resource manager web ui (b) 0.0.0.0:8042 - node manager web ui (c) (default not set) webapp proxy server, if not set resource managers process (d) 0.0.0.0:19888 - job history server (e) 8080 shuffle handlers HTTP port ( not a user-accessible web UI)

YARN RPC Servers: (ports and usage) (a) yarn.resourcemanager.address (b) yarn.resourcemanager.admin.address (c) yarn.resourcemanager.scheduler.address (d) yarn.resourcemanager.resourcetracker.address (e) yarn.nodemanager.address (f) yarn.nodemanager.localizer.address (g) mapreduce.jobhistory.address

(a) 8032 Used by client to communicate with the resource manager (b) 8033 Used by the admin client to communicate with the resource manager (c) 8030 Used by in-cluster application masters to communicate with the resource manager (d) 8031 Used by in-cluster node managers to communicate with the resource manager (e) 0 Used by in-cluster application masters to communicate with node managers (f) 8040 (g) 10020 Used by the client, typically outside the cluster, to query job history

A. What is distcp? (use case) B. How do you run distcp? C. What are it's options? (2)

(a) A Hadoop program for copying large amounts of data to and from the Hadoop Filesystem in parallel. use case: Transferring data between two HDFS clusters (b) hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar (c) overwrite - distcp will skip files that already exist without specifying this update - updates only the files that have changed

What are OutputCommitters used for? How do you implement it?

(a) A commit protocol to ensure that jobs or tasks either succeed or fail cleanly. The framework ensures that the event of multiple task attempts for a particular task, only one will be committed, others will be aborted. (b) old - JobConf.setOutputComitter() or mapred.output.comitter new - OutputComitter is decided by the OutputFormat, using getOutputComitter() (default FileOutputComitter)

What is a Balancer program? Where does it output?

(a) A hadoop daemon that redistributes blocks from over-utilized datanodes to under-utilized datanodes, while still adhering to the replication placement policy. Moves blocks until the cluser is deemed "balanced". (b) Standard log directory

What is fencing?

(a) A method of stopping corruption if there is ungraceful failover

What is fencing? What are fencing methods? (4)

(a) A method of stopping corruption if there is ungraceful failover (b) 1. Killing the namenode process 2. Revoking the namenodes access to the shared storage directory 3. Disabling its network port 4. STONITH - (extreme) Shoot the other node in the head - Specialized power distribution unit to force the host machine down

What happens during a Job Completion in MR2? (a) Before (b) During (c) After

(a) Client cals Job.waitForCompletion() as well as polling the application master every 5 seconds via (mapreduce.client.completion.pollinterval) (b) Notification via HTTP callback from the application master (c) Application master and task containers clean up their working state. Job information is archived by the job history server.

What are metrics? Which metrics contexts does Hadoop use?

(a) Data collection from HDFS and Mapreduce daemons (b) dfs, rpc, jvm (datanodes), mapred

What does FSDataOutputStream's sync() method do? What would be the effect of not having calls to sync()? T/F: Closing a file in HDFS performs an implicit sync()

(a) Forces all buffers to be synchronized to the datanodes. When sync() returns successfully, HDFS guarantees that the data written up to that point has reached all the datanodes and is visible to new readers. (b) Possible loss of up to a block of data in the event of a client or system failure (c) True

What is the function of the Secondary Namenode? Where should the Secondary Namenode be located? True or False: The Secondary Namenode will always lag the Namenode

(a) It doesn't act in the same way as the Namenode. It periodically merges the namespace image with the edit log to prevent the edit log from being too large. (b) On a separate machine from the Namenode because it requires as much memory/CPU usage as the Namenode to run. (c) True

How can you set an arbitrary number of mappers to be created for a job in Hadoop?

You can't set it

A. What is the Hadoop Archive (HAR files)? B. How do you use them? C. How do you list the file contents of a .har? D. What are the limitations of Hadoop Archives?

(a) It is a file archiving facility that packs files into HDFS blocks more efficiently. They can be used as an input to a mapreduce job. It reduces namenode memory usage, while allowing transparent access to files. (b) hadoop archive -archiveName files.har /my/files /my (c) hadoop fs -lsr har://my/files.har (d) 1. Creates a copy of the files (disk space usage) 2. Archives are immutable 3. No compression on archives, only files

What is the purpose of the Interface Progressable? When is Progressable's progress() called? T/F: Only HDFS can use Progressable's progress().

(a) Notifies your application of the progress of data being written to datanodes. (b) Progress is called after each 64KB packet of data is written to the datanode pipeline (c) True

How does a jobtracker choose a (a) reduce task? (b) a map task?

(a) Reduce task - takes the next in the list. (No data locality considerations) (b) Map task - Takes into account the tasktrackers network locations and picks a task whose input split is as close as possible to to the tasktracker.

What are MapReduce defaults? (a) job.setInputFormatClass() (b) job.setMapperClass() (c) job.setOutputKeyClass() (d) job.setOutputValueClass()

(a) TextInputFormat.class (b) Mapper.class (c) LongWritable.class (d) Text.class

The following YARN config files do what? (a) yarn-env.sh (b) yarn-site.xml (c) mapred-site.xml

(a) environment variables (b) config settings for YARN daemons (c) properties still used without jobtracker & tasktracker related properties

How do you disable checksum verification? How do you disable checksums on the client side?

(a) fs.setVerifyChecksum(false); fs.open(); OR -copyToLocal -ignoreCrc (b) using RawLocalFileSystem instead of FileSystem. fs.file.impl = org.apache.hadoop.fs.RawLocalFileSystem

How do you get the block verification report for a datanode? How do you get a list of blocks on the datanode and their status?

(a) http://datanode:50075/blockScannerReport (b) http://datanode:50075/blockScannerReport?listBlocks

What are the default Streaming attributes? (a) input (b) output (c) inputFormat (d) mapper

(a) input/ncdc/sample.txt (b) output (c) org.apache.hadoop.mapred.TextInputFormat (d) /bin/cat

What are the values to set for criteria to run a task uberized? How do you disable running something uberized?

(a) mapreduce.job.ubertask.maxreduces mapreduce.job.ubertask.maxbytes mapreduce.job.ubertask.maxmaps (b) mapreduce.job.ubertask.enable = false

What do the following CompressionCodecFactory methods do? (a) getCodec() (b) removeSuffix

(a) maps a filename extension to a CompressionCodec (takes Path object for the file in question) (b) strips off the file suffix to form the output filename (ex file.gz => file )

What do the following Safe Mode properties do? (a) dfs.replication.min (b) dfs.safemode.threashold (c) dfs.safemode.extension

(a) minimum number of replicas that have to be written for a write to be successful. (b) (0.999) Proportion of blocks that must meet minimum replication before the system will exit Safe Mode (c) (30,000) Time(ms) to extend Safe Mode after the minimum replication has been satisfied

New API or Old API? (a) Job (b) JobConf (c) org.apache.hadoop.mapred (d) org.apache.hadoop.mapreduce

(a) new api (b) old api (c) new api (d) old api

How do you create a FileSystem instance? How do you create a local FileSystem instance?

(a) public static FileSystem get(URI uri, Cofiguration conf, String user) throws IOE -uri and user are optional (b) public static LocalFileSystem getLocal(Configuration conf) throws IOE

What do the following options for dfsadmin do? (a) -help (b) -report (c) -metasave (d) -safemode (e) -saveNamespace (f) -refreshNodes (g) -upgradeProgress (h) -finalizeUpgrade (i) -setQuota (j) -clrQuota (k) -setSpaceQuota (l) -clrSpaceQuota (m) -refreshServiceACL

(a) shows help for given command or -all (b) shows filesystem statistics & info on datanodes (c) Dumps info on blocks being replicated/deleted and connected datanodes to logs (d) Changes or queries to the state of Safe Mode (e) Saves current in memory filsystem image to a new fsimage file and resets the edits file (only in safe mode) (f) Updates the set of datanodes that are permitted to connect to the namenode (g) Gets info on the process of an HDFS upgrade and forces an upgrade to proceed (h) After upgrade is complete it deletes the previous version of the namenode and datanode directories (i) Sets directory quota. Limit on files/directories in the directory tree. Preserves memory by not allowing a small number of small files. (j) Clears specified directory quotas (k) Sets space quotas on directories. Limit on size of files in directory tree. (l) Clears specified space quotas (m) Refreshes the namenodes service-level authorization policy file.

What are MapReduce defaults? (e) job.setPartitionerClass() (f) job.setNumReduceTasks() (g) job.setReducerClass() (h) job.setOutputFormatClass()

(e) HashPartitioner.class - hashes a records key to determine which partition the record belongs in (f) 1 (g) Reducer.class (h) TextOutputFormat.class (tab delimited)

What are the default Streaming attributes? (e) partitioner (f) numReduceTasks (g) reducer (h) outputFormat

(e) org.apache.hadoop.mapred.lib.HashPartitioner (f) 1 (g) org.apache.hadoop.mapred.lib.IdentityReducer (h) org.apache.hadoop.mapred.TextOutputFormat

How is an output path specified for a MapReduce job?

Fileoutputformat.setoutputpath(____) the directory should not exist

What are common tips when installing Hadoop

- Create a hadoop user, for smaller clusters you can create the user home directory on an NFS server outside the cluster - Change the owner of the Hadoop files to the hadoop user and group - Keep config in sync between machines using rsync or shell tools (dsh pdsh) - If you introduce a stronger class of machine, you can manage separate configs per machine class. (using Chef, Pupper, cfengine)

How will you write a custom partitioner for a Hadoop job?

- Create a new class that extends Partitioner Class - Override method getPartition - In the wrapper that runs the Mapreduce, either - Add the custom partitioner to the job programmatically using method set Partitioner Class or - add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)

How is an output path specified for a MapReduce job?

Fileoutputformat.setoutputpath(____) the directory should not exist

Consider case scenario: In M/R system, - HDFS block size is 64 MB

- Input format is FileInputFormat - We have 3 files of size 64K, 65Mb and 127Mb How many input splits will be made by Hadoop framework? Hadoop will make 5 splits as follows: - 1 split for 64K files - 2 splits for 65MB files - 2 splits for 127MB files

How to get the values to also be sorted before reducing?

- Make the key a composite of the natural key and the natural value. - The sort comparator should order by the composite key, that is, the natural key and natural value. - The partitioner and grouping comparator for the composite key should consider only the natural key for partitioning and grouping.

What will a Hadoop job do if you try to run it with an output directory that is already present? Will it

- Overwrite it - Warn you and continue - Throw an exception and exit The Hadoop job will throw an exception and exit.

What should Pig not be used for?

- Pig doesn't perform as well as programs written in MapReduce (the gap is closing) - Designed for batch processing, therefore if you want a query that only touches a small subset of data in a large set, Pig will not perform well because it was meant to scan the entire set.

What constitutes as progress in MapReduce?

- Reading an input record - Writing an output record - Setting the status description on a reporter (using Reporters getStatus() method) - Incrementing a Counter (Reporter's incrCounter() method) - Calling Reporter's progress() method

The input to a MapReduce job is a set of files in the data store that are spread out over the

HDFS

Using command line in Linux, how will you

- See all jobs running in the Hadoop cluster - Kill a job? Hadoop job - list Hadoop job - kill jobID

Name the most common Input Formats defined in Hadoop? Which one is default?

- TextInputFormat - KeyValueInputFormat - SequenceFileInputFormat TextInputFormat is the Hadoop default.

Hadoop 2.x releases fixes namenode failure issues by adding support for HDFS High Availability (HA). What were the changes?

- You can now have 2 Namenodes in active standby configuration. - Shared storage must be used for the edit log - Datanodes send block reports to all Namenodes because block mappings are stored in the Namenodes memory, not on disk - Clients must be configured to handle Namenode failover

What are CompressionCodecs' two methods that allow you to compress or decompress data?

- createOutputStream (OutputStream out) creates a CompressionOutputStream to write your data to, to be compressed. - createInputStream (InputStream in) creates a CompressionInputStream to read uncompressed data from the underlying stream.

HDFS 4

- optimized for write one read many type - storage nodes run datanode to mange blocks. These are coordinated by NameNode - It used replication instead of hw HA features

All compression algorithms have a space/time tradeoff. 1. How do you maximize for time? 2. How do you maximize for space?

-1 (speed) -9 (space) ex: gzip -1 file (Creates a compressed file file.gz using the fastest compression method)

What are some concrete implementations of InputFormat?

-CombineFileInputFormat -DBInputFormat -FileInputFormat -KeyValueTextInputFormat -NLineInputFormat -SequenceFileAsBinaryInputFormat -SequenceFileAsTextInputFormat -StreamInputFormat -TeraInputFormat -TextInputFormat

What are some concrete implementations of InputFormat?

What are some concrete implementations of RecordReader?

-DBInputFormat.DBRecordReader -InnerJoinRecordReader -KeyValueLineRecordReader -OuterJoinRecordReader -SequenceFileAsTextRecordReader -SequenceFileRecordReader -StreamBaseRecordReader -StreamXmlRecordReader

What are some concrete implementations of RecordReader?

What are the benefits of File Compression?

1. Reduces the space needed to store files. 2. Speeds up data transfer across the network to and from the disk

Fill in the blank: The command for removing a file from hadoop recursively is hadoop dfs ___________ <directory>

-rmr

The command for removing a file from hadoop recursively is: hadoop dfs ___________ <directory>

-rmr

Give an example of a meta-character

Have you ever used Counters in Hadoop. Give us an example scenario?

...

cpu power has grown much faster than network and disk speeds

...

How may reduces can the local job runner run?

0 or 1

When running under the local jobrunner, how many reducers are supported?

0 or 1

How many reducers do you need in TOp Ten?

What is 1024 Exabytes?

1 Zettabyte

All Oozie workflows must have which control nodes?

1 start node <start to="max-temp-mr"/> 1 end node <end name="end"/> 1 kill node <kill name="fail><message>MapReduce failed error...</></kill> When the workflow starts it goes to the node specified in start. If workflow succeeds -> end If workflow fails -> kill

what are the steps implemented by Job Clients, what is the submit job method for job initialization?

1) Ask jobtracker for new job ID 2) Check job output specification 3) Compute InputSplits 4) Copies resources needed to run job -jar file -configuration file 5) Tells the jobtracker that job is ready for execution

what are the steps implemented by Job Clients, what is the submit job method for job initialization?

What are the steps taken by the task tracker for task execution?

1) Copies resources from shared file system to the task trackers' file system -jar -distributed cache files 2) creates local working directory and unjars jar 3) creates instance of TaskRunner to run the task 4) TaskRunner launches JVM 5) Runs task

What are the steps taken by the task tracker for task execution?

List the items in a MapReduce job tuning checklist

1) Number of Mappers 2) Number of Reducers 3) Combinars 4) Intermediate Compression 5) Custom serialization 6) Shuffle Tweaks

List the items in a MapReduce job tuning checklist

1) Number of Mappers 2) Number of Reducers 3) Combinars 4) Intermediate Compression 5) Custom serialization 6) Shuffle Tweaks

What steps does the job scheduler take to create a list of tasks to run?

1) Retrieve InputSplit 2) create one map task for each split 3) creates reduce tasks based on mapped.reduce.tasks property (task ids are given as tasks are created)

What steps does the job scheduler take to create a list of tasks to run?

1) Retrieve InputSplit 2) create one map task for each split 3) creates reduce tasks based on mapped.reduce.tasks property (task ids are given as tasks are created)

Ten Steps hadoop follows to run a MapReduce job.

1) Run Job 2) Get new job id 3) Copy job resources 4) Submit job 5) initialize job 6) Retrieve input split 7) Heartbeat 8) Retrieve job resources 9) Launch 10) Run

Ten Steps hadoop follows to run a MapReduce job.

1) Run Job 2) Get new job id 3) Copy job resources 4) Submit job 5) initialize job 6) Retrieve input split 7) Heartbeat 8) Retrieve job resources 9) Launch 10) Run

What does Input Format do?

1) Validate the input-specification of the job. 2) Split-up the input file(s) into logical InputSplits, each of which is assigned to an individual Mapper 3) Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper

What does Input Format do?

What does OutputFormat do?

1) Validate the output specification of the job. for e.g. check that the output directory doesnt already exist. 2) Provide the RecordWriter implementation to be used to write out the output files of the job.

What does OutputFormat do?

What mechanisms are provided to make the NameNode resilient to failure?

1) backup files that make up the persistant state of the file system metadata: -write to local disk -write to remote NES mount 2) run a secondary name node - does not act as a name node - periodically merges the namespace the edit log

What are the options for storing files in HDFS? (think compression and splitting)

1) uncompressed 2) compressed in format that support splitting (bzip2) 3) split file and compress resulting pieces 4) use a sequence file 5) use an Avro data file

What are the options for storing files in HDFS? (think compression and splitting)

1) uncompressed 2) compressed in format that support splitting (bzip2) 3) split file and compress resulting pieces 4) use a sequence file 5) use an Avro data file

1. job IDs are __ based. 2. task IDs are ___ based. 3. attempt IDs are ___ based.

1. 1 2. 0 3. 0

How much memory does Hadoop allocate per daemon? Where is it controlled?

1. 1GB 2. HADOOP_HEAPSIZE in hadoop_env.sh

1. Task logs are detected after how long? 2. Where can it be configured? 3. How do you set the cap size of a log file?

1. 24 hours 2. mapred.userlog.retain.hours 3. mapred.userlog.limit.kb

What are the benefits of having a block abstraction for a distributed filesystem? (3)

1. A file can be larger than any disk on the network. It can be put into blocks and distributed without size concerns 2. It simplifies storage - Since we know how many blocks can be stored on a given disk though a simple calculation. It allows metadata to be stored separately from the data chunks. 3. Copies of blocks can be made (typically 3) and used in case of a node failure.

What should you do before upgrading?

1. A full disk fsck (save output and compare after upgrade) 2. clear out temporary files 3. delete the previous version (finalizing the upgrade)

What are things to look for on the Tuning Checklist? (How can I make a job run faster?) 1. Number of Mappers 2. Number of Reducers 3. Combiners 4. Intermediate Compression 5. Custom Serialization 6. Shuffle Tweaks

1. A mapper should run for about a minute. Any shorter and you should reduce the amt of mappers. 2. Slightly less reducers than the number of reduce slots in the cluster. This allows the reducers to finish in a single wave, using the cluster fully. 3. Check if combiner can be use to reduce amount of data going through the shuffle. 4. Job execution time can almost always benefit from enabling map output compression. 5. Use RawComparator if you are using your won custom Writable objects or custom comparators. 6. Lots of tuning parameters for memory management

1. What is Apache Oozie? 2. What are its two main parts? 3. What is the difference between Oozie and JobControl? 4. What do action nodes do? control nodes? 5. What are two possible types of callbacks?

1. A system for running workflows of dependent jobs 2. (a) Workflow engine - stores and runs workflows composed of different types of Hadoop jobs (MapReduce, Pig, Hive) (b) Coordinator engine - runs workflow jobs based on pre-defined schedules and data availability. 3. JobControl runs on the client machine submitting the jobs. Oozie runs as a service in the cluster and client submit workflow definitions for immediate or later execution 4. (a) Performs a workflow task such as: moving files in HDFS, running MapReduce, Streaming or Pig jobs, Sqoop imports, shell scripts, java programs (b) Governs the workflow execution using conditional logic 5. (a) On workflow completion, HTTP callback to client to inform workflow status. (b) receive callbacks every time a workflow enters/exits an action node

Describe the process of decommissioning nodes (7 steps).

1. Add network address of decommissioned node to exclude file. 2. Update the namenode: hadoop dfsadmin -refreshNodes 3. Update jobtracker: hadoop mradmin -refreshNodes 4. Web UI - check that node status is "decommission in progress" 5. When status = "decommissioned" all blocks are replicated. The node can be shut down. 6. Remove from the include file, then hadoop dfsadmin -refreshNodes hadoop mradmin -refreshNodes 7. Remove nodes from slaves file.

Describe the process of commissioning nodes (6 steps).

1. Add network address of new nodes to the include file. 2. Update the namenode with new permitted tasktrackers: hadoop dfsadmin -refreshNodes 3. Update the jobtracker with the new set of permitted tasktrackers: hadoop mradmin -refreshNodes 4. Update the slaves file with the new nodes 5. Start the new datanode/tasktrackers 6. Check that the new datanodes/tasktrackers show up in the web UI.

What are the steps to access a service with Kerberos?

1. Authentication - client authenticates themselves to get a TGT (Ticket-Granting Ticket). Explicitly carried out by user using the kinit command which prompts for a password. (Good for 10 hours) For automating this you can create a keytab file using ktutil command. 2. Authorization - (not user level, the client performs) The client uses the TGT to request a service ticket from the Ticket Granting Server 3. Service request - (not user level) Client uses service ticket to authenticate itself to the server that is providing the service. (ex: namenode, jobtracker)

What two parts make up the Key Distribution Center (KDC) ?

1. Authentication Server 2. Ticket Granting Server

What are the writable wrapper classes for Java primitues?

1. Boolean Writable 2. Byte Writable 3. Ink Writable 4. V Ink Writable 5. Float Writable 6. Long Writable 7. VLong Writable 8. Text Use Intwritable for shor and char. V stands for variable (as in lengths)

What does FileSystem check do? (fsck) usage?

1. Checks the health of files in HDFS. Looks for blocks that are missing from all datanodes as well as under/over replicated blocks. fsck does a check by looking at the metadata files for blocks and checking for inconsistencies. 2. hadoop fsck / (directory to recursively search)

The balancer runs until: (3)

1. Cluster is balanced 2. It cannot move any more blocks 3. It loses contact with the Namenode

What are the methods to make the Namenode resistant to failure? (2)

1. Configure Hadoop so that it writes its persistent state to multiple filesystems 2. Run a Secondary Namenode

How does the reduce side of the Shuffle work? (Copy Phase)

1. Copy Phase - after a map task completes, the reduce task starts copying their outputs. Small numbers of copier threads are used so it can fetch output in parallel. (default = 5 mapred.reduce.parallel.copies) The output is copied to the reduce task JVM's memory. (if its small enough) otherwise, its copied to disk. When the in-memory buffer reaches threshold size or reaches threshold number of map outputs it is merged and spilled to disk. mapred.job.shuffle.merge.percent mapred.inmem.merge.threshold A combiner would be run here if specified. Any map outputs that were compressed have to be decompressed in memory. When all map outputs have been copied we continue to the Sort phase.

What are the steps in packaging a job?

1. Create a jar file using Ant, Maven or command line. 2. Include any needed classes in the root/classes directory.Dependent jar files can go in root/lib 3. Set the HADOOP_CLASSPATH to dependent jar files.

What are the possible Hadoop compression codecs? Are they supported Natively or do they use a Java Implementation?

1. DEFLATE - (Java yes, Native yes) org.apache.hadoop.io.compress.DefaultCodec 2. gzip ( Java yes, Native yes) org.apache.hadoop.io.compress.GzipCodec 3. bzip2 (Java yes, Native no) org.apache.hadoop.io.compress.BZipCodec 4, LZO (Java no, Native yes) com.hadoop.compression.lzo.LZOCodec 5. LZ4 (Java no, Native yes) com.hadoop.compression.lz4.LZ4Codec 6. Snappy (Java no, Native yes) org.apache.hadoop.io.compress.SnappyCodec

What are the three possibilities of Map task/HDFS block locality?

1. Data local 2. Rack local 3. Off-rack

command to start distributed file system

bin/start-dfs.sh

How does the map portion of the MapReduce write output?

1. Each map task has a circular memory buffer that writes output 100MB by default (io.sort.mb) 2. When contents of the buffer meet threshold size (80% default io.sort.spill.percent) a background thread will start to spill the contents to disk. Map outputs continue to be written to the buffer while the spill is taking place. If the buffer fills up before the spill is complete, it will wait. 3. Spills are written round robin to directories specified (mapred.local.dir) Before it writes to disk, the thread divides the data into partitions based on reducer and then the partition is sorted by key. If a combiner exists, it is then run.

1.What is Job History? 2. Where are the files stored? 3. How long are History files kept? 4. How do you view job history via command line?

1. Events and configuration for a completed job. 2. The local file system of the jobtracker. (history subdir of logs) 3. 30 days , user location - never hadoop.job.history.location 2nd copy _logs/history of jobs output location hadoop.job.history.user.location 4. hadoop job -history

What controls the schedule for checkpointing? (2)

1. Every hour (fs.checkpoint.period) 2. If the edits log reaches 64MB (fs.checkpoint.size)

The FileStatus class holds which data? (6)

1. File length 2. Block size 3. Replication 4. Modification time 5. Ownership 6. Permission Information

How do you set precedence for the users classpath over hadoop built in libraries?

1. HADOOP_USER_CLASSPATH_FIRST = true 2. mapreduce.task.classpath.first to true

What are the two types of blk files in the datanode directory structure and what do they do?

1. HDFS blocks themselves (consist of a files raw bytes) 2. The metadata for a block made up of header with version, type information and a series of checksums for sections on the block.

What is Audit logging and how do you enable it?

1. HDFS logs all filesystem requests with log4j at the INFO level 2. log4j.logger.org.apache.hadoop.hdfs.sever.namenode.FSNameSystem.audit = INFO (default WARN)

When are tasktrackers blacklisted? How do blacklisted tasktrackers behave?

1. If more than 4 tasks from the same job fail on a particular tasktracker, (mapred.max.tracker.failures) the jobtracker records this as a fault. If the number of faults is over the minimum threashold (mapred.max.tracker.blacklists) default 4, the tasktracker is blacklisted. 2.They are not assigned tasks. They still communicate with the jobtracker. Faults expire over time (1 per day) so they will get a chance to run again. If the fault can be fixed (ex: hardware) when it restarts it will be re-added.

A MapReduce job consists of three things:

1. Input Data 2. MapReduce Program 3. Configuration Information

What are the steps of an upgrade (when the filesystem layout hasn't changed)

1. Install new versions of HDFS and MapReduce 2. Shut down old daemons 3. Update config files 4. Start up new daemons and use new libraries

How does the default implementation of ScriptBasedMapping work? What happens if there is no user-defined script?

1. It runs a user-defined script to determine mapping topology.script.file.name (script location config) The script accepts args (IP addresses) and returns a list of network locations 2. All nodes are mapped to a single network location called /default-rack

What does a Jobtracker do when it is notified of a task attempt which has failed? How many times will a task be re-tried before job failure? What are 2 ways to configure failure conditions?

1. It will reschedule the task on a new tasktracker (node) 2. 4 times (default) 3. mapred.map.max.attempts mapred.reduce.max.attempts If tasks are allowed to fail to a certain percentage mapred.max.map.failures.percentage mapred.max.reduce.failures.percentage Note: Killed tasks do not count as failures.

How do you do metadata backups?

1. Keep multiple copies of different ages (1hr, 1day,1week) 2. Write a script to periodically archive the secondary namenodes previous.checkpoint subdir to an offsite location 3. Integrity of the backup is tested by starting a local namenode daemon and verifying it has read fsimage and edits successfully.

The Web UI has action links that allow you to do what? How are they enabled?

1. Kill a task attempt 2. webinterface.private.actions = true

What are fencing methods? (4)

1. Killing the namenode process 2. Revoking the namenodes access to the shared storage directory 3. Disabling its network port 4. STONITH - (extreme) Shoot the other node in the head - Specialized power distribution unit to force the host machine down

What are two ways to limit a task's memory usage?

1. Linux ulimit command or mapred.child.ulimit. This should be larger than mapred.child.java.opts otherwise the child JVM might not start 2. Task Memory Monitoring - administrator sets allowed range of virtual memory for tasks on the cluster. Users will set memory usage in their job config, if not it uses mapred.job.map.memory.mb and mapred.job.reduce.memory.mb. This is a better approach because it encompasses the whole task tree and spawned processes. The Capacity Scheduler will account for slot usage based on memory settings.

During Namenode failure an administrator starts a new primary namenode with a filesystem metatdata replica and configures datanodes and clients to use the new namenode. The new Namenode won't be able to serve requests until these (3) tasks are completed.

1. Loaded its namenode image into memory 2. Replayed edits from the edit log 3. Received enough block reports from datanodes to leave safe mode

What are the two types of log files? When are they deleted?

1. Logs ending in .log are made by log4j and are never deleted. These logs are for most daemon tasks 2. Logs ending in .out act as a combination standard error and standard output log. Only the last 5 are retained and they are rotated out when the daemon restarts.

Describe the upgrade process. ( 9 Steps)

1. Make sure any previous upgrade is finalized before proceeding 2. Shut down MapReduce and kill any orphaned tasks/processes on the tasktrackers 3. Shut down HDFS, and back up namenode directories. 4. Install new versions of Hadoop HDFS and MapReduce on cluster and clients. 5. Start HDFS with -upgrade option 6. Wait until upgrade completes. 7. Perform sanity checks on HDFS (fsck) 8. Start MapReduce 9. Roll back or finalize upgrade

What are characteristics of these compression formats? 1. Gzip 2. Bzip2 3. LZO, LZ4, Snappy

1. Middle of the space/time tradeoff 2. Compresses more effectively but slower (than Gzip) 3. All optimized for speed, compress less effectively

Which Hadoop MBeans are there? What daemons are they from? (5)

1. NameNodeActvityMBean (namenode) 2. FSNameSystemMBean (namenode) - namenode status metrics ex: # of datanodes connected 3. DatanodeActivityMBean (datanode) 4. FSDatasetMBean (datanode) - datanode storage metrics ex: capacity/free space 5. RPCActivityMBean (all rpc daemons) RPC statistics ex: average processing time

What are the restrictions while being in Safe Mode? (2)

1. Offers only a read-only view of the filesystem to clients. 2. No new datanodes are setup/written to. This is because the system has references to where the blocks are in the datanodes and the namenode has to read them all before coordinating any instructions to the datanodes.

What are reasons to turn off Speculative Execution?

1. On a busy cluster it can reduce overall throughput since there are duplicate tasks running. Admins can turn it off and have users override it per job if necessary. 2. For reduce tasks, since duplicate tasks have to transfer duplicate map inputs which increases network traffic 3. Tasks that are not idempotent. You can make tasks idempotent using OutputComitter. Idempotent(def) apply operation multiple times and it doesn't change the result.

Application Master Failure. 1. Applications are marked as failed if they fail ____ . 2. What happens during failure? 3. How does the MapReduce application manager recover state of which tasks were run successfully? 4. How does the client find a new application master?

1. Once 2. Resource manager notices missing heartbeat from application master and starts a new instance of the master running in a new container. (managed by the node manager) 3. If recovery is enabled (yarn,app.mapreduce.am.job.recovery.enabled to true) 4. During a job initialization, the client asks the resource manager for the application masters addresses and caches them. On failure the client experiences a timeout when it issues a status update where it goes back to the resource manager to find a new address.

How does task execution work in YARN?

1. Once a task is assigned to a container, via the resource managers scheduler, the application master starts the container by contacting the node manager. The container is started via the Java App YARNChild which localizes resources and runs the mapreduce task. YARN does not support JVM reuse.

In HDFS each file and directory have their own? (3 permission structures)

1. Owner 2. Group 3. Mode

How do you manage data backups?

1. Prioritize your data. What must not be lost, what can be lost easily? 2. Use distcp to make a backup to other HDFS clusters (preferably to a different hadoop version to prevent version bugs) 3. Have a policy in place for user directories in HDFS. (how big? when are they backed up?)

What are common remote debugging techniques?

1. Reproduce the failure locally, possibly using a debugger like Java's virtual VM. 2. Use JVM debugging options.For JVM out of memory errors. Set -XX:-HeapDumpOnOutOfMemoryError -XX:-HeapDumpPath = /path/to/dumps Dumps heap to be examined afterward with tools such as jhat or Eclipse Memory Analyzer 3. Task Profiling - Hadoop provides a mechanism to profile a subset of the tasks in a job. 4. IsolationRunner - (old hadoop) could re-run old tasks

YARN takes responsibilities of the jobtracker and divides them between which 2 components?

1. Resource Manager 2. Application Master

For multiple jobs to be run, how do you run them linearly? Directed Acyclic Graph of jobs?

1. Run each job, one after another, waiting until the previous completes successfully. Throws an exception and the processing stops at failed job. ex: JobClient.runJob(conf1); JobClient.runJob(conf2); 2. Use Libraries. (org.apache.hadoop.mapreduce.jobcontrol ) JobControl Class instance represents a graph of jobs to be run. (a) Indicate jobs and their dependencies (b) Run JobControl in a thread and it runs the jobs in dependency order (c) You can poll progress (d) If a job fails JobControl won't run dependencies (e) You can query status after the jobs complete

What are these other common benchmarks used for? 1. MRBench 2. NNBench 3. Gridmix

1. Runs a small job a number of times. Acts as a good counterpoint to sort. 2. Useful for load-testing namenode hardware. 3. Suite of benchmarks designed to model a realistic cluster

When does Safe Mode start? When does it end?

1. Safe Mode starts when the namenode is started. (after loading fsimage and edit logs) 2. When the minimum replication factor has been met, plus an additional 30 seconds ( dfs.replication.min) When you are starting a newly formed cluster it does not go into safemode since there are no blocks in the system yet.

What are three ways to execute Pig programs? (all work in local and mapreduce)

1. Script ex: pig script.pig or with the -e option you can run scripts inline via command line (if they are short) 2. Grunt - interactive shell. Grunt is started when Script isn't used (no -e option) Run scripts with Grunt using "run" or "exec". 3. Embedded - runs from Java using PigServer Class. For access to Grunt via Java use the PigRunner Class.

What does running start-mapred.sh do? (2 steps)

1. Starts a jobtracker on the local machine 2. Starts a tasktracker on each machine in the slaves file

What does running start-dfs.sh do? ( 3 steps)

1. Starts a namenode on the machine the script was run on 2. Starts a datanode on each machine listed in the slaves file 3. Starts a secondary namenode on each machine listed in the masters file

What are the 6 key points of HDFS Design?

1. Storing very large files 2. Streaming data access - A write once, read many pattern. The time it takes to read the full data set is more important than the latency of reading the first record 3. Commodity Hardware - Designed to carry on working through node failures which are higher on commodity hardware 4. Low-latency data access - Optimized for high throughput at the expense of latency 5. Lots of small files - The Namenode stores file system metadata in memory, therefore the max amount of files is governed by the amount of memory in the Namenode 6. Multiple writes, arbitrary file modifications - HDFS files may be written by a single writer and must always be made at the end of files

True or False: 1. System properties take priority over properties defined in resource files. 2. System properties are accessible through the configuration API.

1. TRUE 2. FALSE, it will be lost if not redefined in a configuration file.

(1)True or False: Hadoop has enough unix assumptions that is is unwise to run on non-unix platforms in production (2)True or False: For a small cluster (10 nodes) it is acceptable to have the namenode and jobtracker on a single machine

1. TRUE 2. TRUE, as long as you have a copy of the namenode metatdata on a remote machine. Eventually as the # of files grows, the namenode should be moved to a separate machine because it is a memory hog.

What are the Steps of Task Execution? (MR1)

1. Tasktracker copies the job JAR from the shared filesystem to the tasktrackers filesystem. Copies any files needed from distributed cache. 2. Tasktracker creates a local working directory for the task and un-jars the contents 3. Tasktracker creates an instance of TaskRunner: (a) TaskRunner launches a new JVM to run each task in. ( so that any bugs in user-defined maps or reduce functions don't cause the tasktracker to crash or hang) (b) The child tasks communicate via the umbilical interface every few seconds until completion.

What are the YARN entities?

1. The Client - submits the MapReduce job 2. resource manager - coordinates allocation of compute resources 3. node managers - launch and monitor the compute containers on machines in the cluster 4. application master - coordinates the tasks running the MapReduce job. The application master and MapReduce tasks run in containers that are scheduled by the resource manager and managed by node managers. 5. Distributed Filesystem

How does MR2 handle runtime exception/failure and sudden JVM exits? Hanging tasks? What are criteria for job failure?

1. The application master marks them as failed. 2. The application master notices an absence of ping over umbilical channel, task attempt is marked as failed. 3. Same as MR1, same config options. A task is marked as failed after 4 attempts or percentage map/reduce tasks fail.

After a successful upgrade, what should you do?

1. remove old installation and config files 2. fix any warnings in your code or config 3. Change the environment variables in your path. HADOOP_HOME => NEW_HADOOP_HOME

How does task assignment work in YARN? (only if not ubertask)

1. The application master requests containers for all MapReduce tasks in the job from the resource manager. All requests, piggybacked on heartbeat calls, include information about each map tasks data locality and memory requirements for tasks. 2. The scheduler uses locality information to make placement decisions

What are the four entities of MapReduce1 ?

1. The client 2. The jobtracker (coordinate the job run) 3. The tasktrackers (running tasks) 4. The distributed filesystem (sharing job files)

What are the steps of a File Write? (5)

1. The client calls create() on the DistributedFileSystem 2. The DistributedFileSystem makes an RPC call to the Namenode to create a new file in the filesystem namespace with no blocks associated with it. Namenode checks permissions and if file exists already, if it passes the namenode creates a record of the file. The DistributedFileSystem returns an FSDataOutputStream which wraps a DFSOutputStream. The DFSOutputStream handles communcation with datanodes and the namenode. 3. As the client writes data DFSOutputStream splits it into packets, which it writes to an internal queue, the "data queue". 4. The DataStreamer consumes the data queue and asks the namenode to allocate new blocks by picking suitable datanodes with all three replicas. DataStreamer streams the packets to the datanodes. The DFSOutputStream maintains an internal queue of packets that are waiting to be acknowledged ( ack queue ) 5. Packets are removed from the ack queue when it has been acknowledged by all datanodes in the pipeline.

What are the steps of a File Read? (6)

1. The client opens a file by calling FileSystem's open method (HDFS - uses an instance of DistributedFileSystem) 2. The FileSystem calls the Namenode for file block locations. The Namenode returns locations of datanodes sorted by proximity to the client. The DistributedFileSystem returns a FSDataInputStream to the client for it to read data from. The client calls read() on the stream. 3. DFSInputStream finds the first (closest) datanode and connects to it to get access to the first block of the file. It is streamed back to the client, which calls read() repeatedly. 4. When the data is at the end of the block, it will close the stream. The DFSInputStream will then find the next best datanode for the next block. 5. The process continues, it will call the namenode for the next set of blocks. 6. When the client is finished, it calls close() on the stream.

What is the user's task classpath comprised of?

1. The job JAR file 2. Any JAR files contained in the lib/ directories of the job jar file. 3. Any files added to the distributed cache using -libjars option or the addFileToClasspath() method on DistributedCache (old api) or Job (new api)

How does Job Initialization work in MapReduce 1?

1. The job is put into a queue from where the JobScheduler will pick it up and initialize it. Creates an object to represent the job being run. It encapsulates its task and bookkeeping information to keep track of status of tasks. 2. The Job Scheduler receives the input splits computed by the client from the shared filesystem and creates one map task for each split. Reduce tasks are created. The job setup task is created which is run by the tasktrackers to setup before and after a job. The job cleanup task is created to delete the temporary workspace for the task output. 3. The tasktrackers send a heartbeat to the jobtracker and is also used to communicate when a new task is ready. The jobtracker will choose a job and then a task within to send in response.

What happens when a jobtracker receives a notification that the last task for a job is compete?

1. The jobtracker changes the status of the job to "successful" 2. When the Job object polls for status, it prints a message to tell the user and returns from the waitingForCompletion() method. 3. Job Statistics and Counters are printed 4. (optional) HTTP job notification (set job.end.notification.url) 5. Cleans up working sate for the job and instructs tasktrackers to do the same.

Pig is made up of two parts. What are they?

1. The language used to express data flows (Pig Latin) 2. The execution environment of Pig Latin programs) (a) local execution (single JVM) small datasets on the local file system (b) distributed execution - on the Hadoop cluster

If two configuration files set the same property, which does Hadoop use?

1. The last one added. 2. The one marked as "final" within the config

What happens with a datanode fails while data is being written to it?

1. The pipeline is closed 2. Any packets in the ack queue are written to the front of the data queue 3. The current block on the good datanode is given a new identity so if the failed datanode comes back up, it won't continue with the block. 4. The failed datanode is removed from the pipeline 5. The remainder of the blocks data is written to the remaining 2 datanodes in the replication structure. 6. When the client is finished writing, it calls close(), which flushes the remaining packets in the datanode pipeline. It waits for acknowledgement before contacting the namenode to close the file. 7. The namenode notes if any block is under-replicated and arranges for another replication

How does job initialization work in YARN?

1. The scheduler allocates a container and the resource manager then launches the application master's process there, under the node managers management. 2. The application master creates a map task object for each split, as well as a number of reduce tasks (configured: mapreduce.job.reduces) 3. The application master decides if tasks will be run on the same JVM as itself (uberized) or paralell. Uberized - small, less than 10 mappers and 1 reducer. Input size of less than a block

Describe the Checkpoint process. (5 steps)

1. The secondary asks the primary to roll its edits file. edits => edits.new (on primary) 2. Secondary retrieves fsimage and edits from primary 3. Secondary loads fsimage info memory and applies edits. Then creates a new fsimage file. 4. Secondary sends new fsimage to primary (HTTP Post) 5. Primary replaces old fsimage file and the old one with the edits.new. Updates fstime to record the time the checkpoint was taken.

How could you debug a MapReduce program?

1. Use a debug statement to log to standard error and a message to update task status to alert us o look at the error log. 2. Create a custom counter to count the total number of records with implausible values in an entire dataset (valuable to see how common an occurrence is) 3. If the amount of data is large, we add debug output to the maps output for analysis and aggregation in the reducer. 4. Write a program to analyze log files afterwards. ex: Debugging to Mapper: if(airTemp > 1000 ){ Ssytem.err.println("Temp over 1000 degrees for input" + value); Context.setStatus("detected possibly corrupt record:see logs"); Context.getCounter(Temperature.OVER_100).increment(1); }

What do the following SSH settings do? 1. ConnectTimeout 2. StrictHostKeyChecking

1. Used to reduce the connection timeout value so the control scripts don't wait around. 2. If set to no, it automatically adds new host keys. If ask(default),prompts the user to accept host key (not good for a large cluster)

How might a job fail? (MR1)

1. User code throws a runtime exeception - childJVM reports error to tasktracker before it exists. Tasktracker marks task as failed. The error ends up in user logs. 2. Streaming - if streaming process ends with nonzero exit code, is marked as failed. (streaming.non.zero.exit.is.failure. = true) 3. Sudden exit of childJVM - tasktracker notices exit and marks as failed.

What security enhancements have been added to Hadoop?

1. Users can view and modify their own jobs, not others. Using ACL's. 2. A task may communicate only with its parent tasktracker 3. The shuffle is secure, but not encrypted. 4. A datanode may be run on a privileged port ( lower than 1024) to make sure it starts securely. 5. When tasks are run as the use who submitted the job, the distributed cache is secure. Cache was divided into secure/shared portions. 6. Malicious users cant get rouge secondary namenodes, datanodes or tasktrackers to join the cluster. Daemons are required to authenticate with the master node.

A container has 2 types of memory constraints. What are they?

1. Virtual memory constraint - cannot exceed a given multiple set in yarn.nodemanager.vmem-pmem ratio. Usually 2:1 2. Schedulers min/max memory allocations. yarn.scheduler.capacity.minimum-allocation.mb yarn.scheduler.capacity.maximum-allocation.mb

Reduce tasks are broken down on the jobtracker web UI. What do: 1. copy 2. sort 3. reduce refer to?

1. When map outputs are being transferred to the reducers tasktracker. 2. When the reduce inputs are being merged. 3. When the reduce function is being run to produce the file output.

How do you benchmark HDFS?

1. Write hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles IO -fileSize 1000 (writes 10 files of 1,000 MB each) cat TestDFSIO_results.log in /benchmarks/TestDFSIO 2. Read hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -read -nrFiles IO -fileSize 1000 (reads 10 files of 1,000 MB each) 3. Clean hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -clean

How do you benchmark mapreduce?

1. Write hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar randomwriter random-data (Generate some data randomly) 2. Sort hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort random-data sorted-data (runs the sort program) Can see progress at the jobtracker web url. 3. Verify Data is Sorted Correctly hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar testmapredsort -sortInput random-data -sortOutput sorted-data (returns success or failure)

What types does the addInputPath() method accept?

1. a single file 2. directory 3. file pattern

What are the features of Grunt?

1. autocomplete mechanism. 2. remembers previous/next commands that were run

What do the following Oozie components specify? 1. map-reduce action 2. mapred.input.dir / mapred.output.dir

1. contains (a) job-tracker - specifies jobtracker to submit job to (b) name-node - URI for data input/output (c) prepare (optional) - runs before mapreduce job. Used for directory deletion (output dir before job runs) 2. Used to set the FileInputFormat input paths and FileOuputFormat output paths

What the the benefits of PIG?

1. cuts down on development process of MapReduce (even faster than Streaming) 2. Issuing command line tasks to mine data is fast and easy 3. Can process terabytes of data 4. Provides commands for data introspection as you are writing scripts 5. Can run on sample subsets of data

Datanodes permitted/not permitted to connect to namenodes if specified in ___________. Tasktrackers that may/ may not connect to the jobtracker are specified in ___________.

1. dfs.hosts / dfs.hosts.exclude 2. mapred.hosts / mapred.hosts.exclude

How do you run an Oozie workflow job?

1. export OOZIE_URL="http://localhost:11000/oozie" (tells oozie command which server to use) 2. oozie job -config ch05/src.../max-temp-workflow.properties -run (run - runs the workflow) (config - local java properties file containing definitions for the parameters in the workflow xml) 3. oozie job -info 000000009-112....-oozie-tom-W (shows the status, also available via web url)

If I want to keep intermediate failed or succeeded files, how can I do that? Where are the intermediate files stored?

1. failed - keep.failed.task.files = true succeeded - keep.task.files.pattern = (regex of task Ids to keep) 2. mapred.local.dir/taskTracker/jobcache/job-ID/task-attempt-ID

Trash: 1. How do you setup Trash? 2. Where do you find Trash files? 3. Will programatically deleted files be put in Trash? 4. How do you manually take out Trash for non-HDFS filesystems?

1. fs.trash.interval, set to greater than 0 in core-site.xml 2. In your user/home directory in a .trash folder 3. No, they will be permanently deleted 4. hadoop fs -expunge

What should be run regularly for maintenance?

1. fsck 2. balancer

How do you setup Kerberos authentication?

1. hadoop.security.authentication = kerberos (core-site.xml) 2.hadoop.security.authorization = true 3. Setup ACL's (Access Control Lists) in hadoop-policy.xml

How do you set log levels for a component? (3 ways)

1. http://jobtracker-host:50030/logLevel and set org.apache.hadoop.mapred.JobTracker = DEBUG 2. hadoop daemonlog -setLevel jobtracker-host:50030 org.apache.hadoop.mapred.JobTracker = DEBUG 3. (persistent) change log4j.properties file

A mapper commonly performs three things, what are they?

1. input format parsing 2. projection (selecting relevant fields) 3. filtering (removing records that are not of interest)

1. CompressionCodecFactory finds codecs form a list defined where? 1. What is default? 3. How it it formatted?

1. io.compression.codecs It searches through it via extension to find a match. 2. Hadoop supported. Can configure custom codecs format 3. Comma-separated classnames (ex: org.apache.hadoop.io.compress.DefaultCodec, ...)

Pig is extensible in that you can customize what? How?

1. loading 2. storing 3. filtering 4. grouping 5. joining via user-defined functions (UDF)

What can mapreduce.framework.name be set to ?

1. local 2. classic (MapReduce 1 ) 3. yarn (MapReduce 2)

How do you set memory options for individual jobs? (2 controls)

1. mapred.child.java.opts - sets JVM heap size for map/reduce tasks 2. mapreduce.map.memory.mb - how much memory needed for map (or reduce) task containers.

What needs to be set in order to enable Task Memory Monitoring? (6)

1. mapred.cluster.map.memory.mb - amt of virtual memory to take up a map slot. Map tasks that require more can use multiple slots. 2. mapred.cluster.reduce.memory.mb - amt of virtual memory to take up a reduce slot 3. mapred.job.map.memory.mb - amt of virtual memory that a map task requires to run 4. mapred.job.reduce.memory.mb - amt of virtual memory that a reduce task requires to run 5. mapred.cluster.max.map.memory.mb - max limit users can set mapred.job.map.memory.mb 6. mapred.cluster.max.reduce.memory.mb - max limit users can set mapred.job.reduce.memory.mb

How do you configure JVM reuse?

1. mapred.job.reuse.jvm.num.tasks - map number of tasks to run for a given job for each JVM launched (default =1) There is no distinction between map/reduce tasks, however tasks from different jobs are always run in separate JVMs. If set to -1, no limit and the same JVM may be used for all tasks of a job. 2. JobConf.setNumTasksToExecutePerJVM()

1. How do you setup the local jobrunner? 2. How do you setup the local jobrunner on MR2? 3. How many reducers are used?

1. mapred.job.tracker = local (default) 2. mapred.framework.name = local 3. 0 or 1

What are the two files the Namenode stores data in?

1. namespace file image 2. edit log

What does the Datanode's VERSION file contain? (5)

1. namespaceID -received from the namnode when the datanode first connects 2. storageID = DS-5477177.... used by the namenode to uniquely identify the datanode 3. cTime = 0 4. storageType = DATA_NODE 5. layoutVersion = -18

How does job submission work in YARN?

1. new JobID retrieved form the resource manager (called applicationID) 2. The job client checks the output specification, computes input splits and copies job resources to HDFS 3. The job is submitted by calling submitApplication() on the resource manager

Which components does fsck measure and what do they do?

1. over-replicated blocks - automatically deletes replicas 2. under-replicated blocks - automatically create relicas 3. mis-replicated blocks - blocks that don't satisfy the replica placement policy. They are re-replaced. 4. corrupt blocks - blocks whose replicas are all corrupt. Blocks with 1 non-corrupt replica are not marked as corrupt. 5. missing replicas - blocks with no replicas anywhere. Data has been lost. You can specify to -move the affected files to the lost and found directories or -delete the files (cannot be recovered)

What do these common Pig commands output? 1. DUMP records 2. DESCRIBE records 3. ILLUSTRATE 4. EXPLAIN

1. shows records ex: 1950,0,1 2. show the relations schema ex: records: { year: chararray, temperature: int, quality, int } 3. A table representation of all the steps and transformations. It helps understand the query. 4. Shows the logical/physical plan breakdown for a relation

How do you run a MapReduce job on a cluster?

1. unset HADOOP_CLASSPATH (if no dependencies exist) 2. hadoop jar hadoop-examples.jar / v3MaxTemperatureDriver -conf conf/hadoop-cluster.xml input/ncdc/all max-temp

. How did you debug your Hadoop code?

1. use counters 2. use the interface provided by hadoop web ui

How much memory does the namenode, secondary namenode and jobtracker daemons use by default?

1GB Namenode - 1 GB per million blocks storage

How much memory do you dedicate to the node manager?

1GB namenode daemon + 1GB datanode daemon + extra for running processes = 8GB (generally)

What is the bandwidth used by the balancer?

1MB/s (dfs.balance.bandwidthPerSec) hdfs.site.xml Limits bandwidth used for copying blocks between nodes. Designed to run in the background.

What is Hadoop's default replica placement?

1st - on the same node as the client 2nd - on a different rack 3rd - on a different rack and a different node than the 2nd

How does the reduce side of the Shuffle work? (Sort Phase & Reduce phase)

2. Sort Phase - (merge phase) This is done in rounds. The number of map outputs/merge factor(io.sort.factor default 10) 50/10 = 5 rounds. (5 intermediate files) 3. Reduce Phase - reduce function is invoked for every key of ouput. The result is written to the output filesystem, typically HDFS

Think hadoop

2003/2004 Google releasesd 2 academic papers describing Google File System and MapReduce.

How does the map portion of the MapReduce write output? (Part 2)

4. Each time the memory buffer reaches spill threshold, a new spill file is created. There are at least 3 spill files. (min.num.spills.for.combine) All spill files are combined into a single partitioned and sorted output file. 5. It is a good idea to compress output (Not set by default) 6. Output file partitions are made available to reducers via HTTP. The max amount of worker threads used to serve partitions is controlled by tasktracker.http.threads = 40 (default) (setting per tasktracker, not per map) This is set automatically in MR2 by the number of processors on the machine. (2 x amt of processors)

What is a good split size? What happens when a split is too small?

64MB or the size of an HDFS block. A larger split would have to use bandwidth as well as local data, because a split will span more than a single block. A smaller split would create overhead in managing all the splits and metadata associated with task creation.

Machines running a namenode should be which? 32 bit or 64 bit?

64bit, to avoid the 3GB limit on Java Heap size on 32bit

default size of an HDFS block

64mg

What is the default port for the HDFS NameNode?

8020

rule of 5 9's 99.999 uptime

99% uptime system = 3.5 days a year, 7 hours a month.

Pig- if fields are no long unique use

Cluster file system property (XML)

<property> <name> fsodefault.name <1name> <value> holfsi//namemodel/</value> </property>

Cluster job tracker property (XML)

<property> <name>mapred.job.tracker<1name> <value> jobtracker:8021 <value> <1property>

Which of the following is NOT true: A) Hadoop is decentralized B) Hadoop is distributed. C) Hadoop is open source. D) Hadoop is highly scalable.

In Pig Latin, how would you use multi-query execution?

A = LOAD 'input/pig/multiquery/A', B = FILTER A BY $1='banana', C = FILTER A BY $1='banana', STORE B INTO 'ouput/b', STORE C INTO 'output/c'; Reads A only once for the 2 jobs to save time. Store the output in 2 separate places.

What is a DataNode? How many instances of DataNode run on a Hadoop Cluster?

A DataNode stores data in the Hadoop File System HDFS. There is only One DataNode process run on any hadoop slave node. DataNode runs on its own JVM process. On startup, a DataNode connects to the NameNode. DataNode instances can talk to each other, this is mostly during replicating data.

What is HBase

A Hadoop database

three

A Hadoop file is automatically stored in ___ places.

How do status updates work with YARN?

A Task reports its progress and status back to the application master which has an aggregate view. It is sent every 3 seconds over the umbilical interface.

What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster

A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JobTracker. There is only One Task Tracker process run on any hadoop slave node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated

What is the Data Block Scanner?

A background thread that periodically verifies all the blocks stored on the datamode. this is to guarantee against corruption due to "bit rot" in the physical storage media.

What is Hadoop?

A big data system that ties together a cluster of commodity machines with local storage using free and open source software to store and process vast amounts of data at a fraction of the cost of other systems.

hadoop-env.sh

A bourne shell fragment sourced by Hadoop scripts, this file specifies environment variables that affect the JDK used by Hadoop,daemon JDK options, the pid fil,adn log file directories

How does the Capacity Scheduler work?

A cluster is made up of a number of queues which may be hierarchical and each queue has a capacity.Within each queue jobs are scheduled using FIFO scheduling (with priorities) Allows users (defined by queues) to simulate separate clusters. Does not enforce fair sharing like the Fair Scheduler.

What is a file system designed for storing very large files with streaming data access paterns, running on clusters of commodity hardware.

HDFS

Zookeeper

A distributed, highly available coordination service. Zookeeper provides privledges such as distributed locks that can be used for distributed applications.

How would you customize Grunt autocomplete tokens?

Create a file called autocomplete and place it in Pig's classpath. Place keywords in single separate lines (case sensitive)

log4j.properties

A java property file that contains all log configuration information

taskcontroller.cfg

A java property-style file that defines values used by setuid-task-controller MapReduce helper program used when operting in secure mode

hadoop fs -touchz

Create a file of zero length.

What are the two main new components in Hadoop 2.0?

HDFS Federation and Yarn

MAPRED: mapred.local.dir

A list of directories where MapReduce stores intermediate temp data for jobs (cleared at job end)

Map-Side Joins

A map-side join between large inputs works by performing the join before the data reaches the map function. For this to work, though, the inputs to each map must be partitioned and sorted in a particular way. Each input dataset must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. All the records for a particular key must reside in the same partition. This may sound like a strict requirement (and it is), but it actually fits the description of the output of a MapReduce job.

masters(optional)

A new line separated list of machines that run the secondary namenode used only by start-*.sh helper scripts.

dfs.exclude

A newline separated list of machines that are not permitted to connect to namenode

dfs.include

A newline separated list of machines that are perimitted to connect to the namenode

slaves(optional)

A newline separated list of machines that run datanode/task trackerpair of daemons used only by start-*.sh commands

Datanode

A node that holds data and data blocks for files in the file system.

NameNode

A node that stores meta-data about files and keeps track of which nodes hold the data for a particular file

A non-word boundary

What is the namenode's fs image file?

A persistent checkpoint of filesystem metadata

empty

A quota of one forces a directory to remain ____ . (Yes, a directory counts against its own quota!)

What is the result of any operator in Pig Latin. ex:LOAD

A relation, which is a set of tuples. ex: records = LOAD 'input/ncdc/.." records - relation alias or name

What is Pig?

A scripting language that simplifies the creation of mapreduce jobs. Used to explore and transform data

What is Apache Flume? (a) What is a sample use-case? (b) What levels of delivery reliability does Flume support?

A system for moving large quantities of streaming data into HDFS. (a) use case: Collecting log data from one system and aggregating it into HDFS for later analysis. (b) 1. best-effort - doesn't tolerate any Flume node failures 2. end-to-end - guarantees delivery even with multiple failures.

[a-z&&[^m-p]]

A through z but m-p

Hadoop is

HDFS and MR. Both are direct implementations.

MapReduce job (definition)

A unit of work that a client wants to be performed

MapReduce job

A unit of work that the client wants to be performed -input data -the MapReduce program -configuration information

A word boundary

Data Mining

According to analysts, for what can traditional IT systems provide a foundation when they're integrated with big data technologies like Hadoop? Big Data and ___ ___.

process

Administrators should use the conf/hadoop-env.sh and conf/yarn-env.sh script to do site-specific customization of the hadoop daemons' ______ environment.

What are delegation tokens used for?

Allows for later authentication access without having to contact the KDC again.

What does PathFilter do? Which FileSystem functions take an optional PathFilter?

Allows you to exclude directories, as GlobPatterns cannot. listStatus(), globStatus()

What is CompositeContext?

Allows you to output the same set of metrics to multiple contexts. -arity = number of contexts

What does the Hadoop Library Class ChainMapper do?

Allows you to run a chain of mappers, followed by a reducer and another chain of mappers in a single job.

What does a compression format being Splittable mean? Which format is?

Allows you to seek to any point in the stream and start reading. (Suitable for MapReduce) Bzip2

What is Mahout?

An Apache project whose goal is to build scalable machine learning libraries

What is the logo for Hadoop?

An Elephant

mapred-queue-acls.xml

An XML file that defines which user and or group are permitted to submit jobs to which Mapreduce Job queues

hadoop-policy.xml

An XML file that defines which users and / or groups are permitted to invoke specific RPC functions whn communicated with Hadoop

core-site.xml

An XML file that specifies parameters relevant to all Hadoop daemons and clients

mapred-site.xml

An XML file that specifies parametersused by MapReduce daemons and clients

Big Data

An assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve, and analyze it with the help of on-hand database management tools or traditional processing techniques.

Unix

HDFS commands have a one-to-one correspondence with ____ commands.

What is a codec?

An implementation of a compression-decompression algorithm. In Hadoop, its represented by an implementation of CompressionCodec Interface. ex: GZipCodec - encapsulates the compression-decompression algorithm for gzip.

What does Hadoop use for Configuration?

An instance of the Configuration Class in org.apache.hadoop.conf. They read properties from an xml file.

What is Ganglia?

An open source distributed monitoring system for very large scale clusters. Using Ganglia context you can inject Hadoop metrics into Ganglia. Low overhead and collects info about memory/CPU usage.

What is ZooKeeper?

An open source server which enables highly reliable distributed coordination

How is distcp implemented?

As a mapreduce job with the copying being done by the maps and no reducers. Each file is copied by a single map, distcp tries to give each map the same amount of data.

integration

As companies move past the experimental phase with Hadoop, many cite the need for additional capabilities, including: Improved extract, transform and load features for data ____.

Speculative Execution

As most of the tasks in a job ar coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes that dont have work to perform (Yahoo DevNet) (keys an entire job from being delayed by one slow node)

Speculative Execution

Fill in the blank. The solution to cataloging the increasing number of web pages in the late 1900's and early 2000's was _______.

Automation

Fill in the blank. The solution to cataloging the increasing number of web pages in the late 1900's and early 2000's was _______.

Automation

Can an average mapreduce pattern use combiner?

Average is no associative but can be possible if count is kept

Schedulers wait until 5% of the map task in a job have completed before scheduling reduce tasks for the same job. In a large cluster this may cause a problem. Why? How can you fix it?

Because cluster utilization would be higher once reducers were taking up slots. By setting mapred.reduce.slowstart.completed.maps = 0.80 (80%) we could improve throughput because we would wait until 80% of the maps had been completed before we start allocating space to the reduce tasks

root privileges

Because the data transfer protocol of DataNode does not use the RPC framework of Hadoop, DataNode must authenticate itself by using privileged ports which are specified by dfs.datanode.address and dfs.datanode.http.address. This authentication is based on the assumption that the attacker won't be able to get ____ ____.

What is Sqoop used for? (use case)

Bulk imports of data into HDFS from unstructured datastores use case: An organization runs nightly Sqoop imports to load the days data into the Hive data warehouse for analysis

How is the output key and value returned from the mapper or reducer?

By calling myOutputCollector.collect (outputKey output Valve) Where myOutputCollector is of type OutputCollector and outputkey and output value are no key value pairs to be returned

How is the output key and value returned from the mapper or reducer?

By calling myOutputCollector.collect (outputKey output Valve) Where myOutputCollector is of type OutputCollector and outputkey and output value are no key value pairs to be returned

How can counters be incremented in MapReduce jobs?

By calling the incrcounter method on the instance of Reporter passed to the map or reduce method

access

By default Hadoop HTTP web-consoles (JobTracker, NameNode, TaskTrackers and DataNodes) allow ____ without any form of authentication.

authentication

By default Hadoop runs in non-secure mode in which no actual _____ is required. By configuring Hadoop runs in secure mode, each user and service needs to be authenticated by Kerberos in order to use Hadoop services.

How does the default partitioner bucket records?

By using a hash function

How are errors handled with CheckSumFileSystem?

Calls reportChecksumFailure() and LocalFileSystem moves offending file and its checksum to "bad-files" directory.

How to define udf streaming counters

Can be incremented by sending a specially formatted line to the standard error stream. Format must be: reporter:counter:group,counter,amount

Example converting a program to be sortable

Change from text to sequencefile, because usually it's signed integer, it doesn't sort well lexigraphically, but seq file uses intWritable

hadoop fs -chgrb

Change group association of files.

hadoop fs-chown

Change the owner of files.

hadoop fs -chmod

Change the permissions of files.

hadoop fs -setrep

Changes the replication factor of a file.

How do checksums work? What type of hardware do you need for them?

Checksums are computed once when the data first enters the system and again whenever it is transmitted across a channel. The checksums are compared to check if the data was corrupted. No way to fix the data, merely serves as error detection. Must use ECC memory

JobTracker in Hadoop performs following actions

Client applications submit jobs to the Job tracker. The JobTracker talks to the NameNode to determine the location of the data The JobTracker locates TaskTracker nodes with available slots at or near the data The JobTracker submits the work to the chosen TaskTracker nodes. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable. When the work is completed, the JobTracker updates its status. Client applications can poll the JobTracker for information.

What s a block access token?

Client uses the block access token to authenticate itself to datanodes.Enabled by the setting dfs.block.access.token.enable = true. HDFS block may be accessed only by a client with a valid block access token from a namenode

How to retrieve counter values using the Java API (new)?

Cluster cluster = new Cluster(getConf()); Job job = cluster.getJob(JobID.forName(jobID)); Counters counters = job.getCounters(); long missing = counters.findCounter( MaxTemperatureWithCounters.Temperature.MISSING).getValue();

Fill in the blank: ________ is shipped to the nodes of the cluster instead of _________.

Code, Data

________ is shipped to the nodes of the cluster instead of _________.

Code, Data

What are combiners? When should I use a combiner in my MapReduce Job?

Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.

Which settings are used for commissioning/decommissioning nodes?

Commissioning: dfs.hosts (datanodes) mapred.hosts (tasktrackers) Decommissioning: dfs.hosts.exclude (datanodes) mapred.hosts.exclude (tasktrackers)

Why is Hadoop good for "big data"?

Companies need to analyze that data to make large-scale business decisions.

Why is Hadoop good for "big data"?

Companies need to analyze that data to make largescale business decisions.

thread safe

Concurrency and libhdfs/JNI The libhdfs calls to JNI should always be creating thread local storage, so (in theory), libhdfs should be as ____ ___ as the underlying calls to the Hadoop FS.

How would you manually add a resource? How would you access the resources properties?

Configuration conf = new Configuration(); conf.addResource("configuration-1.xml"); assertThat(conf.get("color"), is ("yellow")); assertThat(conf.getInt("size",0), is (10)); assertThat(conf.get("breadth","wide"), is ("wide"));

What are masters and slaves files used for?

Contains a list of machine hosts names or IP addresses. Masters file - determines which machines should run a secondary namenode Slaves file - determines which machines the datanodes and tasktrackers are run on. - Used only by the control scripts running on the namenode or jobtracker

hadoop fs - get

Copy files to the local file system.

HDFS, MapReduce

Core components of Hadoop are ___ and ___.

How to make counter names readable in web UI?

Create a properties file named after the enum, using an underscore as a separator for nested classes. The properties file should be in the same directory as the top-level class containing the enum. The file is named MaxTemperatureWithCounters_Temperature.properties ... CounterGroupName=Air Temperature Records MISSING.name=Missing MALFORMED.name=Malformed

keys

Custom configuration property keys should not conflict with the namespace of Hadoop-defined properties. Typically, users should avoid using prefixes used by Hadoop: hadoop, io, ipc, fs, net, file, ftp, s3, kfs, ha, file, dfs, mapred, mapreduce, yarn.

Which of the following is NOT Hadoop drawbacks? A) inefficient join operation B) security issue C) does not optimize query for user D) high cost E) MapReduce is difficult to implement

Scale up (monolothic) vs. scale out

DB done by impressively large computers. When data grew, we moved it to larger and larger computers & storage array. Cost is measure in 100k's or millions.

What are some concrete implementations of Output Format?

DBOutputFormat FileOutputFormat NullOutputFormat SequenceFileAsBinaryOutputFormat TeraOutputFormat TextOutputFormat

What are some concrete implementations of Output Format?

DBOutputFormat FileOutputFormat NullOutputFormat SequenceFileAsBinaryOutputFormat TeraOutputFormat TextOutputFormat

Name common compression schemes supported by Hadoop

DEFLATE gzip bzip2 LZO

Structured Data

Data that has a schema

SSL (HTTPS)

Data transfer between Web-console and clients are protected by using ___ (____). [words refer to the same thing]

Unstructured Data

Data with no structure like jpg's, pdf files, audio and video files, etc.

The __________ holds the data in the HDFS and the application connects with the __________ to send and retrieve data from the cluster.

Datanode, Namenode

How do datanodes deal with checksums?

Datanodes are responsible for verifying the data they receive before storing the data and its checksum. A client writing data sends it to a pipeline of datanodes. The last datanode verifies the checksum. If there is an error, the client receives a checksum exception. Each datanode keeps a persistent log of checksum verifications. (knows when each block was last verified) Each datanode runs a DataBlockScanner in a background thread that periodically verifies all blocks stored on the datanode.

How do you specify SSH settings?

Define the HADOOP_SSH_OPTS environment variable in hadoop-env.sh

How do you handle corrupt records that are failing in the mapper and reducer code?

Detect and ignore Abort job, throwing an Exception Count the total number of bad records in the jobs using Counters to see how widespread the problem is.

hadoop fs -dus

Displays a summary of file lengths.

hadoop fs -du

Displays aggregate length of files contained in the directory

hadoop fs -tail

Displays last kilobyte of the file to stdout.

What is Distributed Cache in Hadoop?

Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.

What is the logo for Hadoop?

Elephant

hadoop fs- expunge

Empty the Trash.

How do you check FileStatus?

FileStatus stat = fs.getFileStatus(Path f); then: assertThat(stat.getLen(), is (7L)); ... for all properties checked

How often does the client poll the application master?

Every second (mapreduce.client.progress.monitor.pollinterval)

When does the datanode create a new blk_ file?

Every time the number of blocks in a directory reaches 64. This way the datanode ensures there is a manageable amount of blocks spread out in differrent directories. (dfs.datanode.numblocks)

How does the Fair Scheduler work?

Every user gets a fair share of the cluster capacity over time. A single job running on the cluster would use full capacity. A short job belonging to one user will complete in a reasonable time, even while another users long job is running. Jobs are placed in pools and by default each user gets their own pool. Its possible to create custom pools with a minimum value.Supports preemption - if a pool hasn't received its fair share over time, the scheduler will kill tasks in pools running over capacity in order to give more slots to under capacity pools.

1024 Petabytes?

Exabytes

Counting with Counters in Hadoop uses Map and Reduce? T/F

F just Mapper

What is the default MapReduce scheduler

FIFO queue-baced scheduler

What are some of the available MapReduce schedules?

FIFO queue-base scheduler Fair scheduler Capacity scheduler

What are some of the available MapReduce schedules?

FIFO queue-base scheduler Fair scheduler Capacity scheduler

How do you delete files or directories with FileSystem methods?

FileSystem's delete() public boolean delete(Path f, boolean recursive) throws IOE

T/F: Hadoop is good at storing semistructured data.

False

T/F: Hadoop is not recommended to company with small amount of data but it is highly recommended if this data requires instance analysis.

False

T/F: The Cassandra File System has many advantages over HDFS, but simpler deployment is not one of them.

False

T/F: The main benefit of HadoopDB is that it is more scalable than Hadoop while maintaining the same performance level on structured data analysis workloads.

False

T/F: Your user tries to log in to your website. Hadoop is a good technology to store and retrieve their login data.

False

True or False: The Cassandra File System has many advantages over HDFS, but simpler deployment is not one of them.

False

True or False: The main benefit of HadoopDB is that it is more scalable than Hadoop while maintaining the same performance level on structured data analysis workloads.

False

Which of the following is NOT true: a. Hadoop is decentralized b. Hadoop is distributed. c. Hadoop is open source. d. Hadoop is highly scalable.

False

True or False: The number of reduce tasks is governed by the size of the input.

False, the number of reducers is specified independently. job.setNumReduceTasks();

True or False: Type conflicts are detected at compile time.

False, they are detected at Runtime. Therefore they should be tested on a sample set first to fix type incompatabilities

True or False: Input types of the reduce function do not have to match output types of the map function

False, they have to match

True or False: A file in HDFS that is smaller than a single block will occupy a full block's worth of underlying storage

False, unlike a file system for a single disk, it does not

True or False. Hadoop is not recommended to company with small amount of data but it is highly recommended if this data requires instance analysis.

False.

Your user tries to log in to your website. Hadoop is a good technology to store and retrieve their login data. True/false? Why?

False. Hadoop is not as efficient as a relational database that can be queried; a database like mySQL is a better choice in this scenario.

Hadoop is good at storing semistructured data. True/false?

False. It's good at storing unstructured data.

What is the safemode HDFS state?

File system is mounted read-only -no replication -no files created -nofiles deleted Commands: hadoop dfsadmin-safemode enter (enl safe mode) hadoop dfsadmin-safemode leave (exitsafemode) hadoop dfsadmin-safemode get (getoveroffstatus) hadoop dfsadmin-safemode wait (waitenablesafemode exit)

What is the safemode HDFS state?

How do we specify the input paths for the job object?

FileInputFormat.addInputPath(job, new Path(args[0]));

How do we specify the input and output paths for the job object?

FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.addOutputPath(job, new Path(args[1]));

How do you specifiy an input path for a MapReduce job and what are the pssible items one can specify?

FileInputFormat.addinputPath(____) a) single file b) directory c) file platform

How do you specifiy an input path for a MapReduce job and what are the pssible items one can specify?

FileInputFormat.addinputPath(____) a) single file b) directory c) file platform

How do we specify the output paths for the job object?

FileOutputFormat.addOutputPath(job, new Path(args[1]));

What is the name of the distributed tool to retrieve logs, each daemon has source and sink, can also decorate (compress or filter) scabs out, master is point of configuration.

Flume

What is Sqoop?

For efficiency of bulk transfers of data between Hadoop and relational databases

What is Flume?

For efficiently collecting, aggregating, and moving large amounts of log data

JobTrackers

For the Hadoop setup, we need to configure ____ and TaskTrackers and then specify the TaskTrackers in the HADOOP_HOME/conf/slaves file.

Reducer - Top Ten

From the local top K , they will compete for the final top K

shell commands

HDFS is a distributed filesystem, and just like a Unix filesystem, it allows user to manipulate the filesystem using ____ _____.

How would an administrator run the checkpoint process manually while in safe-mode?

hadoop dfsadmin -saveNamespace

Configuration Tuning Principals (General & Map side)

General - Give the shuffle as much memory as possible, however you must make sure your map/reduce functions get enough memory to operate The amount of memory given to the JVM in which map/reduce tasks run is set by mapred.child.java.opts - Make this as large as possible for the amount of memory on your task node. Map-Side - Best performance by avoiding multiple spills to disk. One spill is optimal. io.sort.mb (increase) There is a counter that counts both map and reduce spills that is helpful.

What does "hadoop fs -getmerge max-temp max-temp-local" do?

Gets all the files specified in a HDFS directory and merges them into a single file on the local file system.

What do you have to set to get started?

HADOOP_HOME HADOOP_HOME_CONF_DIR to location where fs.default.name nd mapred.job.tracker are set

How do you increase namenode memory?

HADOOP_NAMENODE_OPTS in hadoop-env.sh HADOOP_SECONDARYNAMENODE_OPTS Value specified should be Xmx2000m would allocate 2GB

How do we make sure the master node is not overwhelmed with rsync requests on daemon start?

HADOOP_SLAVE_SLEEP = 0.1 seconds

How the HDFS Blocks are replicated?

HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack and 3rd copy on a different rack.

What is HDFS ? How it is different from traditional file systems?

HDFS, the Hadoop Distributed File System, is responsible for storing huge data on the cluster. This is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.

read-only

HFTP is a ____-____ filesystem, and will throw exceptions if you try to use it to write data or modify the filesystem state.

What is a distributed data warehouse that manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReducer jobs) for querying data?

HIVE

When comparing Hadoop and RDBMS, which is the best solution for speed?

HaDoop

Which is cheaper for larger scale power and storage? Hadoop or RDBMS?

HaDoop

Explain why the performance of join operation in Hadoop is inefficient.

HaDoop does not have indicies so the entire dataset is copied in the join operation.

E xplain the benefit of Hadoop versus other nondistributed parallel framworks in terms of their hardware requirements.

HaDoop does not require high performance computers to be powerful. It can run on consumer grade hardware.

Why is Hadoop's file redundancy less problematic than it could be?

HaDoop is cheap and cost-effective.

When comparing Hadoop and RDBMS, which is the best solution for speed?

Hadoop

Which is cheaper for larger scale power and storage? Hadoop or RDBMS?

Hadoop

Hadoop Archives

Hadoop Archives or HAR files are an archival facility that packs files into HDFS blocks more efficiently, thereby reducing namemode memory usage while still allowing transparant access to FIBs. In Particular Hadooop archives can be used as input to MyReduce.

What does Hadoop 2.0 Consist Of?

Hadoop Common, HDFS, YARN, MapReduce

What is HDFS?

Hadoop Distributed File System

What is the characteristic of streaming API that makes it flexible run MapReduce jobs in languages like Perl, Ruby, Awk etc.?

Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a MapReduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.

What is HUE

Hadoop User Experience (HUE), is a web library to build browser based tools to interact with cluster, Beeswax, File browser, Job designer, User manager ..etc

How does profiling work in Hadoop?

Hadoop allows you to profile a fraction of the tasks in a job and as each task completes, it pulls down the profile information to your machine for later analysis. ex: Configuration conf = getConf(); conf.setBoolean("mapred.task.profile",true); conf.set("mapred.task.profile.params", "agent-lib:hprof=cpu=samples,heap=sites, depth=6, force=n, thread=y,verbose=n,file=%s"); conf.set("mapred.task.profile.maps", "0-2"); conf.set("mapred.task.profile.reduces", " "); Job job = new Job(conf, "MaxTemperature");

MAPRED: mapred.tasktracker.map.tasks.maximum

Int (default=2) number of map tasks run on a tasktracker at one time.

Read, Site

Hadoop configuration is driven by two types of important configuration files: ____-only default configuration -core-default.xml, hdfs-default.xml, yarn-default.xml, and mapred-default.xml ___-specific configuration -conf/core-site.xml, conf/hdfs-site.xml, conf/yarn-site.xml, and conf/mapred-site.xml

TaskTrackers

Hadoop deployment includes a HDFS deployment, a single job tracker, and multiple ________.

What is Speculative Execution of tasks?

Hadoop detects when a task is running slower than expected and launches another equivalent task as backup. When a task completes successfully, the duplicate tasks are killed. Turned on by default. It is an optimization and not used to make tasks run more reliably.

Explain why the performance of join operation in Hadoop is inefficient.

Hadoop does not have indices for data so entire dataset is copied in the process to perform join operation.

Explain the benefit of Hadoop versus other nondistributed parallel framworks in terms of their hardware requirements.

Hadoop does not require high performance computers to be powerful. Its power is in the library itself. It can run effectively on consumer grade hardware.

How do you merge the Reducers output files into a single file?

Hadoop fs -getmerge somedir somefile

Command for listing a directory in HDFS

Hadoop fs -ls some/path The first column tundrkom/unix file permissions the second column is the replication factor the next is the user, then group, then file size, then modificating hmc, then filename)

Command for listing a directory in HDFS

What is the command line way of uploading a file into HDFS

Hadoop fs -put <file> <dir> Hadoop fs -put <file><file> Hadoop fs -put <dir><dir>

What is the command line way of uploading a file into HDFS

Hadoop fs -put <file> <dir> Hadoop fs -put <file><file> Hadoop fs -put <dir><dir>

How can you test a driver using a mini cluster?

Hadoop has a set of testing classes (allows testing against the full HDFS and MapReduce machinery): MiniDFSCluster MiniMRCluster MiniYARNCluster MapReduceTestCase - abstract class provides methods needed to use a mini cluster in user code.

libhadoop.so

Hadoop has native implementations of certain components for performance reasons and for non-availability of Java implementations. These components are available in a single, dynamically-linked native library called the native hadoop library. On the *nix platforms the library is named _____.

scalable

Hadoop is ________ as more nodes can be added to it.

Why is Hadoop's file redundancy less problematic than it could be?

Hadoop is cheap and costeffective able to run on unspecialized machines, open source software and the money saved by this will likely outweigh the cost of needing additional storage space.

How many Daemon processes run on a Hadoop system?

Hadoop is comprised of five separate daemons. Each of these daemon run in its own JVM. Following 3 Daemons run on Master nodes NameNode - This daemon stores and maintains the metadata for HDFS. Secondary NameNode - Performs housekeeping functions for the NameNode. JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker. Following 2 Daemons run on each Slave nodes DataNode - Stores actual HDFS data blocks. TaskTracker - Responsible for instantiating and monitoring individual Map and Reduce tasks.

database

Hadoop is not a ______, it is an architecture with a filesystem called HDFS.

What is the key benefit of the new YARN framework?

Hadoop jobs are no longer restricted to Map Reduce. With YARN, any type of computing paradigm can be implemented to run Hadoop.

principal

Hadoop maps Kerberos ____ to OS user account using the rule specified by hadoop.security.auth_to_local which works in the same way as the auth_to_local in Kerberos configuration file (krb5.conf).

cluster size

Hadoop reduces cost of operation via limiting ___ ____.

What is "Standalone mode" ?

Hadoop runs on the local filesystem with a local jobrunner

What is data locality optimization ?

Hadoop tries to run map tasks on the node where the input data resides in HDFS. This method doesn't use bandwidth.

Map Reduce

Hadoop uses __ __ to process large data sets.

parallel

Hadoop uses the concept of MapReduce which enables it to divide the query into small parts and process them in ___.

What is the default MapReduce partitioner

HashPartitioner

A distributed column-orented database it uses HDFS for its underlying storage and supports both batch-style computations using MapReduce and point queries (random reads)

Hbase

What is distributed sorted map using HDFS high throuput, get, gut and scan

Hbase

Name three features of Hive.

HiveQL, Indexing, Different Storage types

Name three features of Hive.

HiveQL, indexing, different storage types, metadata storage in an RDBMS

MAPRED: mapred.job.tracker

Hostname and port the jobtracker's RPC server runs on. (default = local)

impersonate

However, if the superuser does want to give a delegation token to joe, it must first ____ joe and get a delegation token for joe, in the same way as the code example above, and add it to the ugi of joe. In this way the delegation token will have the owner as joe.

data

HttpFS can be used to transfer ____ between clusters running different versions of Hadoop (overcoming RPC versioning issues), for example using Hadoop DistCP.

versions

HttpFS can be used to transfer data between clusters running different ____ of Hadoop (overcoming RPC versioning issues), for example using Hadoop DistCP.

How are corrupted blocks "healed"?

If a client detects an error when reading a block, it reports a bad block & datanode to the namenode, and throws a ChecksumException. The namenode marks the copy as corrupt and stops traffic to it. The namenode schedules a copy of the block to be replicated on another datanode. The corrupted replica is deleted.

what is secondary sort

If equivalance rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, the one may specify a comparator via JobConf.setOutputValugeGroupingComparator(class). Since JobConf.setOutputKeyComparatorClass(class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate sort on values.

host

If more lax security is preferred, the wildcard value * may be used to allow impersonation from any ____ or of any user.

When are node managers blacklisted? By what?

If more than 3 tasks fail (mapreduce.job.maxtaskfailures.per.tracker) by the application master

How is tasktracker failure handled?

If the heartbeat isn't sent to jobtracker in 10secs (mapred.task.tracker.expiry.interval) The jobtracker removes it from the pool. Any tasks running when removed from the pool have to be re-run.

Cygwin

If you are using Windows machines, first install ____ and SSH server in each machine. The link http://pigtail.net/LRP/printsrv/cygwin-sshd.html provides step-by-step instructions.

Inherent Characteristics of Big Data

Immutable and Time-Based

What is a common use of floom?

Importing twitter feeds into a Hadoop cluster

What is HDFS Block size? How is it different from traditional file system block size?

In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each block three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block size can not be compared with the traditional file system block size.

two

In Hadoop, when we store a file, it automatically gets replicated at _______ other locations also.

primary

In Kerberized operation, the identity of a client process is determined by its Kerberos credentials. For example, in a Kerberized environment, a user may use the kinit utility to obtain a Kerberos ticket-granting-ticket (TGT) and use klist to determine their current principal. When mapping a Kerberos principal to an HDFS username, all components except for the _____ are dropped. For example, a principal todd/[email protected] will act as the simple username todd on HDFS.

When is the reducers are started in a MapReduce job?

In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.

executable

In contrast to the POSIX model, there are no setuid or setgid bits for files as there is no notion of ____ files.

MAPRED: mapred.tasktracker.reduce.tasks.maximum

Int (default=2) number of reduce tasks run on a tasktracker at one time.

Java

Install ___ in all machines that will be used to set up Hadoop.

What does the GenericOptionsParser do?

Interprets Hadoop command line options and sets them to a Configuration object in your application. Implemented through the Tool Interface.

What is Hadoop Streaming?

It allows you to run map reduce jobs with other languages that use standard in and standard out. ex: ruby, python

What does the Combiner do?

It can reduce the amount of data transferred between mapper and reducer. Combiner can be an instance of the reducer class.

What is the coherency model for a filesystem?

It describes data visibility of reads and writes for a file

How does MR1 handle jobtracker failure?

It is a single point of failure, however it is unlikely that particular machine will go down. After restarting, all jobs need to be resubmitted.

How is resource manager failure handled?

It is designed to recover by using a checkpoint mechanism to save state. After a crash a new instance is brought up (by the administrator) and it recovers from saved state (consisting of node managers and applications but not tasks which are managed by the application manager) The storage the resource manager uses is configurable (org.apache.hadoop.yarn.server.resourcemanager.recovery.memstore) keeps it in memory therefore its not highly available.

How is the splitting of file invoked in Hadoop framework?

It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user.

What happens when a container uses more memory than allocated?

It is marked as failed and terminated by the node manager

What is another name for the hadoop DFS module? ex: hadoop dfs ____

It is the same as hadoop fs ____ and is also called FsShell.

What is another name for the hadoop DFS module? ex: hadoop dfs ____

It is the same as hadoop fs ____ and is also called FsShell.

FSDataInputStream implements the PositionedReadable Interface. What does it provide?

It reads parts of a file given an offset.

How is the namenode machine decided?

It runs on the machine that the startup scripts were run on.

What is the Datanode block scanner?

It verifies all blocks stored on the Datanode. It allows bad blocks to be deleted/fixed/ DataBlockScanner maintains a list of blocks. (dfs.datanode.scan.period = 504(hours)) Corrupt datanodes are reported to the namenode to be fixed.

Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?

It will restart the task again on some other TaskTracker and only if the task fails more than four (default setting and can be changed) times will it kill the job.

What does the namenodes VERSION file contain? (4)

It's a java properties file that contains information about the version of HDFS running 1. namespaceID - a unique identifier for the filesystem. Namenode uses it to identify new datanodes since they will not know it until they have registered. 2. cTime = 0 Marks creation time of namenode's storage. It is updated from 0 to a timestamp when the filsystem is upgraded 3. storageType = NAME_NODE - Indicates the storage directory that contains data structures for the namenode. 4. layoutVersion = -18 Always negative. Indicates the version of HDFS

New API or Old API? (a) Job (b) JobConf (c) org.apache.hadoop.mapred (d) org.apache.hadoop.mapreduce

Job - NEW JobConf - OLD org.apache.hadoop.mapred - NEW org.apache.hadoop.mapreduce - OLD

How does Job Submission work in MapReduce 1?

Job.submit() creates JobSubmitter instance. The JobSubmitter: 1. calls submitJobInternal() 2. Asks the jobtracker for a new job ID (JobTracker.getNewJobID()) and computes the input splits, if they cant be computed, the job is cancelled. It checks the output specifications to make sure the output dir does not exist. 3. Copies the resources needed to run the job to the jobtracker. The job JAR is copied at a high replication factor. (default = 10 mapred.submit.replication) Why? - so that they are readily available for multiple tasktrackers to access. 4. Tells the jobtracker that the job is ready for execution. JobTracker.submitJob().

How do you execute in MapReduce job from within the main method of a driver class?

JobClient.runJob (my JobConf)...

What is a single way of running multiple jobs in order?

JobClient.runJob(conf1); JobClient.runJob(conf2);

What is a single way of running multiple jobs in order?

JobClient.runJob(conf1); JobClient.runJob(conf2);

How is the job more speccified for a map reduce class?

JobConf conf - new JobConf (my driver.class); conf.set JobName ("my Job");

What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?

JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster its run on a separate machine. Each slave node is configured with job tracker node location. The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

What is JobTracker?

JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.

How do you set a space limit on a users home directory?

hadoop dfsadmin -setSpaceQuota 1t /user/username

How does speculative execution work in Hadoop?

JobTracker makes different TaskTrackers process same input. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.

What is the difference in Web UIs MR1vs MR2 ?

JobTracker web UI - list f jobs Resource Manager web UI - list of running applications with links to their respected application masters, which shows progress and further info.

conf/slaves

List all slave hostnames or IP addresses in your ___/___ file, one per line.

HDFS: dfs.data.dir

List of directories for a datanode to store its blocks

Hadoop IO Class that corresponds to Java Long

LongWritable

Why do map tasks write their output to local disk instead of HDFS?

Output from a map task is temporary and would be overkill to store in HDFS. If a map task fails, the mapper is re-run so there is no point in keeping the intermediate data.

What does the NameNode do?

Manges the file system name space it also -maintains file system tree -maintains metadata for all files and directions in the tree

Pig Complex Types

Map, Tuple, Bag

What do you use to monitor a jobs actual memory using during a job run?

MapReduce task counters: 1. PHYSICAL_MEMORY_BYTES 2. VIRTUAL_MEMORY_BYTES 3. COMMITTED_HEAP_BYTES

True or False: The Job's setup is called before any tasks are run. (Create output directory..etc) MapReduce1 MapReduce2

MapReduce1 - false, it is run in a specialized task, run by a tasktracker MapReduce2 - true, directly by the application master

What is the different between Metrics and Counters?

Metrics - collected by Hadoop daemons (administrators) Counters - are collected from mapreduce tasks and aggregated for the job. The collection mechanism for metrics is decoupled from the comonent that receives the updates and there are various pluggable outputs: (a) local files (b) Ganglia (c) JMX The daemon collecting metrics does aggregation.

What does calling Seekable seek() do? What happens when the position referenced is greater than the file length?

Moves to an arbitrary absolute position within a file. It results in an IOException

What is HDFS Federation?

Multiple Namenodes. Each Namenode manages a namespace volume made up of the metadata for a namespace and block pool.

HDFS 5

NameNode constantly monitors repots sent by datanodes to ensure no dropped blocks below block replication factor. If it does, it schedules addition of another block copy.

How NameNode Handles data node failures?

NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data blocks from one datanode to another. The replication data transfer happens directly between datanodes and the data never passes through the namenode.

New vs old for MapReduce job configuration

New = Configuration Old = Job Conf

New vs old for MapReduce job control

New = Job Old = JobClient

New vs old for output file names

New Old part -m-nnnnn (map) same as new but w/o "m" or "r" part -r- nnnnn (reduce) (part number start at zero)

New vs old for output file names

New Old part -m-nnnnn (map) same as new but w/o "m" or "r" part -r- nnnnn (reduce) (part number start at zero)

Are block locations persistently stored?

No, Block locations are reconstructed from datanodes when the system starts

Is Hadoop a database? Explain.

No, HaDoop is a file system.

Does a small file take up a full block in HDFS?

No, unlike a filesystem for a single disk a file in HDFS that is smaller that a single block does not occupy a full blocks worth of underlying storage.

Is Hadoop a database? Explain.

No. Hadoop is a file system.

What is the difference between these commands? hadoop fs _____ hadoop dfs _____

None they are exactly the same

What is the difference between these commands? hadoop fs _____ hadoop dfs _____

None they are exactly the same

Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer?

Nope, MapReduce programming model does not allow reducers to communicate with each other. Reducers run in isolation.

What is difference between hadoop fs -copyToLocal and hadoop fs -get

Nothing they are identical

What is difference between hadoop fs -copyToLocal and hadoop fs -get

Nothing they are identical

What is the difference between hadoop fs -copyFromLocal hadoop fs -put

Nothing they are identical

What is the difference between hadoop fs -copyFromLocal hadoop fs -put

Nothing they are identical

What are toher writables (besides for the Java primitures and Text)?

Null Writable Bytes Writable MD5 Hash Object Writable Generic Writable

What project was Hadoop originally a part of and what idea was that project based on?

Nutch. It was based on returning web search results faster by distributing data and calculations across different compters.

What project was Hadoop originally a part of and what idea was that project based on?

Nutch. It was based on the idea of returning web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously.

Any content written to a file is not guaranteed to be visible, even if the stream is flushed. Why?

Once more than a block's worth of data has been written, the first block will be visible to the readers. The current block is always invisible to new readers and will display a length of zero.

How are hanging tasks dealt with in MR1 ?

Once the timeout time has been reached (default = 10 mins) the tasktracker marks the task as failed. The child JVM will be killed automatically. Timeout is set ( mapred.task.timeout ). Can be set to 0, although not advised because it would slot down slot allocation.

What is the relationship between Jobs and Tasks in Hadoop?

One job is broken down into one or many tasks in Hadoop.

Describe Jobtrackers and Tasktrackers.

One jobtracker for many tasktrackers.The Jobtracker reschedules tasks and holds a record of overall progress

Facebook

One of the initial users of Hadoop

How many output files will a mapreduce job generate?

One per reducer

what are dynamic counters?

One that isn't defined by a Java enum. Because a Java enum's fields are defined at compile time, you can't create new counters on the fly using enums ... public void incrCounter(String group, String counter, long amount)

super-user

Only the owner of a file or the ____ - ____ is permitted to change the mode of a file.

Reluctant does what?

Opposite of greedy, starts from one letter and builds up, the last thing they try is the entire input

What is used in Hadoop for data flow language and execution environment for exploring very large datasets?

PIG

Hadoop can run jobs in ________ to tackle large volumes of data.

Parallel

After the Map phase finishes, the Hadoop framework does "Partitioning, Shuffle and sort". Explain what happens in this phase?

Partitioning: It is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same. Shuffle: After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling. Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer.

How do you determine if you need to upgrade the filesystem?

Perform a trial on a test cluster.

What is the different between Pig Latin and SQL?

Pig Latin - data flow programming language. SQL - declarative programming language. - Pig Latin takes declarative statements and breaks them into steps. It supports complex, nested data structures while SQL deals with flatter data. Pig is customizable with UDFs and doesn't support random reads/writes, similar to Hadoop.

SSO

Practically you need to manage ____ environment using Kerberos with LDAP for Hadoop in secure mode.

Latency

Processing time is measured in weeks. Process more data, throw in more hardware to keep elapsed time under control

What are the specs of a typical "commodity hardware" machine?

Processor - 2 quad core 2-2.5 Ghz Memory - 16-24GB ECC RAM (error code checking) Storage - Four 1TB SATA disks Network -Gigabit Ethernet

What is HRPROF?

Profiling tool that comes with the JDK that can give valuable info about a programs CPU and heap usage.

Cost

Prohibitive undertaking for small and medium size business. Reserved for multinationals.

What is Apache Whirr? What is the benefit of using it?

Provides a Java API and scripts for interacting with Hadoop on EC2. You can easily read data from S3, but it doesn't take advantage of data locality.

Class and Method signature for mapper

Public class My Mapper extends MapReduceBase implements Mapper <K1, V2, K2, V2> Public void map (K1 key, V1 value OutputCollector <K2, V2> outlet Reporter reporter) throws IO Exception

MapReduce driver class w/new API

Public class MyDrive - Job job = new Job (); Job.set job by class (MyDriver.class); fileinputformat.addinputpath(job, newpath (org [o])); fileoutputformat.add input path (job, newpath (org (1) job.set Mapper class (my Mapper.class) job.set Reducer class (my Redcuer.class) job.set Output Key class (text.class) job.set Output Value class (intwritable.class) system.exit (job.waitforcompletion (true) ?0:1)

Class and method signature for new mapper API

Public class MyNewMapper extends Mapper <K1, V1, K2, V2> public void map (K1, key, V1 value, context context) throws IOException, Interrupted Exception Context.write(key2, value2);

Class and method signature for new mapper API

Public class MyNewMapper extends Mapper <K1, V1, K2, V2> public void map (K1, key, V1 value, context context) throws IOException, Interrupted Exception Context.write(key2, value2);

class and method signature for Reducer

Public class MyReducer extends MapReduceBase implements Reducer <K1, V1, K2, V2> Public void reducer (K1 key, iterator <V1> values OutputCollector<K2, V2> output, Reporter reporter) throws IOException

class and method signature for Reducer

Describe the writeable interface

Public interface writable Void write (DataOutputout) thus IO Exception; Void read fields (DataInput in) thus IO Exception

Describe a MapReduce driver class.

Public my driver public static void (string [ ] augs)throws Ioexception JobConf conf = New JobCong (mydriver.class); conf.setJobName("my Job');

Describe a MapReduce driver class.

Public my driver public static void (string [ ] augs)throws Ioexception JobConf conf = New JobCong (mydriver.class); conf.setJobName("my Job');

fsimage

Quotas are persistent with the _____ . When starting, if the fsimage is immediately in violation of a quota (perhaps the fsimage was surreptitiously modified), a warning is printed for each of such violations. Setting or removing a quota creates a journal entry.

What does a RecordReader do?

RecordReader, typically, converts the byte-orented view of the input provided by the InputSplit and presents a record-orented view for the Mapper and Reducer tasks for processing. It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values

Configuration Tuning Principals (Reduce Side & Buffer Size)

Reduce-Side - Best performance when intermediate data can reside entirely in memory. If your reduce function has light memory requirements, you can set mapred.inmem.merge.threshold to 0 and mapred.job.reduce.input.buffer.percent = 1.0 (or a lower value) Buffer-Size - 4KB by default, increase it io.file.buffer.size

If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?

Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.

extrinsic

Regardless of the mode of operation, the user identity mechanism is ____ to HDFS itself. There is no provision within HDFS for creating user identities, establishing groups, or processing user credentials.

MAPRED: mapred.system.dir

Relative to fs.default.name where shared files are stored during job run.

x*? - Reluctant, Greedy or Possessive?

Reluctant

RPC

Remote Procedure Calls - a protocol for data serialization

How does memory utilization in MapReduce 2 get rid of previous memory issues?

Resources are more fine grained instead of having a set number of blocks at a fixed memory amount. With MR2 applications can request a memory capability that is between the min and max allocation set. Default memory allocations are scheduler specific. This removes the previous problem of tasks taking too little/too much memory because they were forced to use a fixed amount.

hadoop fs -stat

Returns the stat information on the path.

Describe how Sqoop transfers data from a relational database to Hadoop.

Runs a query on a relational database and exports into files in a variety of formats. They are then saved on HDFS.

What is Hive?

SQL like language for Big Data

slave nodes

SSH is used in Hadoop for launching server processes on ______ ____.

Where are bad records stored in Hadoop?

Saved as SequenceFiles in the jobs output directory under _logs/skip

Web consoles

Security features of Hadoop consist of authentication, service level authorization, authentication for ___ ___ and data confidentiality.

The ______ Interface permits seeking to a position in the file and provides a query method for the current file offset getPos()

Seekable

6 Key Hadoop Data Types

Sentiment, Clickstream, Sensor/Machine, Geographic, Server Logs, Text

How can you configre the task tracker to retain enough information to allow a task to be rerun over the same input data for debugging

Set Keep.failed.task.files to true

How does a task report progress?

Sets a flag to indicate that the status change should be sent to the tasktracker. The flag is checked in a separate thread every 3 seconds.

What does dfs.web.ugi do?

Sets the user that HDFS web interface runs as. (used to restrict system files to web users)

STONITH

Shoot the other node in the head

What does YARN use rather than tasktrackers?

Shuffle handlers, auxillary services running in node managers.

What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?

Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate JVM process. Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is run as a separate JVM process. One or Multiple instances of Task Instance is run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.

What is the default network topology configuration? With multi-rack clusters, what do you need to do?

Single-rack. Map nodes to racks so that Hadoop can locate within-rack transfers, which is preferable. It will also allow Hadoop to place replicas more intelligently.

What can you use when 3rd party library software is causing bad records that can't be intercepted in the mapper/reducer code?

Skipping Mode - to automatically skip bad records. Enable it and use the SkipBadRecords Class. A task reports the records that are passed back to the tasktracker. Because of extra network traffic and bookkeeping to maintain the failed record ranges, skipping mode is only enabled after 2 failed task attempts. Skipping mode can only detect one bad record per task attempt. (good for catching occasional record errors) To give skipping mode enough attempts to detect and skip all bad records in an input split, increase mapred.map.max.attempts mapred.reduce.max.attempts

HDFS: dfs.name.dir

Specifies a list of directories where the namenode metadata will be stored.

What does time-based mean?

Something known at a certain moment in time.

Describe how Sqoop transfers data from a relational database to Hadoop.

Sqoop runs a query on the relational database and exports the results into files in a variety of formats. These files are then saved on HDFS. Doing this process in reverse will import formatted files from HDFS into a relational database.

What are Java Management Extensions? (JMX)

Standard Java API for monitoring and managing applications. Hadoop includes several (MBeans) managed beans which expose Hadoop metrics to JMX aware applications.

What does YARN's start-yarn.sh script do?

Starts the YARN daemon which: (a) starts resource manager on the machine the script was run on. (b) node manager on each machine in the slaves file.

How does a Pig Latin program get executed?

Step 1 - all statements are checked for syntax, then added to the logical plan. Step 2 - DUMP statement converts the logical plan to a physical plan and the commands are executed.

What is Hadoop Streaming?

Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations.

MAPRED: mapreduce.map.java.opts

String (-Xmx 2000m) JVM option used for child process that runs map tasks

MAPRED: mapreduce.reduce.java.opts

String (-Xmx 2000m) JVM option used for child process that runs reduce tasks

MAPRED: mapred.child.java.opts

String (-Xmx 2000m) JVM option used to launch tasktracker child processes that run map and reduce tasks ( can be set on per-job basis)

Hadoop works best with _________ and ___________ data, while Relational Databases are best with the first one.

Structured, Unstructured

What does the job.waitForCompletion() method do?

Submits the job Waits for it to finish Return value is Boolean true/false which translates to an exit code

How do you set a system property? How do you set them via command line?

System.setProperty("size",14); -Dproperty=value

False

T/F Hadoop works in real time?

A tasktracker may connect if its in the include file and not in the exclude file.

TRUE

If you shut down a tasktracker that is running, the jobtracker will reschedule the task on another tasktracker.

TRUE

Once an upgrade is finalize, you can't roll back to a previous version.

TRUE

Pig turns the transformations into a series of MapReduce jobs. (seamlessly to the programmer)

TRUE

True of False: If you are using the default TextInputFormat, you do not have to specify input types for your job.

TRUE

True or False: Input of a reduce task is output from all map tasks so there is no benefit of data locality

TRUE

True or False: Its better to add more jobs than add more complexity to the mapper.

TRUE

True or False: addOutputPath() must point to a directory that doesn't currently exist.

TRUE

hadoop fs - getmerge

Takes a source directory and a destination file as input and concatenates files in src into the destination local file.

HDFS: fs.default.name

Takes the HDFS filesystem URI host is the namenodes hostname or IP:port that the namenode will listen on (default file///:8020) It specifies the default filesystem so you can use relative paths.

What is a Task instance in Hadoop? Where does it run?

Task instances are the actual MapReduce jobs which are run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.

What's a tasktracker?

TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations - from a JobTracker.

How do we ensure that multiple instances of the same task don't try to write to the same file?

Tasks write to their working directory, when they are committed, the working directory is promoted to the output directory.

Hadoop IO Class that corresponds to Java String

Text

Streaming output keys and values are always of type ____ .

Text. The IdentityMapper cannot change LongWritable keys to Text keys so it fails. Another mapper must be used.

What is the difference between TextInputFormat and KeyValueInputFormat class?

TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper. KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.

What is Hadoop Pipes?

The C++ interface to Hadoop MapReduce. Uses sockets rather than I/O

How the Client communicates with HDFS?

The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

What is a Combiner?

The Combiner is a 'mini-reduce' process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.

aware

The HDFS and the YARN components are rack-___.

REST API

The HTTP ____ ____ supports the complete FileSystem/FileContext interface for HDFS.

names

The Hadoop Distributed File System (HDFS) allows the administrator to set quotas for the number of ___ used and the amount of space used for individual directories. Name quotas and space quotas operate independently, but the administration and implementation of the two types of quotas are closely parallel.

POSIX

The Hadoop Distributed File System (HDFS) implements a permissions model for files and directories that shares much of the ____ model. Each file and directory is associated with an owner and a group.

What is the difference between HDFS and NAS ?

The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. Following are differences between HDFS and NAS In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware. HDFS is designed to work with MapReduce System, since computation are moved to data. NAS is not suitable for MapReduce since data is stored seperately from the computations. HDFS runs on a cluster of machines and provides redundancy usinga replication protocal. Whereas NAS is provided by a single machine therefore does not provide data redundancy.

How is client-side checksumming done?

The Hadoop LocalFileSystem performs client-side checksumming.A file is written and a hidden file is created.(filename.crc) Controlled by io.bytes.per.checksum (512 bytes)

NodeManager

The Hadoop daemons are NameNode/DataNode and ResourceManager/_____.

What is the purpose of RecordReader in Hadoop?

The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the Input Format.

Jobtracker

The JobReduce number is decided via the _____.

What is the Hadoop MapReduce API contract for a key and value Class?

The Key must implement the org.apache.hadoop.io.WritableComparable interface. The value must implement the org.apache.hadoop.io.Writable interface.

What is a NameNode? How many instances of NameNode run on a Hadoop Cluster?

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. There is only One NameNode process run on any hadoop cluster. NameNode runs on its own JVM process. In a typical production cluster its run on a separate machine. The NameNode is a Single Point of Failure for the HDFS Cluster. When the NameNode goes down, the file system goes offline. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

How JobTracker schedules a task?

The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

Volume, Velocity, Variety

The Three Characteristics of Big Data

block size

The __ ___ of a data product can affect the performance of MapReduce computations, as the default behavior of Hadoop is to create one map task for each data block of the input files.

name quota

The ____ ____ is a hard limit on the number of file and directory names in the tree rooted at that directory

superuser

The ____ must be configured on namenode and jobtracker to be allowed to impersonate another user.

Define "fault tolerance".

The ability of a system to continue operating after the failure of some of its components.

The beginning of the input

What is "the shuffle" ?

The data flow between map and reduce tasks.

containers

The data is stored in HDFS which does not have any predefined ______.

If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?

The default partitioner computes a hash value for the key and assigns the partition based on this result.

The end of the input

The end of the input but for the final terminator, if any

The end of the previous match

What manages the transition of active namenode to Standby?

The failover controller

If not set explicitly, the intermediate types in MapReduce default to ____________.

The final output types default to LongWritable and Text

Input splits or splits

The fixed sized pieces into which the input is divided one map task is created for each split for many/most jobs a good split tends to be the size of an HDFS block

Input splits or splits

The fixed sized pieces into which the input is divided one map task is created for each split for many/most jobs a good split tends to be the size of an HDFS block

What are some typical functions of Job Tracker?

The following are some typical tasks of JobTracker:- - Accepts jobs from clients - It talks to the NameNode to determine the location of the data. - It locates TaskTracker nodes with available slots at or near the data. - It submits the work to the chosen TaskTracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker.

Codec

The implementation of a compression-decrompression algorithm

Where is the namenode data stored?

The local disk in teo files -namespace image -edit log

How are Hadoop Pipes jobs written?

The mapper and reducer methods are written by extending the Mapper and Reducer classes defined in the HadoopPipes namespace. main() acts as the entry point. It called HadoopPipes::runTask

Where is the Mapper Output (intermediate kay-value data) stored ?

The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.

What is a Namenode?

The master node that manages the file system namespace. It maintains the file system tree and metadata for all the files/directories within. Keeps track of the location of all the datanodes for a given file.

Block

The minimum amount of data that can be read or written. For HDFS, this is a much larger unit than a normal file system. Typically this is 64MB by default

What is Streaming data access

The most efficient data processing pattern a write-once, read-many-times pattern

The number of map tasks is driven by what?

The number of input splits, which is dictated by the size of inputs and block size.

The number of map tasks is driven by what? The number of partitions is equal to?

The number of input splits, which is dictated by the size of inputs and block size. The number of reduce tasks for the job.

The number of partitions is equal to?

The number of reduce tasks for the job.

Deserialization

The process of turning a byte stream back into a series of structured objects

Serialization

The process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage.

slaves

The rest of the machines in the cluster act as both DataNode and NodeManager. These are the ___________.

Enter your regex: (dog){3} Enter input string to search: dogdogdogdogdogdog I found the text "dogdogdog" starting at index 0 and ending at index 9. I found the text "dogdogdog" starting at index 9 and ending at index 18. Enter your regex: dog{3} Enter input string to search: dogdogdogdogdogdog No match found. Why doe the second one fail??

The second on is asking for the letter g three times.

Controlling sort order

The sort order for keys is controlled by a RawComparator, which is found as follows: 1. If the property mapred.output.key.comparator.class is set, either explicitly or by calling setSortComparatorClass() on Job, then an instance of that class is used. (In the old API, the equivalent method is setOutputKeyComparatorClass() on JobConf.) 2. Otherwise, keys must be a subclass of WritableComparable, and the registered comparator for the key class is used. 3. If there is no registered comparator, then a RawComparator is used that deserializes the byte streams being compared into objects and delegates to the WritableCompar able's compareTo() method.

kerberos

The superuser must have ____ credentials to be able to impersonate another user. It cannot use delegation tokens for this feature. It would be wrong if superuser adds its own delegation token to the proxy user ugi, as it will allow the proxy user to connect to the service with the privileges of the superuser.

map.task.tracker.report.address

The tasktrackers RPC server address and port used by tasktrackers child JVM to communicate. The server only binds to localhost (default 127.0.0.1:0)

What does immutable mean?

The truthfulness of the data does not change. Changes of big data are new entries not updates to existing entries.

Globbing

The use of using pattern matching to match multiple files with a single expression.

What is Hadoop Common?

The utilities that provide support for other Hadoop modules.

What does dfs.replication =1 mean?

There would only be one replication per block. Typically we aim for at least 3.

Whats unique about -D pepehes when used with hadoop command

They have space -D name=Value as comapres with JVM properties -Dname=Value

Benefits of distributed cache

This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.

Distributed Cache

This provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. To save network bandwidth, files are normally copied to any articular node once per job.

LDAP

Though files on HDFS are associated to owner and group, Hadoop does not have the definition of group by itself. Mapping from user to group is done by OS or _____.

Greedy does what first?

Tries to match the entire input string first , if it fails then backs off by one letter each time until a match is made.

T/F: Hadoop is open source.

True

True or False: Because CRC-32 checksum is 4 bytes long, the storage overhead is less than 1%.

True

What is HADOOP_IDENT_STRING used for?

To change the perceived use for logging purposes. The log names will contain this value.

daemons

To configure the Hadoop cluster you will need to configure the environment in which the Hadoop ___ execute as well as the configuration paramerts for Hadoop ____.

Why are HDFS blocks so large?

To minimize the cost of seeks

Why is a block in HDFS so large?

To minimize the cost of seeks By making a block large enough the time to transfer the data from the disk can be made to be significantly larger that the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.

YARN

To start a Hadoop cluster, you will need to start both the HDFS and ____ cluster.

What is dfsadmin?

Tool for finding information on the state of HDFS and performing administrative actions on HDFS

How do you calculate the optimal number of reducers?

Total number of available reducer slots = # nodes in the cluster x # of slots per node (mapred.tasktraker.reduce.tasks.maximum) Then slightly fewer reducers than total slots, whih gives you one wave of reduce tasks.

What has big data been processed historically?

Traditionally very difficult, technically and financially.

FIll in the blank. Hadoop lacks notion of ________ and _______. Therefore, the analyzed result generated by Hadoop may or may not be 100% accurate.

Transaction Consistency, Recovery Checkpoint

True or False: JConsole allows you to view MBeans in a running JVM. You can see Hadoop metrics via JMX using the default metrics but to have it updaet you have to configure metrics to use something other than NullContext.

True, using NullContextwithUpdate Thread is appropriate if JMX is your only way to view metrics.

True or False: You can reference a java mapper and reducer in a Hadoop Pipes job.

True, you can use a hybrid Java and C++

True or False: ChecksumFileSystem is just a wrapper around FileSystem.

True, you can use methods like getChecksumFile().

Semi-structured Data

Typically columns are missing rows or rows have their own unique columns

How does a FIFO Scheduler work?

Typically each job would use the whole cluster so jobs had to wait their turn. Has the ability to set a job's priority (very high, high, normal. low, very low) It will choose the highest tasks first, but no preemption (one its running, it can't be replaced)

ResourceManager

Typically one machine in the cluster is designated as the NameNode and another machine the as ___ , exclusively. These are the masters.

What does dfs.default.name do?

Typically sets the default filesystem for Hadoop. If set you do not need to specify it explicitly when you use -CopyFromLocal via command line. ex: dfs.default.name = hdfs://localhost/

How do you list the contents of a directory?

Use the FileSystem's listStatus() method. public FileStatus[ ] listStatus(Path f) throws IOException

CLI MiniCluster

Using the ___ ____, users can simply start and stop a single-node Hadoop cluster with a single command, and without the need to set any environment variables or manage configuration files. The ___ ___ starts both a YARN/MapReduce & HDFS clusters. This is useful for cases where users want to quickly experiment with a real Hadoop cluster or test non-Java programs that rely on significant Hadoop functionality.[same word]

What is "globbing"?

Using wildcard characters to match multiple files with a single expression rather than having to enumerate each file and directory to specify input

Three V's of Big Data

Variety, volume and velocity

How can Oozie inform a client about the workflow status?

Via an HTTP callback

How does one obtain a reference to an instance of the hadoop file system in Java

Via state methods: File System.get(configuration conf) thans IDExeplan File System.get(URI URI, configurationconf)thans IDExeplan

How does one obtain a reference to an instance of the hadoop file system in Java

Via state methods: File System.get(configuration conf) thans IDExeplan File System.get(URI URI, configurationconf)thans IDExeplan

Lists three drawbacks of using Hadoop

Whatever listed in 6 Drawbacks of Hadoop are drawbacks of Hadoop. For example, does not work well with small amount of data, MapReduce program is difficult to implement or understand, and does not guaratee atomicity transactions.

What is InputSplit in Hadoop?

When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called InputSplit.

HDFS: fs.checkpoint.dir

Where the secondary namenode stores it's checkpoints of the filesystem

When is it impossible to set a URLStreamHandlerFactory? What is the workaround?

When it has been used elsewhere. You can use the FileSystem API instead.

When is a CodecPool used? How is it implemented?

When lots of compression/decompression occurs. It is used to re-use compressors and de-compressors, reducing the cost of new object creation. Compressor compressor = null; try{ compressor = CodecPool.getCompressor(codec); CompressionOutputStream out = codec.createOutputStream(System.out, compressor); } finally { CodecPool.returnCompressor(compressor); }

0 or 1

When running under the local jobrunner, how many reducers are supported?

kinit

When service level authentication is turned on, end users using Hadoop in secure mode needs to be authenticated by Kerberos. The simplest way to do authentication is using ___ command of Kerberos.

Describe what happens when a slave node in a Hadoop cluster is destroyed and how the master node compensates.

When the slave node is destroyed, it stops sending heartbeat signals to the master node. The master node recognizes the loss of the slave node and relegates its tasks, including incomplete tasks, to other slave nodes.

Describe what happens when a slave node in a Hadoop cluster is destroyed and how the master node compensates.

When the slaves heartbeat stops sending, the master moves it's tasks to other slave nodes.

rebalancer

When you add new nodes, HDFS will not rebalance automatically. However, HDFS provides a _____ tool that can be invoked manually.

What is the behavior of the HashPartitioner?

With multiple reducers, records will be allocated evenly across reduce tasks, with all records that share the same key being processed by the same reduce task.

What does HADOOP_MASTER defined in hadoop-env.sh do?

Worker daemons will rsync the tree rooted at HADOOP_MASTER to the local nodes HADOOP_INSTALL when the daemon starts.

What is Oozie?

Workflow scheduler system to manage Apache Hadoop jobs

What are datanodes?

Workhorses of the filesystem. They store and retrieve blocks when told to do so by the client or Namenode. They report back to the Namenode with lists of which blocks they are currently storing and where they are located

What does FileContext in metrics do?

Writes metrics to a local file. Unsuitable for large clusters because output files are spread out.

How does YARN handle memory? How does it compare to the slot model?

YARN allows applications to request an arbitrary amount of memory for a task. Node managers allocate memory from a pool, the number of tasks running on a node depends on the sum of their memory requirements, not a fixed number of slots.The slot-based model can lead to under utilization because slots are reserved for map or reduce tasks. YARN doesn't differentiate so it is free to maximize memory utilization.

Can I set the number of reducers to zero?

Yes, Setting the number of reducers to zero is a valid configuration in Hadoop. When you set the reducers to zero no reducers will be executed, and the output of each mapper will be stored to a separate file on HDFS. [This is different from the condition when reducers are set to a number greater than zero and the Mappers output (intermediate data) is written to the Local file system(NOT HDFS) of each mappter slave node.]

Is it possible to have Hadoop job output in multiple directories? If yes, how?

Yes, by using Multiple Outputs class.

do map tasks have the advantage of data locality

Yes, many times

Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job?

Yes, the input format class provides methods to add multiple directories as input to a Hadoop job.

Can distcp work on two different versions of Hadoop?

Yes, you would have to use http ( or the newer webhdfs) ex: hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar

What does YARN stand for?

Yet Another Resource Negotiator

What is YARN?

Yet Another Resource Negotiator... a fremework for job scheduling and cluster resource management

How can you set an arbitrary number of Reducers to be created for a job in Hadoop?

You can either do it programmatically by using method setNumReduceTasks in the Jobconf Class or set it up as a configuration setting.

When does PLATFORM need to be set? What is it for?

a) When you are running Hadoop Pipes and use C++ b) It specifies the operating system architecture and data model. Needs to be set before running the makefile. ex: PLATFORM=Linux-i386-32

F use-dfs

allows mountif of HDFS

Possessive does what?

always eat the entire input string, trying once (and only once) for a match

hdfs-site.xml

an XML file that specifies parameters used by HDFS deamons and Clients.

The mapper in filtering

applies the evaluation function to each record it receives; mapper outputs the same key/value types as the types of input since the record is left unchanged which is what we want in filtering

daily::exchange and divs::exchange- what does that mean?

both daily and divs have a column with the same name, need to distinguish them

Count words in PIG

cntd = foreach grpd generate group, COUNT(words);

so what not scale up and out at same time

compounds costs and weakeness of both approaches. Instead of very large and expenseive hardware and cross-cluster logic, this hybrid architecture requires both.

Jobtracker

coordinates all the jobs run on the system by scheduling task to run on tasktrackers

What are the default Hadoop properties stored?

core-default.xml

Where are hadoop fs defaults stored?

core-site.xml

Where are the site specific overrides to the default Hadoop properties stored?

core-site.xml

Which of the following is NOT Hadoop drawbacks? a. inefficient join operation b. security issue c. does not optimize query for user d. high cost e. MapReduce is difficult to implement

MapReduce

data processing software with specs on how to input and output data sets. It integrates tightly with HDFS.

PIG is a ________ language rather than a programming language

dataflow

Fill in the blank: The __________ holds the data in the HDFS and the application connects with the __________ to send and retrieve data from the cluster.

datanode, namenode

Partitioner Structure for Inverted Index

determines where values with the same key will eventually be copied by a reducer for final output

the downside we need to develop software that can process data across fleet of machines.

developer needed to handcraft mechanism for data partitioning and reassembly, logic to schedule, and how to handle failure

How do you change HDFS block size?

dfs.block.size (hdfs-site.xml) (default 64MB recommended 128MB )

Property for path on local file system in which data node instance should store its data

dfs.data.dir /tmp by defaould must be overridden

Property for path on local file system in which data node instance should store its data

dfs.data.dir /tmp by defaould must be overridden

Which settings are used to control which network interfaces to use? (ex:eth0) (2)

dfs.datanode.dns.interface mapred.tasktracker.dns.interface

How do you reserve storage space for non-HDFS use?

dfs.datanode.du.reserved (amount in bytes)

Property for patch on local file system of the NameNode instance where the NameNode metadata is stored

dfs.name.dir ex; /home/username/hdfs

Property for patch on local file system of the NameNode instance where the NameNode metadata is stored

dfs.name.dir ex; /home/username/hdfs

What is the property that enables file permissions

dfs.permissions (the namenode runs as a super user where permissions are not applicable)

Pig- Tuple

divided into fields, minmaxcount

How many reducers does Distinct pattern need?

doesn't matter

How to do secondary sort in Streaming?

don't want to partition by the entire key, so we use the KeyFieldBased Partitioner partitioner, which allows us to partition by a part of the key. The specification mapred.text.key.partitioner.options configures the partitioner

one reason scale up systems are so costly ...

due to redunduncy to mitigating impact of component failures

print cntd in PIG

dump cntd;

Some questions are only meaningful is asked of sufficiently large data sets.

e.g. what is most popular song or movie. More relevant if we ask 10 vs. 100 users.

dfs.namenode.shared.edits.dir

each namenode in a HA pair must have access to a shared filesystems defined by this properity.active name node will write while stand by name node read and apply chnages to its in memiry version of meta data

how are user defined counters defined:

enums ... enum Temperature { MISSING, MALFORMED } ... System.err.println("Ignoring possibly corrupt input: " + value); context.getCounter(Temperature.MALFORMED).increment(1);

How do you access task execution from Streaming?

environment variables (ex in python: os.environ["mapred_job_id"] ) -cmdenv set environment variables via command line

Property that is the URI that describes the NameNode for the cluster

fs.default.name ex: HDFS;//servername:9000 (port 9000 is arbritary)

Property that is the URI that describes the NameNode for the cluster

fs.default.name ex: HDFS;//servername:9000 (port 9000 is arbritary)

How do you configure a hadoop cluster for psuedo-distributed mode?

fs.default.name - hdfs://localhost/ dfs.replication = 1

What is it called when an administrator brings a namenode down manually for routine maintenence?

graceful failover

x* reluctant greedy or possessive?

greedy

Distinct- structure - exploits MapReduce's ability to ____ and uses

group keys together to remove duplicates; uses mapper to transform the data and doesn't do much in reducer

What is the core function of the MapReduce paradigm?

grouping data together by a key

Min Max Count Mapper -

groups it by a key value - such as the user ID and then the value is three columns min max and count

Reducer - Distinct

groups the nulls together by key- so we'll have one null per key

Group by word in PIG

grpd = group words by word

command for making a directory in HDFS

hadoop FS - mkdir mydir

What is the command to enter Safe Mode?

hadoop dfsadmin -safemode enter

How do you check if you are in Safe Mode?

hadoop dfsadmin -safemode get or front page of HDFS web UI

What is the command to exit Safe Mode?

hadoop dfsadmin -safemode leave

How do you set a script to run after Safe Mode is over?

hadoop dfsadmin -safemode wait #command to read/write a file

What is the command line for executing a Hadoop Streaming job

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar -input input/pat -output output/path -mapper path/to/mapper.py -reducer path/to/reducer.py (all on one line) -file can be used to ship scripts to the cluster -combiner can be used to specify a combiner

How do you use Hadoop Streaming?

hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \ -input input/ncdc/sample.txt \ (input path in hdfs) -output output \ (output location) -mapper path-to-mapper-function \ -reducer path-to-reducer-function

Where do you find the list of all benchmarks?

hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar

How do you set your hadoop username and group?

hadoop.job.ugi ex hadoop.job.ugi = test1, test2 test3 test1 = user test2 = group1 test3 = group2

What property is used to set the Hadoop username and group?

hadoop.job.ugi uder,group1,group 2

How do you disable the default Native Libraries to use Java Libraries instead?

hadoop.native.lib = false Hadoop script in bin automatically sets the native library for you

Costs remain lower than scale up

hardware costs more when one seeks to purchase larger machines

Yahoo

hired Doug in 2006 and became one of the most prominent supporters of HD. Yahoo published largest HD implementations and allowed Doug and team to contribute to HD event with yahoo's own improvements and extensions.

Where can you view a components metrics?

http://jobtracker-host:50030/metrics - ?format=json (optional)

How do you get a stack trace for a component?

http://jobtracker-host:50030/stacks

move processing, not data

if needing to process 100 TB of data on a massive server with fiber SAN. There are limits for how much data is delivered to the host. Instead, use 1000 of servers, make data local to each. All that travels on network is program binaries and metadata and status reports.

Pig guesses at data

if not explicity told

scale out system are build with 'expect to fail'

individual components will fail regularly and at inconvenient times.

load data gettysburg.txt in pig

input = load 'gettysburg.txt' as (line);

What are the output of the map tasks called?

intermediate keys and values

[a-z&&[def]]

intersection - d e or f

How do you set how many bytes per checksum of data?

io.bytes.per.checksum (default 512 bytes)

How do you change the default buffer size?

io.file.buffer.size (default 4KB) recommended 128KB core-site.xml

numerical summarizations pattern

is a general pattern for calculating aggregate statistical values

HDFS 2...

it is not a posix-complient FS it does not provide same guarantees as regular FS.

failure is not longer a crises to be mitigated ...

it is reduced to irrelevance

medi and standev - reducer process and what is the output?

iterates through the given set of value and adds each value to an in memory list.- calculates a running sum and count- then comment lengths sorted to find the median value then running sum of deviations is calculated by squaring the difference between comment length and mean and then calculated from this sum= output is the median and standard deviation with the key

Min Max Count Reducer -

iterates through the values to find the min and maximum dates and sums the count

DataBlockScanner

job run in the background thread to periodically verify all the blocks stored on the datanode in order to guard against "bit rot"

How do you specify the mapper and reduce to use within a mapreduce job?

job.setMapperClass(Classname.class); job.setReducerClass(Classname.class);

[abc]

just a b or c

How would you dump the jvm context to a file?

jvm.class = org.apache.hadoop.metrics.file.FileContext jvm.fileName = /tmp/jvm-metrics.log

Pig - Map

key(char array) to value(any pig type)

Reduce-Side Joins

less efficient because both datasets have to go through the MapReduce shuffle. The basic idea is that the mapper tags each record with its source and uses the join key as the map output key, so that the records with the same key are brought together in the reducer. We use several ingredients to make this work in practice: Multiple inputs Secondary sort

What does "." do?

lets any character afterwards

thread

libdhfs is ____ safe.

pig - Limit

limits the number of output tuples

Where do map taks write their output

local disk, not HDFS

How do you list the files in a Hadoop filesystem within the Pig command line? What is the Pig Latin comment style?

ls / C style /* */ or --

What is used in the filter code?

mapper

Top Ten Pattern Description uses

mapper and reducer

median and stand dev in map reduce uses

mapper and reducer

Average uses-

mapper and reducer- no combiner since it is not associative can't add up averages of averages

Min Max Count- uses..

mapper combiner and reducer

How do you set a MapReduce taskscheduler?

mapred.jobtracker.taskscheduler = org.apache.hadoop.mapred.FairScheduler

What properly configures the number of Reduce tasks?

mapred.reduce.tasks in JobConf (is set by calling jobconf.setnumreducetasks ()

What properly configures the number of Reduce tasks?

mapred.reduce.tasks in JobConf (is set by calling jobconf.setnumreducetasks ()

What property is used to set the timeout for failed tasks?

mapred.task.timeout (defaoults to 10min, can be configured per job or per cluster)

What property is used to set the timeout for failed tasks?

mapred.task.timeout (defaoults to 10min, can be configured per job or per cluster)

Which property controls the maximum number of map/reduce tasks that can run on a tasktracker at one time?

mapred.tasktracker.map.tasks.maximum (default 2) mapred.tasktracker.reduce.tasks.maximum (default 2)

What is the property that changes the number of task slots for the tasktrackers?

mapredotasktrackeromap.tasks.maximum

How do you set Hadoop to use YARN?

mapreduce.framework.name = yarn

matches digits

matches non digits

matches non white space characters

matches non word character

matches whitespace

matches word character [a-zA-Z_0-9]

How do you package an Oozie workflow application?

max-temp-workflow/\ -lib/ --hadoop-examples.jar -workflow.xml (workflow.xml must be in first level, can be built with ant/maven)

Min Max Count Combiner

min and max comment dates can be calculated for each local map task without having an effect on the final min and max

Why can minmaxcount have a combiner?

min max and count are all associative and communative

and bad

moving software to a larger system is not a trivial task.

What is used for MapReduce tests?

mrunit

How do you test a MapReduce job?

mrunit and junit ( test execution framework). mrunit's MapDriver, which is configured with the mapper we will test. runTest() executeds the test. mrunit's ReduceDriver is configured with the reducer.

How is the output value type specified for a MapReduce job?

my JobConf.SetOutputValueClass(intwritable.class) (where intwriteable.class is the type of the output value)

How is the mapper specified for a MapReduce job?

my JobConf.setMapperClass (myMapper.class);

How is combiner specified for a job?

myConf.setCombinerClass (MyCombiner.class)

How is the output key specified for a map reduce job?

myJobConf.setoutputkeyclass (text.class); (where test.class is the type of me output key)

How is the output key specified for a map reduce job?

myJobConf.setoutputkeyclass (text.class); (where test.class is the type of me output key)

How is the recuer specified for a MapReduce job?

myJobConf.setreducerclass (my reducer.class);

Pig join syntax

name = join column by connecter, column2 by connecter

Does MapReduce sort values?

do redu tasks have the advantage of data locality

why Pig?

no more java/mapreduce, better load balancing of the reducers, easier joins

Combiner Structure for Inverted Index

not used- not beneficial on impact on byte count

GET

operations: HTTP ___ : OPEN (see FileSystem.open) GETFILESTATUS (see FileSystem.getFileStatus) LISTSTATUS (see FileSystem.listStatus) GETCONTENTSUMMARY (see FileSystem.getContentSummary) GETFILECHECKSUM (see FileSystem.getFileChecksum) GETHOMEDIRECTORY (see FileSystem.getHomeDirectory) GETDELEGATIONTOKEN (see FileSystem.getDelegationToken) GETDELEGATIONTOKENS (see FileSystem.getDelegationTokens)

POST

operations: HTTP ____ APPEND (see FileSystem.append) CONCAT (see FileSystem.concat)

PUT

operations: HTTP ____ CREATE (see FileSystem.create) MKDIRS (see FileSystem.mkdirs) CREATESYMLINK (see FileContext.createSymlink) RENAME (see FileSystem.rename) SETREPLICATION (see FileSystem.setReplication) SETOWNER (see FileSystem.setOwner) SETPERMISSION (see FileSystem.setPermission) SETTIMES (see FileSystem.setTimes) RENEWDELEGATIONTOKEN (see FileSystem.renewDelegationToken) CANCELDELEGATIONTOKEN (see FileSystem.cancelDelegationToken)

DELETE

operations: HTTP _____ DELETE (see FileSystem.delete)

What is the combiner?

optional localized reducer; groups data in the map phase

dfs.namenode.http-address.nameservice-id.namenode-id

optionally it is possible to specify the hostnam and port for HTT{ service for a given namenode-id within nameservice-id.

What class defines a file system in Hadoop

org.apache.hadoop.fs.filesystem

What is Writable & WritableComparable interface?

org.apache.hadoop.io.Writable is a Java interface. Any key or value type in the Hadoop Map-Reduce framework implements this interface. Implementations typically implement a static read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance. org.apache.hadoop.io.WritableComparable is a Java interface. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. WritableComparable objects can be compared to each other using Comparators.

How do you run Pig locally? How do you run Pig on a distributed system?

pig -x local -x (execution environment option) pig (default)

What is a IdentityMapper and IdentityReducer in MapReduce ?

org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value. org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

Mapper Structure for Inverted Index

outputs the desired fields for the index as the key and the unique identifier as value

At how many nodes does MapReduce1 hit scaleability bottlenecks?

over 4,000 nodes

Hadoop can run jobs in ________ to tackle large volumes of data.

parallel

what is the record reader?

parses the data into records, passes the data into the mapper in the form of a key value pair

Once a FilesySystem is retrieved, how do you open the input stream for a file?

public FSDataInputStream open(Path f) throws IOException - Uses default buffer of 4KB public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException

If you want to append to an existing file instead of creating a new one, how to do you specify the output stream? How does appending work?

public FSDataOutputStream append(Path f) throws IOException It creates a single writer to append to the bottom of files only.

How do you create a file output stream?

public FSDataOutputStream create(Path f) throws IOException

What are the two FileSystem methods for processing globs? What does it return?

public FileStatus [ ] globStatus(Path pathPattern) throws IOE public FileStatus [ ] globStatus(Path pathPattern, PathFilter filter) throws IOE return an array of FileStatus objects whose paths match the supplied pattern, ordered by Path

Which method do you use to see if a file or directory exists? (FileSystem method)

public boolean exists(Path f) throws IOException

How do you create directories with FileSystem?

public boolean mkdirs(Path f) throws IOException It passes back a boolean to indicate if directories and parents were created. This isn't used often because create() will create a directory structure as well.

class and method signature for a new reducer API

public class MyNewReducer extends reducer <K1, V1, K2, V2> public void reduce (K1 key, Itemable <V1> values, content context) throws IOException, interrupted Exception context, write (Key 2, value2);

class and method signature for a new reducer API

public class MyNewReducer extends reducer <K1, V1, K2, V2> public void reduce (K1 key, Itemable <V1> values, content context) throws IOException, interrupted Exception context, write (Key 2, value2);

Describe the writable comparable interface.

public interface writable comparble<t> extends writable comparable <t>

smart software, dump hardware

push smarts into the software and away from hardware. allows hardware to be generic

inverted indexes should be used when

quick query responses are required

Can you modify the partitioner?

rarely ever need to

Each map task in hadoop is broken into what phases:

record reader, mapper, combiner, partitioner

Filtering does not require the ____ part of Map Reduce because

reduce ; does not produce an aggregation

if problem requires workloads with strong mandates for transnational consistency/integrity,

relational databases are still likely to be a great solution.

How would you make sure a namenode stays in Safe Mode indefinitely?

set dfs.safemode.threashold.pct > 1

How does a mapreduce program find where the jar file on the Hadoop cluser is to run?

setJarByClass(Classname.class) It will find the jar containing that class instead of providing a explicit jar file name

How does a mapreduce program where the jar file on the Hadoop cluser is to run?

setJarByClass(Classname.class) It will find the jar containing that class instead of providing a explicit jar file name

How do you set the output formats to use in a mapreduce job?

setOutputKeyClass(); setOutputValueClass(); If the mapper and reducer output types are difference you may specify the mapper output format as well: setMapOutputKey(); setMapOutputValue();

The reduce tasks are broken into what phases:

shuffle, sort, reducer, and output format

architecture did not change much.

singe architecture at any scale is not realistic. To handle data sets of 100TB to petabytes may apply larger versions of same components, but complexity of connectivity my prove prohibitive.

so scale out

spread processing onto more and more machines. Use 2 servers instead of a double-sized one.

How do you run the balancer?

start-balancer.sh -threashold - Specifies threashold percentage that is deemed "balanced" (optional) Only one balancer can be run at a time

If you store 3 separate configs for: (a) single (b) pseudo-distributed (c) distributed How do you start/stop and specify those to the daemon?

start-dfs.sh --config path/to/config/dir start-mapred.sh --config path/to/config/dir

HDFS 3...

store files in bocks of 64mb in size (vs. 4-32kb)

Mapper- Distinct

takes each record and extracts the data fields for which we want unique values- key is the record and value is null

What does the reducer do?

takes the grouped data from the shuffle and sort and runs a reduce function once per key grouping

what is the shuffle and sort?

takes the output files and downloads them to the local machine where the reducer is running then sorted by key into one larger data list

What is the partitioner?

takes the output from mapper or combiner and splits them up into shards

How does memory limit the number of files in HDFS?

the NameNode holds filessystem data in memory the limit to the number of files in a file system is governed by the amount of memory on the NameNode Rule of thumb: each file, directory and block takes 150 bytes

the beginning of a line

the end of a line

fair-scheduler.xml

the file used to specify the resources pooland setting for the fair schedular plugin for Mapreduce

What is the value?

the information pertinent to the analysis in reducer

capacity-schedular.xml

the name of the file used to specify queues and setting for the capacity scheduler task

What is a record

the portion of an input split fo rwhich the map funtion is called (e.g. a line in a file)

Hadoop runs tasks _____________ to isolate them from other running tasks

their own java virtual machine

dfs.namenode.rpc-address.nameservice-id.namenode-id

this parameter speficies the colon separted hostnames and port on which namenode-id should be provided namenode RPC service for nameservice-id.

dfs.ha.fencing.methods

this property specifies a new line -separated list of fencing methods

Rack Awareness

to take a node's physical location into account while scheduling tasks and allocating storage.

FIll in the blank. Hadoop lacks notion of ______________ and ______________. Therefore, the analyzed result generated by Hadoop may or may not be 100% accurate.

transaction consistency and recovery checkpoint

Generation of large data sets e.g. large search engines and online companies. Also, the need to extra information to identify

trends and relationship.s to make decisions e.g. customer habits and marketing. e.g google adwords

True or False: A good rule of thumb is to have one or more tasks than processors

Combo with "HaDoop" and 27 others

Related study sets

Astronomy Ch. 11

Wellness

ECON-E321 FINAL EXAM

Mega Set FAC

Chapter 4 linux /dos

FINA 3724

12 Indo ORAL EXAM - 05 pendidikan non-formal

Biology - Homework Ch 12

Chapter 12: Management of Patients with Oncologic Disorders

ISDS 3115 Chapter 16

quiz 3 nursing research

MIST 201 Test Bank

Chapter 5

Chapter 4: Civil Liberties

Chapter 15

Chapter 13 finance

Bio 245: Chapter 16- Nervous System: Senses

Vocabulary for the College Bound Level 12: Lesson 10

CIRCULATION AND GAS EXCHANGE

Transitions Final review: quizzes