Combo with "HaDoop" and 27 others
[^abc]
everything but a,b,c (negation)
How do you execute a MapReduce job from the command line
export HADOOP_CLASSPATH=build/classes hadoop MyDriver input/path output/path
Lists three drawbacks of using Hadoop.
Does not work well with small amounts of data, MapReduce is difficult to implement or understand, does not guarantee atomicity transactions
streaming
HDFS is a file system designed for storing very large files with streaming data access patterns, running clusters on commodity hardware.
file system data
HDFS is a highly fault-tolerant, with high throughput, suitable for applications with large data sets, streaming access to ____ ____ ____ and can be built out of commodity hardware.
Hadoop comes with pre-built native compression libraries for which version/OS?
Linux 32 and 64bit. Other platforms you have to compile the libraries yourself.
How do you obtain a comparator for an IntWritable?
Raw Comparator<Intwritable>Comparator=Writable Comparator.get (Int Writable. class) ;
What is the benefit to UDFs in Pig?
Re-usable, unlike many MapReduce jobs libraries
threshold
Rebalancer tool will balance the data blocks across cluster up to an optional ____ percentage.
space
Rebalancing would be very helpful if you are having ____ issues in the other existing nodes.
What is the meaning of speculative execution in Hadoop? Why is it important?
Speculative execution is a way of coping with individual Machine performance. In large clusters where hundreds or thousands of machines are involved there may be machines which are not performing as fast as others. This may result in delays in a full job due to only one machine not performaing well. To avoid this, speculative execution in hadoop can run multiple copies of same map or reduce task on different slave nodes. The results from first node to finish are used.
True or False: As long as df.replication.min replicas are written, a write will succeed.
True, (default = 1) The namenode can schedule further replication afterward.
HDFS clusters do not benefit from RAID
True, The redundancy that RAID provides is not needed since HDFS handles replication between nodes.
True or False: FSDataInputStream allows seeking, FSDataOutputStream does not.
True, because there is no support to write anywhere but the end of a file. So there is no value to seek while writing.
True or False: Block pool storage is not partitioned
True, datanodes must register with every namenode in the cluster. Datanodes can store blocks from multiple block pools.
True or False: Pipes cannot be run in local standalone mode.
True, it relys on Hadoop's distributed cache system which is active when HDFS is running. (Will also work in psedo-distributed mode)
True or False: It is possible for users to run different versions of MapReduce on the same YARN cluster.
True, makes upgrades more manageable.
True or False: Streaming and Pipes work the same way in MapReduce1 vs MapReduce 2.
True, only difference is the child and subprocesses run on the node managers not tasktrackers
True or False: Under YARN, you no longer run a jobtracker or tasktrackers.
True, there is a single resource manager running on the same machine as the HDFS namenode(small clusters) or a dedicated machine with node managers running on each worker node.
simple
___ -In this mode of operation, the identity of a client process is determined by the host operating system. On Unix-like systems, the user name is the equivalent of `whoami`.
Structured data
___ ____ is the data that is easily identifiable as it is organized in structure. The most common form of __ ___ is a database where specific information is stored in tables, that is, rows and columns. [same word]
HFTP
___ is a Hadoop filesystem implementation that lets you read data from a remote Hadoop HDFS cluster. The reads are done via HTTP, and data is sourced from DataNodes.
Rebalancer
____ - tool to balance the cluster when the data is unevenly distributed among DataNodes
Backup Node
____ ____ - An extension to the Checkpoint node. In addition to checkpointing it also receives a stream of edits from the NameNode and maintains its own in-memory copy of the namespace, which is always in sync with the active NameNode namespace state. Only one Backup node may be registered with the NameNode at once.
Secondary NameNode
____ ____ - performs periodic checkpoints of the namespace and helps keep the size of file containing log of HDFS modifications within certain limits at the NameNode.
Checkpoint Node
____ ____ - performs periodic checkpoints of the namespace and helps minimize the size of the log stored a the NameNode containing changes to the HDFS.
Map Reduce
____ ____ is a set of programs used to access and manipulate large data sets over a Hadoop cluster.
unstructured data
____ ____ refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email, logs, and random text. It is not in the form of rows and columns.
HTTP POST
____ ____: APPEND (see FileSystem.append) CONCAT (see FileSystem.concat)
HTTP PUT
____ ____: CREATE (see FileSystem.create) MKDIRS (see FileSystem.mkdirs) CREATESYMLINK (see FileContext.createSymlink) RENAME (see FileSystem.rename) SETREPLICATION (see FileSystem.setReplication) SETOWNER (see FileSystem.setOwner) SETPERMISSION (see FileSystem.setPermission) SETTIMES (see FileSystem.setTimes) RENEWDELEGATIONTOKEN (see FileSystem.renewDelegationToken) CANCELDELEGATIONTOKEN (see FileSystem.cancelDelegationToken)
HTTP DELETE
____ ____: DELETE (see FileSystem.delete)
Concurrency
____ and Hadoop FS "handles" The Hadoop FS implementation includes a FS handle cache which caches based on the URI of the namenode along with the user connecting. So, all calls to hdfsConnect will return the same handle but calls to hdfsConnectAsUser with different users will return different handles. But, since HDFS client handles are completely thread safe, this has no bearing on ____. [same word]
libhdfs
____ is a JNI based C API for Hadoop's Distributed File System (HDFS). It provides C APIs to a subset of the HDFS APIs to manipulate HDFS files and the filesystem. ____ is part of the Hadoop distribution and comes pre-compiled in $HADOOP_PREFIX/libhdfs/libhdfs.so . [same word]
HttpFS
____ is a server that provides a REST HTTP gateway supporting all HDFS File System operations (read and write). And it is interoperable with the webhdfs REST HTTP API.
HTTP GET
_____ _____: OPEN (see FileSystem.open) GETFILESTATUS (see FileSystem.getFileStatus) LISTSTATUS (see FileSystem.listStatus) GETCONTENTSUMMARY (see FileSystem.getContentSummary) GETFILECHECKSUM (see FileSystem.getFileChecksum) GETHOMEDIRECTORY (see FileSystem.getHomeDirectory) GETDELEGATIONTOKEN (see FileSystem.getDelegationToken) GETDELEGATIONTOKENS (see FileSystem.getDelegationTokens)
HttpFS
_____ is a server that provides a REST HTTP gateway supporting all HDFS File System operations (read and write). And it is inteoperable with the webhdfs REST HTTP API.
RDBMS
_________ is useful when you want to seek one record from Big Data, whereas, Hadoop will be useful when you want Big Data in one shot and perform analysis on that later.
HDFS
___________ is used to store large datasets in Hadoop.
What is the workflow in Oozie?
a DAG of action nodes and control-flow nodes
What are the two types of nodes in HDFS and in what pattern are they working?
a NameNode (the master) and a number of data nodes (workers) in a master=worker pattern
What is a meta-character?
a character with special meaning interpreted by the matcher
MapReduce is
a computing paradigm for processing data that resides on 100 of data
tasks
a divided MapReduce job two types: Map tasks Reduce tasks
tasks
a divided MapReduce job two types: Map tasks Reduce tasks
What is Oozie?
a job control utility
What typically delimits a key from a value MapPedree?
a tab
[a-zA-Z]
a through z or A through Z inclusive
fetchdt
a utility to fetch DelegationToken and store in a file on the local system.
Regular expressions are
a way to describe a set of strings based on common characteristics
Hadoop achieves parallelism by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program and slow down the program. What mechanism Hadoop provides to combat this?
Speculative Execution.
HDFS 1
filesystem to store large data sets by scaling out across a cluster of hosts. Optimized for throughout instead of latency. Achieves HA via replication vs. redundancy.
What do you need for Bloom filtering
data can be separated into records, feature can be extracted from each record , predetermined set for hot values, some false positive are acceptable
Mapper- Top Ten
find their local top K
Top Ten example
finding outliers, finding the top 10%
Structure of Bloom filtering
first bloom filter needs to be trained over the list of values; resulting data object is stored in HDFS, then filtering Mapreduce job
build applications, not infrastructure
for a developer, not need to spend time worrying about job scheduling, error handling, and coordination in distributed processing.
What is HCatalog
for defining and sharing schemas
there are limits
for how big s single host can be.
What is MapReduce?
for processing large data sets in a scalable and parallel fashion
What is Ambari?
for provisioning, managing, and monitoring Apache Hadoop clusters
How do you copy a file from the local file system to HDFS?
hadoop fs - copy From Local path/to/file hdfs://localhost/path/to/file or with defaults hadoop fs -copy from local path/to/file (all on one line)
How do you specify a configuration file when using hadoop command?
hadoop fs -conf conf/hadoop-localhostexml -is.
How do you list the files running in pseduo/single/distributed mode?
hadoop fs -conf conf/hadoop-xxx.xml -ls . hadoop-xxx-.xml (config file for single or dist or pseudo)
command for copying file from HDFS t local disk
hadoop fs -copyToLocal remot/path local/path
command for copying file from HDFS t local disk
hadoop fs -copyToLocal remot/path local/path
How can youget help on the hadoop commands for interacting with the file system?
hadoop fs -help
How would you list the files in the root directory of the local filesystem via command line?
hadoop fs -ls file:///
how can you list all the blocks that makeup each file in the filesystem?
hadoop fsck / -files - blocks
How do you find which blocks are in any particular file?
hadoop fsck /user/tom/part-0007 -files -blocks -racks
How do you check if the file content is the same after copying to/from HDFS?
md5 input/docs/test.txt test.copy.txt MD5 hash should match
Can Hadoop pipers be run in stand alone mode?
no, it relies on Hadoop's distributed cache mechanism which only works when HDFS is running in development run in Pseudo-distributed mode
Can Hadoop pipers be run in stand alone mode?
no, it relies on Hadoop's distributed cache mechanism which only works when HDFS is running in development run in Pseudo-distributed mode
[a-d[m-p]]
union - a through d or m through p
Bloom filtering has a _______ applied to each record
unique evaluation function
pig bag
unordered collection of tuples- can spill onto disk
What is the criteria for a node manager failure?
unresponsive for 10 minutes. (Not sending heartbeat to resouce manager) yarn.resourcemanager.nm.liveness-monitor.expiry-interval-ms The node manager is removed from the pool and any tasks or application managers running on the node can be recovered.
How can you produce a globally sorted file using Hadoop?
use a partitioner that respects the total order of the output (so not hash partitioner)
dfs.nameservices
used by both namenode and federation features.define the logical name of the service being provided by pair of namenode IDs.
What can regular expressions be used for?
used to search, edit, or manipulate text and data.
uses of counters
useful tool for gathering statistics about your job, counter values much easier to retrieve than logs
How to retrieve counter values using the Java API (old)?
usual to get counters at the end of a job run ... Counters counters = job.getCounters(); long missing = counters.getCounter( MaxTemperatureWithCounters.Temperature.MISSING);
What is the key?
what the data will be grouped on
dfs.client failover.proxy.provider.nameservices-id
when namenode HA is enabled clients need a way to deicde which namenode is active and should be used.
Reducer Structure for Inverted Index
will receive a set of unique record identifiers to map back to the input key- identifiers will be concatenated by some delimiter
dfs.ha.namenode.nameservice-id
with name services id define by dfs.nameservicesnow we need to provide which namenode-ids makeup that service.the value is a comma separated list of logical namenode names.
Tokenize the lines in Pig
words = foreach input generate flatten(TOKENIZE(line)) as word;
HDFS & MP are designed to
work on low commodity clusters (low on cost and specs), scale by adding more servers, identify and work around failures.
Doug Cutting
worked as Nutch open source web search engine.Started working on implementing Google's GFS and MR. Hadoop is born.
x{n,}, x{n,}?, x{n,}+
x at least n times
x{n,m}, x{n,m}?, x{n,m}+
x at least n times but not more than m
x{n}, x{n}?, x{n}+
x exactly n times
x?, x??, x?+
x once or not at all
x+, x+?, x++
x one or more times
How do you set memory amounts for the node manager?
yarn.nodemanager.resource.memory.mb
If a datanode is in the include and not in the exclude will it connect? If a datanode is in the include and exclude, will it connect?
yes yes, but it will be decommisioned
x*, x*?, x*+
zero or more times
What mechanism does Hadoop framework provide to synchronise changes made in Distribution Cache during runtime of the application?
none
x*+- reluctant greedy or possessive?
possessive
how can you force a meta-character to be a normal character?
precede the metacharacter with a backslash, or enclose it within \Q (which starts the quote) and \E (which ends it).
medi and standev - mapper
process each input record to calculate the median comment length with each hour of the day- output key is the hour of day and output value is the comment length
data locality optimization
running the map task on the node where the input data resides
Pig- Sample
sample <column> 0-1
Hadoop works best with _________ and ___________ data, while Relational Databases are best with the first one.
structured and unstructured
[a-z&&[^bc]]
subtraction a through z but b and c [ad-Z]
Doug moved
to Cloudera and rest of team started HortonWorks.
Hadoop IO Class that corresponds to Java Integer
IntWritable
When comparing Hadoop and RDBMS, which is the best solution for big data?
RDBMS
True or False: Each node in the cluster should run a datanode & tasktracker?
True
True or False: FileSystem Filters can only act on a file's name, not metadata
True
True or False: Hadoop is open source.
True
True or False: Skipping mode is not supported in the new MapReduce API.
True
True or False: The default JobScheduler will fill empty map task slots before reduce task slots.
True
Upgrading a cluster when the filesystem hasn't changed is reversible.
True
Using Pig's fs command you can run any command like HDFS from Hadoop
True
True
True or false? MapReduce can best be described as a programming model used to develop Hadoop-based applications that can process massive amounts of unstructured data.
What would you use to turn an array of FileStatus objects to an array of Path objects?
FileUtil.stat2Paths();
Define "fault tolerance".
"Fault tolerance" is the ability of a system to continue operating in the event of the failure of some of its components.
how do counters work
# of bytes of uncompressed input/output consume by maps in job, incremented everytime the collect() is called in Outputcollector
What configuration is used with the hadoop command if you dont use the -conf option
$HADOOP_INSTALL/conf
Where can one lean the default settles for all the public properties in Hadoop?
$HADOOP_INSTALL/does/core-default.html hdfs-default.html mapred-default.html
Where are logs stored by default? How and where should you move them?
$HADOOP_INSTALL/logs set HADOOP_LOG_DIR in hadoop-env.sh Move it outside the install path to avoid deletion during upgrades
How do you finalize an upgrade?
$NEW_HADOOP_HOME/bin/ hadoop dfsadmin -finalizeUpgrade
How do you start an upgrade?
$NEW_HADOOP_HOME/bin/ start-dfs.sh -upgrade
How do you check the progress of an upgrade?
$NEW_HADOOP_HOME/bin/hadoop dfsadmin -upgradeProgress status
How do you roll back an upgrade?
$NEW_HADOOP_HOME/bin/stop-dfs.sh $OLD_HADOOP_HOME/bin/start-dfs.sh
What is the directory structure for the secondary namenode? What are the key points in its design?
${ dfs.checkpoint.dir } - current/ -- version -- edits -- fsimage -- fstime - previous.checkpoint/ -- version -- edits -- fsimage -- fstime Previous checkpoint can act as a stale backup If the secondary is taking over you can use -importCheckpoint when starting the namnode daemon to use the most recent version
What is the directory structure of the Datanode?
${ dfs.data.dir } - current/ -- version -- blk_<id_1> -- blk_<id_1>.meta -- blk_<id_2> -- blk_<id_2>.meta .... --subdir0/ --subdir1/
What does a newly formatted namenode look like? (directory structure)
${ dfs.name.dir } - current -- version -- edits -- fsimage -- fstime
How do you map between node addresses and network locations? Which config property defines an implementation of DNSToSwitchMapping?
(1) public interface DNSToSwitchMapping{ public List<String> resolve( List<String> names); } names - list of IP addresses - returns list of corresponding network location strings (2) topology.node.switch.mapping.impl (Namenodes and jobtrackers use this to resolve worker node network locations)
What are the following HTTP Server default ports? (a) mapred.job.tracker.http.address (b) mapred.task.tracker.http.address (c) dfs.http.address (d) dfs.datanode.http.address (e) dfs.secondary.http.address
(a) 0.0.0.0:50030 (b) 0.0.0.0:50060 (c) 0.0.0.0:50070 (d) 0.0.0.0:50075 (e) 0.0.0.0:50090
What do the following YARN properties manage? (a) yarn.resourcemanager.address (b) yarn.nodemanager.local-dirs (c) yarn.nodemanager.aux-services (d) yarn.nodemanager.resource.memory.mb (e) yarn.nodemanager.vmem-pmem ratio
(a) 0.0.0.0:8032 (default) where resource manager RPC runs (b) locations where node managers allow containers to share intermediate data (cleared at the end of a job) (c) List of auxiliary services run by node manager (d) Amt of physical memory that may be allocated to containers being run by the node manager (e) Ratio of virtual to physical memory for containers.
YARN HTTP Servers: (ports and usage) (a) yarn.resourcemanager.webapp.address (b) yarn.nodemanager.webapp.address (c) yarn.web-proxy.address (d) mapreduce.jobhistory.webapp.address (e) mapreduce.shuffle.port
(a) 0.0.0.0:8088 - resource manager web ui (b) 0.0.0.0:8042 - node manager web ui (c) (default not set) webapp proxy server, if not set resource managers process (d) 0.0.0.0:19888 - job history server (e) 8080 shuffle handlers HTTP port ( not a user-accessible web UI)
YARN RPC Servers: (ports and usage) (a) yarn.resourcemanager.address (b) yarn.resourcemanager.admin.address (c) yarn.resourcemanager.scheduler.address (d) yarn.resourcemanager.resourcetracker.address (e) yarn.nodemanager.address (f) yarn.nodemanager.localizer.address (g) mapreduce.jobhistory.address
(a) 8032 Used by client to communicate with the resource manager (b) 8033 Used by the admin client to communicate with the resource manager (c) 8030 Used by in-cluster application masters to communicate with the resource manager (d) 8031 Used by in-cluster node managers to communicate with the resource manager (e) 0 Used by in-cluster application masters to communicate with node managers (f) 8040 (g) 10020 Used by the client, typically outside the cluster, to query job history
A. What is distcp? (use case) B. How do you run distcp? C. What are it's options? (2)
(a) A Hadoop program for copying large amounts of data to and from the Hadoop Filesystem in parallel. use case: Transferring data between two HDFS clusters (b) hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar (c) overwrite - distcp will skip files that already exist without specifying this update - updates only the files that have changed
What are OutputCommitters used for? How do you implement it?
(a) A commit protocol to ensure that jobs or tasks either succeed or fail cleanly. The framework ensures that the event of multiple task attempts for a particular task, only one will be committed, others will be aborted. (b) old - JobConf.setOutputComitter() or mapred.output.comitter new - OutputComitter is decided by the OutputFormat, using getOutputComitter() (default FileOutputComitter)
What is a Balancer program? Where does it output?
(a) A hadoop daemon that redistributes blocks from over-utilized datanodes to under-utilized datanodes, while still adhering to the replication placement policy. Moves blocks until the cluser is deemed "balanced". (b) Standard log directory
What is fencing?
(a) A method of stopping corruption if there is ungraceful failover
What is fencing? What are fencing methods? (4)
(a) A method of stopping corruption if there is ungraceful failover (b) 1. Killing the namenode process 2. Revoking the namenodes access to the shared storage directory 3. Disabling its network port 4. STONITH - (extreme) Shoot the other node in the head - Specialized power distribution unit to force the host machine down
What happens during a Job Completion in MR2? (a) Before (b) During (c) After
(a) Client cals Job.waitForCompletion() as well as polling the application master every 5 seconds via (mapreduce.client.completion.pollinterval) (b) Notification via HTTP callback from the application master (c) Application master and task containers clean up their working state. Job information is archived by the job history server.
What are metrics? Which metrics contexts does Hadoop use?
(a) Data collection from HDFS and Mapreduce daemons (b) dfs, rpc, jvm (datanodes), mapred
What does FSDataOutputStream's sync() method do? What would be the effect of not having calls to sync()? T/F: Closing a file in HDFS performs an implicit sync()
(a) Forces all buffers to be synchronized to the datanodes. When sync() returns successfully, HDFS guarantees that the data written up to that point has reached all the datanodes and is visible to new readers. (b) Possible loss of up to a block of data in the event of a client or system failure (c) True
What is the function of the Secondary Namenode? Where should the Secondary Namenode be located? True or False: The Secondary Namenode will always lag the Namenode
(a) It doesn't act in the same way as the Namenode. It periodically merges the namespace image with the edit log to prevent the edit log from being too large. (b) On a separate machine from the Namenode because it requires as much memory/CPU usage as the Namenode to run. (c) True
How can you set an arbitrary number of mappers to be created for a job in Hadoop?
You can't set it
A. What is the Hadoop Archive (HAR files)? B. How do you use them? C. How do you list the file contents of a .har? D. What are the limitations of Hadoop Archives?
(a) It is a file archiving facility that packs files into HDFS blocks more efficiently. They can be used as an input to a mapreduce job. It reduces namenode memory usage, while allowing transparent access to files. (b) hadoop archive -archiveName files.har /my/files /my (c) hadoop fs -lsr har://my/files.har (d) 1. Creates a copy of the files (disk space usage) 2. Archives are immutable 3. No compression on archives, only files
What is the purpose of the Interface Progressable? When is Progressable's progress() called? T/F: Only HDFS can use Progressable's progress().
(a) Notifies your application of the progress of data being written to datanodes. (b) Progress is called after each 64KB packet of data is written to the datanode pipeline (c) True
How does a jobtracker choose a (a) reduce task? (b) a map task?
(a) Reduce task - takes the next in the list. (No data locality considerations) (b) Map task - Takes into account the tasktrackers network locations and picks a task whose input split is as close as possible to to the tasktracker.
What are MapReduce defaults? (a) job.setInputFormatClass() (b) job.setMapperClass() (c) job.setOutputKeyClass() (d) job.setOutputValueClass()
(a) TextInputFormat.class (b) Mapper.class (c) LongWritable.class (d) Text.class
The following YARN config files do what? (a) yarn-env.sh (b) yarn-site.xml (c) mapred-site.xml
(a) environment variables (b) config settings for YARN daemons (c) properties still used without jobtracker & tasktracker related properties
How do you disable checksum verification? How do you disable checksums on the client side?
(a) fs.setVerifyChecksum(false); fs.open(); OR -copyToLocal -ignoreCrc (b) using RawLocalFileSystem instead of FileSystem. fs.file.impl = org.apache.hadoop.fs.RawLocalFileSystem
How do you get the block verification report for a datanode? How do you get a list of blocks on the datanode and their status?
(a) http://datanode:50075/blockScannerReport (b) http://datanode:50075/blockScannerReport?listBlocks
What are the default Streaming attributes? (a) input (b) output (c) inputFormat (d) mapper
(a) input/ncdc/sample.txt (b) output (c) org.apache.hadoop.mapred.TextInputFormat (d) /bin/cat
What are the values to set for criteria to run a task uberized? How do you disable running something uberized?
(a) mapreduce.job.ubertask.maxreduces mapreduce.job.ubertask.maxbytes mapreduce.job.ubertask.maxmaps (b) mapreduce.job.ubertask.enable = false
What do the following CompressionCodecFactory methods do? (a) getCodec() (b) removeSuffix
(a) maps a filename extension to a CompressionCodec (takes Path object for the file in question) (b) strips off the file suffix to form the output filename (ex file.gz => file )
What do the following Safe Mode properties do? (a) dfs.replication.min (b) dfs.safemode.threashold (c) dfs.safemode.extension
(a) minimum number of replicas that have to be written for a write to be successful. (b) (0.999) Proportion of blocks that must meet minimum replication before the system will exit Safe Mode (c) (30,000) Time(ms) to extend Safe Mode after the minimum replication has been satisfied
New API or Old API? (a) Job (b) JobConf (c) org.apache.hadoop.mapred (d) org.apache.hadoop.mapreduce
(a) new api (b) old api (c) new api (d) old api
How do you create a FileSystem instance? How do you create a local FileSystem instance?
(a) public static FileSystem get(URI uri, Cofiguration conf, String user) throws IOE -uri and user are optional (b) public static LocalFileSystem getLocal(Configuration conf) throws IOE
What do the following options for dfsadmin do? (a) -help (b) -report (c) -metasave (d) -safemode (e) -saveNamespace (f) -refreshNodes (g) -upgradeProgress (h) -finalizeUpgrade (i) -setQuota (j) -clrQuota (k) -setSpaceQuota (l) -clrSpaceQuota (m) -refreshServiceACL
(a) shows help for given command or -all (b) shows filesystem statistics & info on datanodes (c) Dumps info on blocks being replicated/deleted and connected datanodes to logs (d) Changes or queries to the state of Safe Mode (e) Saves current in memory filsystem image to a new fsimage file and resets the edits file (only in safe mode) (f) Updates the set of datanodes that are permitted to connect to the namenode (g) Gets info on the process of an HDFS upgrade and forces an upgrade to proceed (h) After upgrade is complete it deletes the previous version of the namenode and datanode directories (i) Sets directory quota. Limit on files/directories in the directory tree. Preserves memory by not allowing a small number of small files. (j) Clears specified directory quotas (k) Sets space quotas on directories. Limit on size of files in directory tree. (l) Clears specified space quotas (m) Refreshes the namenodes service-level authorization policy file.
What are MapReduce defaults? (e) job.setPartitionerClass() (f) job.setNumReduceTasks() (g) job.setReducerClass() (h) job.setOutputFormatClass()
(e) HashPartitioner.class - hashes a records key to determine which partition the record belongs in (f) 1 (g) Reducer.class (h) TextOutputFormat.class (tab delimited)
What are the default Streaming attributes? (e) partitioner (f) numReduceTasks (g) reducer (h) outputFormat
(e) org.apache.hadoop.mapred.lib.HashPartitioner (f) 1 (g) org.apache.hadoop.mapred.lib.IdentityReducer (h) org.apache.hadoop.mapred.TextOutputFormat
How is an output path specified for a MapReduce job?
Fileoutputformat.setoutputpath(____) the directory should not exist
What are common tips when installing Hadoop
- Create a hadoop user, for smaller clusters you can create the user home directory on an NFS server outside the cluster - Change the owner of the Hadoop files to the hadoop user and group - Keep config in sync between machines using rsync or shell tools (dsh pdsh) - If you introduce a stronger class of machine, you can manage separate configs per machine class. (using Chef, Pupper, cfengine)
How will you write a custom partitioner for a Hadoop job?
- Create a new class that extends Partitioner Class - Override method getPartition - In the wrapper that runs the Mapreduce, either - Add the custom partitioner to the job programmatically using method set Partitioner Class or - add the custom partitioner to the job as a config file (if your wrapper reads from config file or oozie)
How is an output path specified for a MapReduce job?
Fileoutputformat.setoutputpath(____) the directory should not exist
Consider case scenario: In M/R system, - HDFS block size is 64 MB
- Input format is FileInputFormat - We have 3 files of size 64K, 65Mb and 127Mb How many input splits will be made by Hadoop framework? Hadoop will make 5 splits as follows: - 1 split for 64K files - 2 splits for 65MB files - 2 splits for 127MB files
How to get the values to also be sorted before reducing?
- Make the key a composite of the natural key and the natural value. - The sort comparator should order by the composite key, that is, the natural key and natural value. - The partitioner and grouping comparator for the composite key should consider only the natural key for partitioning and grouping.
What will a Hadoop job do if you try to run it with an output directory that is already present? Will it
- Overwrite it - Warn you and continue - Throw an exception and exit The Hadoop job will throw an exception and exit.
What should Pig not be used for?
- Pig doesn't perform as well as programs written in MapReduce (the gap is closing) - Designed for batch processing, therefore if you want a query that only touches a small subset of data in a large set, Pig will not perform well because it was meant to scan the entire set.
What constitutes as progress in MapReduce?
- Reading an input record - Writing an output record - Setting the status description on a reporter (using Reporters getStatus() method) - Incrementing a Counter (Reporter's incrCounter() method) - Calling Reporter's progress() method
The input to a MapReduce job is a set of files in the data store that are spread out over the
HDFS
Using command line in Linux, how will you
- See all jobs running in the Hadoop cluster - Kill a job? Hadoop job - list Hadoop job - kill jobID
Name the most common Input Formats defined in Hadoop? Which one is default?
- TextInputFormat - KeyValueInputFormat - SequenceFileInputFormat TextInputFormat is the Hadoop default.
Hadoop 2.x releases fixes namenode failure issues by adding support for HDFS High Availability (HA). What were the changes?
- You can now have 2 Namenodes in active standby configuration. - Shared storage must be used for the edit log - Datanodes send block reports to all Namenodes because block mappings are stored in the Namenodes memory, not on disk - Clients must be configured to handle Namenode failover
What are CompressionCodecs' two methods that allow you to compress or decompress data?
- createOutputStream (OutputStream out) creates a CompressionOutputStream to write your data to, to be compressed. - createInputStream (InputStream in) creates a CompressionInputStream to read uncompressed data from the underlying stream.
HDFS 4
- optimized for write one read many type - storage nodes run datanode to mange blocks. These are coordinated by NameNode - It used replication instead of hw HA features
All compression algorithms have a space/time tradeoff. 1. How do you maximize for time? 2. How do you maximize for space?
-1 (speed) -9 (space) ex: gzip -1 file (Creates a compressed file file.gz using the fastest compression method)
What are some concrete implementations of InputFormat?
-CombineFileInputFormat -DBInputFormat -FileInputFormat -KeyValueTextInputFormat -NLineInputFormat -SequenceFileAsBinaryInputFormat -SequenceFileAsTextInputFormat -StreamInputFormat -TeraInputFormat -TextInputFormat
What are some concrete implementations of InputFormat?
-CombineFileInputFormat -DBInputFormat -FileInputFormat -KeyValueTextInputFormat -NLineInputFormat -SequenceFileAsBinaryInputFormat -SequenceFileAsTextInputFormat -StreamInputFormat -TeraInputFormat -TextInputFormat
What are some concrete implementations of RecordReader?
-DBInputFormat.DBRecordReader -InnerJoinRecordReader -KeyValueLineRecordReader -OuterJoinRecordReader -SequenceFileAsTextRecordReader -SequenceFileRecordReader -StreamBaseRecordReader -StreamXmlRecordReader
What are some concrete implementations of RecordReader?
-DBInputFormat.DBRecordReader -InnerJoinRecordReader -KeyValueLineRecordReader -OuterJoinRecordReader -SequenceFileAsTextRecordReader -SequenceFileRecordReader -StreamBaseRecordReader -StreamXmlRecordReader
What are the benefits of File Compression?
1. Reduces the space needed to store files. 2. Speeds up data transfer across the network to and from the disk
Fill in the blank: The command for removing a file from hadoop recursively is hadoop dfs ___________ <directory>
-rmr
The command for removing a file from hadoop recursively is: hadoop dfs ___________ <directory>
-rmr
Give an example of a meta-character
.
Have you ever used Counters in Hadoop. Give us an example scenario?
...
cpu power has grown much faster than network and disk speeds
...
How may reduces can the local job runner run?
0 or 1
When running under the local jobrunner, how many reducers are supported?
0 or 1
How many reducers do you need in TOp Ten?
1
What is 1024 Exabytes?
1 Zettabyte
All Oozie workflows must have which control nodes?
1 start node <start to="max-temp-mr"/> 1 end node <end name="end"/> 1 kill node <kill name="fail><message>MapReduce failed error...</></kill> When the workflow starts it goes to the node specified in start. If workflow succeeds -> end If workflow fails -> kill
what are the steps implemented by Job Clients, what is the submit job method for job initialization?
1) Ask jobtracker for new job ID 2) Check job output specification 3) Compute InputSplits 4) Copies resources needed to run job -jar file -configuration file 5) Tells the jobtracker that job is ready for execution
what are the steps implemented by Job Clients, what is the submit job method for job initialization?
1) Ask jobtracker for new job ID 2) Check job output specification 3) Compute InputSplits 4) Copies resources needed to run job -jar file -configuration file 5) Tells the jobtracker that job is ready for execution
What are the steps taken by the task tracker for task execution?
1) Copies resources from shared file system to the task trackers' file system -jar -distributed cache files 2) creates local working directory and unjars jar 3) creates instance of TaskRunner to run the task 4) TaskRunner launches JVM 5) Runs task
What are the steps taken by the task tracker for task execution?
1) Copies resources from shared file system to the task trackers' file system -jar -distributed cache files 2) creates local working directory and unjars jar 3) creates instance of TaskRunner to run the task 4) TaskRunner launches JVM 5) Runs task
List the items in a MapReduce job tuning checklist
1) Number of Mappers 2) Number of Reducers 3) Combinars 4) Intermediate Compression 5) Custom serialization 6) Shuffle Tweaks
List the items in a MapReduce job tuning checklist
1) Number of Mappers 2) Number of Reducers 3) Combinars 4) Intermediate Compression 5) Custom serialization 6) Shuffle Tweaks
What steps does the job scheduler take to create a list of tasks to run?
1) Retrieve InputSplit 2) create one map task for each split 3) creates reduce tasks based on mapped.reduce.tasks property (task ids are given as tasks are created)
What steps does the job scheduler take to create a list of tasks to run?
1) Retrieve InputSplit 2) create one map task for each split 3) creates reduce tasks based on mapped.reduce.tasks property (task ids are given as tasks are created)
Ten Steps hadoop follows to run a MapReduce job.
1) Run Job 2) Get new job id 3) Copy job resources 4) Submit job 5) initialize job 6) Retrieve input split 7) Heartbeat 8) Retrieve job resources 9) Launch 10) Run
Ten Steps hadoop follows to run a MapReduce job.
1) Run Job 2) Get new job id 3) Copy job resources 4) Submit job 5) initialize job 6) Retrieve input split 7) Heartbeat 8) Retrieve job resources 9) Launch 10) Run
What does Input Format do?
1) Validate the input-specification of the job. 2) Split-up the input file(s) into logical InputSplits, each of which is assigned to an individual Mapper 3) Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper
What does Input Format do?
1) Validate the input-specification of the job. 2) Split-up the input file(s) into logical InputSplits, each of which is assigned to an individual Mapper 3) Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper
What does OutputFormat do?
1) Validate the output specification of the job. for e.g. check that the output directory doesnt already exist. 2) Provide the RecordWriter implementation to be used to write out the output files of the job.
What does OutputFormat do?
1) Validate the output specification of the job. for e.g. check that the output directory doesnt already exist. 2) Provide the RecordWriter implementation to be used to write out the output files of the job.
What mechanisms are provided to make the NameNode resilient to failure?
1) backup files that make up the persistant state of the file system metadata: -write to local disk -write to remote NES mount 2) run a secondary name node - does not act as a name node - periodically merges the namespace the edit log
What are the options for storing files in HDFS? (think compression and splitting)
1) uncompressed 2) compressed in format that support splitting (bzip2) 3) split file and compress resulting pieces 4) use a sequence file 5) use an Avro data file
What are the options for storing files in HDFS? (think compression and splitting)
1) uncompressed 2) compressed in format that support splitting (bzip2) 3) split file and compress resulting pieces 4) use a sequence file 5) use an Avro data file
1. job IDs are __ based. 2. task IDs are ___ based. 3. attempt IDs are ___ based.
1. 1 2. 0 3. 0
How much memory does Hadoop allocate per daemon? Where is it controlled?
1. 1GB 2. HADOOP_HEAPSIZE in hadoop_env.sh
1. Task logs are detected after how long? 2. Where can it be configured? 3. How do you set the cap size of a log file?
1. 24 hours 2. mapred.userlog.retain.hours 3. mapred.userlog.limit.kb
What are the benefits of having a block abstraction for a distributed filesystem? (3)
1. A file can be larger than any disk on the network. It can be put into blocks and distributed without size concerns 2. It simplifies storage - Since we know how many blocks can be stored on a given disk though a simple calculation. It allows metadata to be stored separately from the data chunks. 3. Copies of blocks can be made (typically 3) and used in case of a node failure.
What should you do before upgrading?
1. A full disk fsck (save output and compare after upgrade) 2. clear out temporary files 3. delete the previous version (finalizing the upgrade)
What are things to look for on the Tuning Checklist? (How can I make a job run faster?) 1. Number of Mappers 2. Number of Reducers 3. Combiners 4. Intermediate Compression 5. Custom Serialization 6. Shuffle Tweaks
1. A mapper should run for about a minute. Any shorter and you should reduce the amt of mappers. 2. Slightly less reducers than the number of reduce slots in the cluster. This allows the reducers to finish in a single wave, using the cluster fully. 3. Check if combiner can be use to reduce amount of data going through the shuffle. 4. Job execution time can almost always benefit from enabling map output compression. 5. Use RawComparator if you are using your won custom Writable objects or custom comparators. 6. Lots of tuning parameters for memory management
1. What is Apache Oozie? 2. What are its two main parts? 3. What is the difference between Oozie and JobControl? 4. What do action nodes do? control nodes? 5. What are two possible types of callbacks?
1. A system for running workflows of dependent jobs 2. (a) Workflow engine - stores and runs workflows composed of different types of Hadoop jobs (MapReduce, Pig, Hive) (b) Coordinator engine - runs workflow jobs based on pre-defined schedules and data availability. 3. JobControl runs on the client machine submitting the jobs. Oozie runs as a service in the cluster and client submit workflow definitions for immediate or later execution 4. (a) Performs a workflow task such as: moving files in HDFS, running MapReduce, Streaming or Pig jobs, Sqoop imports, shell scripts, java programs (b) Governs the workflow execution using conditional logic 5. (a) On workflow completion, HTTP callback to client to inform workflow status. (b) receive callbacks every time a workflow enters/exits an action node
Describe the process of decommissioning nodes (7 steps).
1. Add network address of decommissioned node to exclude file. 2. Update the namenode: hadoop dfsadmin -refreshNodes 3. Update jobtracker: hadoop mradmin -refreshNodes 4. Web UI - check that node status is "decommission in progress" 5. When status = "decommissioned" all blocks are replicated. The node can be shut down. 6. Remove from the include file, then hadoop dfsadmin -refreshNodes hadoop mradmin -refreshNodes 7. Remove nodes from slaves file.
Describe the process of commissioning nodes (6 steps).
1. Add network address of new nodes to the include file. 2. Update the namenode with new permitted tasktrackers: hadoop dfsadmin -refreshNodes 3. Update the jobtracker with the new set of permitted tasktrackers: hadoop mradmin -refreshNodes 4. Update the slaves file with the new nodes 5. Start the new datanode/tasktrackers 6. Check that the new datanodes/tasktrackers show up in the web UI.
What are the steps to access a service with Kerberos?
1. Authentication - client authenticates themselves to get a TGT (Ticket-Granting Ticket). Explicitly carried out by user using the kinit command which prompts for a password. (Good for 10 hours) For automating this you can create a keytab file using ktutil command. 2. Authorization - (not user level, the client performs) The client uses the TGT to request a service ticket from the Ticket Granting Server 3. Service request - (not user level) Client uses service ticket to authenticate itself to the server that is providing the service. (ex: namenode, jobtracker)
What two parts make up the Key Distribution Center (KDC) ?
1. Authentication Server 2. Ticket Granting Server
What are the writable wrapper classes for Java primitues?
1. Boolean Writable 2. Byte Writable 3. Ink Writable 4. V Ink Writable 5. Float Writable 6. Long Writable 7. VLong Writable 8. Text Use Intwritable for shor and char. V stands for variable (as in lengths)
What does FileSystem check do? (fsck) usage?
1. Checks the health of files in HDFS. Looks for blocks that are missing from all datanodes as well as under/over replicated blocks. fsck does a check by looking at the metadata files for blocks and checking for inconsistencies. 2. hadoop fsck / (directory to recursively search)
The balancer runs until: (3)
1. Cluster is balanced 2. It cannot move any more blocks 3. It loses contact with the Namenode
What are the methods to make the Namenode resistant to failure? (2)
1. Configure Hadoop so that it writes its persistent state to multiple filesystems 2. Run a Secondary Namenode
How does the reduce side of the Shuffle work? (Copy Phase)
1. Copy Phase - after a map task completes, the reduce task starts copying their outputs. Small numbers of copier threads are used so it can fetch output in parallel. (default = 5 mapred.reduce.parallel.copies) The output is copied to the reduce task JVM's memory. (if its small enough) otherwise, its copied to disk. When the in-memory buffer reaches threshold size or reaches threshold number of map outputs it is merged and spilled to disk. mapred.job.shuffle.merge.percent mapred.inmem.merge.threshold A combiner would be run here if specified. Any map outputs that were compressed have to be decompressed in memory. When all map outputs have been copied we continue to the Sort phase.
What are the steps in packaging a job?
1. Create a jar file using Ant, Maven or command line. 2. Include any needed classes in the root/classes directory.Dependent jar files can go in root/lib 3. Set the HADOOP_CLASSPATH to dependent jar files.
What are the possible Hadoop compression codecs? Are they supported Natively or do they use a Java Implementation?
1. DEFLATE - (Java yes, Native yes) org.apache.hadoop.io.compress.DefaultCodec 2. gzip ( Java yes, Native yes) org.apache.hadoop.io.compress.GzipCodec 3. bzip2 (Java yes, Native no) org.apache.hadoop.io.compress.BZipCodec 4, LZO (Java no, Native yes) com.hadoop.compression.lzo.LZOCodec 5. LZ4 (Java no, Native yes) com.hadoop.compression.lz4.LZ4Codec 6. Snappy (Java no, Native yes) org.apache.hadoop.io.compress.SnappyCodec
What are the three possibilities of Map task/HDFS block locality?
1. Data local 2. Rack local 3. Off-rack
command to start distributed file system
bin/start-dfs.sh
How does the map portion of the MapReduce write output?
1. Each map task has a circular memory buffer that writes output 100MB by default (io.sort.mb) 2. When contents of the buffer meet threshold size (80% default io.sort.spill.percent) a background thread will start to spill the contents to disk. Map outputs continue to be written to the buffer while the spill is taking place. If the buffer fills up before the spill is complete, it will wait. 3. Spills are written round robin to directories specified (mapred.local.dir) Before it writes to disk, the thread divides the data into partitions based on reducer and then the partition is sorted by key. If a combiner exists, it is then run.
1.What is Job History? 2. Where are the files stored? 3. How long are History files kept? 4. How do you view job history via command line?
1. Events and configuration for a completed job. 2. The local file system of the jobtracker. (history subdir of logs) 3. 30 days , user location - never hadoop.job.history.location 2nd copy _logs/history of jobs output location hadoop.job.history.user.location 4. hadoop job -history
What controls the schedule for checkpointing? (2)
1. Every hour (fs.checkpoint.period) 2. If the edits log reaches 64MB (fs.checkpoint.size)
The FileStatus class holds which data? (6)
1. File length 2. Block size 3. Replication 4. Modification time 5. Ownership 6. Permission Information
How do you set precedence for the users classpath over hadoop built in libraries?
1. HADOOP_USER_CLASSPATH_FIRST = true 2. mapreduce.task.classpath.first to true
What are the two types of blk files in the datanode directory structure and what do they do?
1. HDFS blocks themselves (consist of a files raw bytes) 2. The metadata for a block made up of header with version, type information and a series of checksums for sections on the block.
What is Audit logging and how do you enable it?
1. HDFS logs all filesystem requests with log4j at the INFO level 2. log4j.logger.org.apache.hadoop.hdfs.sever.namenode.FSNameSystem.audit = INFO (default WARN)
When are tasktrackers blacklisted? How do blacklisted tasktrackers behave?
1. If more than 4 tasks from the same job fail on a particular tasktracker, (mapred.max.tracker.failures) the jobtracker records this as a fault. If the number of faults is over the minimum threashold (mapred.max.tracker.blacklists) default 4, the tasktracker is blacklisted. 2.They are not assigned tasks. They still communicate with the jobtracker. Faults expire over time (1 per day) so they will get a chance to run again. If the fault can be fixed (ex: hardware) when it restarts it will be re-added.
A MapReduce job consists of three things:
1. Input Data 2. MapReduce Program 3. Configuration Information
What are the steps of an upgrade (when the filesystem layout hasn't changed)
1. Install new versions of HDFS and MapReduce 2. Shut down old daemons 3. Update config files 4. Start up new daemons and use new libraries
How does the default implementation of ScriptBasedMapping work? What happens if there is no user-defined script?
1. It runs a user-defined script to determine mapping topology.script.file.name (script location config) The script accepts args (IP addresses) and returns a list of network locations 2. All nodes are mapped to a single network location called /default-rack
What does a Jobtracker do when it is notified of a task attempt which has failed? How many times will a task be re-tried before job failure? What are 2 ways to configure failure conditions?
1. It will reschedule the task on a new tasktracker (node) 2. 4 times (default) 3. mapred.map.max.attempts mapred.reduce.max.attempts If tasks are allowed to fail to a certain percentage mapred.max.map.failures.percentage mapred.max.reduce.failures.percentage Note: Killed tasks do not count as failures.
How do you do metadata backups?
1. Keep multiple copies of different ages (1hr, 1day,1week) 2. Write a script to periodically archive the secondary namenodes previous.checkpoint subdir to an offsite location 3. Integrity of the backup is tested by starting a local namenode daemon and verifying it has read fsimage and edits successfully.
The Web UI has action links that allow you to do what? How are they enabled?
1. Kill a task attempt 2. webinterface.private.actions = true
What are fencing methods? (4)
1. Killing the namenode process 2. Revoking the namenodes access to the shared storage directory 3. Disabling its network port 4. STONITH - (extreme) Shoot the other node in the head - Specialized power distribution unit to force the host machine down
What are two ways to limit a task's memory usage?
1. Linux ulimit command or mapred.child.ulimit. This should be larger than mapred.child.java.opts otherwise the child JVM might not start 2. Task Memory Monitoring - administrator sets allowed range of virtual memory for tasks on the cluster. Users will set memory usage in their job config, if not it uses mapred.job.map.memory.mb and mapred.job.reduce.memory.mb. This is a better approach because it encompasses the whole task tree and spawned processes. The Capacity Scheduler will account for slot usage based on memory settings.
During Namenode failure an administrator starts a new primary namenode with a filesystem metatdata replica and configures datanodes and clients to use the new namenode. The new Namenode won't be able to serve requests until these (3) tasks are completed.
1. Loaded its namenode image into memory 2. Replayed edits from the edit log 3. Received enough block reports from datanodes to leave safe mode
What are the two types of log files? When are they deleted?
1. Logs ending in .log are made by log4j and are never deleted. These logs are for most daemon tasks 2. Logs ending in .out act as a combination standard error and standard output log. Only the last 5 are retained and they are rotated out when the daemon restarts.
Describe the upgrade process. ( 9 Steps)
1. Make sure any previous upgrade is finalized before proceeding 2. Shut down MapReduce and kill any orphaned tasks/processes on the tasktrackers 3. Shut down HDFS, and back up namenode directories. 4. Install new versions of Hadoop HDFS and MapReduce on cluster and clients. 5. Start HDFS with -upgrade option 6. Wait until upgrade completes. 7. Perform sanity checks on HDFS (fsck) 8. Start MapReduce 9. Roll back or finalize upgrade
What are characteristics of these compression formats? 1. Gzip 2. Bzip2 3. LZO, LZ4, Snappy
1. Middle of the space/time tradeoff 2. Compresses more effectively but slower (than Gzip) 3. All optimized for speed, compress less effectively
Which Hadoop MBeans are there? What daemons are they from? (5)
1. NameNodeActvityMBean (namenode) 2. FSNameSystemMBean (namenode) - namenode status metrics ex: # of datanodes connected 3. DatanodeActivityMBean (datanode) 4. FSDatasetMBean (datanode) - datanode storage metrics ex: capacity/free space 5. RPCActivityMBean (all rpc daemons) RPC statistics ex: average processing time
What are the restrictions while being in Safe Mode? (2)
1. Offers only a read-only view of the filesystem to clients. 2. No new datanodes are setup/written to. This is because the system has references to where the blocks are in the datanodes and the namenode has to read them all before coordinating any instructions to the datanodes.
What are reasons to turn off Speculative Execution?
1. On a busy cluster it can reduce overall throughput since there are duplicate tasks running. Admins can turn it off and have users override it per job if necessary. 2. For reduce tasks, since duplicate tasks have to transfer duplicate map inputs which increases network traffic 3. Tasks that are not idempotent. You can make tasks idempotent using OutputComitter. Idempotent(def) apply operation multiple times and it doesn't change the result.
Application Master Failure. 1. Applications are marked as failed if they fail ____ . 2. What happens during failure? 3. How does the MapReduce application manager recover state of which tasks were run successfully? 4. How does the client find a new application master?
1. Once 2. Resource manager notices missing heartbeat from application master and starts a new instance of the master running in a new container. (managed by the node manager) 3. If recovery is enabled (yarn,app.mapreduce.am.job.recovery.enabled to true) 4. During a job initialization, the client asks the resource manager for the application masters addresses and caches them. On failure the client experiences a timeout when it issues a status update where it goes back to the resource manager to find a new address.
How does task execution work in YARN?
1. Once a task is assigned to a container, via the resource managers scheduler, the application master starts the container by contacting the node manager. The container is started via the Java App YARNChild which localizes resources and runs the mapreduce task. YARN does not support JVM reuse.
In HDFS each file and directory have their own? (3 permission structures)
1. Owner 2. Group 3. Mode
How do you manage data backups?
1. Prioritize your data. What must not be lost, what can be lost easily? 2. Use distcp to make a backup to other HDFS clusters (preferably to a different hadoop version to prevent version bugs) 3. Have a policy in place for user directories in HDFS. (how big? when are they backed up?)
What are common remote debugging techniques?
1. Reproduce the failure locally, possibly using a debugger like Java's virtual VM. 2. Use JVM debugging options.For JVM out of memory errors. Set -XX:-HeapDumpOnOutOfMemoryError -XX:-HeapDumpPath = /path/to/dumps Dumps heap to be examined afterward with tools such as jhat or Eclipse Memory Analyzer 3. Task Profiling - Hadoop provides a mechanism to profile a subset of the tasks in a job. 4. IsolationRunner - (old hadoop) could re-run old tasks
YARN takes responsibilities of the jobtracker and divides them between which 2 components?
1. Resource Manager 2. Application Master
For multiple jobs to be run, how do you run them linearly? Directed Acyclic Graph of jobs?
1. Run each job, one after another, waiting until the previous completes successfully. Throws an exception and the processing stops at failed job. ex: JobClient.runJob(conf1); JobClient.runJob(conf2); 2. Use Libraries. (org.apache.hadoop.mapreduce.jobcontrol ) JobControl Class instance represents a graph of jobs to be run. (a) Indicate jobs and their dependencies (b) Run JobControl in a thread and it runs the jobs in dependency order (c) You can poll progress (d) If a job fails JobControl won't run dependencies (e) You can query status after the jobs complete
What are these other common benchmarks used for? 1. MRBench 2. NNBench 3. Gridmix
1. Runs a small job a number of times. Acts as a good counterpoint to sort. 2. Useful for load-testing namenode hardware. 3. Suite of benchmarks designed to model a realistic cluster
When does Safe Mode start? When does it end?
1. Safe Mode starts when the namenode is started. (after loading fsimage and edit logs) 2. When the minimum replication factor has been met, plus an additional 30 seconds ( dfs.replication.min) When you are starting a newly formed cluster it does not go into safemode since there are no blocks in the system yet.
What are three ways to execute Pig programs? (all work in local and mapreduce)
1. Script ex: pig script.pig or with the -e option you can run scripts inline via command line (if they are short) 2. Grunt - interactive shell. Grunt is started when Script isn't used (no -e option) Run scripts with Grunt using "run" or "exec". 3. Embedded - runs from Java using PigServer Class. For access to Grunt via Java use the PigRunner Class.
What does running start-mapred.sh do? (2 steps)
1. Starts a jobtracker on the local machine 2. Starts a tasktracker on each machine in the slaves file
What does running start-dfs.sh do? ( 3 steps)
1. Starts a namenode on the machine the script was run on 2. Starts a datanode on each machine listed in the slaves file 3. Starts a secondary namenode on each machine listed in the masters file
What are the 6 key points of HDFS Design?
1. Storing very large files 2. Streaming data access - A write once, read many pattern. The time it takes to read the full data set is more important than the latency of reading the first record 3. Commodity Hardware - Designed to carry on working through node failures which are higher on commodity hardware 4. Low-latency data access - Optimized for high throughput at the expense of latency 5. Lots of small files - The Namenode stores file system metadata in memory, therefore the max amount of files is governed by the amount of memory in the Namenode 6. Multiple writes, arbitrary file modifications - HDFS files may be written by a single writer and must always be made at the end of files
True or False: 1. System properties take priority over properties defined in resource files. 2. System properties are accessible through the configuration API.
1. TRUE 2. FALSE, it will be lost if not redefined in a configuration file.
(1)True or False: Hadoop has enough unix assumptions that is is unwise to run on non-unix platforms in production (2)True or False: For a small cluster (10 nodes) it is acceptable to have the namenode and jobtracker on a single machine
1. TRUE 2. TRUE, as long as you have a copy of the namenode metatdata on a remote machine. Eventually as the # of files grows, the namenode should be moved to a separate machine because it is a memory hog.
What are the Steps of Task Execution? (MR1)
1. Tasktracker copies the job JAR from the shared filesystem to the tasktrackers filesystem. Copies any files needed from distributed cache. 2. Tasktracker creates a local working directory for the task and un-jars the contents 3. Tasktracker creates an instance of TaskRunner: (a) TaskRunner launches a new JVM to run each task in. ( so that any bugs in user-defined maps or reduce functions don't cause the tasktracker to crash or hang) (b) The child tasks communicate via the umbilical interface every few seconds until completion.
What are the YARN entities?
1. The Client - submits the MapReduce job 2. resource manager - coordinates allocation of compute resources 3. node managers - launch and monitor the compute containers on machines in the cluster 4. application master - coordinates the tasks running the MapReduce job. The application master and MapReduce tasks run in containers that are scheduled by the resource manager and managed by node managers. 5. Distributed Filesystem
How does MR2 handle runtime exception/failure and sudden JVM exits? Hanging tasks? What are criteria for job failure?
1. The application master marks them as failed. 2. The application master notices an absence of ping over umbilical channel, task attempt is marked as failed. 3. Same as MR1, same config options. A task is marked as failed after 4 attempts or percentage map/reduce tasks fail.
After a successful upgrade, what should you do?
1. remove old installation and config files 2. fix any warnings in your code or config 3. Change the environment variables in your path. HADOOP_HOME => NEW_HADOOP_HOME
How does task assignment work in YARN? (only if not ubertask)
1. The application master requests containers for all MapReduce tasks in the job from the resource manager. All requests, piggybacked on heartbeat calls, include information about each map tasks data locality and memory requirements for tasks. 2. The scheduler uses locality information to make placement decisions
What are the four entities of MapReduce1 ?
1. The client 2. The jobtracker (coordinate the job run) 3. The tasktrackers (running tasks) 4. The distributed filesystem (sharing job files)
What are the steps of a File Write? (5)
1. The client calls create() on the DistributedFileSystem 2. The DistributedFileSystem makes an RPC call to the Namenode to create a new file in the filesystem namespace with no blocks associated with it. Namenode checks permissions and if file exists already, if it passes the namenode creates a record of the file. The DistributedFileSystem returns an FSDataOutputStream which wraps a DFSOutputStream. The DFSOutputStream handles communcation with datanodes and the namenode. 3. As the client writes data DFSOutputStream splits it into packets, which it writes to an internal queue, the "data queue". 4. The DataStreamer consumes the data queue and asks the namenode to allocate new blocks by picking suitable datanodes with all three replicas. DataStreamer streams the packets to the datanodes. The DFSOutputStream maintains an internal queue of packets that are waiting to be acknowledged ( ack queue ) 5. Packets are removed from the ack queue when it has been acknowledged by all datanodes in the pipeline.
What are the steps of a File Read? (6)
1. The client opens a file by calling FileSystem's open method (HDFS - uses an instance of DistributedFileSystem) 2. The FileSystem calls the Namenode for file block locations. The Namenode returns locations of datanodes sorted by proximity to the client. The DistributedFileSystem returns a FSDataInputStream to the client for it to read data from. The client calls read() on the stream. 3. DFSInputStream finds the first (closest) datanode and connects to it to get access to the first block of the file. It is streamed back to the client, which calls read() repeatedly. 4. When the data is at the end of the block, it will close the stream. The DFSInputStream will then find the next best datanode for the next block. 5. The process continues, it will call the namenode for the next set of blocks. 6. When the client is finished, it calls close() on the stream.
What is the user's task classpath comprised of?
1. The job JAR file 2. Any JAR files contained in the lib/ directories of the job jar file. 3. Any files added to the distributed cache using -libjars option or the addFileToClasspath() method on DistributedCache (old api) or Job (new api)
How does Job Initialization work in MapReduce 1?
1. The job is put into a queue from where the JobScheduler will pick it up and initialize it. Creates an object to represent the job being run. It encapsulates its task and bookkeeping information to keep track of status of tasks. 2. The Job Scheduler receives the input splits computed by the client from the shared filesystem and creates one map task for each split. Reduce tasks are created. The job setup task is created which is run by the tasktrackers to setup before and after a job. The job cleanup task is created to delete the temporary workspace for the task output. 3. The tasktrackers send a heartbeat to the jobtracker and is also used to communicate when a new task is ready. The jobtracker will choose a job and then a task within to send in response.
What happens when a jobtracker receives a notification that the last task for a job is compete?
1. The jobtracker changes the status of the job to "successful" 2. When the Job object polls for status, it prints a message to tell the user and returns from the waitingForCompletion() method. 3. Job Statistics and Counters are printed 4. (optional) HTTP job notification (set job.end.notification.url) 5. Cleans up working sate for the job and instructs tasktrackers to do the same.
Pig is made up of two parts. What are they?
1. The language used to express data flows (Pig Latin) 2. The execution environment of Pig Latin programs) (a) local execution (single JVM) small datasets on the local file system (b) distributed execution - on the Hadoop cluster
If two configuration files set the same property, which does Hadoop use?
1. The last one added. 2. The one marked as "final" within the config
What happens with a datanode fails while data is being written to it?
1. The pipeline is closed 2. Any packets in the ack queue are written to the front of the data queue 3. The current block on the good datanode is given a new identity so if the failed datanode comes back up, it won't continue with the block. 4. The failed datanode is removed from the pipeline 5. The remainder of the blocks data is written to the remaining 2 datanodes in the replication structure. 6. When the client is finished writing, it calls close(), which flushes the remaining packets in the datanode pipeline. It waits for acknowledgement before contacting the namenode to close the file. 7. The namenode notes if any block is under-replicated and arranges for another replication
How does job initialization work in YARN?
1. The scheduler allocates a container and the resource manager then launches the application master's process there, under the node managers management. 2. The application master creates a map task object for each split, as well as a number of reduce tasks (configured: mapreduce.job.reduces) 3. The application master decides if tasks will be run on the same JVM as itself (uberized) or paralell. Uberized - small, less than 10 mappers and 1 reducer. Input size of less than a block
Describe the Checkpoint process. (5 steps)
1. The secondary asks the primary to roll its edits file. edits => edits.new (on primary) 2. Secondary retrieves fsimage and edits from primary 3. Secondary loads fsimage info memory and applies edits. Then creates a new fsimage file. 4. Secondary sends new fsimage to primary (HTTP Post) 5. Primary replaces old fsimage file and the old one with the edits.new. Updates fstime to record the time the checkpoint was taken.
How could you debug a MapReduce program?
1. Use a debug statement to log to standard error and a message to update task status to alert us o look at the error log. 2. Create a custom counter to count the total number of records with implausible values in an entire dataset (valuable to see how common an occurrence is) 3. If the amount of data is large, we add debug output to the maps output for analysis and aggregation in the reducer. 4. Write a program to analyze log files afterwards. ex: Debugging to Mapper: if(airTemp > 1000 ){ Ssytem.err.println("Temp over 1000 degrees for input" + value); Context.setStatus("detected possibly corrupt record:see logs"); Context.getCounter(Temperature.OVER_100).increment(1); }
What do the following SSH settings do? 1. ConnectTimeout 2. StrictHostKeyChecking
1. Used to reduce the connection timeout value so the control scripts don't wait around. 2. If set to no, it automatically adds new host keys. If ask(default),prompts the user to accept host key (not good for a large cluster)
How might a job fail? (MR1)
1. User code throws a runtime exeception - childJVM reports error to tasktracker before it exists. Tasktracker marks task as failed. The error ends up in user logs. 2. Streaming - if streaming process ends with nonzero exit code, is marked as failed. (streaming.non.zero.exit.is.failure. = true) 3. Sudden exit of childJVM - tasktracker notices exit and marks as failed.
What security enhancements have been added to Hadoop?
1. Users can view and modify their own jobs, not others. Using ACL's. 2. A task may communicate only with its parent tasktracker 3. The shuffle is secure, but not encrypted. 4. A datanode may be run on a privileged port ( lower than 1024) to make sure it starts securely. 5. When tasks are run as the use who submitted the job, the distributed cache is secure. Cache was divided into secure/shared portions. 6. Malicious users cant get rouge secondary namenodes, datanodes or tasktrackers to join the cluster. Daemons are required to authenticate with the master node.
A container has 2 types of memory constraints. What are they?
1. Virtual memory constraint - cannot exceed a given multiple set in yarn.nodemanager.vmem-pmem ratio. Usually 2:1 2. Schedulers min/max memory allocations. yarn.scheduler.capacity.minimum-allocation.mb yarn.scheduler.capacity.maximum-allocation.mb
Reduce tasks are broken down on the jobtracker web UI. What do: 1. copy 2. sort 3. reduce refer to?
1. When map outputs are being transferred to the reducers tasktracker. 2. When the reduce inputs are being merged. 3. When the reduce function is being run to produce the file output.
How do you benchmark HDFS?
1. Write hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles IO -fileSize 1000 (writes 10 files of 1,000 MB each) cat TestDFSIO_results.log in /benchmarks/TestDFSIO 2. Read hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -read -nrFiles IO -fileSize 1000 (reads 10 files of 1,000 MB each) 3. Clean hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -clean
How do you benchmark mapreduce?
1. Write hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar randomwriter random-data (Generate some data randomly) 2. Sort hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort random-data sorted-data (runs the sort program) Can see progress at the jobtracker web url. 3. Verify Data is Sorted Correctly hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar testmapredsort -sortInput random-data -sortOutput sorted-data (returns success or failure)
What types does the addInputPath() method accept?
1. a single file 2. directory 3. file pattern
What are the features of Grunt?
1. autocomplete mechanism. 2. remembers previous/next commands that were run
What do the following Oozie components specify? 1. map-reduce action 2. mapred.input.dir / mapred.output.dir
1. contains (a) job-tracker - specifies jobtracker to submit job to (b) name-node - URI for data input/output (c) prepare (optional) - runs before mapreduce job. Used for directory deletion (output dir before job runs) 2. Used to set the FileInputFormat input paths and FileOuputFormat output paths
What the the benefits of PIG?
1. cuts down on development process of MapReduce (even faster than Streaming) 2. Issuing command line tasks to mine data is fast and easy 3. Can process terabytes of data 4. Provides commands for data introspection as you are writing scripts 5. Can run on sample subsets of data
Datanodes permitted/not permitted to connect to namenodes if specified in ___________. Tasktrackers that may/ may not connect to the jobtracker are specified in ___________.
1. dfs.hosts / dfs.hosts.exclude 2. mapred.hosts / mapred.hosts.exclude
How do you run an Oozie workflow job?
1. export OOZIE_URL="http://localhost:11000/oozie" (tells oozie command which server to use) 2. oozie job -config ch05/src.../max-temp-workflow.properties -run (run - runs the workflow) (config - local java properties file containing definitions for the parameters in the workflow xml) 3. oozie job -info 000000009-112....-oozie-tom-W (shows the status, also available via web url)
If I want to keep intermediate failed or succeeded files, how can I do that? Where are the intermediate files stored?
1. failed - keep.failed.task.files = true succeeded - keep.task.files.pattern = (regex of task Ids to keep) 2. mapred.local.dir/taskTracker/jobcache/job-ID/task-attempt-ID
Trash: 1. How do you setup Trash? 2. Where do you find Trash files? 3. Will programatically deleted files be put in Trash? 4. How do you manually take out Trash for non-HDFS filesystems?
1. fs.trash.interval, set to greater than 0 in core-site.xml 2. In your user/home directory in a .trash folder 3. No, they will be permanently deleted 4. hadoop fs -expunge
What should be run regularly for maintenance?
1. fsck 2. balancer
How do you setup Kerberos authentication?
1. hadoop.security.authentication = kerberos (core-site.xml) 2.hadoop.security.authorization = true 3. Setup ACL's (Access Control Lists) in hadoop-policy.xml
How do you set log levels for a component? (3 ways)
1. http://jobtracker-host:50030/logLevel and set org.apache.hadoop.mapred.JobTracker = DEBUG 2. hadoop daemonlog -setLevel jobtracker-host:50030 org.apache.hadoop.mapred.JobTracker = DEBUG 3. (persistent) change log4j.properties file
A mapper commonly performs three things, what are they?
1. input format parsing 2. projection (selecting relevant fields) 3. filtering (removing records that are not of interest)
1. CompressionCodecFactory finds codecs form a list defined where? 1. What is default? 3. How it it formatted?
1. io.compression.codecs It searches through it via extension to find a match. 2. Hadoop supported. Can configure custom codecs format 3. Comma-separated classnames (ex: org.apache.hadoop.io.compress.DefaultCodec, ...)
Pig is extensible in that you can customize what? How?
1. loading 2. storing 3. filtering 4. grouping 5. joining via user-defined functions (UDF)
What can mapreduce.framework.name be set to ?
1. local 2. classic (MapReduce 1 ) 3. yarn (MapReduce 2)
How do you set memory options for individual jobs? (2 controls)
1. mapred.child.java.opts - sets JVM heap size for map/reduce tasks 2. mapreduce.map.memory.mb - how much memory needed for map (or reduce) task containers.
What needs to be set in order to enable Task Memory Monitoring? (6)
1. mapred.cluster.map.memory.mb - amt of virtual memory to take up a map slot. Map tasks that require more can use multiple slots. 2. mapred.cluster.reduce.memory.mb - amt of virtual memory to take up a reduce slot 3. mapred.job.map.memory.mb - amt of virtual memory that a map task requires to run 4. mapred.job.reduce.memory.mb - amt of virtual memory that a reduce task requires to run 5. mapred.cluster.max.map.memory.mb - max limit users can set mapred.job.map.memory.mb 6. mapred.cluster.max.reduce.memory.mb - max limit users can set mapred.job.reduce.memory.mb
How do you configure JVM reuse?
1. mapred.job.reuse.jvm.num.tasks - map number of tasks to run for a given job for each JVM launched (default =1) There is no distinction between map/reduce tasks, however tasks from different jobs are always run in separate JVMs. If set to -1, no limit and the same JVM may be used for all tasks of a job. 2. JobConf.setNumTasksToExecutePerJVM()
1. How do you setup the local jobrunner? 2. How do you setup the local jobrunner on MR2? 3. How many reducers are used?
1. mapred.job.tracker = local (default) 2. mapred.framework.name = local 3. 0 or 1
What are the two files the Namenode stores data in?
1. namespace file image 2. edit log
What does the Datanode's VERSION file contain? (5)
1. namespaceID -received from the namnode when the datanode first connects 2. storageID = DS-5477177.... used by the namenode to uniquely identify the datanode 3. cTime = 0 4. storageType = DATA_NODE 5. layoutVersion = -18
How does job submission work in YARN?
1. new JobID retrieved form the resource manager (called applicationID) 2. The job client checks the output specification, computes input splits and copies job resources to HDFS 3. The job is submitted by calling submitApplication() on the resource manager
Which components does fsck measure and what do they do?
1. over-replicated blocks - automatically deletes replicas 2. under-replicated blocks - automatically create relicas 3. mis-replicated blocks - blocks that don't satisfy the replica placement policy. They are re-replaced. 4. corrupt blocks - blocks whose replicas are all corrupt. Blocks with 1 non-corrupt replica are not marked as corrupt. 5. missing replicas - blocks with no replicas anywhere. Data has been lost. You can specify to -move the affected files to the lost and found directories or -delete the files (cannot be recovered)
What do these common Pig commands output? 1. DUMP records 2. DESCRIBE records 3. ILLUSTRATE 4. EXPLAIN
1. shows records ex: 1950,0,1 2. show the relations schema ex: records: { year: chararray, temperature: int, quality, int } 3. A table representation of all the steps and transformations. It helps understand the query. 4. Shows the logical/physical plan breakdown for a relation
How do you run a MapReduce job on a cluster?
1. unset HADOOP_CLASSPATH (if no dependencies exist) 2. hadoop jar hadoop-examples.jar / v3MaxTemperatureDriver -conf conf/hadoop-cluster.xml input/ncdc/all max-temp
. How did you debug your Hadoop code?
1. use counters 2. use the interface provided by hadoop web ui
How much memory does the namenode, secondary namenode and jobtracker daemons use by default?
1GB Namenode - 1 GB per million blocks storage
How much memory do you dedicate to the node manager?
1GB namenode daemon + 1GB datanode daemon + extra for running processes = 8GB (generally)
What is the bandwidth used by the balancer?
1MB/s (dfs.balance.bandwidthPerSec) hdfs.site.xml Limits bandwidth used for copying blocks between nodes. Designed to run in the background.
What is Hadoop's default replica placement?
1st - on the same node as the client 2nd - on a different rack 3rd - on a different rack and a different node than the 2nd
How does the reduce side of the Shuffle work? (Sort Phase & Reduce phase)
2. Sort Phase - (merge phase) This is done in rounds. The number of map outputs/merge factor(io.sort.factor default 10) 50/10 = 5 rounds. (5 intermediate files) 3. Reduce Phase - reduce function is invoked for every key of ouput. The result is written to the output filesystem, typically HDFS
Think hadoop
2003/2004 Google releasesd 2 academic papers describing Google File System and MapReduce.
How does the map portion of the MapReduce write output? (Part 2)
4. Each time the memory buffer reaches spill threshold, a new spill file is created. There are at least 3 spill files. (min.num.spills.for.combine) All spill files are combined into a single partitioned and sorted output file. 5. It is a good idea to compress output (Not set by default) 6. Output file partitions are made available to reducers via HTTP. The max amount of worker threads used to serve partitions is controlled by tasktracker.http.threads = 40 (default) (setting per tasktracker, not per map) This is set automatically in MR2 by the number of processors on the machine. (2 x amt of processors)
What is a good split size? What happens when a split is too small?
64MB or the size of an HDFS block. A larger split would have to use bandwidth as well as local data, because a split will span more than a single block. A smaller split would create overhead in managing all the splits and metadata associated with task creation.
Machines running a namenode should be which? 32 bit or 64 bit?
64bit, to avoid the 3GB limit on Java Heap size on 32bit
default size of an HDFS block
64mg
What is the default port for the HDFS NameNode?
8020
rule of 5 9's 99.999 uptime
99% uptime system = 3.5 days a year, 7 hours a month.
Pig- if fields are no long unique use
::
Cluster file system property (XML)
<property> <name> fsodefault.name <1name> <value> holfsi//namemodel/</value> </property>
Cluster job tracker property (XML)
<property> <name>mapred.job.tracker<1name> <value> jobtracker:8021 <value> <1property>
Which of the following is NOT true: A) Hadoop is decentralized B) Hadoop is distributed. C) Hadoop is open source. D) Hadoop is highly scalable.
A
In Pig Latin, how would you use multi-query execution?
A = LOAD 'input/pig/multiquery/A', B = FILTER A BY $1='banana', C = FILTER A BY $1='banana', STORE B INTO 'ouput/b', STORE C INTO 'output/c'; Reads A only once for the 2 jobs to save time. Store the output in 2 separate places.
What is a DataNode? How many instances of DataNode run on a Hadoop Cluster?
A DataNode stores data in the Hadoop File System HDFS. There is only One DataNode process run on any hadoop slave node. DataNode runs on its own JVM process. On startup, a DataNode connects to the NameNode. DataNode instances can talk to each other, this is mostly during replicating data.
What is HBase
A Hadoop database
three
A Hadoop file is automatically stored in ___ places.
How do status updates work with YARN?
A Task reports its progress and status back to the application master which has an aggregate view. It is sent every 3 seconds over the umbilical interface.
What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster
A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JobTracker. There is only One Task Tracker process run on any hadoop slave node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated
What is the Data Block Scanner?
A background thread that periodically verifies all the blocks stored on the datamode. this is to guarantee against corruption due to "bit rot" in the physical storage media.
What is Hadoop?
A big data system that ties together a cluster of commodity machines with local storage using free and open source software to store and process vast amounts of data at a fraction of the cost of other systems.
hadoop-env.sh
A bourne shell fragment sourced by Hadoop scripts, this file specifies environment variables that affect the JDK used by Hadoop,daemon JDK options, the pid fil,adn log file directories
How does the Capacity Scheduler work?
A cluster is made up of a number of queues which may be hierarchical and each queue has a capacity.Within each queue jobs are scheduled using FIFO scheduling (with priorities) Allows users (defined by queues) to simulate separate clusters. Does not enforce fair sharing like the Fair Scheduler.
What is a file system designed for storing very large files with streaming data access paterns, running on clusters of commodity hardware.
HDFS
Zookeeper
A distributed, highly available coordination service. Zookeeper provides privledges such as distributed locks that can be used for distributed applications.
How would you customize Grunt autocomplete tokens?
Create a file called autocomplete and place it in Pig's classpath. Place keywords in single separate lines (case sensitive)
log4j.properties
A java property file that contains all log configuration information
taskcontroller.cfg
A java property-style file that defines values used by setuid-task-controller MapReduce helper program used when operting in secure mode
hadoop fs -touchz
Create a file of zero length.
What are the two main new components in Hadoop 2.0?
HDFS Federation and Yarn
MAPRED: mapred.local.dir
A list of directories where MapReduce stores intermediate temp data for jobs (cleared at job end)
Map-Side Joins
A map-side join between large inputs works by performing the join before the data reaches the map function. For this to work, though, the inputs to each map must be partitioned and sorted in a particular way. Each input dataset must be divided into the same number of partitions, and it must be sorted by the same key (the join key) in each source. All the records for a particular key must reside in the same partition. This may sound like a strict requirement (and it is), but it actually fits the description of the output of a MapReduce job.
masters(optional)
A new line separated list of machines that run the secondary namenode used only by start-*.sh helper scripts.
dfs.exclude
A newline separated list of machines that are not permitted to connect to namenode
dfs.include
A newline separated list of machines that are perimitted to connect to the namenode
slaves(optional)
A newline separated list of machines that run datanode/task trackerpair of daemons used only by start-*.sh commands
Datanode
A node that holds data and data blocks for files in the file system.
NameNode
A node that stores meta-data about files and keeps track of which nodes hold the data for a particular file
\B
A non-word boundary
What is the namenode's fs image file?
A persistent checkpoint of filesystem metadata
empty
A quota of one forces a directory to remain ____ . (Yes, a directory counts against its own quota!)
What is the result of any operator in Pig Latin. ex:LOAD
A relation, which is a set of tuples. ex: records = LOAD 'input/ncdc/.." records - relation alias or name
What is Pig?
A scripting language that simplifies the creation of mapreduce jobs. Used to explore and transform data
What is Apache Flume? (a) What is a sample use-case? (b) What levels of delivery reliability does Flume support?
A system for moving large quantities of streaming data into HDFS. (a) use case: Collecting log data from one system and aggregating it into HDFS for later analysis. (b) 1. best-effort - doesn't tolerate any Flume node failures 2. end-to-end - guarantees delivery even with multiple failures.
[a-z&&[^m-p]]
A through z but m-p
Hadoop is
HDFS and MR. Both are direct implementations.
MapReduce job (definition)
A unit of work that a client wants to be performed
MapReduce job
A unit of work that the client wants to be performed -input data -the MapReduce program -configuration information
\b
A word boundary
Data Mining
According to analysts, for what can traditional IT systems provide a foundation when they're integrated with big data technologies like Hadoop? Big Data and ___ ___.
process
Administrators should use the conf/hadoop-env.sh and conf/yarn-env.sh script to do site-specific customization of the hadoop daemons' ______ environment.
What are delegation tokens used for?
Allows for later authentication access without having to contact the KDC again.
What does PathFilter do? Which FileSystem functions take an optional PathFilter?
Allows you to exclude directories, as GlobPatterns cannot. listStatus(), globStatus()
What is CompositeContext?
Allows you to output the same set of metrics to multiple contexts. -arity = number of contexts
What does the Hadoop Library Class ChainMapper do?
Allows you to run a chain of mappers, followed by a reducer and another chain of mappers in a single job.
What does a compression format being Splittable mean? Which format is?
Allows you to seek to any point in the stream and start reading. (Suitable for MapReduce) Bzip2
What is Mahout?
An Apache project whose goal is to build scalable machine learning libraries
What is the logo for Hadoop?
An Elephant
mapred-queue-acls.xml
An XML file that defines which user and or group are permitted to submit jobs to which Mapreduce Job queues
hadoop-policy.xml
An XML file that defines which users and / or groups are permitted to invoke specific RPC functions whn communicated with Hadoop
core-site.xml
An XML file that specifies parameters relevant to all Hadoop daemons and clients
mapred-site.xml
An XML file that specifies parametersused by MapReduce daemons and clients
Big Data
An assortment of such a huge and complex data that it becomes very tedious to capture, store, process, retrieve, and analyze it with the help of on-hand database management tools or traditional processing techniques.
Unix
HDFS commands have a one-to-one correspondence with ____ commands.
What is a codec?
An implementation of a compression-decompression algorithm. In Hadoop, its represented by an implementation of CompressionCodec Interface. ex: GZipCodec - encapsulates the compression-decompression algorithm for gzip.
What does Hadoop use for Configuration?
An instance of the Configuration Class in org.apache.hadoop.conf. They read properties from an xml file.
What is Ganglia?
An open source distributed monitoring system for very large scale clusters. Using Ganglia context you can inject Hadoop metrics into Ganglia. Low overhead and collects info about memory/CPU usage.
What is ZooKeeper?
An open source server which enables highly reliable distributed coordination
How is distcp implemented?
As a mapreduce job with the copying being done by the maps and no reducers. Each file is copied by a single map, distcp tries to give each map the same amount of data.
integration
As companies move past the experimental phase with Hadoop, many cite the need for additional capabilities, including: Improved extract, transform and load features for data ____.
Speculative Execution
As most of the tasks in a job ar coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes that dont have work to perform (Yahoo DevNet) (keys an entire job from being delayed by one slow node)
Speculative Execution
As most of the tasks in a job ar coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes that dont have work to perform (Yahoo DevNet) (keys an entire job from being delayed by one slow node)
Fill in the blank. The solution to cataloging the increasing number of web pages in the late 1900's and early 2000's was _______.
Automation
Fill in the blank. The solution to cataloging the increasing number of web pages in the late 1900's and early 2000's was _______.
Automation
Can an average mapreduce pattern use combiner?
Average is no associative but can be possible if count is kept
Schedulers wait until 5% of the map task in a job have completed before scheduling reduce tasks for the same job. In a large cluster this may cause a problem. Why? How can you fix it?
Because cluster utilization would be higher once reducers were taking up slots. By setting mapred.reduce.slowstart.completed.maps = 0.80 (80%) we could improve throughput because we would wait until 80% of the maps had been completed before we start allocating space to the reduce tasks
root privileges
Because the data transfer protocol of DataNode does not use the RPC framework of Hadoop, DataNode must authenticate itself by using privileged ports which are specified by dfs.datanode.address and dfs.datanode.http.address. This authentication is based on the assumption that the attacker won't be able to get ____ ____.
What is Sqoop used for? (use case)
Bulk imports of data into HDFS from unstructured datastores use case: An organization runs nightly Sqoop imports to load the days data into the Hive data warehouse for analysis
How is the output key and value returned from the mapper or reducer?
By calling myOutputCollector.collect (outputKey output Valve) Where myOutputCollector is of type OutputCollector and outputkey and output value are no key value pairs to be returned
How is the output key and value returned from the mapper or reducer?
By calling myOutputCollector.collect (outputKey output Valve) Where myOutputCollector is of type OutputCollector and outputkey and output value are no key value pairs to be returned
How can counters be incremented in MapReduce jobs?
By calling the incrcounter method on the instance of Reporter passed to the map or reduce method
access
By default Hadoop HTTP web-consoles (JobTracker, NameNode, TaskTrackers and DataNodes) allow ____ without any form of authentication.
authentication
By default Hadoop runs in non-secure mode in which no actual _____ is required. By configuring Hadoop runs in secure mode, each user and service needs to be authenticated by Kerberos in order to use Hadoop services.
How does the default partitioner bucket records?
By using a hash function
How are errors handled with CheckSumFileSystem?
Calls reportChecksumFailure() and LocalFileSystem moves offending file and its checksum to "bad-files" directory.
How to define udf streaming counters
Can be incremented by sending a specially formatted line to the standard error stream. Format must be: reporter:counter:group,counter,amount
Example converting a program to be sortable
Change from text to sequencefile, because usually it's signed integer, it doesn't sort well lexigraphically, but seq file uses intWritable
hadoop fs -chgrb
Change group association of files.
hadoop fs-chown
Change the owner of files.
hadoop fs -chmod
Change the permissions of files.
hadoop fs -setrep
Changes the replication factor of a file.
How do checksums work? What type of hardware do you need for them?
Checksums are computed once when the data first enters the system and again whenever it is transmitted across a channel. The checksums are compared to check if the data was corrupted. No way to fix the data, merely serves as error detection. Must use ECC memory
JobTracker in Hadoop performs following actions
Client applications submit jobs to the Job tracker. The JobTracker talks to the NameNode to determine the location of the data The JobTracker locates TaskTracker nodes with available slots at or near the data The JobTracker submits the work to the chosen TaskTracker nodes. The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker. A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable. When the work is completed, the JobTracker updates its status. Client applications can poll the JobTracker for information.
What s a block access token?
Client uses the block access token to authenticate itself to datanodes.Enabled by the setting dfs.block.access.token.enable = true. HDFS block may be accessed only by a client with a valid block access token from a namenode
How to retrieve counter values using the Java API (new)?
Cluster cluster = new Cluster(getConf()); Job job = cluster.getJob(JobID.forName(jobID)); Counters counters = job.getCounters(); long missing = counters.findCounter( MaxTemperatureWithCounters.Temperature.MISSING).getValue();
Fill in the blank: ________ is shipped to the nodes of the cluster instead of _________.
Code, Data
________ is shipped to the nodes of the cluster instead of _________.
Code, Data
What are combiners? When should I use a combiner in my MapReduce Job?
Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.
Which settings are used for commissioning/decommissioning nodes?
Commissioning: dfs.hosts (datanodes) mapred.hosts (tasktrackers) Decommissioning: dfs.hosts.exclude (datanodes) mapred.hosts.exclude (tasktrackers)
Why is Hadoop good for "big data"?
Companies need to analyze that data to make large-scale business decisions.
Why is Hadoop good for "big data"?
Companies need to analyze that data to make largescale business decisions.
thread safe
Concurrency and libhdfs/JNI The libhdfs calls to JNI should always be creating thread local storage, so (in theory), libhdfs should be as ____ ___ as the underlying calls to the Hadoop FS.
How would you manually add a resource? How would you access the resources properties?
Configuration conf = new Configuration(); conf.addResource("configuration-1.xml"); assertThat(conf.get("color"), is ("yellow")); assertThat(conf.getInt("size",0), is (10)); assertThat(conf.get("breadth","wide"), is ("wide"));
What are masters and slaves files used for?
Contains a list of machine hosts names or IP addresses. Masters file - determines which machines should run a secondary namenode Slaves file - determines which machines the datanodes and tasktrackers are run on. - Used only by the control scripts running on the namenode or jobtracker
hadoop fs - get
Copy files to the local file system.
HDFS, MapReduce
Core components of Hadoop are ___ and ___.
How to make counter names readable in web UI?
Create a properties file named after the enum, using an underscore as a separator for nested classes. The properties file should be in the same directory as the top-level class containing the enum. The file is named MaxTemperatureWithCounters_Temperature.properties ... CounterGroupName=Air Temperature Records MISSING.name=Missing MALFORMED.name=Malformed
keys
Custom configuration property keys should not conflict with the namespace of Hadoop-defined properties. Typically, users should avoid using prefixes used by Hadoop: hadoop, io, ipc, fs, net, file, ftp, s3, kfs, ha, file, dfs, mapred, mapreduce, yarn.
Which of the following is NOT Hadoop drawbacks? A) inefficient join operation B) security issue C) does not optimize query for user D) high cost E) MapReduce is difficult to implement
D
Scale up (monolothic) vs. scale out
DB done by impressively large computers. When data grew, we moved it to larger and larger computers & storage array. Cost is measure in 100k's or millions.
What are some concrete implementations of Output Format?
DBOutputFormat FileOutputFormat NullOutputFormat SequenceFileAsBinaryOutputFormat TeraOutputFormat TextOutputFormat
What are some concrete implementations of Output Format?
DBOutputFormat FileOutputFormat NullOutputFormat SequenceFileAsBinaryOutputFormat TeraOutputFormat TextOutputFormat
Name common compression schemes supported by Hadoop
DEFLATE gzip bzip2 LZO
Structured Data
Data that has a schema
SSL (HTTPS)
Data transfer between Web-console and clients are protected by using ___ (____). [words refer to the same thing]
Unstructured Data
Data with no structure like jpg's, pdf files, audio and video files, etc.
The __________ holds the data in the HDFS and the application connects with the __________ to send and retrieve data from the cluster.
Datanode, Namenode
How do datanodes deal with checksums?
Datanodes are responsible for verifying the data they receive before storing the data and its checksum. A client writing data sends it to a pipeline of datanodes. The last datanode verifies the checksum. If there is an error, the client receives a checksum exception. Each datanode keeps a persistent log of checksum verifications. (knows when each block was last verified) Each datanode runs a DataBlockScanner in a background thread that periodically verifies all blocks stored on the datanode.
How do you specify SSH settings?
Define the HADOOP_SSH_OPTS environment variable in hadoop-env.sh
How do you handle corrupt records that are failing in the mapper and reducer code?
Detect and ignore Abort job, throwing an Exception Count the total number of bad records in the jobs using Counters to see how widespread the problem is.
hadoop fs -dus
Displays a summary of file lengths.
hadoop fs -du
Displays aggregate length of files contained in the directory
hadoop fs -tail
Displays last kilobyte of the file to stdout.
What is Distributed Cache in Hadoop?
Distributed Cache is a facility provided by the MapReduce framework to cache files (text, archives, jars and so on) needed by applications during execution of the job. The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node.
What is the logo for Hadoop?
Elephant
hadoop fs- expunge
Empty the Trash.
How do you check FileStatus?
FileStatus stat = fs.getFileStatus(Path f); then: assertThat(stat.getLen(), is (7L)); ... for all properties checked
How often does the client poll the application master?
Every second (mapreduce.client.progress.monitor.pollinterval)
When does the datanode create a new blk_ file?
Every time the number of blocks in a directory reaches 64. This way the datanode ensures there is a manageable amount of blocks spread out in differrent directories. (dfs.datanode.numblocks)
How does the Fair Scheduler work?
Every user gets a fair share of the cluster capacity over time. A single job running on the cluster would use full capacity. A short job belonging to one user will complete in a reasonable time, even while another users long job is running. Jobs are placed in pools and by default each user gets their own pool. Its possible to create custom pools with a minimum value.Supports preemption - if a pool hasn't received its fair share over time, the scheduler will kill tasks in pools running over capacity in order to give more slots to under capacity pools.
1024 Petabytes?
Exabytes
Counting with Counters in Hadoop uses Map and Reduce? T/F
F just Mapper
What is the default MapReduce scheduler
FIFO queue-baced scheduler
What are some of the available MapReduce schedules?
FIFO queue-base scheduler Fair scheduler Capacity scheduler
What are some of the available MapReduce schedules?
FIFO queue-base scheduler Fair scheduler Capacity scheduler
How do you delete files or directories with FileSystem methods?
FileSystem's delete() public boolean delete(Path f, boolean recursive) throws IOE
T/F: Hadoop is good at storing semistructured data.
False
T/F: Hadoop is not recommended to company with small amount of data but it is highly recommended if this data requires instance analysis.
False
T/F: The Cassandra File System has many advantages over HDFS, but simpler deployment is not one of them.
False
T/F: The main benefit of HadoopDB is that it is more scalable than Hadoop while maintaining the same performance level on structured data analysis workloads.
False
T/F: Your user tries to log in to your website. Hadoop is a good technology to store and retrieve their login data.
False
True or False: The Cassandra File System has many advantages over HDFS, but simpler deployment is not one of them.
False
True or False: The main benefit of HadoopDB is that it is more scalable than Hadoop while maintaining the same performance level on structured data analysis workloads.
False
Which of the following is NOT true: a. Hadoop is decentralized b. Hadoop is distributed. c. Hadoop is open source. d. Hadoop is highly scalable.
False
True or False: The number of reduce tasks is governed by the size of the input.
False, the number of reducers is specified independently. job.setNumReduceTasks();
True or False: Type conflicts are detected at compile time.
False, they are detected at Runtime. Therefore they should be tested on a sample set first to fix type incompatabilities
True or False: Input types of the reduce function do not have to match output types of the map function
False, they have to match
True or False: A file in HDFS that is smaller than a single block will occupy a full block's worth of underlying storage
False, unlike a file system for a single disk, it does not
True or False. Hadoop is not recommended to company with small amount of data but it is highly recommended if this data requires instance analysis.
False.
Your user tries to log in to your website. Hadoop is a good technology to store and retrieve their login data. True/false? Why?
False. Hadoop is not as efficient as a relational database that can be queried; a database like mySQL is a better choice in this scenario.
Hadoop is good at storing semistructured data. True/false?
False. It's good at storing unstructured data.
What is the safemode HDFS state?
File system is mounted read-only -no replication -no files created -nofiles deleted Commands: hadoop dfsadmin-safemode enter (enl safe mode) hadoop dfsadmin-safemode leave (exitsafemode) hadoop dfsadmin-safemode get (getoveroffstatus) hadoop dfsadmin-safemode wait (waitenablesafemode exit)
What is the safemode HDFS state?
File system is mounted read-only -no replication -no files created -nofiles deleted Commands: hadoop dfsadmin-safemode enter (enl safe mode) hadoop dfsadmin-safemode leave (exitsafemode) hadoop dfsadmin-safemode get (getoveroffstatus) hadoop dfsadmin-safemode wait (waitenablesafemode exit)
How do we specify the input paths for the job object?
FileInputFormat.addInputPath(job, new Path(args[0]));
How do we specify the input and output paths for the job object?
FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.addOutputPath(job, new Path(args[1]));
How do you specifiy an input path for a MapReduce job and what are the pssible items one can specify?
FileInputFormat.addinputPath(____) a) single file b) directory c) file platform
How do you specifiy an input path for a MapReduce job and what are the pssible items one can specify?
FileInputFormat.addinputPath(____) a) single file b) directory c) file platform
How do we specify the output paths for the job object?
FileOutputFormat.addOutputPath(job, new Path(args[1]));
What is the name of the distributed tool to retrieve logs, each daemon has source and sink, can also decorate (compress or filter) scabs out, master is point of configuration.
Flume
What is Sqoop?
For efficiency of bulk transfers of data between Hadoop and relational databases
What is Flume?
For efficiently collecting, aggregating, and moving large amounts of log data
JobTrackers
For the Hadoop setup, we need to configure ____ and TaskTrackers and then specify the TaskTrackers in the HADOOP_HOME/conf/slaves file.
Reducer - Top Ten
From the local top K , they will compete for the final top K
shell commands
HDFS is a distributed filesystem, and just like a Unix filesystem, it allows user to manipulate the filesystem using ____ _____.
How would an administrator run the checkpoint process manually while in safe-mode?
hadoop dfsadmin -saveNamespace
Configuration Tuning Principals (General & Map side)
General - Give the shuffle as much memory as possible, however you must make sure your map/reduce functions get enough memory to operate The amount of memory given to the JVM in which map/reduce tasks run is set by mapred.child.java.opts - Make this as large as possible for the amount of memory on your task node. Map-Side - Best performance by avoiding multiple spills to disk. One spill is optimal. io.sort.mb (increase) There is a counter that counts both map and reduce spills that is helpful.
What does "hadoop fs -getmerge max-temp max-temp-local" do?
Gets all the files specified in a HDFS directory and merges them into a single file on the local file system.
What do you have to set to get started?
HADOOP_HOME HADOOP_HOME_CONF_DIR to location where fs.default.name nd mapred.job.tracker are set
How do you increase namenode memory?
HADOOP_NAMENODE_OPTS in hadoop-env.sh HADOOP_SECONDARYNAMENODE_OPTS Value specified should be Xmx2000m would allocate 2GB
How do we make sure the master node is not overwhelmed with rsync requests on daemon start?
HADOOP_SLAVE_SLEEP = 0.1 seconds
How the HDFS Blocks are replicated?
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack and 3rd copy on a different rack.
What is HDFS ? How it is different from traditional file systems?
HDFS, the Hadoop Distributed File System, is responsible for storing huge data on the cluster. This is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.
read-only
HFTP is a ____-____ filesystem, and will throw exceptions if you try to use it to write data or modify the filesystem state.
What is a distributed data warehouse that manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReducer jobs) for querying data?
HIVE
When comparing Hadoop and RDBMS, which is the best solution for speed?
HaDoop
Which is cheaper for larger scale power and storage? Hadoop or RDBMS?
HaDoop
Explain why the performance of join operation in Hadoop is inefficient.
HaDoop does not have indicies so the entire dataset is copied in the join operation.
E xplain the benefit of Hadoop versus other nondistributed parallel framworks in terms of their hardware requirements.
HaDoop does not require high performance computers to be powerful. It can run on consumer grade hardware.
Why is Hadoop's file redundancy less problematic than it could be?
HaDoop is cheap and cost-effective.
When comparing Hadoop and RDBMS, which is the best solution for speed?
Hadoop
Which is cheaper for larger scale power and storage? Hadoop or RDBMS?
Hadoop
Hadoop Archives
Hadoop Archives or HAR files are an archival facility that packs files into HDFS blocks more efficiently, thereby reducing namemode memory usage while still allowing transparant access to FIBs. In Particular Hadooop archives can be used as input to MyReduce.
What does Hadoop 2.0 Consist Of?
Hadoop Common, HDFS, YARN, MapReduce
What is HDFS?
Hadoop Distributed File System
What is the characteristic of streaming API that makes it flexible run MapReduce jobs in languages like Perl, Ruby, Awk etc.?
Hadoop Streaming allows to use arbitrary programs for the Mapper and Reducer phases of a MapReduce job by having both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.
What is HUE
Hadoop User Experience (HUE), is a web library to build browser based tools to interact with cluster, Beeswax, File browser, Job designer, User manager ..etc
How does profiling work in Hadoop?
Hadoop allows you to profile a fraction of the tasks in a job and as each task completes, it pulls down the profile information to your machine for later analysis. ex: Configuration conf = getConf(); conf.setBoolean("mapred.task.profile",true); conf.set("mapred.task.profile.params", "agent-lib:hprof=cpu=samples,heap=sites, depth=6, force=n, thread=y,verbose=n,file=%s"); conf.set("mapred.task.profile.maps", "0-2"); conf.set("mapred.task.profile.reduces", " "); Job job = new Job(conf, "MaxTemperature");
MAPRED: mapred.tasktracker.map.tasks.maximum
Int (default=2) number of map tasks run on a tasktracker at one time.
Read, Site
Hadoop configuration is driven by two types of important configuration files: ____-only default configuration -core-default.xml, hdfs-default.xml, yarn-default.xml, and mapred-default.xml ___-specific configuration -conf/core-site.xml, conf/hdfs-site.xml, conf/yarn-site.xml, and conf/mapred-site.xml
TaskTrackers
Hadoop deployment includes a HDFS deployment, a single job tracker, and multiple ________.
What is Speculative Execution of tasks?
Hadoop detects when a task is running slower than expected and launches another equivalent task as backup. When a task completes successfully, the duplicate tasks are killed. Turned on by default. It is an optimization and not used to make tasks run more reliably.
Explain why the performance of join operation in Hadoop is inefficient.
Hadoop does not have indices for data so entire dataset is copied in the process to perform join operation.
Explain the benefit of Hadoop versus other nondistributed parallel framworks in terms of their hardware requirements.
Hadoop does not require high performance computers to be powerful. Its power is in the library itself. It can run effectively on consumer grade hardware.
How do you merge the Reducers output files into a single file?
Hadoop fs -getmerge somedir somefile
Command for listing a directory in HDFS
Hadoop fs -ls some/path The first column tundrkom/unix file permissions the second column is the replication factor the next is the user, then group, then file size, then modificating hmc, then filename)
Command for listing a directory in HDFS
Hadoop fs -ls some/path The first column tundrkom/unix file permissions the second column is the replication factor the next is the user, then group, then file size, then modificating hmc, then filename)
What is the command line way of uploading a file into HDFS
Hadoop fs -put <file> <dir> Hadoop fs -put <file><file> Hadoop fs -put <dir><dir>
What is the command line way of uploading a file into HDFS
Hadoop fs -put <file> <dir> Hadoop fs -put <file><file> Hadoop fs -put <dir><dir>
How can you test a driver using a mini cluster?
Hadoop has a set of testing classes (allows testing against the full HDFS and MapReduce machinery): MiniDFSCluster MiniMRCluster MiniYARNCluster MapReduceTestCase - abstract class provides methods needed to use a mini cluster in user code.
libhadoop.so
Hadoop has native implementations of certain components for performance reasons and for non-availability of Java implementations. These components are available in a single, dynamically-linked native library called the native hadoop library. On the *nix platforms the library is named _____.
scalable
Hadoop is ________ as more nodes can be added to it.
Why is Hadoop's file redundancy less problematic than it could be?
Hadoop is cheap and costeffective able to run on unspecialized machines, open source software and the money saved by this will likely outweigh the cost of needing additional storage space.
How many Daemon processes run on a Hadoop system?
Hadoop is comprised of five separate daemons. Each of these daemon run in its own JVM. Following 3 Daemons run on Master nodes NameNode - This daemon stores and maintains the metadata for HDFS. Secondary NameNode - Performs housekeeping functions for the NameNode. JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker. Following 2 Daemons run on each Slave nodes DataNode - Stores actual HDFS data blocks. TaskTracker - Responsible for instantiating and monitoring individual Map and Reduce tasks.
database
Hadoop is not a ______, it is an architecture with a filesystem called HDFS.
What is the key benefit of the new YARN framework?
Hadoop jobs are no longer restricted to Map Reduce. With YARN, any type of computing paradigm can be implemented to run Hadoop.
principal
Hadoop maps Kerberos ____ to OS user account using the rule specified by hadoop.security.auth_to_local which works in the same way as the auth_to_local in Kerberos configuration file (krb5.conf).
cluster size
Hadoop reduces cost of operation via limiting ___ ____.
What is "Standalone mode" ?
Hadoop runs on the local filesystem with a local jobrunner
What is data locality optimization ?
Hadoop tries to run map tasks on the node where the input data resides in HDFS. This method doesn't use bandwidth.
Map Reduce
Hadoop uses __ __ to process large data sets.
parallel
Hadoop uses the concept of MapReduce which enables it to divide the query into small parts and process them in ___.
What is the default MapReduce partitioner
HashPartitioner
A distributed column-orented database it uses HDFS for its underlying storage and supports both batch-style computations using MapReduce and point queries (random reads)
Hbase
What is distributed sorted map using HDFS high throuput, get, gut and scan
Hbase
Name three features of Hive.
HiveQL, Indexing, Different Storage types
Name three features of Hive.
HiveQL, indexing, different storage types, metadata storage in an RDBMS
MAPRED: mapred.job.tracker
Hostname and port the jobtracker's RPC server runs on. (default = local)
impersonate
However, if the superuser does want to give a delegation token to joe, it must first ____ joe and get a delegation token for joe, in the same way as the code example above, and add it to the ugi of joe. In this way the delegation token will have the owner as joe.
data
HttpFS can be used to transfer ____ between clusters running different versions of Hadoop (overcoming RPC versioning issues), for example using Hadoop DistCP.
versions
HttpFS can be used to transfer data between clusters running different ____ of Hadoop (overcoming RPC versioning issues), for example using Hadoop DistCP.
How are corrupted blocks "healed"?
If a client detects an error when reading a block, it reports a bad block & datanode to the namenode, and throws a ChecksumException. The namenode marks the copy as corrupt and stops traffic to it. The namenode schedules a copy of the block to be replicated on another datanode. The corrupted replica is deleted.
what is secondary sort
If equivalance rules for grouping the intermediate keys are required to be different from those for grouping keys before reduction, the one may specify a comparator via JobConf.setOutputValugeGroupingComparator(class). Since JobConf.setOutputKeyComparatorClass(class) can be used to control how intermediate keys are grouped, these can be used in conjunction to simulate sort on values.
host
If more lax security is preferred, the wildcard value * may be used to allow impersonation from any ____ or of any user.
When are node managers blacklisted? By what?
If more than 3 tasks fail (mapreduce.job.maxtaskfailures.per.tracker) by the application master
How is tasktracker failure handled?
If the heartbeat isn't sent to jobtracker in 10secs (mapred.task.tracker.expiry.interval) The jobtracker removes it from the pool. Any tasks running when removed from the pool have to be re-run.
Cygwin
If you are using Windows machines, first install ____ and SSH server in each machine. The link http://pigtail.net/LRP/printsrv/cygwin-sshd.html provides step-by-step instructions.
Inherent Characteristics of Big Data
Immutable and Time-Based
What is a common use of floom?
Importing twitter feeds into a Hadoop cluster
What is HDFS Block size? How is it different from traditional file system block size?
In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each block three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block size can not be compared with the traditional file system block size.
two
In Hadoop, when we store a file, it automatically gets replicated at _______ other locations also.
primary
In Kerberized operation, the identity of a client process is determined by its Kerberos credentials. For example, in a Kerberized environment, a user may use the kinit utility to obtain a Kerberos ticket-granting-ticket (TGT) and use klist to determine their current principal. When mapping a Kerberos principal to an HDFS username, all components except for the _____ are dropped. For example, a principal todd/[email protected] will act as the simple username todd on HDFS.
When is the reducers are started in a MapReduce job?
In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.
executable
In contrast to the POSIX model, there are no setuid or setgid bits for files as there is no notion of ____ files.
MAPRED: mapred.tasktracker.reduce.tasks.maximum
Int (default=2) number of reduce tasks run on a tasktracker at one time.
Java
Install ___ in all machines that will be used to set up Hadoop.
What does the GenericOptionsParser do?
Interprets Hadoop command line options and sets them to a Configuration object in your application. Implemented through the Tool Interface.
What is Hadoop Streaming?
It allows you to run map reduce jobs with other languages that use standard in and standard out. ex: ruby, python
What does the Combiner do?
It can reduce the amount of data transferred between mapper and reducer. Combiner can be an instance of the reducer class.
What is the coherency model for a filesystem?
It describes data visibility of reads and writes for a file
How does MR1 handle jobtracker failure?
It is a single point of failure, however it is unlikely that particular machine will go down. After restarting, all jobs need to be resubmitted.
How is resource manager failure handled?
It is designed to recover by using a checkpoint mechanism to save state. After a crash a new instance is brought up (by the administrator) and it recovers from saved state (consisting of node managers and applications but not tasks which are managed by the application manager) The storage the resource manager uses is configurable (org.apache.hadoop.yarn.server.resourcemanager.recovery.memstore) keeps it in memory therefore its not highly available.
How is the splitting of file invoked in Hadoop framework?
It is invoked by the Hadoop framework by running getInputSplit()method of the Input format class (like FileInputFormat) defined by the user.
What happens when a container uses more memory than allocated?
It is marked as failed and terminated by the node manager
What is another name for the hadoop DFS module? ex: hadoop dfs ____
It is the same as hadoop fs ____ and is also called FsShell.
What is another name for the hadoop DFS module? ex: hadoop dfs ____
It is the same as hadoop fs ____ and is also called FsShell.
FSDataInputStream implements the PositionedReadable Interface. What does it provide?
It reads parts of a file given an offset.
How is the namenode machine decided?
It runs on the machine that the startup scripts were run on.
What is the Datanode block scanner?
It verifies all blocks stored on the Datanode. It allows bad blocks to be deleted/fixed/ DataBlockScanner maintains a list of blocks. (dfs.datanode.scan.period = 504(hours)) Corrupt datanodes are reported to the namenode to be fixed.
Suppose Hadoop spawned 100 tasks for a job and one of the task failed. What will Hadoop do?
It will restart the task again on some other TaskTracker and only if the task fails more than four (default setting and can be changed) times will it kill the job.
What does the namenodes VERSION file contain? (4)
It's a java properties file that contains information about the version of HDFS running 1. namespaceID - a unique identifier for the filesystem. Namenode uses it to identify new datanodes since they will not know it until they have registered. 2. cTime = 0 Marks creation time of namenode's storage. It is updated from 0 to a timestamp when the filsystem is upgraded 3. storageType = NAME_NODE - Indicates the storage directory that contains data structures for the namenode. 4. layoutVersion = -18 Always negative. Indicates the version of HDFS
New API or Old API? (a) Job (b) JobConf (c) org.apache.hadoop.mapred (d) org.apache.hadoop.mapreduce
Job - NEW JobConf - OLD org.apache.hadoop.mapred - NEW org.apache.hadoop.mapreduce - OLD
How does Job Submission work in MapReduce 1?
Job.submit() creates JobSubmitter instance. The JobSubmitter: 1. calls submitJobInternal() 2. Asks the jobtracker for a new job ID (JobTracker.getNewJobID()) and computes the input splits, if they cant be computed, the job is cancelled. It checks the output specifications to make sure the output dir does not exist. 3. Copies the resources needed to run the job to the jobtracker. The job JAR is copied at a high replication factor. (default = 10 mapred.submit.replication) Why? - so that they are readily available for multiple tasktrackers to access. 4. Tells the jobtracker that the job is ready for execution. JobTracker.submitJob().
How do you execute in MapReduce job from within the main method of a driver class?
JobClient.runJob (my JobConf)...
What is a single way of running multiple jobs in order?
JobClient.runJob(conf1); JobClient.runJob(conf2);
What is a single way of running multiple jobs in order?
JobClient.runJob(conf1); JobClient.runJob(conf2);
How is the job more speccified for a map reduce class?
JobConf conf - new JobConf (my driver.class); conf.set JobName ("my Job");
What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?
JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster its run on a separate machine. Each slave node is configured with job tracker node location. The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.
What is JobTracker?
JobTracker is the service within Hadoop that runs MapReduce jobs on the cluster.
How do you set a space limit on a users home directory?
hadoop dfsadmin -setSpaceQuota 1t /user/username
How does speculative execution work in Hadoop?
JobTracker makes different TaskTrackers process same input. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.
What is the difference in Web UIs MR1vs MR2 ?
JobTracker web UI - list f jobs Resource Manager web UI - list of running applications with links to their respected application masters, which shows progress and further info.
conf/slaves
List all slave hostnames or IP addresses in your ___/___ file, one per line.
HDFS: dfs.data.dir
List of directories for a datanode to store its blocks
Hadoop IO Class that corresponds to Java Long
LongWritable
Why do map tasks write their output to local disk instead of HDFS?
Output from a map task is temporary and would be overkill to store in HDFS. If a map task fails, the mapper is re-run so there is no point in keeping the intermediate data.
What does the NameNode do?
Manges the file system name space it also -maintains file system tree -maintains metadata for all files and directions in the tree
Pig Complex Types
Map, Tuple, Bag
What do you use to monitor a jobs actual memory using during a job run?
MapReduce task counters: 1. PHYSICAL_MEMORY_BYTES 2. VIRTUAL_MEMORY_BYTES 3. COMMITTED_HEAP_BYTES
True or False: The Job's setup is called before any tasks are run. (Create output directory..etc) MapReduce1 MapReduce2
MapReduce1 - false, it is run in a specialized task, run by a tasktracker MapReduce2 - true, directly by the application master
What is the different between Metrics and Counters?
Metrics - collected by Hadoop daemons (administrators) Counters - are collected from mapreduce tasks and aggregated for the job. The collection mechanism for metrics is decoupled from the comonent that receives the updates and there are various pluggable outputs: (a) local files (b) Ganglia (c) JMX The daemon collecting metrics does aggregation.
What does calling Seekable seek() do? What happens when the position referenced is greater than the file length?
Moves to an arbitrary absolute position within a file. It results in an IOException
What is HDFS Federation?
Multiple Namenodes. Each Namenode manages a namespace volume made up of the metadata for a namespace and block pool.
HDFS 5
NameNode constantly monitors repots sent by datanodes to ensure no dropped blocks below block replication factor. If it does, it schedules addition of another block copy.
How NameNode Handles data node failures?
NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data blocks from one datanode to another. The replication data transfer happens directly between datanodes and the data never passes through the namenode.
New vs old for MapReduce job configuration
New = Configuration Old = Job Conf
New vs old for MapReduce job control
New = Job Old = JobClient
New vs old for output file names
New Old part -m-nnnnn (map) same as new but w/o "m" or "r" part -r- nnnnn (reduce) (part number start at zero)
New vs old for output file names
New Old part -m-nnnnn (map) same as new but w/o "m" or "r" part -r- nnnnn (reduce) (part number start at zero)
Are block locations persistently stored?
No, Block locations are reconstructed from datanodes when the system starts
Is Hadoop a database? Explain.
No, HaDoop is a file system.
Does a small file take up a full block in HDFS?
No, unlike a filesystem for a single disk a file in HDFS that is smaller that a single block does not occupy a full blocks worth of underlying storage.
Is Hadoop a database? Explain.
No. Hadoop is a file system.
What is the difference between these commands? hadoop fs _____ hadoop dfs _____
None they are exactly the same
What is the difference between these commands? hadoop fs _____ hadoop dfs _____
None they are exactly the same
Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer?
Nope, MapReduce programming model does not allow reducers to communicate with each other. Reducers run in isolation.
What is difference between hadoop fs -copyToLocal and hadoop fs -get
Nothing they are identical
What is difference between hadoop fs -copyToLocal and hadoop fs -get
Nothing they are identical
What is the difference between hadoop fs -copyFromLocal hadoop fs -put
Nothing they are identical
What is the difference between hadoop fs -copyFromLocal hadoop fs -put
Nothing they are identical
What are toher writables (besides for the Java primitures and Text)?
Null Writable Bytes Writable MD5 Hash Object Writable Generic Writable
What project was Hadoop originally a part of and what idea was that project based on?
Nutch. It was based on returning web search results faster by distributing data and calculations across different compters.
What project was Hadoop originally a part of and what idea was that project based on?
Nutch. It was based on the idea of returning web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously.
Any content written to a file is not guaranteed to be visible, even if the stream is flushed. Why?
Once more than a block's worth of data has been written, the first block will be visible to the readers. The current block is always invisible to new readers and will display a length of zero.
How are hanging tasks dealt with in MR1 ?
Once the timeout time has been reached (default = 10 mins) the tasktracker marks the task as failed. The child JVM will be killed automatically. Timeout is set ( mapred.task.timeout ). Can be set to 0, although not advised because it would slot down slot allocation.
What is the relationship between Jobs and Tasks in Hadoop?
One job is broken down into one or many tasks in Hadoop.
Describe Jobtrackers and Tasktrackers.
One jobtracker for many tasktrackers.The Jobtracker reschedules tasks and holds a record of overall progress
One of the initial users of Hadoop
How many output files will a mapreduce job generate?
One per reducer
what are dynamic counters?
One that isn't defined by a Java enum. Because a Java enum's fields are defined at compile time, you can't create new counters on the fly using enums ... public void incrCounter(String group, String counter, long amount)
super-user
Only the owner of a file or the ____ - ____ is permitted to change the mode of a file.
Reluctant does what?
Opposite of greedy, starts from one letter and builds up, the last thing they try is the entire input
What is used in Hadoop for data flow language and execution environment for exploring very large datasets?
PIG
Hadoop can run jobs in ________ to tackle large volumes of data.
Parallel
After the Map phase finishes, the Hadoop framework does "Partitioning, Shuffle and sort". Explain what happens in this phase?
Partitioning: It is the process of determining which reducer instance will receive which intermediate keys and values. Each mapper must determine for all of its output (key, value) pairs which reducer will receive them. It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same. Shuffle: After the first map tasks have completed, the nodes may still be performing several more map tasks each. But they also begin exchanging the intermediate outputs from the map tasks to where they are required by the reducers. This process of moving map outputs to the reducers is known as shuffling. Sort: Each reduce task is responsible for reducing the values associated with several intermediate keys. The set of intermediate keys on a single node is automatically sorted by Hadoop before they are presented to the Reducer.
How do you determine if you need to upgrade the filesystem?
Perform a trial on a test cluster.
What is the different between Pig Latin and SQL?
Pig Latin - data flow programming language. SQL - declarative programming language. - Pig Latin takes declarative statements and breaks them into steps. It supports complex, nested data structures while SQL deals with flatter data. Pig is customizable with UDFs and doesn't support random reads/writes, similar to Hadoop.
SSO
Practically you need to manage ____ environment using Kerberos with LDAP for Hadoop in secure mode.
Latency
Processing time is measured in weeks. Process more data, throw in more hardware to keep elapsed time under control
What are the specs of a typical "commodity hardware" machine?
Processor - 2 quad core 2-2.5 Ghz Memory - 16-24GB ECC RAM (error code checking) Storage - Four 1TB SATA disks Network -Gigabit Ethernet
What is HRPROF?
Profiling tool that comes with the JDK that can give valuable info about a programs CPU and heap usage.
Cost
Prohibitive undertaking for small and medium size business. Reserved for multinationals.
What is Apache Whirr? What is the benefit of using it?
Provides a Java API and scripts for interacting with Hadoop on EC2. You can easily read data from S3, but it doesn't take advantage of data locality.
Class and Method signature for mapper
Public class My Mapper extends MapReduceBase implements Mapper <K1, V2, K2, V2> Public void map (K1 key, V1 value OutputCollector <K2, V2> outlet Reporter reporter) throws IO Exception
MapReduce driver class w/new API
Public class MyDrive - Job job = new Job (); Job.set job by class (MyDriver.class); fileinputformat.addinputpath(job, newpath (org [o])); fileoutputformat.add input path (job, newpath (org (1) job.set Mapper class (my Mapper.class) job.set Reducer class (my Redcuer.class) job.set Output Key class (text.class) job.set Output Value class (intwritable.class) system.exit (job.waitforcompletion (true) ?0:1)
Class and method signature for new mapper API
Public class MyNewMapper extends Mapper <K1, V1, K2, V2> public void map (K1, key, V1 value, context context) throws IOException, Interrupted Exception Context.write(key2, value2);
Class and method signature for new mapper API
Public class MyNewMapper extends Mapper <K1, V1, K2, V2> public void map (K1, key, V1 value, context context) throws IOException, Interrupted Exception Context.write(key2, value2);
class and method signature for Reducer
Public class MyReducer extends MapReduceBase implements Reducer <K1, V1, K2, V2> Public void reducer (K1 key, iterator <V1> values OutputCollector<K2, V2> output, Reporter reporter) throws IOException
class and method signature for Reducer
Public class MyReducer extends MapReduceBase implements Reducer <K1, V1, K2, V2> Public void reducer (K1 key, iterator <V1> values OutputCollector<K2, V2> output, Reporter reporter) throws IOException
Describe the writeable interface
Public interface writable Void write (DataOutputout) thus IO Exception; Void read fields (DataInput in) thus IO Exception
Describe a MapReduce driver class.
Public my driver public static void (string [ ] augs)throws Ioexception JobConf conf = New JobCong (mydriver.class); conf.setJobName("my Job');
Describe a MapReduce driver class.
Public my driver public static void (string [ ] augs)throws Ioexception JobConf conf = New JobCong (mydriver.class); conf.setJobName("my Job');
fsimage
Quotas are persistent with the _____ . When starting, if the fsimage is immediately in violation of a quota (perhaps the fsimage was surreptitiously modified), a warning is printed for each of such violations. Setting or removing a quota creates a journal entry.
What does a RecordReader do?
RecordReader, typically, converts the byte-orented view of the input provided by the InputSplit and presents a record-orented view for the Mapper and Reducer tasks for processing. It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values
Configuration Tuning Principals (Reduce Side & Buffer Size)
Reduce-Side - Best performance when intermediate data can reside entirely in memory. If your reduce function has light memory requirements, you can set mapred.inmem.merge.threshold to 0 and mapred.job.reduce.input.buffer.percent = 1.0 (or a lower value) Buffer-Size - 4KB by default, increase it io.file.buffer.size
If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?
Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.
extrinsic
Regardless of the mode of operation, the user identity mechanism is ____ to HDFS itself. There is no provision within HDFS for creating user identities, establishing groups, or processing user credentials.
MAPRED: mapred.system.dir
Relative to fs.default.name where shared files are stored during job run.
x*? - Reluctant, Greedy or Possessive?
Reluctant
RPC
Remote Procedure Calls - a protocol for data serialization
How does memory utilization in MapReduce 2 get rid of previous memory issues?
Resources are more fine grained instead of having a set number of blocks at a fixed memory amount. With MR2 applications can request a memory capability that is between the min and max allocation set. Default memory allocations are scheduler specific. This removes the previous problem of tasks taking too little/too much memory because they were forced to use a fixed amount.
hadoop fs -stat
Returns the stat information on the path.
Describe how Sqoop transfers data from a relational database to Hadoop.
Runs a query on a relational database and exports into files in a variety of formats. They are then saved on HDFS.
What is Hive?
SQL like language for Big Data
slave nodes
SSH is used in Hadoop for launching server processes on ______ ____.
Where are bad records stored in Hadoop?
Saved as SequenceFiles in the jobs output directory under _logs/skip
Web consoles
Security features of Hadoop consist of authentication, service level authorization, authentication for ___ ___ and data confidentiality.
The ______ Interface permits seeking to a position in the file and provides a query method for the current file offset getPos()
Seekable
6 Key Hadoop Data Types
Sentiment, Clickstream, Sensor/Machine, Geographic, Server Logs, Text
How can you configre the task tracker to retain enough information to allow a task to be rerun over the same input data for debugging
Set Keep.failed.task.files to true
How does a task report progress?
Sets a flag to indicate that the status change should be sent to the tasktracker. The flag is checked in a separate thread every 3 seconds.
What does dfs.web.ugi do?
Sets the user that HDFS web interface runs as. (used to restrict system files to web users)
STONITH
Shoot the other node in the head
What does YARN use rather than tasktrackers?
Shuffle handlers, auxillary services running in node managers.
What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?
Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate JVM process. Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is run as a separate JVM process. One or Multiple instances of Task Instance is run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.
What is the default network topology configuration? With multi-rack clusters, what do you need to do?
Single-rack. Map nodes to racks so that Hadoop can locate within-rack transfers, which is preferable. It will also allow Hadoop to place replicas more intelligently.
What can you use when 3rd party library software is causing bad records that can't be intercepted in the mapper/reducer code?
Skipping Mode - to automatically skip bad records. Enable it and use the SkipBadRecords Class. A task reports the records that are passed back to the tasktracker. Because of extra network traffic and bookkeeping to maintain the failed record ranges, skipping mode is only enabled after 2 failed task attempts. Skipping mode can only detect one bad record per task attempt. (good for catching occasional record errors) To give skipping mode enough attempts to detect and skip all bad records in an input split, increase mapred.map.max.attempts mapred.reduce.max.attempts
HDFS: dfs.name.dir
Specifies a list of directories where the namenode metadata will be stored.
What does time-based mean?
Something known at a certain moment in time.
Describe how Sqoop transfers data from a relational database to Hadoop.
Sqoop runs a query on the relational database and exports the results into files in a variety of formats. These files are then saved on HDFS. Doing this process in reverse will import formatted files from HDFS into a relational database.
What are Java Management Extensions? (JMX)
Standard Java API for monitoring and managing applications. Hadoop includes several (MBeans) managed beans which expose Hadoop metrics to JMX aware applications.
What does YARN's start-yarn.sh script do?
Starts the YARN daemon which: (a) starts resource manager on the machine the script was run on. (b) node manager on each machine in the slaves file.
How does a Pig Latin program get executed?
Step 1 - all statements are checked for syntax, then added to the logical plan. Step 2 - DUMP statement converts the logical plan to a physical plan and the commands are executed.
What is Hadoop Streaming?
Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations.
MAPRED: mapreduce.map.java.opts
String (-Xmx 2000m) JVM option used for child process that runs map tasks
MAPRED: mapreduce.reduce.java.opts
String (-Xmx 2000m) JVM option used for child process that runs reduce tasks
MAPRED: mapred.child.java.opts
String (-Xmx 2000m) JVM option used to launch tasktracker child processes that run map and reduce tasks ( can be set on per-job basis)
Hadoop works best with _________ and ___________ data, while Relational Databases are best with the first one.
Structured, Unstructured
What does the job.waitForCompletion() method do?
Submits the job Waits for it to finish Return value is Boolean true/false which translates to an exit code
How do you set a system property? How do you set them via command line?
System.setProperty("size",14); -Dproperty=value
False
T/F Hadoop works in real time?
A tasktracker may connect if its in the include file and not in the exclude file.
TRUE
If you shut down a tasktracker that is running, the jobtracker will reschedule the task on another tasktracker.
TRUE
Once an upgrade is finalize, you can't roll back to a previous version.
TRUE
Pig turns the transformations into a series of MapReduce jobs. (seamlessly to the programmer)
TRUE
True of False: If you are using the default TextInputFormat, you do not have to specify input types for your job.
TRUE
True or False: Input of a reduce task is output from all map tasks so there is no benefit of data locality
TRUE
True or False: Its better to add more jobs than add more complexity to the mapper.
TRUE
True or False: addOutputPath() must point to a directory that doesn't currently exist.
TRUE
hadoop fs - getmerge
Takes a source directory and a destination file as input and concatenates files in src into the destination local file.
HDFS: fs.default.name
Takes the HDFS filesystem URI host is the namenodes hostname or IP:port that the namenode will listen on (default file///:8020) It specifies the default filesystem so you can use relative paths.
What is a Task instance in Hadoop? Where does it run?
Task instances are the actual MapReduce jobs which are run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.
What's a tasktracker?
TaskTracker is a node in the cluster that accepts tasks like MapReduce and Shuffle operations - from a JobTracker.
How do we ensure that multiple instances of the same task don't try to write to the same file?
Tasks write to their working directory, when they are committed, the working directory is promoted to the output directory.
Hadoop IO Class that corresponds to Java String
Text
Streaming output keys and values are always of type ____ .
Text. The IdentityMapper cannot change LongWritable keys to Text keys so it fails. Another mapper must be used.
What is the difference between TextInputFormat and KeyValueInputFormat class?
TextInputFormat: It reads lines of text files and provides the offset of the line as key to the Mapper and actual line as Value to the mapper. KeyValueInputFormat: Reads text file and parses lines into key, Val pairs. Everything up to the first tab character is sent as key to the Mapper and the remainder of the line is sent as value to the mapper.
What is Hadoop Pipes?
The C++ interface to Hadoop MapReduce. Uses sockets rather than I/O
How the Client communicates with HDFS?
The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.
What is a Combiner?
The Combiner is a 'mini-reduce' process which operates only on data generated by a mapper. The Combiner will receive as input all data emitted by the Mapper instances on a given node. The output from the Combiner is then sent to the Reducers, instead of the output from the Mappers.
aware
The HDFS and the YARN components are rack-___.
REST API
The HTTP ____ ____ supports the complete FileSystem/FileContext interface for HDFS.
names
The Hadoop Distributed File System (HDFS) allows the administrator to set quotas for the number of ___ used and the amount of space used for individual directories. Name quotas and space quotas operate independently, but the administration and implementation of the two types of quotas are closely parallel.
POSIX
The Hadoop Distributed File System (HDFS) implements a permissions model for files and directories that shares much of the ____ model. Each file and directory is associated with an owner and a group.
What is the difference between HDFS and NAS ?
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. Following are differences between HDFS and NAS In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware. HDFS is designed to work with MapReduce System, since computation are moved to data. NAS is not suitable for MapReduce since data is stored seperately from the computations. HDFS runs on a cluster of machines and provides redundancy usinga replication protocal. Whereas NAS is provided by a single machine therefore does not provide data redundancy.
How is client-side checksumming done?
The Hadoop LocalFileSystem performs client-side checksumming.A file is written and a hidden file is created.(filename.crc) Controlled by io.bytes.per.checksum (512 bytes)
NodeManager
The Hadoop daemons are NameNode/DataNode and ResourceManager/_____.
What is the purpose of RecordReader in Hadoop?
The InputSplit has defined a slice of work, but does not describe how to access it. The RecordReader class actually loads the data from its source and converts it into (key, value) pairs suitable for reading by the Mapper. The RecordReader instance is defined by the Input Format.
Jobtracker
The JobReduce number is decided via the _____.
What is the Hadoop MapReduce API contract for a key and value Class?
The Key must implement the org.apache.hadoop.io.WritableComparable interface. The value must implement the org.apache.hadoop.io.Writable interface.
What is a NameNode? How many instances of NameNode run on a Hadoop Cluster?
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. There is only One NameNode process run on any hadoop cluster. NameNode runs on its own JVM process. In a typical production cluster its run on a separate machine. The NameNode is a Single Point of Failure for the HDFS Cluster. When the NameNode goes down, the file system goes offline. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.
How JobTracker schedules a task?
The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.
Volume, Velocity, Variety
The Three Characteristics of Big Data
block size
The __ ___ of a data product can affect the performance of MapReduce computations, as the default behavior of Hadoop is to create one map task for each data block of the input files.
name quota
The ____ ____ is a hard limit on the number of file and directory names in the tree rooted at that directory
superuser
The ____ must be configured on namenode and jobtracker to be allowed to impersonate another user.
Define "fault tolerance".
The ability of a system to continue operating after the failure of some of its components.
\A
The beginning of the input
What is "the shuffle" ?
The data flow between map and reduce tasks.
containers
The data is stored in HDFS which does not have any predefined ______.
If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?
The default partitioner computes a hash value for the key and assigns the partition based on this result.
\z
The end of the input
\Z
The end of the input but for the final terminator, if any
\G
The end of the previous match
What manages the transition of active namenode to Standby?
The failover controller
If not set explicitly, the intermediate types in MapReduce default to ____________.
The final output types default to LongWritable and Text
Input splits or splits
The fixed sized pieces into which the input is divided one map task is created for each split for many/most jobs a good split tends to be the size of an HDFS block
Input splits or splits
The fixed sized pieces into which the input is divided one map task is created for each split for many/most jobs a good split tends to be the size of an HDFS block
What are some typical functions of Job Tracker?
The following are some typical tasks of JobTracker:- - Accepts jobs from clients - It talks to the NameNode to determine the location of the data. - It locates TaskTracker nodes with available slots at or near the data. - It submits the work to the chosen TaskTracker nodes and monitors progress of each task by receiving heartbeat signals from Task tracker.
Codec
The implementation of a compression-decrompression algorithm
Where is the namenode data stored?
The local disk in teo files -namespace image -edit log
How are Hadoop Pipes jobs written?
The mapper and reducer methods are written by extending the Mapper and Reducer classes defined in the HadoopPipes namespace. main() acts as the entry point. It called HadoopPipes::runTask
Where is the Mapper Output (intermediate kay-value data) stored ?
The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.
What is a Namenode?
The master node that manages the file system namespace. It maintains the file system tree and metadata for all the files/directories within. Keeps track of the location of all the datanodes for a given file.
Block
The minimum amount of data that can be read or written. For HDFS, this is a much larger unit than a normal file system. Typically this is 64MB by default
What is Streaming data access
The most efficient data processing pattern a write-once, read-many-times pattern
The number of map tasks is driven by what?
The number of input splits, which is dictated by the size of inputs and block size.
The number of map tasks is driven by what? The number of partitions is equal to?
The number of input splits, which is dictated by the size of inputs and block size. The number of reduce tasks for the job.
The number of partitions is equal to?
The number of reduce tasks for the job.
Deserialization
The process of turning a byte stream back into a series of structured objects
Serialization
The process of turning structured objects into a byte stream for transmission over a network or for writing to persistent storage.
slaves
The rest of the machines in the cluster act as both DataNode and NodeManager. These are the ___________.
Enter your regex: (dog){3} Enter input string to search: dogdogdogdogdogdog I found the text "dogdogdog" starting at index 0 and ending at index 9. I found the text "dogdogdog" starting at index 9 and ending at index 18. Enter your regex: dog{3} Enter input string to search: dogdogdogdogdogdog No match found. Why doe the second one fail??
The second on is asking for the letter g three times.
Controlling sort order
The sort order for keys is controlled by a RawComparator, which is found as follows: 1. If the property mapred.output.key.comparator.class is set, either explicitly or by calling setSortComparatorClass() on Job, then an instance of that class is used. (In the old API, the equivalent method is setOutputKeyComparatorClass() on JobConf.) 2. Otherwise, keys must be a subclass of WritableComparable, and the registered comparator for the key class is used. 3. If there is no registered comparator, then a RawComparator is used that deserializes the byte streams being compared into objects and delegates to the WritableCompar able's compareTo() method.
kerberos
The superuser must have ____ credentials to be able to impersonate another user. It cannot use delegation tokens for this feature. It would be wrong if superuser adds its own delegation token to the proxy user ugi, as it will allow the proxy user to connect to the service with the privileges of the superuser.
map.task.tracker.report.address
The tasktrackers RPC server address and port used by tasktrackers child JVM to communicate. The server only binds to localhost (default 127.0.0.1:0)
What does immutable mean?
The truthfulness of the data does not change. Changes of big data are new entries not updates to existing entries.
Globbing
The use of using pattern matching to match multiple files with a single expression.
What is Hadoop Common?
The utilities that provide support for other Hadoop modules.
What does dfs.replication =1 mean?
There would only be one replication per block. Typically we aim for at least 3.
Whats unique about -D pepehes when used with hadoop command
They have space -D name=Value as comapres with JVM properties -Dname=Value
Benefits of distributed cache
This is because distributed cache is much faster. It copies the file to all trackers at the start of the job. Now if the task tracker runs 10 or 100 Mappers or Reducer, it will use the same copy of distributed cache. On the other hand, if you put code in file to read it from HDFS in the MR Job then every Mapper will try to access it from HDFS hence if a TaskTracker run 100 map jobs then it will try to read this file 100 times from HDFS. Also HDFS is not very efficient when used like this.
Distributed Cache
This provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. To save network bandwidth, files are normally copied to any articular node once per job.
LDAP
Though files on HDFS are associated to owner and group, Hadoop does not have the definition of group by itself. Mapping from user to group is done by OS or _____.
Greedy does what first?
Tries to match the entire input string first , if it fails then backs off by one letter each time until a match is made.
T/F: Hadoop is open source.
True
True or False: Because CRC-32 checksum is 4 bytes long, the storage overhead is less than 1%.
True
What is HADOOP_IDENT_STRING used for?
To change the perceived use for logging purposes. The log names will contain this value.
daemons
To configure the Hadoop cluster you will need to configure the environment in which the Hadoop ___ execute as well as the configuration paramerts for Hadoop ____.
Why are HDFS blocks so large?
To minimize the cost of seeks
Why is a block in HDFS so large?
To minimize the cost of seeks By making a block large enough the time to transfer the data from the disk can be made to be significantly larger that the time to seek to the start of the block. Thus the time to transfer a large file made of multiple blocks operates at the disk transfer rate.
YARN
To start a Hadoop cluster, you will need to start both the HDFS and ____ cluster.
What is dfsadmin?
Tool for finding information on the state of HDFS and performing administrative actions on HDFS
How do you calculate the optimal number of reducers?
Total number of available reducer slots = # nodes in the cluster x # of slots per node (mapred.tasktraker.reduce.tasks.maximum) Then slightly fewer reducers than total slots, whih gives you one wave of reduce tasks.
What has big data been processed historically?
Traditionally very difficult, technically and financially.
FIll in the blank. Hadoop lacks notion of ________ and _______. Therefore, the analyzed result generated by Hadoop may or may not be 100% accurate.
Transaction Consistency, Recovery Checkpoint
True or False: JConsole allows you to view MBeans in a running JVM. You can see Hadoop metrics via JMX using the default metrics but to have it updaet you have to configure metrics to use something other than NullContext.
True, using NullContextwithUpdate Thread is appropriate if JMX is your only way to view metrics.
True or False: You can reference a java mapper and reducer in a Hadoop Pipes job.
True, you can use a hybrid Java and C++
True or False: ChecksumFileSystem is just a wrapper around FileSystem.
True, you can use methods like getChecksumFile().
Semi-structured Data
Typically columns are missing rows or rows have their own unique columns
How does a FIFO Scheduler work?
Typically each job would use the whole cluster so jobs had to wait their turn. Has the ability to set a job's priority (very high, high, normal. low, very low) It will choose the highest tasks first, but no preemption (one its running, it can't be replaced)
ResourceManager
Typically one machine in the cluster is designated as the NameNode and another machine the as ___ , exclusively. These are the masters.
What does dfs.default.name do?
Typically sets the default filesystem for Hadoop. If set you do not need to specify it explicitly when you use -CopyFromLocal via command line. ex: dfs.default.name = hdfs://localhost/
How do you list the contents of a directory?
Use the FileSystem's listStatus() method. public FileStatus[ ] listStatus(Path f) throws IOException
CLI MiniCluster
Using the ___ ____, users can simply start and stop a single-node Hadoop cluster with a single command, and without the need to set any environment variables or manage configuration files. The ___ ___ starts both a YARN/MapReduce & HDFS clusters. This is useful for cases where users want to quickly experiment with a real Hadoop cluster or test non-Java programs that rely on significant Hadoop functionality.[same word]
What is "globbing"?
Using wildcard characters to match multiple files with a single expression rather than having to enumerate each file and directory to specify input
Three V's of Big Data
Variety, volume and velocity
How can Oozie inform a client about the workflow status?
Via an HTTP callback
How does one obtain a reference to an instance of the hadoop file system in Java
Via state methods: File System.get(configuration conf) thans IDExeplan File System.get(URI URI, configurationconf)thans IDExeplan
How does one obtain a reference to an instance of the hadoop file system in Java
Via state methods: File System.get(configuration conf) thans IDExeplan File System.get(URI URI, configurationconf)thans IDExeplan
Lists three drawbacks of using Hadoop
Whatever listed in 6 Drawbacks of Hadoop are drawbacks of Hadoop. For example, does not work well with small amount of data, MapReduce program is difficult to implement or understand, and does not guaratee atomicity transactions.
What is InputSplit in Hadoop?
When a Hadoop job is run, it splits input files into chunks and assign each split to a mapper to process. This is called InputSplit.
HDFS: fs.checkpoint.dir
Where the secondary namenode stores it's checkpoints of the filesystem
When is it impossible to set a URLStreamHandlerFactory? What is the workaround?
When it has been used elsewhere. You can use the FileSystem API instead.
When is a CodecPool used? How is it implemented?
When lots of compression/decompression occurs. It is used to re-use compressors and de-compressors, reducing the cost of new object creation. Compressor compressor = null; try{ compressor = CodecPool.getCompressor(codec); CompressionOutputStream out = codec.createOutputStream(System.out, compressor); } finally { CodecPool.returnCompressor(compressor); }
0 or 1
When running under the local jobrunner, how many reducers are supported?
kinit
When service level authentication is turned on, end users using Hadoop in secure mode needs to be authenticated by Kerberos. The simplest way to do authentication is using ___ command of Kerberos.
Describe what happens when a slave node in a Hadoop cluster is destroyed and how the master node compensates.
When the slave node is destroyed, it stops sending heartbeat signals to the master node. The master node recognizes the loss of the slave node and relegates its tasks, including incomplete tasks, to other slave nodes.
Describe what happens when a slave node in a Hadoop cluster is destroyed and how the master node compensates.
When the slaves heartbeat stops sending, the master moves it's tasks to other slave nodes.
rebalancer
When you add new nodes, HDFS will not rebalance automatically. However, HDFS provides a _____ tool that can be invoked manually.
What is the behavior of the HashPartitioner?
With multiple reducers, records will be allocated evenly across reduce tasks, with all records that share the same key being processed by the same reduce task.
What does HADOOP_MASTER defined in hadoop-env.sh do?
Worker daemons will rsync the tree rooted at HADOOP_MASTER to the local nodes HADOOP_INSTALL when the daemon starts.
What is Oozie?
Workflow scheduler system to manage Apache Hadoop jobs
What are datanodes?
Workhorses of the filesystem. They store and retrieve blocks when told to do so by the client or Namenode. They report back to the Namenode with lists of which blocks they are currently storing and where they are located
What does FileContext in metrics do?
Writes metrics to a local file. Unsuitable for large clusters because output files are spread out.
How does YARN handle memory? How does it compare to the slot model?
YARN allows applications to request an arbitrary amount of memory for a task. Node managers allocate memory from a pool, the number of tasks running on a node depends on the sum of their memory requirements, not a fixed number of slots.The slot-based model can lead to under utilization because slots are reserved for map or reduce tasks. YARN doesn't differentiate so it is free to maximize memory utilization.
Can I set the number of reducers to zero?
Yes, Setting the number of reducers to zero is a valid configuration in Hadoop. When you set the reducers to zero no reducers will be executed, and the output of each mapper will be stored to a separate file on HDFS. [This is different from the condition when reducers are set to a number greater than zero and the Mappers output (intermediate data) is written to the Local file system(NOT HDFS) of each mappter slave node.]
Is it possible to have Hadoop job output in multiple directories? If yes, how?
Yes, by using Multiple Outputs class.
do map tasks have the advantage of data locality
Yes, many times
Is it possible to provide multiple input to Hadoop? If yes then how can you give multiple directories as input to the Hadoop job?
Yes, the input format class provides methods to add multiple directories as input to a Hadoop job.
Can distcp work on two different versions of Hadoop?
Yes, you would have to use http ( or the newer webhdfs) ex: hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar
What does YARN stand for?
Yet Another Resource Negotiator
What is YARN?
Yet Another Resource Negotiator... a fremework for job scheduling and cluster resource management
How can you set an arbitrary number of Reducers to be created for a job in Hadoop?
You can either do it programmatically by using method setNumReduceTasks in the Jobconf Class or set it up as a configuration setting.
When does PLATFORM need to be set? What is it for?
a) When you are running Hadoop Pipes and use C++ b) It specifies the operating system architecture and data model. Needs to be set before running the makefile. ex: PLATFORM=Linux-i386-32
F use-dfs
allows mountif of HDFS
Possessive does what?
always eat the entire input string, trying once (and only once) for a match
hdfs-site.xml
an XML file that specifies parameters used by HDFS deamons and Clients.
The mapper in filtering
applies the evaluation function to each record it receives; mapper outputs the same key/value types as the types of input since the record is left unchanged which is what we want in filtering
daily::exchange and divs::exchange- what does that mean?
both daily and divs have a column with the same name, need to distinguish them
Count words in PIG
cntd = foreach grpd generate group, COUNT(words);
so what not scale up and out at same time
compounds costs and weakeness of both approaches. Instead of very large and expenseive hardware and cross-cluster logic, this hybrid architecture requires both.
Jobtracker
coordinates all the jobs run on the system by scheduling task to run on tasktrackers
What are the default Hadoop properties stored?
core-default.xml
Where are hadoop fs defaults stored?
core-site.xml
Where are the site specific overrides to the default Hadoop properties stored?
core-site.xml
Which of the following is NOT Hadoop drawbacks? a. inefficient join operation b. security issue c. does not optimize query for user d. high cost e. MapReduce is difficult to implement
d
MapReduce
data processing software with specs on how to input and output data sets. It integrates tightly with HDFS.
PIG is a ________ language rather than a programming language
dataflow
Fill in the blank: The __________ holds the data in the HDFS and the application connects with the __________ to send and retrieve data from the cluster.
datanode, namenode
Partitioner Structure for Inverted Index
determines where values with the same key will eventually be copied by a reducer for final output
the downside we need to develop software that can process data across fleet of machines.
developer needed to handcraft mechanism for data partitioning and reassembly, logic to schedule, and how to handle failure
How do you change HDFS block size?
dfs.block.size (hdfs-site.xml) (default 64MB recommended 128MB )
Property for path on local file system in which data node instance should store its data
dfs.data.dir /tmp by defaould must be overridden
Property for path on local file system in which data node instance should store its data
dfs.data.dir /tmp by defaould must be overridden
Which settings are used to control which network interfaces to use? (ex:eth0) (2)
dfs.datanode.dns.interface mapred.tasktracker.dns.interface
How do you reserve storage space for non-HDFS use?
dfs.datanode.du.reserved (amount in bytes)
Property for patch on local file system of the NameNode instance where the NameNode metadata is stored
dfs.name.dir ex; /home/username/hdfs
Property for patch on local file system of the NameNode instance where the NameNode metadata is stored
dfs.name.dir ex; /home/username/hdfs
What is the property that enables file permissions
dfs.permissions (the namenode runs as a super user where permissions are not applicable)
Pig- Tuple
divided into fields, minmaxcount
How many reducers does Distinct pattern need?
doesn't matter
How to do secondary sort in Streaming?
don't want to partition by the entire key, so we use the KeyFieldBased Partitioner partitioner, which allows us to partition by a part of the key. The specification mapred.text.key.partitioner.options configures the partitioner
one reason scale up systems are so costly ...
due to redunduncy to mitigating impact of component failures
print cntd in PIG
dump cntd;
Some questions are only meaningful is asked of sufficiently large data sets.
e.g. what is most popular song or movie. More relevant if we ask 10 vs. 100 users.
dfs.namenode.shared.edits.dir
each namenode in a HA pair must have access to a shared filesystems defined by this properity.active name node will write while stand by name node read and apply chnages to its in memiry version of meta data
how are user defined counters defined:
enums ... enum Temperature { MISSING, MALFORMED } ... System.err.println("Ignoring possibly corrupt input: " + value); context.getCounter(Temperature.MALFORMED).increment(1);
How do you access task execution from Streaming?
environment variables (ex in python: os.environ["mapred_job_id"] ) -cmdenv set environment variables via command line
Property that is the URI that describes the NameNode for the cluster
fs.default.name ex: HDFS;//servername:9000 (port 9000 is arbritary)
Property that is the URI that describes the NameNode for the cluster
fs.default.name ex: HDFS;//servername:9000 (port 9000 is arbritary)
How do you configure a hadoop cluster for psuedo-distributed mode?
fs.default.name - hdfs://localhost/ dfs.replication = 1
What is it called when an administrator brings a namenode down manually for routine maintenence?
graceful failover
x* reluctant greedy or possessive?
greedy
Distinct- structure - exploits MapReduce's ability to ____ and uses
group keys together to remove duplicates; uses mapper to transform the data and doesn't do much in reducer
What is the core function of the MapReduce paradigm?
grouping data together by a key
Min Max Count Mapper -
groups it by a key value - such as the user ID and then the value is three columns min max and count
Reducer - Distinct
groups the nulls together by key- so we'll have one null per key
Group by word in PIG
grpd = group words by word
command for making a directory in HDFS
hadoop FS - mkdir mydir
What is the command to enter Safe Mode?
hadoop dfsadmin -safemode enter
How do you check if you are in Safe Mode?
hadoop dfsadmin -safemode get or front page of HDFS web UI
What is the command to exit Safe Mode?
hadoop dfsadmin -safemode leave
How do you set a script to run after Safe Mode is over?
hadoop dfsadmin -safemode wait #command to read/write a file
What is the command line for executing a Hadoop Streaming job
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar -input input/pat -output output/path -mapper path/to/mapper.py -reducer path/to/reducer.py (all on one line) -file can be used to ship scripts to the cluster -combiner can be used to specify a combiner
How do you use Hadoop Streaming?
hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \ -input input/ncdc/sample.txt \ (input path in hdfs) -output output \ (output location) -mapper path-to-mapper-function \ -reducer path-to-reducer-function
Where do you find the list of all benchmarks?
hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar
How do you set your hadoop username and group?
hadoop.job.ugi ex hadoop.job.ugi = test1, test2 test3 test1 = user test2 = group1 test3 = group2
What property is used to set the Hadoop username and group?
hadoop.job.ugi uder,group1,group 2
How do you disable the default Native Libraries to use Java Libraries instead?
hadoop.native.lib = false Hadoop script in bin automatically sets the native library for you
Costs remain lower than scale up
hardware costs more when one seeks to purchase larger machines
Yahoo
hired Doug in 2006 and became one of the most prominent supporters of HD. Yahoo published largest HD implementations and allowed Doug and team to contribute to HD event with yahoo's own improvements and extensions.
Where can you view a components metrics?
http://jobtracker-host:50030/metrics - ?format=json (optional)
How do you get a stack trace for a component?
http://jobtracker-host:50030/stacks
move processing, not data
if needing to process 100 TB of data on a massive server with fiber SAN. There are limits for how much data is delivered to the host. Instead, use 1000 of servers, make data local to each. All that travels on network is program binaries and metadata and status reports.
Pig guesses at data
if not explicity told
scale out system are build with 'expect to fail'
individual components will fail regularly and at inconvenient times.
load data gettysburg.txt in pig
input = load 'gettysburg.txt' as (line);
What are the output of the map tasks called?
intermediate keys and values
[a-z&&[def]]
intersection - d e or f
How do you set how many bytes per checksum of data?
io.bytes.per.checksum (default 512 bytes)
How do you change the default buffer size?
io.file.buffer.size (default 4KB) recommended 128KB core-site.xml
numerical summarizations pattern
is a general pattern for calculating aggregate statistical values
HDFS 2...
it is not a posix-complient FS it does not provide same guarantees as regular FS.
failure is not longer a crises to be mitigated ...
it is reduced to irrelevance
medi and standev - reducer process and what is the output?
iterates through the given set of value and adds each value to an in memory list.- calculates a running sum and count- then comment lengths sorted to find the median value then running sum of deviations is calculated by squaring the difference between comment length and mean and then calculated from this sum= output is the median and standard deviation with the key
Min Max Count Reducer -
iterates through the values to find the min and maximum dates and sums the count
DataBlockScanner
job run in the background thread to periodically verify all the blocks stored on the datanode in order to guard against "bit rot"
How do you specify the mapper and reduce to use within a mapreduce job?
job.setMapperClass(Classname.class); job.setReducerClass(Classname.class);
[abc]
just a b or c
How would you dump the jvm context to a file?
jvm.class = org.apache.hadoop.metrics.file.FileContext jvm.fileName = /tmp/jvm-metrics.log
Pig - Map
key(char array) to value(any pig type)
Reduce-Side Joins
less efficient because both datasets have to go through the MapReduce shuffle. The basic idea is that the mapper tags each record with its source and uses the join key as the map output key, so that the records with the same key are brought together in the reducer. We use several ingredients to make this work in practice: Multiple inputs Secondary sort
What does "." do?
lets any character afterwards
thread
libdhfs is ____ safe.
pig - Limit
limits the number of output tuples
Where do map taks write their output
local disk, not HDFS
How do you list the files in a Hadoop filesystem within the Pig command line? What is the Pig Latin comment style?
ls / C style /* */ or --
What is used in the filter code?
mapper
Top Ten Pattern Description uses
mapper and reducer
median and stand dev in map reduce uses
mapper and reducer
Average uses-
mapper and reducer- no combiner since it is not associative can't add up averages of averages
Min Max Count- uses..
mapper combiner and reducer
How do you set a MapReduce taskscheduler?
mapred.jobtracker.taskscheduler = org.apache.hadoop.mapred.FairScheduler
What properly configures the number of Reduce tasks?
mapred.reduce.tasks in JobConf (is set by calling jobconf.setnumreducetasks ()
What properly configures the number of Reduce tasks?
mapred.reduce.tasks in JobConf (is set by calling jobconf.setnumreducetasks ()
What property is used to set the timeout for failed tasks?
mapred.task.timeout (defaoults to 10min, can be configured per job or per cluster)
What property is used to set the timeout for failed tasks?
mapred.task.timeout (defaoults to 10min, can be configured per job or per cluster)
Which property controls the maximum number of map/reduce tasks that can run on a tasktracker at one time?
mapred.tasktracker.map.tasks.maximum (default 2) mapred.tasktracker.reduce.tasks.maximum (default 2)
What is the property that changes the number of task slots for the tasktrackers?
mapredotasktrackeromap.tasks.maximum
How do you set Hadoop to use YARN?
mapreduce.framework.name = yarn
\d
matches digits
\D
matches non digits
\S
matches non white space characters
\W
matches non word character
\s
matches whitespace
\w
matches word character [a-zA-Z_0-9]
How do you package an Oozie workflow application?
max-temp-workflow/\ -lib/ --hadoop-examples.jar -workflow.xml (workflow.xml must be in first level, can be built with ant/maven)
Min Max Count Combiner
min and max comment dates can be calculated for each local map task without having an effect on the final min and max
Why can minmaxcount have a combiner?
min max and count are all associative and communative
and bad
moving software to a larger system is not a trivial task.
What is used for MapReduce tests?
mrunit
How do you test a MapReduce job?
mrunit and junit ( test execution framework). mrunit's MapDriver, which is configured with the mapper we will test. runTest() executeds the test. mrunit's ReduceDriver is configured with the reducer.
How is the output value type specified for a MapReduce job?
my JobConf.SetOutputValueClass(intwritable.class) (where intwriteable.class is the type of the output value)
How is the mapper specified for a MapReduce job?
my JobConf.setMapperClass (myMapper.class);
How is combiner specified for a job?
myConf.setCombinerClass (MyCombiner.class)
How is the output key specified for a map reduce job?
myJobConf.setoutputkeyclass (text.class); (where test.class is the type of me output key)
How is the output key specified for a map reduce job?
myJobConf.setoutputkeyclass (text.class); (where test.class is the type of me output key)
How is the recuer specified for a MapReduce job?
myJobConf.setreducerclass (my reducer.class);
Pig join syntax
name = join column by connecter, column2 by connecter
Does MapReduce sort values?
no
do redu tasks have the advantage of data locality
no
why Pig?
no more java/mapreduce, better load balancing of the reducers, easier joins
Combiner Structure for Inverted Index
not used- not beneficial on impact on byte count
GET
operations: HTTP ___ : OPEN (see FileSystem.open) GETFILESTATUS (see FileSystem.getFileStatus) LISTSTATUS (see FileSystem.listStatus) GETCONTENTSUMMARY (see FileSystem.getContentSummary) GETFILECHECKSUM (see FileSystem.getFileChecksum) GETHOMEDIRECTORY (see FileSystem.getHomeDirectory) GETDELEGATIONTOKEN (see FileSystem.getDelegationToken) GETDELEGATIONTOKENS (see FileSystem.getDelegationTokens)
POST
operations: HTTP ____ APPEND (see FileSystem.append) CONCAT (see FileSystem.concat)
PUT
operations: HTTP ____ CREATE (see FileSystem.create) MKDIRS (see FileSystem.mkdirs) CREATESYMLINK (see FileContext.createSymlink) RENAME (see FileSystem.rename) SETREPLICATION (see FileSystem.setReplication) SETOWNER (see FileSystem.setOwner) SETPERMISSION (see FileSystem.setPermission) SETTIMES (see FileSystem.setTimes) RENEWDELEGATIONTOKEN (see FileSystem.renewDelegationToken) CANCELDELEGATIONTOKEN (see FileSystem.cancelDelegationToken)
DELETE
operations: HTTP _____ DELETE (see FileSystem.delete)
What is the combiner?
optional localized reducer; groups data in the map phase
dfs.namenode.http-address.nameservice-id.namenode-id
optionally it is possible to specify the hostnam and port for HTT{ service for a given namenode-id within nameservice-id.
What class defines a file system in Hadoop
org.apache.hadoop.fs.filesystem
What is Writable & WritableComparable interface?
org.apache.hadoop.io.Writable is a Java interface. Any key or value type in the Hadoop Map-Reduce framework implements this interface. Implementations typically implement a static read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance. org.apache.hadoop.io.WritableComparable is a Java interface. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. WritableComparable objects can be compared to each other using Comparators.
How do you run Pig locally? How do you run Pig on a distributed system?
pig -x local -x (execution environment option) pig (default)
What is a IdentityMapper and IdentityReducer in MapReduce ?
org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value. org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.
Mapper Structure for Inverted Index
outputs the desired fields for the index as the key and the unique identifier as value
At how many nodes does MapReduce1 hit scaleability bottlenecks?
over 4,000 nodes
Hadoop can run jobs in ________ to tackle large volumes of data.
parallel
what is the record reader?
parses the data into records, passes the data into the mapper in the form of a key value pair
Once a FilesySystem is retrieved, how do you open the input stream for a file?
public FSDataInputStream open(Path f) throws IOException - Uses default buffer of 4KB public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException
If you want to append to an existing file instead of creating a new one, how to do you specify the output stream? How does appending work?
public FSDataOutputStream append(Path f) throws IOException It creates a single writer to append to the bottom of files only.
How do you create a file output stream?
public FSDataOutputStream create(Path f) throws IOException
What are the two FileSystem methods for processing globs? What does it return?
public FileStatus [ ] globStatus(Path pathPattern) throws IOE public FileStatus [ ] globStatus(Path pathPattern, PathFilter filter) throws IOE return an array of FileStatus objects whose paths match the supplied pattern, ordered by Path
Which method do you use to see if a file or directory exists? (FileSystem method)
public boolean exists(Path f) throws IOException
How do you create directories with FileSystem?
public boolean mkdirs(Path f) throws IOException It passes back a boolean to indicate if directories and parents were created. This isn't used often because create() will create a directory structure as well.
class and method signature for a new reducer API
public class MyNewReducer extends reducer <K1, V1, K2, V2> public void reduce (K1 key, Itemable <V1> values, content context) throws IOException, interrupted Exception context, write (Key 2, value2);
class and method signature for a new reducer API
public class MyNewReducer extends reducer <K1, V1, K2, V2> public void reduce (K1 key, Itemable <V1> values, content context) throws IOException, interrupted Exception context, write (Key 2, value2);
Describe the writable comparable interface.
public interface writable comparble<t> extends writable comparable <t>
smart software, dump hardware
push smarts into the software and away from hardware. allows hardware to be generic
inverted indexes should be used when
quick query responses are required
Can you modify the partitioner?
rarely ever need to
Each map task in hadoop is broken into what phases:
record reader, mapper, combiner, partitioner
Filtering does not require the ____ part of Map Reduce because
reduce ; does not produce an aggregation
if problem requires workloads with strong mandates for transnational consistency/integrity,
relational databases are still likely to be a great solution.
How would you make sure a namenode stays in Safe Mode indefinitely?
set dfs.safemode.threashold.pct > 1
How does a mapreduce program find where the jar file on the Hadoop cluser is to run?
setJarByClass(Classname.class) It will find the jar containing that class instead of providing a explicit jar file name
How does a mapreduce program where the jar file on the Hadoop cluser is to run?
setJarByClass(Classname.class) It will find the jar containing that class instead of providing a explicit jar file name
How do you set the output formats to use in a mapreduce job?
setOutputKeyClass(); setOutputValueClass(); If the mapper and reducer output types are difference you may specify the mapper output format as well: setMapOutputKey(); setMapOutputValue();
The reduce tasks are broken into what phases:
shuffle, sort, reducer, and output format
architecture did not change much.
singe architecture at any scale is not realistic. To handle data sets of 100TB to petabytes may apply larger versions of same components, but complexity of connectivity my prove prohibitive.
so scale out
spread processing onto more and more machines. Use 2 servers instead of a double-sized one.
How do you run the balancer?
start-balancer.sh -threashold - Specifies threashold percentage that is deemed "balanced" (optional) Only one balancer can be run at a time
If you store 3 separate configs for: (a) single (b) pseudo-distributed (c) distributed How do you start/stop and specify those to the daemon?
start-dfs.sh --config path/to/config/dir start-mapred.sh --config path/to/config/dir
HDFS 3...
store files in bocks of 64mb in size (vs. 4-32kb)
Mapper- Distinct
takes each record and extracts the data fields for which we want unique values- key is the record and value is null
What does the reducer do?
takes the grouped data from the shuffle and sort and runs a reduce function once per key grouping
what is the shuffle and sort?
takes the output files and downloads them to the local machine where the reducer is running then sorted by key into one larger data list
What is the partitioner?
takes the output from mapper or combiner and splits them up into shards
How does memory limit the number of files in HDFS?
the NameNode holds filessystem data in memory the limit to the number of files in a file system is governed by the amount of memory on the NameNode Rule of thumb: each file, directory and block takes 150 bytes
^
the beginning of a line
$
the end of a line
fair-scheduler.xml
the file used to specify the resources pooland setting for the fair schedular plugin for Mapreduce
What is the value?
the information pertinent to the analysis in reducer
capacity-schedular.xml
the name of the file used to specify queues and setting for the capacity scheduler task
What is a record
the portion of an input split fo rwhich the map funtion is called (e.g. a line in a file)
Hadoop runs tasks _____________ to isolate them from other running tasks
their own java virtual machine
dfs.namenode.rpc-address.nameservice-id.namenode-id
this parameter speficies the colon separted hostnames and port on which namenode-id should be provided namenode RPC service for nameservice-id.
dfs.ha.fencing.methods
this property specifies a new line -separated list of fencing methods
Rack Awareness
to take a node's physical location into account while scheduling tasks and allocating storage.
FIll in the blank. Hadoop lacks notion of ______________ and ______________. Therefore, the analyzed result generated by Hadoop may or may not be 100% accurate.
transaction consistency and recovery checkpoint
Generation of large data sets e.g. large search engines and online companies. Also, the need to extra information to identify
trends and relationship.s to make decisions e.g. customer habits and marketing. e.g google adwords
True or False: A good rule of thumb is to have one or more tasks than processors
true