EAP - Hadoop Chapter 5
How do you package an Oozie workflow application?
max-temp-workflow/\ -lib/ --hadoop-examples.jar -workflow.xml (workflow.xml must be in first level, can be built with ant/maven)
What is used for MapReduce tests?
mrunit
How do you test a MapReduce job?
mrunit and junit ( test execution framework). mrunit's MapDriver, which is configured with the mapper we will test. runTest() executeds the test. mrunit's ReduceDriver is configured with the reducer.
What does "hadoop fs -getmerge max-temp max-temp-local" do?
Gets all the files specified in a HDFS directory and merges them into a single file on the local file system.
What does Hadoop use for Configuration?
An instance of the Configuration Class in org.apache.hadoop.conf. They read properties from an xml file.
All Oozie workflows must have which control nodes?
1 start node <start to="max-temp-mr"/> 1 end node <end name="end"/> 1 kill node <kill name="fail><message>MapReduce failed error...</></kill> When the workflow starts it goes to the node specified in start. If workflow succeeds -> end If workflow fails -> kill
1. job IDs are __ based. 2. task IDs are ___ based. 3. attempt IDs are ___ based.
1. 1 2. 0 3. 0
1. Task logs are detected after how long? 2. Where can it be configured? 3. How do you set the cap size of a log file?
1. 24 hours 2. mapred.userlog.retain.hours 3. mapred.userlog.limit.kb
What are things to look for on the Tuning Checklist? (How can I make a job run faster?) 1. Number of Mappers 2. Number of Reducers 3. Combiners 4. Intermediate Compression 5. Custom Serialization 6. Shuffle Tweaks
1. A mapper should run for about a minute. Any shorter and you should reduce the amt of mappers. 2. Slightly less reducers than the number of reduce slots in the cluster. This allows the reducers to finish in a single wave, using the cluster fully. 3. Check if combiner can be use to reduce amount of data going through the shuffle. 4. Job execution time can almost always benefit from enabling map output compression. 5. Use RawComparator if you are using your won custom Writable objects or custom comparators. 6. Lots of tuning parameters for memory management
1. What is Apache Oozie? 2. What are its two main parts? 3. What is the difference between Oozie and JobControl? 4. What do action nodes do? control nodes? 5. What are two possible types of callbacks?
1. A system for running workflows of dependent jobs 2. (a) Workflow engine - stores and runs workflows composed of different types of Hadoop jobs (MapReduce, Pig, Hive) (b) Coordinator engine - runs workflow jobs based on pre-defined schedules and data availability. 3. JobControl runs on the client machine submitting the jobs. Oozie runs as a service in the cluster and client submit workflow definitions for immediate or later execution 4. (a) Performs a workflow task such as: moving files in HDFS, running MapReduce, Streaming or Pig jobs, Sqoop imports, shell scripts, java programs (b) Governs the workflow execution using conditional logic 5. (a) On workflow completion, HTTP callback to client to inform workflow status. (b) receive callbacks every time a workflow enters/exits an action node
What are the steps in packaging a job?
1. Create a jar file using Ant, Maven or command line. 2. Include any needed classes in the root/classes directory.Dependent jar files can go in root/lib 3. Set the HADOOP_CLASSPATH to dependent jar files.
1.What is Job History? 2. Where are the files stored? 3. How long are History files kept? 4. How do you view job history via command line?
1. Events and configuration for a completed job. 2. The local file system of the jobtracker. (history subdir of logs) 3. 30 days , user location - never hadoop.job.history.location 2nd copy _logs/history of jobs output location hadoop.job.history.user.location 4. hadoop job -history
How do you set precedence for the users classpath over hadoop built in libraries?
1. HADOOP_USER_CLASSPATH_FIRST = true 2. mapreduce.task.classpath.first to true
The Web UI has action links that allow you to do what? How are they enabled?
1. Kill a task attempt 2. webinterface.private.actions = true
What are common remote debugging techniques?
1. Reproduce the failure locally, possibly using a debugger like Java's virtual VM. 2. Use JVM debugging options.For JVM out of memory errors. Set -XX:-HeapDumpOnOutOfMemoryError -XX:-HeapDumpPath = /path/to/dumps Dumps heap to be examined afterward with tools such as jhat or Eclipse Memory Analyzer 3. Task Profiling - Hadoop provides a mechanism to profile a subset of the tasks in a job. 4. IsolationRunner - (old hadoop) could re-run old tasks
For multiple jobs to be run, how do you run them linearly? Directed Acyclic Graph of jobs?
1. Run each job, one after another, waiting until the previous completes successfully. Throws an exception and the processing stops at failed job. ex: JobClient.runJob(conf1); JobClient.runJob(conf2); 2. Use Libraries. (org.apache.hadoop.mapreduce.jobcontrol ) JobControl Class instance represents a graph of jobs to be run. (a) Indicate jobs and their dependencies (b) Run JobControl in a thread and it runs the jobs in dependency order (c) You can poll progress (d) If a job fails JobControl won't run dependencies (e) You can query status after the jobs complete
True or False: 1. System properties take priority over properties defined in resource files. 2. System properties are accessible through the configuration API.
1. TRUE 2. FALSE, it will be lost if not redefined in a configuration file.
What is the user's task classpath comprised of?
1. The job JAR file 2. Any JAR files contained in the lib/ directories of the job jar file. 3. Any files added to the distributed cache using -libjars option or the addFileToClasspath() method on DistributedCache (old api) or Job (new api)
If two configuration files set the same property, which does Hadoop use?
1. The last one added. 2. The one marked as "final" within the config
How could you debug a MapReduce program?
1. Use a debug statement to log to standard error and a message to update task status to alert us o look at the error log. 2. Create a custom counter to count the total number of records with implausible values in an entire dataset (valuable to see how common an occurrence is) 3. If the amount of data is large, we add debug output to the maps output for analysis and aggregation in the reducer. 4. Write a program to analyze log files afterwards. ex: Debugging to Mapper: if(airTemp > 1000 ){ Ssytem.err.println("Temp over 1000 degrees for input" + value); Context.setStatus("detected possibly corrupt record:see logs"); Context.getCounter(Temperature.OVER_100).increment(1); }
Reduce tasks are broken down on the jobtracker web UI. What do: 1. copy 2. sort 3. reduce refer to?
1. When map outputs are being transferred to the reducers tasktracker. 2. When the reduce inputs are being merged. 3. When the reduce function is being run to produce the file output.
What do the following Oozie components specify? 1. map-reduce action 2. mapred.input.dir / mapred.output.dir
1. contains (a) job-tracker - specifies jobtracker to submit job to (b) name-node - URI for data input/output (c) prepare (optional) - runs before mapreduce job. Used for directory deletion (output dir before job runs) 2. Used to set the FileInputFormat input paths and FileOuputFormat output paths
How do you run an Oozie workflow job?
1. export OOZIE_URL="http://localhost:11000/oozie" (tells oozie command which server to use) 2. oozie job -config ch05/src.../max-temp-workflow.properties -run (run - runs the workflow) (config - local java properties file containing definitions for the parameters in the workflow xml) 3. oozie job -info 000000009-112....-oozie-tom-W (shows the status, also available via web url)
If I want to keep intermediate failed or succeeded files, how can I do that? Where are the intermediate files stored?
1. failed - keep.failed.task.files = true succeeded - keep.task.files.pattern = (regex of task Ids to keep) 2. mapred.local.dir/taskTracker/jobcache/job-ID/task-attempt-ID
A mapper commonly performs three things, what are they?
1. input format parsing 2. projection (selecting relevant fields) 3. filtering (removing records that are not of interest)
1. How do you setup the local jobrunner? 2. How do you setup the local jobrunner on MR2? 3. How many reducers are used?
1. mapred.job.tracker = local (default) 2. mapred.framework.name = local 3. 0 or 1
How do you run a MapReduce job on a cluster?
1. unset HADOOP_CLASSPATH (if no dependencies exist) 2. hadoop jar hadoop-examples.jar / v3MaxTemperatureDriver -conf conf/hadoop-cluster.xml input/ncdc/all max-temp
What does the Hadoop Library Class ChainMapper do?
Allows you to run a chain of mappers, followed by a reducer and another chain of mappers in a single job.
How would you manually add a resource? How would you access the resources properties?
Configuration conf = new Configuration(); conf.addResource("configuration-1.xml"); assertThat(conf.get("color"), is ("yellow")); assertThat(conf.getInt("size",0), is (10)); assertThat(conf.get("breadth","wide"), is ("wide"));
How does profiling work in Hadoop?
Hadoop allows you to profile a fraction of the tasks in a job and as each task completes, it pulls down the profile information to your machine for later analysis. ex: Configuration conf = getConf(); conf.setBoolean("mapred.task.profile",true); conf.set("mapred.task.profile.params", "agent-lib:hprof=cpu=samples,heap=sites, depth=6, force=n, thread=y,verbose=n,file=%s"); conf.set("mapred.task.profile.maps", "0-2"); conf.set("mapred.task.profile.reduces", " "); Job job = new Job(conf, "MaxTemperature");
How can you test a driver using a mini cluster?
Hadoop has a set of testing classes (allows testing against the full HDFS and MapReduce machinery): MiniDFSCluster MiniMRCluster MiniYARNCluster MapReduceTestCase - abstract class provides methods needed to use a mini cluster in user code.
What does the GenericOptionsParser do?
Interprets Hadoop command line options and sets them to a Configuration object in your application. Implemented through the Tool Interface.
What is HRPROF?
Profiling tool that comes with the JDK that can give valuable info about a programs CPU and heap usage.
What does dfs.web.ugi do?
Sets the user that HDFS web interface runs as. (used to restrict system files to web users)
How do you set a system property? How do you set them via command line?
System.setProperty("size",14); -Dproperty=value
True or False: Its better to add more jobs than add more complexity to the mapper.
TRUE
How do you list the files running in pseduo/single/distributed mode?
hadoop fs -conf conf/hadoop-xxx.xml -ls . hadoop-xxx-.xml (config file for single or dist or pseudo)
How do you set your hadoop username and group?
hadoop.job.ugi ex hadoop.job.ugi = test1, test2 test3 test1 = user test2 = group1 test3 = group2
If you store 3 separate configs for: (a) single (b) pseudo-distributed (c) distributed How do you start/stop and specify those to the daemon?
start-dfs.sh --config path/to/config/dir start-mapred.sh --config path/to/config/dir