Pig in practice
Which of the following script is used to check scripts that have failed jobs? a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'STATUS' as status, j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, j#'JOBID' as job; c = filter b by status != 'SUCCESS'; dump c; b) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queue; c = group b by (id, user, queue) parallel 10; d = foreach c generate group.user, group.queue, COUNT(b); dump d; d) None of the mentioned
a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'STATUS' as status, j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, j#'JOBID' as job; c = filter b by status != 'SUCCESS'; dump c; Pig provides the ability to register a listener to receive event notifications during the execution of a script.
Which of the following find the running time of each script (in seconds)? a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; b) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = for a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = for c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queue; c = group b by (id, user, queue) parallel 10; d = foreach c generate group.user, group.queue, COUNT(b); dump d; d) All of the mentioned
a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; The HadoopJobHistoryLoader in Piggybank loads Hadoop job history files and job xml files from file system.
Point out the wrong statement : a) Pig can invoke code in language like Java Only. b) Pig enables data workers to write complex data transformations without knowing Java. c) Pig's simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. d) Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig.
a) Pig can invoke code in language like Java Only. Through the User Defined Functions(UDF) facility in Pig, Pig can invoke code in many languages like JRuby, Jython and Java
Pig Latin is _______ and fits very naturally in the pipeline paradigm while SQL is instead declarative. a) functional b) procedural c) declarative d) all of the mentioned
b) procedural In SQL users can specify that data from two tables must be joined, but not what join implementation to use
Which of the following script determines the number of scripts run by user and queue on a cluster: a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'STATUS' as status, j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, j#'JOBID' as job; c = filter b by status != 'SUCCESS'; dump c; b) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queue; c = group b by (id, user, queue) parallel 10; d = foreach c generate group.user, group.queue, COUNT(b); dump d; d) None of the mentioned
c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queue; c = group b by (id, user, queue) parallel 10; d = foreach c generate group.user, group.queue, COUNT(b); dump d; EmbeddedPigStats contains a map of SimplePigStats or TezPigScriptStats, which captures the Pig job launched in the embedded script
Which of the following scripts that generate more than three MapReduce jobs? a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = group a by (j#'PIG_SCRIPT_ID', j#'USER', j#'JOBNAME'); c = for b generate group.$1, group.$2, COUNT(a); d = filter c by $2 > 3; dump d; b) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = display a by (j#'PIG_SCRIPT_ID', j#'USER', j#'JOBNAME'); c = foreach b generate group.$1, group.$2, COUNT(a); d = filter c by $2 > 3; dump d; c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = group a by (j#'PIG_SCRIPT_ID', j#'USER', j#'JOBNAME'); c = foreach b generate group.$1, group.$2, COUNT(a); d = filter c by $2 > 3; dump d; d) None of the mentioned
c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = group a by (j#'PIG_SCRIPT_ID', j#'USER', j#'JOBNAME'); c = foreach b generate group.$1, group.$2, COUNT(a); d = filter c by $2 > 3; dump d; For each MapReduce job, the loader produces a tuple with schema (j:map[], m:map[], r:map[]).
Which of the following code is used to find scripts that use only the default parallelism? a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'STATUS' as status, j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, j#'JOBID' as job; c = filter b by status != 'SUCCESS'; dump c; b) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queue; c = group b by (id, user, queue) parallel 10; d = foreach c generate group.user, group.queue, COUNT(b); dump d; d) None of the mentioned
b) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; The first map in the schema contains job-related entries.
Which of the following is an entry in jobconf? a) pig.job b) pig.input.dirs c) pig.feature d) none of the mentioned
b) pig.input.dirs pig.input.dirs contains comma-separated list of input directories for the job
Point out the correct statement: a) LoadPredicatePushdown is same as LoadMetadata.setPartitionFilter b) getOutputFormat() is called by Pig to get the InputFormat used by the loader c) Pig works with data from many sources d) None of the mentioned
c) Pig works with data from many sources Data sources includes structured and unstructured data, and store the results into the Hadoop Data File System.
In comparison to SQL, Pig uses: a) Lazy evaluation b) ETL c) Supports pipeline splits d) All of the mentioned
d) All of the mentioned Pig Latin ability to include user code at any point in the pipeline is useful for pipeline development