Pig in practice

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Which of the following script is used to check scripts that have failed jobs? a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'STATUS' as status, j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, j#'JOBID' as job; c = filter b by status != 'SUCCESS'; dump c; b) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queue; c = group b by (id, user, queue) parallel 10; d = foreach c generate group.user, group.queue, COUNT(b); dump d; d) None of the mentioned

a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'STATUS' as status, j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, j#'JOBID' as job; c = filter b by status != 'SUCCESS'; dump c; Pig provides the ability to register a listener to receive event notifications during the execution of a script.

Which of the following find the running time of each script (in seconds)? a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; b) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = for a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = for c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queue; c = group b by (id, user, queue) parallel 10; d = foreach c generate group.user, group.queue, COUNT(b); dump d; d) All of the mentioned

a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) j#'SUBMIT_TIME' as start, (Long) j#'FINISH_TIME' as end; c = group b by (id, user, script_name) d = foreach c generate group.user, group.script_name, (MAX(b.end) - MIN(b.start)/1000; dump d; The HadoopJobHistoryLoader in Piggybank loads Hadoop job history files and job xml files from file system.

Point out the wrong statement : a) Pig can invoke code in language like Java Only. b) Pig enables data workers to write complex data transformations without knowing Java. c) Pig's simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL. d) Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig.

a) Pig can invoke code in language like Java Only. Through the User Defined Functions(UDF) facility in Pig, Pig can invoke code in many languages like JRuby, Jython and Java

Pig Latin is _______ and fits very naturally in the pipeline paradigm while SQL is instead declarative. a) functional b) procedural c) declarative d) all of the mentioned

b) procedural In SQL users can specify that data from two tables must be joined, but not what join implementation to use

Which of the following script determines the number of scripts run by user and queue on a cluster: a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'STATUS' as status, j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, j#'JOBID' as job; c = filter b by status != 'SUCCESS'; dump c; b) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queue; c = group b by (id, user, queue) parallel 10; d = foreach c generate group.user, group.queue, COUNT(b); dump d; d) None of the mentioned

c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queue; c = group b by (id, user, queue) parallel 10; d = foreach c generate group.user, group.queue, COUNT(b); dump d; EmbeddedPigStats contains a map of SimplePigStats or TezPigScriptStats, which captures the Pig job launched in the embedded script

Which of the following scripts that generate more than three MapReduce jobs? a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = group a by (j#'PIG_SCRIPT_ID', j#'USER', j#'JOBNAME'); c = for b generate group.$1, group.$2, COUNT(a); d = filter c by $2 > 3; dump d; b) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = display a by (j#'PIG_SCRIPT_ID', j#'USER', j#'JOBNAME'); c = foreach b generate group.$1, group.$2, COUNT(a); d = filter c by $2 > 3; dump d; c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = group a by (j#'PIG_SCRIPT_ID', j#'USER', j#'JOBNAME'); c = foreach b generate group.$1, group.$2, COUNT(a); d = filter c by $2 > 3; dump d; d) None of the mentioned

c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = group a by (j#'PIG_SCRIPT_ID', j#'USER', j#'JOBNAME'); c = foreach b generate group.$1, group.$2, COUNT(a); d = filter c by $2 > 3; dump d; For each MapReduce job, the loader produces a tuple with schema (j:map[], m:map[], r:map[]).

Which of the following code is used to find scripts that use only the default parallelism? a) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate (Chararray) j#'STATUS' as status, j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, j#'JOBID' as job; c = filter b by status != 'SUCCESS'; dump c; b) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; c) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'QUEUE_NAME' as queue; c = group b by (id, user, queue) parallel 10; d = foreach c generate group.user, group.queue, COUNT(b); dump d; d) None of the mentioned

b) a = load '/mapred/history/done' using HadoopJobHistoryLoader() as (j:map[], m:map[], r:map[]); b = foreach a generate j#'PIG_SCRIPT_ID' as id, j#'USER' as user, j#'JOBNAME' as script_name, (Long) r#'NUMBER_REDUCES' as reduces; c = group b by (id, user, script_name) parallel 10; d = foreach c generate group.user, group.script_name, MAX(b.reduces) as max_reduces; e = filter d by max_reduces == 1; dump e; The first map in the schema contains job-related entries.

Which of the following is an entry in jobconf? a) pig.job b) pig.input.dirs c) pig.feature d) none of the mentioned

b) pig.input.dirs pig.input.dirs contains comma-separated list of input directories for the job

Point out the correct statement: a) LoadPredicatePushdown is same as LoadMetadata.setPartitionFilter b) getOutputFormat() is called by Pig to get the InputFormat used by the loader c) Pig works with data from many sources d) None of the mentioned

c) Pig works with data from many sources Data sources includes structured and unstructured data, and store the results into the Hadoop Data File System.

In comparison to SQL, Pig uses: a) Lazy evaluation b) ETL c) Supports pipeline splits d) All of the mentioned

d) All of the mentioned Pig Latin ability to include user code at any point in the pipeline is useful for pipeline development


Ensembles d'études connexes

Organizational Behavior Midterm Exam: Chapter 2

View Set

Online Health · First Aid · Study Guide

View Set

Chapter 67: Care of Patients with Kidney Disorders

View Set

MEDSURG1 Chapter 26: Assessment and Management of Patients with Vascular Disorders and Disorders of Peripheral Circulation PREPU

View Set

Midterm final contemporary topics CJ

View Set

Solving trigonometric equations assignment

View Set

Chapter 43: Thyroid and Parathyroid Disorders

View Set

Are E-Sports Real Sports? Vocabulary for the YES Opinion

View Set