CISD42 MIDTERM
Each database created in hive is stored as A file A directory A HDFS block A jar file
A directory
The tables created in hive are stored as A file under the database directory A subdirectory under the database directory A .java file present in the database directory A HDFS block containing the database directory
A subdirectory under the database directory.
Which of the following is the data types in Hive ARRAY STRUCT MAP All the above
ALL of the above
Q.3 How to change the column data type in Hive ALTER and CHANGE ALTER CHANGE
ALTER and CHANGE
What is Hive used as? Hadoop query engine MapReduce wrapper Hadoop SQL interface All of the above
All of the above
Which of the following is the Key components of Hive Architecture User Interface Metastore Driver All of the above
All of the above
Which of the following is the commonly used Hive services Command Line Interface (cli) Hive Web Interface (hwi) HiveServer (hiveserver) All of the above
All of the above
The parameter used to identify the individual row in HBase while importing data to it using sqoop is A - --hbase-row-key B - --hbase-rowkey C - --hbase-rowid D - --hbase-row-id
Answer: A *--hbase-row-key* Explanation the parameter --hbase-row-key is used in sqoop to identify each row in the HBase table.
The temporary location to which sqoop moves the data before loading into hive is specified by the parameter A - --target-dir B - --source-dir C - --hive-dir D - --sqoop-dir
Answer: A *--target-dir* Explanation The --target-dir parameter mentions the directory used for temporary staging the data before loading into the hive table.
Although the Hadoop framework is implemented in Java, MapReduce applications need not be written in ____________ a) Java b) C c) C# d) None of the mentioned
Answer: A *Java* Explanation: Hadoop Pipes is a SWIG- compatible *C++ *API to implement MapReduce applications (non JNITM based).
___________ is general-purpose computing model and runtime system for distributed data analytics. a) Mapreduce b) Drill c) Oozie d) None of the mentioned
Answer: A *Mapreduce* Explanation: Mapreduce provides a flexible and scalable foundation for analytics, from traditional reporting to leading-edge machine learning algorithms.
Which of the following function will return the size of string? A. length() B. size()
Answer: A. length()
Which of the following function will remove duplicates? A. COLLECT_SET() B. COLLECT_LIST()
Answer: A. COLLECT_SET()
How would you delete the data of hive table without deleting the table? A. truncate <table_name>; B. drop <table_name>; C. disable <table_name>; D. remove <table_name>;
Answer: A. truncate <table_name>;
___________ part of the MapReduce is responsible for processing one or more chunks of data and producing the output results. a) Maptask b) Mapper c) Task execution d) All of the mentioned
Answer: A. *Maptask* Explanation *Map Task* in MapReduce is performed using the Map() function.
How do we decide the order of columns in which data is loaded to the target table? A - By using -- order by parameter B - By using a new mapreduce job after submitting sqoop export command C - By using a database stored procedure D - By using -columns parameter with comma separated column names in the required order.
Answer: D By using -columns parameter with comma separated column names in the required order. Explanation we can use the -column parameter and specify the required column in the required order.
Q 13 - If the database contains some tables then it can be forced to drop without dropping the tables by using the keyword A - RESTRICT B - OVERWRITE C - F DROP D - CASCADE
Answer: D CASCADE
Q 19 - The difference between the MAP and STRUCT data type in Hive is A - MAP is Key-value pair but STRUCT is series of values B - There can not be more than one MAP dat type column in a table but more than one STRUCT data type in a table is allowed. C - The Keys in MAP can not be integers but in STRUCT they can be. D - Only one pair of data types is allowed in the key-value pair of MAP while mixed types are allowed in STRUCT.
Answer: D Only one pair of data types is allowed in the key-value pair of MAP while mixed types are allowed in STRUCT.
Q 8 - in hive when the schema does not match the file content A - It cannot read the file B - It reads only the string data type C - it throws an error and stops reading the file D - It returns null values for mismatched fields.
Answer: D It returns null values for mismatched fields.
Point out the wrong statement. a) A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner b) The MapReduce framework operates exclusively on <key, value> pairs c) Applications typically implement the Mapper and Reducer interfaces to provide the map and reduce methods d) None of the mentioned
Answer: D *None of the mentioned* The MapReduce framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.
_______ jobs are optimized for scalability but not latency. latency= the delay before a transfer of data begins following an instruction for its transfer. a) Mapreduce b) Drill c) Oozie d) None of the mentioned
Answer: D *None of them mentioned* Explanation: Hive Queries are translated to MapReduce jos to exploit the scalability of MapReduce. scalability= the computing process to be used or produced in a range of capabilities
How can you avoid importing tables one-by-one when importing a large number of tables from a database? A. Connect B. Username C. Password D. All of these
Answer: D all of the above A. Connect B. Username C. Password
_________ tool can list all the available database schemas. A. Sqoop-list-tables B. sqoop-list-databases C. sqoop-list-schema D. sqoop-list-columns
Answer: Option B *sqoop-list-databases*
Which of the following scenario may not be a good fit for HDFS? a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file b) HDFS is suitable for storing data related to applications requiring low latency data access c) HDFS is suitable for storing data related to applications requiring low latency data access d) None of the mentionedV
Answer: a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file
HDFS works in a __________ fashion. a) master-worker b) master-slave c) worker/slave d) all of the mentioned
Answer: a) master-worker
_________ is the base class for all implementations of InputFormat that use files as their data source. a) FileTextFormat b) FileInputFormat c) FileOutputFormat d) None of the mentionedView Answer
Answer: b *FileInputFormat* FileInputFormat provides implementation for generating splits for the input files.
____________is a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer. a) Hadoop Strdata b) Hadoop Streaming c) Hadoop Stream d) None of the mentioned
Answer: b *Hadoop Streaming* Explanation: *Hadoop streaming* is one of the most important utilities in the Apache Hadoop distribution.
In _____________ the default job is similar, but not identical, to the Java equivalent. a) Mapreduce b) Streaming c) Orchestration d) All of the mentioned
Answer: b *streaming* MapReduce Types and Formats MapReduce has a simple model of data processing.
10. _________ is the default Partitioner for partitioning key space. a) HashPar b) Partitioner c) HashPartitioner d) None of the mentioned
Answer: c *HashPartitioner* The default partitioner in Hadoop is the *HashPartitioner* which has a method called getPartition to partition.
______________ is another implementation of the MapRunnable interface that runs mappers concurrently in a configurable number of threads. a) MultithreadedRunner b) MultithreadedMap c) MultithreadedMapRunner d) SinglethreadedMapRunner
Answer: c *MultithreadedMapRunner* A RecordReader is little more than an iterator over records, and the map task uses one to generate record key-value pairs, which it passes to the map function.
________ NameNode is used when the Primary NameNode goes down. a) Rack b) Data c) Secondary d) None of the mentioned
Answer: c) Secondary
Q 6 - A table contains 4 columns (C1,C2,C3,C4). With -update-key C2,C4, the sqoop generated query will be like A - Update table set C1 = 'newval', c3 = 'newval' where c2 = 'oldval' and c4 = 'oldval' B - Update table set C2 = 'newval', c4 = 'newval' where c2 = 'oldval' and c4 = 'oldval' C - Update table set C1 = 'newval', c2 = 'newval', c3 = 'newval', c4 = 'newval' where c2 = 'oldval' and c4 = 'oldval' D - None
Answer: A. - Update table set C1 = 'newval', c3 = 'newval' where c2 = 'oldval' and c4 = 'oldval' Answer : A Explanation only the columns other than in the -update-key parameter will be appear in the SET clause.
Point out the correct statement. a) MapReduce tries to place the data and the compute as close as possible b) Map Task in MapReduce is performed using the Mapper() function c) Reduce Task in MapReduce is performed using the Map() function d) All of the mentioned
Answer: A. MapReduce tries to place the data and the compute as close as possible Explanation: This feature of MapReduce is *Data Locality* Data Locality = refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to coputation.
The tool that populates a Hive meta store with a definition for a table based on a database table previously imported to HDFS is A. create-hive-table B. import-hive-meta store C. create-hive-metastore D. update-hive-meta store
Answer: B *import-hive-meta-store*
What Hive can not offer? A - storing data in tables and columns B - Online transaction processing C - Handling date time data D - Partitioning stored data
Answer: B Online transaction processing
Q 24 - The main advantage of creating table partition is A - Effective storage memory utilization B - faster query performance C - Less RAM required by namenode D - simpler query syntax
Answer: B faster query performance
Q 25 - To see the partitions present in a Hive table the command used is A - Describe B - show C - describe extended D - show extended
Answer: B show
Q 22 - The partitioning of a table in Hive creates more A - subdirectories under the database name B - subdirectories under the table name C - files under databse name D - files under the table name
Answer: B subdirectories under the table name
Q 9 - The query "SHOW DATABASE LIKE 'h.*' ; gives the output with database name A - containing h in their name B - starting with h C - ending with h D - containing 'h.'
Answer: B starting with h
Q 12 - By default when a database is dropped in Hive A - the tables are also deleted B - the directory is deleted if there are no tables C - the hdfs blocks are formatted D - Only the comments associated with database is deleted
Answer: B the directory is deleted if there are no tables
What is achieved by the command - sqoop job -exec myjob A - Sqoop job named myjob is saved to sqoop metastore B - Sqoop job named myjob starts running C - Sqoop job named myjob is scheduled D - Sqoop job named myjob gets created
Answer: B - Sqoop job named myjob starts running Explanation This is the command to execute a sqoop job already saved in the metastore.
What will be the output ofCONCAT_WS('|','hey','coder','how','are','you') A. 'hey,coder,how,are,you' B. 'hey|coder|how|are|you C. 'heycoderhowareyou' D. 'hey coder how are you'
Answer: B. 'hey|coder|how|are|you
________ text is appropriate for most non-binary data types. A. Character B. Binary C. Delimited D. None of the mentioned
Answer: C *Delimited*
The fields parsed by ____________ are backed by an internal buffer. A. LargeObjectLoader B. ProcessingException C. RecordParser D. None of the Men
Answer: C *RecordParser*
Q 14 - Using the ALTER DATABASE command in an database you can change the A - database name B - database creation time C - dbproperties D - directory where the database is stored
Answer: C dbproperties
The following tool imports a set of tables from an RDBMS to HDFS A. export-all-tables B. import-all-tables C. import-tables D. none of the mentioned
Answer: C *import-tables*
Q 3 - The results of a hive query can be stored as A - local file B - hdfs file C - both D - can not be stored
Answer: C - both A and B A - local file B - hdfs file C - both
Which of the following will cast a column "a" having value 3.2 to 3 ? A. CAST(a as Float) B. a.to_int C. CAST(a as INT) D. INT(a)
Answer: C. CAST(a as INT)
What is the extension of hive query file? A. .txt B. .sql C. .hql D. .hive
Answer: C. .hql
You have one column in hive table named as "my_ts" having datatype as string and sample value like "2018-02-24 17:22:35". how would you extract only day from it i.e. 24 ? A. get_date(myts) B. extract(myts) C. day(myts) D. not possible directly
Answer: C. day(myts)
Which of the following method will remove the spaces from both the ends of " bigdata " ? A substring() B. whitespace() C. trim() D. remove()
Answer: C. trim()
Mention the best features of Apache Sqoop. A. Parallel import/export B. Connectors for all major RDBMS Databases C. Import results of SQL query D. All of these
Answer: D all of the above A. Parallel import/export B. Connectors for all major RDBMS Databases C. Import results of SQL query
Q 17 - On dropping a managed table A - The schema gets dropped without dropping the data B - The data gets dropped without dropping the schema C - An error is thrown D - Both the schema and the data is dropped
Answer: D Both the schema and the data is dropped
Running a ___________ program involves running mapping tasks on many or all of the nodes in our cluster. a) MapReduce b) Map c) Reducer d) All of the mentioned
Answer: a *MapReduce* In some applications, component tasks need to create and/or write to side-files, which differ from the actual job-output files.
__________ maps input key/value pairs to a set of intermediate key/value pairs. a) Mapper b) Reducer c) Both Mapper and Reducer d) None of the mentioned
Answer: a *Mapper* *Mapper* maps are the individual tasks that transform input records into intermediate
Point out the correct statement. a) The reduce input must have the same types as the map output, although the reduce output types may be different again b) The map input key and value types (K1 and V1) are different from the map output types c) The partition function operates on the intermediate key d) All of the mentioned
Answer: d *All of the mentioned* In practice, the partition is determined solely by the key (the value is ignored).
An ___________ is responsible for creating the input splits, and dividing them into records. a) TextOutputFormat b) TextInputFormat c) OutputInputFormat d) InputFormat
Answer: d *InputFormat* As a MapReduce application writer, you don't need to deal with InputSplits directly, as they are created by an InputFormat.
Point out the wrong statement. a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file level b) Block Report from each DataNode contains a list of all the blocks that are stored on that DataNode c) User data is stored on the local file system of DataNodes d) DataNode is aware of the files to which the blocks stored on it belong to
Answer: d) DataNode is aware of the files to which the blocks stored on it belong to
Managed tables in Hive Can load the data only from HDFS Can load the data only from local file system Are useful for enterprise wide data Are Managed by Hive for their data and metadata
Are managed by Hive for their data and metadata
A ________ serves as the master and there is only one NameNode per cluster. a) Data Node b) NameNode c) Data block d) Replication
B. NameNode Explanation: All the metadata related to HDFS including the information about data nodes, files stored on HDFS, and Replication, etc. are stored and maintained on the NameNode
On dropping a managed table The schema gets dropped without dropping the data The data gets dropped without dropping the schema An error is thrown Both the schema and the data is dropped
Both the schema and the data is dropped.
The query "SHOW DATABASE LIKE 'h.*' ; gives the output with database name Containing h in their name Starting with h Ending with h Containing 'h.'
Containing h in their name
The partition of an Indexed table is dropped. then, Corresponding partition from all indexes are dropped. No indexes are dropped Indexes refresh themselves automatically Error is shown asking to first drop the indexes
Corresponding partition from all indexes are dropped
13 Which of the following query displays the name of the database, the root location on the file system and comments if any. Describe extended Show Describe Show extended
Describe
Sqoop import and export code Export Data from hive table to mysql
Export Data from hive table to mysql Sqoop export \ --connect jdbc:mysql://localhost/retail_db \ --username root \ --password cloudera \ --table product_exported \ --hcatalog-table product_hive
Sqoop import and export code Selective Column Imports
Selective Column Imports sqoop import \ --connect jdbc:mysql://localhost/retail_db \ --username root \ --password cloudera \ --table customers \ term-57--target-dir /user/cloudera/customer-selected \ --columns "customer_fname,customer_lname,customer_city"
Sqoop import and export code Specifying Mappers
Specifying Mappers sqoop import \ --connect jdbc:mysql://localhost/retail_db \ --username root --password cloudera \ --table customers -m 2
The thrift service component in hive is used for Moving hive data files between different servers Use multiple hive versions Submit hive queries from a remote client Installing hive
Submit hive queries from a remote client
By default when a database is dropped in Hive The tables are also deleted The directory is deleted if there are no tables The HDFS blocks are formatted None of the above
The directory is deleted if there are no tales.
On dropping a external table The schema gets dropped without dropping the data The data gets dropped without dropping the schema An error is thrown Both the schema and the data is dropped
The schema gets dropped without dropping the data.
In Hive you can copy The schema without the data The data without the schema Both schema and its data Neither the schema nor its data
The schema without the data
f an Index is dropped then The directory containing the index is deleted The underlying table is not dropped The underlying table is also dropped Error is thrown by hive
Thew directory the index is deleted
The drawback of managed tables in hive is They are always stored under default directory They cannot grow bigger than a fixed size of 100GB They can never be dropped They cannot be shared with other applications
They cannot be shared with other applications
Sqoop import and export code Using query
Using query sqoop import \ --connect jdbc:mysql://localhost/retail_db \ --username root --password cloudera \ --target-dir /user/cloudera/customer-queries \ --query "Select * from customers where customer_id > 100 AND \$CONDITIONS" \ --split-by "customer_id"
Is it possible to overwrite Hadoop MapReduce configuration in Hive? Yes No
Yes
The need for data replication can arise in various scenarios like ____________ a) Replication Factor is changed b) DataNode goes down c) Data Blocks get corrupted d) All of the mentioned
d) All of the mentioned
2 Using the ALTER DATABASE command in an database you can change the Database name dbproperties Database creation time Directory where the database is stored
dbproperties
Which of the following data type is supported by Hive? map record string enum
enum
The 2 default TBLPROPERTIES added by hive when a hive table is created is hive_version and last_modified by last_modified_by and last_modified_time last_modified_time and hive_version last_modified_by and table_location
last_modified_by and last_modified_time
Q.2 Which among the following command is used to change the settings within Hive session RESET SET
set
Sqoop import and export code simple sqoop import
simple Sqoop Import sqoop import \ --connect jdbc:mysql://localhost/retail_db \ --username root \ --password cloudera \ --table customerss
When a Hive query joins 3 tables, How many mapreduce jobs will be started? 0 1 2 3
2
2 - What does the --last-value parameter in sqoop incremental import signify? A - What is the number of rows successfully imported in append type import B - what is the date value to be used to select the rows for import in the last_update_date type import C - Both of the above D - The count of the number of rows that were successful in the current import.
Answer : C both A and B A - What is the number of rows successfully imported in append type import B - what is the date value to be used to select the rows for import in the last_update_date type import Sqoop uses the --last-value parameter in both the append mode and the last_update_date mode to import the incremental data form source.
By default the records from databases imported to HDFS by sqoop are? A - Tab separated B - Concatenated columns C - space separated D - comma separated
Answer : D *comma serperated* Explanation The default record separator is comm.
_________ function is responsible for consolidating the results produced by each of the Map() functions/tasks. a) Reduce b) Map c) Reducer d) All of the mentioned
Answer A *Reduce* *Reduce* function collates the work and resolves the results.
Point out the correct statement. a) Hive is not a relational database, but a query engine that supports the parts of SQL specific to querying data b) Hive is a relational database with SQL support c) Pig is a relational database with SQL support d) All of the mentioned
Answer A: Hive is not a relational database, but a query engine that supports the parts of SQL specific to querying data Explanation: Hive is a SQL-based data warehouse system for Hadoop that facilitates data summarization, ad hoc queries, and the analysis of large datasets stored in Hadoop-compatible file systems.
Q 20 - The 2 default TBLPROPERTIES added by hive when a hive table is created is A - hive_version and last_modified by B - last_modified_by and last_modified_time C - last_modified_time and hive_version D - last_modified_by and table_location
Answer B last_modified_by and last_modified_time
Hive also supports custom extensions written in ____________ a) C# b) Java c) C d) C++
Answer B. *Java* Explanation: Hive also support custom extensions written in *Java*, including user-defined functions (UDFs) and serializer-deserializers for reading and optionally writing custom formats.
A ________ node acts as the Slave and is responsible for executing a Task assigned to it by the JobTracker. a) MapReduce b) Mapper c) TaskTracker d) JobTracker
Answer C . *TaskTracker* Explanation: The *Task Tracker* receives the information necessary for the exectuion of a Task from Job Tracer, Executes the Task and Sends the Results back to Job Tracker.
_____ is a framework for performing remote procedure calls and data serialization. a) Drill b) BigTop c) Avro d) Chukwa
Answer C. *Avro* Explanation: In the context of Hadoop, Avro can be used to pass data from one program or language to another.
______________ is a platform for construction data flows for extract, transform, and load (ETL) processing and analysis of large datasets. a) Pig Latin b) Oozie c) Pig d) Hive
Answer C: *Apache Pig* Explanation: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs.
For ________ the HBase Master UI provides information about the HBase Master uptime. a) HBase b) Oozie c) Kafka d) All of the mentioned
Answer a) HBase
2. Point out the correct statement. a) DataNode is the slave/worker node and holds the user data in the form of Data Blocks b) Each incoming file is broken into 32 MB by default c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of fault tolerance d) None of the mentioned
Answer a) DataNode is the slave/worker node and holds the user data in the form of Data Blocks
During start up, the ___________ loads the file system state from the fsimage and the edits log file. a) DataNode b) NameNode c) ActionNode d) None of the mentioned
Answer b) NameNode
HDFS provides a command line interface called __________ used to interact with HDFS. a) "HDFS Shell" b) "FS Shell" c) "DFS Shell" d) None of the mentioned
Answer b) "FS Shell"
An input _________ is a chunk of the input that is processed by a single map. a) textformat b) split c) datanode d) all of the mentioned
Answer b. *split* Explanation: Each split is divided into records, and the map processes each record—a key-value pair—in turn.
Q 21 - To see the data types details of only a column (not the table) we should use the command A - DESCRIBE B - DESCRIBE EXTENDED C - DESCRIBE FORMATTED D - DESCRIBE COLUMN
Answer: A DESCRIBE
Q 10 - Each database created in hive is stored as A - a directory B - a file C - a hdfs block D - a jar file
Answer: A a directory
Q 7 - Hive is A - schema on read B - schema on write C - schema on update D - all the above
Answer: A schema on read
The partition of an Indexed table is dropped. then, A - Corresponding partition from all indexes are dropped. B - No indexes are dropped C - Indexes refresh themselves automatically D - Error is shown asking to first drop the indexes
Answer: A Corresponding partition from all indexes are dropped.
Q4 - Which of the following is not a complex data type in Hive? A - Matrix B - Array C - Map D - STRUCT
Answer: A Matrix
Q 18 - On dropping an external table A - The schema gets dropped without dropping the data B - The data gets dropped without dropping the schema C - An error is thrown D - Both the schema and the data is dropped
Answer: A The schema gets dropped without dropping the data
Q 15 - In Hive you can copy A - The schema without the data B - The data without the schema C - Both schema and it's data D - neither the schema nor its data
Answer: A The schema without the data
Q 11 - The tables created in hive are stored as A - a subdirectory under the database directory B - a file under the database directory C - a hdfs block containing the database directory D - a .java file present in the database directory
Answer: A a subdirectory under the database directory
Which of the following is used to analyse data stored in Hadoop cluster using SQL like query Mahoot Hive Pig All of the above
Hive
Sqoop import and export code Hive Import
Hive Import sqoop import \ --connect "jdbc:mysql://localhost/retail_db" \ --username root \ --password cloudera \ --table customers \ --hive-import \ --create-hive-table \
Which of the following is true for Hive? Hive is the database of Hadoop Hive supports schema checking Hive doesn't allow row level updates Hive can replace an OLTP system
Hive doesn't support low level updates
Hive can be accessed remotely by using programs written in C++, Ruby etc, over a single port. This is achieved by using HiveServer HiveMetaStore HiveWeb Hive Streaming
HiveServer
Which kind of keys(CONSTRAINTS) Hive can have? Primary Keys Foreign Keys Unique Keys None of the above
None of the above
When a partition is archived in Hive it Reduces space through compression Reduces the length of records Reduces the number of files stored Reduces the block size
Reduces the number of files stored
f the schema of the table does not match with the data types present in the file containing the table then Hive Automatically drops the file Automatically corrects the data Reports Null values for mismatched data Does not allow any query to run on the table
Reports Null values for mismatched data