Big Data Interview Questions and Answers

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Describe RLIKE in Hive with an example?

RLIKE (Right-Like) is a special function in Hive where if any substring of A matches with B then it evaluates to true. It also obeys Java regular expression pattern. Users don't need to put % symbol for a simple match in RLIKE. Examples: 'Express' RLIKE 'Exp' -> True 'Express' RLIKE '^E.*' -> True (Regular expression) Moreover, RLIKE will come handy when the string has some spaces. Without using TRIM function, RLIKE satisfies the required scenario. Suppose if A has value 'Express ' (2 spaces additionally) and B has value 'Express'. In these situations, RLIKE will work better without using TRIM. 'Express ' RLIKE 'Express' -> True Note: RLIKE evaluates to NULL if A or B is NULL.

What is a rack?

Rack is a storage area with all the datanodes put together. These datanodes can be physically located at different places. Rack is a physical collection of datanodes which are stored at a single location. There can be multiple racks in a single location.

Why Reading is done in parallel and Writing is not in HDFS?

Reading is done in parallel because by doing so we can access the data fast. But we do not perform the write operation in parallel because it might result in data written by one node can be overwritten by other. For example, we have a file and two nodes are trying to write data into the file in parallel, then the first node does not know what the second node has written and vice-versa. So, this makes it confusing which data to be stored and accessed.

How to skip header rows from a table in Hive?

Suppose while processing some log files, we may find header records. System=.... Version=... Sub-version=.... Like above, It may have 3 lines of headers that we do not want to include in our Hive query. To skip header lines from our tables in Hive we can set a table property that will allow us to skip the header lines. CREATE EXTERNAL TABLE userdata ( name STRING, job STRING, dob STRING, id INT, salary INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/user/data' TBLPROPERTIES("skip.header.line.count"="3");

DynamicSerDe:

This SerDe also read/write thrift serialized objects, but it understands thrift DDL so the schema of the object can be provided at run time. Also it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol

MetadataTypedColumnsetSerDe

This SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (quote is not supported yet.)

ThriftSerDe

This SerDe is used to read/write thrift serialized objects. The class file for the Thrift object must be loaded first.

What is the functionality of Query Processor in Apache Hive ?

This component implements the processing framework for converting SQL to a graph of map/reduce jobs and the execution time framework to run those jobs in the order of dependencies.

we have table with 3 columns name,id,address. but address contains doornum,street name,town name,city and pin. your delimiter is (,). How can you store such type of data in hive table?

maintain different delimiter than ',' example ':' while loading from input table itself and define address as struct and collection items terminated by ','

Can we bulk load HBase tables with RDBMS tables via Sqoop?

use of -hbase-bulkload argument enable bulk loading into HBase table while importing from RDBMS table via Sqoop.

How to rename a table in Hive? Using ALTER command with RENAME, we can rename a table in Hive.

ALTER TABLE hive_table_name RENAME TO new_name;

How can we change a column data type in Hive?

ALTER TABLE table_name CHANGE column_name column_name new_datatype Example: If we want to change the data type of empid column from integer to bigint in a table called employee. ALTER TABLE employee CHANGE empid empid BIGINT;

Is it possible to create multiple table in hive for same data?

As hive creates schema and append on top of an existing data file. One can have multiple schema for one data file, schema will be saved in hive's metastore and data will not be parsed or serialized to disk in given schema. When we will try to retrieve data, schema will be used. For example if we have 5 column (name, job, dob, id, salary) in the data file present in hive metastore then, we can have multiple schema by choosing any number of columns from the above list. (Table with 3 columns or 5 columns or 6 columns). But while querying, if we specify any column other than above list, will result in NULL values

What is Metadata?

Data about Data

Which classes are used by the Hive to Read and Write HDFS Files?

Following classes are used by Hive to read and write HDFS files TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format. SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in hadoop SequenceFile format.

What is the maximum size of string data type supported by Hive?

Maximum size is 2 GB.

Is multi line comment supported in Hive Script ?

No

Describe REPEAT function in Hive with example?

REPEAT function will repeat the input string n times specified in the command. Example: REPEAT('Hive',3); Output: HiveHiveHive.

Describe REVERSE function in Hive with example?

REVERSE function will reverse the characters in a string. Example: REVERSE('Hive'); Output: eviH

Describe TRIM function in Hive with example?

TRIM function will remove the spaces associated with a string. Example: TRIM(' Hadoop '); Output: Hadoop. If we want to remove only leading or trailing spaces then we can specify the below commands respectively. LTRIM(' Hadoop'); RTRIM('Hadoop ');

What is Rack Awareness ?

The concept of maintaining Rack Id information by NameNode and using these rack ids for choosing closest data nodes for HDFS file read or writes requests is called Rack Awareness. By choosing closest data nodes for read/writes request through rack awareness policy, minimizes the write cost and maximizing read speed.

What is the communication mode between namenode and datanode?

The mode of communication is SSH.

What are the types of tables in Hive?

There are two types of tables. Managed tables External tables Only while dropping tables these two differentiates. Otherwise both type of tables are very similar.

What is the Hive configuration precedence order?

There is a precedence hierarchy to setting properties. In the following list, lower numbers take precedence over higher numbers: The Hive SET command The command line -hiveconf option hive-site.xml hive-default.xml hadoop-site.xml (or, equivalently, core-site.xml, hdfs-site.xml, and mapred-site.xml) hadoop-default.xml (or, equivalently, core-default.xml, hdfs-default.xml, and mapred-default.xml)

What is SerDe in Apache Hive?

A SerDe is a Serializer Deserializer. Hive uses SerDe to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System (HDFS) format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy("CREATE EXTERNAL TABLE" or "LOAD DATA INPATH," ) and use Hive to correctly "parse" that file format in a way that can be used by Hive. A SerDe is a powerful and customizable mechanism that Hive uses to "parse" data stored in HDFS to be used by Hive.

What is a heartbeat in HDFS?

A heartbeat is a signal indicating that it is alive. A datanode sends heartbeat to Namenode and task tracker will send its heart beat to job tracker. If the Namenode or job tracker does not receive heart beat then they will decide that there is some problem in datanode or task tracker is unable to perform the assigned task.

If we run hive as a server, what are the available mechanisms for connecting it from application?

Below are following ways by which we can connect with the Hive Server: Thrift Client: Using thrift we can call hive commands from a various programming languages e.g: Java, PHP, Python and Ruby. JDBC Driver : It supports the Type 4 (pure Java) JDBC Driver ODBC Driver: It supports ODBC protocol.

What are the different types of Hive Metastore?

Below are three different types of metastore. Embedded Metastore Local Metastore Remote Metastore

What are the Binary Storage formats supported in Hive?

By default Hive supports text file format, however hive also supports below binary formats. Sequence Files, Avro Data files, RCFiles, ORC files, Parquet files Sequence files: General binary format. splittable, compressible and row oriented. a typical example can be. if we have lots of small file, we may use sequence file as a container, where file name can be a key and content could stored as value. it support compression which enables huge gain in performance. Avro datafiles: Same as Sequence file splittable, compressible and row oriented except support of schema evolution and multilingual binding support. RCFiles: Record columnar file, it's a column oriented storage file. it breaks table in row split. in each split stores that value of first row in first column and followed sub subsequently. ORC Files: Optimized Record Columnar files

Is there any alternative way to rename a table without ALTER command?

By using Import and export options we can be rename a table as shown below. Here we are saving the hive data into HDFS and importing back to new table like below EXPORT TABLE tbl_name TO 'HDFS_location'; IMPORT TABLE new_tbl_name FROM 'HDFS_location'; If we prefer to just preserve the data, we can create a new table from old table like below CREATE TABLE new_tbl_name AS SELECT * FROM old_tbl_name; DROP TABLE old_tbl_name;

How can we copy the columns of a hive table into a file?

By using awk command in shell, the output from HiveQL Describe command can be written to a file. $ hive -S -e "describe table_name;" | awk -F" " '{print 1}' > ~/output.

How to verify whether the daemon processes are running or not

By using java's processes command $ jps to check what are all the java processes running on a machine. This command lists down all the daemon processes running on a machine along with their process ids.

How to stop all the three hadoop daemons at a time

By using stop-dfs.sh command, we can stop the above three daemon processes with a single command.

Describe CONCAT function in Hive with Example?

CONCAT function will concatenate the input strings. We can specify any number of strings separated by comma. Example: CONCAT ('Hive','-','is','-','a','-','data warehouse','-','in Hadoop'); Output: Hive-is-a-data warehouse-in Hadoop So, every time we delimit the strings by '-'. If it is common for all the strings, then Hive provides another command CONCAT_WS. Here you have to specify the delimit operator first. Syntax: CONCAT_WS ('-','Hive','is','a','data warehouse','in Hadoop'); Output: Hive-is-a-data warehouse-in Hadoop

What is a daemon?

Daemon is a process or service that runs in background. In general, we use this word in UNIX environment. Hadoop or Yarn daemons are Java processes which can be verified with jps command.

What is the default file format that is used to store HDFS files or Hive tables when and RDBMS file is imported via Sqoop?

Default file type is text file format. It is same as specifying -as-textfile clause to sqoop import command.

What is Double data type in Hive?

Double data type in Hive will present the data differently unlike RDBMS. See the double type data below: 14324.0 342556.0 1.28893E4 E4 represents 10^4 here. So, the value1.28893E4 represents 12889.3. All the calculations will be accurately performed using double type It is crucial while exporting the double type data to any RDBMS since the type may be wrongly interpreted. So, it is advised to cast the double type into appropriate type before exporting

What is HCatalog and how to use it?

HCatalog is a Table and Storage Management tool to Hadoop/HDFS. In MR, we use it by specifying InputOutput Formats i.e. HCatInputFormat and HCatOutputFormat. In Pig, we use it by specifying Storage types i.e HCatLoader and HCatStorer.

What are the examples of the SerDe classes which hive uses to Serialize and Deserialize data?

Hive currently use below SerDe classes to serialize and deserialize data:

Does Hive provide OLTP or OLAP?

Hive doesn't provide crucial features required for OLTP, Online Transaction Processing. It's closer to being an OLAP tool, Online Analytic Processing. So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.

What is the differences Between Hive and HBase?

Hive is not a database but a data warehousing frame work. Hive doesn't provide record level operations on tables. HBase is a NoSQL Database and it provides record level updates, inserts and deletes to the table data. HBase doesn't provide a query language like SQL, but Hive is now integrated with HBase.

What kind of data warehouse application is suitable for Hive?

Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do. Hive is most suited for data warehouse applications, where Relatively static data is analyzed, Fast response times are not required, and When the data is not changing rapidly.

What is ObjectInspector functionality ?

Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns. ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including: Instance of a Java class (Thrift or native Java) A standard Java object (we use java.util.List to represent Struct and Array, and use java.util.Map to represent Map) A lazily-initialized object (For example, a Struct of string fields stored in a single Java string object with starting offset for each field) A complex object can be represented by a pair of ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object, but also gives us ways to access the internal fields inside the Object.

How do we write our own custom SerDe ?

In most cases, users want to write a Deserializer instead of a SerDe, because users just want to read their own data format instead of writing to it. For example, the RegexDeserializer will deserialize the data using the configuration parameter 'regex', and possibly a list of column names If your SerDe supports DDL (basically, SerDe with parameterized columns and column types), you probably want to implement a Protocol based on DynamicSerDe, instead of writing a SerDe from scratch. The reason is that the framework passes DDL to SerDe through"thrift DDL" format, and it's non-trivial to write a "thrift DDL" parser.

When existing rows are also being updated in addition to new rows then how can we bring only the updated records into HDFS ?

In this case we need to use -incremental lastmodified argument with two additional mandatory arguments -check-column <col> and -last-value (value).

What is the default Hive warehouse directory?

It is /user/hive/warehouse directory in local file system.

What is a NameNode ?

Namenode is a dedicated machine in HDFS cluster which acts as a master serve that maintains file system namespace in its main memory and serves the file access requests by users. File system namespace mainly contains fsimage and edits files. Fsimage is a file which contains file names, file permissions and block locations of each file. Usually only one active namenode is allowed in HDFS default architecture.

Are Namenode and Resource Manager run on the same host?

No, in practical environment, Namenode runs on a separate host and Resource Manager runs on a separate host.

Is it possible to use same metastore by multiple users, in case of embedded hive?

No, it is not possible to use metastore in sharing mode. It is recommended to use standalone "real" database like MySQL or PostGreSQL.

Does Hive support record level Insert, delete or update?

No. Hive does not provide record-level update, insert, or delete. Henceforth, Hive does not provide transactions too. However, users can go with CASE statements and built in functions of Hive to satisfy the above DML operations. Thus, a complex update query in a RDBMS may need many lines of code in Hive.

Is Namenode also a commodity

No. Namenode can never be a commodity hardware because the entire HDFS rely on it. It is the single point of failure in HDFS. Namenode has to be a high-availability machine

Is Sqoop similar to distcp in hadoop?

Partially yes, hadoop's distcp command is similar to Sqoop Import command. Both submits parallel map-only jobs but distcp is used to copy any type of files from Local FS/HDFS to HDFS and Sqoop is for transferring the data records only between RDMBS and Hadoop eco system services, HDFS, Hive and HBase.

What is the need for partitioning in Hive?

Partitioning is mainly intended for quick turn around time for queries on hive tables.

What is the difference between order by and sort by in hive?

SORT BY will sort the data within each reducer. We can use any number of reducers for SORT BY operation. ORDER BY will sort all of the data together, which has to pass through one reducer. Thus, ORDER BY in hive uses single reducer. ORDER BY guarantees total order in the output while SORT BY only guarantees ordering of the rows within a reducer. If there is more than one reducer, SORT BY may give partially ordered final results

What is Safe Mode in HDFS ?

Safe Mode is a maintenance state of NameNode during which Name Node doesn't allow any changes to the file system. During Safe Mode, HDFS cluster is ready-only and doesn't replicate or delete blocks. Name Node automatically enters safe mode during its start up and maintain blocks replication value within minimum and maximum allowable replication limit.

What is a Secondary Namenode?

The Secondary NameNode is a helper to the primary NameNode. Secondary NameNode is a specially dedicated node in HDFS cluster whose main function is to take checkpoints of the file system metadata present on namenode. It is not a backup namenode and doesn't act as a namenode in case of primary namenode's failures. It just checkpoints namenode's file system namespace.

How do change settings within Hive Session?

We can change settings from within a session, too, using the SET command. This is useful for changing Hive or MapReduce job settings for a particular query. For example, the following command ensures buckets are populated according to the table definition. hive> SET hive.enforce.bucketing=true; To see the current value of any property, use SET with just the property name hive> SET hive.enforce.bucketing; hive.enforce.bucketing=true By itself, SET will list all the properties and their values set by Hive. This list will not include Hadoop defaults, unless they have been explicitly overridden in one of the ways covered in the above answer. Use SET -v to list all the properties in the system, including Hadoop defaults.

How to start Hive metastore service as a background process?

We can start hive metastore service as a background process with below command. $ hive --service metastore & By using kill -9 <process id> we can stop this service.

How to print header on Hive query results?

We need to use following set command before our query to show column headers in STDOUT. hive> set hive.cli.print.header=true; Now hive tables can be pointed to the higher level directory. This is suitable for a scenario where the directory structure is as following: /data/country/state/city

If a data Node is full, then how is it identified?

When data is stored in datanode, then the metadata of that data will be stored in the Namenode. So Namenode will identify if the data node is full.

Can we import RDBMS tables into HCatalog directly, If yes, what are the limitations in it?

Yes by using -hcatalog-database option with -hcatalog-table we can create Hcatalog tables directly but importing/exporting into/from HCatalog has limitations as of now and doesn't below arguments. -direct -export-dir -target-dir -warehouse-dir -append -as-sequencefile -as-avrofile

Can we stop all the above five daemon processes with a single command ?

Yes, by using $ stop-all.sh command all the above five daemon processes can be bring down in a single shot.

We have already 3 tables named US,UK,IND in Hive. Now we have one more JPN created using hadoop fs -mkdir JPN. Can we move the content in IND to JPN directly?

Yes, we can copy contents from hive warehouse directory table IND into JPN.

Can we write any default value for all the string NULLs in an RDBMS table while importing into Hive table?

Yes, we can provide our own meaningful default value to all NULL strings in source RDBMS table while loading into Hive Table with -null-string argument. Below is the sample sqoop import command sqoop import \ --connect jdbc:mysql://mysql.server.com/sqoop \ --table emp \ --null-string '\\N' \ --null-non-string '\\N'

Can we start both Hadoop daemon processes and Yarn daemon processes with a single command?

Yes, we can start all the above mentioned five daemon processes (3 hadoop + 2 Yarn) with a single command $ start-all.sh

What are commands that need to be used to bring down a single hadoop daemon?

$ hadoop-daemon.sh stop namenode $ hadoop-daemon.sh stop secondarynamenode $ hadoop-daemon.sh stop datanode

How to start all hadoop daemons at a time

$ start-dfs.sh command can be used to start all hadoop daemons from terminal at a time

How many hadoop daemon processes run on a Hadoop System

As of hadoop-2.5.0 release, three hadoop daemon processes run on a hadoop cluster. NameNode daemon - Only one daemon runs for entire hadoop cluster. SecondaryNameNode daemon - Only one daemon runs for entire hadoop cluster. DataNode daemon - One datanode daemon per each datanode in hadoop cluster

What are the limitations of Hive?

Below are the limitations of Hive: Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc. Hive does not provide record-level update, insert, nor delete. Hive queries have higher latency than SQL queries, because of start-up overhead for MapReduce jobs submitted for each hive query. As Hadoop is a batch-oriented system, Hive doesn't support OLTP (Online Transaction Processing). Hive is close to OLAP (Online Analytic Processing) but not ideal since there is significant latency between issuing a query and receiving a reply, both due to the overhead of Mapreduce jobs and due to the size of the data sets Hadoop was designed to serve. If we need OLAP, we need to use NoSQL databases like HBase that can be integrated with Hadoop.

What are the relational databases supported in Sqoop?

Below are the list of RDBMSs that are supported by Sqoop Currently. MySQL PostGreSQL Oracle Microsoft SQL IBM's Netezza Teradata

What is big data?

Big data is vast amount of data (generally in GBs or TBs of size) that exceeds the regular processing capacity of the traditional computing servers and requires special parallel processing mechanism. This data is too big and its rate of increase gets accelerated. This data can be either structural or unstructured data which may not be able to process by legacy databases.

How could be the various components of Hadoop cluster deployed in production?

Both Name Node and Resource Manager can be deployed on a Master Node, and Data nodes and node managers can be deployed on multiple slave nodes. There is a need for only one master node for namenode and Resource Manager on the system. The number of slave nodes for datanodes & node managers depends on the size of the cluster. One more node with hardware specifications same as master node will be needed for secondary namenode.

What are the differences between Backup Node and Checkpoint node or Secondary Namenode ?

But in backup node, no need to download fsimage and edits files from active namenode because, it already has an up-to-date copy of fsimage in its main memory and accesses online streaming of edits which are provided by namenode. So, applying these edits into fsimage in its own main memory and saving a copy in local FS. So, checkpoint creation in backup node is faster than that of checkpoint node or secondary namenode. The diff between checkpoint node and secondary namenode is that checkpoint node can upload the new copy of fsimage file back to namenode after checkpoint creation where as a secondary namenode can't upload but can only store in its local FS. Backup node provides the option of running namenode with no persistent storage but a checkpoint node or secondary namenode doesn't provide such option. In case of namenode failures, data loss in checkpoint node or secondary namenode is certain at least to a minimum amount of data due to time gap between two checkpoints.

How to start Yarn daemon processes on a hadoop cluster

By running $ start-yarn.sh command from terminal on each machine on hadoop cluster, Yarn daemons can be started.

What is commodity hardware

Commodity hardware is a non-expensive system which is not of high quality or high-availability. Hadoop can be installed on any commodity hardware. We don't need super computers or high-end hardware to work on Hadoop. Commodity hardware includes RAM because there will be some services which will be running on RAM.

What are the destination types allowed in Sqoop Import command?

Currently Sqoop Supports data imported into below services. HDFS Hive HBase HCatalog Accumulo

What is a DataNode ?

DataNodes are slave nodes of HDFS architecture which store the blocks of HDFS files and sends blocks information to namenode periodically through heart beat messages. Data Nodes serve read and write requests of clients on HDFS files and also perform block creation, replication and deletions

What is a checkpoint ?

During Checkpointing process, fsimage file is merged with edits log file and a new fsimage file will be created which is usually called as a checkpoint.

What are the objectives of HDFS file system

Easily Store large amount of data across multiple machines Data reliability and fault tolerance by maintaining multiple copies of each block of a file. Capacity to move computation to data instead of moving data to computation server. I.e. processing data locally. Able to provide parallel processing of data by Mapreduce framework.

How Many Mapreduce jobs and Tasks will be submitted for Sqoop copying into HDFS?

For each sqoop copying into HDFS only one mapreduce job will be submitted with 4 map tasks. There will not be any reduce tasks scheduled.

What if my MySQL server is running on MachineA and Sqoop is running on MachineB for the above question?

From MachineA login to MySQL shell and perform the below command as root user. If using hostname of second machine, then that should be added to /etc/hosts file of first machine. $ mysql -u root -p mysql> GRANT ALL PRIVILEGES ON *.* TO '%'@'MachineB hostname or Ip address'; mysql> GRANT ALL PRIVILEGES ON *.* TO ''@'MachineB hostname or Ip address';

What is fsimage and edit log in hadoop?

Fsimage is a file which contains file names, file permissions and block locations of each file, and this file is maintained by Namenode for indexing of files in HDFS. We can call it as metadata about HDFS files. The fsimage file contains a serialized form of all the directory and file inodes in the filesystem. EditLog is a transaction log which contains records for every change that occurs to file system metadata.

In destination HBase table how will the row keys be maintained uniquely?

HBase tables maintain row keys uniquely and it is the primary key of input RDBMS table if it is present, otherwise it will be split-by column from input import command.

What is HDFS

HDFS is a distributed file system implemented on Hadoop's framework. It is a block-structured distributed file system designed to store vast amount of data on low cost commodity hardware and ensuring high speed process on data. HDFS stores files across multiple machines and maintains reliability and fault tolerance. HDFS support parallel processing of data by Mapreduce framewor

What are the limitations of HDFS file systems

HDFS supports file operations reads, writes, appends and deletes efficiently but it doesn't support file updates. HDFS is not suitable for large number of small sized files but best suits for large sized files. Because file system namespace maintained by Namenode is limited by it's main memory capacity as namespace is stored in namenode's main memory and large number of files will result in big fsimage file.

What are main components/projects in Hadoop architecture

Hadoop Common: The common utilities that support the other Hadoop modules. HDFS: Hadoop distributed file system that provides high-throughput access to application data. Hadoop YARN: A framework for job scheduling and cluster resource management. Hadoop MapReduce: A YARN-based system for parallel processing of large data sets

What are the modes in which Hadoop can run

Hadoop can run in three modes. Stand alone or Local mode - No daemons will be running in this mode and everything runs in a single JVM. Pseudo distributed mode - All the Hadoop daemons run on a local machine, simulating cluster on a small scale. Fully distributed mode - A cluster of machines will be setup in master/slaves architecture to distribute and process the data across various nodes of commodity hardware.

How indexing is done in HDFS?

Hadoop has its own way of indexing. Depending upon the block size, once the data is stored, HDFS will keep on storing the last part of the data which will say where the next part of the data will be. In fact, this is the base of HDFS

What is Hadoop

Hadoop is an open source frame work from Apache Software Foundation for storing & processing large-scale data usually called Big Data using clusters of commodity hardware.

What is Hive?

Hive is one of the important tool in Hadoop eco system and it provides an SQL like dialect to Hadoop distributed file system.

What are the features of Hive?

Hive provides, Tools to enable easy data extract/transform/load (ETL) A mechanism to project structure on a variety of data formats Access to files stored either directly in HDFS or other data storage systems as HBase Query execution through MapReduce jobs. SQL like language called HiveQL that facilitates querying and managing large data sets residing in hadoop.

Can we append the tables data to an existing target directory?

If the destination directory already exists in HDFS, Sqoop will refuse to import. If we use the -append argument, Sqoop will import data to a temporary directory and then rename the files into normal target directory in a manner that, it does not conflict with existing file names in that directory.

If we want to copy 20 blocks from one machine to another, but another machine can copy only 18.5 blocks, can the blocks be broken at the time of replication?

In HDFS, blocks cannot be broken down. Before copying the blocks from one machine to another, the Master node will figure out what is the actual amount of space required, how many block are being used, how much space is available, and it will allocate the blocks accordingly.

When Importing tables from MySQL to what are the precautions that needs to be taken care w.r.t to access?

In MySQL, we need to make sure that we have granted all privileges on the databases, that needs to be accessed, should be given to all users at destination hostname. If Sqoop is being run under localhost and MySQL is also present on the same then we can grant the permissions with below two commands from MySQL shell logged in with ROOT user. $ mysql -u root -p mysql> GRANT ALL PRIVILEGES ON *.* TO '%'@'localhost'; mysql> GRANT ALL PRIVILEGES ON *.* TO ''@'localhost';

What are the majorly used commands in Sqoop?

In Sqoop Majorly Import and export commands are used. But below commands are also useful some times. codegen eval import-all-tables job list-databases list-tables merge metastore

List important site-specific configuration files in Hadoop cluster

In order to override any hadoop configuration property's default values, we need to provide configuration values in site-specific configuration files. Below are the four site-specific .xml configuration files and environment variable setup file. core-site.xml : Common properties are configured in this file. hdfs-site.xml : Site specific hdfs properties are configured in this file yarn-site.xml : Yarn specific properties can be provided in this file. mapred-site.xml : Mapreduce framework specific properties will defined here. hadoop-env.sh : Hadoop environment variables are setup in this file. All these configuration files should be placed in hadoop's configuration directory etc/hadoop from hadoop's home directory.

If some hadoop daemons are already running and if we need to start any one remaining daemon process then what are the commands to use

Instead of start-dfs.sh which will trigger all the hadoop three daemons at a time, we can also start running each daemon separately by the below commands $ hadoop-daemon.sh start namenode $ hadoop-daemon.sh start secondarynamenode $ hadoop-daemon.sh start datanode

Is Namenode machine same as datanode machine as in terms of hardware?

It depends upon the cluster we are trying to create. The Hadoop VM can be there on the same machine or on another machine. For instance, in a single node cluster, there is only one machine,whereas in the development or in a testing environment, Namenode and datanodes are on different machines.

What is a block in HDFS and what is its size

It is a fixed size chunk of data usually of size 128 MB. It is the minimum of size of data that HDFS can read/write. HDFS files are broken into these fixed size chunks of data across multiple machines on a cluster. Thus, blocks are building bricks of a HDFS file. Each block is maintained in at least 3 copies as mentioned by replication factor in Hadoop configuration to provide data redundancy and maintain fault-tolerance.

What is a Checkpoint Node

It is an enhanced secondary namenode whose main functionality is to take checkpoints of namenode's file system metadata periodically. It replaces the role of secondary namenode. Advantage of Checkpoint node over the secondary namenode is that it can upload the result of merge operation of fsimage and edits log files while checkpointing. For indepth details please refer the post here.

What is a Backup Node?

It is an extended checkpoint node that performs checkpointing and also supports online streaming of file system edits. It maintains an in memory, up-to-date copy of file system namespace and accepts a real time online stream of file system edits and applies these edits on its own copy of namespace in its main memory. Thus, it maintains always a latest backup of current file system namespace.

What is Multiple checkpoint nodes?

Multiple checkpoint nodes can be registered with namenode but only a single backup node is allowed to register with namenode at any point of time. To create a checkpoint, checkpoint node or secondary namenode needs to download fsimage and edits files from active namenode and apply edits to fsimage and saves a copy of new fsimage as a checkpoint.

What are the core components in HDFS Cluster ?

Name Node Secondary Name Node Data Nodes Checkpoint Nodes Backup Node

Now we have to display the contents in US,UK,IND,JPN. By using SELECT * FROM TABLES is it possible to display?

No, Because JPN is created by using fs -mkdir command. It is not part of metadata.

Does Sqoop import direct fast importing technique while importing into Hive/HBase/Hcatalog?

No, fast direct import is supported only for HDFS and that too from MySQL and PostGreSQL only.

What is Data Locality in HDFS ?

One of the HDFS design idea is that "Moving Computation is cheaper than Moving data". If data sets are huge, running applications on nodes where the actual data resides will give efficient results than moving data to nodes where applications are running. This concept of moving applications to data, is called Data Locality. This reduces network traffic and increases speed of data processing and accuracy of data since there is no chance of data loss during data transfer through network channels because there is no need to move data.

Can we load RDBMS tables data into an Hive Partition directly ? If Yes how can we achieve it?

Sqoop can import data for Hive into a particular partition by specifying the -hive-partition-keyand -hive-partition-value arguments

What is Sqoop?

Sqoop is an open source tool that enables users to transfer bulk data between Hadoop eco system and relational databases.

What are the various table creation arguments available in Sqoop?

Sqoop supports direct table creations in Hive, HBase and HCatalog as of now, and below are the corresponding create table arguments respectively. -create-hive-table argument used with -hive-import -hbase-create-table along with -column-family <family> optionally -hbase-row-key <col> -create-hcatalog-table

What is structured data?

Structured data is the data that is easily identifiable as it is organized in a structure. The most common form of structured data is a database where specific information is stored in tables, that is, rows and columns.

List important default configuration files in Hadoop cluster

The default configuration files in hadoop cluster are: core-default.xml hdfs-default.xml yarn-default.xml mapred-default.xml

What is Hive Metastore?

The metastore is the central repository of Hive metadata. The metastore is divided into two pieces: a service and the backing store for the data. By default, the metastore is run in the same process as the Hive service. Using this service, it is possible to run the metastore as a standalone (remote) process. Set the METASTORE_PORT environment variable to specify the port the server will listen on

Which operating systems are supported for Hadoop deployment ?

The only supported operating system for hadoop's production deployment is Linux. However, with some additional software Hadoop can be deployed on Windows for test environments.

While connecting to MySQL through Sqoop, I am getting Connection Failure exception what might be the root cause and fix for this error scenario?

This might be due to insufficient permissions to access your MySQL database over the network. To confirm this we can try the below command to connect to MySQL database from Sqoop's client machine. mysql --host=MySql node&gt; --database=test --user= --password=

What is the criteria for specifying parallel copying in Sqoop with multiple parallel map tasks?

To use multiple mappers in Sqoop, RDBMS table must have one primary key column (if present) in a table and the same will be used as split-by column in Sqoop process. If primary key is not present, we need to provide any unique key column or set of columns to form unique values and these should be provided to -split-by column argument.

What is the basic difference between traditional RDBMS and Hadoop

Traditional RDBMS is used for transactional systems to report and archive the data, whereas Hadoop is an approach to store huge amount of data in the distributed file system and process it. RDBMS will be useful when we want to seek one record from Big data, whereas, Hadoop will be useful when we want Big data in one shot and perform analysis on that later.

How many YARN daemon processes run on a cluster

Two types of Yarn daemons will be running on hadoop cluster in master/slave fashion. ResourceManager - Master daemon process NodeManager - One Slave daemon process per node in a cluster.

What is unstructured data?

Unstructured data refers to any data that cannot be identified easily. It could be in the form of images, videos, documents, email, logs and random text. It is not in the form of rows and columns

How to bring down the Yarn daemon processes

Using $ stop-yarn.sh command, we can bring down both the Yarn daemon processes running on a machine.

How to configure hive remote metastore in hive-site.xml file?

We can configure remote metastore in hive-site.xml file with the below property. <property> <name>hive.metastore.uris</name> <value>thrift://node1(or IP Address):9083</value> <description>IP address (or fully-qualified domain name) and port of the metastore host</description> </property>

How can we control the parallel copying of RDBMS tables into hadoop ?

We can control/increase/decrease speed of copying by configuring the number of map tasks to be run for each sqoop copying process. We can do this by providing argument -m 10 or -num-mappers 10 argument to sqoop import command. If we specify -m 10 then it will submit 10 map tasks parallel at a time. Based on our requirement we can increase/decrease this number to control the copy speed.

How to start Hive Thrift server?

We can issue below command from terminal to start Hive thrift server. $ hive -service hiveserver

If HDFS target directory already exists, then we how can we overwrite the RDBMS table into it via Sqoop?

We can use -delete-target-dir argument to sqoop import command, so that target directory will be deleted prior to copying the new table contents into that directory.

When RDBMS table is only getting new rows and the existing rows are not changed, then how can we pull only the new rows into HDFS via sqoop?

We can use -incremental append argument to pull only the new rows from RDBMS table into HDFS directory.

What is the best way to provide passwords to user accounts of RDBMS in Sqoop import/export commands?

We can use -password argument to provide password of user account but it is not secure. The -P argument (prompts for user password) will read a password from a console prompt ,and is the preferred method of entering credentials. Credentials may still be transferred between nodes of the MapReduce cluster using insecure means. The most secure way is to use, -password-file <file containing the password> method. Set authentication password in this file on the users home directory with 400 permissions.

While loading tables from MySQL into HDFS, if we need to copy tables with maximum possible speed, what can you do ?

We need to use -direct argument in import command to use direct import fast path and this -direct can be used only with MySQL and PostGreSQL as of now.

What is the example connect string for Oracle database to import tables into HDFS?

We need to use Oracle JDBC Thin driver while connecting to Oracle database via Sqoop. Below is the sample import command to pull table employees from oracle database testdb --connect jdbc:oracle:thin:@oracle.example.com/testdb \ --username SQOOP \ --password sqoop \ --table employees

How does HDFS File Deletes or Undeletes work?

When a file is deleted from HDFS, it will not be removed immediately from HDFS, but HDFS moves the file into /trash directory. After certain period of time interval, the NameNode deletes the file from the HDFS /trash directory. The deletion of a file releases the blocks associated with the file. Time interval for which a file remains in /trash directory can be configured with fs.trash.interval property stored in core-site.xml. As long as a file remains in /trash directory, the file can be undeleted by moving the file from /trash directory into required location in HDFS. Default trash interval is set to 0. So, HDFS Deletes file without storing in trash.

Wherever (Different Directory) we run hive query, it creates new metastore_db, please explain the reason for it?

Whenever we run the hive in embedded mode, it creates the local metastore. And before creating the metastore it looks whether metastore already exist or not. This property is defined in configuration file hive-site.xml. Property is "javax.jdo.option.ConnectionURL" with default value "jdbc:derby:;databaseName=metastore_db;create=true". So to change the behavior change the location to absolute path, so metastore will be used from that location.

14. While importing tables from Oracle database, Sometimes I am getting java.lang.IllegalArgumentException: Attempted to generate class with no columns! or NullPointerException what might be the root cause and fix for this error scenario?

While dealing with Oracle database from Sqoop, Case sensitivity of table names and user names matters highly. Most probably by specifying these two values in UPPER case will solve the issue unless actual names are mixed with Lower/Upper cases. If these are mixed, then we need to provide them within double quotes. In case, the source table is created under different user namespace, then we need to provide table name as USERNAME.TABLENAME as shown below. sqoop import \ --connect jdbc:oracle:thin:@oracle.example.com/ORACLE \ --username SQOOP \ --password sqoop \ --table SIVA.EMPLOYEES [Re

If datanodes increase, then do we need to upgrade Namenode?

While installing the Hadoop system, Namenode is determined based on the size of the clusters. Most of the time, we do not need to upgrade the Namenode because it does not store the actual data, but just the metadata, so such a requirement rarely arise

Can we provide free-form SQL queries in Sqoop Import commands ? If yes, what is the criteria to use them?

Yes, We can use free-form SQL queries with the help of -e, -query arguments to Sqoop Import command, but we must specify the -target-dir argument along with this.

Can we save RDBMS table into Hive table with Avro file format and compression enabled? If yes how can we do that ?

Yes, we can use -as-avrodatafile clause to create the target file as avro file and -z,-compress clause as argument, optionally we can specify -compression-codec <c> to provide the compression codec class name.

What is the difference between jps and jps -lm commands ?

jps command returns the process id and short names for running processes. But jps -lm returns long messages along with process id and short names as shown below

What is the difference between -target-dir and -warehouse-dir arguments while importing and can we use both in the same import command?

target-dir is used to create the directory with the name provided in HDFS but warehouse-dir additionally creates two more sub level directories as HDFS path +/warehouse/dir/(tablename)/+. No we cannot use these two clauses in the same command. These are mutually exclusive.


Ensembles d'études connexes

Chapter 11- Fat Soluble Vitamins (Study Guide)

View Set

Pénzügytan: Feleletválasztós

View Set

FCE Common structures in Use of English

View Set

consumer behavior chapter 5 practice

View Set

Chapter 6, 7, 8 Study Guide Intro to Business

View Set

AWS Certified Cloud Practitioner Module 8 - Pricing and Support

View Set