Hive

Réussis tes devoirs et examens dès maintenant avec Quizwiz!

Are multi line comments supported in HIVE?

no single line comments = --

How do you list all databases whose name starts with p?

SHOW DATABASES LIKE ' p.* '

Give the command to see the indexes on a table.

SHOW INDEX ON table_name in the table table_name, this will list all the indexes created on any of the columns.

How do you check if a particular partition exists?

SHOW PARTITIONS "table name" PARTITION( partitioned_column='partition_ value')

When should we use SORT BY instead of ORDER BY?

SORT BY is a more preferable , especially while dealing with huge sets. SORT BY sorts the data using multiple reducers. ORDER BY sorts all of the data together using a single reducer. ORDER BY will take much longer to execute large inputs

How can clients interact with Hive?

1)Hive thrift Client 2)JDBC Driver 3)ODBC Driver

What is a metastore in Hive?

Metastore is the central repository of Apache Hive metadata. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API. Hive metastore consists of two fundamental units: A service that provides metastore access to other Apache Hive services. Disk storage for the Hive metadata which is separate from HDFS storage.

Is hive suitable to be used for OLTP systems?Why?

No, it is not suitable for OLTP system since it does not offer insert and update at the row level.

How to add the partition in existing table without the partition table?

We cannot add/create the partition in an existing table especially if the table wasn't partitioned during creation. PARTITIONED BY clause allows you to add additional partitions using the AlTER TABLE command. But the table has to be partitioned from the start.

What is the importance of .hiverc file?

When hive CLI starts, the file contains a list of commands that need to run i.e setting the strict mode to be true etc.

How can you configure the remote metastore mode in HIVE?

hive-site.xml file has to be configured with the below property, to configure metastore in Hive -hive.metastore.uris thrift: //node1 (or IP Address):9083 IP address and port of the metastore host

How does data transfer happen from HDFS to Hive?

if the data is already present in the HDFS then the user does not need to "LOAD DATA" which in results moves the files to /user/hive/warehouse/. User has to define the external table that creates the table definition in the hive met store. However " LOAD DATA LOCAL INPATH " hdfs path" INTO TABLE " table name" is how data transfer happens

What is the significance of the line set hive.mapred.mode=strict;

in strict mode , it sets the MR jobs.By which the queries on partitioned tables cannot run without a WHERE clause. It prevents very large jobs running for a long time.

Where is the table data stored in Hive by default?

in the hive warehouse hdfs: //namenode_server/user/hive/warehouse

What is SerDe in Apavhe Hive?

(Serializer/Deserializer) for I/O purposed SerDe is used. It handles both serilization and deserialization in hive. Interprets the results of serialization as individual fields for processing.

Usage of Hive

1)Hive is used for Schema flexibility as well as evolution 2)It is possible to partition and bucket, tables in Apache Hive 3) We can use JDBC/ODBC drivers, since they are available in Hive.

Features of Hive

1) Offers data summarization, query and analysis in much easier manner 2) Processing data without actually storing in HDFS, Hive supports external tables 3)Fits low-level interface requirement of hadoop perfectly

Limitations of Hive

1) can't perform real-time queries, does not support row-level updates 2) Offers acceptable latency for interactive data browsing. 3)Hive is not suitable for online transaction processing.

How to Write a UDF function in Hive?

1) create a java class for the udf which extends ora.apache.hadoop.hive.sq.exec.UDF and implements more than one evaluate() method. 2) Package your Java class into a JAR file 3) go to Hive CLI, add you JAR , and verify your JARs in the hive cli classpath 4)CREATE TEMPORARY FUNCTION in Hive which points to your Java class 5)Then Use it in Hive SQL.

What are types of Hive Built-In Functions?

1)Collection Functions 2)Hive Date Functions 3)Mathematical Functions 4)Conditional Functions 5)Hive String Functions

Types of Hive DDL Commands.

1)Create Database Statement 2)Hive Show Database 3)Drop database 4)Creating Hive Tables 5)Browse the table 6)Altering and Dropping Tables 7)Hive Select Data from Table 8)Hive Load Data

Give examples of the SerDe classes which hive uses to Serialize and Deserialize data?

1)MetadataTypedColumnsetSerDeSo, to read/write delimited records we use this Hive SerDe. Such as CSV, tab-separated control-A separated records (sorry, quote is not supported yet). 2)ThriftSerDeTo read/write Thrift serialized objects, we use this Hive SerDe. However, make sure, for the Thrift object the class file must be loaded first. 3)DynamicSerDeTo read/write Thrift serialized objects we use this Hive SerDe.

Which classes are used by Hive To Read and Write HDFS files/

1)TextInputFormat/HiveIgnoreKeyTextOutputFormat: read/write data in plain text file format. 2)SequenceFileInputFormat/SequenceFileOutputFormat: However, it read/write data in Hadoop SequenceFile format.

What are Hive Operators and its Types?

1)Types of Hive Built-in Operators 2)Relational Operators 3)Arithmetic Operators 4)Logical Operators 5)String Operators 6)Operators on Complex Types

Mention what are views in Hive?

A view allows a query to be saved and treated like a table. It is a logical construct, as it does not store data like a table. In other words, materialized views are not currently supported by Hive. Logically, you can imagine that Hive executes the view and then uses the results in the rest of the query

What is the max size of string data type supported by Hive?

2 gb

What is a generic UDF in Hive?

A generic UDF is written by extending the genericUDF class.all objects are passed around using the Object type. Hive is structured this way so that all code handling records and cells is generic, and to avoid the costs of instantiating and deserializing objects when it's not needed.all interaction with the data passed in to UDFs is done via ObjectInspectors.They allow you to read values from an UDF parameter, and to write output values.

What is Apache Hive?

A tool which we call a data warehousing tool. Hive gives SQL queries to perform an analysis and an abstraction. Although, hive is not a database it gives you logical abstraction over the databases and the tables.

What is a Hive variable ? What is it used for?

A variable created in the Hive environment that can be referenced by scripts. When the query starts executing it is used to pass some values to the data hive queries.

What kind of applications are supported by Apache Hive?

All client applications which are written in JAVA, PHP, Python ,C++ or Ruby by exposing its thrift server.

Can a table be renamed in Hive?

Alter Table "table name" RENAME TO ' new name'

Differentiate between PigLatin and Hive

Apache Pig is best for Structured and Semi-structured while Apache Hive is best for structured data. Apache Pig is a procedural language while Apache Hive is a declarative language. Apache Pig supports cogroup feature for outer joins while Apache Hive does not support.

What are collection data types in Hive?

Array , Map and Struct

Why do we need buckets?

Bucketing in hive is a data organizing technique . It is similar to partitioning in HIVE with an added functionality that divides large datasets into more manageable parts known as buckets. Bucketing is implemented when partitioning becomes difficult.

How can you delete the DBPROPERTY in Hive?

DBPROPERTY can't be deleted in hive.

What is the importance of driver in Hive?

Driver manages the life cycle of Hive QL Queries. It receives the queries from UI and fetches on JDBC interfaces to process the query. Also, it creates a separate section to handle the query.

How does hive distribute the rows into buckets?

Hash Functions are used to distribute rows into buckets . Based on the resulted value the data is stored into the corresponding buckets.

Difference between HBase vs Hive

Hive and HBase are two different Hadoop based technologies . Hive is a SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database on Hadoop. Hive can be used for analytical queries while HBase for real-time querying.

Difference between Hive and Impala?

Hive generates query expressions at compile time whereas Impala does runtime code generation for "big loops". Hive is batch based Hadoop MapReduce whereas Impala is more like MPP database. Hive supports complex types but Impala does not. Apache Hive is fault tolerant whereas Impala does not support fault tolerance

How is Hcatalog different from hive?

Hive is the layer for analyzing, querying and managing large datasets that reside in Hadoop various file systems.HQL is used as a processing engine, uses SerDes for serilization and deserialization.Hive works best with huge volumes of data. Hcatalog is a table and storage managment layer for Hadoop. It is a sub-component of Hive, which enables ETL processes. Tool for accessing metadata that resides in Hive Metastore. uses WebHcat, a web server for engaging with the Hive Metastore

Is it possible to use the same metaastore by multiple users, incase of the embedded hive?

No, we can't use metastore in sharing mode. It is possible to use it in standalone" real" database such as MYSQL or PostFresSQL

What is the relation between MapReduce and Hive?

Hive prevents writing MapReduce programs in Java. Instead one can use SQL like language to do their daily tasks.Hive queries are converted to MapReduce programs in the background by the hive compiler for the jobs to be executed parallel across the Hadoop cluster. This helps hadoop developers to focus more on the business problem rather than having to focus on complex programming language logic.

Why does Hive not store metadata information in the HDFS?

Hive stores metadata information in the metastore. To achieve low latency we use RDBMS. HDFS read/write operations are time-consuming processes.

Can hive queries be executed from script files?How?

Hive> source /path/to/file/file_with_query.hql

What is the significance of "IF EXISTS" clause while dropping a table?

If the table does not exist hive throws an error.

Where does the dat of a Hive table get stored?

In the HDFS directory- /user/hive/warehouse by default . By specifying the desired directory in hive.metastore.warehouse.dir configuration parameter present in the hive-site.xml, one can change it

What is indexing and why do we need it?

Indexing is a query optimization technique. We use it to speed up the access of a column or set of columns in a Hive database.

Can a partition be archived? What are the advantages and disadvantages?

It decreases the number of files stored in name node and the archived file can be queried using hive. Disadvantage is that it will cause less efficient query and does not offer any space saving/

What is the default database provided by Apache Hive for metastore?

It offers an embedded Derby database instance backed by the local disk for the metastore, by default . It is what we call an embedded metastore configuration .

Explain Hive Thrift server?

Its an optimal component in Hive. It allows access to Hive over a single port. It allows clients using various languages (java, c++, Ruby and many other) to programmatically access hive remotely.

Explain different types of joins in Hive?

JOIN- it is very similar to outer join in SQL Full outer Join- combines the records of both the left and right outer tables. Left Outer Join- All rows from the left table are returned even if there are no matches in the right side Right outer Join- All rows from the right table are returned even if there are no matches in the left table.

PIG vs Hive Vs Hadoop MapReduce

Language Hive Basically, it has SQL like Query language. MapReduce Also, has compiled language. Pig However, it has the scripting language. Abstraction Hive Basically, it has a Low level of Abstraction. MapReduce Also, has the High level of Abstraction. Pig Similarly, it also has the High level of Abstraction. Line of codes Hive Comparatively less no. of the line of codes from both MapReduce and Pig. MapReduce However, it has More line of codes. Pig Comparatively less no. of the line of codes from MapReduce.

What is the difference between local and remote metastores?

Local Metastore: Metastore service runs in the same JVM in which the Hive service is running and connects to a database running in a separate JVM, Either in the same machine or on a remote machine Remote Metastore: The metastore service runs on it own separate JVM and not in the Hive service JVM.

What is the difference between the external table and managed table?

Managed table: The metadata information along with the table data is deleted from the Hive warehouse directory if one drops a managed table. External table: Hive deletes the metadata information regarding the table . Further , it leaves the table data present in HDFS

Why do we perform partitioning in HIVE?

Partitioning provides granularity ( the scale or level of detail present in a set of data) .By scanning only relevant partitioned data instead of the whole dataset it reduces the query latency.

What is a partition in HIVE?

Partitions are used for grouping similar type of data together on the basis of column's or partition keys.To identify a particular partition each table can have one or more partition key. It is a sub-directory in the table directory.

Difference between SQL and HiveQL?

SQL is based on relational database model and HQL is a combination of object oriented programming with relational database concept. Sql manipulates data stored in tables and modifies its rows and columns and HQL is concerned about objects and it properties. SQL is concerned about the relationship that exists between two tables and HQl considers the relation between two objects.

What is dynamic partitioning and when is it used?

Single insert to partition table is known as a dynamic partition. Usually, dynamic partition loads the data from the non-partitioned table. Dynamic Partition takes more time in loading data compared to static partition. When you have large data stored in a table then the Dynamic partition is suitable. If you want to partition a number of columns but you don't know how many columns then also dynamic partition is suitable. Dynamic partition there is no required where clause to use limit. we can't perform alter on the Dynamic partition. You can perform dynamic partition on hive external table and managed table. If you want to use the Dynamic partition in the hive then the mode is in non-strict mode. Here are Hive dynamic partition properties you should allow

How does Hive Organize data?

Tables, Partitions and Buckets

How do you specify the table creator name when creating a table in Hive?

The TBLPROPERTIES clause is used to add the creator name while creating a table.The TBLPROPERTIES is added like −TBLPROPERTIES('creator'= 'Joan')

When you point a partition of a hive table to a new directory , what happens?

The data stays in the original location. The data has to be moved manually.

Wherever ( Different Directory ) I run hive query, it creates new metastore_db, please explain the reason for it?

The property of interest here is javax.jdo.option.ConnectionURL. The default value of this property is jdbc:derby:;databaseName=metastore_db;create=true. This value specifies that you will be using embedded derby as your Hive metastore and the location of the metastore is metastore_db. Also the metastore will be created if it doesn't already exist. Note that the location of the metastore (metastore_db) is a relative path. Therefore, it gets created where you launch Hive from. If you update this property (in your hive-site.xml) to be, say an absolute path to a location, the metastore will be used from that location.

Can we change the data type of a column in a Hive table?

Using REPLACE column option we can change the data type of a column. ALTER TABLE table_name REPLACE COLUMNS

Unable to instantiate

There is a possibility that because of following reasons above error may occur: 1)While we use derby metastore, Then lock file would be there in case of the abnormal exit. Hence, do remove the lock filerm metastore_db/*.lck 1)Moreover, Run hive in Debug mode hive -hiveconf hive.root.logger=DEBUG,console

How do you write your own SerDe?

Using the configuration parameter "regex", the RegexDeserializer will desrialize the data, and possibly a list of column names. Despite SerDe users write a Deserializer in most cases because users rather read their own data format instead of writing to it.

What is ObjectInspector's functionality?

To analyze the structure of individual columns and the internal structure of the row objects. It provides access to complex objects which can be stored in multiple formats in Hive.

What is the use of Hcatalog?

To share data structures with external systems we use Hcatalog. It offers access to hive metastore to users of other tools on Hadoop. They can read and write data to hive's data warehouse.

What are the different components of Hive architecture?

User Interface, Metastore, Compiler and Execute Engine

Is it possible to change the default location of Managed Tables in Hive, if so how?

Using "LOCATION" while creating the table table. The user has to specify the storage path of the table as the value of the "LOCATION"

How to optimize Hive Performance?

Using : 1)Tez-Execution Engine in Hive 2)Usage of Suitable File Format in Hive 3)Hive Partitioning 4)Bucketing in Hive 5)Vectorization in Hive 6)Cost-Based Optimization in Hive 7)Hive Indexing

Is it possible to change the default location of a managed table?

Yes, by using the clasue - LOCATION '<hdfs_path>' we can change the default location of a managed table.

Is there a time data type in Hive?

Yes, in java.sql.timestamp format, the TIMESTAMP data stores date.

What does the use command in hive do?

fix the database on which all subsquent hive queries will run .

Which java class handles the output record encoding into files which result from hive queries?

org.apache.hadoop.hive.ql.io.HiveIgnoreKeTextOutput Format

Which java class handles the Input record encoding into files which store the table in Hive?

org.apache.hadoop.mapred.TextInputFormat

What types of costs are associated with creating the index on hive tables?

processing cost in arranging the values of the column on which index is created since Indexes occupies.

What do you mean by schema on reading?

reading the data and not enforced when writing data, the schema is validated with the data

Explain bucketing in Hive?

the way of dividing table data sets into more manageable parts.It is based on (hash function on the bucketed column) mod (total number of buckets).hash function depends on the type of bucketed column.Records with same bucketed column will be stored in same bucket.Each bucket is just a file in table directory and bucketing number is 1-based.Bucketing can be done along with Partitioning on Hive tables and even without partitioning.Bucketed tables will create almost equally distributed data file parts.It offers efficient sampling than non bucketed tables.As the data files are equal sized parts, map-side joins will be faster on bucketed tables than non-bucketed tables.

What is the difference between CREATE TABLE AND CREATE EXTERNAL TABLE?

to create the Internal table we use the command 'CREATE TABLE' whereas to create the External table we use the command 'CREATE EXTERNAL TABLE'.

Explain clustering in Hive?

to decompose table data sets into more manageable parts.the table is divided into the number of partitions, and these partitions can be further subdivided into more manageable parts known as Buckets/Clusters."clustered by" clause is used to divide the table into buckets.

Can we run unix Shell commands from Hive?

yes, using "!" before the command. ex) ! pwd = hive will prompt the list in the current directory


Ensembles d'études connexes

Chemistry Chapter 12 Puryear M/C

View Set

Nervous System Need to Know CBIO 2200

View Set

Lesson 115 - Box Fill and Introduction to Series Circuits (Front Bedroom) Homework

View Set

Counseling Skills Final Fall 2020

View Set

Chapter 19: nutritional concept and related therapies

View Set

Chapter 38: Agents to Control Blood Glucose Levels - ML5

View Set