Apache Hive Interview Questions
When you are preparing for any interview it is very wise to prepare for the questions that are expected to be asked. It fills you with confidence and you get to believe that you can make it. And when it comes to preparing for a Hadoop job it is no different, you will need to be aware of questions that might arise on some of the tools like Sqoop, MapRed, and Flume among other tools on Hadoop. In this article we are going to be providing you with answers to some of the most asked questions during a hive interview. The following are some of the most asked hive interview questions. Prepare it well and have the answers on your fingertips as you are going to nail this interview very easy.
1. What is apache hive?
This is one question that is asked in almost every hive interview. The answer is apache hive is a tool which we can call a data warehousing tool. In addition to warehousing purposes, hive gives SQL queries so that it can perform analysis as well as perform an abstraction. Hive is not a database but it can give you logical abstraction over tables and databases.
2. What are some of the applications that can be supported by apache hive?
Apache hive can support a wide range of applications and any app that is written in java, python, or PHP can be supported by apache hive. Apache hive can also support applications that are written in ruby.
3. Where is the data of hive table stored?
The data of a hive table is normally stored in an HDFS directory and to access the storage you will need to access the user then to hive to the warehouse and in the warehouse select HDFS directory. That’s where data of hive data is stored by default. However, if a person wants to change the default storage and store in a folder of choice, he or she can do that because it is possible.
4. What is a metastore in hive?
Metastore is basically a store that is used to store metadata info in the hive. However, it is still possible to use RDBMS and an open source object relational model layer that is called data nucleus to store metadata instead of using metastore. Data nucleus usually convert the object representation to relational schema and it also convert relational schema into object representation.
5. What is the difference between local metastore and remote metastore?
Local metastore is the metastore in hive that runs in the same JMV as the one hive services is running and the local metastore connects to a database that is running in a separate JMV. It can be on the same machine or it can be on a remote machine. Remote metastore on the other hand is the metastore services that do run on its own separate JMV not in the hive service JMV like the local metastore.
6. What is the difference between the managed table and the external table?
Managed table is the metadata info that comes along with table data that is deleted from the hive warehouse directory. External table on the other hand is the hive that just deletes the metadata information regarding the table. External table do not delete the table data in the HDFS and it is normally left untouched.
7. What is a partition in hive?
Hive organizes tables into partitions for the purpose of grouping data that is similar together on the basis of partition or column key. It is also used to identify a specific partition each table can have a single or even more partition keys. In other words you can define hive partition as sub directory in the table directory. I guess that will be easier to grasp and keep in mind just in case you are asked a question on the same.
8. What is dynamic portioning and when is it normally used?
Dynamic partitioning values for partition columns are for runtime. In simpler words dynamic portioning in well- known during the loading of data into the hive table. The dynamic partitioning can be used in two scenario and one is when one is loading data from an existing non- partitioned table so as to improve the sampling. This in turn decreases the query latency. Dynamic positioning can also be used when one does not know the values of partitions that are beforehand. It will play a great role in helping find the values manually from a big dataset.
9. Why do we need buckets?
The basic reason for needing the buckets is so as to perform bucketing to a partition and so we can say that we need the buckets for 2 main reasons and one of it is that it allows one to decrease the query time and it makes the sampling process way more efficient. The other reason for needing the buckets that a map side join normally require the data that belongs to a unique join key to be there (present) in the same partition.
10. What is indexing and why do we need it?
Indexing or hive index is simply a hive query optimization technique. Indexing is very important and we mostly need it to increase the speed of accessing a column or a set of columns that are in the database of hive. The database system does not necessarily need to read all the rows that are on the table in order to find data with the help of index more especially one that is selected.
11. Mention the different types of join in hive?
There are normally 4 types of joins n HiveQL and one of them is JOIN which is kind of similar to out join in SQL then there is FULL OUTER JOIN which basically fulfill the join condition then there is LEFT OUTER JOIN and finally there is RIGHT OUTER JOIN.
Hive Interview Questions And Answers PDF
You can also download Hive interview questions and answers pdf from the link below :
Hope the above will help you in preparing for Hive interviews