Querying Hadoop Data With Hive
Hive provides SQL like query interface to Apache Hadoop. Hive in turn does the computation over large number of nodes on hadoop cluster to provide the results. This enables doing easy ad-hoc data analysis and summarization queries.
Hive does not support index queries like RDBMS, but unlike other relational database it scales very well. Hive is not designed for real time queries and row level updates. It is best suited to batch jobs over large sets of immutable data like web logs.
Running Hive in Hadoop Cluster
Inside the hadoop cluster the data is distributed over by HDFS. You can run hive queries from namenode or any of the data nodes.
Hive Metastore and tables
Hive uses a metastore,a database, to store information about the tables it knows. Hive tables are just information about where the data is stored and in which format.
Hive has two kind of tables
- managed
- externa
Managed Table
Managed tables data is controlled by Hive. It will create a directory for the data on HDFS and when we drop the table it will delete the data.
When you create a table in hive, the default location is inside the directory “/user/hive/warehouse/”. You can load the data into table by
This will create the directory “/user/hive/warehouse/mytablename”
When you drop the table, the raw data is lost as the directory corresponding to the table in warehouse is deleted.
External Table
To create external table, simply point to the location of data while creating the tables and mark the table “EXTERNAL”. This will ensure that the data is not moved into a location inside the warehouse directory.
HiveQL is Hive’s query language and is dialect of SQL.
When you drop the table the data at the location is not deleted.
You can have multiple tables all pointing to the same raw data
SortBy Vs Orderby
Orderby requires one reducer to sort the final output. If the number of rows in the output is too large, the single reducer could take a very long time to finish.
SortBy sorts the rows before feeding the rows to a reducer.
The difference between “order by” and “sort by” is that the former guarantees total order in the output while the latter only guarantees ordering of the rows within a reducer. If there are more than one reducer, “sort by” may give partially ordered final results.
Distribute By And Cluster By
Hive uses the columns in Distribute By to distribute the rows among reducers. All rows with the same Distribute By columns will go to the same reducer. However, Distribute By does not guarantee clustering or sorting properties on the distributed keys.
Cluster By is a short-cut for both Distribute By and Sort By. Clustering the sorting would provide a tremendous performance improvement since the sort can potentially be done by hundreds of cluster nodes in parallel.
Ref : https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy
Join Tables
Storage Formats
TEXTFILE
Text file format can load data of form CSV (Comma Separated Values), delimited by Tabs, Spaces and JSON data.By default if we use TEXTFILE format then each line is considered as a record.
SEQUENCEFILE
Sequence files are flat files consisting of binary key-value pairs which can be split by hadoop and distributed across map jobs. The main use of these files is to club two or more smaller files and make them as a one sequence file.
There are three types of sequence files: - Uncompressed key/value records. - Record compressed key/value records :only values’ are compressed here - Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed.
RCFILE
RCFILE stands of Record Columnar File which is another type of binary file format which offers high compression rate. This column oriented storage is very useful while performing analytics. It is easy to perform analytics when we “hive’ a column oriented storage type.
ORCFILE
ORC stands for Optimized Row Columnar which means it can store data in an optimized way than the other file formats and is also optimized for speed.
Ref: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-ORCFileFormat
Compression
Data stored in Gzip or Bzip2 can be directly imported into a table stored as TextFile. The compression will be detected automatically and the file will be decompressed on-the-fly during query execution
However data in compressed format cannot be split into chunks/blocks to run multiple maps in parallel.The recommended practice is to insert data into another table, which is stored as a SequenceFile.
Ref: https://cwiki.apache.org/confluence/display/Hive/CompressedStorage
Using Sqoop
Tables are stored as directories under Hive’s warehouse directory, which is controlled by the hive.metastore.warehouse.dir property and defaults to /user/hive/warehouse. You can also create external tables where you point hive to location of data storage.
When you drop an external table, Hive will leave the data untouched and only delete the metadata.