Hadoop data import/export with Sqoop
Apache Sqoop is an open source tool that allows to extract data from a structured data store into Hadoop for further processing. In addition to writing the contents of the database table to HDFS, Sqoop also provides you with a generated Java source file (widgets.java) written to the current local directory. Sqoop is a client command and there is no daemon process for it. It depends on HDFS and YARN and database drivers to which it connects.
Output Formats
By default, Sqoop will generate csv files for the imported data. It can also write the data as SequenceFiles, Avro datafiles or Parquet files. These binary formats allow data to be compressed while retaining MapReduce’s ability to process different sections of the same file in parallel.
Splitting columns
Using metadata about the table, Sqoop will guess a good column to use for splitting the table (typically the primary key for the table, if one exists). The minimum and maximum values for the primary key column are retrieved, and then these are used in conjunction with a target number of tasks to determine the queries that each map task should issue
Eg if table is 100,000 rows and with -m 5, then there would be 5 map-reduce task and each map-reduce task will run a part of query in range : say 10K to 20K etc
-m means 12 mappers. This means that there are 12 threads active on a single table. All the tables are imported in sequential manner one after the other. It is just a suggestion to hadoop and while during operation hadoop will decide the number of threads to use based on the size of the data, block size etc. If in the output you see files like part-001 to part–006 it means that it has used only 6 threads. By default the number of mappers is 4 If the database there is primary index with min value of 1 and max value of 12, then since there are 12 mappers, sqoop will break the data in 12 parts and issues 12 different queries . One in each thread. Eg: min 1, max 2 , min 2 and max 3 etc. Each of these will be given to each of the mappers to process.
The mappers count is decided based on the volume of the data and nodes available. References:
- How many maps and reduces
- Hadoop number of mappers and reducers
- Good way to decide number of mappers and reducers
Importing data to hive
Using scoop we can import data to Hive so that we can run SQL queries with a bigger dataset over a cluster.
Using Boundary Query and columns
The –boundary-query allow you more control while importing data in parallel. Boundary Query lets you specify an optimized query to get the max, min else it will try to find the min and max on the query statement and when there are millions of rows this may take a while and slow it down.
Using -split-by
By default during table import the primary key column is used to split and distribute the values from table across the mappers uniformly. However in case of doing a more advanced query, you’ll need to specify the column to do the parallel split.
Export
While calculating number of mappers for scoop exports, the number of blocks in which files is divided is used as the criteria. There is no difference in importing from hdfs and hive directories. Same commands are use.
input-null-non-string : specify what int value can be stored as null . In the query above its -1 Input-null-string : specify what value in file should be stored as null . In the query above its nvl