Apache Sqoop: Import from RDBMS to HDFS in ORC Format

Apache Sqoop import tool offers the capability to import data from RDBMS (MySQL, Oracle, SQLServer, etc.) table to HDFS. Sqoop import provides native support to store data in text files as well as binary formats such as Avro and Parquet. There’s no native support to import in ORC format. However, it’s still possible to import in ORC format by using Sqoop and Hive in tandem.
Firstly, let’s see how Sqoop import has native support for Text format and binary formats such as Avro and Parquet. After that, we will see how we can leverage Hive with Sqoop to import as an ORC format.
Import as Text Files
By default, the import tool stores the data as text files if no format is specified. We can also explicitly instruct the import tool to store it as a text file using this argument –as-textfile
In the below example, we are importing data from a MySQL table called departments into HDFS location /user/cloudera/import_example/text/. We are using –as-textfile argument to instruct Sqoop to store the data in Text format in HDFS.
$ sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--username root \
--password cloudera \
--table departments \
--num-mappers 1 \
--warehouse-dir "/user/cloudera/import_example/text/" \
--as-textfile
–connect, –username, –password contain the connection details of the RDBMS.
–table is the RDBMS table from which the data will be imported.
–warehouse-dir is the parent output location in HDFS. In this location, Sqoop will create a directory with the name of the RDBMS table to contain all the data files.
–num-mappers specifies the number of map tasks to run in parallel. Each map task creates an output file. To make the example simpler, we are setting the number of map tasks to 1 so that only one output file is created.
After running the above Sqoop command, we can verify that the data was stored as text indeed. Listing the files in the output directory shows that the files have no file extension.
$ hadoop fs -ls /user/cloudera/import_example/text/departments
Found 2 items
-rw-r–r– 1 cloudera cloudera 0 2019-05-31 03:53 /user/cloudera/import_example/text/departments/_SUCCESS
-rw-r–r– 1 cloudera cloudera 60 2019-05-31 03:53 /user/cloudera/import_example/text/departments/part-m-00000
You can observe that the contents of /user/cloudera/import_example/text/departments/part-m-00000 are in text format.
$ hadoop fs -cat /user/cloudera/import_example/text/departments/part-m-00000 | head
2,Fitness
3,Footwear
4,Apparel
5,Golf
6,Outdoors
7,Fan Shop
Import as Binary Files – Avro/Parquet
Similar to the –as-textfile parameter, the import tool has parameters to import the data in binary formats – Avro and Parquet.
- Avro: –as-avrodatafile
- Parquet: –as-parquetfile
Let’s use the following command to import the data in Avro format. Note that, we are passing the argument –as-avrodatafile.
sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--username root \
--password cloudera \
--table departments \
--num-mappers 1 \
--warehouse-dir "/user/cloudera/import_example/avro/" \
--as-avrodatafile
We can see that the files have .avro extension. We can peek into the files to confirm that the files are in Avro format indeed.
$ hadoop fs -ls /user/cloudera/import_example/avro/departments
Found 2 items
-rw-r–r– 1 cloudera cloudera 0 2019-05-31 04:30 /user/cloudera/import_example/avro/departments/_SUCCESS
-rw-r–r– 1 cloudera cloudera 450 2019-05-31 04:30 /user/cloudera/import_example/avro/departments/part-m-00000.avro
$ hadoop fs -cat /user/cloudera/import_example/avro/departments/part-m-00000.avro | head
Objavro.schema�{“type”:”record”,”name”:”departments”,”doc”:”Sqoop import of departments”,”fields”:[{“name”:”department_id”,”type”:[“null”,”int”],”default”:null,”columnName”:”department_id”,”sqlType”:”4″},{“name”:”department_name”,”type”:[“null”,”string”],”default”:null,”columnName”:”department_name”,”sqlType”:”12″}],”tableName”:”departments”}�M�ܣ����^�`�
�FitnessFootweaApparel
Golf
Import as ORC Files
There is no equivalent of –as-avrodatafile and –as-parquetfile to save as an ORC file. We can leverage Sqoop’s Hcatalog integration feature to import as an ORC file. You can think of Hcatalog as an API to access Hive metastore. This method involves two steps.
- Create a Hive Database with your desired HDFS warehouse location
- Run Sqoop import command to import from the RDBMS table to the Hcatalog table
Create Hive Database
A Hive database should exist in order to use Sqoop with Hcatalog. We can create a Hive database by providing the location to store the data of managed tables. If we don’t specify a location, the default location will be used.
The below command is run on Hive CLI to create a Hive database called orc_db with hdfs:///user/cloudera/import_example/orc/ as the location. This location is equivalent to the –warehouse-dir parameter of the Sqoop import.
hive> create database orc_db location "hdfs:///user/cloudera/import_example/orc/";
hive> create database orc_db location “hdfs:///user/cloudera/import_example/orc/”;
OK
Time taken: 0.431 seconds
Sqoop import with Hcatalog Integration
We can use Sqoop import to import from an RDBMS table to an Hcatalog/Hive Table. Sqoop import’s Hcatalog integration provides a feature to automatically create Hcatalog/Hive table during the import.
$ sqoop import \
--connect jdbc:mysql://localhost/retail_db \
--username root \
--password cloudera \
--table departments \
--num-mappers 1 \
--hcatalog-database orc_db \
--hcatalog-table departments \
--create-hcatalog-table \
--hcatalog-storage-stanza "stored as orcfile"
–hcatalog-database refers to the Hive database that we need to use.
–hcatalog-table refers to the Hive table. In this example, we are calling it departments. A directory named departments will be created in the location we specified with the create database command.
–create-hcatalog-table is used as an indicator that the Hive table has to be created as part of the import job. The job will fail if the table already exists.
–hcatalog-storage-stanza is used to specify the storage format of the newly created table. In this example, we are storing it as an ORC file.
After we run the Sqoop command, we can see that a directory called departments was created under the database location – hdfs:///user/cloudera/import_example/orc
$ hadoop fs -ls hdfs:///user/cloudera/import_example/orc
Found 1 items
drwxr-xr-x – cloudera cloudera 0 2019-05-31 04:23 hdfs:///user/cloudera/import_example/orc/departments
When we list the contents of the departments directory, we can see there was one file created. Though it doesn’t have an extension, be assured that it’s an ORC file.
$ hadoop fs -ls hdfs:///user/cloudera/import_example/orc/departments
Found 1 items
-rw-r–r– 1 cloudera cloudera 377 2019-05-31 04:23 hdfs:///user/cloudera/import_example/orc/departments/part-m-00000
We can verify that it’s an ORC file by peeking at the top few lines of the file. Observe that the first line of the file reads “ORC”.
$ hadoop fs -cat hdfs:///user/cloudera/import_example/orc/departments/part-m-00000 | head
ORC
P+
6PJ�R�be!i6%Q.vǂ�Ģ�!�Ғ���b�� �UFitnessFootwearApparelGolfOutdoorsFan Shop
Fxt�V�b�`���ь
�IBH3��`�IBH3i!&
TL;DR
Sqoop import has native support to import in Avro and Parquet formats but not for ORC format. We can leverage Sqoop import’s Hcatalogue integration to import the data from RDBMS to HDFS in ORC format.
First, we create a Hive database and specify the HDFS location of the database. Then, we import data from the RDBMS table to the Hive table to be stored in ORC format. The ORC data would be stored under a directory with the name of the Hive table in the HDFS location of the Hive database.
Leave a Reply