Apache Sqoop: Import data from RDBMS to HDFS in ORC Format

Image result for apache sqoop creative commons
Apache Sqoop

Apache Sqoop import tool offers capability to import data from RDBMS (MySQL, Oracle, SQLServer, etc) table to HDFS. Sqoop import provides native support to store data in text file as well as binary format such as Avro and Parquet. There’s no native support to import in ORC format. However, it’s still possible to import in ORC format by using Sqoop and Hive in tandem.

Firstly, let’s see how Sqoop import has native support for Text format, and binary formats such as avro and parquet. After that, we will see how we can leverage Hive with Sqoop to import as ORC format.

Import as Text Files

By default, import tool stores the data as text files if no format is specified. We can also explicitly instruct import tool to store as text file using this argument –as-textfile

In the below example, we are importing data from MySQL table called departments into HDFS location /user/cloudera/import_example/text/. We are using –as-textfile argument to instruct Sqoop to store the data in Text format in HDFS.

–connect, –username, –password contain the connection details of the RDBMS.
–table is the RDBMS table from which the data will be imported.
–warehouse-dir is the parent output location in HDFS. In this location, Sqoop will create a directory with the name of the RDBMS table to contain all the data files.
–num-mappers specifies the number of map tasks to run in parallel. Each map task creates an output file. To make the example simpler, we are setting the number of map tasks 1 so that only one output file is created.

After running the above Sqoop command, we can verify that the data was stored as text indeed. Listing the files in the output directory shows that the files have no file extension.

$ hadoop fs -ls /user/cloudera/import_example/text/departments
Found 2 items
-rw-r–r– 1 cloudera cloudera 0 2019-05-31 03:53 /user/cloudera/import_example/text/departments/_SUCCESS
-rw-r–r– 1 cloudera cloudera 60 2019-05-31 03:53 /user/cloudera/import_example/text/departments/part-m-00000

You can observe that the contents of /user/cloudera/import_example/text/departments/part-m-00000 are in text format.

$ hadoop fs -cat /user/cloudera/import_example/text/departments/part-m-00000 | head
7,Fan Shop

Import as Binary Files – Avro/Parquet

Similar to the –as-textfile parameter, import tool has parameters to import the data in binary formats – Avro and Parquet.

  • Avro: –as-avrodatafile
  • Parquet: –as-parquetfile

Let’s use the following command to import the data in Avro format. Note that, we are passing the argument –as-avrodatafile.

We can see that the files have .avro extension. We can peek into the files to confirm that the files are in Avro format indeed.

$ hadoop fs -ls /user/cloudera/import_example/avro/departments
Found 2 items
-rw-r–r– 1 cloudera cloudera 0 2019-05-31 04:30 /user/cloudera/import_example/avro/departments/_SUCCESS
-rw-r–r– 1 cloudera cloudera 450 2019-05-31 04:30 /user/cloudera/import_example/avro/departments/part-m-00000.avro

$ hadoop fs -cat /user/cloudera/import_example/avro/departments/part-m-00000.avro | head
Objavro.schema�{“type”:”record”,”name”:”departments”,”doc”:”Sqoop import of departments”,”fields”:[{“name”:”department_id”,”type”:[“null”,”int”],”default”:null,”columnName”:”department_id”,”sqlType”:”4″},{“name”:”department_name”,”type”:[“null”,”string”],”default”:null,”columnName”:”department_name”,”sqlType”:”12″}],”tableName”:”departments”}�M�ܣ����^��

Import as ORC Files

There is no equivalent of --as-avrodatafile and --as-parquetfile to save as ORC file. We can leverage Sqoop's Hcatalog integration feature to import as orc file. You can think of Hcatalog as an API to access Hive metastore. This method involves two steps.

  1. Create Hive Database with your desired HDFS warehouse location
  2. Run Sqoop import command to import from RDBMS table to Hcatalog table
Create Hive Database

A Hive database should exist in order to use Sqoop with Hcatalog. We can create a Hive database by providing the location to store the data of managed tables. If we don't specify a location, default location will be used.

The below command is run on Hive CLI to create a Hive database called orc_db with hdfs:///user/cloudera/import_example/orc/ as location. This location is equivalent to the --warehouse-dir parameter of Sqoop import.

hive> create database orc_db location "hdfs:///user/cloudera/import_example/orc/";
Time taken: 0.431 seconds

Sqoop import with Hcatalog Integration

We can use Sqoop import to import from RDBMS table to Hcatalog/Hive Table. Sqoop import's Hcatalog integration provides feature to automatically create Hcatalog/Hive table during the import.

--hcatalog-database refers to the Hive database that we need to use.
--hcatalog-table refers to the Hive table. In this example, we are calling it departments. A directory named departments will be created in the location we specified with create database command.
--create-hcatalog-table is used as an indicator that the Hive table has to be created as part of the import job. The job will fail if the table already exists.
--hcatalog-storage-stanza is used to specify the storage format of the newly created table. In this example, we are storing as ORC file.

After we run the Sqoop command, we can see that a directory called departments was created under the database location - hdfs:///user/cloudera/import_example/orc

$ hadoop fs -ls hdfs:///user/cloudera/import_example/orc
Found 1 items
drwxr-xr-x - cloudera cloudera 0 2019-05-31 04:23 hdfs:///user/cloudera/import_example/orc/departments

When we list the contents of departments directory, we can see there was one file created. Though it doesn't have an extension, be assured that it's an ORC file.

$ hadoop fs -ls hdfs:///user/cloudera/import_example/orc/departments
Found 1 items
-rw-r--r-- 1 cloudera cloudera 377 2019-05-31 04:23 hdfs:///user/cloudera/import_example/orc/departments/part-m-00000

We can verify that it's an ORC file by peeking at the top few lines of the file. Observe that the first line of the file reads "ORC".

$ hadoop fs -cat hdfs:///user/cloudera/import_example/orc/departments/part-m-00000 | head
6PJ�R�be!i6%Q.vǂ�Ģ�!�Ғ���b�� �UFitnessFootwearApparelGolfOutdoorsFan Shop


Sqoop import has native support to import in Avro and Parquet formats but not for ORC format. We can leverage Sqoop import’s Hcatalogue integration to import the data from RDBMS to HDFS in ORC format.

First, we create a Hive database and specify the HDFS location of the database. Then, we import data from RDBMS table to Hive table to be stored in ORC format. The ORC data would be stored under a directory with the name of the Hive table in the HDFS location of the Hive database.

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to Top